A resource manager for optimal resource selection and fault tolerance service in grids

Hwa Min Lee, Sung Ho Chin, Jong Hyuk Lee, Dae Won Lee, Kwang Sik Chung, Soon Young Jung, Heonchang Yu

Research output: Chapter in Book/Report/Conference proceedingConference contribution

17 Citations (Scopus)

Abstract

In this paper, we address the issues of resource management and fault tolerance in Grids. In Grids, the state of the selected resources for job execution is a primary factor that determines the computing performance. Specifically, we propose a resource manager for optimal resource selection. The resource manager automatically selects the optimal resources among candidate resources using a genetic algorithm. Typically, the probability of failure is higher in the grid computing than in a traditional parallel computing and the failure of resources affects job execution fatally. Therefore, a fault tolerance service is essential in computational grids and grid services are often expected to meet some minimum levels of Quality of Service (QoS) for desirable operation. To address this issue, we also propose fault tolerance service to satisfy QoS requirements. We extend the definition of failures, such as process failure, processor failure, and network failure, and design the fault detector and fault manager. The simulation results indicate that our approaches are promising in that (1) our resource manager finds the optimal set of resources that guarantees the optimal performance, (2) fault detector detects the occurrence of resource failures and (3) fault manager guarantees that the submitted jobs complete and improves the performance of job execution due to job migration even if some failures happen.

Original languageEnglish
Title of host publication2004 IEEE International Symposium on Cluster Computing and the Grid, CCGrid 2004
Pages572-579
Number of pages8
Publication statusPublished - 2004 Sep 29
Event2004 IEEE International Symposium on Cluster Computing and the Grid, CCGrid 2004 - Chicago, IL, United States
Duration: 2004 Apr 192004 Apr 22

Other

Other2004 IEEE International Symposium on Cluster Computing and the Grid, CCGrid 2004
CountryUnited States
CityChicago, IL
Period04/4/1904/4/22

Fingerprint

Fault tolerance
Managers
Quality of service
Detectors
Grid computing
Parallel processing systems
Genetic algorithms

ASJC Scopus subject areas

  • Engineering(all)

Cite this

Lee, H. M., Chin, S. H., Lee, J. H., Lee, D. W., Chung, K. S., Jung, S. Y., & Yu, H. (2004). A resource manager for optimal resource selection and fault tolerance service in grids. In 2004 IEEE International Symposium on Cluster Computing and the Grid, CCGrid 2004 (pp. 572-579)

A resource manager for optimal resource selection and fault tolerance service in grids. / Lee, Hwa Min; Chin, Sung Ho; Lee, Jong Hyuk; Lee, Dae Won; Chung, Kwang Sik; Jung, Soon Young; Yu, Heonchang.

2004 IEEE International Symposium on Cluster Computing and the Grid, CCGrid 2004. 2004. p. 572-579.

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Lee, HM, Chin, SH, Lee, JH, Lee, DW, Chung, KS, Jung, SY & Yu, H 2004, A resource manager for optimal resource selection and fault tolerance service in grids. in 2004 IEEE International Symposium on Cluster Computing and the Grid, CCGrid 2004. pp. 572-579, 2004 IEEE International Symposium on Cluster Computing and the Grid, CCGrid 2004, Chicago, IL, United States, 04/4/19.
Lee HM, Chin SH, Lee JH, Lee DW, Chung KS, Jung SY et al. A resource manager for optimal resource selection and fault tolerance service in grids. In 2004 IEEE International Symposium on Cluster Computing and the Grid, CCGrid 2004. 2004. p. 572-579
Lee, Hwa Min ; Chin, Sung Ho ; Lee, Jong Hyuk ; Lee, Dae Won ; Chung, Kwang Sik ; Jung, Soon Young ; Yu, Heonchang. / A resource manager for optimal resource selection and fault tolerance service in grids. 2004 IEEE International Symposium on Cluster Computing and the Grid, CCGrid 2004. 2004. pp. 572-579
@inproceedings{ccc7f3aa93874885ac05b3b35e3c0229,
title = "A resource manager for optimal resource selection and fault tolerance service in grids",
abstract = "In this paper, we address the issues of resource management and fault tolerance in Grids. In Grids, the state of the selected resources for job execution is a primary factor that determines the computing performance. Specifically, we propose a resource manager for optimal resource selection. The resource manager automatically selects the optimal resources among candidate resources using a genetic algorithm. Typically, the probability of failure is higher in the grid computing than in a traditional parallel computing and the failure of resources affects job execution fatally. Therefore, a fault tolerance service is essential in computational grids and grid services are often expected to meet some minimum levels of Quality of Service (QoS) for desirable operation. To address this issue, we also propose fault tolerance service to satisfy QoS requirements. We extend the definition of failures, such as process failure, processor failure, and network failure, and design the fault detector and fault manager. The simulation results indicate that our approaches are promising in that (1) our resource manager finds the optimal set of resources that guarantees the optimal performance, (2) fault detector detects the occurrence of resource failures and (3) fault manager guarantees that the submitted jobs complete and improves the performance of job execution due to job migration even if some failures happen.",
author = "Lee, {Hwa Min} and Chin, {Sung Ho} and Lee, {Jong Hyuk} and Lee, {Dae Won} and Chung, {Kwang Sik} and Jung, {Soon Young} and Heonchang Yu",
year = "2004",
month = "9",
day = "29",
language = "English",
isbn = "078038430X",
pages = "572--579",
booktitle = "2004 IEEE International Symposium on Cluster Computing and the Grid, CCGrid 2004",

}

TY - GEN

T1 - A resource manager for optimal resource selection and fault tolerance service in grids

AU - Lee, Hwa Min

AU - Chin, Sung Ho

AU - Lee, Jong Hyuk

AU - Lee, Dae Won

AU - Chung, Kwang Sik

AU - Jung, Soon Young

AU - Yu, Heonchang

PY - 2004/9/29

Y1 - 2004/9/29

N2 - In this paper, we address the issues of resource management and fault tolerance in Grids. In Grids, the state of the selected resources for job execution is a primary factor that determines the computing performance. Specifically, we propose a resource manager for optimal resource selection. The resource manager automatically selects the optimal resources among candidate resources using a genetic algorithm. Typically, the probability of failure is higher in the grid computing than in a traditional parallel computing and the failure of resources affects job execution fatally. Therefore, a fault tolerance service is essential in computational grids and grid services are often expected to meet some minimum levels of Quality of Service (QoS) for desirable operation. To address this issue, we also propose fault tolerance service to satisfy QoS requirements. We extend the definition of failures, such as process failure, processor failure, and network failure, and design the fault detector and fault manager. The simulation results indicate that our approaches are promising in that (1) our resource manager finds the optimal set of resources that guarantees the optimal performance, (2) fault detector detects the occurrence of resource failures and (3) fault manager guarantees that the submitted jobs complete and improves the performance of job execution due to job migration even if some failures happen.

AB - In this paper, we address the issues of resource management and fault tolerance in Grids. In Grids, the state of the selected resources for job execution is a primary factor that determines the computing performance. Specifically, we propose a resource manager for optimal resource selection. The resource manager automatically selects the optimal resources among candidate resources using a genetic algorithm. Typically, the probability of failure is higher in the grid computing than in a traditional parallel computing and the failure of resources affects job execution fatally. Therefore, a fault tolerance service is essential in computational grids and grid services are often expected to meet some minimum levels of Quality of Service (QoS) for desirable operation. To address this issue, we also propose fault tolerance service to satisfy QoS requirements. We extend the definition of failures, such as process failure, processor failure, and network failure, and design the fault detector and fault manager. The simulation results indicate that our approaches are promising in that (1) our resource manager finds the optimal set of resources that guarantees the optimal performance, (2) fault detector detects the occurrence of resource failures and (3) fault manager guarantees that the submitted jobs complete and improves the performance of job execution due to job migration even if some failures happen.

UR - http://www.scopus.com/inward/record.url?scp=4544223732&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=4544223732&partnerID=8YFLogxK

M3 - Conference contribution

SN - 078038430X

SP - 572

EP - 579

BT - 2004 IEEE International Symposium on Cluster Computing and the Grid, CCGrid 2004

ER -