A resource management and fault tolerance services in grid computing

HwaMin Lee, KwangSik Chung, SungSo Chin, JongHyuk Lee, DaeWon Lee, Seongbin Park, Heonchang Yu

Research output: Contribution to journalArticle

24 Citations (Scopus)

Abstract

In grid computing, resource management and fault tolerance services are important issues. The availability of the selected resources for job execution is a primary factor that determines the computing performance. In this paper, we propose a resource manager for optimal resource selection. Our resource manager automatically selects the set of optimal resources among candidate resources that achieves optimal performance using a genetic algorithm. Typically, the probability of a failure is higher in the grid computing than in a traditional parallel computing and the failure of resources affects job execution fatally. Therefore, a fault tolerance service is essential in computational grids. And grid services are often expected to meet some minimum levels of Quality of Service (QoS) for a desirable operation. To address this issue, we also propose a fault tolerance service that satisfies QoS requirements. We extend the definition of failures from the conventional notion of failures in distribute systems in order to provide a fault tolerance service that deals with various types of resource failures, which include process failures, processor failures, and network failures. We also design and implement a fault detector and a fault manager. The implementation and simulation results indicate that our approaches are promising in that (1) the resource manager finds the optimal set of resources that guarantees efficient job execution, (2) the fault detector detects the occurrence of resource failures and (3) the fault manager guarantees that the submitted jobs complete and the performance of job execution is improved due to job migration even if some failures occur.

Original languageEnglish
Pages (from-to)1305-1317
Number of pages13
JournalJournal of Parallel and Distributed Computing
Volume65
Issue number11
DOIs
Publication statusPublished - 2005 Nov 1

Fingerprint

Grid computing
Grid Computing
Resource Management
Fault tolerance
Fault Tolerance
Managers
Resources
Quality of service
Fault
Detectors
Parallel processing systems
Quality of Service
Genetic algorithms
Availability
Detector
Grid Service
Computational Grid
Parallel Computing
Migration
Genetic Algorithm

Keywords

  • Fault tolerance
  • Grid computing
  • Migration
  • Quality of service
  • Resource manager

ASJC Scopus subject areas

  • Computer Science Applications
  • Hardware and Architecture
  • Control and Systems Engineering

Cite this

A resource management and fault tolerance services in grid computing. / Lee, HwaMin; Chung, KwangSik; Chin, SungSo; Lee, JongHyuk; Lee, DaeWon; Park, Seongbin; Yu, Heonchang.

In: Journal of Parallel and Distributed Computing, Vol. 65, No. 11, 01.11.2005, p. 1305-1317.

Research output: Contribution to journalArticle

Lee, HwaMin ; Chung, KwangSik ; Chin, SungSo ; Lee, JongHyuk ; Lee, DaeWon ; Park, Seongbin ; Yu, Heonchang. / A resource management and fault tolerance services in grid computing. In: Journal of Parallel and Distributed Computing. 2005 ; Vol. 65, No. 11. pp. 1305-1317.
@article{d2eb8f8359064fd58b3a4f35d8354ce6,
title = "A resource management and fault tolerance services in grid computing",
abstract = "In grid computing, resource management and fault tolerance services are important issues. The availability of the selected resources for job execution is a primary factor that determines the computing performance. In this paper, we propose a resource manager for optimal resource selection. Our resource manager automatically selects the set of optimal resources among candidate resources that achieves optimal performance using a genetic algorithm. Typically, the probability of a failure is higher in the grid computing than in a traditional parallel computing and the failure of resources affects job execution fatally. Therefore, a fault tolerance service is essential in computational grids. And grid services are often expected to meet some minimum levels of Quality of Service (QoS) for a desirable operation. To address this issue, we also propose a fault tolerance service that satisfies QoS requirements. We extend the definition of failures from the conventional notion of failures in distribute systems in order to provide a fault tolerance service that deals with various types of resource failures, which include process failures, processor failures, and network failures. We also design and implement a fault detector and a fault manager. The implementation and simulation results indicate that our approaches are promising in that (1) the resource manager finds the optimal set of resources that guarantees efficient job execution, (2) the fault detector detects the occurrence of resource failures and (3) the fault manager guarantees that the submitted jobs complete and the performance of job execution is improved due to job migration even if some failures occur.",
keywords = "Fault tolerance, Grid computing, Migration, Quality of service, Resource manager",
author = "HwaMin Lee and KwangSik Chung and SungSo Chin and JongHyuk Lee and DaeWon Lee and Seongbin Park and Heonchang Yu",
year = "2005",
month = "11",
day = "1",
doi = "10.1016/j.jpdc.2005.05.026",
language = "English",
volume = "65",
pages = "1305--1317",
journal = "Journal of Parallel and Distributed Computing",
issn = "0743-7315",
publisher = "Academic Press Inc.",
number = "11",

}

TY - JOUR

T1 - A resource management and fault tolerance services in grid computing

AU - Lee, HwaMin

AU - Chung, KwangSik

AU - Chin, SungSo

AU - Lee, JongHyuk

AU - Lee, DaeWon

AU - Park, Seongbin

AU - Yu, Heonchang

PY - 2005/11/1

Y1 - 2005/11/1

N2 - In grid computing, resource management and fault tolerance services are important issues. The availability of the selected resources for job execution is a primary factor that determines the computing performance. In this paper, we propose a resource manager for optimal resource selection. Our resource manager automatically selects the set of optimal resources among candidate resources that achieves optimal performance using a genetic algorithm. Typically, the probability of a failure is higher in the grid computing than in a traditional parallel computing and the failure of resources affects job execution fatally. Therefore, a fault tolerance service is essential in computational grids. And grid services are often expected to meet some minimum levels of Quality of Service (QoS) for a desirable operation. To address this issue, we also propose a fault tolerance service that satisfies QoS requirements. We extend the definition of failures from the conventional notion of failures in distribute systems in order to provide a fault tolerance service that deals with various types of resource failures, which include process failures, processor failures, and network failures. We also design and implement a fault detector and a fault manager. The implementation and simulation results indicate that our approaches are promising in that (1) the resource manager finds the optimal set of resources that guarantees efficient job execution, (2) the fault detector detects the occurrence of resource failures and (3) the fault manager guarantees that the submitted jobs complete and the performance of job execution is improved due to job migration even if some failures occur.

AB - In grid computing, resource management and fault tolerance services are important issues. The availability of the selected resources for job execution is a primary factor that determines the computing performance. In this paper, we propose a resource manager for optimal resource selection. Our resource manager automatically selects the set of optimal resources among candidate resources that achieves optimal performance using a genetic algorithm. Typically, the probability of a failure is higher in the grid computing than in a traditional parallel computing and the failure of resources affects job execution fatally. Therefore, a fault tolerance service is essential in computational grids. And grid services are often expected to meet some minimum levels of Quality of Service (QoS) for a desirable operation. To address this issue, we also propose a fault tolerance service that satisfies QoS requirements. We extend the definition of failures from the conventional notion of failures in distribute systems in order to provide a fault tolerance service that deals with various types of resource failures, which include process failures, processor failures, and network failures. We also design and implement a fault detector and a fault manager. The implementation and simulation results indicate that our approaches are promising in that (1) the resource manager finds the optimal set of resources that guarantees efficient job execution, (2) the fault detector detects the occurrence of resource failures and (3) the fault manager guarantees that the submitted jobs complete and the performance of job execution is improved due to job migration even if some failures occur.

KW - Fault tolerance

KW - Grid computing

KW - Migration

KW - Quality of service

KW - Resource manager

UR - http://www.scopus.com/inward/record.url?scp=26944441245&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=26944441245&partnerID=8YFLogxK

U2 - 10.1016/j.jpdc.2005.05.026

DO - 10.1016/j.jpdc.2005.05.026

M3 - Article

AN - SCOPUS:26944441245

VL - 65

SP - 1305

EP - 1317

JO - Journal of Parallel and Distributed Computing

JF - Journal of Parallel and Distributed Computing

SN - 0743-7315

IS - 11

ER -