A resource management and fault tolerance services in grid computing

Hwa Min Lee, Kwang Sik Chung, Sung So Chin, Jong Hyuk Lee, Dae Won Lee, Seongbin Park, Heon Chang Yu

Research output: Contribution to journalArticle

24 Citations (Scopus)

Abstract

In grid computing, resource management and fault tolerance services are important issues. The availability of the selected resources for job execution is a primary factor that determines the computing performance. In this paper, we propose a resource manager for optimal resource selection. Our resource manager automatically selects the set of optimal resources among candidate resources that achieves optimal performance using a genetic algorithm. Typically, the probability of a failure is higher in the grid computing than in a traditional parallel computing and the failure of resources affects job execution fatally. Therefore, a fault tolerance service is essential in computational grids. And grid services are often expected to meet some minimum levels of Quality of Service (QoS) for a desirable operation. To address this issue, we also propose a fault tolerance service that satisfies QoS requirements. We extend the definition of failures from the conventional notion of failures in distribute systems in order to provide a fault tolerance service that deals with various types of resource failures, which include process failures, processor failures, and network failures. We also design and implement a fault detector and a fault manager. The implementation and simulation results indicate that our approaches are promising in that (1) the resource manager finds the optimal set of resources that guarantees efficient job execution, (2) the fault detector detects the occurrence of resource failures and (3) the fault manager guarantees that the submitted jobs complete and the performance of job execution is improved due to job migration even if some failures occur.

Original languageEnglish
Pages (from-to)1305-1317
Number of pages13
JournalJournal of Parallel and Distributed Computing
Volume65
Issue number11
DOIs
Publication statusPublished - 2005 Nov

Keywords

  • Fault tolerance
  • Grid computing
  • Migration
  • Quality of service
  • Resource manager

ASJC Scopus subject areas

  • Software
  • Theoretical Computer Science
  • Hardware and Architecture
  • Computer Networks and Communications
  • Artificial Intelligence

Fingerprint Dive into the research topics of 'A resource management and fault tolerance services in grid computing'. Together they form a unique fingerprint.

  • Cite this