Management of fault tolerance information for coordinated checkpointing protocol without sympathetic rollbacks

Kwang Sik Chung, YoungJun J. Lee, Heonchang Yu, Won Gyu Lee

Research output: Contribution to journalArticle

2 Citations (Scopus)

Abstract

This paper presents the condition for an extended global recovery line for coordinated checkpointing protocol and a new garbage collection protocol on checkpoints and message logs in order to avoid the sympathetic rollback caused by lost messages. Since previous works assumed the communication channel does not lose the in-transit messages, those works on garbage collection in coordinated checkpointing protocols delete all the checkpoints except for the last checkpoints on each process. But coordinated checkpointing protocol based on the communication protocol with reliability (TCP) causes in-transit messages to be lost when a failure occurs, and lost messages lead to sympathetic rollbacks of faulty processes or related processes. Thus there is a need for management methods of fault tolerance information that can store and delete the coordinated checkpoint and light message log to avoid sympathetic rollback. In this paper, we define the extended global recovery line conditions for garbage collection of checkpoints and message logs for lost messages, and present the new garbage collection algorithm within the extended global recovery line. The proposed algorithm uses piggybacked process information on each message so that the additional messages for garbage collection and extended global recovery line are not needed. Since it relies on the piggybacked checkpoint information in communication message, the proposed garbage collection algorithm is called 'the lazy garbage collection algorithm'.

Original languageEnglish
Pages (from-to)379-390
Number of pages12
JournalJournal of Information Science and Engineering
Volume20
Issue number2
Publication statusPublished - 2004 Mar 1

Fingerprint

Fault tolerance
tolerance
Network protocols
Recovery
management
communication
information process
cause
Communication

Keywords

  • Coordinated checkpointing protocol
  • Garbage collection
  • Message log
  • Sympathetic rollback

ASJC Scopus subject areas

  • Information Systems

Cite this

Management of fault tolerance information for coordinated checkpointing protocol without sympathetic rollbacks. / Chung, Kwang Sik; Lee, YoungJun J.; Yu, Heonchang; Lee, Won Gyu.

In: Journal of Information Science and Engineering, Vol. 20, No. 2, 01.03.2004, p. 379-390.

Research output: Contribution to journalArticle

@article{7bb240ff8f7a465bb53ceb65fdb9dfa8,
title = "Management of fault tolerance information for coordinated checkpointing protocol without sympathetic rollbacks",
abstract = "This paper presents the condition for an extended global recovery line for coordinated checkpointing protocol and a new garbage collection protocol on checkpoints and message logs in order to avoid the sympathetic rollback caused by lost messages. Since previous works assumed the communication channel does not lose the in-transit messages, those works on garbage collection in coordinated checkpointing protocols delete all the checkpoints except for the last checkpoints on each process. But coordinated checkpointing protocol based on the communication protocol with reliability (TCP) causes in-transit messages to be lost when a failure occurs, and lost messages lead to sympathetic rollbacks of faulty processes or related processes. Thus there is a need for management methods of fault tolerance information that can store and delete the coordinated checkpoint and light message log to avoid sympathetic rollback. In this paper, we define the extended global recovery line conditions for garbage collection of checkpoints and message logs for lost messages, and present the new garbage collection algorithm within the extended global recovery line. The proposed algorithm uses piggybacked process information on each message so that the additional messages for garbage collection and extended global recovery line are not needed. Since it relies on the piggybacked checkpoint information in communication message, the proposed garbage collection algorithm is called 'the lazy garbage collection algorithm'.",
keywords = "Coordinated checkpointing protocol, Garbage collection, Message log, Sympathetic rollback",
author = "Chung, {Kwang Sik} and Lee, {YoungJun J.} and Heonchang Yu and Lee, {Won Gyu}",
year = "2004",
month = "3",
day = "1",
language = "English",
volume = "20",
pages = "379--390",
journal = "Journal of Information Science and Engineering",
issn = "1016-2364",
publisher = "Institute of Information Science",
number = "2",

}

TY - JOUR

T1 - Management of fault tolerance information for coordinated checkpointing protocol without sympathetic rollbacks

AU - Chung, Kwang Sik

AU - Lee, YoungJun J.

AU - Yu, Heonchang

AU - Lee, Won Gyu

PY - 2004/3/1

Y1 - 2004/3/1

N2 - This paper presents the condition for an extended global recovery line for coordinated checkpointing protocol and a new garbage collection protocol on checkpoints and message logs in order to avoid the sympathetic rollback caused by lost messages. Since previous works assumed the communication channel does not lose the in-transit messages, those works on garbage collection in coordinated checkpointing protocols delete all the checkpoints except for the last checkpoints on each process. But coordinated checkpointing protocol based on the communication protocol with reliability (TCP) causes in-transit messages to be lost when a failure occurs, and lost messages lead to sympathetic rollbacks of faulty processes or related processes. Thus there is a need for management methods of fault tolerance information that can store and delete the coordinated checkpoint and light message log to avoid sympathetic rollback. In this paper, we define the extended global recovery line conditions for garbage collection of checkpoints and message logs for lost messages, and present the new garbage collection algorithm within the extended global recovery line. The proposed algorithm uses piggybacked process information on each message so that the additional messages for garbage collection and extended global recovery line are not needed. Since it relies on the piggybacked checkpoint information in communication message, the proposed garbage collection algorithm is called 'the lazy garbage collection algorithm'.

AB - This paper presents the condition for an extended global recovery line for coordinated checkpointing protocol and a new garbage collection protocol on checkpoints and message logs in order to avoid the sympathetic rollback caused by lost messages. Since previous works assumed the communication channel does not lose the in-transit messages, those works on garbage collection in coordinated checkpointing protocols delete all the checkpoints except for the last checkpoints on each process. But coordinated checkpointing protocol based on the communication protocol with reliability (TCP) causes in-transit messages to be lost when a failure occurs, and lost messages lead to sympathetic rollbacks of faulty processes or related processes. Thus there is a need for management methods of fault tolerance information that can store and delete the coordinated checkpoint and light message log to avoid sympathetic rollback. In this paper, we define the extended global recovery line conditions for garbage collection of checkpoints and message logs for lost messages, and present the new garbage collection algorithm within the extended global recovery line. The proposed algorithm uses piggybacked process information on each message so that the additional messages for garbage collection and extended global recovery line are not needed. Since it relies on the piggybacked checkpoint information in communication message, the proposed garbage collection algorithm is called 'the lazy garbage collection algorithm'.

KW - Coordinated checkpointing protocol

KW - Garbage collection

KW - Message log

KW - Sympathetic rollback

UR - http://www.scopus.com/inward/record.url?scp=1642358231&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=1642358231&partnerID=8YFLogxK

M3 - Article

AN - SCOPUS:1642358231

VL - 20

SP - 379

EP - 390

JO - Journal of Information Science and Engineering

JF - Journal of Information Science and Engineering

SN - 1016-2364

IS - 2

ER -