In this paper, we present a fault tolerant and recovery system called FRASystem (Fault Tolerant & Recovery Agent System) using multi-agent in distributed computing systems. Previous rollback-recovery protocols were dependent on an inherent communication and an underlying operating system, which caused a decline of computing performance. We propose a rollback- recovery protocol that works independently on an operating system and leads to an increasing portability and extensibility. We define four types of agents: (1) a recovery agent performs a rollback-recovery protocol after a failure, (2) an information agent constructs domain knowledge as a rule of fault tolerance and information during a failure-free operation, (3) a facilitator agent controls the communication between agents, (4) a garbage collection agent performs garbage collection of the useless fault tolerance information. Since agent failures may lead to inconsistent states of a system and a domino effect, we propose an agent recovery algorithm. A garbage collection protocol addresses the performance degradation caused by the increment of saved fault tolerance information in a stable storage. We implemented a prototype of FRASystem using This work was supported by the Soon chunhyang University Research Fund 20080152. Java and CORBA and experimented the proposed roll back recovery protocol. The simulations results indicate that the performance of our protocol is better than previous roll back recovery protocols which use independent check pointing and pessimistic message logging without using agents. Our contributions are as follows: (1) this is the first rollback recovery protocol using agents, (2) FRASystem is not dependent on an operating system, and (3) FRASystem provides a portability and extensibility.
- Distributed computing system
- Fault tolerance
- Multi-agent system
ASJC Scopus subject areas
- Computer Networks and Communications