TY - CHAP
T1 - Low-cost fault-tolerance protocol for large-scale network monitoring
AU - Ahn, Jin Ho
AU - Min, Sung Gi
AU - Choi, Young Il
AU - Lee, Byung Sun
N1 - Copyright:
Copyright 2020 Elsevier B.V., All rights reserved.
PY - 2003
Y1 - 2003
N2 - Distributed hierarchical network monitoring model has been proposed to solve scalability problem of centralized model. In this distributed model, a top-level monitoring manager, called main manager, obtains aggregate management information from mid-level managers, named domain managers, forming a hierarchical structure. However, if some of monitoring managers crash, network elements cannot be continuously and correctly monitored until the managers are repaired. To address this important, but previously unresolved issue, this paper presents a new fault-tolerance protocol for domain managers, named DMFTP, allowing the managers to efficiently utilize their organization structure. Therefore, this protocol can minimize failure detection overhead and the number of live managers affected by each manager node crash. Also, it tolerates concurrent manager failures and, after the failed managers have been repaired, ensures their immediate and consistent recovery.
AB - Distributed hierarchical network monitoring model has been proposed to solve scalability problem of centralized model. In this distributed model, a top-level monitoring manager, called main manager, obtains aggregate management information from mid-level managers, named domain managers, forming a hierarchical structure. However, if some of monitoring managers crash, network elements cannot be continuously and correctly monitored until the managers are repaired. To address this important, but previously unresolved issue, this paper presents a new fault-tolerance protocol for domain managers, named DMFTP, allowing the managers to efficiently utilize their organization structure. Therefore, this protocol can minimize failure detection overhead and the number of live managers affected by each manager node crash. Also, it tolerates concurrent manager failures and, after the failed managers have been repaired, ensures their immediate and consistent recovery.
UR - http://www.scopus.com/inward/record.url?scp=35248877563&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=35248877563&partnerID=8YFLogxK
U2 - 10.1007/3-540-44863-2_50
DO - 10.1007/3-540-44863-2_50
M3 - Chapter
AN - SCOPUS:35248877563
SN - 9783540401964
T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
SP - 504
EP - 513
BT - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
A2 - Sloot, Peter M.A.
A2 - Abramson, David
A2 - Bogdanov, Alexander V.
A2 - Gorbachev, Yuriy E.
A2 - Dongarra, Jack J.
A2 - Zomaya, Albert Y.
PB - Springer Verlag
ER -