Fault tolerance technique offlining faulty blocks by heap memory management

Jaeyung Jun, Yoonah Paik, Gyeong Il Min, Seon Wook Kim, Youngsun Han

Research output: Contribution to journalArticle

Abstract

As dynamic random access memory (DRAM) cells continue to be scaled down for higher density and capacity, they have more faults. Thus, DRAM reliability becomes a major concern in computer systems. Previous studies have proposed many techniques preserving the reliability in various system components, such as DRAM internal, memory controller, caches, and operating systems. By reviewing the techniques, we identified the following two considerations: First, it is possible to recover faults with reasonable overhead at high fault rate only if the recovery unit is fine-grained. Second, since hardware modification requires additional cost in the employment of a technique, a pure software-based recovery technique is preferable. However, in the existing software-based recovery technique, the recovery unit is too coarse-grained to tolerate the high fault rate. In this article, we propose a pure software-based recovery technique with fine-granularity. Our key idea is based on heap segments being managed by the system library with variable-sized chunks to handle dynamic allocation in user applications. In our technique, faulty blocks in pages are offlined by marking them as allocated chunks. Thus, not only fault-free pages but also the remaining clean blocks in faulty pages are allowed to be usable space. Our technique is implemented by modifying the operating system and the system library. Since hardware assistance is unnecessary in the implementation, we evaluated our method on a real machine. Our evaluation results show that our technique has negligible performance overhead at high bit error rate (BER) 5.12e-5, which a hardware-based recovery technique could not tolerate without unacceptable area overhead. Also, at the same BER, our method provides 5.22× usable space, compared with page-offline, which is the state-of-the-art pure software-based technique.

Original languageEnglish
Article number47
JournalACM Transactions on Design Automation of Electronic Systems
Volume24
Issue number4
DOIs
Publication statusPublished - 2019 Jun 1

Fingerprint

Fault tolerance
Data storage equipment
Recovery
Hardware
Bit error rate
Computer systems
Computer operating systems
Controllers
Costs

Keywords

  • DRAM fault recovery

ASJC Scopus subject areas

  • Computer Science Applications
  • Computer Graphics and Computer-Aided Design
  • Electrical and Electronic Engineering

Cite this

Fault tolerance technique offlining faulty blocks by heap memory management. / Jun, Jaeyung; Paik, Yoonah; Min, Gyeong Il; Kim, Seon Wook; Han, Youngsun.

In: ACM Transactions on Design Automation of Electronic Systems, Vol. 24, No. 4, 47, 01.06.2019.

Research output: Contribution to journalArticle

@article{21e77d7e185c4aa7a58784877887a4cb,
title = "Fault tolerance technique offlining faulty blocks by heap memory management",
abstract = "As dynamic random access memory (DRAM) cells continue to be scaled down for higher density and capacity, they have more faults. Thus, DRAM reliability becomes a major concern in computer systems. Previous studies have proposed many techniques preserving the reliability in various system components, such as DRAM internal, memory controller, caches, and operating systems. By reviewing the techniques, we identified the following two considerations: First, it is possible to recover faults with reasonable overhead at high fault rate only if the recovery unit is fine-grained. Second, since hardware modification requires additional cost in the employment of a technique, a pure software-based recovery technique is preferable. However, in the existing software-based recovery technique, the recovery unit is too coarse-grained to tolerate the high fault rate. In this article, we propose a pure software-based recovery technique with fine-granularity. Our key idea is based on heap segments being managed by the system library with variable-sized chunks to handle dynamic allocation in user applications. In our technique, faulty blocks in pages are offlined by marking them as allocated chunks. Thus, not only fault-free pages but also the remaining clean blocks in faulty pages are allowed to be usable space. Our technique is implemented by modifying the operating system and the system library. Since hardware assistance is unnecessary in the implementation, we evaluated our method on a real machine. Our evaluation results show that our technique has negligible performance overhead at high bit error rate (BER) 5.12e-5, which a hardware-based recovery technique could not tolerate without unacceptable area overhead. Also, at the same BER, our method provides 5.22× usable space, compared with page-offline, which is the state-of-the-art pure software-based technique.",
keywords = "DRAM fault recovery",
author = "Jaeyung Jun and Yoonah Paik and Min, {Gyeong Il} and Kim, {Seon Wook} and Youngsun Han",
year = "2019",
month = "6",
day = "1",
doi = "10.1145/3329079",
language = "English",
volume = "24",
journal = "ACM Transactions on Design Automation of Electronic Systems",
issn = "1084-4309",
publisher = "Association for Computing Machinery (ACM)",
number = "4",

}

TY - JOUR

T1 - Fault tolerance technique offlining faulty blocks by heap memory management

AU - Jun, Jaeyung

AU - Paik, Yoonah

AU - Min, Gyeong Il

AU - Kim, Seon Wook

AU - Han, Youngsun

PY - 2019/6/1

Y1 - 2019/6/1

N2 - As dynamic random access memory (DRAM) cells continue to be scaled down for higher density and capacity, they have more faults. Thus, DRAM reliability becomes a major concern in computer systems. Previous studies have proposed many techniques preserving the reliability in various system components, such as DRAM internal, memory controller, caches, and operating systems. By reviewing the techniques, we identified the following two considerations: First, it is possible to recover faults with reasonable overhead at high fault rate only if the recovery unit is fine-grained. Second, since hardware modification requires additional cost in the employment of a technique, a pure software-based recovery technique is preferable. However, in the existing software-based recovery technique, the recovery unit is too coarse-grained to tolerate the high fault rate. In this article, we propose a pure software-based recovery technique with fine-granularity. Our key idea is based on heap segments being managed by the system library with variable-sized chunks to handle dynamic allocation in user applications. In our technique, faulty blocks in pages are offlined by marking them as allocated chunks. Thus, not only fault-free pages but also the remaining clean blocks in faulty pages are allowed to be usable space. Our technique is implemented by modifying the operating system and the system library. Since hardware assistance is unnecessary in the implementation, we evaluated our method on a real machine. Our evaluation results show that our technique has negligible performance overhead at high bit error rate (BER) 5.12e-5, which a hardware-based recovery technique could not tolerate without unacceptable area overhead. Also, at the same BER, our method provides 5.22× usable space, compared with page-offline, which is the state-of-the-art pure software-based technique.

AB - As dynamic random access memory (DRAM) cells continue to be scaled down for higher density and capacity, they have more faults. Thus, DRAM reliability becomes a major concern in computer systems. Previous studies have proposed many techniques preserving the reliability in various system components, such as DRAM internal, memory controller, caches, and operating systems. By reviewing the techniques, we identified the following two considerations: First, it is possible to recover faults with reasonable overhead at high fault rate only if the recovery unit is fine-grained. Second, since hardware modification requires additional cost in the employment of a technique, a pure software-based recovery technique is preferable. However, in the existing software-based recovery technique, the recovery unit is too coarse-grained to tolerate the high fault rate. In this article, we propose a pure software-based recovery technique with fine-granularity. Our key idea is based on heap segments being managed by the system library with variable-sized chunks to handle dynamic allocation in user applications. In our technique, faulty blocks in pages are offlined by marking them as allocated chunks. Thus, not only fault-free pages but also the remaining clean blocks in faulty pages are allowed to be usable space. Our technique is implemented by modifying the operating system and the system library. Since hardware assistance is unnecessary in the implementation, we evaluated our method on a real machine. Our evaluation results show that our technique has negligible performance overhead at high bit error rate (BER) 5.12e-5, which a hardware-based recovery technique could not tolerate without unacceptable area overhead. Also, at the same BER, our method provides 5.22× usable space, compared with page-offline, which is the state-of-the-art pure software-based technique.

KW - DRAM fault recovery

UR - http://www.scopus.com/inward/record.url?scp=85069207735&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85069207735&partnerID=8YFLogxK

U2 - 10.1145/3329079

DO - 10.1145/3329079

M3 - Article

VL - 24

JO - ACM Transactions on Design Automation of Electronic Systems

JF - ACM Transactions on Design Automation of Electronic Systems

SN - 1084-4309

IS - 4

M1 - 47

ER -