Design of Processing-"Inside"-Memory Optimized for DRAM Behaviors

Won Jun Lee, Chang Hyun Kim, Yoonah Paik, Jongsun Park, Il Park, Seon Wook Kim

Research output: Contribution to journalArticle

Abstract

The computing domain of today's computer systems is moving very fast from arithmetic to data processing as data volumes grow exponentially. As a result, processing-in-memory (PIM) studies have been actively conducted to support the data processing in or near memory devices to address the limited bandwidth and high power consumption due to data movement between CPU/GPU and memory. However, most PIM studies so far have been conducted in a way that the processing units are designed only as an accelerator on the base die of 3D-stacked DRAM, not involved inside memory while not servicing the standard DRAM requests during the PIM execution. Therefore, in this paper, we show how to design and operate the PIM computing units inside DRAM by effectively coordinating with standard DRAM operations while achieving the full computing performance and minimizing the implementation cost. To make our goals, we extend a standard DRAM state diagram to depict the PIM behaviors in the same way as standard DRAM commands are scheduled and operated on the DRAM devices and exploit several levels of parallelism to overlap memory and computing operations. Also, we present how the entire architecture layers from applications to operating systems, memory controllers, and PIM devices should work together for the effective execution by applying our approaches to our experiment platform. In our HBM2-based experimental platform to include 16-cycle MAC (Multiply-and-Add) units and 8-cycle reducers for a matrix-vector multiplication, we achieved 406% and 35.2% faster performance by the all-bank and the per-bank schedulings, respectively, at (1024×1024) × (1024×1) 8-bit integer matrix-vector multiplication than the execution of only its operand burst reads assuming the external full DRAM bandwidth. It should be noted that the performance of the PIM on a base die of a 3D-stacked memory cannot be better than that provided by the full bandwidth in any case.

Original languageEnglish
Article number8743357
Pages (from-to)82633-82648
Number of pages16
JournalIEEE Access
Volume7
DOIs
Publication statusPublished - 2019 Jan 1

Fingerprint

Dynamic random access storage
Data storage equipment
Processing
Bandwidth
Computer operating systems
Particle accelerators
Program processors
Computer systems
Electric power utilization

Keywords

  • DRAM
  • matrix-vector multiplication
  • parallelism
  • Processing-in-memory

ASJC Scopus subject areas

  • Computer Science(all)
  • Materials Science(all)
  • Engineering(all)

Cite this

Design of Processing-"Inside"-Memory Optimized for DRAM Behaviors. / Lee, Won Jun; Kim, Chang Hyun; Paik, Yoonah; Park, Jongsun; Park, Il; Kim, Seon Wook.

In: IEEE Access, Vol. 7, 8743357, 01.01.2019, p. 82633-82648.

Research output: Contribution to journalArticle

Lee, Won Jun ; Kim, Chang Hyun ; Paik, Yoonah ; Park, Jongsun ; Park, Il ; Kim, Seon Wook. / Design of Processing-"Inside"-Memory Optimized for DRAM Behaviors. In: IEEE Access. 2019 ; Vol. 7. pp. 82633-82648.
@article{1fa60b3d7a7d4a49bde8a363d134f26b,
title = "Design of Processing-{"}Inside{"}-Memory Optimized for DRAM Behaviors",
abstract = "The computing domain of today's computer systems is moving very fast from arithmetic to data processing as data volumes grow exponentially. As a result, processing-in-memory (PIM) studies have been actively conducted to support the data processing in or near memory devices to address the limited bandwidth and high power consumption due to data movement between CPU/GPU and memory. However, most PIM studies so far have been conducted in a way that the processing units are designed only as an accelerator on the base die of 3D-stacked DRAM, not involved inside memory while not servicing the standard DRAM requests during the PIM execution. Therefore, in this paper, we show how to design and operate the PIM computing units inside DRAM by effectively coordinating with standard DRAM operations while achieving the full computing performance and minimizing the implementation cost. To make our goals, we extend a standard DRAM state diagram to depict the PIM behaviors in the same way as standard DRAM commands are scheduled and operated on the DRAM devices and exploit several levels of parallelism to overlap memory and computing operations. Also, we present how the entire architecture layers from applications to operating systems, memory controllers, and PIM devices should work together for the effective execution by applying our approaches to our experiment platform. In our HBM2-based experimental platform to include 16-cycle MAC (Multiply-and-Add) units and 8-cycle reducers for a matrix-vector multiplication, we achieved 406{\%} and 35.2{\%} faster performance by the all-bank and the per-bank schedulings, respectively, at (1024×1024) × (1024×1) 8-bit integer matrix-vector multiplication than the execution of only its operand burst reads assuming the external full DRAM bandwidth. It should be noted that the performance of the PIM on a base die of a 3D-stacked memory cannot be better than that provided by the full bandwidth in any case.",
keywords = "DRAM, matrix-vector multiplication, parallelism, Processing-in-memory",
author = "Lee, {Won Jun} and Kim, {Chang Hyun} and Yoonah Paik and Jongsun Park and Il Park and Kim, {Seon Wook}",
year = "2019",
month = "1",
day = "1",
doi = "10.1109/ACCESS.2019.2924240",
language = "English",
volume = "7",
pages = "82633--82648",
journal = "IEEE Access",
issn = "2169-3536",
publisher = "Institute of Electrical and Electronics Engineers Inc.",

}

TY - JOUR

T1 - Design of Processing-"Inside"-Memory Optimized for DRAM Behaviors

AU - Lee, Won Jun

AU - Kim, Chang Hyun

AU - Paik, Yoonah

AU - Park, Jongsun

AU - Park, Il

AU - Kim, Seon Wook

PY - 2019/1/1

Y1 - 2019/1/1

N2 - The computing domain of today's computer systems is moving very fast from arithmetic to data processing as data volumes grow exponentially. As a result, processing-in-memory (PIM) studies have been actively conducted to support the data processing in or near memory devices to address the limited bandwidth and high power consumption due to data movement between CPU/GPU and memory. However, most PIM studies so far have been conducted in a way that the processing units are designed only as an accelerator on the base die of 3D-stacked DRAM, not involved inside memory while not servicing the standard DRAM requests during the PIM execution. Therefore, in this paper, we show how to design and operate the PIM computing units inside DRAM by effectively coordinating with standard DRAM operations while achieving the full computing performance and minimizing the implementation cost. To make our goals, we extend a standard DRAM state diagram to depict the PIM behaviors in the same way as standard DRAM commands are scheduled and operated on the DRAM devices and exploit several levels of parallelism to overlap memory and computing operations. Also, we present how the entire architecture layers from applications to operating systems, memory controllers, and PIM devices should work together for the effective execution by applying our approaches to our experiment platform. In our HBM2-based experimental platform to include 16-cycle MAC (Multiply-and-Add) units and 8-cycle reducers for a matrix-vector multiplication, we achieved 406% and 35.2% faster performance by the all-bank and the per-bank schedulings, respectively, at (1024×1024) × (1024×1) 8-bit integer matrix-vector multiplication than the execution of only its operand burst reads assuming the external full DRAM bandwidth. It should be noted that the performance of the PIM on a base die of a 3D-stacked memory cannot be better than that provided by the full bandwidth in any case.

AB - The computing domain of today's computer systems is moving very fast from arithmetic to data processing as data volumes grow exponentially. As a result, processing-in-memory (PIM) studies have been actively conducted to support the data processing in or near memory devices to address the limited bandwidth and high power consumption due to data movement between CPU/GPU and memory. However, most PIM studies so far have been conducted in a way that the processing units are designed only as an accelerator on the base die of 3D-stacked DRAM, not involved inside memory while not servicing the standard DRAM requests during the PIM execution. Therefore, in this paper, we show how to design and operate the PIM computing units inside DRAM by effectively coordinating with standard DRAM operations while achieving the full computing performance and minimizing the implementation cost. To make our goals, we extend a standard DRAM state diagram to depict the PIM behaviors in the same way as standard DRAM commands are scheduled and operated on the DRAM devices and exploit several levels of parallelism to overlap memory and computing operations. Also, we present how the entire architecture layers from applications to operating systems, memory controllers, and PIM devices should work together for the effective execution by applying our approaches to our experiment platform. In our HBM2-based experimental platform to include 16-cycle MAC (Multiply-and-Add) units and 8-cycle reducers for a matrix-vector multiplication, we achieved 406% and 35.2% faster performance by the all-bank and the per-bank schedulings, respectively, at (1024×1024) × (1024×1) 8-bit integer matrix-vector multiplication than the execution of only its operand burst reads assuming the external full DRAM bandwidth. It should be noted that the performance of the PIM on a base die of a 3D-stacked memory cannot be better than that provided by the full bandwidth in any case.

KW - DRAM

KW - matrix-vector multiplication

KW - parallelism

KW - Processing-in-memory

UR - http://www.scopus.com/inward/record.url?scp=85068645446&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85068645446&partnerID=8YFLogxK

U2 - 10.1109/ACCESS.2019.2924240

DO - 10.1109/ACCESS.2019.2924240

M3 - Article

AN - SCOPUS:85068645446

VL - 7

SP - 82633

EP - 82648

JO - IEEE Access

JF - IEEE Access

SN - 2169-3536

M1 - 8743357

ER -