Abstract
In this paper, we propose a Silent-PIM that performs the PIM computation with standard DRAM memory requests; thus, requiring no hardware modifications and allowing the PIM memory device to perform the computation while servicing non-PIM applications' memory requests. We can achieve our design goal by preserving the standard memory request behaviors and satisfying the DRAM standard timing requirements. In addition, using standard memory requests makes it possible to use DMA as a PIM's offloading engine, resulting in processing the PIM memory requests fast and making a core perform other tasks. We compared the performance of three LSTM kernels on real platforms, such as the Silent-PIM modeled on the FPGA, GPU, and CPU. For (px512) x (512x2048) matrix multiplication with a batch size p varying from 1 to 128, the Silent-PIM performed up to 16.9x and 24.6x faster than GPU and CPU, respectively, p=1, which was the case without having any data reuse. At p=128, the highest data reuse case, the GPU performance was the highest, but the PIM performance was still higher than the CPU execution. Similarly, at (px2048) element-wise multiplication and addition, where there was no data reuse, the Silent-PIM always achieved higher than both CPU and GPU. It also showed that when the PIM's EDP performance was superior to the others in all the cases having no data reuse.
Original language | English |
---|---|
Journal | IEEE Transactions on Parallel and Distributed Systems |
DOIs | |
Publication status | Accepted/In press - 2021 |
Keywords
- DMA
- Engines
- Field programmable gate arrays
- LSTM
- Memory management
- Performance evaluation
- Random access memory
- Silent-PIM
- Standards
- Timing
- in-memory processing
- standard memory requests
ASJC Scopus subject areas
- Signal Processing
- Hardware and Architecture
- Computational Theory and Mathematics