TY - GEN
T1 - Extending the ONNX Runtime Framework for the Processing-in-Memory Execution
AU - Kim, Seok Young
AU - Lee, Jaewook
AU - Kim, Chang Hyun
AU - Lee, Won Jun
AU - Kim, Seon Wook
N1 - Funding Information:
ACKNOWLEDGMENT This work was supported in part by SK hynix Inc.
Publisher Copyright:
© 2022 IEEE.
PY - 2022
Y1 - 2022
N2 - The attention mechanism-based model provides sufficiently accurate performance for NLP tasks. As the model's size enlarges, the memory usage increases exponentially. Also, the large amount of data with low locality causes an excessive increase in power consumption for the data movement. Therefore, Processing-in-Memory (PIM), which places computing logic in/near memory, is becoming an attractive solution to solve the memory bottleneck of system performance. Meanwhile, various design explorations of the PIM architecture have been studied, but their efficient software framework has been rarely conducted. This paper extends the ONNX runtime framework for the PIM-based platform. The framework provides the function abstractions for various PIM operations and easy programmability to users. We executed the BERT workload with the GLUE dataset using the framework, and the workload is dominantly used among the attention-based models. By exploiting data/bank-level parallelism and performing vector execution in each bank, our baseline PIM platform showed a speedup of x1.64 and x1.71 on average compared to x86 and ARM CPU, respectively.
AB - The attention mechanism-based model provides sufficiently accurate performance for NLP tasks. As the model's size enlarges, the memory usage increases exponentially. Also, the large amount of data with low locality causes an excessive increase in power consumption for the data movement. Therefore, Processing-in-Memory (PIM), which places computing logic in/near memory, is becoming an attractive solution to solve the memory bottleneck of system performance. Meanwhile, various design explorations of the PIM architecture have been studied, but their efficient software framework has been rarely conducted. This paper extends the ONNX runtime framework for the PIM-based platform. The framework provides the function abstractions for various PIM operations and easy programmability to users. We executed the BERT workload with the GLUE dataset using the framework, and the workload is dominantly used among the attention-based models. By exploiting data/bank-level parallelism and performing vector execution in each bank, our baseline PIM platform showed a speedup of x1.64 and x1.71 on average compared to x86 and ARM CPU, respectively.
KW - Attention-based Model
KW - Processing-in-Memory
UR - http://www.scopus.com/inward/record.url?scp=85128831479&partnerID=8YFLogxK
U2 - 10.1109/ICEIC54506.2022.9748444
DO - 10.1109/ICEIC54506.2022.9748444
M3 - Conference contribution
AN - SCOPUS:85128831479
T3 - 2022 International Conference on Electronics, Information, and Communication, ICEIC 2022
BT - 2022 International Conference on Electronics, Information, and Communication, ICEIC 2022
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 2022 International Conference on Electronics, Information, and Communication, ICEIC 2022
Y2 - 6 February 2022 through 9 February 2022
ER -