Parallel huge matrix multiplication on a cluster with GPGPU accelerators

Seungyo Ryu, Dong Seung Kim

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

We design a parallel huge matrix multiplication algorithm on a cluster of GPU nodes. Since input matrices are too big to accommodate in the memory, the algorithm repeats the loading, computing, storing partial matrix data from/to disk and GPU buffer. The key to achieve the best speedup is not only to use GPU with full performance, but to reduce the overhead in data movement between disk and GPU buffer. We devise an efficient way to lower the latency of supplying the matching pair of the partial matrices to the GPU buffer, and to optimize the data partition, distribution, and disk access using the pipelined way. Experimental results show our algorithm outperforms a generic algorithm, resulting in the computing time reduction by 45%. Also, the scalability of the algorithm enhances with more GPU nodes.

Original languageEnglish
Title of host publicationProceedings - 2018 IEEE 32nd International Parallel and Distributed Processing Symposium Workshops, IPDPSW 2018
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages877-882
Number of pages6
ISBN (Print)9781538655559
DOIs
Publication statusPublished - 2018 Aug 3
Event32nd IEEE International Parallel and Distributed Processing Symposium Workshops, IPDPSW 2018 - Vancouver, Canada
Duration: 2018 May 212018 May 25

Other

Other32nd IEEE International Parallel and Distributed Processing Symposium Workshops, IPDPSW 2018
CountryCanada
CityVancouver
Period18/5/2118/5/25

Fingerprint

Particle accelerators
Graphics processing unit
Scalability
Data storage equipment
Buffer
Node

Keywords

  • GPU computing
  • Matrix multiplication
  • MPI
  • Parallel computing

ASJC Scopus subject areas

  • Artificial Intelligence
  • Computer Networks and Communications
  • Hardware and Architecture
  • Information Systems and Management

Cite this

Ryu, S., & Kim, D. S. (2018). Parallel huge matrix multiplication on a cluster with GPGPU accelerators. In Proceedings - 2018 IEEE 32nd International Parallel and Distributed Processing Symposium Workshops, IPDPSW 2018 (pp. 877-882). [8425506] Institute of Electrical and Electronics Engineers Inc.. https://doi.org/10.1109/IPDPSW.2018.00139

Parallel huge matrix multiplication on a cluster with GPGPU accelerators. / Ryu, Seungyo; Kim, Dong Seung.

Proceedings - 2018 IEEE 32nd International Parallel and Distributed Processing Symposium Workshops, IPDPSW 2018. Institute of Electrical and Electronics Engineers Inc., 2018. p. 877-882 8425506.

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Ryu, S & Kim, DS 2018, Parallel huge matrix multiplication on a cluster with GPGPU accelerators. in Proceedings - 2018 IEEE 32nd International Parallel and Distributed Processing Symposium Workshops, IPDPSW 2018., 8425506, Institute of Electrical and Electronics Engineers Inc., pp. 877-882, 32nd IEEE International Parallel and Distributed Processing Symposium Workshops, IPDPSW 2018, Vancouver, Canada, 18/5/21. https://doi.org/10.1109/IPDPSW.2018.00139
Ryu S, Kim DS. Parallel huge matrix multiplication on a cluster with GPGPU accelerators. In Proceedings - 2018 IEEE 32nd International Parallel and Distributed Processing Symposium Workshops, IPDPSW 2018. Institute of Electrical and Electronics Engineers Inc. 2018. p. 877-882. 8425506 https://doi.org/10.1109/IPDPSW.2018.00139
Ryu, Seungyo ; Kim, Dong Seung. / Parallel huge matrix multiplication on a cluster with GPGPU accelerators. Proceedings - 2018 IEEE 32nd International Parallel and Distributed Processing Symposium Workshops, IPDPSW 2018. Institute of Electrical and Electronics Engineers Inc., 2018. pp. 877-882
@inproceedings{5d84ddb587bd450091dd0ded86a92845,
title = "Parallel huge matrix multiplication on a cluster with GPGPU accelerators",
abstract = "We design a parallel huge matrix multiplication algorithm on a cluster of GPU nodes. Since input matrices are too big to accommodate in the memory, the algorithm repeats the loading, computing, storing partial matrix data from/to disk and GPU buffer. The key to achieve the best speedup is not only to use GPU with full performance, but to reduce the overhead in data movement between disk and GPU buffer. We devise an efficient way to lower the latency of supplying the matching pair of the partial matrices to the GPU buffer, and to optimize the data partition, distribution, and disk access using the pipelined way. Experimental results show our algorithm outperforms a generic algorithm, resulting in the computing time reduction by 45{\%}. Also, the scalability of the algorithm enhances with more GPU nodes.",
keywords = "GPU computing, Matrix multiplication, MPI, Parallel computing",
author = "Seungyo Ryu and Kim, {Dong Seung}",
year = "2018",
month = "8",
day = "3",
doi = "10.1109/IPDPSW.2018.00139",
language = "English",
isbn = "9781538655559",
pages = "877--882",
booktitle = "Proceedings - 2018 IEEE 32nd International Parallel and Distributed Processing Symposium Workshops, IPDPSW 2018",
publisher = "Institute of Electrical and Electronics Engineers Inc.",

}

TY - GEN

T1 - Parallel huge matrix multiplication on a cluster with GPGPU accelerators

AU - Ryu, Seungyo

AU - Kim, Dong Seung

PY - 2018/8/3

Y1 - 2018/8/3

N2 - We design a parallel huge matrix multiplication algorithm on a cluster of GPU nodes. Since input matrices are too big to accommodate in the memory, the algorithm repeats the loading, computing, storing partial matrix data from/to disk and GPU buffer. The key to achieve the best speedup is not only to use GPU with full performance, but to reduce the overhead in data movement between disk and GPU buffer. We devise an efficient way to lower the latency of supplying the matching pair of the partial matrices to the GPU buffer, and to optimize the data partition, distribution, and disk access using the pipelined way. Experimental results show our algorithm outperforms a generic algorithm, resulting in the computing time reduction by 45%. Also, the scalability of the algorithm enhances with more GPU nodes.

AB - We design a parallel huge matrix multiplication algorithm on a cluster of GPU nodes. Since input matrices are too big to accommodate in the memory, the algorithm repeats the loading, computing, storing partial matrix data from/to disk and GPU buffer. The key to achieve the best speedup is not only to use GPU with full performance, but to reduce the overhead in data movement between disk and GPU buffer. We devise an efficient way to lower the latency of supplying the matching pair of the partial matrices to the GPU buffer, and to optimize the data partition, distribution, and disk access using the pipelined way. Experimental results show our algorithm outperforms a generic algorithm, resulting in the computing time reduction by 45%. Also, the scalability of the algorithm enhances with more GPU nodes.

KW - GPU computing

KW - Matrix multiplication

KW - MPI

KW - Parallel computing

UR - http://www.scopus.com/inward/record.url?scp=85052227842&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85052227842&partnerID=8YFLogxK

U2 - 10.1109/IPDPSW.2018.00139

DO - 10.1109/IPDPSW.2018.00139

M3 - Conference contribution

SN - 9781538655559

SP - 877

EP - 882

BT - Proceedings - 2018 IEEE 32nd International Parallel and Distributed Processing Symposium Workshops, IPDPSW 2018

PB - Institute of Electrical and Electronics Engineers Inc.

ER -