Where does the speedup go: Quantitative modeling of performance losses in shared-memory programs

Research output: Contribution to journalArticle

4 Citations (Scopus)

Abstract

Even fully parallel shared-memory program sections may perform significantly below the ideal speedup of P on P processors. Relatively little quantitative information is available about the sources of such inefficiencies. In this paper we present a speedup component model that is able to fully account for sources of performance loss in parallel program sections. The model categorizes the gap between measured and ideal speedup into the four components memory stalls, processor stalls, code overhead, and thread management overhead. These model components are measured based on hardware counters and timers, with which programs are instrumented automatically by our compiler. The speedup component model allows us, for the first time, to quantitatively state the reasons for less-than-optimal program performance, on a program section basis. The overhead components are chosen such that they can be associated directly with software and hardware techniques that may improve performance. Although general, our model is especially suited for the analysis of loop-oriented programs, such as those written in the OpenMP API. We have applied this model to compare three parallel code generation schemes for the Polaris parallelizing compiler. It helps us answer questions such as, what sources of inefficiencies are present in compiler-parallelized programs. To discuss the question we have also implemented an alternative, thread-based code generation method.

Original languageEnglish
Pages (from-to)227-238
Number of pages12
JournalParallel Processing Letters
Volume10
Issue number2-3
Publication statusPublished - 2000 Jun 1
Externally publishedYes

Fingerprint

Data storage equipment
Computer simulation
Hardware
Program compilers
Application programming interfaces (API)
Code generation

ASJC Scopus subject areas

  • Engineering(all)

Cite this

Where does the speedup go : Quantitative modeling of performance losses in shared-memory programs. / Kim, Seon Wook.

In: Parallel Processing Letters, Vol. 10, No. 2-3, 01.06.2000, p. 227-238.

Research output: Contribution to journalArticle

@article{fda63c6013a54dd194160029636b38db,
title = "Where does the speedup go: Quantitative modeling of performance losses in shared-memory programs",
abstract = "Even fully parallel shared-memory program sections may perform significantly below the ideal speedup of P on P processors. Relatively little quantitative information is available about the sources of such inefficiencies. In this paper we present a speedup component model that is able to fully account for sources of performance loss in parallel program sections. The model categorizes the gap between measured and ideal speedup into the four components memory stalls, processor stalls, code overhead, and thread management overhead. These model components are measured based on hardware counters and timers, with which programs are instrumented automatically by our compiler. The speedup component model allows us, for the first time, to quantitatively state the reasons for less-than-optimal program performance, on a program section basis. The overhead components are chosen such that they can be associated directly with software and hardware techniques that may improve performance. Although general, our model is especially suited for the analysis of loop-oriented programs, such as those written in the OpenMP API. We have applied this model to compare three parallel code generation schemes for the Polaris parallelizing compiler. It helps us answer questions such as, what sources of inefficiencies are present in compiler-parallelized programs. To discuss the question we have also implemented an alternative, thread-based code generation method.",
author = "Kim, {Seon Wook}",
year = "2000",
month = "6",
day = "1",
language = "English",
volume = "10",
pages = "227--238",
journal = "The BMJ",
issn = "0730-6512",
publisher = "Kluwer Academic Publishers",
number = "2-3",

}

TY - JOUR

T1 - Where does the speedup go

T2 - Quantitative modeling of performance losses in shared-memory programs

AU - Kim, Seon Wook

PY - 2000/6/1

Y1 - 2000/6/1

N2 - Even fully parallel shared-memory program sections may perform significantly below the ideal speedup of P on P processors. Relatively little quantitative information is available about the sources of such inefficiencies. In this paper we present a speedup component model that is able to fully account for sources of performance loss in parallel program sections. The model categorizes the gap between measured and ideal speedup into the four components memory stalls, processor stalls, code overhead, and thread management overhead. These model components are measured based on hardware counters and timers, with which programs are instrumented automatically by our compiler. The speedup component model allows us, for the first time, to quantitatively state the reasons for less-than-optimal program performance, on a program section basis. The overhead components are chosen such that they can be associated directly with software and hardware techniques that may improve performance. Although general, our model is especially suited for the analysis of loop-oriented programs, such as those written in the OpenMP API. We have applied this model to compare three parallel code generation schemes for the Polaris parallelizing compiler. It helps us answer questions such as, what sources of inefficiencies are present in compiler-parallelized programs. To discuss the question we have also implemented an alternative, thread-based code generation method.

AB - Even fully parallel shared-memory program sections may perform significantly below the ideal speedup of P on P processors. Relatively little quantitative information is available about the sources of such inefficiencies. In this paper we present a speedup component model that is able to fully account for sources of performance loss in parallel program sections. The model categorizes the gap between measured and ideal speedup into the four components memory stalls, processor stalls, code overhead, and thread management overhead. These model components are measured based on hardware counters and timers, with which programs are instrumented automatically by our compiler. The speedup component model allows us, for the first time, to quantitatively state the reasons for less-than-optimal program performance, on a program section basis. The overhead components are chosen such that they can be associated directly with software and hardware techniques that may improve performance. Although general, our model is especially suited for the analysis of loop-oriented programs, such as those written in the OpenMP API. We have applied this model to compare three parallel code generation schemes for the Polaris parallelizing compiler. It helps us answer questions such as, what sources of inefficiencies are present in compiler-parallelized programs. To discuss the question we have also implemented an alternative, thread-based code generation method.

UR - http://www.scopus.com/inward/record.url?scp=0034197216&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=0034197216&partnerID=8YFLogxK

M3 - Article

AN - SCOPUS:0034197216

VL - 10

SP - 227

EP - 238

JO - The BMJ

JF - The BMJ

SN - 0730-6512

IS - 2-3

ER -