Decoupling local variable accesses in a wide-issue superscalar processor

Sangyeun Cho, Pen Chung Yew, Kyung Ho Lee

Research output: Chapter in Book/Report/Conference proceedingChapter

34 Citations (Scopus)

Abstract

Providing adequate data bandwidth is extremely important for a wide-issue superscalar processor to achieve its full performance potential. Adding a large number of ports to a data cache, however, becomes increasingly inefficient and can add to the hardware complexity significantly. This paper takes an alternative or complementary approach for providing more data bandwidth, called the data-decoupled architecture. The approach, with support from the compiler and/or hardware, partitions the memory stream into two independent streams early in the processor pipeline, and feeds each stream to a separate memory access queue and cache. Under this model, the paper studies the potential of decoupling memory accesses to program's local variables that are allocated on the run-time stack. Using a set of integer and floating-point programs from the SPEC95 benchmark suite, it is shown that local variable accesses constitute a large portion of all the memory references, while their reference space is very small, averaging around 7 words per (static) procedure. To service local variable accesses quickly, two optimizations, fast data forwarding and access combining, are proposed and studied. Some of the important design parameters, such as the cache size, the number of cache ports, and the degree of access combining, are studied based on simulations. The potential performance of the proposed scheme is measured using various configurations, and it is concluded that the scheme can become a viable alternative to building a single multi-ported data cache.

Original languageEnglish
Title of host publicationConference Proceedings - Annual International Symposium on Computer Architecture, ISCA
PublisherIEEE
Pages100-110
Number of pages11
Publication statusPublished - 1999
Externally publishedYes
EventProceedings of the 1999 26th Annual International Symposium on Computer Architecture - ISCA '99 - Atlanta, GA, USA
Duration: 1999 May 21999 May 4

Other

OtherProceedings of the 1999 26th Annual International Symposium on Computer Architecture - ISCA '99
CityAtlanta, GA, USA
Period99/5/299/5/4

Fingerprint

Data storage equipment
Bandwidth
Computer hardware
Pipelines
Hardware

ASJC Scopus subject areas

  • Hardware and Architecture

Cite this

Cho, S., Yew, P. C., & Lee, K. H. (1999). Decoupling local variable accesses in a wide-issue superscalar processor. In Conference Proceedings - Annual International Symposium on Computer Architecture, ISCA (pp. 100-110). IEEE.

Decoupling local variable accesses in a wide-issue superscalar processor. / Cho, Sangyeun; Yew, Pen Chung; Lee, Kyung Ho.

Conference Proceedings - Annual International Symposium on Computer Architecture, ISCA. IEEE, 1999. p. 100-110.

Research output: Chapter in Book/Report/Conference proceedingChapter

Cho, S, Yew, PC & Lee, KH 1999, Decoupling local variable accesses in a wide-issue superscalar processor. in Conference Proceedings - Annual International Symposium on Computer Architecture, ISCA. IEEE, pp. 100-110, Proceedings of the 1999 26th Annual International Symposium on Computer Architecture - ISCA '99, Atlanta, GA, USA, 99/5/2.
Cho S, Yew PC, Lee KH. Decoupling local variable accesses in a wide-issue superscalar processor. In Conference Proceedings - Annual International Symposium on Computer Architecture, ISCA. IEEE. 1999. p. 100-110
Cho, Sangyeun ; Yew, Pen Chung ; Lee, Kyung Ho. / Decoupling local variable accesses in a wide-issue superscalar processor. Conference Proceedings - Annual International Symposium on Computer Architecture, ISCA. IEEE, 1999. pp. 100-110
@inbook{447fcb4db84041489197e5322a476bde,
title = "Decoupling local variable accesses in a wide-issue superscalar processor",
abstract = "Providing adequate data bandwidth is extremely important for a wide-issue superscalar processor to achieve its full performance potential. Adding a large number of ports to a data cache, however, becomes increasingly inefficient and can add to the hardware complexity significantly. This paper takes an alternative or complementary approach for providing more data bandwidth, called the data-decoupled architecture. The approach, with support from the compiler and/or hardware, partitions the memory stream into two independent streams early in the processor pipeline, and feeds each stream to a separate memory access queue and cache. Under this model, the paper studies the potential of decoupling memory accesses to program's local variables that are allocated on the run-time stack. Using a set of integer and floating-point programs from the SPEC95 benchmark suite, it is shown that local variable accesses constitute a large portion of all the memory references, while their reference space is very small, averaging around 7 words per (static) procedure. To service local variable accesses quickly, two optimizations, fast data forwarding and access combining, are proposed and studied. Some of the important design parameters, such as the cache size, the number of cache ports, and the degree of access combining, are studied based on simulations. The potential performance of the proposed scheme is measured using various configurations, and it is concluded that the scheme can become a viable alternative to building a single multi-ported data cache.",
author = "Sangyeun Cho and Yew, {Pen Chung} and Lee, {Kyung Ho}",
year = "1999",
language = "English",
pages = "100--110",
booktitle = "Conference Proceedings - Annual International Symposium on Computer Architecture, ISCA",
publisher = "IEEE",

}

TY - CHAP

T1 - Decoupling local variable accesses in a wide-issue superscalar processor

AU - Cho, Sangyeun

AU - Yew, Pen Chung

AU - Lee, Kyung Ho

PY - 1999

Y1 - 1999

N2 - Providing adequate data bandwidth is extremely important for a wide-issue superscalar processor to achieve its full performance potential. Adding a large number of ports to a data cache, however, becomes increasingly inefficient and can add to the hardware complexity significantly. This paper takes an alternative or complementary approach for providing more data bandwidth, called the data-decoupled architecture. The approach, with support from the compiler and/or hardware, partitions the memory stream into two independent streams early in the processor pipeline, and feeds each stream to a separate memory access queue and cache. Under this model, the paper studies the potential of decoupling memory accesses to program's local variables that are allocated on the run-time stack. Using a set of integer and floating-point programs from the SPEC95 benchmark suite, it is shown that local variable accesses constitute a large portion of all the memory references, while their reference space is very small, averaging around 7 words per (static) procedure. To service local variable accesses quickly, two optimizations, fast data forwarding and access combining, are proposed and studied. Some of the important design parameters, such as the cache size, the number of cache ports, and the degree of access combining, are studied based on simulations. The potential performance of the proposed scheme is measured using various configurations, and it is concluded that the scheme can become a viable alternative to building a single multi-ported data cache.

AB - Providing adequate data bandwidth is extremely important for a wide-issue superscalar processor to achieve its full performance potential. Adding a large number of ports to a data cache, however, becomes increasingly inefficient and can add to the hardware complexity significantly. This paper takes an alternative or complementary approach for providing more data bandwidth, called the data-decoupled architecture. The approach, with support from the compiler and/or hardware, partitions the memory stream into two independent streams early in the processor pipeline, and feeds each stream to a separate memory access queue and cache. Under this model, the paper studies the potential of decoupling memory accesses to program's local variables that are allocated on the run-time stack. Using a set of integer and floating-point programs from the SPEC95 benchmark suite, it is shown that local variable accesses constitute a large portion of all the memory references, while their reference space is very small, averaging around 7 words per (static) procedure. To service local variable accesses quickly, two optimizations, fast data forwarding and access combining, are proposed and studied. Some of the important design parameters, such as the cache size, the number of cache ports, and the degree of access combining, are studied based on simulations. The potential performance of the proposed scheme is measured using various configurations, and it is concluded that the scheme can become a viable alternative to building a single multi-ported data cache.

UR - http://www.scopus.com/inward/record.url?scp=0032686331&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=0032686331&partnerID=8YFLogxK

M3 - Chapter

AN - SCOPUS:0032686331

SP - 100

EP - 110

BT - Conference Proceedings - Annual International Symposium on Computer Architecture, ISCA

PB - IEEE

ER -