TY - GEN
T1 - An FPGA approach to quantifying coherence traffic efficiency on multiprocessor systems
AU - Suh, Taeweon
AU - Lu, Shih Lien
AU - Lee, Hsien Hsin S.
PY - 2007
Y1 - 2007
N2 - Recently, there is a surge of interests in using FPGAs for computer architecture research including applications from emulating and analyzing a new platform to accelerating microarchitecural simulation speed for design space exploration. This paper proposes and demonstrates a novel usage of FPGAs for measuring the efficiency of coherent traffic of an actual computer system. Our approach employs an FPGA acting as a bus agent, interacting with a real CPU in a dual processor system to measure the intrinsic delay of co-herence traffic. This technique eliminates non-deterministic factors in the measurement, such as the arbitration delay and stall in the pipelined bus. It completely isolates the impact of pure coherence traffic delay on system performance while executing workloads natively. Our experiments show that the overall execution time of the benchmark programs on a system with coherence traffic was actually increased over one without coherent traffic. It indicates that cache-to-cache transfers are less efficient in an Intel-based server system, and there exists room for further improvement such as the inclusion of the O state and cache line buffers in the memory controller.
AB - Recently, there is a surge of interests in using FPGAs for computer architecture research including applications from emulating and analyzing a new platform to accelerating microarchitecural simulation speed for design space exploration. This paper proposes and demonstrates a novel usage of FPGAs for measuring the efficiency of coherent traffic of an actual computer system. Our approach employs an FPGA acting as a bus agent, interacting with a real CPU in a dual processor system to measure the intrinsic delay of co-herence traffic. This technique eliminates non-deterministic factors in the measurement, such as the arbitration delay and stall in the pipelined bus. It completely isolates the impact of pure coherence traffic delay on system performance while executing workloads natively. Our experiments show that the overall execution time of the benchmark programs on a system with coherence traffic was actually increased over one without coherent traffic. It indicates that cache-to-cache transfers are less efficient in an Intel-based server system, and there exists room for further improvement such as the inclusion of the O state and cache line buffers in the memory controller.
UR - http://www.scopus.com/inward/record.url?scp=48149102869&partnerID=8YFLogxK
U2 - 10.1109/FPL.2007.4380624
DO - 10.1109/FPL.2007.4380624
M3 - Conference contribution
AN - SCOPUS:48149102869
SN - 1424410606
SN - 9781424410606
T3 - Proceedings - 2007 International Conference on Field Programmable Logic and Applications, FPL
SP - 47
EP - 53
BT - Proceedings - 2007 International Conference on Field Programmable Logic and Applications, FPL
T2 - 2007 International Conference on Field Programmable Logic and Applications, FPL
Y2 - 27 August 2007 through 29 August 2007
ER -