Resource-Aware Compiler Prefetching for Fine-Grained Many-Cores

hgpu.org » Programming » Algorithms » Resource-Aware Compiler Prefetching for Fine-Grained Many-Cores

Resource-Aware Compiler Prefetching for Fine-Grained Many-Cores

George C. Caragea, Alexandros Tzannes, Fuat Keceli, Rajeev Barua, Uzi Vishkin

Department of Computer Science, University of Maryland

International Journal of Parallel Programming, Volume 39, Number 5, 615-638, 2011

DOI:10.1007/s10766-011-0163-8

@article{caragea2011resource,

title={Resource-Aware Compiler Prefetching for Fine-Grained Many-Cores},

author={Caragea, G.C. and Tzannes, A. and Keceli, F. and Barua, R. and Vishkin, U.},

journal={International Journal of Parallel Programming},

pages={1–24},

year={2011},

publisher={Springer}

}

Download (PDF)

View

Source

Source codes

Package:

Software Release of the XMT Environment

2535

views

Super-scalar, out-of-order processors that can have tens of read and write requests in the execution window place significant demands on Memory Level Parallelism (MLP). Multi- and many-cores with shared parallel caches further increase MLP demand. Current cache hierarchies however have been unable to keep up with this trend, with modern designs allowing only 4-16 concurrent cache misses. This disconnect is exacerbated by recent highly parallel architectures (e.g. GPUs) where power and area per-core budget favor numerous lighter cores with less resources, further reducing support for MLP on a per-core basis. Support for hardware and software prefetch increases MLP pressure since these techniques overlap multiple memory requests with existing computation. In this paper, we propose and evaluate a novel Resource-Aware Prefetching (RAP) compiler algorithm that is aware of the number of simultaneous prefetches supported, and optimized for the same. We implemented our algorithm in a GCC-derived compiler and evaluated its performance using an emerging fine-grained many-core architecture. Our results show that the RAP algorithm outperforms a well-known loop prefetching algorithm by up to 40.15% in run-time on average across benchmarks and the state-of-the art GCC implementation by up to 34.79%, depending upon hardware configuration. Moreover, we compare the RAP algorithm with a simple hardware prefetching mechanism, and show run-time improvements of up to 24.61%. To demonstrate the robustness of our approach, we conduct a design-space exploration (DSE) for the considered target architecture by varying (i) the amount of chip resources designated for per-core prefetch storage and (ii) off-chip bandwidth. We show that the RAP algorithm is robust in that it improves performance across all design points considered. We also identify the Pareto-optimal hardware-software configuration which delivers 53.66% run-time improvement on average while using only 5.47% more chip area than the bare-bones design.

Tags: Algorithms, ASIC, Benchmarking, Computer science, Memory level parallelism, nVidia, Package, Prefetch

December 29, 2011 by hgpu

No votes yet.

Please wait...

Your response

You must be logged in to post a comment.

high performance computing on graphics processing units: hgpu.org