Data Layout Transformation Exploiting Memory-Level Parallelism in Structured Grid Many-Core Applications

I-Jui Sung, John A. Stratton, Wen-Mei W. Hwu
Center for Reliable and High-Performance Computing, University of Illinois at Urbana-Champaign
Proceedings of the 19th International Conference on Parallel Architectures and Compilation Techniques (PACT) 2010, Vienna, Austria, September 11-15, 2010


   author={Sung, I-Jui and Stratton, John A. and Hwu, Wen-Mei W.},

   title={Data layout transformation exploiting memory-level parallelism in structured grid many-core applications},

   booktitle={Proceedings of the 19th international conference on Parallel architectures and compilation techniques},

   series={PACT ’10},



   location={Vienna, Austria},







   address={New York, NY, USA},

   keywords={GPU, data layout transformation, parallel programming,}


Download Download (PDF)   View View   Source Source   



We present automatic data layout transformation as an effective compiler performance optimization for memory-bound structured grid applications. Structured grid applications include stencil codes and other code structures using a dense, regular grid as the primary data structure. Fluid dynamics and heat distribution, which both solve partial differential equations on a discretized representation of space, are representative of many important structured grid applications. Using the information available through variable-length array syntax, standardized in C99 and other modern languages, we have enabled automatic data layout transformations for structured grid codes with dynamically allocated arrays. We also present how a tool can guide these transformations to statically choose a good layout given a model of the memory system, using a modern GPU as an example. A transformed layout that distributes concurrent memory requests among parallel memory system components provides substantial speedup for structured grid applications by improving their achieved memory-level parallelism. Even with the overhead of more complex address calculations, we observe up to 560% performance increases over the languagedefined layout, and a 7% performance gain in the worst case, in which the language-defined layout and access pattern is already well-vectorizable by the underlying hardware.
No votes yet.
Please wait...

* * *

* * *

HGPU group © 2010-2017 hgpu.org

All rights belong to the respective authors

Contact us: