Trellis: Portability Across Architectures with a High-level Framework

Lukasz G. Szafaryn, Todd Gamblin, Bronis R. de Supinski, Kevin Skadron
University of Virginia
Journal of Parallel and Distributed Computing, 2013


   title={Trellis: Portability across architectures with a high-level framework},

   author={Szafaryn, Lukasz G and Gamblin, Todd and De Supinski, Bronis R and Skadron, Kevin},

   journal={Journal of Parallel and Distributed Computing},




Download Download (PDF)   View View   Source Source   



The increasing computational needs of parallel applications inevitably require portability across parallel architectures, which now include heterogeneous processing resources, such as CPUs and GPUs, and multiple SIMD/SIMT widths. However, the lack of a common parallel programming paradigm that provides predictable, near-optimal performance on each resource leads to the use of low-level frameworks with architecture-specific optimizations, which in turn cause the code base to diverge and makes porting difficult. Our experiences with parallel applications and frameworks lead us to the conclusion that achieving performance portability requires structured code, a common set of high-level directives and efficient mapping onto hardware. In order to demonstrate this concept, we develop Trellis, a prototype programming framework that allows the programmer to maintain only a single generic and structured codebase that executes efficiently on both the CPU and the GPU. Our approach annotates such code with a single set of high-level directives, derived from both OpenMP and OpenACC, that is made compatible for both architectures. Most importantly, motivated by the limitations of the OpenACC compiler in transforming such code into a GPU kernel, we introduce a thread synchronization directive and a set of transformation techniques that allow us to obtain the GPU code with the desired parallelization that yields more optimal performance. While a common high-level programming framework for both CPU and GPU is currently not available, our analysis shows that even obtaining the best-case performance with OpenACC, state-of-the-art solution for a GPU, requires modifications to the structure of codes to properly exploit braided parallelism, and cope with conditional statements or serial sections. While this already requires prior knowledge of compiler behavior the optimal performance is still unattainable due to the lack of synchronization. We describe the contributions of Trellis in addressing these problems by showing how it can achieve correct parallelization of the original codes for three parallel applications, with performance competitive to that of OpenMP and CUDA, improved programmability and reduced overall code length.
No votes yet.
Please wait...

* * *

* * *

HGPU group © 2010-2017 hgpu.org

All rights belong to the respective authors

Contact us: