Automatic transformation and optimization of applications on GPUs and GPU clusters

Wenjing Ma
Department of Computer Science and Engineering, The Ohio State University, Columbus, OH 43210
The Ohio State University, 2011


   title={Automatic transformation and optimization of applications on GPUs and GPU clusters},

   author={Ma, W.},




Download Download (PDF)   View View   Source Source   



Modern accelerators and multi-core architectures offer significant computing power at a very modest cost. With this trend, an important research issue at the software end is how to make the best use of these computing devices, and how to enable high performance without the users having to put too much effort into learning the architecture and the programming model. Our goal is to address the above problem by developing automatic code generation systems, particularly for GPUs and GPU clusters. We believe that by focusing on specific application classes, the task of automatic code generation can be significantly simplified. Thus, we made efforts in providing code generation and optimization systems for two classes of applications: data-intensive applications with generalized reductions, and tensor contraction functions. First, we focused on a class of data-intensive applications, whose processing structure is of generalized reductions. In the code generation systems we have built, the user input are algorithms written in high-level languages, specifically, C or MATLAB. Program analysis and code generation is performed to generate code for a single GPU, or a GPU cluster. The three specific systems we have built are GREENRIDE, a code generation system to provide GPU support for C programs; GMAT-DM, which translates MATLAB code into GPU executable program; and AUTO-GC, which provides GPU support for clusters, by incorporating code generation for FREERIDE, which is a middleware supporting parallel execution for data mining. For tensor contractions, we evaluated the automatically generated code on different GPUs, and made investigation in the algorithm optimization for each card. It led to an auto-tuning framework which selects algorithms and parameters according to some cost model and thresholds extracted from simple micro-benchmarks. We also developed a loop transformation system in the environment of multi-level memory hierarchy. By focusing on the dominating factors of the computation, we were able to remove a large portion of extra data movement between memory hierarchies. In future, we plan to extend our work in the following directions. The code generation system for data intensive applications with reduction patterns could be applied and optimized for other classes of applications. The integer programming model could also be used for other architectures, including future accelerators. We would like to consider heterogeneous systems for the loop transformation approach. The auto-tuning framework will be extended to include more parameters, enabling better performance gain.
No votes yet.
Please wait...

* * *

* * *

HGPU group © 2010-2017 hgpu.org

All rights belong to the respective authors

Contact us: