high performance computing on graphics processing units: hgpu.org

hgpu.org » Programming » Algorithms » Automatic transformation and optimization of applications on GPUs and GPU clusters

Automatic transformation and optimization of applications on GPUs and GPU clusters

Wenjing Ma

Department of Computer Science and Engineering, The Ohio State University, Columbus, OH 43210

The Ohio State University, 2011

@phdthesis{ma2011automatic,

title={Automatic transformation and optimization of applications on GPUs and GPU clusters},

author={Ma, W.},

year={2011},

school={THE OHIO STATE UNIVERSITY}

}

Download (PDF)

View

Source

2402

views

Modern accelerators and multi-core architectures offer significant computing power at a very modest cost. With this trend, an important research issue at the software end is how to make the best use of these computing devices, and how to enable high performance without the users having to put too much effort into learning the architecture and the programming model. Our goal is to address the above problem by developing automatic code generation systems, particularly for GPUs and GPU clusters. We believe that by focusing on specific application classes, the task of automatic code generation can be significantly simplified. Thus, we made efforts in providing code generation and optimization systems for two classes of applications: data-intensive applications with generalized reductions, and tensor contraction functions. First, we focused on a class of data-intensive applications, whose processing structure is of generalized reductions. In the code generation systems we have built, the user input are algorithms written in high-level languages, specifically, C or MATLAB. Program analysis and code generation is performed to generate code for a single GPU, or a GPU cluster. The three specific systems we have built are GREENRIDE, a code generation system to provide GPU support for C programs; GMAT-DM, which translates MATLAB code into GPU executable program; and AUTO-GC, which provides GPU support for clusters, by incorporating code generation for FREERIDE, which is a middleware supporting parallel execution for data mining. For tensor contractions, we evaluated the automatically generated code on different GPUs, and made investigation in the algorithm optimization for each card. It led to an auto-tuning framework which selects algorithms and parameters according to some cost model and thresholds extracted from simple micro-benchmarks. We also developed a loop transformation system in the environment of multi-level memory hierarchy. By focusing on the dominating factors of the computation, we were able to remove a large portion of extra data movement between memory hierarchies. In future, we plan to extend our work in the following directions. The code generation system for data intensive applications with reduction patterns could be applied and optimized for other classes of applications. The integer programming model could also be used for other architectures, including future accelerators. We would like to consider heterogeneous systems for the loop transformation approach. The auto-tuning framework will be extended to include more parameters, enabling better performance gain.

Tags: Algorithm optimization, Algorithms, Benchmarking, Code generation, Computer science, CUDA, Data mining, GPU cluster, Heterogeneous systems, nVidia, nVidia GeForce 8800 GTX, nVidia GeForce 9800 GX2, Optimization, Tesla T10, Thesis

November 9, 2011 by hgpu

No votes yet.

Please wait...

Your response

You must be logged in to post a comment.

high performance computing on graphics processing units: hgpu.org

Automatic transformation and optimization of applications on GPUs and GPU clusters

Your response

Recent source codes

tritonBLAS: A Lightweight Triton-based General Matrix Multiplication (GEMM) Library

hls4ml: Machine learning on FPGAs using HLS

ThunderKittens: Tile primitives for speedy kernels

NVIDIA Nemotron Parse 1.1

Iris: AMD RAD's multi-GPU Triton-based framework for seamless multi-GPU programming

HipKittens: Fast and Furious AMD Kernels

Fortran xDSL dialects

mt4g: Memory Topology 4 GPUs

Falcon: GPU-Based Floating-point Adaptive Lossless Compression

CudaForge: An Agent Framework with Hardware Feedback for CUDA Kernel Optimization

Most viewed papers (last 30 days)

Automatic transformation and optimization of applications on GPUs and GPU clusters

Share this:

Your response

Recent source codes

Most viewed papers (last 30 days)