high performance computing on graphics processing units: hgpu.org

hgpu.org » Applications » Computer science » Productive High Performance Parallel Programming with Auto-tuned Domain-Specific Embedded Languages

Productive High Performance Parallel Programming with Auto-tuned Domain-Specific Embedded Languages

Shoaib Ashraf Kamil

EECS Department, University of California, Berkeley

University of California, Technical Report No. UCB/EECS-2012-255, 2012

@techreport{Kamil:EECS-2012-255,

Author={Kamil, Shoaib Ashraf},

Title={Productive High Performance Parallel Programming with Auto-tuned Domain-Specific Embedded Languages},

Institution={EECS Department, University of California, Berkeley},

Year={2012},

Month={Dec},

URL={http://www.eecs.berkeley.edu/Pubs/TechRpts/2012/EECS-2012-255.html},

Number={UCB/EECS-2012-255}

}

Download (PDF)

View

Source

1640

views

As the complexity of machines and architectures has increased, performance tuning has become more challenging, leading to the failure of general compilers to generate the best possible optimized code. Expert performance programmers can often hand-write code that outperforms compiler-optimized low-level code by an order of magnitude. At the same time, the complexity of programs has also increased, with modern programs built on a variety of abstraction layers to manage complexity, yet these layers hinder efforts at optimization. In fact, it is common to lose one or two additional orders of magnitude in performance when going from a low-level language such as Fortran or C to a high-level language like Python, Ruby, or Matlab. General purpose compilers are limited by the inability of program analysis to determine programmer intent, as well as the lack of detailed performance models that always determine the best executable code for a given computation and architecture. The latter problem can be mitigated through auto-tuning, which generates many code variants for a particular problem and empirically determines which performs best on a given architecture. This thesis addresses the problem of how to write programs at a high level while obtaining the performance of code written by performance experts at the low level. To do so, we build domain-specific embedded languages that generate low-level parallel code from a high-level language, and then use auto-tuning to determine the best performing low-level code. Such DSELs avoid analysis by restricting the domain while ensuring programmers specify high-level intent, and by performing empirical auto-tuning instead of modeling machine parameters. As a result, programmers write in high-level languages with portions of their code using DSELs, yet obtain performance equivalent to the best hand-optimized low-level code, across many architectures. We present a methodology for building such auto-tuned DSELs, as well as a software infrastructure and example DSELs using the infrastructure, including a DSEL for structured grid computations and two DSELs for graph algorithms. The structured grid DSEL obtains over 80% of peak performance for a variety of benchmark kernels across different architectures, while the graph algorithm DSELs mitigate all performance loss due to using a high-level language. Overall, the methodology, infrastructure, and example DSELs point to a promising new direction for obtaining high performance while programming in a high-level language.

Tags: Computer science, CUDA, nVidia, nVidia GeForce GTX 280, Performance, Programming techniques, Thesis

December 16, 2012 by hgpu

No votes yet.

Please wait...

gpu_tracker: Context manager and CLI that tracks the computational-resource-usage of a code block or shell command, particularly the GPU usage

gpu_tracker: Python package for tracking and profiling GPU utilization in both desktop and high-performance computing environments

high performance computing on graphics processing units: hgpu.org

Productive High Performance Parallel Programming with Auto-tuned Domain-Specific Embedded Languages

Recent source codes

SimSYCL: Synchronous, single-threaded, library-only SYCL implementation for debugging and verification

GPU plugin for PySCF

QArray

Celerity: High-level C++ for Accelerator Clusters

gpu_tracker: Context manager and CLI that tracks the computational-resource-usage of a code block or shell command, particularly the GPU usage

CIFAR-10 Airbench: 94% on CIFAR-10 in 3.29 second

LOOPer: a polyhedral compiler for expressing fast and portable data parallel algorithms

OpenMC Monte Carlo Code

Polygeist: C/C++ frontend for MLIR

Parallel Gaussian process with kernel approximation in CUDA

Most viewed papers (last 30 days)

Productive High Performance Parallel Programming with Auto-tuned Domain-Specific Embedded Languages

Share this:

Recent source codes

Most viewed papers (last 30 days)