high performance computing on graphics processing units: hgpu.org

hgpu.org » Applications » Computer science » Source-to-source transformations for irregular and multithreaded code optimization

Source-to-source transformations for irregular and multithreaded code optimization

Julien Jaeger

University of Versailles, Saint-Quentin-en-Yvelines

University of Versailles, 2012

@article{jaeger2012source,

title={Source-to-source transformations for irregular and multithreaded code optimization},

author={JAEGER, J.},

year={2012}

}

Download (PDF)

View

Source

1527

views

Source-to-Source optimization is an efficient method to generate, from a basic implementation, a high performance program for the two main challenges that are irregular codes and heterogeneous implementation. In the last decade, general purpose CPUs moved towards multi-core architectures, and the end of the increase in processors frequency marked a turning point obtaining the best performance of a single chip, achieved only when efficiently considering the parallelism inside the chip. The optimization process is now a paramount key to have continuously increasing speed-up on newest architectures. Parallelization on a single chip brings new problems to consider, with the integration of different cache level on the chip, and having several threads running simultaneously and accessing to shared resources. Such coexistence implies that the different levels of parallelism (vector, Instruction Level Parallelism, threads, memory access) interacts more than ever, and optimization for high performance should consider all levels. A second paradigm shift occurs with the generalization of hardware accelerators and heterogeneous machines, requiring expertise in all architectures composing the heterogeneous system when generating an efficient code for the target. The complication of hardware architectures provides many challenges in the HPC area, especially for irregular codes, whether irregular in data access or control flow, since generating efficient version for such code on an only core remains difficult. In this dissertation, we will provide methods to generate efficient codes from an initial implementation for irregular programs and heterogeneous parallelizations. The remaining of Chapter 1 presents the evolution of machine architecture from the first scalar computer to nowadays multi-core and heterogeneous systems, the most used source-to-source optimizations and loop transformations, and an insight in hardware behaviour of vectorized computations. Chapter 2 describes our CPC framework, extracting codelets from an irregular codes, optimizing these codelets regardless the overall program, then predicting the overall speed-up of the all system. In Chapter 3, we develop methods, with more or less complexity and memory impact, to address alignment issues, due to vectorization or bank conflicts. We apply our methods on symptomatic stencil cases, and provide along with these methods an algorithm using them to generate heterogeneous codes for CPUs and GPUs. Parallelization techniques are discussed in Chapter 4 with the presentation of two works, one addressing the generation of parallelized codelets, the second scheduling sequential tasks on an heterogeneous system. To conclude, Chapter 5 will remind the contribution of the dissertation, and discuss the improvement and future development possible concerning the presented works.

Tags: Code generation, Computer science, CUDA, Heterogeneous systems, nVidia, nVidia Quadro FX 5800, Thesis

July 24, 2012 by hgpu

No votes yet.

Please wait...

gpu_tracker: Context manager and CLI that tracks the computational-resource-usage of a code block or shell command, particularly the GPU usage

gpu_tracker: Python package for tracking and profiling GPU utilization in both desktop and high-performance computing environments

high performance computing on graphics processing units: hgpu.org

Source-to-source transformations for irregular and multithreaded code optimization

Recent source codes

SimSYCL: Synchronous, single-threaded, library-only SYCL implementation for debugging and verification

GPU plugin for PySCF

QArray

Celerity: High-level C++ for Accelerator Clusters

gpu_tracker: Context manager and CLI that tracks the computational-resource-usage of a code block or shell command, particularly the GPU usage

CIFAR-10 Airbench: 94% on CIFAR-10 in 3.29 second

LOOPer: a polyhedral compiler for expressing fast and portable data parallel algorithms

OpenMC Monte Carlo Code

Polygeist: C/C++ frontend for MLIR

Parallel Gaussian process with kernel approximation in CUDA

Most viewed papers (last 30 days)

Source-to-source transformations for irregular and multithreaded code optimization

Share this:

Recent source codes

Most viewed papers (last 30 days)