high performance computing on graphics processing units: hgpu.org

hgpu.org » Applications » Computer science » System-Level Optimization and Code Generation for Graphics Processors using a Domain-Specific Language

System-Level Optimization and Code Generation for Graphics Processors using a Domain-Specific Language

Bo Qiao

Friedrich-Alexander-Universität Erlangen-Nürnberg

Friedrich-Alexander-Universität Erlangen-Nürnberg (FAU), 2021

BibTeX

Download (PDF)

View

Source

Source codes

Package:

From Loop Fusion to Kernel Fusion: A Domain-specific Approach to Locality Optimization

1319

views

As graphics processing units (GPUs) are being used increasingly for general purpose processing, efficient tooling for programming such parallel architectures becomes essential. Despite the continuous effort of programmability improvement in CUDA and OpenCL, they remain relatively low-level languages and require in-depth architecture knowledge to achieve high-performance implementations. Developers have to perform memory management manually to exploit the multi-layered compute and memory hierarchy. This type of hand-tuned expert implementations suffers from performance portability, namely, existing implementations are not guaranteed to be efficient on new architectures, and developers have to perform the tedious tuning and optimization repeatedly for every architecture. To circumvent this issue, developers can choose to utilize high-performance libraries offered by hardware vendors as well as open-source communities. Utilizing libraries is performance portable as it is the library developer’s job to maintain the implementation. However, it lacks programmability. Library functions are provided with pre-defined APIs, and the level of abstraction may not be sufficient for developers of a certain domain. Furthermore, using library-based implementations precludes the possibility of applying system-level optimizations across different functions. In this thesis, we present a domain-specific language (DSL) approach that can achieve both performance portability and programmability within a particular domain. This is possible by exploiting domain-specific abstractions and combining them with architecture-specific optimizations. The abstractions enable programmability and flexibility for domain developers, and the compiler-based optimization facilitates performance portability across different architectures. The core of such a DSL approach is its optimization engine, which combines algorithm and hardware knowledge to explore the optimization space efficiently. Our contributions in this thesis target system-level optimizations and code generations for GPU architectures. Today’s applications such as in image processing and machine learning grow in complexity and consist of many kernels in a computation pipeline. Optimizing each kernel individually is no longer sufficient due to the rapid evolution of modern GPU architectures. Each architecture generation reveals higher computing power as well as memory bandwidth. Nevertheless, the computing power increase is generally faster than the memory bandwidth improvement. As a result, good locality is essential to achieve high-performance implementations. For example, the inter-kernel communications within an image processing pipeline are intensive and exhibit many opportunities for locality improvement. As the first contribution, we present a technique called kernel fusion to reduce the number of memory accesses to the slow GPU global memory. In addition, we automate the transformation in our source-to-source compiler by combining domain knowledge in image processing and architecture knowledge of GPUs. Another trend we can observe following recent architecture development is the increasing number of CUDA cores and streaming multiprocessors (SMs) for computation. Traditionally, GPU programming is about exploring data-level parallelism. Following the single instruction, multiple threads (SIMTs) execution model, data can be mapped to threads to benefit from the massive computing power. Nevertheless, small images that were considered costly on older architectures can no longer occupy the device fully on new GPUs. It becomes important to explore also kernel-level parallelism that can efficiently utilize the growing number of compute resources on the GPU. As the second contribution, we present concurrent kernel execution techniques to enable fine-grained resource sharing within the compute SMs. In addition, we compare different implementation variants and develop analytic models to predict the suitable option based on the algorithmic and architecture knowledge. After considering locality and parallelism, which are the two most essential optimization objectives on modern GPU architectures, we can start examining the possibilities to optimize the computations within an algorithm. As the third contribution in this thesis, we present single-kernel optimization techniques for the two most commonly used compute patterns in image processing, namely local and global operators. For local operators, we present a systematic analysis of an efficient border handling technique based on iteration space partitioning. We use the domain and architecture knowledge to capture the trade-off between occupancy and instruction usage reduction. Our analytic model assists the transformation in the source-tosource compiler to decide on the better implementation variant and improves the end-to-end code generation. For global operators, we present an efficient approach to perform global reductions on GPUs. Our approach benefits from the continuous effort of performance and programmability improvement by hardware vendors, for example, by utilizing new low-level primitives from Nvidia. The proposed techniques cover not only multi-kernel but also single-kernel optimization, and are seamlessly integrated into our image processing DSL and source-to-source compiler called Hipacc. In the end, the presented DSL framework can drastically improve the productivity of domain developers aiming for high-performance GPU implementations.

Tags: Code generation, Computer science, CUDA, Image processing, Machine learning, nVidia, nVidia GeForce GTX 680, nVidia GeForce RTX 2080, OpenCL, Package, performance portability, Tesla K20, Thesis

January 2, 2022 by hgpu

No votes yet.

Please wait...

hpcbench: A set of benchmarking utilities for biomolecular simulation tools

Engineering Supercomputing Platforms for Biomolecular Applications

high performance computing on graphics processing units: hgpu.org

System-Level Optimization and Code Generation for Graphics Processors using a Domain-Specific Language

Package:

Recent source codes

hpcbench: A set of benchmarking utilities for biomolecular simulation tools

HPCTransCompile: An AI Compiler Generated Dataset for High-Performance CUDA Transpilation and LLM Preliminary Exploration

chemtrain: Training Molecular Dynamics Potentials in JAX

microSYCL: SYCL micro-benchmarks repository

XaaS containers

CASS: Cuda-Amd aSSembly

Cluser of smartphones for edge computing application using TensorFlow

SYCL Container

Efficient Graph Embedding at Scale: Optimizing CPU-GPU-SSD Integration

Can Large Language Models Predict Parallel Code Performance?

Most viewed papers (last 30 days)

System-Level Optimization and Code Generation for Graphics Processors using a Domain-Specific Language

Package:

Share this:

Recent source codes

Most viewed papers (last 30 days)