high performance computing on graphics processing units: hgpu.org

hgpu.org » Applications » Computer science » Programming Heterogeneous Systems with General and Domain-Specific Frameworks

Programming Heterogeneous Systems with General and Domain-Specific Frameworks

Lukas Sommer

Technische Universität Darmstadt

Technische Universität Darmstadt, 2021

BibTeX

Download (PDF)

View

Source

Source codes

Package:

The Darmstadt Automotive Parallel HeterogeNEous (DAPHNE) Benchmark-Suite

1247

views

As chip manufacturing processes are getting ever closer to what is physically possible, the projections made by Moore’s Law and Dennard Scaling no longer hold true, and CPU performance has been stagnating over the last decade. At the same time, the performance requirements of many important application areas, ranging from machine learning to scientific computing, are increasing at exponential rates, creating a demand that CPUs cannot satisfy anymore. In order to cater the performance hunger of these applications, computer architects have turned their attention towards heterogeneous systems. By combining CPUs with one or multiple accelerators, architects are seeking to provide the necessary performance through specialization and more efficient forms of parallelism. And while the accelerators have successfully delivered on the promised performance in many cases, programming these heterogeneous systems is becoming increasingly difficult, as developers need to take multiple devices, execution models, and data transfers into account. Over the course of this cumulative dissertation, we investigate two potential solutions to the enormous challenges of heterogeneous systems programming. General programming frameworks such as OpenMP define language constructs that reflect important fundamental computing patterns and allow developers to expose an application’s parallelism to the compiler for efficient mapping to the target hardware. Domain-specific programming frameworks, on the other hand, are tailored to a single domain and provide mechanisms to capture the high-level semantics and structure of an application, which is then again mapped to the computational units of the underlying hardware in an efficient fashion. In this thesis, we discuss the merits of both approaches in detail and show implementation examples for both. For general programming frameworks, the selection of the most suitable framework for a class of applications and target platform is a crucial step. Using automotive software development as an example, we perform an implementation study to extensively compare three different frameworks. Based on the findings from this implementation study, we identify a number of key factors to assess the suitability of general programming frameworks for applications and target platforms. One popular general programming framework is OpenMP, and the target offloading capabilities added in recent versions also make it an interesting candidate for targeting FPGAs. To enable the use of OpenMP for FPGA programming, we develop the first-ever prototype for OpenMP target offloading constructs on FPGAs via High-Level Synthesis. Furthermore, we design and implement an execution model and hardware extensions for multi-threaded execution in FPGA accelerators generated through High-Level Synthesis. By combining multi-threaded execution in the generated FPGA accelerators with OpenMP target offloading as programming interface, we do not only significantly reduce idle cycles and improve performance, but also provide an easy-to-use programming interface with intuitive mechanisms for data management. In order to showcase the implementation of a domain-specific programming framework, we develop a compiler for Sum-Product Networks, a class of machine learning models. By implementing compilation flows for CPUs, GPUs and FPGAs, we are able to cover a wide range of heterogeneous system setups and achieve improvements in inference throughput of multiple orders of magnitude compared to the existing Python-based libraries. The implementation of these toolflows, which for CPU and GPU is based on the modern MLIR framework, also illustrates the role compilers play for the future of heterogeneous computing.

Tags: Computer science, CUDA, FPGA, Heterogeneous systems, Machine learning, nVidia, nVidia GeForce RTX 2070, OpenCL, OpenMP, Package, Python, Thesis

November 21, 2021 by hgpu

No votes yet.

Please wait...

hpcbench: A set of benchmarking utilities for biomolecular simulation tools

Engineering Supercomputing Platforms for Biomolecular Applications

high performance computing on graphics processing units: hgpu.org

Programming Heterogeneous Systems with General and Domain-Specific Frameworks

Package:

Recent source codes

hpcbench: A set of benchmarking utilities for biomolecular simulation tools

HPCTransCompile: An AI Compiler Generated Dataset for High-Performance CUDA Transpilation and LLM Preliminary Exploration

chemtrain: Training Molecular Dynamics Potentials in JAX

microSYCL: SYCL micro-benchmarks repository

XaaS containers

CASS: Cuda-Amd aSSembly

Cluser of smartphones for edge computing application using TensorFlow

SYCL Container

Efficient Graph Embedding at Scale: Optimizing CPU-GPU-SSD Integration

Can Large Language Models Predict Parallel Code Performance?

Most viewed papers (last 30 days)

Programming Heterogeneous Systems with General and Domain-Specific Frameworks

Package:

Share this:

Recent source codes

Most viewed papers (last 30 days)