high performance computing on graphics processing units: hgpu.org

hgpu.org » Applications » Computer science » Improving the Performance, Portability, and Productivity of Hardware Accelerators

Improving the Performance, Portability, and Productivity of Hardware Accelerators

Pablo Antonio Martínez Sánchez

Universidad de Murcia

Universidad de Murcia, 2023

@article{martinez2023improving,

title={Improving the Performance, Portability, and Productivity of Hardware Accelerators},

author={Mart{‘i}nez S{‘a}nchez, Pablo Antonio},

journal={Proyecto de investigaci{‘o}n},

year={2023},

publisher={Universidad de Murcia}

}

Download (PDF)

View

Source

1591

views

With the end of Moore’s Law and Dennard’s scaling, attention is shifting to new ways of enhancing computer performance. Improving microprocessor performance is becoming increasingly complex, whereas computational power demands still grow tremendously fast. In recent years, we are witnessing a paradigm change: rather than using one single chip, the CPU, for computing everything, computers are evolving into more heterogeneous organizations. In this new configuration, multiple specialized chips compute specific workloads while the CPU orchestrates them, and is only used for actual computing when no other chip can be used. These specialized chips are usually called accelerators. Since they are highly specialized, architecture enhancements have tremendous room for improvement, unlike CPUs. Accelerators are way more efficient than CPUs in terms of performance, energy consumption, or both. Like multicores, accelerators come with great benefits to computer performance, but also notable challenges to the programming workflow. In environments with multiple accelerators, writing code for each of them is very inefficient since each accelerator is programmed with different languages. Performance is also concerning because programming languages often struggle to exploit hardware to take advantage of its full potential. Lastly, portability is also complicated because when a program is designed for an specific accelerator, it cannot be executed in a different one. Achieving programming languages that provide productivity, performance and portability is known as the P3 problem. To tackle it, in this thesis, we have studied how two different single-source programming languages perform in real-world scenarios. After studying their performance in each of the three P3 categories, we found that they struggle to achieve good performance, portability, and productivity at the same time. Therefore, we have proposed a new domain-specific language specialized in deep neural networks that supports multiple heterogeneous architectures and reaches superior results in all P3 aspects. Even though we can develop programs with decent portability, productivity and performance in heterogeneous environments, there is much code already written. Therefore, if we wish to target new hardware, we would need to rewrite this code with new languages in order to use new accelerators. In this thesis, we propose a compiler that automatically matches and replaces existing code with API calls. Since the target API can be reconfigured easily, our compiler can target an optimized CPU library, which is more efficient than executing the handwritten code or an API that relies on a hardware accelerator. Our proposal is designed for C/C++ and recognizes linear algebra and tensor codes. The main strength of this proposal is its ability to recognize simple code (e.g., the three-loop structure of matrix multiplication) as well as complex code constructs (like the Strassen algorithm, hand-optimized vectorized code, etc.). Furthermore, a notable trend in SoC design, which is becoming increasingly common, is including a sea of disparate accelerators inside the chip. Even though the hardware is already offering performance improvements never seen before, the software is still struggling to take advantage of it. For example, there is no clear way of managing multiple accelerators to accelerate a given workload or how to assign accelerators to the right tasks automatically. Using multiple accelerators concurrently, like how ILP exploits multiple functional units, is called Accelerator-Level Parallelism (ALP). In this thesis, we show a new proposal for exploiting ALP in heterogeneous environments. We present a framework capable of orchestrating multiple accelerators to run a single task jointly, significantly improving performance. We apply our framework to matrix multiplication and convolution use cases, demonstrating that it automatically schedules tasks between accelerators with a low prediction error and a work distribution very close to the optimal. Like multicores did, heterogeneous computing is increasing the complexity of software development and making the architecture of computers more and more complex due to the diverse hardware variety. All computer architecture advances have come with increasing hardware complexity, which we must tame to make computers practical and useful. New architectures, different from the long-lasting CPU, bring unprecedented levels of performance and energy efficiency. In this thesis we have shown that performance portability is possible with singlesource languages, as well as a novel DSL for DNNs that achieves excellent performance, productivity and portability in heterogeneous environments. Also, we have designed a novel methodology for detecting and compiling acceleratable parts of CPU code to specialized hardware accelerators automatically. And lastly, we have proposed a framework for exploiting Accelerator-Level Parallelism in heterogeneous environments. We expect that the proposal described in this thesis will help to improve the usability and the performance of heterogeneous computing, which will relentlessly establish the standard for future-generation computing systems.

Tags: Computer science, CUDA, Heterogeneous systems, Linear Algebra, Matrix multiplication, Neural networks, nVidia, nVidia GeForce RTX 2080, nVidia GeForce RTX 2080 Ti, Performance, performance portability, Thesis

July 16, 2023 by hgpu

No votes yet.

Please wait...

Your response

You must be logged in to post a comment.

* * *

high performance computing on graphics processing units: hgpu.org

Improving the Performance, Portability, and Productivity of Hardware Accelerators

Your response

Recent source codes

NVIDIA Nemotron Parse 1.1

ThunderKittens: Tile primitives for speedy kernels

Iris: AMD RAD's multi-GPU Triton-based framework for seamless multi-GPU programming

HipKittens: Fast and Furious AMD Kernels

Fortran xDSL dialects

mt4g: Memory Topology 4 GPUs

Falcon: GPU-Based Floating-point Adaptive Lossless Compression

CudaForge: An Agent Framework with Hardware Feedback for CUDA Kernel Optimization

pplx-garden: Perplexity open source garden for inference technology

LC Framework

Most viewed papers (last 30 days)

Improving the Performance, Portability, and Productivity of Hardware Accelerators

Share this:

Your response

Recent source codes

Most viewed papers (last 30 days)