high performance computing on graphics processing units: hgpu.org

hgpu.org » Applications » Computer science » VitBit: Enhancing Embedded GPU Performance for AI Workloads through Register Operand Packing

VitBit: Enhancing Embedded GPU Performance for AI Workloads through Register Operand Packing

Jaebeom Jeon, Minseong Gil, Junsu Kim, Jaeyong Park, Gunjae Koo, Myung Kuk Yoon, Yunho Oh

Korea University, Seoul, South Korea

Proceedings of the 53rd International Conference on Parallel Processing (ICPP’24), 2024

DOI:10.1145/3673038.367304

@inproceedings{jeon2024vitbit,

title={VitBit: Enhancing Embedded GPU Performance for AI Workloads through Register Operand Packing},

author={Jeon, Jaebeom and Gil, Minseong and Kim, Junsu and Park, Jaeyong and Koo, Gunjae and Yoon, Myung Kuk and Oh, Yunho},

booktitle={Proceedings of the 53rd International Conference on Parallel Processing},

pages={1012–1021},

year={2024}

}

Download (PDF)

View

Source

1268

views

The rapid advancement of Artificial Intelligence (AI) necessitates significant enhancements in the energy efficiency of Graphics Processing Units (GPUs) for Deep Neural Network (DNN) workloads. Such a challenge is particularly critical for embedded GPUs, which operate within stringent power constraints. Traditional GPU architectures, designed to support a limited set of numeric formats, face challenges in meeting the diverse requirements of modern AI applications. These applications demand support for various numeric formats to optimize computational speed and efficiency. This paper proposes VitBit, a novel software technique designed to overcome these limitations by enabling efficient processing of arbitrary integer format values, especially those 8 bits or fewer, which are increasingly prevalent in AI workloads. VitBit introduces two key innovations: the packing of arbitrary integer formats for parallel computation and the simultaneous execution of Tensor cores, INT and FP (Integer and Floating-Point) CUDA cores. This approach leverages the architectural features of modern GPUs, such as those based on NVIDIA Ampere architecture, which allows concurrent operation of FP32 and INT32 cores at full throughput. Our evaluation of VitBit on NVIDIA Jetson AGX Orin demonstrates substantial improvements in arithmetic density and peak throughput, achieving up to a 22% reduction in execution time for benchmark AI workloads without compromising inference accuracy. VitBit effectively bridges the gap between current hardware capabilities and the computational demands of AI, offering a scalable and cost-effective method for enhancing GPU performance in AI applications.

Tags: AI, Artificial intelligence, Computer science, CUDA, Deep learning, Neural networks, nVidia, nVidia Jetson AGX Orin, Performance

September 1, 2024 by hgpu

No votes yet.

Please wait...

Your response

You must be logged in to post a comment.

high performance computing on graphics processing units: hgpu.org

VitBit: Enhancing Embedded GPU Performance for AI Workloads through Register Operand Packing

Your response

Recent source codes

tritonBLAS: A Lightweight Triton-based General Matrix Multiplication (GEMM) Library

hls4ml: Machine learning on FPGAs using HLS

ThunderKittens: Tile primitives for speedy kernels

NVIDIA Nemotron Parse 1.1

Iris: AMD RAD's multi-GPU Triton-based framework for seamless multi-GPU programming

HipKittens: Fast and Furious AMD Kernels

Fortran xDSL dialects

mt4g: Memory Topology 4 GPUs

Falcon: GPU-Based Floating-point Adaptive Lossless Compression

CudaForge: An Agent Framework with Hardware Feedback for CUDA Kernel Optimization

Most viewed papers (last 30 days)

VitBit: Enhancing Embedded GPU Performance for AI Workloads through Register Operand Packing

Share this:

Your response

Recent source codes

Most viewed papers (last 30 days)