high performance computing on graphics processing units: hgpu.org

Posts

May, 4

Efficient deep learning inference on end devices

Deep Learning (DL) has become a cornerstone of modern Artificial Intelligence (AI), powering applications across healthcare, computer vision, and autonomous systems. However, executing DL inference on resource-constrained end devices—such as smartphones and IoT hardware—poses challenges due to limited computational resources, energy constraints, and real-time requirements. This thesis addresses the optimization of DL inference on Heterogeneous […]

OpenCL

May, 4

LIFT: LLM-Based Pragma Insertion for HLS via GNN Supervised Fine-Tuning

FPGAs are increasingly adopted in datacenter environments for their reconfigurability and energy efficiency. High-Level Synthesis (HLS) tools have eased FPGA programming by raising the abstraction level from RTL to untimed C/C++, yet attaining high performance still demands expert knowledge and iterative manual insertion of optimization pragmas to modify the microarchitecture. To address this challenge, we […]

May, 4

Mìmir: A real-time interactive visualization library for CUDA programs

Real-time visualization of computational simulations running over graphics processing units (GPU) is a valuable feature in modern science and technological research, as it allows researchers to visually assess the quality and correctness of their computational models during the simulation. Due to the high throughput involved in GPU-based simulations, classical visualization approaches such as ones based […]

CUDA

May, 4

Scaling On-Device GPU Inference for Large Generative Models

Driven by the advancements in generative AI, large machine learning models have revolutionized domains such as image processing, audio synthesis, and speech recognition. While server-based deployments remain the locus of peak performance, the imperative for on-device inference, necessitated by privacy and efficiency considerations, persists. Recognizing GPUs as the on-device ML accelerator with the widest reach, […]

CUDA

•

OpenCL

May, 4

Dynamic Memory Management on GPUs with SYCL

Dynamic memory allocation is not traditionally available in kernels running on GPUs. This work aims to build on Ouroboros, an efficient dynamic memory management library for CUDA applications, by porting the code to SYCL, a cross-platform accelerator API. Since SYCL can be compiled to a CUDA backend, it is possible to compare the performance of […]

CUDA

Apr, 27

MSCCL++: Rethinking GPU Communication Abstractions for Cutting-edge AI Applications

Modern cutting-edge AI applications are being developed over fast-evolving, heterogeneous, nascent hardware devices. This requires frequent reworking of the AI software stack to adopt bottom-up changes from new hardware, which takes time for general-purpose software libraries. Consequently, real applications often develop custom software stacks optimized for their specific workloads and hardware. Custom stacks help in […]

CUDA

Apr, 27

InteropUnityCUDA: A Tool for Interoperability Between Unity and CUDA

Introduction: Unity is a powerful and versatile tool for creating real-time experiments. It includes a built-in compute shader language, a C-like programming language designed for massively parallel General-Purpose GPU (GPGPU) computing. However, as Unity is primarily developed for multi-platform game creation, its compute shader language has several limitations, including the lack of multi-GPU computation support […]

CUDA

•

OpenGL

Apr, 27

Data-efficient LLM Fine-tuning for Code Generation

Large language models (LLMs) have demonstrated significant potential in code generation tasks. However, there remains a performance gap between open-source and closed-source models. To address this gap, existing approaches typically generate large amounts of synthetic data for fine-tuning, which often leads to inefficient training. In this work, we propose a data selection strategy in order […]

CUDA

Apr, 27

LithOS: An Operating System for Efficient Machine Learning on GPUs

The surging demand for GPUs in datacenters for machine learning (ML) has made efficient GPU utilization crucial. However, meeting the diverse needs of ML models while optimizing resource usage is challenging. To enable transparent, fine-grained GPU management that maximizes utilization and energy efficiency while maintaining strong isolation, an operating system (OS) approach is needed. This […]

CUDA

Apr, 27

DeepCompile: A Compiler-Driven Approach to Optimizing Distributed Deep Learning Training

The increasing scale of deep learning models has led to the development of various parallelization strategies for distributed training across accelerators. For example, fully sharded approaches like DeepSpeed ZeRO-3 and FSDP partition the parameters of each layer across multiple GPUs and gather them through communication when needed. These methods rely on optimizations such as prefetching, […]

CUDA

Apr, 13

Scalability Evaluation of HPC Multi-GPU Training for ECG-based LLMs

Training large language models requires extensive processing, made possible by many high-performance computing resources. This study compares multi-node and multi-GPU environments for training large language models of electrocardiograms. It provides a detailed mapping of current frameworks for distributed deep learning in multinode and multi-GPU settings, including Horovod from Uber, DeepSpeed from Microsoft, and the built-in […]

CUDA

Apr, 13

Large Language Model Powered C-to-CUDA Code Translation: A Novel Auto-Parallelization Framework

CUDA (Compute Unified Device Architecture) parallel programming significantly improves computational efficiency across multiple fields. However, converting serial C code to CUDA poses challenges for non-experts, and traditional tools struggle with complex patterns. While LLMs (Large Language Models) enable automatic parallelization of complex patterns, they may generate CUDA code with synchronization and memory management issues. There […]

CUDA

high performance computing on graphics processing units: hgpu.org

Posts

Efficient deep learning inference on end devices

LIFT: LLM-Based Pragma Insertion for HLS via GNN Supervised Fine-Tuning

Mìmir: A real-time interactive visualization library for CUDA programs

Scaling On-Device GPU Inference for Large Generative Models

Dynamic Memory Management on GPUs with SYCL

MSCCL++: Rethinking GPU Communication Abstractions for Cutting-edge AI Applications

InteropUnityCUDA: A Tool for Interoperability Between Unity and CUDA

Data-efficient LLM Fine-tuning for Code Generation

LithOS: An Operating System for Efficient Machine Learning on GPUs

DeepCompile: A Compiler-Driven Approach to Optimizing Distributed Deep Learning Training

Scalability Evaluation of HPC Multi-GPU Training for ECG-based LLMs

Large Language Model Powered C-to-CUDA Code Translation: A Novel Auto-Parallelization Framework

Recent source codes

IntelliKit: Agent-first tooling for AMD hardware

DITRON: Distributed Compiler based on Triton for Parallel Systems

CuTile Benchmark Suite: Performance and Productivity Tradeoffs for GPU Kernel Programming on Blackwell Architecture

MegaTrain: Full Precision Training of 100B+ Parameter Large Language Models on a Single GPU

Device Virtual Machine (DVM)

Agentic Code Optimization via Compiler-LLM Cooperation

AutoKernel: Autoresearch for GPU kernels. Give it any PyTorch model, go to sleep, wake up to optimized Triton kernels

Triton-Sanitizer: A Fast and Device-Agnostic Memory Sanitizer for Triton with Rich Diagnostic Context

LLM.Q: Quantized LLM training in pure CUDA/C++

SOL-ExecBench: Speed-of-Light Benchmarking for Real-World GPU Kernels Against Hardware Limits

Most viewed papers (last 30 days)