high performance computing on graphics processing units: hgpu.org

Posts

Jun, 22

LiteGD: Lightweight and dynamic GPU Dispatching for Large-scale Heterogeneous Clusters

Parallel computing with multiple GPUs has become the dominant paradigm for machine learning tasks, especially those of large language models (LLMs). To reduce the latency incurred by inter-GPU communication, a common practice for parallel tasks has been to allocate GPUs based on their physical proximity. However, this long-standing assumption has notable limitations, particularly in large-scale, […]

Jun, 22

A Novel Compiler Transformation for Fast Sparse Matrix Multiplication in GPUs

Sparse data structures are commonly used in neural networks to reduce the memory footprint. These data structures are compact but cause irregularities such as random memory accesses, which prevent efficient use of the memory hierarchy. GPUs are a common platform for machine learning practitioners, but running compact data structures on these devices often leads to […]

CUDA

Jun, 22

A CPU+FPGA OpenCL Heterogeneous Computing Platform for Multi-Kernel Pipeline

Over the past decades, Field-Programmable Gate Arrays (FPGAs) have become a choice for heterogeneous computing due to their flexibility, energy efficiency, and processing speed. OpenCL is used in FPGA heterogeneous computing for its high-level abstraction and cross-platform compatibility. Previous works have introduced optimization techniques in OpenCL for FPGAs to leverage FPGA-specific advantages. However, the multi-kernel […]

OpenCL

Jun, 22

A First Look at Bugs in LLM Inference Engines

Large language model-specific inference engines (in short as emph{LLM inference engines}) have become a fundamental component of modern AI infrastructure, enabling the deployment of LLM-powered applications (LLM apps) across cloud and local devices. Despite their critical role, LLM inference engines are prone to bugs due to the immense resource demands of LLMs and the complexities […]

Jun, 22

Engineering Supercomputing Platforms for Biomolecular Applications

A range of computational biology software (GROMACS, AMBER, NAMD, LAMMPS, OpenMM, Psi4 and RELION) was benchmarked on a representative selection of HPC hardware, including AMD EPYC 7742 CPU nodes, NVIDIA V100 and AMD MI250X GPU nodes, and an NVIDIA GH200 testbed. The raw performance, power efficiency and data storage requirements of the software was evaluated […]

CUDA

Jun, 15

CUDA-LLM: LLMs Can Write Efficient CUDA Kernels

Large Language Models (LLMs) have demonstrated strong capabilities in general-purpose code generation. However, generating the code which is deeply hardware-specific, architecture-aware, and performance-critical, especially for massively parallel GPUs, remains a complex challenge. In this work, we explore the use of LLMs for the automated generation and optimization of CUDA programs, with the goal of producing […]

CUDA

Jun, 15

HPCTransCompile: An AI Compiler Generated Dataset for High-Performance CUDA Transpilation and LLM Preliminary Exploration

The rapid growth of deep learning has driven exponential increases in model parameters and computational demands. NVIDIA GPUs and their CUDA-based software ecosystem provide robust support for parallel computing, significantly alleviating computational bottlenecks. Meanwhile, due to the cultivation of user programming habits and the high performance of GPUs, the CUDA ecosystem has established a dominant […]

CUDA

Jun, 15

GPU Acceleration of SQL Analytics on Compressed Data

GPUs are uniquely suited to accelerate (SQL) analytics workloads thanks to their massive compute parallelism and High Bandwidth Memory (HBM) — when datasets fit in the GPU HBM, performance is unparalleled. Unfortunately, GPU HBMs remain typically small when compared with lower-bandwidth CPU main memory. Besides brute-force scaling across many GPUs, current solutions to accelerate queries […]

CUDA

Jun, 15

Enabling Profile Guided Optimizations (PGO) for Graphics

This master thesis presents an implementation for enabling profile-guided optimizations (PGO) for mobile phone GPUs. PGO is an optimization technique that uses runtime profiling data, like block frequency and function call frequency, to guide compiler optimizations. The implementation is done by adapting the existing PGO infrastructure in LLVM to serve the architectural differences between CPUs […]

Jun, 15

chemtrain-deploy: A parallel and scalable framework for machine learning potentials in million-atom MD simulations

Machine learning potentials (MLPs) have advanced rapidly and show great promise to transform molecular dynamics (MD) simulations. However, most existing software tools are tied to specific MLP architectures, lack integration with standard MD packages, or are not parallelizable across GPUs. To address these challenges, we present chemtrain-deploy, a framework that enables model-agnostic deployment of MLPs […]

CUDA

Jun, 8

Exploring SYCL as a Portability Layer for High-Performance Computing on CPUs

As multicore vector processors improve in computational and memory performance, running SIMT (Single Instruction Multiple Threads) programs on CPUs has become increasingly appealing, potentially eliminating the need for dedicated GPU hardware. SYCL is a royalty-free cross-platform C++ programming model for heterogeneous computing that implements the SIMT model and provides a path to run GPU programs […]

OpenCL

Jun, 8

Acceleration as a Service (XaaS) Source Containers

In this thesis, we address the challenge of performance portability in heterogeneous computing environments. Performance portability refers to the ability of an application to maintain high performance on multiple platforms without requiring extensive manual tuning for each system. Traditional containers fall short in this regard as they prioritize portability at the expense of architecture-specific optimizations. […]

* * *

high performance computing on graphics processing units: hgpu.org

Posts

LiteGD: Lightweight and dynamic GPU Dispatching for Large-scale Heterogeneous Clusters

A Novel Compiler Transformation for Fast Sparse Matrix Multiplication in GPUs

A CPU+FPGA OpenCL Heterogeneous Computing Platform for Multi-Kernel Pipeline

A First Look at Bugs in LLM Inference Engines

Engineering Supercomputing Platforms for Biomolecular Applications

CUDA-LLM: LLMs Can Write Efficient CUDA Kernels

HPCTransCompile: An AI Compiler Generated Dataset for High-Performance CUDA Transpilation and LLM Preliminary Exploration

GPU Acceleration of SQL Analytics on Compressed Data

Enabling Profile Guided Optimizations (PGO) for Graphics

chemtrain-deploy: A parallel and scalable framework for machine learning potentials in million-atom MD simulations

Exploring SYCL as a Portability Layer for High-Performance Computing on CPUs

Acceleration as a Service (XaaS) Source Containers

Recent source codes

Allo: Accelerator Design Language

Towards Robust Agentic CUDA Kernel Benchmarking, Verification, and Optimization

HPC Benchmark Survey

HDM: Home made Diffusion Models

General Matrix Multiplication (GEMM)

CrossTL: Universal Programming Language & Translator

TBD-GPU

DG-SWEM - The Discontinuous Galerkin Shallow Water Equation Model

torchPDLP: Primal-Dual Linear Programming in PyTorch. In collaboration with AMD and IPAM

Benchmarks for Dissecting CPU-GPU Unified Physical Memory on AMD MI300A APUs

Most viewed papers (last 30 days)