high performance computing on graphics processing units: hgpu.org

Posts

Jun, 29

Survey of HPC in US Research Institutions

The rapid growth of AI, data-intensive science, and digital twin technologies has driven an unprecedented demand for high-performance computing (HPC) across the research ecosystem. While national laboratories and industrial hyperscalers have invested heavily in exascale and GPU-centric architectures, university-operated HPC systems remain comparatively under-resourced. This survey presents a comprehensive assessment of the HPC landscape across […]

Jun, 29

Omniwise: Predicting GPU Kernels Performance with LLMs

In recent years, the rapid advancement of deep neural networks (DNNs) has revolutionized artificial intelligence, enabling models with unprecedented capabilities in understanding, generating, and processing complex data. These powerful architectures have transformed a wide range of downstream applications, tackling tasks beyond human reach. In this paper, we introduce Omniwise, the first end-to-end, self-supervised fine-tuning pipeline […]

Jun, 29

GCStack+GCScaler: Fast and Accurate GPU Performance Analyses Using Fine-Grained Stall Cycle Accounting and Interval Analysis

To design next-generation Graphics Processing Units (GPUs), GPU architects rely on GPU performance analyses to identify key GPU performance bottlenecks and explore GPU design spaces. Unfortunately, the existing GPU performance analysis mechanisms make it difficult for GPU architects to conduct fast and accurate GPU performance analyses. The existing mechanisms can provide misleading insights into GPU […]

CUDA

Jun, 29

No More Shading Languages: Compiling C++ to Vulkan Shaders

Graphics APIs have traditionally relied on shading languages, however, these languages have a number of fundamental defects and limitations. By contrast, GPU compute platforms offer powerful, feature-rich languages suitable for heterogeneous compute. We propose reframing shading languages as embedded domain-specific languages, layered on top of a more general language like C++, doing away with traditional […]

Jun, 29

WiLLM: An Open Wireless LLM Communication System

The rapid evolution of LLMs threatens to overwhelm existing wireless infrastructure, necessitating architectural innovations for burgeoning mobile LLM services. This paper introduces WiLLM, the first open-source wireless system specifically designed for these services. First, we establish a new paradigm by deploying LLMs in core networks (CNs) with abundant GPUs. This enables distributed inference services, strategically […]

Jun, 22

LiteGD: Lightweight and dynamic GPU Dispatching for Large-scale Heterogeneous Clusters

Parallel computing with multiple GPUs has become the dominant paradigm for machine learning tasks, especially those of large language models (LLMs). To reduce the latency incurred by inter-GPU communication, a common practice for parallel tasks has been to allocate GPUs based on their physical proximity. However, this long-standing assumption has notable limitations, particularly in large-scale, […]

Jun, 22

A Novel Compiler Transformation for Fast Sparse Matrix Multiplication in GPUs

Sparse data structures are commonly used in neural networks to reduce the memory footprint. These data structures are compact but cause irregularities such as random memory accesses, which prevent efficient use of the memory hierarchy. GPUs are a common platform for machine learning practitioners, but running compact data structures on these devices often leads to […]

CUDA

Jun, 22

A CPU+FPGA OpenCL Heterogeneous Computing Platform for Multi-Kernel Pipeline

Over the past decades, Field-Programmable Gate Arrays (FPGAs) have become a choice for heterogeneous computing due to their flexibility, energy efficiency, and processing speed. OpenCL is used in FPGA heterogeneous computing for its high-level abstraction and cross-platform compatibility. Previous works have introduced optimization techniques in OpenCL for FPGAs to leverage FPGA-specific advantages. However, the multi-kernel […]

OpenCL

Jun, 22

A First Look at Bugs in LLM Inference Engines

Large language model-specific inference engines (in short as emph{LLM inference engines}) have become a fundamental component of modern AI infrastructure, enabling the deployment of LLM-powered applications (LLM apps) across cloud and local devices. Despite their critical role, LLM inference engines are prone to bugs due to the immense resource demands of LLMs and the complexities […]

Jun, 22

Engineering Supercomputing Platforms for Biomolecular Applications

A range of computational biology software (GROMACS, AMBER, NAMD, LAMMPS, OpenMM, Psi4 and RELION) was benchmarked on a representative selection of HPC hardware, including AMD EPYC 7742 CPU nodes, NVIDIA V100 and AMD MI250X GPU nodes, and an NVIDIA GH200 testbed. The raw performance, power efficiency and data storage requirements of the software was evaluated […]

CUDA

Jun, 15

CUDA-LLM: LLMs Can Write Efficient CUDA Kernels

Large Language Models (LLMs) have demonstrated strong capabilities in general-purpose code generation. However, generating the code which is deeply hardware-specific, architecture-aware, and performance-critical, especially for massively parallel GPUs, remains a complex challenge. In this work, we explore the use of LLMs for the automated generation and optimization of CUDA programs, with the goal of producing […]

CUDA

Jun, 15

HPCTransCompile: An AI Compiler Generated Dataset for High-Performance CUDA Transpilation and LLM Preliminary Exploration

The rapid growth of deep learning has driven exponential increases in model parameters and computational demands. NVIDIA GPUs and their CUDA-based software ecosystem provide robust support for parallel computing, significantly alleviating computational bottlenecks. Meanwhile, due to the cultivation of user programming habits and the high performance of GPUs, the CUDA ecosystem has established a dominant […]

CUDA

HPCTransCompile: An AI Compiler Generated Dataset for High-Performance CUDA Transpilation and LLM Preliminary Exploration

* * *

high performance computing on graphics processing units: hgpu.org

Posts

Survey of HPC in US Research Institutions

Omniwise: Predicting GPU Kernels Performance with LLMs

GCStack+GCScaler: Fast and Accurate GPU Performance Analyses Using Fine-Grained Stall Cycle Accounting and Interval Analysis

No More Shading Languages: Compiling C++ to Vulkan Shaders

WiLLM: An Open Wireless LLM Communication System

LiteGD: Lightweight and dynamic GPU Dispatching for Large-scale Heterogeneous Clusters

A Novel Compiler Transformation for Fast Sparse Matrix Multiplication in GPUs

A CPU+FPGA OpenCL Heterogeneous Computing Platform for Multi-Kernel Pipeline

A First Look at Bugs in LLM Inference Engines

Engineering Supercomputing Platforms for Biomolecular Applications

CUDA-LLM: LLMs Can Write Efficient CUDA Kernels

HPCTransCompile: An AI Compiler Generated Dataset for High-Performance CUDA Transpilation and LLM Preliminary Exploration

Recent source codes

WiLLM: An Open Wireless LLM Communication System

Vcc: the Vulkan Clang Compiler

hpcbench: A set of benchmarking utilities for biomolecular simulation tools

HPCTransCompile: An AI Compiler Generated Dataset for High-Performance CUDA Transpilation and LLM Preliminary Exploration

chemtrain: Training Molecular Dynamics Potentials in JAX

microSYCL: SYCL micro-benchmarks repository

XaaS containers

CASS: Cuda-Amd aSSembly

Cluser of smartphones for edge computing application using TensorFlow

SYCL Container

Most viewed papers (last 30 days)