high performance computing on graphics processing units: hgpu.org

Posts

Sep, 21

High Performance GPU Implementation of KNN Algorithm: A Review

With large volumes of complex data generated by different applications, Machine Learning (ML) algorithms alone may not yield significant performance benefits on a single or multi-core CPU. Applying optimization techniques to these ML algorithms in a High-Performance Computing (HPC) environment can give considerable speedups for high-dimensional datasets. One of the most widely used classification algorithms, […]

Sep, 14

Towards Calculating HPC CUDA Kernel Performance on Nvidia GPUs

This thesis aims at providing the ground work to facilitate a performance estimation model for CUDA kernels using a cycle counting model. After a short overview of past GPU performance modeling techniques, it conducts an exhaustive, in-depth analysis of Nvidia’s SASS instruction set and CUDA ELF formats for architectures Maxwell up to and including Blackwell, […]

CUDA

Sep, 14

An HPC Benchmark Survey and Taxonomy for Characterization

The field of High-Performance Computing (HPC) is defined by providing computing devices with highest performance for a variety of demanding scientific users. The tight co-design relationship between HPC providers and users propels the field forward, paired with technological improvements, achieving continuously higher performance and resource utilization. A key device for system architects, architecture researchers, and […]

CUDA

•

OpenCL

Sep, 14

Home-made Diffusion Model from Scratch to Hatch

We introduce Home-made Diffusion Model (HDM), an efficient yet powerful text-to-image diffusion model optimized for training (and inferring) on consumer-grade hardware. HDM achieves competitive 1024×1024 generation quality while maintaining a remarkably low training cost of $535-620 using four RTX5090 GPUs, representing a significant reduction in computational requirements compared to traditional approaches. Our key contributions include: […]

Sep, 14

High Performance Matrix Multiplication

Matrix multiplication is the foundation from much of the success from high performance technologies like deep learning, scientific simulations, and video graphics. High level programming languages like Python and R rely on highly optimized low level libraries for performing core linear algebra operations like matrix multiplication from Basic Linear Algebra Subprograms (BLAS). This paper compares […]

CUDA

Sep, 14

Combining Performance and Productivity: Accelerating the Network Sensing Graph Challenge with GPUs and Commodity Data Science Software

The HPEC Graph Challenge is a collection of benchmarks representing complex workloads that test the hardware and software components of HPC systems, which traditional benchmarks, such as LINPACK, do not. The first benchmark, Subgraph Isomorphism, focused on several compute-bound and memory-bound kernels. The most recent of the challenges, the Anonymized Network Sensing Graph Challenge, represents […]

CUDA

Sep, 7

CrossTL: A Universal Programming Language Translator with Unified Intermediate Representation

We present CrossTL, a universal programming language translator enabling bidirectional translation between multiple languages through a unified intermediate representation called CrossGL. Traditional approaches require separate translators for each language pair, leading to exponential complexity growth. CrossTL uses a single universal IR to facilitate translations between CUDA, HIP, Metal, DirectX HLSL, OpenGL GLSL, Vulkan SPIR-V, Rust, […]

CUDA

•

OpenGL

Sep, 7

AnnotationGym: A Generic Framework for Automatic Source Code Annotation

A common approach to code optimization is to insert compiler hints in the source code using annotations. Two major challenges with using annotations effectively are their complexity and lack of portability. This means, first, that significant developer expertise is required, and, second, that the supported annotations, as well as their syntax and use, can vary […]

Sep, 7

Harnessing Batched BLAS/LAPACK Kernels on GPUs for Parallel Solutions of Block Tridiagonal Systems

We present a GPU implementation for the factorization and solution of block-tridiagonal symmetric positive definite linear systems, which commonly arise in time-dependent estimation and optimal control problems. Our method employs a recursive algorithm based on Schur complement reduction, transforming the system into a hierarchy of smaller, independent blocks that can be efficiently solved in parallel […]

CUDA

Sep, 7

GPU-acceleration of the Discontinuous Galerkin Shallow Water Equations Solver (DG-SWEM) using CUDA and OpenACC

This paper presents a porting of DG-SWEM, a discontinuous Galerkin solver for coastal ocean circulation, and in particular storm surge, to GPU using two separate approaches: CUDA Fortran and OpenACC. Time-explicit discontinuous Galerkin methods have been shown to exhibit a large amount of data parallelism due to the loose coupling between elements, and thus are […]

CUDA

Sep, 7

Managing Multi Instance GPUs for High Throughput and Energy Savings

Focus to learn morModern GPUs such as the Ampere series (A30, A100) as well as the Hopper series (H100, H200) offer performance as well as security isolation features. They also support a good amount of concurrency, but taking advantage of it can be quite challenging due to the complex constraints on partitioning the chip. In […]

CUDA

Aug, 31

Scalable Engine and the Performance of Different LLM Models in a SLURM based HPC architecture

This work elaborates on a High performance computing (HPC) architecture based on Simple Linux Utility for Resource Management (SLURM) [1] for deploying heterogeneous Large Language Models (LLMs) into a scalable inference engine. Dynamic resource scheduling and seamless integration of containerized microservices have been leveraged herein to manage CPU, GPU, and memory allocations efficiently in multi-node […]

* * *

high performance computing on graphics processing units: hgpu.org

Posts

High Performance GPU Implementation of KNN Algorithm: A Review

Towards Calculating HPC CUDA Kernel Performance on Nvidia GPUs

An HPC Benchmark Survey and Taxonomy for Characterization

Home-made Diffusion Model from Scratch to Hatch

High Performance Matrix Multiplication

Combining Performance and Productivity: Accelerating the Network Sensing Graph Challenge with GPUs and Commodity Data Science Software

CrossTL: A Universal Programming Language Translator with Unified Intermediate Representation

AnnotationGym: A Generic Framework for Automatic Source Code Annotation

Harnessing Batched BLAS/LAPACK Kernels on GPUs for Parallel Solutions of Block Tridiagonal Systems

GPU-acceleration of the Discontinuous Galerkin Shallow Water Equations Solver (DG-SWEM) using CUDA and OpenACC

Managing Multi Instance GPUs for High Throughput and Energy Savings

Scalable Engine and the Performance of Different LLM Models in a SLURM based HPC architecture

Recent source codes

Kernel Library for LLM Serving

Adaptivity in AdaptiveCpp: Optimizing Performance by Leveraging Runtime Information During JIT-Compilation

Neptune: Advanced ML Operator Fusion for Locality and Parallelism on GPUs

Genten: Software for Generalized Tensor Decompositions by Sandia National Laboratories

Interleaved Learning and Exploration: A Self-Adaptive Fuzz Testing Framework for MLIR

Pinocchio: PINpointing Orbit Crossing Collapsed Hierarchical Objects

KernelCoder: trained on a curated dataset of reasoning traces and CUDA kernel pairs

VibeCodeHPC - Multi Agentic Vibe Coding for HPC

Compile-Time Resource Safety for GPU APIs: A Low-Overhead Typestate Framework

exa-AMD: Exascale Accelerated Materials Discovery

Most viewed papers (last 30 days)