high performance computing on graphics processing units: hgpu.org

Posts

Sep, 21

High Performance GPU Implementation of KNN Algorithm: A Review

With large volumes of complex data generated by different applications, Machine Learning (ML) algorithms alone may not yield significant performance benefits on a single or multi-core CPU. Applying optimization techniques to these ML algorithms in a High-Performance Computing (HPC) environment can give considerable speedups for high-dimensional datasets. One of the most widely used classification algorithms, […]

Sep, 21

Dato: A Task-Based Programming Model for Dataflow Accelerators

Recent deep learning workloads increasingly push computational demand beyond what current memory systems can sustain, with many kernels stalling on data movement rather than computation. While modern dataflow accelerators incorporate on-chip streaming to mitigate off-chip bandwidth limitations, existing programming models struggle to harness these capabilities effectively. Low-level interfaces provide fine-grained control but impose significant development […]

Sep, 21

Towards Robust Agentic CUDA Kernel Benchmarking, Verification, and Optimization

Recent advances in large language models (LLMs) demonstrate their effectiveness in scaling test-time compute for software engineering tasks. However, these approaches often focus on high-level solutions, with limited attention to optimizing low-level CUDA kernel implementations. Additionally, existing kernel generation benchmarks suffer from exploitable loopholes and insufficient diversity in testing conditions, hindering true generalization assessment. To […]

CUDA

Sep, 21

Evolution of Kernels: Automated RISC-V Kernel Optimization with Large Language Models

Automated kernel design is critical for overcoming software ecosystem barriers in emerging hardware platforms like RISC-V. While large language models (LLMs) have shown promise for automated kernel optimization, demonstrating success in CUDA domains with comprehensive technical documents and mature codebases, their effectiveness remains unproven for reference-scarce domains like RISC-V. We present Evolution of Kernels (EoK), […]

CUDA

Sep, 21

AI Factories: It’s time to rethink the Cloud-HPC divide

The strategic importance of artificial intelligence is driving a global push toward Sovereign AI initiatives. Nationwide governments are increasingly developing dedicated infrastructures, called AI Factories (AIF), to achieve technological autonomy and secure the resources necessary to sustain robust local digital ecosystems. In Europe, the EuroHPC Joint Undertaking is investing hundreds of millions of euros into […]

Sep, 14

Towards Calculating HPC CUDA Kernel Performance on Nvidia GPUs

This thesis aims at providing the ground work to facilitate a performance estimation model for CUDA kernels using a cycle counting model. After a short overview of past GPU performance modeling techniques, it conducts an exhaustive, in-depth analysis of Nvidia’s SASS instruction set and CUDA ELF formats for architectures Maxwell up to and including Blackwell, […]

CUDA

Sep, 14

An HPC Benchmark Survey and Taxonomy for Characterization

The field of High-Performance Computing (HPC) is defined by providing computing devices with highest performance for a variety of demanding scientific users. The tight co-design relationship between HPC providers and users propels the field forward, paired with technological improvements, achieving continuously higher performance and resource utilization. A key device for system architects, architecture researchers, and […]

CUDA

•

OpenCL

Sep, 14

Combining Performance and Productivity: Accelerating the Network Sensing Graph Challenge with GPUs and Commodity Data Science Software

The HPEC Graph Challenge is a collection of benchmarks representing complex workloads that test the hardware and software components of HPC systems, which traditional benchmarks, such as LINPACK, do not. The first benchmark, Subgraph Isomorphism, focused on several compute-bound and memory-bound kernels. The most recent of the challenges, the Anonymized Network Sensing Graph Challenge, represents […]

CUDA

Sep, 14

Home-made Diffusion Model from Scratch to Hatch

We introduce Home-made Diffusion Model (HDM), an efficient yet powerful text-to-image diffusion model optimized for training (and inferring) on consumer-grade hardware. HDM achieves competitive 1024×1024 generation quality while maintaining a remarkably low training cost of $535-620 using four RTX5090 GPUs, representing a significant reduction in computational requirements compared to traditional approaches. Our key contributions include: […]

Sep, 14

High Performance Matrix Multiplication

Matrix multiplication is the foundation from much of the success from high performance technologies like deep learning, scientific simulations, and video graphics. High level programming languages like Python and R rely on highly optimized low level libraries for performing core linear algebra operations like matrix multiplication from Basic Linear Algebra Subprograms (BLAS). This paper compares […]

CUDA

Sep, 7

CrossTL: A Universal Programming Language Translator with Unified Intermediate Representation

We present CrossTL, a universal programming language translator enabling bidirectional translation between multiple languages through a unified intermediate representation called CrossGL. Traditional approaches require separate translators for each language pair, leading to exponential complexity growth. CrossTL uses a single universal IR to facilitate translations between CUDA, HIP, Metal, DirectX HLSL, OpenGL GLSL, Vulkan SPIR-V, Rust, […]

CUDA

•

OpenGL

Sep, 7

AnnotationGym: A Generic Framework for Automatic Source Code Annotation

A common approach to code optimization is to insert compiler hints in the source code using annotations. Two major challenges with using annotations effectively are their complexity and lack of portability. This means, first, that significant developer expertise is required, and, second, that the supported annotations, as well as their syntax and use, can vary […]

* * *

high performance computing on graphics processing units: hgpu.org

Posts

High Performance GPU Implementation of KNN Algorithm: A Review

Dato: A Task-Based Programming Model for Dataflow Accelerators

Towards Robust Agentic CUDA Kernel Benchmarking, Verification, and Optimization

Evolution of Kernels: Automated RISC-V Kernel Optimization with Large Language Models

AI Factories: It’s time to rethink the Cloud-HPC divide

Towards Calculating HPC CUDA Kernel Performance on Nvidia GPUs

An HPC Benchmark Survey and Taxonomy for Characterization

Combining Performance and Productivity: Accelerating the Network Sensing Graph Challenge with GPUs and Commodity Data Science Software

Home-made Diffusion Model from Scratch to Hatch

High Performance Matrix Multiplication

CrossTL: A Universal Programming Language Translator with Unified Intermediate Representation

AnnotationGym: A Generic Framework for Automatic Source Code Annotation

Recent source codes

Allo: Accelerator Design Language

Towards Robust Agentic CUDA Kernel Benchmarking, Verification, and Optimization

HPC Benchmark Survey

HDM: Home made Diffusion Models

General Matrix Multiplication (GEMM)

CrossTL: Universal Programming Language & Translator

TBD-GPU

DG-SWEM - The Discontinuous Galerkin Shallow Water Equation Model

torchPDLP: Primal-Dual Linear Programming in PyTorch. In collaboration with AMD and IPAM

Benchmarks for Dissecting CPU-GPU Unified Physical Memory on AMD MI300A APUs

Most viewed papers (last 30 days)