high performance computing on graphics processing units: hgpu.org

Posts

Mar, 3

Towards Studying the Effect of Compiler Optimizations and Software Randomization on GPU Reliability

The evolution of Graphics Processing Unit (GPU) compilers has facilitated the support for general-purpose programming languages across various architectures. The NVIDIA CUDA Compiler (NVCC) employs multiple compilation levels prior to generating machine code, implementing intricate optimizations to enhance performance. These optimizations influence the manner in which software is mapped to the underlying hardware, which can […]

CUDA

Feb, 24

KernelBench: Can LLMs Write Efficient GPU Kernels?

Efficient GPU kernels are crucial for building performant machine learning architectures, but writing them is a time-consuming challenge that requires significant expertise; therefore, we explore using language models (LMs) to automate kernel generation. We introduce KernelBench, an open-source framework for evaluating LMs’ ability to write fast and correct kernels on a suite of 250 carefully […]

CUDA

Feb, 24

Seamless acceleration of Fortran intrinsics via AMD AI engines

A major challenge that the HPC community faces is how to continue delivering the performance demanded by scientific programmers, whilst meeting an increased emphasis on sustainable operations. Specialised architectures, such as FPGAs and AMD’s AI Engines (AIEs), have been demonstrated to provide significant energy efficiency advantages, however a major challenge is that to most effectively […]

Feb, 24

Forecasting time series with constraints

Time series forecasting presents unique challenges that limit the effectiveness of traditional machine learning algorithms. To address these limitations, various approaches have incorporated linear constraints into learning algorithms, such as generalized additive models and hierarchical forecasting. In this paper, we propose a unified framework for integrating and combining linear constraints in time series forecasting. Within […]

Feb, 24

Evaluating the Performance of the DeepSeek Model in Confidential Computing Environment

The increasing adoption of Large Language Models (LLMs) in cloud environments raises critical security concerns, particularly regarding model confidentiality and data privacy. Confidential computing, enabled by Trusted Execution Environments (TEEs), offers a promising solution to mitigate these risks. However, existing TEE implementations, primarily CPU-based, struggle to efficiently support the resource-intensive nature of LLM inference and […]

Feb, 24

The AI CUDA Engineer: Agentic CUDA Kernel Discovery, Optimization and Composition

Recent advances in Large Language Models have driven large-scale deployment, resulting in ever-growing inference time and energy demand. While manual optimization of low-level code implementations is feasible, it is an arduous task that requires deep expertise to balance the complex interplay of algorithmic, software, and hardware bottlenecks. This report presents the first comprehensive agentic framework […]

CUDA

Feb, 16

cuSZp2: A GPU Lossy Compressor with Extreme Throughput and Optimized Compression Ratio

Existing GPU lossy compressors suffer from expensive data movement overheads, inefficient memory access patterns, and high synchronization latency, resulting in limited throughput. This work proposes CUSZP2, a generic single-kernel error-bounded lossy compressor purely on GPUs designed for applications that require high speed, such as large-scale GPU simulation and large language model training. In particular, CUSZP2 […]

CUDA

Feb, 16

Leveraging LLVM OpenMP GPU Offload Optimizations for Kokkos Applications

OpenMP provides a cross-vendor API for GPU offload that can serve as an implementation layer under performance portability frameworks like the Kokkos C++ library. However, recent work identified some impediments to performance with this approach arising from limitations in the API or in the available implementations. Advanced programming concepts such as hierarchical parallelism and use […]

CUDA

Feb, 16

InfiniteHiP: Extending Language Model Context Up to 3 Million Tokens on a Single GPU

In modern large language models (LLMs), handling very long context lengths presents significant challenges as it causes slower inference speeds and increased memory costs. Additionally, most existing pre-trained LLMs fail to generalize beyond their original training sequence lengths. To enable efficient and practical long-context utilization, we introduce InfiniteHiP, a novel, and practical LLM inference framework […]

Feb, 16

Teaching An Old Dog New Tricks: Porting Legacy Code to Heterogeneous Compute Architectures With Automated Code Translation

Legacy codes are in ubiquitous use in scientific simulations; they are well-tested and there is significant time investment in their use. However, one challenge is the adoption of new, sometimes incompatible computing paradigms, such as GPU hardware. In this paper, we explore using automated code translation to enable execution of legacy multigrid solver code on […]

CUDA

•

OpenCL

Feb, 16

Vortex: Overcoming Memory Capacity Limitations in GPU-Accelerated Large-Scale Data Analytics

Despite the high computational throughput of GPUs, limited memory capacity and bandwidth-limited CPU-GPU communication via PCIe links remain significant bottlenecks for accelerating large-scale data analytics workloads. This paper introduces Vortex, a GPU-accelerated framework designed for data analytics workloads that exceed GPU memory capacity. A key aspect of our framework is an optimized IO primitive that […]

CUDA

Feb, 10

Optimizing the optimizer increasing performance efficiency of modern compilers

A long-standing goal, which is increasingly important in the post-Moore era, is to augment system performance by building more intelligent compilers. One of our motivating hypotheses is that much of the capability needed to advance compiler optimization is already present: state-of-the-art compilers not only provide a large set of code transformations, but also (by-and-large) correctly […]

* * *

high performance computing on graphics processing units: hgpu.org

Posts

Towards Studying the Effect of Compiler Optimizations and Software Randomization on GPU Reliability

KernelBench: Can LLMs Write Efficient GPU Kernels?

Seamless acceleration of Fortran intrinsics via AMD AI engines

Forecasting time series with constraints

Evaluating the Performance of the DeepSeek Model in Confidential Computing Environment

The AI CUDA Engineer: Agentic CUDA Kernel Discovery, Optimization and Composition

cuSZp2: A GPU Lossy Compressor with Extreme Throughput and Optimized Compression Ratio

Leveraging LLVM OpenMP GPU Offload Optimizations for Kokkos Applications

InfiniteHiP: Extending Language Model Context Up to 3 Million Tokens on a Single GPU

Teaching An Old Dog New Tricks: Porting Legacy Code to Heterogeneous Compute Architectures With Automated Code Translation

Vortex: Overcoming Memory Capacity Limitations in GPU-Accelerated Large-Scale Data Analytics

Optimizing the optimizer increasing performance efficiency of modern compilers

Recent source codes

Allo: Accelerator Design Language

Towards Robust Agentic CUDA Kernel Benchmarking, Verification, and Optimization

HPC Benchmark Survey

HDM: Home made Diffusion Models

General Matrix Multiplication (GEMM)

CrossTL: Universal Programming Language & Translator

TBD-GPU

DG-SWEM - The Discontinuous Galerkin Shallow Water Equation Model

torchPDLP: Primal-Dual Linear Programming in PyTorch. In collaboration with AMD and IPAM

Benchmarks for Dissecting CPU-GPU Unified Physical Memory on AMD MI300A APUs

Most viewed papers (last 30 days)