high performance computing on graphics processing units: hgpu.org

Posts

Jul, 28

Bringing Auto-tuning to HIP: Analysis of Tuning Impact and Difficulty on AMD and Nvidia GPUs

Many studies have focused on developing and improving auto-tuning algorithms for Nvidia Graphics Processing Units (GPUs), but the effectiveness and efficiency of these approaches on AMD devices have hardly been studied. This paper aims to address this gap by introducing an auto-tuner for AMD’s HIP. We do so by extending Kernel Tuner, an open-source Python […]

CUDA

Jul, 28

Efficiently Training 7B LLM with 1 Million Sequence Length on 8 GPUs

Nowadays, Large Language Models (LLMs) have been trained using extended context lengths to foster more creative applications. However, long context training poses great challenges considering the constraint of GPU memory. It not only leads to substantial activation memory consumption during training, but also incurs considerable memory fragmentation. To facilitate long context training, existing frameworks have […]

CUDA

Jul, 28

RBMD: A molecular dynamics package enabling to simulate 10 million all-atom particles in a single graphics processing unit

This paper introduces a random-batch molecular dynamics (RBMD) package for fast simulations of particle systems at the nano/micro scale. Different from existing packages, the RBMD uses random batch methods for nonbonded interactions of particle systems. The long-range part of Coulomb interactions is calculated in Fourier space by the random batch Ewald algorithm, which achieves linear […]

CUDA

Jul, 14

Optimization of Large-Scale Sparse Matrix-Vector Multiplication on Multi-GPU Systems

Sparse matrix-vector multiplication (SpMV) is one of the important kernels of many iterative algorithms for solving sparse linear systems. The limited storage and computational resources of individual GPUs restrict both the scale and speed of SpMV computing in problem-solving. As real-world engineering problems continue to increase in complexity, the imperative for collaborative execution of iterative […]

CUDA

Jul, 14

The Impact of Modern Consumer GPUs on Commonly Used Secure Password Standards

As home network-based devices and servers become more accessible [1], the need for cybersecurity awareness and best practices to secure wireless network is increasingly important. With the growing affordability of advanced hardware technology, such as modern gaming PCs equipped with powerful graphics processing units (GPUs), which can facilitate password brute force cracking on a wider […]

CUDA

•

OpenCL

Jul, 14

Harnessing Integrated CPU-GPU System Memory for HPC: a first look into Grace Hopper

Memory management across discrete CPU and GPU physical memory is traditionally achieved through explicit GPU allocations and data copy or unified virtual memory. The Grace Hopper Superchip, for the first time, supports an integrated CPU-GPU system page table, hardware-level addressing of system allocated memory, and cache-coherent NVLink-C2C interconnect, bringing an alternative solution for enabling a […]

CUDA

Jul, 14

Automating Heterogeneous Parallelism in Numerical Differential Equations

Scientific computing is an amalgamation of numerical methods and computer science. Developments in numerical analysis have allowed stable and accurate numerical schemes, whereas computer algorithms have been successfully adopted to standard multicore systems of today, enabling parallelism. Combining efficient numerical algorithms with efficient parallelism presents a challenge mainly due to the independent development of these […]

CUDA

Jul, 14

Automated C/C++ Program Repair for High-Level Synthesis via Large Language Models

In High-Level Synthesis (HLS), converting a regular C/C++ program into its HLS-compatible counterpart (HLS-C) still requires tremendous manual effort. Various program scripts have been introduced to automate this process. But the resulting codes usually contain many issues that should be manually repaired by developers. Since Large Language Models (LLMs) have the ability to automate code […]

Jul, 7

Supercharging Federated Learning with Flower and NVIDIA FLARE

Several open-source systems, such as Flower and NVIDIA FLARE, have been developed in recent years while focusing on different aspects of federated learning (FL). Flower is dedicated to implementing a cohesive approach to FL, analytics, and evaluation. Over time, Flower has cultivated extensive strategies and algorithms tailored for FL application development, fostering a vibrant FL […]

Jul, 7

Chat AI: A Seamless Slurm-Native Solution for HPC-Based Services

The increasing adoption of large language models (LLMs) has created a pressing need for an efficient, secure and private serving infrastructure, which allows researchers to run open-source or custom fine-tuned LLMs and ensures users that their data remains private and is not stored without their consent. While high-performance computing (HPC) systems equipped with state-of-the-art GPUs […]

Jul, 7

Towards Unified Analysis of GPU Consistency

After more than 30 years of research, there is a solid understanding of the consistency guarantees given by CPU systems. Unfortunately, the same is not yet true for GPUs. The growing popularity of general purpose GPU programming has been a call for action which industry players like Nvidia and Khronos have answered by formalizing their […]

OpenCL

Jul, 7

Automatic Code Rewriting for Performance Portability

Rewriting code for cleanliness, API changes, and new programming models is a common, yet time-consuming task. This is important for HPC applications that desire performance portability in particular, since these applications are usually very long lived and wish to run on many architectures, so they need to be written such that they can make good […]

OpenCL

* * *

high performance computing on graphics processing units: hgpu.org

Posts

Bringing Auto-tuning to HIP: Analysis of Tuning Impact and Difficulty on AMD and Nvidia GPUs

Efficiently Training 7B LLM with 1 Million Sequence Length on 8 GPUs

RBMD: A molecular dynamics package enabling to simulate 10 million all-atom particles in a single graphics processing unit

Optimization of Large-Scale Sparse Matrix-Vector Multiplication on Multi-GPU Systems

The Impact of Modern Consumer GPUs on Commonly Used Secure Password Standards

Harnessing Integrated CPU-GPU System Memory for HPC: a first look into Grace Hopper

Automating Heterogeneous Parallelism in Numerical Differential Equations

Automated C/C++ Program Repair for High-Level Synthesis via Large Language Models

Supercharging Federated Learning with Flower and NVIDIA FLARE

Chat AI: A Seamless Slurm-Native Solution for HPC-Based Services

Towards Unified Analysis of GPU Consistency

Automatic Code Rewriting for Performance Portability

Recent source codes

Kernel Library for LLM Serving

Adaptivity in AdaptiveCpp: Optimizing Performance by Leveraging Runtime Information During JIT-Compilation

Neptune: Advanced ML Operator Fusion for Locality and Parallelism on GPUs

Genten: Software for Generalized Tensor Decompositions by Sandia National Laboratories

Interleaved Learning and Exploration: A Self-Adaptive Fuzz Testing Framework for MLIR

Pinocchio: PINpointing Orbit Crossing Collapsed Hierarchical Objects

KernelCoder: trained on a curated dataset of reasoning traces and CUDA kernel pairs

VibeCodeHPC - Multi Agentic Vibe Coding for HPC

Compile-Time Resource Safety for GPU APIs: A Low-Overhead Typestate Framework

exa-AMD: Exascale Accelerated Materials Discovery

Most viewed papers (last 30 days)