high performance computing on graphics processing units: hgpu.org

Posts

Jun, 2

Classify QCD phase transition with deep learning

The state-of-the-art pattern recognition method in machine learning (deep convolution neural network) is used to identify the equation of state (EoS) employed in the relativistic hydrodynamic simulations of heavy ion collisions. High-level correlations of particle spectra in transverse momentum and azimuthal angle learned by the network act as an effective EoS-meter in deciphering the nature […]

OpenCL

Jun, 2

The Accelerator Wall: Limits of Chip Specialization

Specializing chips using hardware accelerators has become the prime means to alleviate the gap between the growing computational demands and the stagnating transistor budgets caused by the slowdown of CMOS scaling. Much of the benefits of chip specialization stems from optimizing a computational problem within a given chip’s transistor budget. Unfortunately, the stagnation of the […]

Jun, 2

A Development Platform for Embedded Domain-Specific Languages

The use of domain-specific languages (DSL) is a promising approach to helping programmers write an efficient program for high-performance computing. The programmers would feel difficulties in writing such a program by hand with only low-level abstractions, such as arrays and loops, provided by a general-purpose language. This chapter presents our new implementation technique for domainspecific […]

CUDA

Jun, 2

Heterogeneous Resource-Elastic Scheduling for CPU+FPGA Architectures

Heterogeneous computing is a key strategy to meet the requirements of many compute-intensive applications. However, currently, CPU+FPGA platforms are commonly underutilized as scheduling is often constrained to a run-tocompletion model or acceleration of a single application at a time. To tackle this, this paper proposes heterogeneous resource-elastic scheduling for maximizing the utilization of both CPU […]

OpenCL

Jun, 2

Leader Stochastic Gradient Descent for Distributed Training of Deep Learning Models

We consider distributed optimization under communication constraints for training deep learning models. We propose a new algorithm, whose parameter updates rely on two forces: a regular gradient step, and a corrective direction dictated by the currently best-performing worker (leader). Our method differs from the parameter-averaging scheme EASGD in a number of ways: (i) our objective […]

CUDA

May, 30

Breadth-First Search using Dynamic Parallelism on the GPU

Breadth-First Search is an important basis for many different graph-based algorithms with applications ranging from peer-to-peer networking to garbage collection. However, the performance of different approaches depends strongly on the type of graph. In this paper, three algorithms of varying complexity are implemented using the CUDA Programming Model for the GPU and are compared to […]

CUDA

May, 30

The Impact of GPU DVFS on the Energy and Performance of Deep Learning: an Empirical Study

Over the past years, great progress has been made in improving the computing power of general-purpose graphics processing units (GPGPUs), which facilitates the prosperity of deep neural networks (DNNs) in multiple fields like computer vision and natural language processing. A typical DNN training process repeatedly updates tens of millions of parameters, which not only requires […]

CUDA

May, 30

Massively Parallel GPU Memory Compaction

Memory fragmentation is a widely studied problem of dynamic memory allocators. It is well known that fragmentation can lead to premature out-of-memory errors and poor cache performance. With the recent emergence of dynamic memory allocators for SIMD accelerators, memory fragmentation is becoming an increasingly important problem on such architectures. Nevertheless, it has received little attention […]

CUDA

May, 26

Multi-GPU Rendering with Vulkan API

Vulkan API provides a low level interface to modern Graphics Processing Units (GPUs). With this thesis, we demonstrate how to use Vulkan to send commands explicitly to separate GPUs for implementing platform- and vendor independent multi-GPU rendering. We describe how to implement the sort-first and sort-last approaches to perform parallel rendering with Vulkan. We introduce […]

May, 26

Automatic generation of warp-level primitives and atomic instructions for fast and portable parallel reduction on GPUs

Since the advent of GPU computing, GPU hardware has evolved at a fast pace. Since application performance heavily depends on the latest hardware improvements, performance portability is extremely challenging for GPU application library developers. Portability becomes even more difficult when new low-level instructions are added to the ISA (e.g., warp shuffle instructions) or the microarchitectural […]

CUDA

May, 26

Acceleration of Scientific Deep Learning Models on Heterogeneous Computing Platform with Intel FPGAs

AI and deep learning are experiencing explosive growth in almost every domain involving analysis of big data. Deep learning using Deep Neural Networks (DNNs) has shown great promise for such scientific data analysis applications. However, traditional CPU-based sequential computing can no longer meet the requirements of mission-critical applications, which are compute-intensive and require low latency […]

May, 26

Comparing Energy Efficiency of CPU, GPU and FPGA Implementations for Vision Kernels

Developing high performance embedded vision applications requires balancing run-time performance with energy constraints. Given the mix of hardware accelerators that exist for embedded computer vision (e.g. multi-core CPUs, GPUs, and FPGAs), and their associated vendor optimized vision libraries, it becomes a challenge for developers to navigate this fragmented solution space. To aid with determining which […]

CUDA

* * *

high performance computing on graphics processing units: hgpu.org

Posts

Classify QCD phase transition with deep learning

The Accelerator Wall: Limits of Chip Specialization

A Development Platform for Embedded Domain-Specific Languages

Heterogeneous Resource-Elastic Scheduling for CPU+FPGA Architectures

Leader Stochastic Gradient Descent for Distributed Training of Deep Learning Models

Breadth-First Search using Dynamic Parallelism on the GPU

The Impact of GPU DVFS on the Energy and Performance of Deep Learning: an Empirical Study

Massively Parallel GPU Memory Compaction

Multi-GPU Rendering with Vulkan API

Automatic generation of warp-level primitives and atomic instructions for fast and portable parallel reduction on GPUs

Acceleration of Scientific Deep Learning Models on Heterogeneous Computing Platform with Intel FPGAs

Comparing Energy Efficiency of CPU, GPU and FPGA Implementations for Vision Kernels

Recent source codes

Kernel Library for LLM Serving

Adaptivity in AdaptiveCpp: Optimizing Performance by Leveraging Runtime Information During JIT-Compilation

Neptune: Advanced ML Operator Fusion for Locality and Parallelism on GPUs

Genten: Software for Generalized Tensor Decompositions by Sandia National Laboratories

Interleaved Learning and Exploration: A Self-Adaptive Fuzz Testing Framework for MLIR

Pinocchio: PINpointing Orbit Crossing Collapsed Hierarchical Objects

KernelCoder: trained on a curated dataset of reasoning traces and CUDA kernel pairs

VibeCodeHPC - Multi Agentic Vibe Coding for HPC

Compile-Time Resource Safety for GPU APIs: A Low-Overhead Typestate Framework

exa-AMD: Exascale Accelerated Materials Discovery

Most viewed papers (last 30 days)