high performance computing on graphics processing units: hgpu.org

Posts

Jun, 10

Measuring the Impact of Configuration Parameters in CUDA Through Benchmarking

The threadblock size and shape choice is one of the most important user decisions when a parallel problem is coded to run in GPU architectures. In fact, threadblock configuration has a significant impact on the global performance of the program. Unfortunately, the programmer has not enough information about the subtle interactions between this choice of […]

CUDA

Jun, 9

Scaling Fast Multipole Methods up to 4000 GPUs

The Fast Multipole Method (FMM) is a hierarchical N-body algorithm with linear complexity, high arithmetic intensity, high data locality, has hierarchical communication patterns, and no global synchronization. The combination of these features allows the FMM to scale well on large GPU based systems, and to use their compute capability effectively. We present a 1 PFlop/s […]

CUDA

Jun, 9

Fast Morphological Image Processing Open-Source Extensions for GPU processing with CUDA

GPU architectures offer a significant opportunity for faster morphological image processing, and the NVIDIA CUDA architecture offers a relatively inexpensive and powerful framework for performing these operations. However, the generic morphological erosion and dilation operation in the CUDA NPP library is relatively naive, and performance scales expensively with increasing structuring element size. The objective of […]

CUDA

Jun, 9

Autotuning Stencil-Based Computations on GPUs

Finite-difference, stencil-based discretization approaches are widely used in the solution of partial differential equations describing physical phenomena. Newton-Krylov iterative methods commonly used in stencil-based solutions generate matrices that exhibit diagonal sparsity patterns. To exploit these structures on modern GPUs, we extend the standard diagonal sparse matrix representation and define new matrix and vector data types […]

CUDA

Jun, 9

Encapsulated synchronization and load-balance in heterogeneous programming

Programming models and techniques to exploit parallelism in accelerators, such as GPUs, are different from those used in traditional parallel models for shared- or distributed-memory systems. It is a challenge to blend different programming models to coordinate and exploit devices with very different characteristics and computation powers. This paper presents a new extensible framework model […]

CUDA

Jun, 9

Sparse LU Factorization for Parallel Circuit Simulation on GPU

Sparse solver has become the bottleneck of SPICE simulators. There has been few work on GPU-based sparse solver because of the high data-dependency. The strong data-dependency determines that parallel sparse LU factorization runs efficiently on shared-memory computing devices. But the number of CPU cores sharing the same memory is often limited. The state of the […]

CUDA

•

OpenCL

Jun, 8

Decoupling Algorithms from Schedules for Easy Optimization of Image Processing Pipelines

Using existing programming tools, writing high-performance image processing code requires sacrificing readability, portability, and modularity. We argue that this is a consequence of conflating what computations define the algorithm, with decisions about storage and the order of computation. We refer to these latter two concerns as the schedule, including choices of tiling, fusion, recomputation vs. […]

CUDA

Jun, 8

Ameliorating Memory Contention of OLAP operators on GPU Processors

Implementations of database operators on GPU processors have shown dramatic performance improvement compared to multicore-CPU implementations. GPU threads can cooperate using shared memory, which is organized in interleaved banks and is fast only when threads read and modify addresses belonging to distinct memory banks. Therefore, data processing operators implemented on a GPU, in addition to […]

CUDA

Jun, 8

A Comparison of Algebraic Multigrid Preconditioners using Graphics Processing Units and Multi-Core Central Processing Units

The influence of multi-core central processing units and graphics processing units on several algebraic multigrid methods is investigated in this work. Different performance metrics traditionally employed for algebraic multigrid are reconsidered and reevaluated on these novel computing architectures. Our benchmark results show that with the use of graphics processing units for the solver phase, it […]

OpenCL

Jun, 8

Astrophysical Particle Simulations on Heterogeneous CPU-GPU Systems

A heterogeneous CPU-GPU node is getting popular in HPC clusters. We need to rethink algorithms and optimization techniques for such system depending on the relative performance of CPU vs. GPU. In this paper, we report a performance optimized particle simulation code "OTOO", that is based on the octree method, for heterogenous systems. Main applications of […]

OpenCL

Jun, 8

Parallel random variates generator for GPUs based on normal numbers

Pseudorandom number generators are required for many computational tasks, such as stochastic modelling and simulation. This paper investigates the serial CPU and parallel GPU implementation of a Linear Congruential Generator based on the binary representation of the normal number $alpha_{2,3}$. We adapted two methods of modular reduction which allowed us to perform most operations in […]

CUDA

Jun, 6

DMA-Assisted, Intranode Communication in GPU Accelerated Systems

Accelerator awareness has become a pressing issue in data movement models, such as MPI, because of the rapid deployment of systems that utilize accelerators. In our previous work, we developed techniques to enhance MPI with accelerator awareness, thus allowing applications to easily and efficiently communicate data between accelerator memories. In this paper, we extend this […]

CUDA

DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.

DeepCompile: A Compiler-Driven Approach to Optimizing Distributed Deep Learning Training

Large Language Model Powered C-to-CUDA Code Translation: A Novel Auto-Parallelization Framework

GigaAPI: a user-space API that simplifies multi-GPU programming, bridging the gap between the capabilities of parallel GPU systems and the ability of developers to harness their full potential

GigaAPI for GPU Parallelization

high performance computing on graphics processing units: hgpu.org

Posts

Measuring the Impact of Configuration Parameters in CUDA Through Benchmarking

Scaling Fast Multipole Methods up to 4000 GPUs

Fast Morphological Image Processing Open-Source Extensions for GPU processing with CUDA

Autotuning Stencil-Based Computations on GPUs

Encapsulated synchronization and load-balance in heterogeneous programming

Sparse LU Factorization for Parallel Circuit Simulation on GPU

Decoupling Algorithms from Schedules for Easy Optimization of Image Processing Pipelines

Ameliorating Memory Contention of OLAP operators on GPU Processors

A Comparison of Algebraic Multigrid Preconditioners using Graphics Processing Units and Multi-Core Central Processing Units

Astrophysical Particle Simulations on Heterogeneous CPU-GPU Systems

Parallel random variates generator for GPUs based on normal numbers

DMA-Assisted, Intranode Communication in GPU Accelerated Systems

Recent source codes

MSCCL++: A GPU-driven communication stack for scalable AI applications

Benchmark compute shader of Unity against InteropUnityCUDA

Data-efficient LLM Fine-tuning for Code Generation

DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.

Large Language Model Powered C-to-CUDA Code Translation: A Novel Auto-Parallelization Framework

GigaAPI: a user-space API that simplifies multi-GPU programming, bridging the gap between the capabilities of parallel GPU systems and the ability of developers to harness their full potential

Coccinelle: a C code transformation engine using SmPL for matches, refactorings, and bug fixing

DuoReduce: MLIR's benchmark

Shamrock: Multi-GPU hydrodynamics for astrophysics

LLMPerf: GPU Performance Modeling meets Large Language Models

Most viewed papers (last 30 days)