high performance computing on graphics processing units: hgpu.org

Posts

Oct, 5

Flexible, high performance convolutional neural networks for image classification

We present a fast, fully parameterizable GPU implementation of Convolutional Neural Network variants. Our feature extractors are neither carefully designed nor pre-wired, but rather learned in a supervised way. Our deep hierarchical architectures achieve the best published results on benchmarks for object classification (NORB, CIFAR10) and handwritten digit recognition (MNIST), with error rates of 2.53%, […]

CUDA

Oct, 5

A parallel error diffusion implementation on a GPU

In this paper, we investigate the suitability of the GPU for a parallel implementation of the pinwheel error diffusion. We demonstrate a high-performance GPU implementation by efficiently parallelizing and unrolling the image processing algorithm. Our GPU implementation achieves a 10 – 30x speedup over a two-threaded CPU error diffusion implementation with comparable image quality. We […]

CUDA

Oct, 4

GPU performance comparison for accelerated radar data processing

Radar is a data-intensive measurement technique often requiring significant processing to make full use of the received signal. However, computing capacity is limited at remote or mobile radar installations thereby limiting radar data products used for real-time decisions. We used graphics processing units (GPUs) to accelerate processing of high resolution phase-coded radar data from the […]

OpenCL

Oct, 4

A Massive Data Parallel Computational Framework on Petascale/Exascale Hybrid Computer Systems

Heterogeneous systems are becoming more common on High Performance Computing (HPC) systems. Even using tools like CUDA [1] and OpenCL [2] it is a non-trivial task to obtain optimal performance on the GPU. Approaches to simplifying this task include Merge [3] (a library based framework for heterogeneous multi-core systems), Zippy [4] (a framework for parallel […]

CUDA

•

OpenCL

Oct, 4

Architecture-Aware Optimization on a 1600-core Graphics Processor

The graphics processing unit (GPU) continues to make significant strides as an accelerator in commodity cluster computing for high-performance computing (HPC). For example, three of the top five fastest supercomputers in the world, as ranked by the TOP500, employ GPUs as accelerators. Despite this increasing interest in GPUs, however, optimizing the performance of a GPU-accelerated […]

CUDA

•

OpenCL

Oct, 4

Fine-grained Parallel ILU Preconditioners with Fill-ins for Multi-core CPUs and GPUs

Numerical simulation and its huge computational demands require a close coupling between efficient mathematical methods and their hardware-aware implementation on emerging and highly parallel computing platforms. The paradigm shift towards manycore parallelism not only offers a high potential of computing capabilities but also comes up with urgent challenges in designing scalable, portable, and flexible software […]

OpenCL

Oct, 4

GPU Algorithms for Diamond-based Multiresolution Terrain Processing

We present parallel algorithms for processing, extracting and rendering adaptively sampled regular terrain datasets represented as a multiresolution model defined by a super-square-based diamond hierarchy. This model represents a terrain as a nested triangle mesh generated through a series of longest edge bisections and encoded in an implicit hierarchical structure, which clusters triangles into diamonds […]

OpenCL

•

OpenGL

Oct, 4

Finite element assembly strategies on multi-and many-core architectures

We demonstrate that radically differing implementations of finite element methods are needed on multicore (CPU) and many-core (GPU) architectures, if their respective performance potential is to be realised. Our experimental investigations using a finite element advection-diffusion solver show that increased performance on each architecture can only be achieved by committing to specific and diverse algorithmic […]

CUDA

•

OpenCL

Oct, 4

Berkeley Dwarfs on CUDA

Graphics processing units (GPUs) greatly improved their performance over the last ten years. The first graphics cards have been developed in the late 90’s and were targeted for the mass market. These first cards were special purpose hardware, designed to accelerate graphic processing required in computer games. As the interest in computer games continued, GPU […]

CUDA

•

OpenCL

Oct, 4

Comparing Parallel Simulation of Social Agents using Cilk and OpenCL

Recent advances in wireless/mobile communication and body worn sensors, together with ambient intelligence and seamless integrated pervasive technology have paved the way for applications operating based on social signals, i. e., sensing and processing of group behavior, interpersonal relationships, or emotions. Thinking in large, it should be apparent that modeling social systems allowing to study […]

OpenCL

Oct, 4

Optimization of the Gaussian Mixture Model Evaluation on GPU

In this paper we present a highly optimized implementation of Gaussian mixture acoustic model evaluation algorithm. Evaluation of these likelihoods is one of the most computationally intensive parts of automatics speech recognizers but it can be well-parallelized and offloaded to GPU devices. Our approach offers significant speed-up compared to the recently published approaches, since it […]

CUDA

•

OpenCL

Oct, 4

Acceleration of Radiance for Lighting Simulation by Using Parallel Computing with OpenCL

We report on the acceleration of annual daylighting simulations for fenestration systems in the Radiance raytracing program. The algorithm was optimized to reduce both the redundant data input/output operations and the floating-point operations. To further accelerate the simulation speed, the calculation for matrix multiplications was implemented using parallel computing on a graphics processing unit. We […]

OpenCL

DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.

DeepCompile: A Compiler-Driven Approach to Optimizing Distributed Deep Learning Training

Large Language Model Powered C-to-CUDA Code Translation: A Novel Auto-Parallelization Framework

GigaAPI: a user-space API that simplifies multi-GPU programming, bridging the gap between the capabilities of parallel GPU systems and the ability of developers to harness their full potential

GigaAPI for GPU Parallelization

high performance computing on graphics processing units: hgpu.org

Posts

Flexible, high performance convolutional neural networks for image classification

A parallel error diffusion implementation on a GPU

GPU performance comparison for accelerated radar data processing

A Massive Data Parallel Computational Framework on Petascale/Exascale Hybrid Computer Systems

Architecture-Aware Optimization on a 1600-core Graphics Processor

Fine-grained Parallel ILU Preconditioners with Fill-ins for Multi-core CPUs and GPUs

GPU Algorithms for Diamond-based Multiresolution Terrain Processing

Finite element assembly strategies on multi-and many-core architectures

Berkeley Dwarfs on CUDA

Comparing Parallel Simulation of Social Agents using Cilk and OpenCL

Optimization of the Gaussian Mixture Model Evaluation on GPU

Acceleration of Radiance for Lighting Simulation by Using Parallel Computing with OpenCL

Recent source codes

MSCCL++: A GPU-driven communication stack for scalable AI applications

Benchmark compute shader of Unity against InteropUnityCUDA

Data-efficient LLM Fine-tuning for Code Generation

DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.

Large Language Model Powered C-to-CUDA Code Translation: A Novel Auto-Parallelization Framework

GigaAPI: a user-space API that simplifies multi-GPU programming, bridging the gap between the capabilities of parallel GPU systems and the ability of developers to harness their full potential

Coccinelle: a C code transformation engine using SmPL for matches, refactorings, and bug fixing

DuoReduce: MLIR's benchmark

Shamrock: Multi-GPU hydrodynamics for astrophysics

LLMPerf: GPU Performance Modeling meets Large Language Models

Most viewed papers (last 30 days)