high performance computing on graphics processing units: hgpu.org

Posts

Jun, 14

A Tuned and Scalable Fast Multipole Method as a Preeminent Algorithm for Exascale Systems

Achieving computing at the exascale means accelerating today’s applications by one thousand times. Clearly, this cannot be accomplished by hardware alone, at least not in the short time frame expected for reaching this performance milestone. Thus, a lively discussion has begun in the last couple of years about programming models, software components and tools, and […]

CUDA

Jun, 14

A sparse octree gravitational N-body code that runs entirely on the GPU processor

We present parallel algorithms for constructing and traversing sparse octrees on graphics processing units (GPUs). The algorithms are based on parallel-scan and sort methods. To test the performance and feasibility, we implemented them in CUDA in the form of a gravitational tree-code which completely runs on the GPU.(The code is publicly available at: http://castle.strw.leidenuniv.nl/software.html) The […]

CUDA

Jun, 14

Design Exploration of Quadrature Methods in Option Pricing

This paper presents a novel parallel architecture for accelerating quadrature methods used for pricing complex multi-dimensional options, such as discrete barrier, Bermudan and American options. We explore different designs of the quadrature evaluation core including optimized pipelined hardware designs in reconfigurable logic and a compute unified device architecture (CUDA)-based graphics processing unit (GPU) design. A […]

CUDA

Jun, 14

Accelerating Parameter Sweep Applications Using CUDA

This paper proposes a parallelization scheme for parameter sweep (PS) applications using the compute unified device architecture (CUDA). Our scheme focuses on PS applications with irregular access patterns, which usually result in lower performance on the GPU. The key idea to resolve this irregularity is to exploit the similarity of data accesses between different parameters. […]

CUDA

Jun, 14

CUDA Implementation of ${rm TE}^{z}$-FDTD Solution of Maxwell’s Equations in Dispersive Media

This letter presents the graphic processor unit (GPU) implementation of the finite-difference time domain (FDTD) method for the solution of the two-dimensional electromagnetic fields inside dispersive media. The FDTD is truncated by the convolutional perfectly matched layer (CPML) and the piecewise-linear recursive-convolution (PLRC) formulation is used for modeling dispersive media. By using the newly introduced […]

CUDA

Jun, 14

Cellular Level Agent Based Modelling on the Graphics Processing Unit

Cellular level agent based modelling is reliant on either sequential processing environments or expensive and largely unavailable PC grids. The GPU offers an alternative architecture for such systems, however the steep learning curve associated with the GPUs data parallel architecture has previously limited the uptake of this emerging technology. In this paper we demonstrate a […]

CUDA

Jun, 14

A multi-platform linear algebra toolbox for finite element solvers on heterogeneous clusters

Heterogeneous clusters with multiple sockets and multicore-processors accelerated by dedicated coprocessors like GPUs, Cell BE, FPGAs or others nowadays provide unrivaled computing power in terms of floating point operations. Specific capabilities of additional processor technologies enable dedicated exploitation with respect to particular application and data characteristics. However, resource utilization, programmability, and scalability of applications across […]

Jun, 14

Speeding up the MATLAB Hyperspectral Image Analysis Toolbox using GPUs and the Jacket Toolbox

The Hyperspectral Image Analysis Toolbox (HIAT) is a MATLABtrade toolbox for the analysis of hyperspectral imagery. HIAT includes a collection of algorithms for processing of hyperspectral and multispectral imagery under the MATLAB environment. The objective of HIAT is to provide a suite of information extraction algorithms to users of hyperspectral and multispectral imagery across different […]

Jun, 14

Accelerating Dynamic Time Warping Subsequence Search with GPUs and FPGAs

Many time series data mining problems require subsequence similarity search as a subroutine. Dozens of similarity/distance measures have been proposed in the last decade and there is increasing evidence that Dynamic Time Warping (DTW) is the best measure across a wide range of domains. Given DTW’s usefulness and ubiquity, there has been a large community-wide […]

CUDA

Jun, 14

Memory-efficient implementation of a graphics processor-based cluster detection algorithm for large spatial databases

Numerous approaches have been proposed for detecting clusters, groups of data in spatial databases. Of these, the algorithm known as Density Based Spatial Clustering of Applications with Noise (DBSCAN) is a recent approach which has proven efficient for larger databases. Graphical Processing Units (GPUs), used originally to aid in the processing of high intensity graphics, […]

Jun, 14

Parallel implementation of artificial neural network training

In this paper we describe the implementation of a complete ANN training procedure for speech recognition using the block mode back-propagation learning algorithm. We exploit the high performance SIMD architecture of GPU using CUDA and its C-like language interface. We also compare the speed-up obtained implementing the training procedure only taking advantage of the multi-thread […]

CUDA

Jun, 14

CuParcone A High-Performance Evolvable Neural Network Model

An algorithm for evolving recurrent neural network via the genetic algorithm was implemented on the CUDA, resulting in a system called CuParcone (CUDA based Partially Connected Neural Evolutionary). Run on a Nvidia Tesla "GPU supercomputer," CuParcone achieves a performance increase of 323 times in face gender recognition compared to the comparable Parcone algorithm on a […]

CUDA

DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.

DeepCompile: A Compiler-Driven Approach to Optimizing Distributed Deep Learning Training

Large Language Model Powered C-to-CUDA Code Translation: A Novel Auto-Parallelization Framework

GigaAPI: a user-space API that simplifies multi-GPU programming, bridging the gap between the capabilities of parallel GPU systems and the ability of developers to harness their full potential

GigaAPI for GPU Parallelization

high performance computing on graphics processing units: hgpu.org

Posts

A Tuned and Scalable Fast Multipole Method as a Preeminent Algorithm for Exascale Systems

A sparse octree gravitational N-body code that runs entirely on the GPU processor

Design Exploration of Quadrature Methods in Option Pricing

Accelerating Parameter Sweep Applications Using CUDA

CUDA Implementation of ${rm TE}^{z}$-FDTD Solution of Maxwell’s Equations in Dispersive Media

Cellular Level Agent Based Modelling on the Graphics Processing Unit

A multi-platform linear algebra toolbox for finite element solvers on heterogeneous clusters

Speeding up the MATLAB Hyperspectral Image Analysis Toolbox using GPUs and the Jacket Toolbox

Accelerating Dynamic Time Warping Subsequence Search with GPUs and FPGAs

Memory-efficient implementation of a graphics processor-based cluster detection algorithm for large spatial databases

Parallel implementation of artificial neural network training

CuParcone A High-Performance Evolvable Neural Network Model

Recent source codes

MSCCL++: A GPU-driven communication stack for scalable AI applications

Benchmark compute shader of Unity against InteropUnityCUDA

Data-efficient LLM Fine-tuning for Code Generation

DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.

Large Language Model Powered C-to-CUDA Code Translation: A Novel Auto-Parallelization Framework

GigaAPI: a user-space API that simplifies multi-GPU programming, bridging the gap between the capabilities of parallel GPU systems and the ability of developers to harness their full potential

Coccinelle: a C code transformation engine using SmPL for matches, refactorings, and bug fixing

DuoReduce: MLIR's benchmark

Shamrock: Multi-GPU hydrodynamics for astrophysics

LLMPerf: GPU Performance Modeling meets Large Language Models

Most viewed papers (last 30 days)