high performance computing on graphics processing units: hgpu.org

Posts

Feb, 22

Efficient parallel implementation of the lattice Boltzmann method on large clusters of graphic processing units

Many-core processors, such as graphic processing units (GPUs), are promising platforms for intrinsic parallel algorithms such as the lattice Boltzmann method (LBM). Although tremendous speedup has been obtained on a single GPU compared with mainstream CPUs, the performance of the LBM for multiple GPUs has not been studied extensively and systematically. In this article, we […]

CUDA

Feb, 22

GPU-Based Iterative Relative Fuzzy Connectedness Image Segmentation

This paper presents a parallel algorithm for the top of the line among the fuzzy connectedness algorithm family, namely the iterative relative fuzzy connectedness (IRFC) segmentation method. The algorithm of IRFC, realized via image foresting transform (IFT), is implemented by using NVIDIA’s compute unified device architecture (CUDA) platform for segmenting large medical image data sets. […]

CUDA

Feb, 22

A Hierarchical Thread Scheduler and Register File for Energy-efficient Throughput Processors

Modern graphics processing units (GPUs) employ a large number of hardware threads to hide both function unit and memory access latency. Extreme multithreading requires a complex thread scheduler as well as a large register file, which is expensive to access both in terms of energy and latency. We present two complementary techniques for reducing energy […]

CUDA

Feb, 22

Image segmentation using CUDA implementations of the Runge-Kutta-Merson and GMRES methods

Modern GPUs are well suited for performing image processing tasks. We utilize their high computational performance and memory bandwidth for image segmentation purposes. We segment cardiac MRI data by means of numerical solution of an anisotropic partial differential equation of the Allen-Cahn type. We implement two different algorithms for solving the equation on the CUDA […]

CUDA

Feb, 22

High Performance N-Body Simulation and Visualization through CUDA Architecture

General purpose computing on graphics processing unit (GPGPU) has become into a new paradigm to program easily massive parallel processors. This hardware architecture is very suitable to encourage N-body problems such as molecular dynamics simulations, due to compute each body on every thread. Visualizations of molecular systems such as ‘claret’ simulator have been developed. However […]

CUDA

•

OpenGL

Feb, 21

Variants of Mersenne Twister Suitable for Graphic Processors

This paper proposes a type of pseudorandom number generator, Mersenne Twister for Graphic Processor (MTGP), for efficient generation on graphic processessing units (GPUs). MTGP supports large state sizes such as 11213 bits, and uses the high parallelism of GPUs in computing many steps of the recursion in parallel. The second proposal is a parameter-set generator […]

CUDA

Feb, 21

Efficient Data Management for GPU Databases

General purpose GPUs are a new and powerful hardware device with a number of applications in the realm of relational databases. We describe a database framework designed to allow both CPU and GPU execution of queries. Through use of our novel data structure design and method of using GPU-mapped memory with efficient caching, we demonstrate […]

CUDA

Feb, 21

High-Performance 3D Compressive Sensing MRI Reconstruction Using Many-Core Architectures

Compressive sensing (CS) describes how sparse signals can be accurately reconstructed from many fewer samples than required by the Nyquist criterion. Since MRI scan duration is proportional to the number of acquired samples, CS has been gaining significant attention in MRI. However, the computationally intensive nature of CS reconstructions has precluded their use in routine […]

CUDA

Feb, 21

Acceleration of Composite Order Bilinear Pairing on Graphics Hardware

Recently, composite-order bilinear pairing has been shown to be useful in many cryptographic constructions. However, it is time-costly to evaluate. This is because the composite order should be at least 1024bit and, hence, the elliptic curve group order $n$ and base field become too large, rendering the bilinear pairing algorithm itself too slow to be […]

CUDA

Feb, 21

GPGPU Processing in CUDA Architecture

The future of computation is the Graphical Processing Unit, i.e. the GPU. The promise that the graphics cards have shown in the field of image processing and accelerated rendering of 3D scenes, and the computational capability that these GPUs possess, they are developing into great parallel computing units. It is quite simple to program a […]

CUDA

•

OpenCL

Feb, 20

Implementation of LTE Mini receiver on GPUs

Long Term Evolution (LTE) is the latest standard for cellular mobile communication. To fully exploit the available spectrum, LTE utilizes feedback. Since the radio channel is varying in time, the feedback calculation is latency sensitive. In our upcoming LTE measurement with the Vienna Multiple Input Multiple Output (MIMO) Testbed, a low latency feedback calculation is […]

CUDA

Feb, 20

Model-Driven Tile Size Selection for DOACROSS Loops on GPUs

DOALL loops are tiled to exploit DOALL parallelism and data locality on GPUs. In contrast, due to loop-carried dependences, DOACROSS loops must be skewed first in order to make tiling legal and exploit wavefront parallelism across the tiles and within a tile. Thus, tile size selection, which is performance-critical, becomes more complex for DOACROSS loops […]

CUDA

* * *

high performance computing on graphics processing units: hgpu.org

Posts

Efficient parallel implementation of the lattice Boltzmann method on large clusters of graphic processing units

GPU-Based Iterative Relative Fuzzy Connectedness Image Segmentation

A Hierarchical Thread Scheduler and Register File for Energy-efficient Throughput Processors

Image segmentation using CUDA implementations of the Runge-Kutta-Merson and GMRES methods

High Performance N-Body Simulation and Visualization through CUDA Architecture

Variants of Mersenne Twister Suitable for Graphic Processors

Efficient Data Management for GPU Databases

High-Performance 3D Compressive Sensing MRI Reconstruction Using Many-Core Architectures

Acceleration of Composite Order Bilinear Pairing on Graphics Hardware

GPGPU Processing in CUDA Architecture

Implementation of LTE Mini receiver on GPUs

Model-Driven Tile Size Selection for DOACROSS Loops on GPUs

Recent source codes

Allo: Accelerator Design Language

Towards Robust Agentic CUDA Kernel Benchmarking, Verification, and Optimization

HPC Benchmark Survey

HDM: Home made Diffusion Models

General Matrix Multiplication (GEMM)

CrossTL: Universal Programming Language & Translator

TBD-GPU

DG-SWEM - The Discontinuous Galerkin Shallow Water Equation Model

torchPDLP: Primal-Dual Linear Programming in PyTorch. In collaboration with AMD and IPAM

Benchmarks for Dissecting CPU-GPU Unified Physical Memory on AMD MI300A APUs

Most viewed papers (last 30 days)