high performance computing on graphics processing units: hgpu.org

Posts

May, 27

Performance Portability in Accelerated Parallel Kernels

Heterogeneous architectures, by definition, include multiple processing components with very different microarchitectures and execution models.In particular, computing platforms from supercomputers to smartphones can now incorporate both CPU and GPU processors. Disparities between CPU and GPU processor architectures have naturally led to distinct programming models and development patterns for each component.Developers for a specific system decompose […]

OpenCL

May, 27

A Performance Modeling and Optimization Analysis Tool for Sparse Matrix-Vector Multiplication on GPUs

This paper presents a performance modeling and optimization analysis tool to predict and optimize the performance of sparse matrix-vector multiplication (SpMV) on GPUs. We make the following contributions: (1) We present an integrated analytical and profile-based performance modeling to accurately predict the kernel execution times of CSR, ELL, COO, and HYB SpMV kernels. Our proposed […]

CUDA

May, 27

Rapid Computation of Sodium Bioscales Using GPU-Accelerated Image Reconstruction

Quantitative sodium magnetic resonance imaging permits noninvasive measurement of the tissue sodium concentration (TSC) bioscale in the brain. Computing the TSC bioscale requires reconstructing and combining multiple datasets acquired with a non-Cartesian acquisition that highly oversamples the center of k-space. Even with an optimized implementation of the algorithm to compute TSC, the overall processing time […]

CUDA

May, 27

Trapping of giant-planet cores – I. vortex aided trapping at the outer dead zone edge

In this paper the migration of a 10 Earth mass planetary core is investigated at the outer boundary of the dead zone of a protoplanetary disc by means of 2D hydrodynamic simulations done with the GPU version of the FARGO code. In the dead zone the effective viscosity is greatly reduced due to the disc […]

CUDA

May, 27

Scaling Radio Astronomy Signal Correlation on Heterogeneous Supercomputers Using Various Data Distribution Methodologies

Next generation radio telescopes will require orders of magnitude more computing power to provide a view of the universe with greater sensitivity. In the initial stages of the signal processing flow of a radio telescope, signal correlation is one of the largest challenges in terms of handling huge data throughput and intensive computations. We implemented […]

OpenCL

May, 26

GPU Accelerated XenDesktop for Designers and Engineers (webinar)

If you’ve ever wanted to virtualize your CAD or professional video graphics application and have the exact same local experience on a secure central platform, then this webinar provides insight on how to get there. Join Technology Evangelist, Thomas Poppelgaard, and learn how Citrix XenDesktop®, XenApp® and XenServer®, in combination with NVIDIA GRID VGX, makes […]

May, 26

Easily Accelerating Existing Monte Carlo Code: CVA and CCR Examples (webinar)

In this webinar, Hicham Lahlou, CEO & Co-founder, Xcelerit, will discuss real world applications of GPUs in risk management. He will show, using the Xcelerit SDK, how the complexity of GPU programming can be overcome allowing existing models and applications to be easily accelerated and extended, cutting software development and maintenance costs. The risks associated […]

May, 25

Enabling OS Research by Inferring Interactions in the Black-Box GPU Stack

General-purpose GPUs now account for substantial computing power on many platforms, but the management of GPU resources – cycles, memory, bandwidth – is frequently hidden in black-box libraries, drivers, and devices, outside the control of mainstream OS kernels. We believe that this situation is untenable, and that vendors will eventually expose sufficient information about cross-black-box […]

CUDA

May, 25

Use of Multi-GPU Systems for Larger Than Device FFTs: With Applications in Ultrasound Simulations

Ultrasound simulations are a type of application that are both computationally and communicatively intensive. With better performance, implementations of these can be used in designing new ultrasound probes, developing better signal processing techniques, training new ultrasonographers, in treatment planning and many other uses [11]. The pseudo-spectral technique can be used effectively to express the wave-propagation […]

CUDA

May, 25

GPU-accelerated protein family identification for metagenomics

The clustering of putative protein/Open Reading Frame (ORF) sequences available from large-scale metagenomics survey projects is a core analytical function that has led to the identification and characterization of novel protein families of environmental microbial communities. The implementation of this function, however, is currently challenged not only by data size but also by data complexity. […]

CUDA

May, 25

Stencil and Lattice Structures for Field Equation Model Simulations on GPUs

Field equations can be numerically simulated by approximating a continuous space field by a discrete lattice. There are a number of different lattice geometries that can be used to approximate continuous space which may cause structural artefacts in the simulation. These different lattice structures require the use of different stencil operators to approximate the spatial […]

CUDA

May, 25

A Tuned, Concurrent-Kernel Approach to Speed Up the APSP Problem

The All-Pair Shortest-Path (APSP) problem is a well-known problem in graph theory whose objective is to find the shortest paths between any pair of nodes. Computing the distances from one source node to the rest and repeating this process for every node of the graph is an adequate solution for sparse graphs. During the last […]

CUDA

high performance computing on graphics processing units: hgpu.org

Posts

Performance Portability in Accelerated Parallel Kernels

A Performance Modeling and Optimization Analysis Tool for Sparse Matrix-Vector Multiplication on GPUs

Rapid Computation of Sodium Bioscales Using GPU-Accelerated Image Reconstruction

Trapping of giant-planet cores – I. vortex aided trapping at the outer dead zone edge

Scaling Radio Astronomy Signal Correlation on Heterogeneous Supercomputers Using Various Data Distribution Methodologies

GPU Accelerated XenDesktop for Designers and Engineers (webinar)

Easily Accelerating Existing Monte Carlo Code: CVA and CCR Examples (webinar)

Enabling OS Research by Inferring Interactions in the Black-Box GPU Stack

Use of Multi-GPU Systems for Larger Than Device FFTs: With Applications in Ultrasound Simulations

GPU-accelerated protein family identification for metagenomics

Stencil and Lattice Structures for Field Equation Model Simulations on GPUs

A Tuned, Concurrent-Kernel Approach to Speed Up the APSP Problem

Recent source codes

DITRON: Distributed Compiler based on Triton for Parallel Systems

IntelliKit: Agent-first tooling for AMD hardware

CuTile Benchmark Suite: Performance and Productivity Tradeoffs for GPU Kernel Programming on Blackwell Architecture

MegaTrain: Full Precision Training of 100B+ Parameter Large Language Models on a Single GPU

Device Virtual Machine (DVM)

Agentic Code Optimization via Compiler-LLM Cooperation

AutoKernel: Autoresearch for GPU kernels. Give it any PyTorch model, go to sleep, wake up to optimized Triton kernels

Triton-Sanitizer: A Fast and Device-Agnostic Memory Sanitizer for Triton with Rich Diagnostic Context

LLM.Q: Quantized LLM training in pure CUDA/C++

SOL-ExecBench: Speed-of-Light Benchmarking for Real-World GPU Kernels Against Hardware Limits

Most viewed papers (last 30 days)