high performance computing on graphics processing units: hgpu.org

Posts

Dec, 21

Practical Symmetric Key Cryptography on Modern Graphics Hardware

Graphics processors are continuing their trend of vastly outperforming CPUs while becoming more general purpose. The latest generation of graphics processors have introduced the ability handle integers natively. This has increased the GPU’s applicability to many fields, especially cryptography. This paper presents an application oriented approach to block cipher processing on GPUs. A new block […]

CUDA

Dec, 21

Dust-Dust Collisional Charging and Lightning in Protoplanetary Discs

We study the role of dust-dust collisional charging in protoplanetary discs. We show that dust-dust collisional charging becomes an important process in determining the charge state of dust and gas, if there is dust enhancement and/or dust is fluffy, so that dust surface area per disc volume is locally increased. We solve the charge equilibrium […]

CUDA

Dec, 21

GPU-based ultra fast IMRT plan optimization

The widespread adoption of on-board volumetric imaging in cancer radiotherapy has stimulated research efforts to develop online adaptive radiotherapy techniques to handle the inter-fraction variation of the patient’s geometry. Such efforts face major technical challenges to perform treatment planning in real-time. To overcome this challenge, we are developing a supercomputing online re-planning environment (SCORE) at […]

CUDA

Dec, 21

DSPSR: Digital Signal Processing Software for Pulsar Astronomy

DSPSR is a high-performance, open-source, object-oriented, digital signal processing software library and application suite for use in radio pulsar astronomy. Written primarily in C++, the library implements an extensive range of modular algorithms that can optionally exploit both multiple-core processors and general-purpose graphics processing units. After over a decade of research and development, DSPSR is […]

CUDA

Dec, 21

Efficiency of the energy transfer in the Fenna-Matthews-Olson complex using hierarchical equations on graphics processing units

We study the energy transfer in light-harvesting complexes (LHC) and the importance of quantum coherence and the backaction of the molecular environment on the energy flow. We calculate the energy-transfer efficiency and the trapping time for the Fenna-Matthews-Olson (FMO) complex within the exact hierarchical approach proposed by Ishizaki and Fleming (J. Chem. Phys. vol 130, […]

CUDA

Dec, 20

Comparing the Treecode with FMM on GPUs for vortex particle simulations of a leapfrogging vortex ring

We compare the performance of the treecode and the fast multipole method (FMM) on GPUs. These fast algorithms are used to accelerate a vortex particle simulation of two leapfrogging vortex rings. The performance of the treecode and FMM depends strongly on the number of particles, type of hardware, distribution of particles, and the required accuracy. […]

Dec, 20

A parallel immune algorithm for traveling salesman problem and its application on cold rolling scheduling

Parallel computing provides efficient solutions for combinatorial optimization problem. However, since the communications among computing processes are rather cost-consuming, the actual parallel or distributed algorithm comes with substantial expenditures, such as, hardware, management, and maintenance. In this study, a parallel immune algorithm based on graphic processing unit (GPU) that originally comes to process the computer […]

Dec, 20

CUDA optimization strategies for compute- and memory-bound neuroimaging algorithms

As neuroimaging algorithms and technology continue to grow faster than CPU performance in complexity and image resolution, data-parallel computing methods will be increasingly important. The high performance, data-parallel architecture of modern graphical processing units (GPUs) can reduce computational times by orders of magnitude. However, its massively threaded architecture introduces challenges when GPU resources are exceeded. […]

CUDA

Dec, 20

A closer look at GPUs

As the line between GPUs and CPUs begins to blur, it’s important to understand what makes GPUs tick.

Dec, 20

Accelerating the Fourier split operator method via graphics processing units

Current generations of graphics processing units have turned into highly parallel devices with general computing capabilities. Thus, graphics processing units may be utilized, for example, to solve time dependent partial differential equations by the Fourier split operator method. In this contribution, we demonstrate that graphics processing units are capable to calculate fast Fourier transforms much […]

CUDA

Dec, 19

FFT and Convolution Performance in Image Filtering on GPU

Many contemporary visualization tools comprise some image filtering approach. Since image filtering approaches are very computationally demanding, the acceleration using graphics-hardware (GPU) is very desirable to preserve interactivity of the main visualization tool itself. In this article we take a close look on GPU implementation of two basic approaches to image filtering -fast Fourier transform […]

Dec, 19

ATTILA: a cycle-level execution-driven simulator for modern GPU architectures

The present work presents a cycle-level execution-driven simulator for modern GPU architectures. We discuss the simulation model used for our GPU simulator, based in the concept of boxes and signals, and the relation between the timing simulator and the functional emulator. The simulation model we use helps to increase the accuracy and reduce the number […]

OpenGL

high performance computing on graphics processing units: hgpu.org

Posts

Practical Symmetric Key Cryptography on Modern Graphics Hardware

Dust-Dust Collisional Charging and Lightning in Protoplanetary Discs

GPU-based ultra fast IMRT plan optimization

DSPSR: Digital Signal Processing Software for Pulsar Astronomy

Efficiency of the energy transfer in the Fenna-Matthews-Olson complex using hierarchical equations on graphics processing units

Comparing the Treecode with FMM on GPUs for vortex particle simulations of a leapfrogging vortex ring

A parallel immune algorithm for traveling salesman problem and its application on cold rolling scheduling

CUDA optimization strategies for compute- and memory-bound neuroimaging algorithms

A closer look at GPUs

Accelerating the Fourier split operator method via graphics processing units

FFT and Convolution Performance in Image Filtering on GPU

ATTILA: a cycle-level execution-driven simulator for modern GPU architectures

Recent source codes

DITRON: Distributed Compiler based on Triton for Parallel Systems

IntelliKit: Agent-first tooling for AMD hardware

CuTile Benchmark Suite: Performance and Productivity Tradeoffs for GPU Kernel Programming on Blackwell Architecture

MegaTrain: Full Precision Training of 100B+ Parameter Large Language Models on a Single GPU

Device Virtual Machine (DVM)

Agentic Code Optimization via Compiler-LLM Cooperation

AutoKernel: Autoresearch for GPU kernels. Give it any PyTorch model, go to sleep, wake up to optimized Triton kernels

Triton-Sanitizer: A Fast and Device-Agnostic Memory Sanitizer for Triton with Rich Diagnostic Context

LLM.Q: Quantized LLM training in pure CUDA/C++

SOL-ExecBench: Speed-of-Light Benchmarking for Real-World GPU Kernels Against Hardware Limits

Most viewed papers (last 30 days)