high performance computing on graphics processing units: hgpu.org

Posts

Jan, 6

Message passing on data-parallel architectures

This paper explores the challenges in implementing a message passing interface usable on systems with data-parallel processors. As a case study, we design and implement the “DCGN” API on NVIDIA GPUs that is similar to MPI and allows full access to the underlying architecture. We introduce the notion of data-parallel thread-groups as a way to […]

Jan, 6

Accelerating phase unwrapping and affine transformations for optical quadrature microscopy using CUDA

Optical Quadrature Microscopy (OQM) is a process which uses phase data to capture information about the sample being studied. OQM is part of an imaging framework developed by the Optical Science Laboratory at Northeastern University. In one particular application of interest, the framework is used to extract phase information from the image of an embryo […]

CUDA

Jan, 6

Hardware-accelerated parallel non-photorealistic volume rendering

Non-photorealistic rendering can be used to illustrate subtle spatial relationships that might not be visible with more realistic rendering techniques. We present a parallel hardware-accelerated rendering technique, making extensive use of multi-texturing and paletted textures, for the interactive non-photorealistic visualization of scalar volume data. With this technique, we can render a 512x512x512 volume using non-photorealistic […]

Jan, 6

Interactive volume illustration

In this paper we describe non-photorealistic rendering techniques for volumetric data sets. First, we outline an automatic approach that generates line drawings to illustrate such data sets and to augment traditional volume rendering techniques. For a number of seed points that are placed appropriately to represent selected volume structures curvature lines are traced and encoded […]

OpenGL

Jan, 6

Synthetic Aperture Radar Processing with GPGPU

This article focuses on methodologies with recurrent use to code examples that try to couple with the flow of the main steps of the SAR processing. The possibility to be comprehensive was prevented by the wide scenario of variations of the focusing algorithm as well as the spread of applications. The reader should look at […]

CUDA

Jan, 6

An asymmetric distributed shared memory model for heterogeneous parallel systems

Heterogeneous computing combines general purpose CPUs with accelerators to efficiently execute both sequential control-intensive and data-parallel phases of applications. Existing programming models for heterogeneous computing rely on programmers to explicitly manage data transfers between the CPU system memory and accelerator memory.

CUDA

Jan, 6

Efficient gather and scatter operations on graphics processors

Gather and scatter are two fundamental data-parallel operations, where a large number of data items are read (gathered) from or are written (scattered) to given locations. In this paper, we study these two operations on graphics processing units (GPUs).

CUDA

Jan, 6

Data-parallel algorithms and data structures

Abstract is not available.

Jan, 6

q-state Potts model metastability study using optimized GPU-based Monte Carlo algorithms

We implemented a GPU based parallel code to perform Monte Carlo simulations of the two dimensional q-state Potts model. The algorithm is based on a checkerboard update scheme and assigns independent random numbers generators to each thread (one thread per spin). The implementation allows to simulate systems up to ~10^9 spins with an average time […]

CUDA

Jan, 6

Fully-3D GPU PET reconstruction

Fully-3D iterative tomographic image reconstruction is computationally very demanding. Graphics Processing Unit (GPU) have been proposed for many years as potentially accelerators in complex scientific problems, but it has not been until the recent advances in the programmability of GPUs that the best available reconstruction codes have started to be implemented to be run on […]

Jan, 5

Speeding Up Homomorpic Hashing Using GPUs

Homomorphic hash functions (HHFs) have been applied into peer-to-peer networks with erasure coding or network coding to defend against pollution attacks. Unfortunately HHFs are computationally expensive for contemporary CPUs, This paper to exploit the computing power of graphic processing units (GPUs) for homomorphic hashing. Specifically, we demonstrate how to use NVIDIA GPUs and the computer […]

CUDA

Jan, 5

CULLIDE: interactive collision detection between complex models in large environments using graphics hardware

We present a novel approach for fast collision detection between multiple deformable and breakable objects in a large environment using graphics hardware. Our algorithm takes into account low bandwidth to and from the graphics cards and computes a potentially colliding set (PCS) using visibility queries. It involves no precomputation and proceeds in multiple stages: PCS […]

OpenGL

high performance computing on graphics processing units: hgpu.org

Posts

Message passing on data-parallel architectures

Accelerating phase unwrapping and affine transformations for optical quadrature microscopy using CUDA

Hardware-accelerated parallel non-photorealistic volume rendering

Interactive volume illustration

Synthetic Aperture Radar Processing with GPGPU

An asymmetric distributed shared memory model for heterogeneous parallel systems

Efficient gather and scatter operations on graphics processors

Data-parallel algorithms and data structures

q-state Potts model metastability study using optimized GPU-based Monte Carlo algorithms

Fully-3D GPU PET reconstruction

Speeding Up Homomorpic Hashing Using GPUs

CULLIDE: interactive collision detection between complex models in large environments using graphics hardware

Recent source codes

DITRON: Distributed Compiler based on Triton for Parallel Systems

IntelliKit: Agent-first tooling for AMD hardware

CuTile Benchmark Suite: Performance and Productivity Tradeoffs for GPU Kernel Programming on Blackwell Architecture

MegaTrain: Full Precision Training of 100B+ Parameter Large Language Models on a Single GPU

Device Virtual Machine (DVM)

Agentic Code Optimization via Compiler-LLM Cooperation

AutoKernel: Autoresearch for GPU kernels. Give it any PyTorch model, go to sleep, wake up to optimized Triton kernels

Triton-Sanitizer: A Fast and Device-Agnostic Memory Sanitizer for Triton with Rich Diagnostic Context

LLM.Q: Quantized LLM training in pure CUDA/C++

SOL-ExecBench: Speed-of-Light Benchmarking for Real-World GPU Kernels Against Hardware Limits

Most viewed papers (last 30 days)