high performance computing on graphics processing units: hgpu.org

Posts

Nov, 18

An adaptive Expectation-Maximization algorithm with GPU implementation for electron cryomicroscopy

Maximum-likelihood (ML) estimation has very desirable properties for reconstructing 3D volumes from noisy cryo-EM images of single macromolecular particles. Current implementations of ML estimation make use of the Expectation-Maximization (EM) algorithm or its variants. However, the EM algorithm is notoriously computation-intensive, as it involves integrals over all orientations and positions for each particle image. We […]

CUDA

Nov, 17

Correlation analysis on GPU systems using NVIDIA’s CUDA

Functional magnetic resonance imaging allows non-invasive measurements of brain dynamics and has already been used for neurofeedback experiments, which relies on real time data processing. The limited computational resources that are typically available for this have hindered the use of connectivity analysis in this context. A basic, but already computationally demanding analysis method of neural […]

CUDA

Nov, 17

A Survey of Medical Image Registration on Multicore and the GPU

In this article, we look at early, recent, and state-of-the-art methods for registration of medical images using a range of high-performance computing (HPC) architectures including symmetric multiprocessing (SMP), massively multiprocessing (MMP), and architectures with distributed memory (DM), and nonuniform memory access (NUMA). The article is designed to be self-sufficient. We will take the time to […]

Nov, 17

TeraFLOP computing on a desktop PC with GPUs for 3D CFD

A very efficient implementation of a lattice Boltzmann (LB) kernel in 3D on a graphical processing unit using the compute unified device architecture interface developed by nVIDIA is presented. By exploiting the explicit parallelism offered by the graphics hardware, we obtain an efficiency gain of up to two orders of magnitude with respect to the […]

Nov, 17

FAST: fast architecture sensitive tree search on modern CPUs and GPUs

In-memory tree structured index search is a fundamental database operation. Modern processors provide tremendous computing power by integrating multiple cores, each with wide vector units. There has been much work to exploit modern processor architectures for database primitives like scan, sort, join and aggregation. However, unlike other primitives, tree search presents significant challenges due to […]

Nov, 17

Scalable parallel programming with CUDA

Is CUDA the parallel programming model that application developers have been waiting for?

CUDA

Nov, 17

Fast free-form deformation using graphics processing units

A large number of algorithms have been developed to perform non-rigid registration and it is a tool commonly used in medical image analysis. The free-form deformation algorithm is a well-established technique, but is extremely time consuming. In this paper we present a parallel-friendly formulation of the algorithm suitable for graphics processing unit execution. Using our […]

CUDA

Nov, 17

A Graphics Parallel Memory Organization Exploiting Request Correlations

Real-time graphics applications require memory organizations featuring parallel pixel access and low-cost implementation. This work bases on a nonlinear skew mapping scheme and exploits the correlation between consecutive requests for pixels to design an efficient parallel memory organization. The mapping achieves parallel access, of mn pixels in various shapes, to the memory organized with mn […]

Nov, 17

permGPU: Using graphics processing units in RNA microarray association studies

BACKGROUND:Many analyses of microarray association studies involve permutation and bootstrap resampling, and cross-validation, that are ideally formulated as embarrassingly parallel computing problems. Given that these analyses are computationally intensive, scalable approaches that can take advantage of multi-core processor systems need to be developed. RESULTS:We have developed a CUDA based implementation, permGPU, that employs graphics processing […]

CUDA

Nov, 17

Accelerating Collapsed Variational Bayesian Inference for Latent Dirichlet Allocation with Nvidia CUDA Compatible Devices

In this paper, we propose an acceleration of collapsed variational Bayesian (CVB) inference for latent Dirichlet allocation (LDA) by using Nvidia CUDA compatible devices. While LDA is an efficient Bayesian multi-topic document model, it requires complicated computations for parameter estimation in comparison with other simpler document models, e.g. probabilistic latent semantic indexing, etc. Therefore, we […]

CUDA

Nov, 17

Accelerating simultaneous algebraic reconstruction technique with motion compensation using CUDA-enabled GPU

PURPOSE: To accelerate the simultaneous algebraic reconstruction technique (SART) with motion compensation for speedy and quality computed tomography reconstruction by exploiting CUDA-enabled GPU. METHODS: Two core techniques are proposed to fit SART into the CUDA architecture: (1) a ray-driven projection along with hardware trilinear interpolation, and (2) a voxel-driven back-projection that can avoid redundant computation […]

CUDA

Nov, 17

Eye-Full Tower: A GPU-based variable multibaseline omnidirectional stereovision system with automatic baseline selection for outdoor mobile robot navigation

In recent years, it can be observed that there is a gradual increase in the number of researchers and projects involved with the development of omnidirectional vision systems for various applications. The primary factors, which contributed towards this positive trend, are the availability of inexpensive and high resolution vision sensors, robust and fast computers and […]

CUDA

A Safety Report on GPT-5.2, Gemini 3 Pro, Qwen3-VL, Grok 4.1 Fast, Nano Banana Pro, and Seedream 4.5

DICE: Diffusion Large Language Models Excel at Generating CUDA Kernels

KernelGYM & Dr. Kernel: A distributed GPU environment and a collection of RL training methods to support RL for Kernel Generations

Dr. Kernel: Reinforcement Learning Done Right for Triton Kernel Generations

* * *

high performance computing on graphics processing units: hgpu.org

Posts

An adaptive Expectation-Maximization algorithm with GPU implementation for electron cryomicroscopy

Correlation analysis on GPU systems using NVIDIA’s CUDA

A Survey of Medical Image Registration on Multicore and the GPU

TeraFLOP computing on a desktop PC with GPUs for 3D CFD

FAST: fast architecture sensitive tree search on modern CPUs and GPUs

Scalable parallel programming with CUDA

Fast free-form deformation using graphics processing units

A Graphics Parallel Memory Organization Exploiting Request Correlations

permGPU: Using graphics processing units in RNA microarray association studies

Accelerating Collapsed Variational Bayesian Inference for Latent Dirichlet Allocation with Nvidia CUDA Compatible Devices

Accelerating simultaneous algebraic reconstruction technique with motion compensation using CUDA-enabled GPU

Eye-Full Tower: A GPU-based variable multibaseline omnidirectional stereovision system with automatic baseline selection for outdoor mobile robot navigation

Recent source codes

A Safety Report on GPT-5.2, Gemini 3 Pro, Qwen3-VL, Grok 4.1 Fast, Nano Banana Pro, and Seedream 4.5

DICE: Diffusion Large Language Models Excel at Generating CUDA Kernels

KernelGYM & Dr. Kernel: A distributed GPU environment and a collection of RL training methods to support RL for Kernel Generations

Vortex-Optimized Light-weight Toolchain (VOLT)

SciDef: Automated Definition Extraction from Scientific Literature

Theorizer: from the paper Generating Literature-Driven Scientific Discoveries at Scale

bioagent-bench: Benchmark for evaluating LLM agents in bioinformatics

Benchmark suite for LLM inference on NVIDIA consumer GPUs

Nsight Python: a Python kernel profiling interface based on NVIDIA Nsight Tools

Awesome LLM-Driven Kernel Generation

Most viewed papers (last 30 days)