high performance computing on graphics processing units: hgpu.org

Posts

Aug, 10

A Flexible Multi-Volume Shader Framework for Arbitrarily Intersecting Multi-Resolution Datasets

We present a powerful framework for 3D-texture-based rendering of multiple arbitrarily intersecting volumetric datasets. Each volume is represented by a multi-resolution octree-based structure and we use out-of-core techniques to support extremely large volumes. Users define a set of convex polyhedral volume lenses, which may be associated with one or more volumetric datasets. The volumes or […]

OpenGL

Aug, 10

Real-time continuum grass

Simulating grass field in real-time has many applications, such as in virtual reality and games. Modeling accurate grass-grass, grass-object and grass-wind interactions requires a high computational cost. In this paper, we present a method to simulate grass field in real-time by considering grass field as a two dimensional grid-based continuum and shifting the complex interactions […]

Aug, 10

Performance evaluation and optimization of random memory access on multicores with high productivity

The slow progress in memory access latencies in comparison to CPU speeds has resulted in memory accesses dominating code performance. While architectural enhancements have benefited applications with data locality and sequential access, random memory access still remains a cause for concern. Several benchmarks have been proposed to evaluate the random memory access performance on multicore […]

CUDA

Aug, 10

A parallel mapping of optical flow to Compute Unified Device Architecture for motion-based image segmentation

A correlation-based optical flow algorithm using compute unified device architecture (CUDA) technology to achieve fast motion-based image segmentation is described. Using CUDA, a 240 processor GPU implementation of an optimized correlation-based optical flow algorithm allows segmentation to be achieved at high frame rates on high-resolution video sequences. Details of the mapping of the optical flow […]

CUDA

Aug, 10

Approaches for parallelizing reductions on modern GPUs

GPU hardware and software has been evolving rapidly. CUDA versions 1.1 and higher started supporting atomic operations on device memory, and CUDA versions 1.2 and higher started supporting atomic operations on shared memory. This paper focuses on parallelizing applications involving reductions on GPUs. Prior to the availability of support for locking, these applications could only […]

CUDA

Aug, 9

G-NetMon: A GPU-accelerated Network Performance Monitoring System

At Fermilab, we have prototyped a GPU-accelerated network performance monitoring system, called G-NetMon, to support large-scale scientific collaborations. In this work, we explore new opportunities in network traffic monitoring and analysis with GPUs. Our system exploits the data parallelism that exists within network flow data to provide fast analysis of bulk data movement between Fermilab […]

CUDA

Aug, 9

G-NetMon: A GPU-accelerated Network Performance Monitoring System for Large Scale Scientific Collaborations

Network traffic is difficult to monitor and analyze, especially in high-bandwidth networks. Performance analysis, in particular, presents extreme complexity and scalability challenges. GPU (Graphics Processing Unit) technology has been utilized recently to accelerate general purpose scientific and engineering computing. GPUs offer extreme thread-level parallelism with hundreds of simple cores. Their data-parallel execution model can rapidly […]

CUDA

Aug, 9

Real-Time All-in-Focus Video-Based Rendering Using A Network Camera Array

We present a real-time video-based rendering system using a network camera array. Our system consists of 64 commodity network cameras that are connected to a single PC through a Gigabit Ethernet. To render a high-quality novel view, we estimate a view-dependent per-pixel depth map in real-time by using a layered representation. The rendering algorithm is […]

OpenGL

Aug, 9

Graphics Processing Units for Handhelds

During the past few years, mobile phones and other handheld devices have gone from only handling dull text-based menu systems to, on an increasing number of models, being able to render high-quality three-dimensional graphics at high frame rates. This paper is a survey of the special considerations that must be taken when designing graphics processing […]

Aug, 9

Geospatial visualization using hardware accelerated real-time volume rendering

We present a visualization framework using direct volume rendering techniques that achieves real-time performance and high image quality. The visualization program runs on a desktop as well as in an immersive environment. The application is named HurricaneVis, and it uses OpenGL, GLSL and VTK. For immersive visualization VRJuggler is added. To achieve real-time rendering rates […]

OpenGL

Aug, 9

Performance Evaluation of Feature Extraction Algorithm on GPGPU

Nvidia’s GPGPU based Compute Unified Device Architecture (CUDA) is a software platform for massively parallel high-performance computing on GPU. It provide several key abstractions- a hierarchy of thread block, shared memory, and barrier synchronization. This model has proven quite successful at programming multithreaded many core GPUs and scale transparently to hundreds of cores: many industry […]

CUDA

Aug, 9

Cache Miss Analysis for GPU Programs Based on Stack Distance Profile

Using the graphics processing unit (GPU) to accelerate the general purpose computation has attracted much attention from both the academia and industry due to GPU’s powerful computing capacity. Thus optimization of GPU programs has become a popular research direction. In order to support the general purpose computing more efficiently, GPU has integrated the general data […]

A Safety Report on GPT-5.2, Gemini 3 Pro, Qwen3-VL, Grok 4.1 Fast, Nano Banana Pro, and Seedream 4.5

DICE: Diffusion Large Language Models Excel at Generating CUDA Kernels

KernelGYM & Dr. Kernel: A distributed GPU environment and a collection of RL training methods to support RL for Kernel Generations

Dr. Kernel: Reinforcement Learning Done Right for Triton Kernel Generations

* * *

high performance computing on graphics processing units: hgpu.org

Posts

A Flexible Multi-Volume Shader Framework for Arbitrarily Intersecting Multi-Resolution Datasets

Real-time continuum grass

Performance evaluation and optimization of random memory access on multicores with high productivity

A parallel mapping of optical flow to Compute Unified Device Architecture for motion-based image segmentation

Approaches for parallelizing reductions on modern GPUs

G-NetMon: A GPU-accelerated Network Performance Monitoring System

G-NetMon: A GPU-accelerated Network Performance Monitoring System for Large Scale Scientific Collaborations

Real-Time All-in-Focus Video-Based Rendering Using A Network Camera Array

Graphics Processing Units for Handhelds

Geospatial visualization using hardware accelerated real-time volume rendering

Performance Evaluation of Feature Extraction Algorithm on GPGPU

Cache Miss Analysis for GPU Programs Based on Stack Distance Profile

Recent source codes

A Safety Report on GPT-5.2, Gemini 3 Pro, Qwen3-VL, Grok 4.1 Fast, Nano Banana Pro, and Seedream 4.5

DICE: Diffusion Large Language Models Excel at Generating CUDA Kernels

KernelGYM & Dr. Kernel: A distributed GPU environment and a collection of RL training methods to support RL for Kernel Generations

Vortex-Optimized Light-weight Toolchain (VOLT)

SciDef: Automated Definition Extraction from Scientific Literature

bioagent-bench: Benchmark for evaluating LLM agents in bioinformatics

Benchmark suite for LLM inference on NVIDIA consumer GPUs

Theorizer: from the paper Generating Literature-Driven Scientific Discoveries at Scale

Nsight Python: a Python kernel profiling interface based on NVIDIA Nsight Tools

Awesome LLM-Driven Kernel Generation

Most viewed papers (last 30 days)