high performance computing on graphics processing units: hgpu.org

Papers on hgpu.org (.txt-file)

VolQD: Direct Volume Rendering of Multi-million Atom Quantum Dot Simulations

Volume and Isosurface Rendering with GPU-Accelerated Cell Projection

Volume exploration using ellipsoidal Gaussian transfer functions

Volume Raycasting Performance Using DirectCompute

Volume rendering visualization of 3D spherical mantle convection with an unstructured mesh

Volume Visualization: A Technical Overview with a Focus on Medical Applications

Volume-preserving FFD for programmable graphics hardware

Volumetric Ambient Occlusion

Volumetric Ambient Occlusion for Real-Time Rendering and Games

Volumetric Rendering Techniques for Scientific Visualization

Voreen: A Rapid-Prototyping Environment for Ray-Casting-Based Volume Visualizations

Voronoi Toolpaths for PCB Mechanical Etch: Simple and Intuitive Algorithms with the 3D GPU

Vortex Methods for Fluid Simulation in Computer Graphics

Vortex methods for incompressible flow simulations on the GPU

Vortex particle method and parallel computing

Vortex: Overcoming Memory Capacity Limitations in GPU-Accelerated Large-Scale Data Analytics

Voxelized Minkowski sum computation on the GPU with robust culling

VoxelPipe: a programmable pipeline for 3D voxelization

Voxels on fire

VSIPL++ Acceleration Using Commodity Graphics Processors

vSMC: Parallel Sequential Monte Carlo in C++

Vulkan 1.1.97 – A Specification (with all registered Vulkan extensions)

Vulnerability Analysis and Attacks on Intel Xeon Phi Coprocessor

Vulnerable GPU Memory Management: Towards Recovering Raw Data from GPU

Wait-free programming for general purpose computations on graphics processors

waLBerla: A block-structured high-performance framework for multiphysics simulations

Walle: An End-to-End, General-Purpose, and Large-Scale Production System for Device-Cloud Collaborative Machine Learning

Wanted: Floating-Point Add Round-off Error instruction

Warp Size Impact in GPUs: Large or Small?

Warp-Level Divergence in GPUs: Characterization, Impact, and Mitigation

Warp-Level Parallelism: Enabling Multiple Replications In Parallel on GPU

WarpCore: A Library for fast Hash Tables on GPUs

WarpDrive: Extremely Fast End-to-End Deep Multi-Agent Reinforcement Learning on a GPU

Warped Register File: A Power Efficient Register File for GPGPUs

Warps and Atomics: Beyond Barrier Synchronization in the Verification of GPU Kernels

Wasserstein-Fisher-Rao Document Distance

Waste Not, Want Not! Managing relational data in asymmetric memories

Waste Not… Efficient Co-Processing of Relational Data

Water simulation based on HLSL

Water simulation for cell based sandbox games

Water Surface Animation using Damped Wave Equation and CUDA Acceleration

wav2letter++: The Fastest Open-source Speech Recognition System

Wave field synthesis for 3D audio: architectural prospectives

Wavefront raycasting using larger filter kernels for on-the-fly GPU gradient reconstruction

Wavelet Encoding and Multi-GPU Programming

Wavelet Model-based Stereo for Fast, Robust Face Reconstruction

WAYPOINT: scaling coherence to thousand-core architectures

WCCV: Improving the Vectorization of IF-statements with Warp-Coherent Conditions

Weak execution ordering – exploiting iterative methods on many-core GPUs

WebCL for Hardware-Accelerated Web Applications

Weighted Block-Asynchronous Iteration on GPU-Accelerated Systems

Weighted Residuals for Very Deep Networks

WgPy: GPU-accelerated NumPy-like array library for web browsers

What you see is what you snap: snapping to geometry deformed on the GPU

When HLS Meets FPGA HBM: Benchmarking and Bandwidth Optimization

When Machine Learning Meets Quantum Computers: A Case Study

Where is the data? Why you cannot debate CPU vs. GPU performance without the answer

Whippletree: Task-based Scheduling of Dynamic Workloads on the GPU

Whole-function vectorization

Why does PHM matter? – Nvidia’s GPU problems reviewed

Why is FPGA-GPU Heterogeneity the Best Option for Embedded Deep Neural Networks?

Why it is time for a HyPE: A Hybrid Query Processing Engine for Efficient GPU Coprocessing in DBMS

Wideband Channelization for Software-Defined Radio via Mobile Graphics Processors

WiLLM: An Open Wireless LLM Communication System

Wilson and Domainwall Kernels on Oakforest-PACS

Winograd Algorithm for AdderNet

Wire Speed Name Lookup: A GPU-based Approach

Wireless Interference Identification with Convolutional Neural Networks

word2ket: Space-efficient Word Embeddings inspired by Quantum Entanglement

Work Efficient Parallel Algorithms for Large Graph Exploration

Work in Progress: Vortex Detection and Visualization for Design of Micro Air Vehicles and Turbomachinery

Work Stealing Inside GPUs

Work-Efficient Parallel GPU Methods for Single-Source Shortest Paths

Working With Incremental Spatial Data During Parallel (GPU) Computation

Workload Analysis and Efficient OpenCL-based Implementation of SIFT Algorithm on a Smartphone

Workload and network-optimized computing systems

Workload Aware Algorithms for Heterogeneous Platforms

Workload Balancing on Heterogeneous Systems: A Case Study of Sparse Grid Interpolation

Workload Characterization of 3D Games

Workload distribution and balancing in FPGAs and CPUs with OpenCL and TBB

Workload Scheduling on Heterogeneous Devices

Workload-aware Automatic Parallelization for Multi-GPU DNN Training

Worst-Case Execution Time Guarantees for Runtime-Reconfigurable Architectures

WPA/WPA2 Password Security Testing using Graphics Processing Units

Wrinkling Coarse Meshes on the GPU

Writing a modular GPGPU program in Java

Writing a performance-portable matrix multiplication

Writing self-adaptive codes for heterogeneous systems

X-Device Query Processing by Bitwise Distribution

X-ray CT on the GPU

X-toon: an extended toon shader

XBOOLE-CUDA: Fast Boolean Operations on the GPU

Xbox 360 System Architecture

Xbox360 Front Side Bus – A 21.6 GB/s End-to-End Interface Design

Xeon Phi: A comparison between the newly introduced MIC architecture and a standard CPU through three types of problems

XeonPhi Meets Astrophysical Fluid Dynamics

XGBoost: Scalable GPU Accelerated Learning

XKaapi: A Runtime System for Data-Flow Task Programming on Heterogeneous Architectures

XMalloc: A Scalable Lock-free Dynamic Memory Allocator for Many-core Machines

XML3D: interactive 3D graphics for the web

Brief statistics for this page

Titles: 100

Download open PDFs: 94

Package packages: 23

UniCoder: Unified Visual-to-Code Generation via Symbolic Rewards and Reference-Guided Code Optimization

CuFuzz: An API-Knowledge-Graph Coverage-Driven Fuzzing Framework for CUDA Libraries

AutoPass: Evidence-Guided LLM Agents for Compiler Performance Tuning

KernelBenchX: A Comprehensive Benchmark for Evaluating LLM-Generated GPU Kernels

CUDA Kernel Fusion Benchmarks

Analyzing the Impact of Kernel Fusion on GPU Tensor Operation Performance: A Systematic Performance Study

IntelliKit: Agent-first tooling for AMD hardware

Kerncap: Automated Kernel Extraction and Isolation for AMD GPUs

DITRON: Distributed Compiler based on Triton for Parallel Systems

DITRON: Distributed Multi-level Tiling Compiler for Parallel Tensor Programs

See all packages

* * *

high performance computing on graphics processing units: hgpu.org

Papers on hgpu.org (.txt-file)

Recent source codes

UniCoder: Unified Visual-to-Code Generation via Symbolic Rewards and Reference-Guided Code Optimization

CuFuzz: An API-Knowledge-Graph Coverage-Driven Fuzzing Framework for CUDA Libraries

AutoPass: Evidence-Guided LLM Agents for Compiler Performance Tuning

Probe-and-Refine Tuning of Repository Guidance for AI Coding Agents

CUDAnalyst (CUDA + Analyst)

CodegenBench

KernelBenchX: A Comprehensive Benchmark for Evaluating LLM-Generated GPU Kernels

CUDA Kernel Fusion Benchmarks

IntelliKit: Agent-first tooling for AMD hardware

DITRON: Distributed Compiler based on Triton for Parallel Systems

Most viewed papers (last 30 days)