high performance computing on graphics processing units: hgpu.org

Papers on hgpu.org (.txt-file)

A Research of MapReduce with GPU Acceleration

A Resource Selection System for Cycle Stealing in GPU Grids

A Resource-Efficient Computing Paradigm for Computational Protein Modeling Applications

A Restructuring Algorithm for CUDA

A Reverse-Projecting Pixel-Level Painting Algorithm

A Review of CUDA, MapReduce, and Pthreads Parallel Computing Models

A Review of the Parallelization Strategies for Iterative Algorithms

A Review on Parallelization of Node based Game Tree Search Algorithms on GPU

A Rigid Body Physics Engine for Interactive Applications

A Road Marking Extraction Method Using GPGPU

A Run-Time Adaptive FPGA Architecture for Monte Carlo Simulations

A Runtime Controller for OpenCL Applications on Heterogeneous System Architectures

A Safety Report on GPT-5.2, Gemini 3 Pro, Qwen3-VL, Grok 4.1 Fast, Nano Banana Pro, and Seedream 4.5

A Scala Prototype to Generate Multigrid Solver Implementations for Different Problems and Target Multi-Core Platforms

A Scalable and Reconfigurable Shared-Memory Graphics Cluster Architecture

A Scalable Approach to Solving Dense Linear Algebra Problems on Hybrid CPU-GPU Systems

A Scalable End-to-End Optimized Real-Time Image-Based Rendering Framework on Graphics Hardware

A Scalable Framework for Heterogeneous GPU-Based Clusters

A Scalable Framework for Monte Carlo Simulation Using FPGA-based Hardware Accelerators with Application to SPECT Imaging

A Scalable GPU-based Approach to Accelerate the Multiple-Choice Knapsack Problem

A scalable GPU-based approach to shading and shadowing for photorealistic real-time augmented reality

A Scalable graph-cut algorithm for N-D grids

A Scalable Heterogeneous Parallelization Framework for Iterative Local Searches

A Scalable High Performant Cholesky Factorization for Multicore with GPU Accelerators

A scalable hybrid algorithm based on domain decomposition and algebraic multigrid for solving partial differential equations on a cluster of CPU/GPUs

A Scalable Hybrid FPGA/GPU FX Correlator

A Scalable Lane Detection Algorithm on COTSs with OpenCL

A Scalable Multi-Path Microarchitecture for Efficient GPU Control Flow

A Scalable, Efficient Scheme for Evaluation of Stencil Computations over Unstructured Meshes

A scalable, numerically stable, high-performance tridiagonal solver using GPUs

A scheduling and runtime framework for a cluster of heterogeneous machines with multiple accelerators

A Scheduling Framework for a Heterogeneous Parallel Architecture

A Screen Space Quality Method for Data Abstraction

A scripting language for Digital Content Creation applications

A second generation of DEFG: Declarative Framework for GPUs

A Second-Order Distributed Trotter-Suzuki Solver with a Hybrid Kernel

A Self-Optimizing Framework for Developing Metrology Software on Massive Parallel Processor Architectures

A self-organization based optical flow estimator with GPU implementation

A self-organization based optical flow estimator with GPU implementation (thesis)

A Semi-Automated Tool Flow for Roofline Anaylsis of OpenCL Kernels on Accelerators

A Shader Library for OpenGL 4 and GLSL 4.3 Learning and Development

A shared file system abstraction for heterogeneous architectures

A shared-scene-graph image-warping architecture for VR: Low latency versus image quality

A short guide to CUDA C: For physicists with multi-core graphics cards

A Short Note on Gaussian Process Modeling for Large Datasets using Graphics Processing Units

A SIMD Interpreter for Genetic Programming on GPU Graphics Cards

A SIMD-efficient 14 instruction shader program for high-throughput microtriangle rasterization

A Similarity Measure for GPU Kernel Subgraph Matching

A Similarity-Based Analysis Tool for Scientific Application Porting

A simple and efficient way to compute depth maps for multi-view videos

A simple and flexible volume rendering framework for graphics-hardware-based raycasting

A simple GPU-based approach for 3D Voronoi diagram construction and visualization

A simple method to accelerate fringe analysis algorithms based on graphics processing unit and MATLAB

A Simplified and Accurate Model of Power-Performance Efficiency on Emergent GPU Architectures

A Simulation Framework for Scheduling Performance Evaluation on CPU-GPU Heterogeneous System

A simulation suite for lattice Boltzmann based real time CFD applications exploiting multi-level parallelism on modern multi-and many-core architectures

A Simulation Suite for Lattice-Boltzmann based Real-Time CFD Applications Exploiting Multi-Level Parallelism on modern Multi- and Many-Core Architectures

A Simulator for the Cafadis Real Time 3DTV Camera

A Single (Unified) Shader GPU Microarchitecture for Embedded Systems

A single-pass GPU ray casting framework for interactive out-of-core rendering of massive volumetric datasets

A small-world network model for distributed storage of semantic metadata

A Smart GPU Implementation of an Elliptic Kernel for an Ocean Global Circulation Model

A smooth particle hydrodynamics code to model collisions between solid, self-gravitating objects

A Software Framework for the Detection and Classification of Biological Targets in Bio-Nano Sensing

A Software-Based Self Test of CUDA Fermi GPUs

A Sorting Library for FPGA Implementation in OpenCL Programming

A Sparse Matrix Personality for the Convey HC-1

A sparse octree gravitational N-body code that runs entirely on the GPU processor

A Spiking Neural P system simulator based on CUDA

A Splitting Algorithm for Directional Regularization and Sparsification

A stand-alone Finite Difference Time Domain (FDTD) simulation for Integrated Optoelectronics Laboratory

A state-of-the-art password strength analysis demonstrator

A Static Analysis-based Cross-Architecture Performance Prediction Using Machine Learning

A Static Load Balancing Scheme for Parallel Volume Rendering on Multi-GPU Clusters

A Static Task Partitioning Approach for Heterogeneous Systems Using OpenCL

A Stencil DSEL for Single Code Accelerated Computing with SYCL

A stencil-based implementation of Parareal in the C++ domain specific embedded language STELLA

A Step towards Energy Efficient Computing: Redesigning A Hydrodynamic Application on CPU-GPU

A stereoscopic movie player with real-time content adaptation to the display geometry

A Stochastic-based Optimized Schwarz Method for the Gravimetry Equations on GPU Clusters

A straightforward CUDA implementation for interactive ray-tracing

A Straightforward Preprocessing Approach for Accelerating Convex Hull Computations on the GPU

A Strategy for Automatic Performance Tuning of Stencil Computations on GPUs

A Strategy for Automatically Generating High Performance CUDA Code for a GPU Accelerator from a Specialized Fortran Code Expression

A Stream Processor Cluster Architecture Model with the Hybrid Technology of MPI and CUDA

A stream-computing extension to OpenMP

A streaming model for nested data parallelism

A streaming narrow-band algorithm: interactive computation and visualization of level sets

A structural analysis of the A5/1 state transition graph

A structured parallel periodic arnoldi shooting algorithm for RF-PSS analysis based on GPU platforms

A Study of Complex Deep Learning Networks on High Performance, Neuromorphic, and Quantum Computers

A Study of CUDA Acceleration and Impact of Data Transfer Overhead in Heterogeneous Environment

A Study of Data Partitioning on OpenCL-based FPGAs

A Study of Floating-Point Precision Tuning in Deep Learning Operators Implementations

A study of integer sorting on multicores

A Study of Mixed Precision Strategies for GMRES on GPUs

A study of parallel evolution strategy: pattern search on a GPU computing platform

A Study of Parallel Sorting Algorithms Using CUDA and OpenMP

A Study of Performance Programming of CPU, GPU accelerated Computers and SIMD Architecture

A Study of Productivity and Performance of Modern Vector Processors

Brief statistics for this page

Titles: 100

Download open PDFs: 93

Package packages: 12

UniCoder: Unified Visual-to-Code Generation via Symbolic Rewards and Reference-Guided Code Optimization

CuFuzz: An API-Knowledge-Graph Coverage-Driven Fuzzing Framework for CUDA Libraries

AutoPass: Evidence-Guided LLM Agents for Compiler Performance Tuning

KernelBenchX: A Comprehensive Benchmark for Evaluating LLM-Generated GPU Kernels

CUDA Kernel Fusion Benchmarks

Analyzing the Impact of Kernel Fusion on GPU Tensor Operation Performance: A Systematic Performance Study

IntelliKit: Agent-first tooling for AMD hardware

Kerncap: Automated Kernel Extraction and Isolation for AMD GPUs

DITRON: Distributed Compiler based on Triton for Parallel Systems

DITRON: Distributed Multi-level Tiling Compiler for Parallel Tensor Programs

See all packages

* * *

high performance computing on graphics processing units: hgpu.org

Papers on hgpu.org (.txt-file)

Recent source codes

UniCoder: Unified Visual-to-Code Generation via Symbolic Rewards and Reference-Guided Code Optimization

CuFuzz: An API-Knowledge-Graph Coverage-Driven Fuzzing Framework for CUDA Libraries

AutoPass: Evidence-Guided LLM Agents for Compiler Performance Tuning

Probe-and-Refine Tuning of Repository Guidance for AI Coding Agents

CUDAnalyst (CUDA + Analyst)

CodegenBench

KernelBenchX: A Comprehensive Benchmark for Evaluating LLM-Generated GPU Kernels

CUDA Kernel Fusion Benchmarks

IntelliKit: Agent-first tooling for AMD hardware

DITRON: Distributed Compiler based on Triton for Parallel Systems

Most viewed papers (last 30 days)