high performance computing on graphics processing units: hgpu.org

Papers on hgpu.org (.txt-file)

Revisiting Query Performance in GPU Database Systems

Revisiting sorting for GPGPU stream architectures

Revisiting the Case of ARM SoCs in High-Performance Computing Clusters

Revolutionary technologies for acceleration of emerging petascale applications

RGEM: A Responsive GPGPU Execution Model for Runtime Engines

Rgtsvm: Support Vector Machines on a GPU in R

Ringing: Frugal Subdivision of Curves and Surfaces

Rinnegan: Efficient Resource Use in Heterogeneous Architectures

Ripple: Simplified Large-Scale Computation on Heterogeneous Architectures with Polymorphic Data Layout

Rise of the Graphics Processor

Risk Estimation Without Using Stein’s Lemma — Application to Image Denoising

Ristretto: Hardware-Oriented Approximation of Convolutional Neural Networks

RNA secondary structure prediction using dynamic programming algorithm – A review and proposed work

RNS-Based Elliptic Curve Point Multiplication for Massive Parallel Architectures

RoadRunner: a fast and flexible exoplanet transit model

Roberts edge detection algorithm based on GPU

Robotic approach to multi-beam optical tweezers with Computer Generated Hologram

Robust Adaptive 3-D Segmentation of Vessel Laminae From Fluorescence Confocal Microscope Images and Parallel GPU Implementation

Robust Computational Tools for Multiple Testing With Genetic Association Studies

Robust Edge Detection and GPU-Based Smoothing for Extracting Surface Primitives from Range Images

Robust foreground segmentation for GPU architecture in an immersive 3D videoconferencing system

Robust GPGPU plugin development for RapidMiner

Robust GPU-assisted camera tracking using free-form surface models

Robust LLM Training Infrastructure at ByteDance

Robust Low Complexity Feature Tracking using CUDA

Robust mesh reconstruction from unoriented noisy points

Robust modified L2 local optical flow estimation and feature tracking

Robust non-local denoising of colored depth data

Robust real time face recognition and tracking on gpu using fusion of rgb and depth image

Robust Real-Time Multiprocessor Interrupt Handling Motivated by GPUs

Rodinia: A benchmark suite for heterogeneous computing

Romou: Rapidly Generate High-Performance Tensor Kernels for Mobile GPUs

Room acoustics modelling using GPU-accelerated finite difference and finite volume methods on a face-centered cubic grid

Rootbeer: Seamlessly using GPUs from Java

Rotationally invariant sparse patch matching on GPU and FPGA

Routine Microsecond Molecular Dynamics Simulations with AMBER on GPUs. 1. Generalized Born

RSVDPACK: Subroutines for computing partial singular value decompositions via randomized sampling on single core, multi core, and GPU architectures

RTCUDB: Building Databases with RT Processors

RTIndeX: Exploiting Hardware-Accelerated GPU Raytracing for Database Indexing

RTSL: a Ray Tracing Shading Language

RTX Beyond Ray Tracing: Exploring the Use of Hardware Ray Tracing Cores for Tet-Mesh Point Location

RubiCL, a Library Providing Automatic Parallelisation on CPU and GPU devices

Rubus: A compiler for seamless and extensible parallelism

RUMD: A general purpose molecular dynamics package optimized to utilize GPU hardware down to a few thousand particles

Run-time Image and Video Resizing Using CUDA-enabled GPUs

Run-time Reconfigurable Multiprocessors

Run-time support for multi-level disjoint memory address spaces

Run, Stencil, Run! – A Comparison of Modern Parallel Programming Paradigms

Running Financial Risk Management Applications on FPGA in the Amazon Cloud

Running the NIM Next-Generation Weather Model on GPUs

Running unstructured grid-based CFD solvers on modern graphics hardware

Running unstructured grid-based CFD solvers on modern graphics hardware

Runtime Code Generation and Data Management for Heterogeneous Computing in Java

Runtime Comparison of CPU and GPU Using Portable Programming Models

Runtime Compilation of Array-Oriented Python Programs

Runtime Configurable Deep Neural Networks for Energy-Accuracy Trade-off

Runtime Performances Benchmark for Knowledge Graph Embedding Methods

Runtime Specialization for Heterogeneous CPU-GPU Platforms

Runtime Support for Adaptive Power Capping on Heterogeneous SoCs

Runtime Support for Performance Portability on Heterogeneous Distributed Platforms

Runtime Support toward Transparent Memory Access in GPU-accelerated Heterogeneous Systems

Runtime Systems and Scheduling Support for High-End CPU-GPU Architectures

Runtime Visualization of Application Progress and Monitoring of a GPU-enabled Parallel Environment

S-buffer: Sparsity-aware Multi-fragment Rendering

SABER: Window-Based Hybrid Stream Processing for Heterogeneous Architectures

SaberLDA: Sparsity-Aware Learning of Topic Models on GPUs

Saddle Vertex Graph (SVG): A Novel Solution to the Discrete Geodesic Problem

Safe and Practical GPU Acceleration in TrustZone

Safe Asynchronous Multicore Memory Operations

Safe, Seamless, And Scalable Integration Of Asynchronous GPU Streams In PETSc

SafeGPU: Contract- and Library-Based GPGPU for Object-Oriented Languages

SAGA: SystemC Acceleration on GPU Architectures

SAGE: Self-Tuning Approximation for Graphics Engines

SAIH: A Scalable Evaluation Methodology for Understanding AI Performance Trend on HPC Systems

Sailfish: a flexible multi-GPU implementation of the lattice Boltzmann method

SaLoBa: Maximizing Data Locality and Workload Balance for Fast Sequence Alignment on GPUs

Salus: Fine-Grained GPU Sharing Primitives for Deep Learning Applications

Sample distribution shadow maps

SAPPORO: A way to turn your graphics cards into a GRAPE-6

Sapporo2: A versatile direct N-body library

SAR focusing of P-band ice sounding data using back-projection

SAR raw signal simulation based on GPU parallel computation

Sawtooth Wavefront Reordering: Enhanced CuTile FlashAttention on NVIDIA GB10

SBArt4 – Breeding abstract animations in realtime

SBLOCK: A Framework for Efficient Stencil-Based PDE Solvers on Multi-core Platforms

SC-DCNN: Highly-Scalable Deep Convolutional Neural Network using Stochastic Computing

Scalability Analysis of Parallel Algorithms on GPU Clusters

Scalability Analysis of Synchronous Data-Parallel Artificial Neural Network (ANN) Learners

Scalability and Optimization Strategies for GPU Enhanced Neural Networks (GeNN)

Scalability Evaluation of HPC Multi-GPU Training for ECG-based LLMs

Scalability of Higher-Order Discontinuous Galerkin FEM Computations for Solving Electromagnetic Wave Propagation Problems on GPU Clusters

Scalability of Incompressible Flow Computations on Multi-GPU Clusters Using Dual-Level and Tri-Level Parallelism

Scalability of Self-organizing Maps on a GPU cluster using OpenCL and CUDA

Scalability Study of Deep Learning Algorithms in High Performance Computer Infrastructures

Scalable Access-Pattern Aware I/O Acceleration and Multi-Tiered Data Management for HPC and AI Workloads

Scalable and deterministic timing-driven parallel placement for FPGAs

Scalable and High Performance Betweenness Centrality on the GPU

Scalable and highly parallel implementation of Smith-Waterman on graphics processing unit using CUDA

Scalable and Interactive Segmentation and Visualization of Neural Processes in EM Datasets

Scalable and massively parallel Monte Carlo photon transport simulations for heterogeneous computing platforms

Brief statistics for this page

Titles: 100

Doubles=1

Download open PDFs: 91

Package packages: 25

UniCoder: Unified Visual-to-Code Generation via Symbolic Rewards and Reference-Guided Code Optimization

CuFuzz: An API-Knowledge-Graph Coverage-Driven Fuzzing Framework for CUDA Libraries

AutoPass: Evidence-Guided LLM Agents for Compiler Performance Tuning

KernelBenchX: A Comprehensive Benchmark for Evaluating LLM-Generated GPU Kernels

CUDA Kernel Fusion Benchmarks

Analyzing the Impact of Kernel Fusion on GPU Tensor Operation Performance: A Systematic Performance Study

IntelliKit: Agent-first tooling for AMD hardware

Kerncap: Automated Kernel Extraction and Isolation for AMD GPUs

DITRON: Distributed Compiler based on Triton for Parallel Systems

DITRON: Distributed Multi-level Tiling Compiler for Parallel Tensor Programs

See all packages

* * *

high performance computing on graphics processing units: hgpu.org

Papers on hgpu.org (.txt-file)

Recent source codes

UniCoder: Unified Visual-to-Code Generation via Symbolic Rewards and Reference-Guided Code Optimization

CuFuzz: An API-Knowledge-Graph Coverage-Driven Fuzzing Framework for CUDA Libraries

AutoPass: Evidence-Guided LLM Agents for Compiler Performance Tuning

Probe-and-Refine Tuning of Repository Guidance for AI Coding Agents

CUDAnalyst (CUDA + Analyst)

CodegenBench

KernelBenchX: A Comprehensive Benchmark for Evaluating LLM-Generated GPU Kernels

CUDA Kernel Fusion Benchmarks

IntelliKit: Agent-first tooling for AMD hardware

DITRON: Distributed Compiler based on Triton for Parallel Systems

Most viewed papers (last 30 days)