high performance computing on graphics processing units: hgpu.org

Papers on hgpu.org (.txt-file)

Detection of a faint fast-moving near-Earth asteroid using synthetic tracking technique

Detection of collisions and self-collisions using image-space techniques

Detection of retransmissions in 10G Ethernet using GPUs

Determinant Computation on the GPU using the Condensation Method

Determining the difficulty of accelerating problems on a GPU

Deterministic Parallelism

Deterministic Sample Sort For GPUs

Developing a compiler for the XeonPhi

Developing a CUDA solver for large sparse matrices for MARIN

Developing a High Performance GPGPU Compiler Using Cetus

Developing a High Performance Software Library with MPI and CUDA for Matrix Computations

Developing a massive real-time crowd simulation framework on the GPU

Developing a New Storage Format and a Warp-Based SpMV Kernel for Configuration Interaction Sparse Matrices on the GPU

Developing acquisition systems based on FPGA with OpenCL

Developing an OO Model for Generalized Matrix Multiplication: Preliminary Considerations

Developing and Deploying Advanced Algorithms to Novel Supercomputing Hardware

Developing and Evaluating clOpenCL Applications for Heterogeneous Clusters

Developing Extensible Lattice-Boltzmann Simulators for General-Purpose Graphics-Processing Units

Developing Performance-Portable Molecular Dynamics Kernels in OpenCL

Development and evaluation of a GPU-optimized N-body term for the simulation of biomolecules

Development and evaluation of scalable video motion estimators on GPU

Development methodologies for GPU and cluster of GPUs

Development of a Chemically Reacting Flow Solver on the Graphic Processing Units

Development of a CUDA Implementation of the 3D FDTD Method

Development of a Flow Solver with Complex Kinetics on the Graphic Processing Units

Development of a GPU based two-way time transfer modem

Development of a GPU-accelerated MIKE 21 Solver for Water Wave Dynamics

Development of a GPU-based High-Performance Radiative Transfer Model for the Infrared Atmospheric Sounding Interferometer (IASI)

Development of a GPU-based Monte Carlo dose calculation code for coupled electron-photon transport

Development of a GPU-based multithreaded software application to calculate digitally reconstructed radiographs for radiotherapy

Development of a new framework for high performance volunteer computing

Development of a Restricted Additive Schwarz Preconditioner for Sparse Linear Systems on NVIDIA GPU

Development of a volume rendering system using 3D texture compression techniques on general-purpose personal computers

Development of an Algorithm for Extracting Parallelism and Pipeline Structure from Stream-based Processing flow with Spanning Tree

Development of an explicit pressure-based unstructured solver for three-dimensional incompressible flows with graphics hardware acceleration

Development of an unified FDTD-FEM library for electromagnetic analysis with CPU and GPU computing

Development of Bayesian analysis program for extraction of polarisation observables at CLAS

Development of Generic Scheduling Concepts for OpenGL ES 2.0

Development of High-Performance Software Components for Emerging Architectures

Development of JavaScript-based deep learning platform and application to distributed training

Development of Krylov and AMG linear solvers for large-scale sparse matrices on GPUs

Development of methods for the processing of mining images using genetic algorithms

Development of nonlinear filter bank system for real-time beautification of facial video using GPGPU

Development of Parallel Architectures for Radar/Video Signal Processing Applications

Development of Parallel Computation Tools

Development of Virtual Machine Tool for Simulation and Evaluation

Developmental Directions in Parallel Accelerators

Device Placement Optimization with Reinforcement Learning

Device specialization in heterogeneous multi-GPU environments

Devito: automated fast finite difference computation

DFG Implementation on Multi GPU Cluster with Computation-Communication Overlap

DGEMM on Integer Matrix Multiplication Unit

DGEMM without FP64 Arithmetic – using FP64 Emulation and FP8 Tensor Cores with Ozaki Scheme

Diagnosing FP4 inference: a layer-wise and block-wise sensitivity analysis of NVFP4 and MXFP4

Diagnosing Performance Bottlenecks in HPC Applications

Diagnosis, Tuning, and Redesign for Multicore Performance: A Case Study of the Fast Multipole Method

Diagrammatic Determinantal Quantum Monte Carlo Calculations on GPUs

DIANNE: Distributed Artificial Neural Networks for the Internet of Things

DICE: Diffusion Large Language Models Excel at Generating CUDA Kernels

Diderot: A Parallel DSL for Image Analysis and Visualization

DiffBench Meets DiffAgent: End-to-End LLM-Driven Diffusion Acceleration Code Generation

Different Optimization Strategies and Performance Evaluation of Reduction on Multicore CUDA Architecture

Differential evolution algorithm on the GPU with C-CUDA

Differential Evolution with parallelised objective functions using CUDA

Diffusion Curves: A Vector Representation for Smooth-Shaded Images

Digital beamforming using a GPU

Digital Marbling: a GPU Approach with Precomputed Velocity Field

Digital Signal Processing using Stream High Performance Computing: A 512-input Broadband Correlator for Radio Astronomy

Digitize Your Body and Action in 3-D at Over 10 FPS: Real Time Dense Voxel Reconstruction and Marker-less Motion Tracking via GPU Acceleration

Diplomat: Mapping of multi-kernel applications using a static dataflow abstraction

Direct Communication Methods for Distributed GPUs

Direct deconvolution of radio synthesis images using L1 minimisation

Direct evaluation of NURBS curves and surfaces on the GPU

Direct GPU Compilation and Execution for Host Applications with OpenMP Parallelism

Direct GPU/FPGA Communication Via PCI Express

Direct N-body code on low-power embedded ARM GPUs

Direct N-body Kernels for Multicore Platforms

Direct N-body simulations of globular clusters: (I) Palomar 14

Direct Numeric Simulation of Sheared Convective Boundary Layer Entrainment with GPUs

Direct Numerical Simulation and Large Eddy Simulation on a Turbulent Wall-Bounded Flow Using Lattice Boltzmann Method and Multiple GPUs

Direct numerical simulation of sub-grid structures in gas-solid flow — GPU implementation of macro-scale pseudo-particle modeling

Direct Numerical Simulation of Turbulence on Heterogenous Computer Systems: Architectures, Algorithms, and Applications

Direct Numerical Simulation of Turbulent Flows with Parallel Algorithms for Various Computing Architectures

Direct Point Rendering on GPU

Direct Self-Consistent Field Computations on GPU Clusters

Direct solution of the Boltzmann equation for a binary mixture on GPUs

Direct Visualization of Particle-Partition of Unity Data

Direct Volume Editing

Direct-to-indirect transfer for cinematic relighting

directCell: hybrid systems with tightly coupled accelerators

Directionally Unsplit Hydrodynamic Schemes with Hybrid MPI/OpenMP/GPU Parallelization in AMR

Directive-based Approach to Heterogeneous Computing

Directive-Based Compilers for GPUs

Directive-Based Data Partitioning and Pipelining and Auto-Tuning for High-Performance GPU Computing

Directive-Based Partitioning and Pipelining for Graphical Processing Units

Directive-Based, High-Level Programming and Optimizations for High-Performance Computing with FPGAs

Directives Based Programming of GPU Accelerated Systems

DISC: A Dynamic Shape Compiler for Machine Learning Workloads

Disc: Approximative Nearest Neighbor Search using Ellipsoids for Photon Mapping on GPUs

Discontinuous Galerkin Methods on Graphics Processing Units for Nonlinear Hyperbolic Conservation Laws

Brief statistics for this page

Titles: 100

Download open PDFs: 93

Package packages: 15

UniCoder: Unified Visual-to-Code Generation via Symbolic Rewards and Reference-Guided Code Optimization

CuFuzz: An API-Knowledge-Graph Coverage-Driven Fuzzing Framework for CUDA Libraries

AutoPass: Evidence-Guided LLM Agents for Compiler Performance Tuning

KernelBenchX: A Comprehensive Benchmark for Evaluating LLM-Generated GPU Kernels

CUDA Kernel Fusion Benchmarks

Analyzing the Impact of Kernel Fusion on GPU Tensor Operation Performance: A Systematic Performance Study

IntelliKit: Agent-first tooling for AMD hardware

Kerncap: Automated Kernel Extraction and Isolation for AMD GPUs

DITRON: Distributed Compiler based on Triton for Parallel Systems

DITRON: Distributed Multi-level Tiling Compiler for Parallel Tensor Programs

See all packages

* * *

high performance computing on graphics processing units: hgpu.org

Papers on hgpu.org (.txt-file)

Recent source codes

UniCoder: Unified Visual-to-Code Generation via Symbolic Rewards and Reference-Guided Code Optimization

CuFuzz: An API-Knowledge-Graph Coverage-Driven Fuzzing Framework for CUDA Libraries

AutoPass: Evidence-Guided LLM Agents for Compiler Performance Tuning

Probe-and-Refine Tuning of Repository Guidance for AI Coding Agents

CUDAnalyst (CUDA + Analyst)

CodegenBench

KernelBenchX: A Comprehensive Benchmark for Evaluating LLM-Generated GPU Kernels

CUDA Kernel Fusion Benchmarks

IntelliKit: Agent-first tooling for AMD hardware

DITRON: Distributed Compiler based on Triton for Parallel Systems

Most viewed papers (last 30 days)