high performance computing on graphics processing units: hgpu.org

Papers on hgpu.org (.txt-file)

One Stone Two Birds: Synchronization Relaxation and Redundancy Removal in GPU-CPU Translation

One weird trick for parallelizing convolutional neural networks

One-shot tuner for deep learning compilers

oneDNN Graph Compiler: A Hybrid Approach for High-Performance Deep Learning Compilation

Onesweep: A Faster Least Significant Digit Radix Sort for GPUs

Online Adaptive Code Generation and Tuning

Online Dynamic Graph Drawing

Online Energy Optimization in GPUs: A Multi-Armed Bandit Approach

Online Performance Projection for Clusters with Heterogeneous GPUs

Online rapid prototyping of 3D objects using GPU-based 3D cloud computing: Application to 3D face modelling

Online video synthesis for removing occluding objects using multiple uncalibrated cameras via plane sweep algorithm

OP2: An Active Library Framework for Solving Unstructured Mesh-based Applications on Multi-Core and Many-Core Architectures

Opal: A Modular Framework for Optimizing Performance using Analytics and LLMs

Open Source Face Recognition API

Open SYCL on heterogeneous GPU systems: A case of study

Open-source FPGA-ML codesign for the MLPerf Tiny Benchmark

OpenABLext: An automatic code generation framework for agent-based simulations on CPU-GPU-FPGA heterogeneous platforms

OpenACC – First Experiences with Real-World Applications

OpenACC cache Directive: Opportunities and Optimizations

OpenACC Implementations Comparison

OpenACC offloading of the MFC compressible multiphase flow solver on AMD and NVIDIA GPUs

OpenACC-based GPU Acceleration of a 3-D Unstructured Discontinuous Galerkin Method

OpenACC-based Snow Simulation

OpenCL – An effective programming model for data parallel computations at the Cell Broadband Engine

OpenCL + OpenSHMEM Hybrid Programming Model for the Adapteva Epiphany Architecture

OpenCL 2.0 for FPGAs using OCLAcc

OpenCL 2.2 API Specification

OpenCL Accelerated Multi-GPU Cone-Beam Reconstruction

OpenCL Acceleration for TensorFlow

OpenCL Actors – Adding Data Parallelism to Actor-based Programming with CAF

OpenCL and parallel primitives for digital TV applications

OpenCL and the 13 Dwarfs: A Work in Progress

OpenCL API Extensions to achieve Multi-level Parallelism for Efficient Implementation of Strassen’s Matrix Multiplication on GPUs

OpenCL Based Digital Image Projection Acceleration

OpenCL Based High-Quality HEVC Motion Estimation on GPU

OpenCL based machine learning labeling of biomedical datasets

OpenCL C++

OpenCL Cryptographic Library

OpenCL embedded profile prototype in mobile device

OpenCL Evaluation for Numerical Linear Algebra Library Development

OpenCL Fast Fourier Transform

OpenCL Floating Point Software on Heterogeneous Architectures – Portable or Not?

OpenCL for Database Query Processing

OpenCL for FPGAs: Prototyping a Compiler

OpenCL for programming shared memory multicore CPUs

OpenCL FPGA Optimization guided by memory accesses and roofline model analysis applied to tomography acceleration

OpenCL framework for a CPU, GPU, and FPGA Platform

OpenCL Implementation of a Color Based Object Tracking

OpenCL Implementation of a Parallel Universal Kriging Algorithm for Massive Spatial Data Interpolation on Heterogeneous Systems

OpenCL Implementation of LiDAR Data Processing

OpenCL Implementation of Montgomery Multiplication on FPGA

OpenCL Implementation of Motion Estimation for Cloud Video Processing

OpenCL in Action: How to Accelerate Graphics and Computations

OpenCL JIT Compilation for Dynamic Programming Languages

OpenCL Library for Parallel Graph Search Algorithms

OpenCL Numerical Simulations of Two-Fluid Compressible Flows With a 2D Random Choice Method

OpenCL parallel Processing using General Purpose Graphical Processing units – TiViPE software development

OpenCL Parallel Programming Development Cookbook

OpenCL Performance Evaluation on Modern Multi Core CPUs

OpenCL Performance on the Intel Heterogeneous Architecture Research Platform

OpenCL Performance Prediction using Architecture-Independent Features

OpenCL Programming by Example

OpenCL Programming Guide

OpenCL Programming Guide for Mac

OpenCL programming using Python syntax

OpenCL simulations of two-fluid compressible flows with a random choice method

OpenCL Sparse Linear Solver for Circuit Simulation

OpenCL Task Partitioning in the Presence of GPU Contention

OpenCL Vector Swizzling Optimization under Global Value Numbering

OpenCL vs: Accelerated Finite-Difference Digital Synthesis

OpenCL vs. OpenMP: A Programmability Debate

OpenCL-Accelerated Computation of a 3D SPECT Projection Operator for the Content Adaptive Mesh Model

OpenCL-accelerated object classification in video streams using Spatial Pooler of Hierarchical Temporal Memory

OpenCL-accelerated Point Feature Histogram and Its Application in Railway Track Point Cloud Data Processing

OpenCL-Accelerated Simplified General Perturbations 4 Algorithm

OpenCL-based Algorithm for Heat Load Modelling of District Heating System

OpenCL-based design methodology for application-specific processors

OpenCL-Based Design of an FPGA Accelerator for Phase-Based Correspondence Matching

OpenCL-Based Erasure Coding on Heterogeneous Architectures

OpenCL-Based FPGA Accelerator for 3D FDTD with Periodic and Absorbing Boundary Conditions

OpenCL-Based Implementation of an FPGA Accelerator for Molecular Dynamics Simulation

OpenCL-Based Mobile GPGPU Benchmarking: Methods and Challenges

OpenCL-based optimizations for acceleration of object tracking on FPGAs and GPUs

OpenCL-Darknet: implementation and optimization of OpenCL-based deep learning object detection framework

OpenCL-HPX Integration

OpenCL-ready High Speed FPGA Network for Reconfigurable High Performance Computing

OpenCL-Z Android Released on Google Play

OpenCL: A Parallel Programming Standard for Heterogeneous Computing Systems

OpenCL: a viable solution for high-performance medical image reconstruction?

OpenCL: Make Ubiquitous Supercomputing Possible

OpenCL/CUDA algorithms for parallel decoding of any irregular LDPC code using GPU

OpenCL/OpenGL aproach for studying active Brownian motion

OpenCLIPER: an OpenCL-based C++ Framework for Overhead-Reduced Medical Image Processing and Reconstruction on Heterogeneous Devices

OpenCUDA+MPI: A Framework for Heterogeneous GP-GPU Distributed Computing

OpenDNN: An Open-source, cuDNN-like Deep Learning Primitive Library

OpenDwarfs 2025: Modernizing the OpenDwarfs Benchmark Suite for Heterogeneous Computing

OpenDwarfs: Characterization of Dwarf-Based Benchmarks on Fixed and Reconfigurable Architectures

OpenFace: A general-purpose face recognition library with mobile applications

OpenGL application live migration with GPU acceleration in personal cloud

OpenGL SuperBible: Comprehensive Tutorial and Reference (5th Edition)

Brief statistics for this page

Titles: 100

Download open PDFs: 93

Package packages: 22

UniCoder: Unified Visual-to-Code Generation via Symbolic Rewards and Reference-Guided Code Optimization

CuFuzz: An API-Knowledge-Graph Coverage-Driven Fuzzing Framework for CUDA Libraries

AutoPass: Evidence-Guided LLM Agents for Compiler Performance Tuning

KernelBenchX: A Comprehensive Benchmark for Evaluating LLM-Generated GPU Kernels

CUDA Kernel Fusion Benchmarks

Analyzing the Impact of Kernel Fusion on GPU Tensor Operation Performance: A Systematic Performance Study

IntelliKit: Agent-first tooling for AMD hardware

Kerncap: Automated Kernel Extraction and Isolation for AMD GPUs

DITRON: Distributed Compiler based on Triton for Parallel Systems

DITRON: Distributed Multi-level Tiling Compiler for Parallel Tensor Programs

See all packages

* * *

high performance computing on graphics processing units: hgpu.org

Papers on hgpu.org (.txt-file)

Recent source codes

UniCoder: Unified Visual-to-Code Generation via Symbolic Rewards and Reference-Guided Code Optimization

CuFuzz: An API-Knowledge-Graph Coverage-Driven Fuzzing Framework for CUDA Libraries

AutoPass: Evidence-Guided LLM Agents for Compiler Performance Tuning

Probe-and-Refine Tuning of Repository Guidance for AI Coding Agents

CUDAnalyst (CUDA + Analyst)

CodegenBench

KernelBenchX: A Comprehensive Benchmark for Evaluating LLM-Generated GPU Kernels

CUDA Kernel Fusion Benchmarks

IntelliKit: Agent-first tooling for AMD hardware

DITRON: Distributed Compiler based on Triton for Parallel Systems

Most viewed papers (last 30 days)