high performance computing on graphics processing units: hgpu.org

Papers on hgpu.org (.txt-file)

CHO: Towards a Benchmark Suite for OpenCL FPGA Accelerators

Cholla : A New Massively-Parallel Hydrodynamics Code For Astrophysical Simulation

Chopper: A Multi-Level GPU Characterization Tool & Derived Insights Into LLM Training Inefficiency

CHPS: An Environment for Collaborative Execution on Heterogeneous Desktop Systems

Chrono: a parallel multi-physics library for rigid-body, flexible-body, and fluid dynamics

Chunkflow: Distributed Hybrid Cloud Processing of Large 3D Images by Convolutional Nets

CI/CD Efforts for Validation, Verification and Benchmarking OpenMP Implementations

Cinematic Particle Systems with OpenCL

Circular Hough Transform in OpenCL

CitiusSynapse: A Deep Learning Framework for Embedded Systems

CL-VIS: Visualization Platform for Understanding and Checking the OpenCL Programs

CL2QCD – Lattice QCD based on OpenCL

CL4SE: A Context Learning Benchmark For Software Engineering Tasks

Clacc: Translating OpenACC to OpenMP in Clang

Classical Mechanical Hard-Core Particles Simulated in a Rigid Enclosure using Multi-GPU Systems

Classical Simulation of Quantum Adiabatic Algorithms using Mathematica on GPUs

Classiffication-based Financial Markets Prediction using Deep Neural Networks

Classification of Higgs Boson Tau-Tau decays using GPU accelerated Neural Networks

Classification Performance of Convolutional Neural Networks

Classify QCD phase transition with deep learning

ClawHMMER: A Streaming HMMer-Search Implementation

CLBlast: A Tuned OpenCL BLAS Library

ClearPath: highly parallel collision avoidance for multi-agent simulation

ClearView: An Interactive Context Preserving Hotspot Visualization Technique

CLgrep: A Parallel String Matching Tool

Climbing Mont Blanc – A Training Site for Energy Efficient Programming on Heterogeneous Multicore Processors

Clinically applicable Monte Carlo-based biological dose optimization for the treatment of head and neck cancers with spot-scanning proton therapy

Clipmapping on the GPU

clMAGMA: High Performance Dense Linear Algebra with OpenCL

clMF: A fine-grained and portable alternating least squares algorithm for parallel matrix factorization

Clock Math – A System for Solving SLEs Exactly

CLOP: A Multi-stage Compiler to Seamlessly Embed Heterogeneous Code

clOpenCL – Supporting Distributed Heterogeneous Computing in HPC Clusters

CLort: High Throughput and Low Energy Network Intrusion Detection on IoT Devices with Embedded GPUs

Closing the Ninja Performance Gap through Traditional Programming and Compiler Technology

Cloth Simulation on the GPU

Cloth Simulation Using AABB Hierarchies and GPU Parallelism

CloudCL: Single-Paradigm Distributed Heterogeneous Computing for Cloud Infrastructures

Cloudlet-screen computing: A multi-core-based, cloud-computing-oriented, traditional-computing-compatible parallel computing Paradigm for the masses

clpeak – peak performance of your opencl device

clRNG: A Random Number API with Multiple Streams for OpenCL

clSPARSE: A Vendor-Optimized Open-Source Sparse BLAS Library

clSpMV: A Cross-Platform OpenCL SpMV Framework on GPUs

CLTestCheck: Measuring Test Effectiveness for GPU Kernels

cltorch: a Hardware-Agnostic Backend for the Torch Deep Neural Network Library, Based on OpenCL

CLTune: A Generic Auto-Tuner for OpenCL Kernels

CLUEstering: a high-performance density-based clustering library for scientific computing

ClusCo: clustering and comparison of protein models

Cluster and Fast-Update Simulations of Regular and Rewired Lattice Ising Models Using CUDA and Graphical Processing Units

Cluster versus GPU implementation of an Orthogonal Target Detection Algorithm for Remotely Sensed Hyperspectral Images

Cluster-Level Tuning of a Shallow Water Equation Solver on the Intel MIC Architecture

Cluster-SkePU: A Multi-Backend Skeleton Programming Library for GPU Clusters

Clustering Based Search Algorithm For Motion Estimation

Clustering billions of data points using GPUs

Clustering coefficient queries on massive dynamic social networks

Clustering on GPU – A Brief Survey

Clustering Throughput Optimization on the GPU

ClusterWatch: Flexible, Lightweight Monitoring for High-end GPGPU Clusters

CMA-ES for Hyperparameter Optimization of Deep Neural Networks

CMCpy: Genetic Code-Message Coevolution Models in Python

CMLCompiler: A Unified Compiler for Classical Machine Learning

CnC-CUDA: declarative programming for GPUs

CNN2Gate: An Implementation of Convolutional Neural Networks Inference on FPGAs with Automated Design Space Exploration

CNNLab: a Novel Parallel Framework for Neural Networks using GPU and FPGA-a Practical Study with Trade-off Analysis

Co-design of a particle-in-cell plasma simulation code for Intel Xeon Phi: a first look at Knights Landing

Co-processing SPMD Computation on GPUs and CPUs on Shared Memory System

Co-processor acceleration of an unmodified parallel solid mechanics code with FEASTGPU

Co-tuning of Software Specializers and Hardware Accelerators within a CNN Application

Coalition Structure Generation with the Graphic Processor Unit

Coalition Structure Generation with the Graphics Processing Unit

Coarse grain computation-communication overlap for efficient application-level checkpointing for GPUs

Coarse grain parallelization of evolutionary algorithms on GPGPU cards with EASEA

Coating Process Monitoring Using Computer Vision

CoCoNet: Co-Optimizing Computation and Communication for Distributed Machine Learning

Code Generation Compiler for the OpenMP 4.0 Accelerator Model onto OMPSS

Code Generation for a Variety of Accelerators for a Graph DSL

Code Generation for Cryptographic Kernels using Multi-word Modular Arithmetic on GPU

Code Generation for Embedded Heterogeneous Architectures on Android

Code Generation for High-Level Synthesis of Multiresolution Applications on FPGAs

Code Generation from Functional to Imperative: Combining Destination-Passing Style and Views

Code Optimization and Performance Analysis of Oceanographic Software Package NEMO for GPGPU Systems

Code Optimization and Scaling of the Astrophysics Software Gadget on Intel Xeon Phi

Code optimization based on source to source transformations using profile guided metrics

Code Optimization on GPUs

Code Optimization on Kepler GPUs and Xeon Phi

Code Optimization Techniques for Graphics Processing Units

Code Refinement of Stencil Codes

CodegenBench: Can LLMs Write Efficient Code Across Architectures?

CodePy

CodeScaler: Scaling Code LLM Training and Test-Time Inference via Execution-Free Reward Models

Coding Ants: Using Ant Colony Optimization to Accelerate CT Reconstruction

CoDL: Efficient CPU-GPU Co-execution for Deep Learning Inference on Mobile Devices

Cofactorization on Graphics Processing Units

COFFEE: an Optimizing Compiler for Finite Element Local Assembly

Cognitive radio network for the smart grid: Experimental system architecture, control algorithms, security, and microgrid testbed

Coherence aware GPU-based ray casting for virtual colonoscopy

Coherent Photon Mapping on the Intel MIC Architecture

Coherent Spatiotemporal Filtering, Upsampling and Rendering of RGBZ Videos

Coherent transport by adiabatic passage on atom chips

Collaborative design and optimization using Collective Knowledge

Brief statistics for this page

Titles: 100

Download open PDFs: 93

Package packages: 36

UniCoder: Unified Visual-to-Code Generation via Symbolic Rewards and Reference-Guided Code Optimization

CuFuzz: An API-Knowledge-Graph Coverage-Driven Fuzzing Framework for CUDA Libraries

AutoPass: Evidence-Guided LLM Agents for Compiler Performance Tuning

KernelBenchX: A Comprehensive Benchmark for Evaluating LLM-Generated GPU Kernels

CUDA Kernel Fusion Benchmarks

Analyzing the Impact of Kernel Fusion on GPU Tensor Operation Performance: A Systematic Performance Study

IntelliKit: Agent-first tooling for AMD hardware

Kerncap: Automated Kernel Extraction and Isolation for AMD GPUs

DITRON: Distributed Compiler based on Triton for Parallel Systems

DITRON: Distributed Multi-level Tiling Compiler for Parallel Tensor Programs

See all packages

* * *

high performance computing on graphics processing units: hgpu.org

Papers on hgpu.org (.txt-file)

Recent source codes

UniCoder: Unified Visual-to-Code Generation via Symbolic Rewards and Reference-Guided Code Optimization

CuFuzz: An API-Knowledge-Graph Coverage-Driven Fuzzing Framework for CUDA Libraries

AutoPass: Evidence-Guided LLM Agents for Compiler Performance Tuning

Probe-and-Refine Tuning of Repository Guidance for AI Coding Agents

CUDAnalyst (CUDA + Analyst)

CodegenBench

KernelBenchX: A Comprehensive Benchmark for Evaluating LLM-Generated GPU Kernels

CUDA Kernel Fusion Benchmarks

IntelliKit: Agent-first tooling for AMD hardware

DITRON: Distributed Compiler based on Triton for Parallel Systems

Most viewed papers (last 30 days)