high performance computing on graphics processing units: hgpu.org

Papers on hgpu.org (.txt-file)

Bone Structure Analysis with GPGPUs

Bonsai: A GPU Tree-Code

Boosted Algorithms for Visual Object Detection on Graphics Processing Units

Boosting GPU Virtualization Performance with Hybrid Shadow Page Tables

Boosting Java Performance using GPGPUs

Boosting Performance of Iterative Applications on GPUs: Kernel Batching with CUDA Graphs

Boosting quantum evolutions using Trotter-Suzuki algorithms on GPUs

Boosting sphere decoding speed through Graphic Processing Units

BootCMatchG: An adaptive Algebraic MultiGrid linear solver for GPUs

BOPM implemented on a GPU-architecture

Bothnia: a dual-personality extension to the Intel integrated graphics driver

Bottleneck Analysis of Dynamic Graph Neural Network Inference on CPU and GPU

Bouncing Behavior of Microscopic Dust Aggregates

Bound the Peak Performance of SGEMM on GPU with software-controlled fast memory

Bounding the effect of partition camping in GPU kernels

Bounds Checking on GPU

Bounds on the Energy Consumption of Computational Kernels

Brain perfusion imaging: performance and accuracy

BrainCove: A Tool for Voxel-wise fMRI Brain Connectivity Visualization

BrainFrame: A heterogeneous accelerator platform for neuron simulations

BrainSlug: Transparent Acceleration of Deep Learning Through Depth-First Parallelism

Branch and Data Herding: Reducing Control and Memory Divergence for Error-tolerant GPU Applications

Breadth First Search Vectorization on the Intel Xeon Phi

Breadth-First Search using Dynamic Parallelism on the GPU

Breaking DVB-CSA

Breaking ECC2K-130

Breaking the GPU programming barrier with the auto-parallelising SAC compiler

Breaking the Memory Wall: A Study of I/O Patterns and GPU Memory Utilization for Hybrid CPU-GPU Offloaded Optimizers

Bridging Control-Centric and Data-Centric Optimization

Bridging OpenCL and CUDA: A Comparative Analysis and Translation

Bridging parallel and reconfigurable computing with multilevel PGAS and SHMEM+

Bridging the Gap between FPGAs and Multi-Processor Architectures: A Video Processing Perspective

Bridging the GPGPU-FPGA efficiency gap

Bridging the Performance-Programmability Gap for FPGAs via OpenCL: A Case Study with OpenDwarfs

Bridging the Semantic Gaps of GPU Acceleration for Scaleout CNN-based Big Data Processing: Think Big, See Small

Brief announcement: better speedups for parallel max-flow

Brief Announcement: On the Limits of Parallelizing Convolutional Neural Networks on GPUs

Bringing Auto-tuning to HIP: Analysis of Tuning Impact and Difficulty on AMD and Nvidia GPUs

Bringing OpenCL to Commodity RISC-V CPUs

Bringing Parallel Performance to Python with Domain-Specific Selective Embedded Just-in-Time Specialization

Brook for GPUs: Stream Computing on Graphics Hardware

Brownian Dynamics of Active Sphere Suspensions Confined Near a No-Slip Boundary

Brownian dynamics simulations on CPU and GPU with BD_BOX

Browsing a Large Collection of Community Photos Based on Similarity on GPU

Browsing Large Image Datasets through Voronoi Diagrams

Brute force de-shredding algorithm using the GPU

Brute-Force k-Nearest Neighbors Search on the GPU

BSGP: bulk-synchronous GPU programming

Buffer k-d Trees: Processing Massive Nearest Neighbor Queries on GPUs

Buffer overflow vulnerabilities in CUDA: a preliminary analysis

Bufferless NOC Simulation of Large Multicore System on GPU Hardware

Build and Travel KD-Tree with CUDA

Building a Performance Model for Deep Learning Recommendation Model Training on GPUs

Building a Personal High Performance Computer with Heterogeneous Processors

Building a Real-Time Multi-GPU Platform: Robust Real-Time Interrupt Handling Despite Closed-Source Drivers

Building Correlators with Many-Core Hardware

Building Human Brain Network in 3D Coefficient Map Determined by X-ray Microtomography

Building Multiclass Nonlinear Classifiers with GPUs

Building Source-to-Source Compilers for Heterogeneous Targets

Building-Blocks for Performance Oriented DSLs

Bulk Execution of Oblivious Algorithms on the Unified Memory Machine, with GPU Implementation

Bulk GCD Computation Using a GPU to Break Weak RSA Keys

Bump Mapping Unparametrized Surfaces on the GPU

Bundled depth-map merging for multi-view stereo

Burrows-Wheeler Aligner: A Parallel Approach

BVH for efficient raytracing of dynamic metaballs on GPU

C and CUDA Implementation for SIRT and SART Reconstruction Algorithms

C Language Extensions for Hybrid CPU/GPU Programming with StarPU

C to Cellular Automata and Execution on CPU, GPU and FPGA

C-DAC’s Efforts – Application Kernels on HPC Cluster with GPU Accelerators

C-for-Metal: High Performance SIMD Programming on Intel GPUs

C++ AMP: Accelerated Massive Parallelism with Microsoft Visual C++

Cache and bandwidth aware matrix multiplication on the GPU

Cache Miss Analysis for GPU Programs Based on Stack Distance Profile

Cache-efficient numerical algorithms using graphics hardware

CADDIES: A New Framework for Rapid Development of Parallel Cellular Automata Algorithms for Flood Simulation

Caffe con Troll: Shallow Ideas to Speed Up Deep Learning

Caffe: Convolutional Architecture for Fast Feature Embedding

Caffeinated FPGAs: FPGA Framework For Convolutional Neural Networks

Caffeine: Towards Uniformed Representation and Acceleration for Deep Convolutional Neural Networks

CaffeLink: Mathematica binding for Caffe Deep Learning Framework

CaffePresso: An Optimized Library for Deep Learning on Embedded Accelerator-based platforms

Calamari – A High-Performance Tensorflow-based Deep Learning Package for Optical Character Recognition

Calculation by articificial compressibility method and virtual flux method on GPU

Calculation of fermion loops for eta-prime and nucleon scalar and electromagnetic form factors

Calculation of Force Field Grids for Molecular Docking Using Graphics Processing Unit

Calculation of HELAS amplitudes for QCD processes using graphics processing unit (GPU)

Calculation of Stochastic Heating and Emissivity of Cosmic Dust Grains with Optimization for the Intel Many Integrated Core Architecture

Calculation of weight vectors for wideband beamforming using Graphics Processing Units

CAMPAIGN: An open-source Library of GPU-accelerated Data Clustering Algorithms

Can CUDA be exposed through web services?

Can GPGPU Programming Be Liberated from the Data-Parallel Bottleneck?

Can GPUs Sort Strings Efficiently?

Can Large Language Models Predict Parallel Code Performance?

Can PCM Benefit GPU? Reconciling Hybrid Memory Design with GPU Massive Parallelism for Energy Efficiency

Can Portability Improve Performance? An Empirical Study of Parallel Graph Analytics

Can Tensor Cores Benefit Memory-Bound Kernels? (No!)

Can We Run in Parallel? Automating Loop Parallelization for TornadoVM

Canadian Hydrogen Intensity Mapping Experiment (CHIME) Pathfinder

Candidate set parallelization strategies for Ant Colony Optimization on the GPU

Brief statistics for this page

Titles: 100

Download open PDFs: 91

Package packages: 29

UniCoder: Unified Visual-to-Code Generation via Symbolic Rewards and Reference-Guided Code Optimization

CuFuzz: An API-Knowledge-Graph Coverage-Driven Fuzzing Framework for CUDA Libraries

AutoPass: Evidence-Guided LLM Agents for Compiler Performance Tuning

KernelBenchX: A Comprehensive Benchmark for Evaluating LLM-Generated GPU Kernels

CUDA Kernel Fusion Benchmarks

Analyzing the Impact of Kernel Fusion on GPU Tensor Operation Performance: A Systematic Performance Study

IntelliKit: Agent-first tooling for AMD hardware

Kerncap: Automated Kernel Extraction and Isolation for AMD GPUs

DITRON: Distributed Compiler based on Triton for Parallel Systems

DITRON: Distributed Multi-level Tiling Compiler for Parallel Tensor Programs

See all packages

* * *

high performance computing on graphics processing units: hgpu.org

Papers on hgpu.org (.txt-file)

Recent source codes

UniCoder: Unified Visual-to-Code Generation via Symbolic Rewards and Reference-Guided Code Optimization

CuFuzz: An API-Knowledge-Graph Coverage-Driven Fuzzing Framework for CUDA Libraries

AutoPass: Evidence-Guided LLM Agents for Compiler Performance Tuning

Probe-and-Refine Tuning of Repository Guidance for AI Coding Agents

CUDAnalyst (CUDA + Analyst)

CodegenBench

KernelBenchX: A Comprehensive Benchmark for Evaluating LLM-Generated GPU Kernels

CUDA Kernel Fusion Benchmarks

IntelliKit: Agent-first tooling for AMD hardware

DITRON: Distributed Compiler based on Triton for Parallel Systems

Most viewed papers (last 30 days)