Papers on hgpu.org (.txt-file)
Breadth First Search Vectorization on the Intel Xeon Phi

Breadth-First Search using Dynamic Parallelism on the GPU

Breaking the GPU programming barrier with the auto-parallelising SAC compiler
Breaking the Memory Wall: A Study of I/O Patterns and GPU Memory Utilization for Hybrid CPU-GPU Offloaded Optimizers

Bridging Control-Centric and Data-Centric Optimization

Bridging OpenCL and CUDA: A Comparative Analysis and Translation

Bridging parallel and reconfigurable computing with multilevel PGAS and SHMEM+

Bridging the Gap between FPGAs and Multi-Processor Architectures: A Video Processing Perspective

Bridging the GPGPU-FPGA efficiency gap
Bridging the Performance-Programmability Gap for FPGAs via OpenCL: A Case Study with OpenDwarfs

Bridging the Semantic Gaps of GPU Acceleration for Scaleout CNN-based Big Data Processing: Think Big, See Small

Brief announcement: better speedups for parallel max-flow

Brief Announcement: On the Limits of Parallelizing Convolutional Neural Networks on GPUs

Bringing Auto-tuning to HIP: Analysis of Tuning Impact and Difficulty on AMD and Nvidia GPUs

Bringing OpenCL to Commodity RISC-V CPUs

Bringing Parallel Performance to Python with Domain-Specific Selective Embedded Just-in-Time Specialization

Brook for GPUs: Stream Computing on Graphics Hardware

Brownian Dynamics of Active Sphere Suspensions Confined Near a No-Slip Boundary

Brownian dynamics simulations on CPU and GPU with BD_BOX

Browsing a Large Collection of Community Photos Based on Similarity on GPU
Browsing Large Image Datasets through Voronoi Diagrams

Brute force de-shredding algorithm using the GPU

Brute-Force k-Nearest Neighbors Search on the GPU

BSGP: bulk-synchronous GPU programming

Buffer k-d Trees: Processing Massive Nearest Neighbor Queries on GPUs

Buffer overflow vulnerabilities in CUDA: a preliminary analysis

Bufferless NOC Simulation of Large Multicore System on GPU Hardware

Build and Travel KD-Tree with CUDA

Building a Performance Model for Deep Learning Recommendation Model Training on GPUs

Building a Personal High Performance Computer with Heterogeneous Processors
Building a Real-Time Multi-GPU Platform: Robust Real-Time Interrupt Handling Despite Closed-Source Drivers

Building Correlators with Many-Core Hardware

Building Human Brain Network in 3D Coefficient Map Determined by X-ray Microtomography

Building Multiclass Nonlinear Classifiers with GPUs

Building Source-to-Source Compilers for Heterogeneous Targets

Building-Blocks for Performance Oriented DSLs

Bulk Execution of Oblivious Algorithms on the Unified Memory Machine, with GPU Implementation

Bulk GCD Computation Using a GPU to Break Weak RSA Keys

Bump Mapping Unparametrized Surfaces on the GPU

Bundled depth-map merging for multi-view stereo

Burrows-Wheeler Aligner: A Parallel Approach

BVH for efficient raytracing of dynamic metaballs on GPU

C and CUDA Implementation for SIRT and SART Reconstruction Algorithms

C Language Extensions for Hybrid CPU/GPU Programming with StarPU

C to Cellular Automata and Execution on CPU, GPU and FPGA

C-DAC’s Efforts – Application Kernels on HPC Cluster with GPU Accelerators

C-for-Metal: High Performance SIMD Programming on Intel GPUs

C++ AMP: Accelerated Massive Parallelism with Microsoft Visual C++

Cache and bandwidth aware matrix multiplication on the GPU

Cache Miss Analysis for GPU Programs Based on Stack Distance Profile
Cache-efficient numerical algorithms using graphics hardware

CADDIES: A New Framework for Rapid Development of Parallel Cellular Automata Algorithms for Flood Simulation

Caffe con Troll: Shallow Ideas to Speed Up Deep Learning

Caffe: Convolutional Architecture for Fast Feature Embedding

Caffeinated FPGAs: FPGA Framework For Convolutional Neural Networks

Caffeine: Towards Uniformed Representation and Acceleration for Deep Convolutional Neural Networks

CaffeLink: Mathematica binding for Caffe Deep Learning Framework

CaffePresso: An Optimized Library for Deep Learning on Embedded Accelerator-based platforms

Calamari – A High-Performance Tensorflow-based Deep Learning Package for Optical Character Recognition

Calculation by articificial compressibility method and virtual flux method on GPU
Calculation of fermion loops for eta-prime and nucleon scalar and electromagnetic form factors

Calculation of Force Field Grids for Molecular Docking Using Graphics Processing Unit

Calculation of HELAS amplitudes for QCD processes using graphics processing unit (GPU)

Calculation of Stochastic Heating and Emissivity of Cosmic Dust Grains with Optimization for the Intel Many Integrated Core Architecture

Calculation of weight vectors for wideband beamforming using Graphics Processing Units

CAMPAIGN: An open-source Library of GPU-accelerated Data Clustering Algorithms

Can CUDA be exposed through web services?

Can GPGPU Programming Be Liberated from the Data-Parallel Bottleneck?

Can GPUs Sort Strings Efficiently?

Can Large Language Models Predict Parallel Code Performance?

Can PCM Benefit GPU? Reconciling Hybrid Memory Design with GPU Massive Parallelism for Energy Efficiency

Can Portability Improve Performance? An Empirical Study of Parallel Graph Analytics

Can Tensor Cores Benefit Memory-Bound Kernels? (No!)

Can We Run in Parallel? Automating Loop Parallelization for TornadoVM

Canadian Hydrogen Intensity Mapping Experiment (CHIME) Pathfinder

Candidate set parallelization strategies for Ant Colony Optimization on the GPU

CANNA: Neural Network Acceleration using Configurable Approximation on GPGPU

Canny edge detection on NVIDIA CUDA

Capability Models for Manycore Memory Systems: A Case-Study with Xeon Phi KNL

CAPRI: Prediction of Compaction-Adequacy for Handling Control-Divergence in GPGPU Architectures

Capturing the Memory Topology of GPUs

Caracal: dynamic translation of runtime environments for GPUs

Caracteristiques arithmetiques des processeurs graphiques

CaravelaMPI: Message Passing Interface for Parallel GPU-Based Applications

Cardiac Dysrhythmia Detection with GPU-Accelerated Neural Networks

Cardiac simulation on multi-GPU platform
Cardiac tissue simulation using graphics hardware

Cartesian SENSE and k-t SENSE reconstruction using commodity graphics hardware

Cascaded Segmentation-Detection Networks for Word-Level Text Spotting

Case Studies in Acceleration of Heston’s Stochastic Volatility Financial Engineering Model: GPU, Cloud and FPGA Implementations

Case Study: GPU-based implementation of sequence pair based floorplanning using CUDA
Case study: Interactive rendering of adaptive mesh refinement data

Case study: Runtime reduction of a buffer insertion algorithm using GPU parallel programming
CASE: A Compiler-Assisted SchEduling Framework for Multi-GPU Systems

CASS: Nvidia to AMD Transpilation with Data, Models, and Benchmark

CATBench: A Compiler Autotuning Benchmarking Suite for Black-box Optimization

Titles: 100
open PDFs: 87
packages: 29
