Papers on hgpu.org (.txt-file)
Best bang for your buck: GPU nodes for GROMACS biomolecular simulations

Best Practice Guide – Intel Xeon Phi

Best Practice Guide Intel Xeon Phi v2.0

Best-effort semantic document search on GPUs
Betatron tune measurement with the LHC damper using a GPU

Better speedups using simpler parallel programming for graph connectivity and biconnectivity

Betweenness Centrality on GPUs and Heterogeneous Architectures

Beyond 16GB: Out-of-Core Stencil Computations

Beyond a Gaussian Denoiser: Residual Learning of Deep CNN for Image Denoising

Beyond Amdahl’s Law: An Objective Function That Links Multiprocessor Performance Gains To Delay and Energy

Beyond Desktop Computation: Challenges in Scaling a GPU Infrastructure

Beyond programmable shading (parts I and II)

Beyond Straightforward Vectorization of Lightweight Data Compression Algorithms for Larger Vector Sizes

BFROST: Binary Features from Robust Orientation Segment Tests accelerated on the GPU

Bi-directional Path Tracing on GPU

Bidimensional Median Filter for Parallel Computing Architectures

BIDMach: Large-scale Learning with Zero Memory Allocation

Bifrost: a Python/C++ Framework for High-Throughput Stream Processing in Astronomy

Big Integer Multiplication with CUDA FFT (cuFFT) Library

Bigger Buffer k-d Trees on Multi-Many-Core Systems

BigKernel — High Performance CPU-GPU Communication Pipelining for Big Data-style Applications

Billion-scale similarity search with GPUs

Binary Code Summarization: Benchmarking ChatGPT/GPT-4 and Other Large Language Models

Binary Interval Search (BITS): A Scalable Algorithm for Counting Interval Intersections

Binary Interval Search: a scalable algorithm for counting interval intersections

Binary Mesh Partitioning for Cache-Efficient Visualization

Binary Segmentation of Video Sequences in Real Time
BinaryNet: Training Deep Neural Networks with Weights and Activations Constrained to +1 or -1

Binaural Simulations Using Audio Rate FDTD Schemes and CUDA

Binomial American Option Pricing on CPU-GPU Hetergenous System

Bio-inspired computer visual system using GPU and Visual Pattern Assessment Language (ViPAL): Application on breast cancer prognosis

Bio-Inspired Optimization of Ultra-Wideband Patch Antennas Using Graphics Processing Unit Acceleration

Bio-sequence database scanning on a GPU

BioEM: GPU-accelerated computing of Bayesian inference of electron microscopy images

Bioinformatics Sequence Comparisons on Manycore Processors

Biomedical and Clinical English Model Packages in the Stanza Python NLP Library

Biomedical image analysis on a cooperative cluster of GPUs and multicores

Biomolecular electrostatics simulation with a parallel FMM-based BEM, using up to 512 GPUs

Biomolecular electrostatics using a fast multipole BEM on up to 512 GPUs and a billion unknowns

Bit-GraphBLAS: Bit-Level Optimizations of Matrix-Centric Graph Processing on GPU

Bit-level Parallelization of 3DES Encryption on GPU

Bit-Packed Damaged Lattice Potts Model Simulations with CUDA and GPUs

Bit-Parallel Multiple Pattern Matching

Bit-Vectorized GPU Implementation of a Stochastic Cellular Automaton Model for Surface Growth

Bitcoin and The Age of Bespoke Silicon

BitCracker: BitLocker meets GPUs

Bitmap Filter: Speeding up Exact Set Similarity Joins with Bitwise Operations

Bitstream Database-Driven FPGA Programming Flow Based on Standard OpenCL

BlaBla: Linguistic Feature Extraction for Clinical Analysis in Multiple Languages

Black-Box Side-Channel Attacks Highlight the Importance of Countermeasures: An Analysis of the Xilinx Virtex-4 and Virtex-5 Bitstream Encryption Mechanism

BLAS Comparison on FPGA, CPU and GPU

Blasting through lattice calculations using CUDA

BLASX: A High Performance Level-3 BLAS Library for Heterogeneous Multi-GPU Computing

Blind image deconvolution algorithm on NVIDIA CUDA platform

Blink: Fast and Generic Collectives for Distributed ML

Blister: GPU-based rendering of Boolean combinations of free-form triangulated shapes

Block based Singular Value Decomposition approach to matrix factorization for recommender systems

Block Conjugate Gradient Solver in OpenCL

Block Time Step Storage Scheme for Astrophysical N-body Simulations

Block-asynchronous Multigrid Smoothers for GPU-accelerated Systems

Block-Relaxation Methods for 3D Constant-Coefficient Stencils on GPUs and Multicore CPUs

Block-Size Independence for GPU Programs

Block: Balancing Load in LLM Serving with Context, Knowledge and Predictive Scheduling

Blockchain Goes Green? Part II: Characterizing the Performance and Cost of Blockchains on the Cloud and at the Edge

Blocked All-Pairs Shortest Paths Algorithm on Intel Xeon Phi KNL Processor: A Case Study

Blocking Self-avoiding Walks Stops Cyber-epidemics: A Scalable GPU-based Approach

Blocks and Fuel: Frameworks for deep learning

Boda-RTC: Productive Generation of Portable, Efficient Code for Convolutional Neural Networks on Mobile Computing Platforms

Bohrium: Unmodified NumPy Code on CPU, GPU, and Cluster

Boids that see: Using self-occlusion for simulating large groups on GPUs
Bolt: Bridging the Gap between Auto-tuners and Hardware-native Performance

Bone structure analysis on multiple GPGPUs

Bone Structure Analysis with GPGPUs

Boosted Algorithms for Visual Object Detection on Graphics Processing Units

Boosting GPU Virtualization Performance with Hybrid Shadow Page Tables

Boosting Java Performance using GPGPUs

Boosting Performance of Iterative Applications on GPUs: Kernel Batching with CUDA Graphs

Boosting quantum evolutions using Trotter-Suzuki algorithms on GPUs

Boosting sphere decoding speed through Graphic Processing Units

BootCMatchG: An adaptive Algebraic MultiGrid linear solver for GPUs

BOPM implemented on a GPU-architecture

Bothnia: a dual-personality extension to the Intel integrated graphics driver
Bottleneck Analysis of Dynamic Graph Neural Network Inference on CPU and GPU

Bouncing Behavior of Microscopic Dust Aggregates

Bound the Peak Performance of SGEMM on GPU with software-controlled fast memory

Bounding the effect of partition camping in GPU kernels

Bounds on the Energy Consumption of Computational Kernels

Brain perfusion imaging: performance and accuracy

BrainCove: A Tool for Voxel-wise fMRI Brain Connectivity Visualization

BrainFrame: A heterogeneous accelerator platform for neuron simulations

BrainSlug: Transparent Acceleration of Deep Learning Through Depth-First Parallelism

Branch and Data Herding: Reducing Control and Memory Divergence for Error-tolerant GPU Applications

Titles: 100
open PDFs: 96
packages: 35
