Papers on hgpu.org (.txt-file)
Belief Propagation by Message Passing in Junction Trees: Computing Each Message Faster Using GPU Parallelization
Belief Propagation on the GPU for Stereo Vision
Believe it or Not! Multi-core CPUs Can Match GPU Performance for FLOP-intensive Application!
Bempp-cl: A fast Python based just-in-time compiling boundary element library
BenchDirect: A Directed Language Model for Compiler Benchmarks
BenchFriend: Correlating the Performance of GPU Benchmarks
BENCHIP: Benchmarking Intelligence Processors
Benchmarking a Proof-of-Concept Performance Portable SYCL-based Fast Fourier Transformation Library
Benchmarking Across Platforms: European Option Pricing
Benchmarking and Dissecting the Nvidia Hopper GPU Architecture
Benchmarking and Implementation of Probability-Based Simulations on Programmable Graphics Cards
Benchmarking and modelling of POWER7, Westmere, BG/P, and GPUs: an industry case study
Benchmarking and Optimization of Gradient Boosted Decision Tree Algorithms
Benchmarking Data Analysis and Machine Learning Applications on the Intel KNL Many-Core Processor
Benchmarking Deep Learning Models on Jetson TX2
Benchmarking GPU and CPU codes for Heisenberg spin glass overrelaxation
Benchmarking GPU and TPU Performance with Graph Neural Networks
Benchmarking GPU Devices with N-Body Simulations
Benchmarking GPUs to tune dense linear algebra
Benchmarking Harp-DAAL: High Performance Hadoop on KNL Clusters
Benchmarking Intel Xeon Phi to Guide Kernel Design
Benchmarking Modern Edge Devices for AI Applications
Benchmarking Next Generation Hardware Platforms: An Experimental Approach
Benchmarking OpenCL, OpenACC, OpenMP, and CUDA: programming productivity, performance, and energy consumption
Benchmarking optimization algorithms for auto-tuning GPU kernels
Benchmarking Parallel Performance on Many-Core Processors
Benchmarking performance of a hybrid Xeon/Xeon Phi system for parallel computation of similarity measures between large vectors
Benchmarking State-of-the-Art Deep Learning Software Tools
Benchmarking the cost of thread divergence in CUDA
Benchmarking the Intel Xeon Phi Coprocessor
Benchmarking the Memory Hierarchy of Modern GPUs
Benchmarking the Nvidia GPU Lineage: From Early K80 to Modern A100 with Asynchronous Memory Transfers
Benchmarking TPU, GPU, and CPU Platforms for Deep Learning
Benchmarks Based on Anti-Parallel Patterns for the Evaluation of GPUs
Benchmarks for Intel MIC Architecture
BenchPress: A Deep Active Benchmark Generator
Best bang for your buck: GPU nodes for GROMACS biomolecular simulations
Best Practice Guide – Intel Xeon Phi
Best Practice Guide Intel Xeon Phi v2.0
Best-effort semantic document search on GPUs
Betatron tune measurement with the LHC damper using a GPU
Better speedups using simpler parallel programming for graph connectivity and biconnectivity
Betweenness Centrality on GPUs and Heterogeneous Architectures
Beyond 16GB: Out-of-Core Stencil Computations
Beyond a Gaussian Denoiser: Residual Learning of Deep CNN for Image Denoising
Beyond Amdahl’s Law: An Objective Function That Links Multiprocessor Performance Gains To Delay and Energy
Beyond Desktop Computation: Challenges in Scaling a GPU Infrastructure
Beyond programmable shading (parts I and II)
Beyond Straightforward Vectorization of Lightweight Data Compression Algorithms for Larger Vector Sizes
BFROST: Binary Features from Robust Orientation Segment Tests accelerated on the GPU
Bi-directional Path Tracing on GPU
Bidimensional Median Filter for Parallel Computing Architectures
BIDMach: Large-scale Learning with Zero Memory Allocation
Bifrost: a Python/C++ Framework for High-Throughput Stream Processing in Astronomy
Big Integer Multiplication with CUDA FFT (cuFFT) Library
Bigger Buffer k-d Trees on Multi-Many-Core Systems
BigKernel — High Performance CPU-GPU Communication Pipelining for Big Data-style Applications
Billion-scale similarity search with GPUs
Binary Code Summarization: Benchmarking ChatGPT/GPT-4 and Other Large Language Models
Binary Interval Search (BITS): A Scalable Algorithm for Counting Interval Intersections
Binary Interval Search: a scalable algorithm for counting interval intersections
Binary Mesh Partitioning for Cache-Efficient Visualization
Binary Segmentation of Video Sequences in Real Time
BinaryNet: Training Deep Neural Networks with Weights and Activations Constrained to +1 or -1
Binaural Simulations Using Audio Rate FDTD Schemes and CUDA
Binomial American Option Pricing on CPU-GPU Hetergenous System
Bio-inspired computer visual system using GPU and Visual Pattern Assessment Language (ViPAL): Application on breast cancer prognosis
Bio-Inspired Optimization of Ultra-Wideband Patch Antennas Using Graphics Processing Unit Acceleration
Bio-sequence database scanning on a GPU
BioEM: GPU-accelerated computing of Bayesian inference of electron microscopy images
Bioinformatics Sequence Comparisons on Manycore Processors
Biomedical and Clinical English Model Packages in the Stanza Python NLP Library
Biomedical image analysis on a cooperative cluster of GPUs and multicores
Biomolecular electrostatics simulation with a parallel FMM-based BEM, using up to 512 GPUs
Biomolecular electrostatics using a fast multipole BEM on up to 512 GPUs and a billion unknowns
Bit-GraphBLAS: Bit-Level Optimizations of Matrix-Centric Graph Processing on GPU
Bit-level Parallelization of 3DES Encryption on GPU
Bit-Packed Damaged Lattice Potts Model Simulations with CUDA and GPUs
Bit-Parallel Multiple Pattern Matching
Bit-Vectorized GPU Implementation of a Stochastic Cellular Automaton Model for Surface Growth
Bitcoin and The Age of Bespoke Silicon
BitCracker: BitLocker meets GPUs
Bitmap Filter: Speeding up Exact Set Similarity Joins with Bitwise Operations
BlaBla: Linguistic Feature Extraction for Clinical Analysis in Multiple Languages
Black-Box Side-Channel Attacks Highlight the Importance of Countermeasures: An Analysis of the Xilinx Virtex-4 and Virtex-5 Bitstream Encryption Mechanism
BLAS Comparison on FPGA, CPU and GPU
Blasting through lattice calculations using CUDA
BLASX: A High Performance Level-3 BLAS Library for Heterogeneous Multi-GPU Computing
Blind image deconvolution algorithm on NVIDIA CUDA platform
Blink: Fast and Generic Collectives for Distributed ML
Blister: GPU-based rendering of Boolean combinations of free-form triangulated shapes
Block based Singular Value Decomposition approach to matrix factorization for recommender systems
Block Conjugate Gradient Solver in OpenCL
Block Time Step Storage Scheme for Astrophysical N-body Simulations
Block-asynchronous Multigrid Smoothers for GPU-accelerated Systems
Titles: 100
open PDFs: 97
packages: 30