Papers on hgpu.org (.txt-file)
BANG: Billion-Scale Approximate Nearest Neighbor Search using a Single GPU
Barra, a Modular Functional GPU Simulator for GPGPU
Barra: A Parallel Functional Simulator for GPGPU
BarraCUDA – a fast short read sequence aligner using graphics processing units
Barrier Invariants: A Shared State Abstraction for the Analysis of Data-Dependent GPU Kernels
Barycentric coordinates computation in homogeneous coordinates
BASEMENT v3: a modular freeware for river process modelling over multiple computational backends
Basker: A Threaded Sparse LU Factorization Utilizing Hierarchical Parallelism and Data Layouts
BAT: A Benchmark suite for AutoTuners
Batch Method for Efficient Resource Sharing in Real-time Multi-GPU Systems
Batch Records Insertion into Multidimensional Linear Dynamic Hashing Table on GPU
Batched Kronecker product for 2-D matrices and 3-D arrays on NVIDIA GPUs
Batched Linear Algebra Problems on GPU Accelerators
Batched Matrix Computations on Hardware Accelerators
Batched Matrix Computations on Hardware Accelerators Based on GPUs
Batched QR and SVD Algorithms on GPUs with Applications in Hierarchical Matrix Compression
Batched Shift Reduce Parsing with Lists of Vectors on CUDA
Bayesian Image Restoration Using A Large-scale Total Patch Variation Prior
Bayesian inference for artificial perception using OpenCL on FPGAs and GPUs
Bayesian model comparison via sequential Monte Carlo
Bayesian neural networks for detecting epistasis in genetic association studies
Bayesian Neural Networks for Genetic Association Studies of Complex Disease
Bayesian Neural Networks in Data-Intensive High Energy Physics Applications
Bayesian Optimization for auto-tuning GPU kernels
Bayesian real-time perception algorithms on GPU
Bayesian Sparse Unsupervised Learning for Probit Models of Binary Data
Bayesian Sparsity-Path-Analysis of Genetic Association Signal using Generalized t Priors
Bayesian State-Space Modelling on High-Performance Hardware Using LibBi
BbmTTP: Beat-based Parallel Simulated Annealing Algorithm on GPGPUs for the Mirrored Traveling Tournament Problem
BEAGLE: an Application Programming Interface and High-Performance Computing Library for Statistical Phylogenetics
Beam Dynamics Simulations Using GPUs
Beam Dynamics Simulations with a GPU-accelerated Version of ELEGANT
Beauty And The Beast: Exploiting GPUs In Haskell
Beehive SPIR-V Toolkit: A Composable and Functional API for Runtime SPIR-V Code Generation
Behavioral graph fraud detection in E-commerce
Behavioral Non-portability in Scientific Numeric Computing
Behavioral Spherical Harmonics for Long-Range Agents’ Interaction
Belief Propagation by Message Passing in Junction Trees: Computing Each Message Faster Using GPU Parallelization
Belief Propagation on the GPU for Stereo Vision
Believe it or Not! Multi-core CPUs Can Match GPU Performance for FLOP-intensive Application!
Bempp-cl: A fast Python based just-in-time compiling boundary element library
BenchDirect: A Directed Language Model for Compiler Benchmarks
BenchFriend: Correlating the Performance of GPU Benchmarks
BENCHIP: Benchmarking Intelligence Processors
Benchmarking a Proof-of-Concept Performance Portable SYCL-based Fast Fourier Transformation Library
Benchmarking Across Platforms: European Option Pricing
Benchmarking and Dissecting the Nvidia Hopper GPU Architecture
Benchmarking and Implementation of Probability-Based Simulations on Programmable Graphics Cards
Benchmarking and modelling of POWER7, Westmere, BG/P, and GPUs: an industry case study
Benchmarking and Optimization of Gradient Boosted Decision Tree Algorithms
Benchmarking Data Analysis and Machine Learning Applications on the Intel KNL Many-Core Processor
Benchmarking Deep Learning Models on Jetson TX2
Benchmarking GPU and CPU codes for Heisenberg spin glass overrelaxation
Benchmarking GPU and TPU Performance with Graph Neural Networks
Benchmarking GPU Devices with N-Body Simulations
Benchmarking GPUs to tune dense linear algebra
Benchmarking Harp-DAAL: High Performance Hadoop on KNL Clusters
Benchmarking Intel Xeon Phi to Guide Kernel Design
Benchmarking Modern Edge Devices for AI Applications
Benchmarking Next Generation Hardware Platforms: An Experimental Approach
Benchmarking OpenCL, OpenACC, OpenMP, and CUDA: programming productivity, performance, and energy consumption
Benchmarking optimization algorithms for auto-tuning GPU kernels
Benchmarking Parallel Performance on Many-Core Processors
Benchmarking performance of a hybrid Xeon/Xeon Phi system for parallel computation of similarity measures between large vectors
Benchmarking State-of-the-Art Deep Learning Software Tools
Benchmarking the cost of thread divergence in CUDA
Benchmarking the Intel Xeon Phi Coprocessor
Benchmarking the Memory Hierarchy of Modern GPUs
Benchmarking the Nvidia GPU Lineage: From Early K80 to Modern A100 with Asynchronous Memory Transfers
Benchmarking Thread Block Cluster
Benchmarking TPU, GPU, and CPU Platforms for Deep Learning
Benchmarks Based on Anti-Parallel Patterns for the Evaluation of GPUs
Benchmarks for Intel MIC Architecture
BenchPress: A Deep Active Benchmark Generator
Best bang for your buck: GPU nodes for GROMACS biomolecular simulations
Best Practice Guide – Intel Xeon Phi
Best Practice Guide Intel Xeon Phi v2.0
Best-effort semantic document search on GPUs
Betatron tune measurement with the LHC damper using a GPU
Better speedups using simpler parallel programming for graph connectivity and biconnectivity
Betweenness Centrality on GPUs and Heterogeneous Architectures
Beyond 16GB: Out-of-Core Stencil Computations
Beyond a Gaussian Denoiser: Residual Learning of Deep CNN for Image Denoising
Beyond Amdahl’s Law: An Objective Function That Links Multiprocessor Performance Gains To Delay and Energy
Beyond Desktop Computation: Challenges in Scaling a GPU Infrastructure
Beyond programmable shading (parts I and II)
Beyond Straightforward Vectorization of Lightweight Data Compression Algorithms for Larger Vector Sizes
BFROST: Binary Features from Robust Orientation Segment Tests accelerated on the GPU
Bi-directional Path Tracing on GPU
Bidimensional Median Filter for Parallel Computing Architectures
BIDMach: Large-scale Learning with Zero Memory Allocation
Bifrost: a Python/C++ Framework for High-Throughput Stream Processing in Astronomy
Big Integer Multiplication with CUDA FFT (cuFFT) Library
Bigger Buffer k-d Trees on Multi-Many-Core Systems
BigKernel — High Performance CPU-GPU Communication Pipelining for Big Data-style Applications
Titles: 100
open PDFs: 97
packages: 30