Papers on hgpu.org (.txt-file)
AvA: Accelerated Virtualization of Accelerators

AVEC: Accelerator Virtualization in Cloud-Edge Computing for Deep Learning Libraries

AVSS2011 demo session: GPU enabled Smart Video Node

AVX-512 extension to OpenQCD 1.6

AXC: A new format to perform the SpMV oriented to Intel Xeon Phi architecture in OpenCL

Axel: a heterogeneous cluster with FPGAs and GPUs

AZP: Automatic Specialization for Zero Values in Gaming Applications

b-Bit Minwise Hashing in Practice: Large-Scale Batch and Online Learning and Using GPUs for Fast Preprocessing with Simple Hash Functions

B-CALM: An open-source GPU-based 3D-FDTD with multi-pole dispersion for plasmonics

B-Calm: an Open-Source Multi-Gpu-Based 3D-FDTD with Multi-Pole Dispersion for Plasmonics

Back Ground Subtraction Algorithm For Moving Object Detection In FPGA

Backpropagation Training for Fisher Vectors within Neural Networks

BaCO: A Fast and Portable Bayesian Compiler Optimization Framework

Bacon: A GPU Programming System With Just in Time Specialization

Bag of Tricks: Benchmarking of Jailbreak Attacks on LLMs

Balancing locality and concurrency: solving sparse triangular systems on GPUs

Balancing Tracking Granularity and Parallelism in Many-Task Systems: The Horizons Approach

Bamboo: Automatic Translation of MPI Source into a Latency-Tolerant Form

Bandicoot: A Templated C++ Library for GPU Linear Algebra

Bandicoot: C++ Library for GPU Linear Algebra and Scientific Computing

Bandwidth intensive 3-D FFT kernel for GPUs using CUDA
Bandwidth Reduction Through Multithreaded Compression of Seismic Images

Bandwidth Requirements of GPU Architectures

BANG: Billion-Scale Approximate Nearest Neighbor Search using a Single GPU

Barra, a Modular Functional GPU Simulator for GPGPU

Barra: A Parallel Functional Simulator for GPGPU

BarraCUDA – a fast short read sequence aligner using graphics processing units

Barrier Invariants: A Shared State Abstraction for the Analysis of Data-Dependent GPU Kernels

Barycentric coordinates computation in homogeneous coordinates

BASEMENT v3: a modular freeware for river process modelling over multiple computational backends

Basker: A Threaded Sparse LU Factorization Utilizing Hierarchical Parallelism and Data Layouts

BAT: A Benchmark suite for AutoTuners

Batch Method for Efficient Resource Sharing in Real-time Multi-GPU Systems

Batch Records Insertion into Multidimensional Linear Dynamic Hashing Table on GPU

Batched Kronecker product for 2-D matrices and 3-D arrays on NVIDIA GPUs

Batched Linear Algebra Problems on GPU Accelerators

Batched Matrix Computations on Hardware Accelerators

Batched Matrix Computations on Hardware Accelerators Based on GPUs

Batched QR and SVD Algorithms on GPUs with Applications in Hierarchical Matrix Compression

Batched Shift Reduce Parsing with Lists of Vectors on CUDA

Bayesian Image Restoration Using A Large-scale Total Patch Variation Prior

Bayesian inference for artificial perception using OpenCL on FPGAs and GPUs

Bayesian model comparison via sequential Monte Carlo

Bayesian neural networks for detecting epistasis in genetic association studies

Bayesian Neural Networks for Genetic Association Studies of Complex Disease

Bayesian Neural Networks in Data-Intensive High Energy Physics Applications

Bayesian Optimization for auto-tuning GPU kernels

Bayesian real-time perception algorithms on GPU

Bayesian Sparse Unsupervised Learning for Probit Models of Binary Data

Bayesian Sparsity-Path-Analysis of Genetic Association Signal using Generalized t Priors

Bayesian State-Space Modelling on High-Performance Hardware Using LibBi

BbmTTP: Beat-based Parallel Simulated Annealing Algorithm on GPGPUs for the Mirrored Traveling Tournament Problem

BEAGLE: an Application Programming Interface and High-Performance Computing Library for Statistical Phylogenetics

Beam Dynamics Simulations Using GPUs

Beam Dynamics Simulations with a GPU-accelerated Version of ELEGANT

Beauty And The Beast: Exploiting GPUs In Haskell

Beehive SPIR-V Toolkit: A Composable and Functional API for Runtime SPIR-V Code Generation

Behavioral graph fraud detection in E-commerce

Behavioral Non-portability in Scientific Numeric Computing

Behavioral Spherical Harmonics for Long-Range Agents’ Interaction

Belief Propagation by Message Passing in Junction Trees: Computing Each Message Faster Using GPU Parallelization

Belief Propagation on the GPU for Stereo Vision

Believe it or Not! Multi-core CPUs Can Match GPU Performance for FLOP-intensive Application!

Bempp-cl: A fast Python based just-in-time compiling boundary element library

BenchDirect: A Directed Language Model for Compiler Benchmarks

BenchFriend: Correlating the Performance of GPU Benchmarks

BENCHIP: Benchmarking Intelligence Processors

Benchmarking a Proof-of-Concept Performance Portable SYCL-based Fast Fourier Transformation Library

Benchmarking Across Platforms: European Option Pricing

Benchmarking and Dissecting the Nvidia Hopper GPU Architecture

Benchmarking and Implementation of Probability-Based Simulations on Programmable Graphics Cards

Benchmarking and modelling of POWER7, Westmere, BG/P, and GPUs: an industry case study

Benchmarking and Optimization of Gradient Boosted Decision Tree Algorithms

Benchmarking Data Analysis and Machine Learning Applications on the Intel KNL Many-Core Processor

Benchmarking Deep Learning Models on Jetson TX2

Benchmarking GPU and CPU codes for Heisenberg spin glass overrelaxation
Benchmarking GPU and TPU Performance with Graph Neural Networks

Benchmarking GPU Devices with N-Body Simulations

Benchmarking GPUs to tune dense linear algebra

Benchmarking Harp-DAAL: High Performance Hadoop on KNL Clusters

Benchmarking Intel Xeon Phi to Guide Kernel Design

Benchmarking Modern Edge Devices for AI Applications

Benchmarking Next Generation Hardware Platforms: An Experimental Approach

Benchmarking OpenCL, OpenACC, OpenMP, and CUDA: programming productivity, performance, and energy consumption

Benchmarking optimization algorithms for auto-tuning GPU kernels

Benchmarking Parallel Performance on Many-Core Processors

Benchmarking performance of a hybrid Xeon/Xeon Phi system for parallel computation of similarity measures between large vectors

Benchmarking State-of-the-Art Deep Learning Software Tools

Benchmarking the cost of thread divergence in CUDA

Benchmarking the Intel Xeon Phi Coprocessor

Benchmarking the Memory Hierarchy of Modern GPUs

Benchmarking the Nvidia GPU Lineage: From Early K80 to Modern A100 with Asynchronous Memory Transfers

Benchmarking Thread Block Cluster

Benchmarking TPU, GPU, and CPU Platforms for Deep Learning

Benchmarks Based on Anti-Parallel Patterns for the Evaluation of GPUs

Benchmarks for Intel MIC Architecture

BenchPress: A Deep Active Benchmark Generator

BePilot: An AI Programming Assistant for Compiler Backend Development

Titles: 100
open PDFs: 97
packages: 34
