high performance computing on graphics processing units: hgpu.org

Papers on hgpu.org (.txt-file)

Benchmarking Harp-DAAL: High Performance Hadoop on KNL Clusters

Benchmarking Intel Xeon Phi to Guide Kernel Design

Benchmarking Modern Edge Devices for AI Applications

Benchmarking Next Generation Hardware Platforms: An Experimental Approach

Benchmarking OpenCL, OpenACC, OpenMP, and CUDA: programming productivity, performance, and energy consumption

Benchmarking optimization algorithms for auto-tuning GPU kernels

Benchmarking Parallel Performance on Many-Core Processors

Benchmarking performance of a hybrid Xeon/Xeon Phi system for parallel computation of similarity measures between large vectors

Benchmarking State-of-the-Art Deep Learning Software Tools

Benchmarking the cost of thread divergence in CUDA

Benchmarking the Intel Xeon Phi Coprocessor

Benchmarking the Memory Hierarchy of Modern GPUs

Benchmarking the Nvidia GPU Lineage: From Early K80 to Modern A100 with Asynchronous Memory Transfers

Benchmarking Thread Block Cluster

Benchmarking TPU, GPU, and CPU Platforms for Deep Learning

Benchmarks Based on Anti-Parallel Patterns for the Evaluation of GPUs

Benchmarks for Intel MIC Architecture

BenchPress: A Deep Active Benchmark Generator

BePilot: An AI Programming Assistant for Compiler Backend Development

Berkeley Dwarfs on CUDA

Best bang for your buck: GPU nodes for GROMACS biomolecular simulations

Best Practice Guide – GPGPU

Best Practice Guide – Intel Xeon Phi

Best Practice Guide Intel Xeon Phi v2.0

Best-effort semantic document search on GPUs

Betatron tune measurement with the LHC damper using a GPU

Better GPU Hash Tables

Better speedups using simpler parallel programming for graph connectivity and biconnectivity

Betweenness Centrality on GPUs and Heterogeneous Architectures

Beyond 16GB: Out-of-Core Stencil Computations

Beyond a Gaussian Denoiser: Residual Learning of Deep CNN for Image Denoising

Beyond Amdahl’s Law: An Objective Function That Links Multiprocessor Performance Gains To Delay and Energy

Beyond Code Pairs: Dialogue-Based Data Generation for LLM Code Translation

Beyond Desktop Computation: Challenges in Scaling a GPU Infrastructure

Beyond programmable shading (parts I and II)

Beyond Straightforward Vectorization of Lightweight Data Compression Algorithms for Larger Vector Sizes

BFROST: Binary Features from Robust Orientation Segment Tests accelerated on the GPU

Bi-directional Path Tracing on GPU

Bidimensional Median Filter for Parallel Computing Architectures

BIDMach: Large-scale Learning with Zero Memory Allocation

Bifrost: a Python/C++ Framework for High-Throughput Stream Processing in Astronomy

Big Integer Multiplication with CUDA FFT (cuFFT) Library

Bigger Buffer k-d Trees on Multi-Many-Core Systems

BigKernel — High Performance CPU-GPU Communication Pipelining for Big Data-style Applications

Bilateral Filtering with CUDA

Billion-scale similarity search with GPUs

Binary Code Summarization: Benchmarking ChatGPT/GPT-4 and Other Large Language Models

Binary Interval Search (BITS): A Scalable Algorithm for Counting Interval Intersections

Binary Interval Search: a scalable algorithm for counting interval intersections

Binary Mesh Partitioning for Cache-Efficient Visualization

Binary Segmentation of Video Sequences in Real Time

BinaryNet: Training Deep Neural Networks with Weights and Activations Constrained to +1 or -1

Binaural Simulations Using Audio Rate FDTD Schemes and CUDA

Binomial American Option Pricing on CPU-GPU Hetergenous System

Bio-inspired computer visual system using GPU and Visual Pattern Assessment Language (ViPAL): Application on breast cancer prognosis

Bio-Inspired Optimization of Ultra-Wideband Patch Antennas Using Graphics Processing Unit Acceleration

Bio-sequence database scanning on a GPU

BioAgent Bench: An AI Agent Evaluation Suite for Bioinformatics

BioEM: GPU-accelerated computing of Bayesian inference of electron microscopy images

Bioinformatics Sequence Comparisons on Manycore Processors

Biomedical and Clinical English Model Packages in the Stanza Python NLP Library

Biomedical image analysis on a cooperative cluster of GPUs and multicores

Biomolecular electrostatics simulation with a parallel FMM-based BEM, using up to 512 GPUs

Biomolecular electrostatics using a fast multipole BEM on up to 512 GPUs and a billion unknowns

Bit-GraphBLAS: Bit-Level Optimizations of Matrix-Centric Graph Processing on GPU

Bit-level Parallelization of 3DES Encryption on GPU

Bit-Packed Damaged Lattice Potts Model Simulations with CUDA and GPUs

Bit-Parallel Multiple Pattern Matching

Bit-Vectorized GPU Implementation of a Stochastic Cellular Automaton Model for Surface Growth

Bitcoin and The Age of Bespoke Silicon

BitCracker: BitLocker meets GPUs

Bitmap Filter: Speeding up Exact Set Similarity Joins with Bitwise Operations

Bitstream Database-Driven FPGA Programming Flow Based on Standard OpenCL

BlaBla: Linguistic Feature Extraction for Clinical Analysis in Multiple Languages

Black-Box Side-Channel Attacks Highlight the Importance of Countermeasures: An Analysis of the Xilinx Virtex-4 and Virtex-5 Bitstream Encryption Mechanism

BLAS Comparison on FPGA, CPU and GPU

Blasting through lattice calculations using CUDA

BLASX: A High Performance Level-3 BLAS Library for Heterogeneous Multi-GPU Computing

Blind image deconvolution algorithm on NVIDIA CUDA platform

Blink: Fast and Generic Collectives for Distributed ML

Blister: GPU-based rendering of Boolean combinations of free-form triangulated shapes

Block based Singular Value Decomposition approach to matrix factorization for recommender systems

Block Conjugate Gradient Solver in OpenCL

Block Time Step Storage Scheme for Astrophysical N-body Simulations

Block-asynchronous Multigrid Smoothers for GPU-accelerated Systems

Block-Parallel IDA* for GPUs

Block-Relaxation Methods for 3D Constant-Coefficient Stencils on GPUs and Multicore CPUs

Block-Size Independence for GPU Programs

Block: Balancing Load in LLM Serving with Context, Knowledge and Predictive Scheduling

Blockchain Goes Green? Part II: Characterizing the Performance and Cost of Blockchains on the Cloud and at the Edge

Blocked All-Pairs Shortest Paths Algorithm on Intel Xeon Phi KNL Processor: A Case Study

Blocking Self-avoiding Walks Stops Cyber-epidemics: A Scalable GPU-based Approach

Blocks and Fuel: Frameworks for deep learning

Blum Blum Shub on the GPU

Boda-RTC: Productive Generation of Portable, Efficient Code for Convolutional Neural Networks on Mobile Computing Platforms

Bohrium: Unmodified NumPy Code on CPU, GPU, and Cluster

Boids that see: Using self-occlusion for simulating large groups on GPUs

Bolt: Bridging the Gap between Auto-tuners and Hardware-native Performance

BoltzGen:Toward Universal Binder Design

Bone structure analysis on multiple GPGPUs

Brief statistics for this page

Titles: 100

Download open PDFs: 97

Package packages: 33

UniCoder: Unified Visual-to-Code Generation via Symbolic Rewards and Reference-Guided Code Optimization

CuFuzz: An API-Knowledge-Graph Coverage-Driven Fuzzing Framework for CUDA Libraries

AutoPass: Evidence-Guided LLM Agents for Compiler Performance Tuning

KernelBenchX: A Comprehensive Benchmark for Evaluating LLM-Generated GPU Kernels

CUDA Kernel Fusion Benchmarks

Analyzing the Impact of Kernel Fusion on GPU Tensor Operation Performance: A Systematic Performance Study

IntelliKit: Agent-first tooling for AMD hardware

Kerncap: Automated Kernel Extraction and Isolation for AMD GPUs

DITRON: Distributed Compiler based on Triton for Parallel Systems

DITRON: Distributed Multi-level Tiling Compiler for Parallel Tensor Programs

See all packages

* * *

high performance computing on graphics processing units: hgpu.org

Papers on hgpu.org (.txt-file)

Recent source codes

UniCoder: Unified Visual-to-Code Generation via Symbolic Rewards and Reference-Guided Code Optimization

CuFuzz: An API-Knowledge-Graph Coverage-Driven Fuzzing Framework for CUDA Libraries

AutoPass: Evidence-Guided LLM Agents for Compiler Performance Tuning

Probe-and-Refine Tuning of Repository Guidance for AI Coding Agents

CUDAnalyst (CUDA + Analyst)

CodegenBench

KernelBenchX: A Comprehensive Benchmark for Evaluating LLM-Generated GPU Kernels

CUDA Kernel Fusion Benchmarks

IntelliKit: Agent-first tooling for AMD hardware

DITRON: Distributed Compiler based on Triton for Parallel Systems

Most viewed papers (last 30 days)