high performance computing on graphics processing units: hgpu.org

Papers on hgpu.org (.txt-file)

AutoParBench: A Unified Test Framework for OpenMP-based Parallelizers

AutoPass: Evidence-Guided LLM Agents for Compiler Performance Tuning

AutoPhase: Compiler Phase-Ordering for High Level Synthesis with Deep Reinforcement Learning

Autotuning CUDA Compiler Parameters for Heterogeneous Applications using the OpenTuner Framework

Autotuning CUDA: Applying NLP Techniques to LS-CAT

Autotuning for Automatic Parallelization on Heterogeneous Systems

Autotuning GEMMs for Fermi

Autotuning GPU Kernels via Static and Predictive Analysis

Autotuning of Pattern Runtimes for Accelerated Parallel Systems

Autotuning OpenACC Work Distribution via Direct Search

Autotuning OpenCL Workgroup Size for Stencil Patterns

Autotuning Programs with Algorithmic Choice

Autotuning Stencil-Based Computations on GPUs

Autotuning Stencils Codes with Algorithmic Skeletons

Autotuning Tensor Contraction Computations on GPUs

Autotuning Wavefront Abstractions for Heterogeneous Architectures

Autotuning Wavefront Patterns for Heterogeneous Architectures

Autotuning, Code Generation and Optimizing Compiler Technology for GPUs

Auxiliary Image Regularization for Deep CNNs with Noisy Labels

AvA: Accelerated Virtualization of Accelerators

AVEC: Accelerator Virtualization in Cloud-Edge Computing for Deep Learning Libraries

AVSS2011 demo session: GPU enabled Smart Video Node

AVX-512 extension to OpenQCD 1.6

AXC: A new format to perform the SpMV oriented to Intel Xeon Phi architecture in OpenCL

Axel: a heterogeneous cluster with FPGAs and GPUs

AZP: Automatic Specialization for Zero Values in Gaming Applications

b-Bit Minwise Hashing in Practice: Large-Scale Batch and Online Learning and Using GPUs for Fast Preprocessing with Simple Hash Functions

B-CALM: An open-source GPU-based 3D-FDTD with multi-pole dispersion for plasmonics

B-Calm: an Open-Source Multi-Gpu-Based 3D-FDTD with Multi-Pole Dispersion for Plasmonics

Back Ground Subtraction Algorithm For Moving Object Detection In FPGA

Backpropagation Training for Fisher Vectors within Neural Networks

BaCO: A Fast and Portable Bayesian Compiler Optimization Framework

Bacon: A GPU Programming System With Just in Time Specialization

Bag of Tricks: Benchmarking of Jailbreak Attacks on LLMs

Balancing locality and concurrency: solving sparse triangular systems on GPUs

Balancing Tracking Granularity and Parallelism in Many-Task Systems: The Horizons Approach

Bamboo: Automatic Translation of MPI Source into a Latency-Tolerant Form

Bandicoot: A Templated C++ Library for GPU Linear Algebra

Bandicoot: C++ Library for GPU Linear Algebra and Scientific Computing

Bandwidth intensive 3-D FFT kernel for GPUs using CUDA

Bandwidth Reduction Through Multithreaded Compression of Seismic Images

Bandwidth Requirements of GPU Architectures

BANG: Billion-Scale Approximate Nearest Neighbor Search using a Single GPU

Barnes-hut treecode on GPU

Barra, a Modular Functional GPU Simulator for GPGPU

Barra: A Parallel Functional Simulator for GPGPU

BarraCUDA – a fast short read sequence aligner using graphics processing units

Barrier Invariants: A Shared State Abstraction for the Analysis of Data-Dependent GPU Kernels

Barycentric coordinates computation in homogeneous coordinates

BASEMENT v3: a modular freeware for river process modelling over multiple computational backends

Basker: A Threaded Sparse LU Factorization Utilizing Hierarchical Parallelism and Data Layouts

BAT: A Benchmark suite for AutoTuners

Batch Method for Efficient Resource Sharing in Real-time Multi-GPU Systems

Batch Records Insertion into Multidimensional Linear Dynamic Hashing Table on GPU

Batched Kronecker product for 2-D matrices and 3-D arrays on NVIDIA GPUs

Batched Linear Algebra Problems on GPU Accelerators

Batched Matrix Computations on Hardware Accelerators

Batched Matrix Computations on Hardware Accelerators Based on GPUs

Batched Multi Triangulation

Batched QR and SVD Algorithms on GPUs with Applications in Hierarchical Matrix Compression

Batched Shift Reduce Parsing with Lists of Vectors on CUDA

Bayesian Image Restoration Using A Large-scale Total Patch Variation Prior

Bayesian inference for artificial perception using OpenCL on FPGAs and GPUs

Bayesian model comparison via sequential Monte Carlo

Bayesian neural networks for detecting epistasis in genetic association studies

Bayesian Neural Networks for Genetic Association Studies of Complex Disease

Bayesian Neural Networks in Data-Intensive High Energy Physics Applications

Bayesian Optimization for auto-tuning GPU kernels

Bayesian real-time perception algorithms on GPU

Bayesian Sparse Unsupervised Learning for Probit Models of Binary Data

Bayesian Sparsity-Path-Analysis of Genetic Association Signal using Generalized t Priors

Bayesian State-Space Modelling on High-Performance Hardware Using LibBi

BbmTTP: Beat-based Parallel Simulated Annealing Algorithm on GPGPUs for the Mirrored Traveling Tournament Problem

BEAGLE: an Application Programming Interface and High-Performance Computing Library for Statistical Phylogenetics

Beam Dynamics Simulations Using GPUs

Beam Dynamics Simulations with a GPU-accelerated Version of ELEGANT

Beauty And The Beast: Exploiting GPUs In Haskell

Beehive SPIR-V Toolkit: A Composable and Functional API for Runtime SPIR-V Code Generation

Behavioral graph fraud detection in E-commerce

Behavioral Non-portability in Scientific Numeric Computing

Behavioral Spherical Harmonics for Long-Range Agents’ Interaction

Belief Propagation by Message Passing in Junction Trees: Computing Each Message Faster Using GPU Parallelization

Belief Propagation on the GPU for Stereo Vision

Believe it or Not! Multi-core CPUs Can Match GPU Performance for FLOP-intensive Application!

Bempp-cl: A fast Python based just-in-time compiling boundary element library

BenchDirect: A Directed Language Model for Compiler Benchmarks

BenchFriend: Correlating the Performance of GPU Benchmarks

BENCHIP: Benchmarking Intelligence Processors

Benchmarking a Proof-of-Concept Performance Portable SYCL-based Fast Fourier Transformation Library

Benchmarking Across Platforms: European Option Pricing

Benchmarking and Dissecting the Nvidia Hopper GPU Architecture

Benchmarking and Implementation of Probability-Based Simulations on Programmable Graphics Cards

Benchmarking and modelling of POWER7, Westmere, BG/P, and GPUs: an industry case study

Benchmarking and Optimization of Gradient Boosted Decision Tree Algorithms

Benchmarking Data Analysis and Machine Learning Applications on the Intel KNL Many-Core Processor

Benchmarking Deep Learning Models on Jetson TX2

Benchmarking GPU and CPU codes for Heisenberg spin glass overrelaxation

Benchmarking GPU and TPU Performance with Graph Neural Networks

Benchmarking GPU Devices with N-Body Simulations

Benchmarking GPUs to tune dense linear algebra

Brief statistics for this page

Titles: 100

Download open PDFs: 97

Package packages: 38

UniCoder: Unified Visual-to-Code Generation via Symbolic Rewards and Reference-Guided Code Optimization

CuFuzz: An API-Knowledge-Graph Coverage-Driven Fuzzing Framework for CUDA Libraries

AutoPass: Evidence-Guided LLM Agents for Compiler Performance Tuning

KernelBenchX: A Comprehensive Benchmark for Evaluating LLM-Generated GPU Kernels

CUDA Kernel Fusion Benchmarks

Analyzing the Impact of Kernel Fusion on GPU Tensor Operation Performance: A Systematic Performance Study

IntelliKit: Agent-first tooling for AMD hardware

Kerncap: Automated Kernel Extraction and Isolation for AMD GPUs

DITRON: Distributed Compiler based on Triton for Parallel Systems

DITRON: Distributed Multi-level Tiling Compiler for Parallel Tensor Programs

See all packages

* * *

high performance computing on graphics processing units: hgpu.org

Papers on hgpu.org (.txt-file)

Recent source codes

UniCoder: Unified Visual-to-Code Generation via Symbolic Rewards and Reference-Guided Code Optimization

CuFuzz: An API-Knowledge-Graph Coverage-Driven Fuzzing Framework for CUDA Libraries

AutoPass: Evidence-Guided LLM Agents for Compiler Performance Tuning

Probe-and-Refine Tuning of Repository Guidance for AI Coding Agents

CUDAnalyst (CUDA + Analyst)

CodegenBench

KernelBenchX: A Comprehensive Benchmark for Evaluating LLM-Generated GPU Kernels

CUDA Kernel Fusion Benchmarks

IntelliKit: Agent-first tooling for AMD hardware

DITRON: Distributed Compiler based on Triton for Parallel Systems

Most viewed papers (last 30 days)