high performance computing on graphics processing units: hgpu.org

Papers on hgpu.org (.txt-file)

JSDoop and TensorFlow.js: Volunteer Distributed Web Browser-Based Neural Network Training

Julia as a unifying end-to-end workflow language on the Frontier exascale system

Jump flooding in GPU with applications to Voronoi diagram and distance transform

Just-in-time Acceleration of JavaScript

Just-in-Time Catching Test Generation at Meta

Just-in-Time Compilation and Link-Time Optimization for OpenMP Target Offloading

K-Means on Commodity GPUs with CUDA

K-Means on GPU: A Review

K-nearest neighbor search: Fast GPU-based implementations and application to high-dimensional feature matching

k+-buffer: Fragment Synchronized k-buffer

K3 Moore’s Law in the Era of GPU Computing

KAdvice: infering synchronization patterns from an existing codebase

KAISA: An Adaptive Second-order Optimizer Framework for Deep Neural Networks

Kalman Filter Tracking on Parallel Architectures

Kalman-Filter-Based Particle Tracking on Parallel Architectures at Hadron Colliders

kANN on the GPU with Shifted Sorting

Kapre: On-GPU Audio Preprocessing Layers for a Quick Implementation of Deep Neural Network Models with Keras

Kargus: a Highly-scalable Software-based Intrusion Detection System

KBLAS: An Optimized Library for Dense Matrix-Vector Multiplication on GPU Accelerators

Kd-Jump: a Path-Preserving Stackless Traversal for Faster Isosurface Raytracing on GPUs

KD-tree acceleration structures for a GPU raytracer

Kd-tree Based N-Body Simulations with Volume-Mass Heuristic on the GPU

kEDM: A Performance-portable Implementation of Empirical Dynamic Modeling using Kokkos

Keeneland: Bringing heterogeneous GPU computing to the computational science community

KEET: Explaining Performance of GPU Kernels Using LLM Agents

Keras Sig: Efficient Path Signature Computation on GPU in Keras 3

Kerncap: Automated Kernel Extraction and Isolation for AMD GPUs

Kernel Fusion: An Effective Method for Better Power Efficiency on Multithreaded GPU

Kernel Launcher: C++ Library for Optimal-Performance Portable CUDA Applications

Kernel Specialization for Improved Adaptability and Performance on Graphics Processing Units (GPUs)

Kernel Tuner: A search-optimizing GPU code auto-tuner

Kernel Tuning Toolkit

Kernel Weaver: Automatically Fusing Database Primitives for Efficient GPU Computation

Kernel-as-a-Service: A Serverless Interface to GPUs

Kernel-Centric Optimizations for Deep Neural Networks on GPGPU

Kernel-Smith: A Unified Recipe for Evolutionary Kernel Optimization

KernelBand: Boosting LLM-based Kernel Optimization with a Hierarchical and Hardware-aware Multi-armed Bandit

KernelBench: Can LLMs Write Efficient GPU Kernels?

KernelBenchX: A Comprehensive Benchmark for Evaluating LLM-Generated GPU Kernels

KernelBlaster: Continual Cross-Task CUDA Optimization via Memory-Augmented In-Context Reinforcement Learning

Kernelet: High-Throughput GPU Kernel Executions with Dynamic Slicing and Scheduling

KernelEvolve: Scaling Agentic Kernel Coding for Heterogeneous AI Accelerators at Meta

KernelFoundry: Hardware-aware evolutionary GPU kernel optimization

KERNELGEN – A Toolchain for Automatic GPU-centric Applications Porting

KernelGen – the design and implementation of a next generation compiler platform for accelerating numerical models on GPUs

KernelInterceptor: automating GPU kernel verification by intercepting kernels and their parameters

Kernelized Renyi distance for speaker recognition

KernelSkill: A Multi-Agent Framework for GPU Kernel Optimization

KeSCo: Compiler-based Kernel Scheduling for Multi-task GPU Applications

Kevin: Multi-Turn RL for Generating CUDA Kernels

Key derivation functions and their GPU implementation

Key Reconciliation with Low-Density Parity-Check Codes for Long-Distance Quantum Cryptography

Keynote address: Immersive exploration of large datasets

KForge: LLM-Driven Cross-Platform Kernel Generation for AI Accelerators

KFusion: Obtaining Modularity and Performance with Regards to General Purpose GPU Computing and Co-processors

Kinematic Modelling of Disc Galaxies using Graphics Processing Units

Kinetics of liquid-solid phase transition in large nickel clusters

KIS-S: A GPU-Aware Kubernetes Inference Simulator with RL-Based Auto-Scaling

Kite: Braided Parallelism for Heterogeneous Systems

KLARAPTOR: A Tool for Dynamically Finding Optimal Kernel Launch Parameters Targeting CUDA Programs

kNN Query Processing in Metric Spaces Using GPUs

Kokkidio: Fast, expressive, portable code, based on Kokkos and Eigen

Kokkos: Enabling performance portability across manycore architectures

Krylov Subspace Accelerated Algebraic Multigrid for Mimetic Finite Differences on GPUs

KUDA: GPU Accelerated Split Race Checker

LAMDA: Learning-Assisted Multi-Stage Autotuning for FPGA Design Closure

LAMMPS’ PPPM Long-Range Solver for the Second Generation Xeon Phi

LAMMPScuda – a new GPU accelerated Molecular Dynamics Simulations Package and its Application to Ion-Conducting Glasses

Landau Gauge Fixing on GPUs

Landau Gauge Fixing on GPUs and String Tension

Langevin dynamics simulations of biomolecules on graphics processors

Language Modeling with Gated Convolutional Networks

Language virtualization for heterogeneous parallel computing

Large calculation of the flow over a hypersonic vehicle using a GPU

Large data real-time classification with Non-negative Matrix Factorization and Self-Organizing Maps on GPU

Large data visualization on distributed memory multi-GPU clusters

Large Graphs on multi-GPUs

Large Integer Arithmetic in GPU for Cryptography

Large Language Model Powered C-to-CUDA Code Translation: A Novel Auto-Parallelization Framework

Large neighborhood local search optimization on graphics processing units

Large scale 3D shape retrieval by exploiting multi-core and GPU

Large Scale Artificial Neural Network Training Using Multi-GPUs

Large Scale Bioinformatics Data Mining with Parallel Genetic Programming on Graphics Processing Units

Large Scale DNA Sequence Alignment and Kernel Method Implemented with GPUs

Large Scale Finite Element Analysis Using GPU Parallel Computing

Large Scale GPU Accelerated PPMLR-MHD Simulations for Space Weather Forecast

Large Scale GPU Based Simulations of Turbulent Bubbly Flow in a Square Duct

Large Scale Language Modeling: Converging on 40GB of Text in Four Hours

Large Scale Monte Carlo Tree Search on GPU

Large scale parallel state space search utilizing graphics processing units and solid state disks

Large Scale Physical Modeling Sound Synthesis

Large Scale Plane Wave Pseudopotential Density Functional Theory Calculations on GPU Clusters

Large Scale Simulations of the Euler Equations on GPU Clusters

Large Speed Increase Using Novel GPU Based Algorithms to Simulate Cardiac Excitation Waves in a Rabbit Ventricle

Large steps in GPU-based deformable bodies simulation

Large-eddy simulations with ClimateMachine: a new open-source code for atmospheric simulations on GPUs and CPUs

Large-Scale Compute-Intensive Analysis via a Combined In-Situ and Co-Scheduling Workflow Approach

Large-Scale Data Computing Performance Comparisons on SYCL Heterogeneous Parallel Processing Layer Implementations

Large-Scale Deep Learning on the YFCC100M Dataset

Large-scale deep unsupervised learning using graphics processors

Brief statistics for this page

Titles: 100

Download open PDFs: 94

Package packages: 35

UniCoder: Unified Visual-to-Code Generation via Symbolic Rewards and Reference-Guided Code Optimization

CuFuzz: An API-Knowledge-Graph Coverage-Driven Fuzzing Framework for CUDA Libraries

AutoPass: Evidence-Guided LLM Agents for Compiler Performance Tuning

KernelBenchX: A Comprehensive Benchmark for Evaluating LLM-Generated GPU Kernels

CUDA Kernel Fusion Benchmarks

Analyzing the Impact of Kernel Fusion on GPU Tensor Operation Performance: A Systematic Performance Study

IntelliKit: Agent-first tooling for AMD hardware

Kerncap: Automated Kernel Extraction and Isolation for AMD GPUs

DITRON: Distributed Compiler based on Triton for Parallel Systems

DITRON: Distributed Multi-level Tiling Compiler for Parallel Tensor Programs

See all packages

* * *

high performance computing on graphics processing units: hgpu.org

Papers on hgpu.org (.txt-file)

Recent source codes

UniCoder: Unified Visual-to-Code Generation via Symbolic Rewards and Reference-Guided Code Optimization

CuFuzz: An API-Knowledge-Graph Coverage-Driven Fuzzing Framework for CUDA Libraries

AutoPass: Evidence-Guided LLM Agents for Compiler Performance Tuning

Probe-and-Refine Tuning of Repository Guidance for AI Coding Agents

CUDAnalyst (CUDA + Analyst)

CodegenBench

KernelBenchX: A Comprehensive Benchmark for Evaluating LLM-Generated GPU Kernels

CUDA Kernel Fusion Benchmarks

IntelliKit: Agent-first tooling for AMD hardware

DITRON: Distributed Compiler based on Triton for Parallel Systems

Most viewed papers (last 30 days)