high performance computing on graphics processing units: hgpu.org

Papers on hgpu.org (.txt-file)

Large-Scale DNS of Gas-Solid Flow on Mole-8.5

Large-scale ferrofluid simulations on graphics processing units

Large-scale FFT on GPU clusters

Large-Scale Geospatial Processing on Multi-Core and Many-Core Processors: Evaluations on CPUs, GPUs and MICs

Large-Scale High-Lundquist Number Reduced MHD Simulations of the Solar Corona Using GPU Accelerated Machines

Large-scale image analysis using docker sandboxing

Large-scale mixer simulations using massively parallel GPU architectures

Large-scale Monte Carlo simulation of two-dimensional classical XY model using multiple GPUs

Large-Scale Motion Modelling using a Graphical Processing Unit

Large-scale multi-dimensional document clustering on GPU clusters

Large-scale Nanostructure Simulations from X-ray Scattering Data On Graphics Processor Clusters

Large-scale network simulation over heterogeneous computing architecture

Large-Scale Paralleled Sparse Principal Component Analysis

Large-Scale Physics-Based Terrain Editing Using Adaptive Tiles on the GPU

Large-Scale Sound Field Rendering in Rectangular Room with Specular Reflection

Large-Scale Stereo Display Wall Using Programmable Graphics Hardware

Large-Scale Stochastic Learning using GPUs

Large-scale transient stability simulation on graphics processing units

Large-scale Virtual Acoustics Simulation at Audio Rates Using Three Dimensional Finite Difference Time Domain and Multiple GPUs

Large, Pruned or Continuous Space Language Models on a GPU for Statistical Machine Translation

Larrabee: a many-core x86 architecture for visual computing

Latency considerations of depth-first GPU ray tracing

Lattice Based Volumetric Global Illumination

Lattice Boltzmann based PDE solver on the GPU

Lattice Boltzmann Method for Simulating Turbulent Flows

Lattice Boltzmann Simulation of Binary Mixture Diffusion Using Modern Graphics Processors

Lattice Boltzmann Simulations of Multiphase Flows

Lattice Boltzmann simulations of the permeability and capillary adsorption of cement model microstructures

Lattice Boltzmann Simulations on a GPU: An optimization approach using C++ AMP

Lattice Group Models: GPU Acceleration and Numerics

Lattice QCD as a video game

Lattice QCD based on OpenCL

Lattice QCD on Intel Xeon Phi

Lattice QCD on new chips: a community summary

Lattice QCD simulations using the OpenACC platform

Lattice QCD with Domain Decomposition on Intel Xeon Phi Co-Processors

Lattice Quantum Chromodynamics on Intel Xeon Phi based supercomputers

Lattice Simulations using OpenACC compilers

Lattice SU(2) on GPU’s

Lattice-based flow field modeling

Lattice-Boltzmann Simulation of the Shallow-Water Equations with Fluid-Structure Interaction on Multi- and Manycore Processors

Lattice-Boltzmann simulation of the shallow-water equations with fluid-structure interaction on multi-and manycore processors

Lattice-boltzmann water waves

LatticeQCD using OpenCL

Launch-time Optimization of OpenCL Kernels

Layered Interpretation of Street View Images

Lazy Solid Texture Synthesis

LazyTensor: combining eager execution with domain-specific compilers

LBCL: multi-device automatic load balancing

LBM based flow simulation using GPU computing processor

LDetector: A Low Overhead Race Detector For GPU Programs

Leader Stochastic Gradient Descent for Distributed Training of Deep Learning Models

Learnergy: Energy-based Machine Learners

Learning a Metric Embedding for Face Recognition using the Multibatch Method

Learning Better Encoding for Approximate Nearest Neighbor Search with Dictionary Annealing

Learning Blood Management in Orthopedic Surgery through Gameplay

Learning hash codes for efficient content reuse detection

Learning Massive Graph Embeddings on a Single Machine

Learning Random Forests on the GPU

Learning Representation for Scene Understanding: Epitomes, CRFs, and CNNs

Learning Sparse Recurrent Neural Networks in Language Modeling

Learning Structured Sparsity in Deep Neural Networks

Learning to Detect Roads in High-Resolution Aerial Images

Learning to Optimize Tensor Programs

Learning Two-View Stereo Matching

Least Squares on GPUs in Multiple Double Precision

Lectures on Parallel Computing

LeetDecoding: A PyTorch Library for Exponentially Decaying Causal Linear Attention with CUDA Implementations

LeFlow: Enabling Flexible FPGA High-Level Synthesis of Tensorflow Deep Neural Networks

LeftoverLocals: Listening to LLM Responses Through Leaked GPU Local Memory

Legion: Programming Distributed Heterogeneous Architectures with Logical Regions

Legolizer: A Real-Time System for Modeling and Rendering LEGO Representations of Boundary Models

Lensed: a code for the forward reconstruction of lenses and sources from strong lensing observations

Leo: A Profile-Driven Dynamic Optimization Framework for GPU Applications

Lessons learned from contrasting a BLAS kernel implementations

Lessons learned in a decade of research software engineering GPU applications

Lessons Learned Migrating CUDA to SYCL: A HEP Case Study with ROOT RDataFrame

Let’s sort this out: GPGPU Verification of Radix Sort

Lettuce: PyTorch-based Lattice Boltzmann Framework

Level Sets and Voronoi based Feature Extraction from any Imagery

Level-of-Detail Triangle Strips for Deforming Meshes

Leveraging AI Ecosystem for Portable and Sustainable GPU Kernels in HPC

Leveraging Binary Translation for Heterogeneous Profiling

Leveraging Computation Sharing and Parallel Processing in Location-Based Services

Leveraging Data-Flow Information for Efficient Scheduling of Task-Parallel Programs on Heterogeneous Systems

Leveraging LLVM OpenMP GPU Offload Optimizations for Kokkos Applications

Leveraging Memory Copy Overlap for Efficient Sparse Matrix-Vector Multiplication on GPUs

Leveraging on High-Performance Computing and Cloud Technologies in Digital Libraries: A Case Study

Leveraging Parallelism with CUDA and OpenCL

Leveraging the potential of task-based programming with OpenMP task graphs

Levy Flights for Particle Swarm Optimisation Algorithms on Graphical Processing Units

LeXInt: GPU-accelerated Exponential Integrators package

LHCb GPU acceleration project

libcloudph++ 0.1: single-moment bulk, double-moment bulk, and particle-based warm-rain microphysics library in C++

libCudaOptimize: an Open Source Library of GPU-based Metaheuristics

libhclooc: Software Library Facilitating Out-of-core Implementations of Accelerator Kernels on Hybrid Computing Platforms

libmolgrid: GPU Accelerated Molecular Gridding for Deep Learning Applications

Libra: Synergizing CUDA and Tensor Cores for High-Performance Sparse Matrix Multiplication

libWater: Heterogeneous Distributed Computing Made Easy

LIFT: LLM-Based Pragma Insertion for HLS via GNN Supervised Fine-Tuning

Brief statistics for this page

Titles: 100

Download open PDFs: 96

Package packages: 25

UniCoder: Unified Visual-to-Code Generation via Symbolic Rewards and Reference-Guided Code Optimization

CuFuzz: An API-Knowledge-Graph Coverage-Driven Fuzzing Framework for CUDA Libraries

AutoPass: Evidence-Guided LLM Agents for Compiler Performance Tuning

KernelBenchX: A Comprehensive Benchmark for Evaluating LLM-Generated GPU Kernels

CUDA Kernel Fusion Benchmarks

Analyzing the Impact of Kernel Fusion on GPU Tensor Operation Performance: A Systematic Performance Study

IntelliKit: Agent-first tooling for AMD hardware

Kerncap: Automated Kernel Extraction and Isolation for AMD GPUs

DITRON: Distributed Compiler based on Triton for Parallel Systems

DITRON: Distributed Multi-level Tiling Compiler for Parallel Tensor Programs

See all packages

* * *

high performance computing on graphics processing units: hgpu.org

Papers on hgpu.org (.txt-file)

Recent source codes

UniCoder: Unified Visual-to-Code Generation via Symbolic Rewards and Reference-Guided Code Optimization

CuFuzz: An API-Knowledge-Graph Coverage-Driven Fuzzing Framework for CUDA Libraries

AutoPass: Evidence-Guided LLM Agents for Compiler Performance Tuning

Probe-and-Refine Tuning of Repository Guidance for AI Coding Agents

CUDAnalyst (CUDA + Analyst)

CodegenBench

KernelBenchX: A Comprehensive Benchmark for Evaluating LLM-Generated GPU Kernels

CUDA Kernel Fusion Benchmarks

IntelliKit: Agent-first tooling for AMD hardware

DITRON: Distributed Compiler based on Triton for Parallel Systems

Most viewed papers (last 30 days)