high performance computing on graphics processing units: hgpu.org

Papers on hgpu.org (.txt-file)

Light Loss-Less Data Compression, with GPU Implementation

Light propagation for mixed polygonal and volumetric data

Light Propagation Maps on Parallel Graphics Architectures

Lighting Details Preserving Photon Density Estimation

LightNet: A Versatile, Standalone Matlab-based Environment for Deep Learning

Lightning: Scaling the GPU Programming Model Beyond a Single GPU

LightPlay: Efficient Replay with GPUs

LightRNN: Memory and Computation-Efficient Recurrent Neural Networks

LightScan: Faster Scan Primitive on CUDA Compatible Manycore Processors

Lightweight bleeding and smoke effect for surgical simulators

Lightweight Modular Staging and Embedded Compilers: Abstraction Without Regret for High-Level High-Performance Programming

Lightweight modular staging: a pragmatic approach to runtime code generation and compiled DSLs

Lina: a fast design optimisation tool for software-based FPGA programming

linalg: Matrix Computations in Apache Spark

Line-art Illustration of Dynamic and Specular Surfaces

Linear Algebra Algorithms for Hybrid Architectures with XKaapi

Linear algebra operators for GPU implementation of numerical algorithms

Linear Feature Detection on GPUs

Linear genetic programming GPGPU on Microsoft’s Xbox 360

Linear optimization on modern GPUs

Linear Performance-Breakdown Model: A Framework for GPU kernel programs performance analysis

Linear Solvers for Stable Fluids: GPU vs CPU

Linearised inversion with GPUs

Linpack evaluation on a supercomputer with heterogeneous accelerators

linus: Conveniently explore, share, and present large-scale biological trajectory data from a web browser

liquidSVM: A Fast and Versatile SVM package

List Mode PET reconstruction

Liszt: A Domain Specific Language for Building Portable Mesh-based PDE Solvers

LiteGD: Lightweight and dynamic GPU Dispatching for Large-scale Heterogeneous Clusters

Literature Review and Implementation Overview: High Performance Computing with Graphics Processing Units for Classroom and Research Use

Literature review: Build and Travel KD-Tree with CUDA

Literature Review: Parallel Computing on linear equations of linear elastic FEM stimulation with CUDA

LithOS: An Operating System for Efficient Machine Learning on GPUs

Live Migration for OpenCL FPGA Accelerators

Live Migration of FPGA Applications

Live, Video-Rate Super-Resolution Microscopy Using Structured Illumination and Rapid GPU-Based Parallel Processing

Living Flows: Enhanced Exploration of Edge-Bundled Graphs Based on GPU-Intensive Edge Rendering

LLload: An Easy-to-Use HPC Utilization Tool

LLM-Inference-Bench: Inference Benchmarking of Large Language Models on AI Accelerators

LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale

LLMPerf: GPU Performance Modeling meets Large Language Models

LLMQ: Efficient Lower-Precision LLM Training for Consumer GPUs

LLOR: Automated Repair of OpenMP Programs

LLVM to PTX Backend

LLVM-based automation of memory decoupling for OpenCL applications on FPGAs

LN-Annote: An Alternative Approach to Information Extraction from Emails using Locally-Customized Named-Entity Recognition

LNA: Fast Protein Classification Using A Laplacian Characterization of Tertiary Structure

LO-SpMM: Low-cost Search for High-performance SpMM Kernels on GPUs

Load Balanced Parallel GPU Out-of-Core for Continuous LOD Model Visualization

Load Balancing for Constraint Solving with GPUs

Load Balancing in a Changing World: Dealing with Heterogeneity and Performance Variability

Load Balancing in Data Warehouse – Evolution and Perspectives

Load Balancing Utilizing Data Redundancy in Distributed Volume Rendering

Load Balancing versus Occupancy Maximization on Graphics Processing Units: The Generalized Hough Transform as a Case Study

Load-Balanced Multi-GPU Ambient Occlusion for Direct Volume Rendering

Local Alignment Tool Based on Hadoop Framework and GPU Architecture

Local Histogram Modification Based Contrast Enhancement with GPU Acceleration

Local Laplacian Filters: Edge-aware Image Processing with a Laplacian Pyramid

Local Search Algorithms on Graphics Processing Units. A Case Study: The Permutation Perceptron Problem

Local Volatility FX Basket Option on CPU and GPU

Local vs. Global Optimization: Operator Placement Strategies in Heterogeneous Environments

Locality Analysis for Characterizing Applications Based on Sparse Matrices

Locality and parallelism optimization for dynamic programming algorithm in bioinformatics

Locality Aware Work-Stealing Based Scheduling in Hybrid CPU-GPU

Locality optimization on a NUMA architecture for hybrid LU factorization

Locality-Aware Automatic Parallelization for GPGPU with OpenHMPP Directives

Locality-Aware Mapping of Nested Parallel Patterns on GPUs

Locality-aware parallel block-sparse matrix-matrix multiplication using the Chunks and Tasks programming model

Locality-Aware Work Stealing on Multi-CPU and Multi-GPU Architectures

LocalityGuru: A PTX Analyzer for Extracting Thread Block-level Locality in GPGPUs

Locally-Oriented Programming: A Simple Programming Model for Stencil-Based Computations on Multi-Level Distributed Memory Architectures

Location-based Matching in Publish/Subscribe Revisited

LOD Terrain Rendering by Local Parallel Processing on GPU

Log File Regular Expression Pattern Matching And Capture With GPUs

LOGAN: High-Performance GPU-Based X-Drop Long-Read Alignment

LoGV: Low-overhead GPGPU Virtualization

Long Code for Code Search

Long time-scale simulations of in vivo diffusion using GPU hardware

Long Timestep Molecular Dynamics on the Graphical Processing Unit

Long-time Simulations with Complex Code Using Multiple Nodes of Intel Xeon Phi Knights Landing

Loo.py: From Fortran to performance via transformation and substitution rules

Loo.py: transformation-based code generation for GPUs and CPUs

Looking at the surprise: Bottom-up attentional control of an active camera system

LookNN: Neural Network with No Multiplication

Loop Perforation in OpenACC

Loop Transformation Recipes for Code Generation and Auto-Tuning

LoopBench: An Evaluation of Loop Acceleration in Heterogeneous Systems

LOOPer: A Learned Automatic Code Optimizer For Polyhedral Compilers

Loose capacity-constrained representatives for the qualitative visual analysis in molecular dynamics

Lossless Acceleration for Seq2seq Generation with Aggressive Decoding

Lossless Compression of Variable-Precision Floating-Point Buffers on GPUs

Lossless data compression on GPGPU architectures

Lossless LZW Data Compression Algorithm on CUDA

Lost in Abstraction: Pitfalls of Analyzing GPUs at the Intermediate Language Level

Lost in Translation: Challenges in Automating CUDA-to-OpenCL Translation

Low Complexity Corner Detector Using CUDA for Multimedia Applications

Low cost approach to real-time vehicle to vehicle communication using parallel CPU and GPU processing

Low cost, high performance GPU computing solution for atomic resolution cryoEM single-particle reconstruction

Low Latency Complex Event Processing on Parallel Hardware

Low latency photon mapping using block hashing

Brief statistics for this page

Titles: 100

Download open PDFs: 95

Package packages: 26

UniCoder: Unified Visual-to-Code Generation via Symbolic Rewards and Reference-Guided Code Optimization

CuFuzz: An API-Knowledge-Graph Coverage-Driven Fuzzing Framework for CUDA Libraries

AutoPass: Evidence-Guided LLM Agents for Compiler Performance Tuning

KernelBenchX: A Comprehensive Benchmark for Evaluating LLM-Generated GPU Kernels

CUDA Kernel Fusion Benchmarks

Analyzing the Impact of Kernel Fusion on GPU Tensor Operation Performance: A Systematic Performance Study

IntelliKit: Agent-first tooling for AMD hardware

Kerncap: Automated Kernel Extraction and Isolation for AMD GPUs

DITRON: Distributed Compiler based on Triton for Parallel Systems

DITRON: Distributed Multi-level Tiling Compiler for Parallel Tensor Programs

See all packages

* * *

high performance computing on graphics processing units: hgpu.org

Papers on hgpu.org (.txt-file)

Recent source codes

UniCoder: Unified Visual-to-Code Generation via Symbolic Rewards and Reference-Guided Code Optimization

CuFuzz: An API-Knowledge-Graph Coverage-Driven Fuzzing Framework for CUDA Libraries

AutoPass: Evidence-Guided LLM Agents for Compiler Performance Tuning

Probe-and-Refine Tuning of Repository Guidance for AI Coding Agents

CUDAnalyst (CUDA + Analyst)

CodegenBench

KernelBenchX: A Comprehensive Benchmark for Evaluating LLM-Generated GPU Kernels

CUDA Kernel Fusion Benchmarks

IntelliKit: Agent-first tooling for AMD hardware

DITRON: Distributed Compiler based on Triton for Parallel Systems

Most viewed papers (last 30 days)