high performance computing on graphics processing units: hgpu.org

hgpu.org » Linear Algbera

A Comparison of Potential Interfaces for Batched BLAS Computations

Samuel D. Relton, Pedro Valero-Lara, Mawussi Zounon

View

Download (PDF)

Source codes

Tags: BLAS, Computer science, CUDA, Linear Algbera, nVidia, Package, Tesla K40

August 11, 2016 by hgpu

Adaptive GPU Array Layout Auto-Tuning

Nicolas Weber, Michael Goesele

View

Download (PDF)

Source codes

Tags: Computer science, CUDA, Linear Algbera, nVidia, nVidia GeForce GTX Titan X, Package, Performance, Tesla K20

April 29, 2016 by hgpu

Luthier: Bridging Auto-Tuning and Vendor Libraries for Efficient Deep Learning Inference

Fused Kernel Library (FKL)

The Fused Kernel Library: A C++ API to Develop Highly-Efficient GPU Libraries

GPUHammer: Rowhammer Attacks on GPU Memories are Practical

Block: Balance Loader of LLM Serving with Context, Knowledge and Predictive Scheduling

Block: Balancing Load in LLM Serving with Context, Knowledge and Predictive Scheduling

SIGMo: Scalable Isomorphism Graph Matching on GPUs

SIGMo: High-Throughput Batched Subgraph Isomorphism on GPUs for Molecular Matching

DGEMM without FP64 Arithmetic - using FP64 Emulation and FP8 Tensor Cores with Ozaki Scheme

DGEMM without FP64 Arithmetic – using FP64 Emulation and FP8 Tensor Cores with Ozaki Scheme

GEAK-agent: LLM-based AI agent, which can write correct and efficient GPU kernels automatically

Geak: Introducing Triton Kernel AI Agent & Evaluation Benchmarks

OpenDwarfs 2025: re-engineered version of the OpenDwarfs benchmark suite, for compatibility with modern platforms

OpenDwarfs 2025: Modernizing the OpenDwarfs Benchmark Suite for Heterogeneous Computing

Specx: Speculative task-based runtime system

Specx: a C++ task-based runtime system for heterogeneous distributed architectures

Mutual-Supervised Learning for Sequential-to-Parallel Code Translation

See all packages

* * *

high performance computing on graphics processing units: hgpu.org

A Comparison of Potential Interfaces for Batched BLAS Computations

Adaptive GPU Array Layout Auto-Tuning

Recent source codes

Luthier: Bridging Auto-Tuning and Vendor Libraries for Efficient Deep Learning Inference

Fused Kernel Library (FKL)

GPUHammer: Rowhammer Attacks on GPU Memories are Practical

Block: Balance Loader of LLM Serving with Context, Knowledge and Predictive Scheduling

SIGMo: Scalable Isomorphism Graph Matching on GPUs

DGEMM without FP64 Arithmetic - using FP64 Emulation and FP8 Tensor Cores with Ozaki Scheme

GEAK-agent: LLM-based AI agent, which can write correct and efficient GPU kernels automatically

OpenDwarfs 2025: re-engineered version of the OpenDwarfs benchmark suite, for compatibility with modern platforms

Specx: Speculative task-based runtime system

Mutual-Supervised Learning for Sequential-to-Parallel Code Translation

Most viewed papers (last 30 days)