high performance computing on graphics processing units: hgpu.org

hgpu.org » Dense linear algebra

A GEMM interface and implementation on NVIDIA GPUs for multiple small matrices

Chetan Jhurani, Paul Mullowney

View

Tags: BLAS, CUBLAS, CUDA, Dense linear algebra, GEMM, Linear Algebra, nVidia, Parallel programming, Tesla K20

April 9, 2013 by chetan.jhurani

AutoPass: Evidence-Guided LLM Agents for Compiler Performance Tuning

AutoPass: Evidence-Guided LLM Agents for Compiler Performance Tuning

Probe-and-Refine Tuning of Repository Guidance for AI Coding Agents

Probe-and-Refine Tuning of Repository Guidance for Coding Agents

CUDAnalyst (CUDA + Analyst)

Towards Feedback-to-Plan Decisions for Self-Evolving LLM Agents in CUDA Kernel Generation

CodegenBench

CodegenBench: Can LLMs Write Efficient Code Across Architectures?

KernelBenchX: A Comprehensive Benchmark for Evaluating LLM-Generated GPU Kernels

KernelBenchX: A Comprehensive Benchmark for Evaluating LLM-Generated GPU Kernels

CUDA Kernel Fusion Benchmarks

Analyzing the Impact of Kernel Fusion on GPU Tensor Operation Performance: A Systematic Performance Study

IntelliKit: Agent-first tooling for AMD hardware

Kerncap: Automated Kernel Extraction and Isolation for AMD GPUs

DITRON: Distributed Compiler based on Triton for Parallel Systems

DITRON: Distributed Multi-level Tiling Compiler for Parallel Tensor Programs

CuTile Benchmark Suite: Performance and Productivity Tradeoffs for GPU Kernel Programming on Blackwell Architecture

Evaluating CUDA Tile for AI Workloads on Hopper and Blackwell GPUs

Agentic Code Optimization via Compiler-LLM Cooperation

Agentic Code Optimization via Compiler-LLM Cooperation

See all packages

* * *

* * *

HGPU group © 2010-2026 hgpu.org

All rights belong to the respective authors

Login | Sitemap | Feedback | Policy

Contact us: