high performance computing on graphics processing units: hgpu.org

hgpu.org » nVidia GeForce RTX 3060

GenVectorX: A performance-portable SYCL library for Lorentz Vectors operations

Monica Dessole, Jolly Chen, Axel Naumann

View

Tags: CUDA, nVidia, nVidia A100, nVidia GeForce RTX 3060, nVidia L4, oneAPI, Package, Performance, Physics, SYCL

December 10, 2023 by hgpu

Towards a Benchmarking Suite for Kernel Tuners

Jacob O. Tørring, Ben van Werkhoven, Filip Petrovic, Floris-Jan Willemsen, Jirí Filipovic, Anne C. Elster

View

Tags: Auto-Tuning, Benchmarking, Computer science, CUDA, nVidia, nVidia GeForce RTX 2080 Ti, nVidia GeForce RTX 3060, nVidia GeForce RTX 3090, nVidia Titan RTX, Package, performance portability

March 19, 2023 by hgpu

Extending MAGMA Portability with OneAPI

Anna Fortenberry, Stanimire Tomov

View

Tags: Computer science, CUDA, Heterogeneous systems, Linear Algebra, Matrix multiplication, nVidia, nVidia GeForce RTX 3060, oneAPI, Package, performance portability

December 25, 2022 by hgpu

Agentic Code Optimization via Compiler-LLM Cooperation

Agentic Code Optimization via Compiler-LLM Cooperation

MegaTrain: Full Precision Training of 100B+ Parameter Large Language Models on a Single GPU

MegaTrain: Full Precision Training of 100B+ Parameter Large Language Models on a Single GPU

Device Virtual Machine (DVM)

DVM: Real-Time Kernel Generation for Dynamic AI Models

AutoKernel: Autoresearch for GPU kernels. Give it any PyTorch model, go to sleep, wake up to optimized Triton kernels

AutoKernel: Autonomous GPU Kernel Optimization via Iterative Agent-Driven Search

SOL-ExecBench: Speed-of-Light Benchmarking for Real-World GPU Kernels Against Hardware Limits

SOL-ExecBench: Speed-of-Light Benchmarking for Real-World GPU Kernels Against Hardware Limits

Triton-Sanitizer: A Fast and Device-Agnostic Memory Sanitizer for Triton with Rich Diagnostic Context

Triton-Sanitizer: A Fast and Device-Agnostic Memory Sanitizer for Triton with Rich Diagnostic Context

LLM.Q: Quantized LLM training in pure CUDA/C++

LLMQ: Efficient Lower-Precision LLM Training for Consumer GPUs

True 4-Bit Quantized CNN Training on CPU

True 4-Bit Quantized Convolutional Neural Network Training on CPU: Achieving Full-Precision Parity

cuFuzz: A GPU-oriented coverage-guided fuzzer for userland CUDA application

Hunting CUDA Bugs at Scale with cuFuzz

KernelSkill: A Multi-Agent Framework for GPU Kernel Optimization

KernelSkill: A Multi-Agent Framework for GPU Kernel Optimization

See all packages

* * *

* * *

HGPU group © 2010-2026 hgpu.org

All rights belong to the respective authors

Login | Sitemap | Feedback | Policy

Contact us: