high performance computing on graphics processing units: hgpu.org

hgpu.org » nVidia

On the Partitioning of GPU Power among Multi-Instances

Tirth Vamja, Kaustabha Ray, Felix George, UmaMaheswari C Devi

View

Tags: Computer science, CUDA, Energy-efficient computing, Machine learning, Matrix multiplication, nVidia, nVidia A100, nVidia V100, Performance

February 3, 2025 by hgpu

Fully-Automated Code Generation for Efficient Computation of Sparse Matrix Permanents on GPUs

Deniz Elbek, Kamer Kaya

View

Download (PDF)

Tags: Code generation, Computer science, CUDA, nVidia, nVidia A100, nVidia Quadro GV100, PTX, Sparse matrix

February 3, 2025 by hgpu

CPU-GPU co-execution through the exploitation of hybrid technologies via SYCL

Nozal Raúl, Jose Luis Bosque

View

Download (PDF)

Tags: Computer science, CUDA, Heterogeneous systems, Hybrid computing, LLVM, load balancing, nVidia, nVidia GeForce GT 1030, oneAPI, OpenCL, performance portability, SYCL

February 3, 2025 by hgpu

Profiling Apple Silicon Performance for ML Training

Dahua Feng, Zhiming Xu, Rongxiang Wang, Felix Xiaozhu Lin

View

Download (PDF)

Tags: AI, Apple M2 Max, Apple M2 Pro, Apple M2 Ultra, Computer science, CUDA, Linear Algebra, LLM, Machine learning, nVidia, nVidia GeForce RTX 4090, nVidia GeFroce RTX 2080 Ti, nVidia Quadro RTX 4000, nVidia RTX A6000, Performance, PyTorch

February 3, 2025 by hgpu

Column-Oriented Datalog on the GPU

Yihao Sun, Sidharth Kumar, Thomas Gilray, Kristopher Micinski

View

Download (PDF)

Source codes

Tags: Computer science, CUDA, Databases, nVidia, nVidia H100, Package

January 27, 2025 by hgpu

Adaptive Optimization Techniques for High-Performance Computing

Gulsum Gudukbay Akbulut

View

Download (PDF)

Tags: Computer science, CUDA, Deep learning, HPC, Machine learning, nVidia, Optimization, Performance, Tesla K80, Thesis

January 27, 2025 by hgpu

Dissecting the NVIDIA Hopper Architecture through Microbenchmarking and Multiple Level Analysis

Weile Luo, Ruibo Fan, Zeyu Li, Dayou Du, Hongyuan Liu, Qiang Wang, Xiaowen Chu

View

Download (PDF)

Tags: Benchmarking, Computer science, CUDA, Hardware Architecture, nVidia, nVidia A100, nVidia GeForce RTX 4090, nVidia H800, Performance, PTX

January 27, 2025 by hgpu

Code Generation for Cryptographic Kernels using Multi-word Modular Arithmetic on GPU

Naifeng Zhang, Franz Franchetti

View

Download (PDF)

Source codes

Tags: Code generation, Computer science, CUDA, Linear Algebra, Modular arithmetic, nVidia, nVidia GeForce RTX 4090, nVidia H100, nVidia V100, Package, Security

January 20, 2025 by hgpu

GSParLib: A multi-level programming interface unifying OpenCL and CUDA for expressing stream and data parallelism

Dinei A. Rockenbach, Gabriell Araujo, Dalvan Griebler, Luiz Gustavo Fernandes

View

Download (PDF)

Source codes

Tags: Computer science, CUDA, Data parallelism, nVidia, nVidia GeForce RTX 3090, OpenCL, OpenMP, Package

January 20, 2025 by hgpu

A User’s Guide to KSig: GPU-Accelerated Computation of the Signature Kernel

Csaba Tóth, Danilo Jr Dela Cruz, Harald Oberhauser

View

Download (PDF)

Source codes

Tags: Computer science, CUDA, Machine learning, nVidia, nVidia A100, Package, Python

January 20, 2025 by hgpu

Boosting Performance of Iterative Applications on GPUs: Kernel Batching with CUDA Graphs

Jonah Ekelund, Stefano Markidis, Ivy Peng

View

Download (PDF)

Source codes

Tags: Computer science, CUDA, FDTD, HIP, nVidia, nVidia A100, nVidia H100, Package, Performance

January 20, 2025 by hgpu

Keras Sig: Efficient Path Signature Computation on GPU in Keras 3

Rémi Genet, Hugo Inzirillo

View

Download (PDF)

Source codes

Tags: Computer science, CUDA, Deep learning, Machine learning, nVidia, nVidia GeForce RTX 4090, Package, Python, TensorFlow

January 20, 2025 by hgpu

AutoPass: Evidence-Guided LLM Agents for Compiler Performance Tuning

KernelBenchX: A Comprehensive Benchmark for Evaluating LLM-Generated GPU Kernels

CUDA Kernel Fusion Benchmarks

Analyzing the Impact of Kernel Fusion on GPU Tensor Operation Performance: A Systematic Performance Study

IntelliKit: Agent-first tooling for AMD hardware

Kerncap: Automated Kernel Extraction and Isolation for AMD GPUs

DITRON: Distributed Compiler based on Triton for Parallel Systems

DITRON: Distributed Multi-level Tiling Compiler for Parallel Tensor Programs

CuTile Benchmark Suite: Performance and Productivity Tradeoffs for GPU Kernel Programming on Blackwell Architecture

Evaluating CUDA Tile for AI Workloads on Hopper and Blackwell GPUs

Agentic Code Optimization via Compiler-LLM Cooperation

See all packages

* * *

high performance computing on graphics processing units: hgpu.org

On the Partitioning of GPU Power among Multi-Instances

Fully-Automated Code Generation for Efficient Computation of Sparse Matrix Permanents on GPUs

CPU-GPU co-execution through the exploitation of hybrid technologies via SYCL

Profiling Apple Silicon Performance for ML Training

Column-Oriented Datalog on the GPU

Adaptive Optimization Techniques for High-Performance Computing

Dissecting the NVIDIA Hopper Architecture through Microbenchmarking and Multiple Level Analysis

Code Generation for Cryptographic Kernels using Multi-word Modular Arithmetic on GPU

GSParLib: A multi-level programming interface unifying OpenCL and CUDA for expressing stream and data parallelism

A User’s Guide to KSig: GPU-Accelerated Computation of the Signature Kernel

Boosting Performance of Iterative Applications on GPUs: Kernel Batching with CUDA Graphs

Keras Sig: Efficient Path Signature Computation on GPU in Keras 3

Recent source codes

AutoPass: Evidence-Guided LLM Agents for Compiler Performance Tuning

Probe-and-Refine Tuning of Repository Guidance for AI Coding Agents

CUDAnalyst (CUDA + Analyst)

CodegenBench

KernelBenchX: A Comprehensive Benchmark for Evaluating LLM-Generated GPU Kernels

CUDA Kernel Fusion Benchmarks

IntelliKit: Agent-first tooling for AMD hardware

DITRON: Distributed Compiler based on Triton for Parallel Systems

CuTile Benchmark Suite: Performance and Productivity Tradeoffs for GPU Kernel Programming on Blackwell Architecture

Agentic Code Optimization via Compiler-LLM Cooperation

Most viewed papers (last 30 days)