high performance computing on graphics processing units: hgpu.org

hgpu.org » nVidia A100

Stencil Computations on AMD and Nvidia Graphics Processors: Performance and Tuning Strategies

Johannes Pekkilä, Oskar Lappi, Fredrik Robertsén, Maarit J. Korpi-Lagg

View

Download (PDF)

Source codes

Tags: AMD Radeon Instinct MI100, AMD Radeon Instinct MI250X, ATI, Computer science, CUDA, Energy-efficient computing, HIP, nVidia, nVidia A100, nVidia V100, Package, Performance, PyTorch, Stencil computation

June 16, 2024 by hgpu

How much can we gain from Tensor Kernel Fusion on GPUs?

Wei Sun, Ang Li, Sander Stuijk, Henk Corporaal

View

Download (PDF)

Tags: Computer science, CUDA, Deep learning, Matrix multiplication, Neural networks, nVidia, nVidia A100, nVidia H100

June 16, 2024 by hgpu

Gaining Cross-Platform Parallelism for HAL’s Molecular Dynamics Package using SYCL

Viktor Skoblin, Felix Höfling, Steffen Christgau

View

Download (PDF)

Source codes

Tags: AMD Radeon Instinct MI210, ATI, Computer science, CUDA, Molecular dynamics, nVidia, nVidia A100, nVidia A40, Package, Physics, SYCL

June 9, 2024 by hgpu

Helix: Distributed Serving of Large Language Models via Max-Flow on Heterogeneous GPUs

Yixuan Mei, Yonghao Zhuang, Xupeng Miao, Juncheng Yang, Zhihao Jia, Rashmi Vinayak

View

Download (PDF)

Tags: Computer science, GPU cluster, Heterogeneous systems, nVidia, nVidia A100, nVidia V100, Tesla T4

June 9, 2024 by hgpu

An implementation of tensor product patch smoothers on GPU

Cu Cui, Paul Grosse-Bley, Guido Kanschat, Robert Strzodka

View

Download (PDF)

Tags: CUDA, FEM, Finite element method, Mathematics, Numerical Analysis, nVidia, nVidia A100

June 2, 2024 by hgpu

Evaluation of computational and energy performance in matrix multiplication algorithms on CPU and GPU using MKL, cuBLAS and SYCL

L.A. Torres, Carlos J. Barrios H, Yves Denneulin

View

Download (PDF)

Source codes

Tags: Computer science, CUBLAS, CUDA, Linear Algebra, Matrix multiplication, Neural networks, nVidia, nVidia A100, Package, Performance, SYCL

June 2, 2024 by hgpu

Enabling full-speed random access to the entire memory on the A100 GPU

Alden Walker

View

Download (PDF)

Tags: Computer science, nVidia, nVidia A100, Performance, PTX

May 26, 2024 by hgpu

ArchesWeather: An efficient AI weather forecasting model at 1.5° resolution

Guillaume Couairon, Christian Lessig, Anastase Charantonis, Claire Monteleoni

View

Download (PDF)

Tags: Deep learning, Earth and Space Sciences, nVidia, nVidia A100, Weather prediction

May 26, 2024 by hgpu

GPU Implementations for Midsize Integer Addition and Multiplication

Cosmin E. Oancea, Stephen M. Watt

View

Download (PDF)

Tags: Algorithms, Computer science, CUDA, nVidia, nVidia A100, Performance, Programming Languages

May 26, 2024 by hgpu

STuning-DL: Model-Driven Autotuning of Sparse GPU Kernels for Deep Learning

Roberto L. Castro, Diego Andrade, Basilio B. Fraguela

View

Download (PDF)

Tags: Auto-Tuning, Computer science, CUDA, Deep learning, Machine learning, nVidia, nVidia A100, Tesla T4

May 26, 2024 by hgpu

Kernel-Centric Optimizations for Deep Neural Networks on GPGPU

Zhaodong Chen

View

Download (PDF)

Tags: Computer science, Computer vision, CUDA, Deep learning, Neural networks, nVidia, nVidia A100, nVidia V100, Thesis

May 26, 2024 by hgpu

Hierarchical Resource Partitioning on Modern GPUs: A Reinforcement Learning Approach

Urvij Saroliya, Eishi Arima, Dai Liu, Martin Schulz

View

Download (PDF)

Tags: Computer science, CUDA, Heterogeneous systems, Machine learning, nVidia, nVidia A100, PC cluster, Task scheduling

May 20, 2024 by hgpu

CUDAnalyst (CUDA + Analyst)

Towards Feedback-to-Plan Decisions for Self-Evolving LLM Agents in CUDA Kernel Generation

CodegenBench

CodegenBench: Can LLMs Write Efficient Code Across Architectures?

KernelBenchX: A Comprehensive Benchmark for Evaluating LLM-Generated GPU Kernels

CuTile Benchmark Suite: Performance and Productivity Tradeoffs for GPU Kernel Programming on Blackwell Architecture

Evaluating CUDA Tile for AI Workloads on Hopper and Blackwell GPUs

Agentic Code Optimization via Compiler-LLM Cooperation

Device Virtual Machine (DVM)

DVM: Real-Time Kernel Generation for Dynamic AI Models

MegaTrain: Full Precision Training of 100B+ Parameter Large Language Models on a Single GPU

See all packages

* * *

high performance computing on graphics processing units: hgpu.org

Stencil Computations on AMD and Nvidia Graphics Processors: Performance and Tuning Strategies

How much can we gain from Tensor Kernel Fusion on GPUs?

Gaining Cross-Platform Parallelism for HAL’s Molecular Dynamics Package using SYCL

Helix: Distributed Serving of Large Language Models via Max-Flow on Heterogeneous GPUs

An implementation of tensor product patch smoothers on GPU

Evaluation of computational and energy performance in matrix multiplication algorithms on CPU and GPU using MKL, cuBLAS and SYCL

Enabling full-speed random access to the entire memory on the A100 GPU

ArchesWeather: An efficient AI weather forecasting model at 1.5° resolution

GPU Implementations for Midsize Integer Addition and Multiplication

STuning-DL: Model-Driven Autotuning of Sparse GPU Kernels for Deep Learning

Kernel-Centric Optimizations for Deep Neural Networks on GPGPU

Hierarchical Resource Partitioning on Modern GPUs: A Reinforcement Learning Approach

Recent source codes

CUDAnalyst (CUDA + Analyst)

CodegenBench

KernelBenchX: A Comprehensive Benchmark for Evaluating LLM-Generated GPU Kernels

CUDA Kernel Fusion Benchmarks

IntelliKit: Agent-first tooling for AMD hardware

DITRON: Distributed Compiler based on Triton for Parallel Systems

CuTile Benchmark Suite: Performance and Productivity Tradeoffs for GPU Kernel Programming on Blackwell Architecture

Agentic Code Optimization via Compiler-LLM Cooperation

Device Virtual Machine (DVM)

MegaTrain: Full Precision Training of 100B+ Parameter Large Language Models on a Single GPU

Most viewed papers (last 30 days)