high performance computing on graphics processing units: hgpu.org

hgpu.org » Applications » Computer science

Can Tensor Cores Benefit Memory-Bound Kernels? (No!)

Lingqi Zhang, Jiajun Huang, Sheng Di, Satoshi Matsuoka, Mohamed Wahib

View

Download (PDF)

Tags: Computer science, CUDA, nVidia, nVidia A100, Performance

March 10, 2025 by hgpu

SUperman: Efficient Permanent Computation on GPUs

Deniz Elbek, Fatih Taşyaran, Bora Uçar, Kamer Kaya

View

Download (PDF)

Source codes

Tags: Computer science, CUDA, HPC, Numerical Analysis, nVidia, nVidia A100, nVidia Quadro GV100, OpenMPI, Package

March 10, 2025 by hgpu

TransCL: An Automatic CUDA-to-OpenCL Programs Transformation Framework

Changqing Shi, Yufei Sun, Rui Chen, Jiahao Wang, Qiang Guo, Chunye Gong, Yicheng Sui, Yutong Jin, Yuzhi Zhang

View

Download (PDF)

Source codes

Tags: Code generation, Computer science, CUDA, HPC, Memory model, nVidia, OpenCL, Package

March 3, 2025 by hgpu

pyATF: Constraint-Based Auto-Tuning in Python

Richard Schulze, Sergei Gorlatch, Ari Rasch

View

Download (PDF)

Source codes

Tags: Auto-Tuning, Compilers, Computer science, CUDA, nVidia A100, OpenCL, Package, Performance, Python

March 3, 2025 by hgpu

TritonBench: Benchmarking Large Language Model Capabilities for Generating Triton Operators

Jianling Li, Shangzhan Li, Zhenye Gao, Qi Shi, Yuxuan Li, Zefan Wang, Jiacheng Huang, Haojie Wang, Jianrong Wang, Xu Han, Zhiyuan Liu, Maosong Sun

View

Download (PDF)

Source codes

Tags: Benchmarking, Code generation, Computer science, CUDA, Deep learning, LLM, nVidia, nVidia A100, Package, Python

March 3, 2025 by hgpu

CRIUgpu: Transparent Checkpointing of GPU-Accelerated Workloads

Radostin Stoyanov, Viktória Spišaková, Jesus Ramos, Steven Gurfinkel, Andrei Vagin, Adrian Reber, Wesley Armour, Rodrigo Bruno

View

Download (PDF)

Source codes

Tags: AMD Radeon Instinct MI210, ATI, Computer science, CUDA, Deep learning, nVidia, nVidia A100, nVidia H100, nVidia RTX A6000, Package, ROCm

March 3, 2025 by hgpu

Towards Studying the Effect of Compiler Optimizations and Software Randomization on GPU Reliability

Pau López Castillón, Xavier Caricchio Hernández, Leonidas Kosmidis

View

Download (PDF)

Tags: Compilers, Computer science, CUDA, nVidia, nVidia GeForce GTX 1080 Ti

March 3, 2025 by hgpu

KernelBench: Can LLMs Write Efficient GPU Kernels?

Anne Ouyang, Simon Guo, Simran Arora, Alex L. Zhang, William Hu, Christopher Ré, Azalia Mirhoseini

View

Download (PDF)

Source codes

Tags: AI, Benchmarking, Computer science, CUDA, LLM, Machine learning, nVidia, nVidia L40s, Package, PyTorch

February 24, 2025 by hgpu

Seamless acceleration of Fortran intrinsics via AMD AI engines

Nick Brown, Gabriel Rodríguez Canal

View

Download (PDF)

Source codes

Tags: AI, AMD, Computer science, Fortran, Linear Algebra, Package, Performance

February 24, 2025 by hgpu

Forecasting time series with constraints

Nathan Doumèche, Francis Bach, Éloi Bedek, Gérard Biau, Claire Boyer, Yannig Goude

View

Download (PDF)

Source codes

Tags: AI, Benchmarking, Computer science, Linear Algebra, Machine learning, nVidia, nVidia L4, Package

February 24, 2025 by hgpu

Evaluating the Performance of the DeepSeek Model in Confidential Computing Environment

Ben Dong, Qian Wang

View

Download (PDF)

Tags: Benchmarking, Cloud, Computer science, LLM, nVidia, nVidia A100, Performance, Security

February 24, 2025 by hgpu

The AI CUDA Engineer: Agentic CUDA Kernel Discovery, Optimization and Composition

Robert Tjarko Lange, Aaditya Prasad, Qi Sun, Maxence Faldor, Yujin Tang, David Ha

View

Download (PDF)

Source codes

Tags: AI, Computer science, CUDA, LLM, nVidia, nVidia H100, Package, Performance

February 24, 2025 by hgpu

Efficient GPU Implementation of Multi-Precision Integer Division

ParEval: A Parallel Code Evaluation Benchmark

ParEval-Repo: A Benchmark Suite for Evaluating LLMs with Repository-level HPC Translation Tasks

FlashSparse: Minimizing Computation Redundancy for Fast Sparse Matrix Multiplications on Tensor Cores

Libra: Synergizing CUDA and Tensor Cores for High-Performance Sparse Matrix Multiplication

exa-AMD: Exascale Accelerated Materials Discovery

Accelerated discovery and design of Fe-Co-Zr magnets with tunable magnetic anisotropy through machine learning and parallel computing

WiLLM: An Open Wireless LLM Communication System

Vcc: the Vulkan Clang Compiler

No More Shading Languages: Compiling C++ to Vulkan Shaders

hpcbench: A set of benchmarking utilities for biomolecular simulation tools

Engineering Supercomputing Platforms for Biomolecular Applications

HPCTransCompile: An AI Compiler Generated Dataset for High-Performance CUDA Transpilation and LLM Preliminary Exploration

chemtrain: Training Molecular Dynamics Potentials in JAX

chemtrain-deploy: A parallel and scalable framework for machine learning potentials in million-atom MD simulations

microSYCL: SYCL micro-benchmarks repository

Exploring SYCL as a Portability Layer for High-Performance Computing on CPUs

See all packages

* * *

high performance computing on graphics processing units: hgpu.org

Applications

Can Tensor Cores Benefit Memory-Bound Kernels? (No!)

SUperman: Efficient Permanent Computation on GPUs

TransCL: An Automatic CUDA-to-OpenCL Programs Transformation Framework

pyATF: Constraint-Based Auto-Tuning in Python

TritonBench: Benchmarking Large Language Model Capabilities for Generating Triton Operators

CRIUgpu: Transparent Checkpointing of GPU-Accelerated Workloads

Towards Studying the Effect of Compiler Optimizations and Software Randomization on GPU Reliability

KernelBench: Can LLMs Write Efficient GPU Kernels?

Seamless acceleration of Fortran intrinsics via AMD AI engines

Forecasting time series with constraints

The AI CUDA Engineer: Agentic CUDA Kernel Discovery, Optimization and Composition

Recent source codes

Efficient GPU Implementation of Multi-Precision Integer Division

ParEval: A Parallel Code Evaluation Benchmark

FlashSparse: Minimizing Computation Redundancy for Fast Sparse Matrix Multiplications on Tensor Cores

exa-AMD: Exascale Accelerated Materials Discovery

WiLLM: An Open Wireless LLM Communication System

Vcc: the Vulkan Clang Compiler

hpcbench: A set of benchmarking utilities for biomolecular simulation tools

HPCTransCompile: An AI Compiler Generated Dataset for High-Performance CUDA Transpilation and LLM Preliminary Exploration

chemtrain: Training Molecular Dynamics Potentials in JAX

microSYCL: SYCL micro-benchmarks repository

Most viewed papers (last 30 days)