high performance computing on graphics processing units: hgpu.org

Yehia Arafa, Ammar ElWazir, Abdelrahman ElKanishy, Youssef Aly, Ayatelrahman Elsayed, Abdel-Hameed Badawy, Gopinath Chennupati, Stephan Eidenbenz, Nandakishore Santhi

View

Download (PDF)

Tags: Computer science, CUDA, Energy-efficient computing, Measurement techniques, nVidia, nVidia GeForce GTX 1080 Ti, nVidia GeForce GTX Titan V, nVidia GeForce GTX Titan X, nVidia Titan RTX, PTX

February 23, 2020 by hgpu

Instructions’ Latencies Characterization for NVIDIA GPGPUs

Yehia Arafa, Abdel-Hameed Badawy, Gopinath Chennupati, Nandakishore Santhi, Stephan Eidenbenz

View

Download (PDF)

Tags: Benchmarking, Computer science, CUDA, nVidia, nVidia GeForce GTX Titan X, nVidia Titan RTX, Performance, PTX, Tesla K40, Tesla P100, Tesla V100

May 23, 2019 by hgpu

CUDA au Coq: A Framework for Machine-validating GPU Assembly Programs

Benjamin Ferrell, Jun Duan, Kevin W. Hamlen

View

Download (PDF)

Tags: Algorithms, Compilers, Computer science, CUDA, nVidia, PTX

May 15, 2019 by hgpu

UniCoder: Unified Visual-to-Code Generation via Symbolic Rewards and Reference-Guided Code Optimization

CuFuzz: An API-Knowledge-Graph Coverage-Driven Fuzzing Framework for CUDA Libraries

AutoPass: Evidence-Guided LLM Agents for Compiler Performance Tuning

KernelBenchX: A Comprehensive Benchmark for Evaluating LLM-Generated GPU Kernels

CUDA Kernel Fusion Benchmarks

Analyzing the Impact of Kernel Fusion on GPU Tensor Operation Performance: A Systematic Performance Study

IntelliKit: Agent-first tooling for AMD hardware

Kerncap: Automated Kernel Extraction and Isolation for AMD GPUs

DITRON: Distributed Compiler based on Triton for Parallel Systems

DITRON: Distributed Multi-level Tiling Compiler for Parallel Tensor Programs

See all packages

* * *

high performance computing on graphics processing units: hgpu.org

Capturing the Memory Topology of GPUs

Demystifying the Nvidia Ampere Architecture through Microbenchmarking and Instruction-level Analysis

Dissecting Tensor Cores via Microbenchmarks: Latency, Throughput and Numerical Behaviors

Research and Development of Porting SYCL on QNX Operating System for High Parallelism

Improving Performance and Energy Efficiency of GPUs through Locality Analysis

LocalityGuru: A PTX Analyzer for Extracting Thread Block-level Locality in GPGPUs

SYCL-Bench: A Versatile Cross-Platform Benchmark Suite for Heterogeneous Computing

Automatic Kernel Generation for Volta Tensor Cores

GEVO: GPU Code Optimization using Evolutionary Computation

Verified Instruction-Level Energy Consumption Measurement for NVIDIA GPUs

Instructions’ Latencies Characterization for NVIDIA GPGPUs

CUDA au Coq: A Framework for Machine-validating GPU Assembly Programs

Recent source codes

UniCoder: Unified Visual-to-Code Generation via Symbolic Rewards and Reference-Guided Code Optimization

CuFuzz: An API-Knowledge-Graph Coverage-Driven Fuzzing Framework for CUDA Libraries

AutoPass: Evidence-Guided LLM Agents for Compiler Performance Tuning

Probe-and-Refine Tuning of Repository Guidance for AI Coding Agents

CUDAnalyst (CUDA + Analyst)

CodegenBench

KernelBenchX: A Comprehensive Benchmark for Evaluating LLM-Generated GPU Kernels

CUDA Kernel Fusion Benchmarks

IntelliKit: Agent-first tooling for AMD hardware

DITRON: Distributed Compiler based on Triton for Parallel Systems

Most viewed papers (last 30 days)