high performance computing on graphics processing units: hgpu.org

hgpu.org » Performance

FC_ACCEL: Enabling Efficient, Low-Latency and Flexible Inference in DNN Fully Connected Layers, using Optimized Checkerboard Block matrix decomposition, fast scheduling, and a resource efficient 1D PE array with a custom HBM2 memory subsystem

Nick Iliev, Amit R Trivedi

View

Tags: CNN, Computer science, Deep learning, FPGA, Matrix decomposition, Neural networks, nVidia, nVidia Jetson AGX Xavier, Performance

February 13, 2022 by hgpu

Performance prediction of deep learning applications training in GPU as a service systems

Marco Lattuada, Eugenio Gianniti, Danilo Ardagna, Li Zhang

View

Download (PDF)

Tags: Computer science, Deep learning, Machine learning, Neural networks, nVidia, nVidia GeForce GTX 1080 Ti, Performance, Tesla K80, Tesla M60

January 30, 2022 by hgpu

Reusing Auto-Schedules for Efficient DNN Compilation

Perry Gibson, José Cano

View

Download (PDF)

Tags: Algorithms, Computer science, Machine learning, Neural networks, Performance, Programming Languages

January 23, 2022 by hgpu

Building a Performance Model for Deep Learning Recommendation Model Training on GPUs

Zhongyi Lin, Louis Feng, Ehsan K. Ardestani, Jaewon Lee, John Lundell, Changkyu Kim, Arun Kejariwal, John D. Owens

View

Download (PDF)

Source codes

Tags: Algorithms, Computer science, CUDA, Deep learning, Machine learning, nVidia, nVidia GeForce GTX Titan XP, Package, Performance, Tesla P100, Tesla V100

January 23, 2022 by hgpu

Fancier: A Unified Framework for Java, C, and OpenCL Integration

Sergio Afonso, Francisco Almeida

View

Download (PDF)

Source codes

Tags: Computer science, Image processing, Java, OpenACC, OpenCL, Package, Performance

January 16, 2022 by hgpu

A Compiler Framework for Optimizing Dynamic Parallelism on GPUs

Mhd Ghaith Olabi, Juan Gómez Luna, Onur Mutlu, Wen-mei Hwu, Izzat El Hajj

View

Download (PDF)

Source codes

Tags: Compilers, Computer science, CUDA, nVidia, Package, Performance, Tesla V100

January 16, 2022 by hgpu

Reveal training performance mystery between TensorFlow and PyTorch in the single GPU environment

Hulin Dai, Xuan Peng, Xuanhua Shi, Ligang He, Qian Xiong, Hai Jin

View

Download (PDF)

Tags: Benchmarking, Computer science, CUDA, Deep learning, Neural networks, nVidia, Performance, PyTorch, TensorFlow, Tesla P100

January 9, 2022 by hgpu

A Variant of Concurrent Constraint Programming on GPU

Pierre Talbot, Frederic Pinel, Pascal Bouvry

View

Download (PDF)

Source codes

Tags: Computer science, CUDA, nVidia, nVidia Quadro RTX 5000, Package, Performance

January 2, 2022 by hgpu

PROGRAML: A Graph-based Program Representation for Data Flow Analysis and Compiler Optimizations

Chris Cummins, Zacharias V. Fisches, Tal Ben-Nun, Torsten Hoefler, Michael O'Boyle, Hugh Leather

View

Download (PDF)

Source codes

Tags: Compilers, Computer science, Flow analysis, Machine learning, OpenCL, Package, Performance

January 2, 2022 by hgpu

Optimization of Compiler-generated OpenCL CNN Kernels and Runtime for FPGAs

Seung-Hun Chung

View

Download (PDF)

Tags: Computer science, FPGA, Neural networks, nVidia, nVidia GeForce GTX 1060, OpenCL, Performance, Thesis

December 19, 2021 by hgpu

Analysis and Comparison of Performance and Power Consumption of Neural Networks on CPU, GPU, TPU and FPGA

Christopher Noel Hesse

View

Download (PDF)

Tags: Benchmarking, Computer science, Deep learning, FPGA, Neural networks, nVidia, nVidia DGX-1, nVidia GeForce RTX 2070, nVidia Jetson Nano, Performance, Tesla V100

December 5, 2021 by hgpu

Bayesian Optimization for auto-tuning GPU kernels

Floris-Jan Willemsen, Rob van Nieuwpoort, Ben van Werkhoven

View

Download (PDF)

Source codes

Tags: Bayesian, Computer science, CUDA, Machine learning, nVidia, nVidia A100, nVidia GeForce GTX Titan X, nVidia GeForce RTX 2070, OpenCL, Package, Performance

December 5, 2021 by hgpu

UniCoder: Unified Visual-to-Code Generation via Symbolic Rewards and Reference-Guided Code Optimization

CuFuzz: An API-Knowledge-Graph Coverage-Driven Fuzzing Framework for CUDA Libraries

AutoPass: Evidence-Guided LLM Agents for Compiler Performance Tuning

KernelBenchX: A Comprehensive Benchmark for Evaluating LLM-Generated GPU Kernels

CUDA Kernel Fusion Benchmarks

Analyzing the Impact of Kernel Fusion on GPU Tensor Operation Performance: A Systematic Performance Study

IntelliKit: Agent-first tooling for AMD hardware

Kerncap: Automated Kernel Extraction and Isolation for AMD GPUs

DITRON: Distributed Compiler based on Triton for Parallel Systems

DITRON: Distributed Multi-level Tiling Compiler for Parallel Tensor Programs

See all packages

* * *

high performance computing on graphics processing units: hgpu.org

FC_ACCEL: Enabling Efficient, Low-Latency and Flexible Inference in DNN Fully Connected Layers, using Optimized Checkerboard Block matrix decomposition, fast scheduling, and a resource efficient 1D PE array with a custom HBM2 memory subsystem

Performance prediction of deep learning applications training in GPU as a service systems

Reusing Auto-Schedules for Efficient DNN Compilation

Building a Performance Model for Deep Learning Recommendation Model Training on GPUs

Fancier: A Unified Framework for Java, C, and OpenCL Integration

A Compiler Framework for Optimizing Dynamic Parallelism on GPUs

Reveal training performance mystery between TensorFlow and PyTorch in the single GPU environment

A Variant of Concurrent Constraint Programming on GPU

PROGRAML: A Graph-based Program Representation for Data Flow Analysis and Compiler Optimizations

Optimization of Compiler-generated OpenCL CNN Kernels and Runtime for FPGAs

Analysis and Comparison of Performance and Power Consumption of Neural Networks on CPU, GPU, TPU and FPGA

Bayesian Optimization for auto-tuning GPU kernels

Recent source codes

UniCoder: Unified Visual-to-Code Generation via Symbolic Rewards and Reference-Guided Code Optimization

CuFuzz: An API-Knowledge-Graph Coverage-Driven Fuzzing Framework for CUDA Libraries

AutoPass: Evidence-Guided LLM Agents for Compiler Performance Tuning

Probe-and-Refine Tuning of Repository Guidance for AI Coding Agents

CUDAnalyst (CUDA + Analyst)

CodegenBench

KernelBenchX: A Comprehensive Benchmark for Evaluating LLM-Generated GPU Kernels

CUDA Kernel Fusion Benchmarks

IntelliKit: Agent-first tooling for AMD hardware

DITRON: Distributed Compiler based on Triton for Parallel Systems

Most viewed papers (last 30 days)