high performance computing on graphics processing units: hgpu.org

Manuel López-Martínez, Germán Díaz-Flórez, Santiago Villagrana-Barraza, Luis O. Solís-Sánchez, Héctor A. Guerrero-Osuna, Genaro M. Soto-Zarazúa, Carlos A. Olvera-Olvera

View

Download (PDF)

Tags: CUDA, Deep learning, Distributed computing, HPC, Image processing, Neural networks, nVidia, nVidia GeForce GTX 1050 Ti

June 4, 2023 by hgpu

ARK: GPU-driven Code Execution for Distributed Deep Learning

Changho Hwang, KyoungSoo Park, Ran Shu, Xinyuan Qu, Peng Cheng, Yongqiang Xiong

View

Download (PDF)

Tags: Computer science, Deep learning, Distributed computing, nVidia, nVidia V100

March 12, 2023 by hgpu

AutoDDL: Automatic Distributed Deep Learning with Asymptotically Optimal Communication

Jinfan Chen, Shigang Li, Ran Gun, Jinhui Yuan, Torsten Hoefler

View

Download (PDF)

Source codes

Tags: Computer science, CUDA, Deep learning, Distributed computing, nVidia, nVidia A100, Package, Tesla P100

January 22, 2023 by hgpu

Distributed Calculations with Algorithmic Skeletons for Heterogeneous Computing Environments

Nina Herrmann, Herbert Kuchen

View

Download (PDF)

Tags: Computer science, CUDA, Distributed computing, Heterogeneous systems, nVidia, nVidia GeForce GTX 750 Ti, nVidia GeFroce RTX 2080 Ti, nVidia Quadro K620

January 15, 2023 by hgpu

Kernel-as-a-Service: A Serverless Interface to GPUs

Nathan Pemberton, Anton Zabreyko, Zhoujie Ding, Randy Katz, Joseph Gonzalez

View

Download (PDF)

Source codes

Tags: Cloud, Computer science, CUDA, Deep learning, Distributed computing, nVidia, Package, Tesla V100

December 25, 2022 by hgpu

SIGMo: Scalable Isomorphism Graph Matching on GPUs

SIGMo: High-Throughput Batched Subgraph Isomorphism on GPUs for Molecular Matching

DGEMM without FP64 Arithmetic - using FP64 Emulation and FP8 Tensor Cores with Ozaki Scheme

DGEMM without FP64 Arithmetic – using FP64 Emulation and FP8 Tensor Cores with Ozaki Scheme

GEAK-agent: LLM-based AI agent, which can write correct and efficient GPU kernels automatically

Geak: Introducing Triton Kernel AI Agent & Evaluation Benchmarks

OpenDwarfs 2025: re-engineered version of the OpenDwarfs benchmark suite, for compatibility with modern platforms

OpenDwarfs 2025: Modernizing the OpenDwarfs Benchmark Suite for Heterogeneous Computing

Specx: Speculative task-based runtime system

Specx: a C++ task-based runtime system for heterogeneous distributed architectures

Mutual-Supervised Learning for Sequential-to-Parallel Code Translation

Hardware Compute Partitioning on NVIDIA GPUs for Composable Systems

KISim: Kubernetes Intelligent Scheduling Simulator

KIS-S: A GPU-Aware Kubernetes Inference Simulator with RL-Based Auto-Scaling

Efficient GPU Implementation of Multi-Precision Integer Division

ParEval: A Parallel Code Evaluation Benchmark

ParEval-Repo: A Benchmark Suite for Evaluating LLMs with Repository-level HPC Translation Tasks

See all packages

* * *

high performance computing on graphics processing units: hgpu.org

DeepCompile: A Compiler-Driven Approach to Optimizing Distributed Deep Learning Training

Validation of GPU Computation in Decentralized, Trustless Networks

Development of a new framework for high performance volunteer computing

Composing Distributed Computations Through Task and Kernel Fusion

HAP: SPMD DNN Training on Heterogeneous GPU Clusters with Automated Program Synthesis

Redco: A Lightweight Tool to Automate Distributed Training of LLMs on Any GPU/TPUs

PoCL-R: An Open Standard Based Offloading Layer for Heterogeneous Multi-Access Edge Computing with Server Side Scalability

A High-Performance Computing Cluster for Distributed Deep Learning: A Practical Case of Weed Classification Using Convolutional Neural Network Models

ARK: GPU-driven Code Execution for Distributed Deep Learning

AutoDDL: Automatic Distributed Deep Learning with Asymptotically Optimal Communication

Distributed Calculations with Algorithmic Skeletons for Heterogeneous Computing Environments

Recent source codes

SIGMo: Scalable Isomorphism Graph Matching on GPUs

DGEMM without FP64 Arithmetic - using FP64 Emulation and FP8 Tensor Cores with Ozaki Scheme

GEAK-agent: LLM-based AI agent, which can write correct and efficient GPU kernels automatically

OpenDwarfs 2025: re-engineered version of the OpenDwarfs benchmark suite, for compatibility with modern platforms

Specx: Speculative task-based runtime system

Mutual-Supervised Learning for Sequential-to-Parallel Code Translation

Hardware Compute Partitioning on NVIDIA GPUs for Composable Systems

KISim: Kubernetes Intelligent Scheduling Simulator

Efficient GPU Implementation of Multi-Precision Integer Division

ParEval: A Parallel Code Evaluation Benchmark

Most viewed papers (last 30 days)