high performance computing on graphics processing units: hgpu.org

hgpu.org » paper » dCUDA: hardware supported overlap of computation and communication

dCUDA: hardware supported overlap of computation and communication

Tobias Gysi, Jeremia Bär, Torsten Hoefler

SC ’16 Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, 2016

BibTeX

Download (PDF)

View

Source

2358

views

Over the last decade, CUDA and the underlying GPU hardware architecture have continuously gained popularity in various high-performance computing application domains such as climate modeling, computational chemistry, or machine learning. Despite this popularity, we lack a single coherent programming model for GPU clusters. We therefore introduce the dCUDA programming model, which implements device-side remote memory access with target notification. To hide instruction pipeline latencies, CUDA programs over-decompose the problem and over-subscribe the device by running many more threads than there are hardware execution units. Whenever a thread stalls, the hardware scheduler immediately proceeds with the execution of another thread ready for execution. This latency hiding technique is key to make best use of the available hardware resources. With dCUDA, we apply latency hiding at cluster scale to automatically overlap computation and communication. Our benchmarks demonstrate perfect overlap for memory bandwidth-bound tasks and good overlap for compute-bound tasks.

Tags: Distributed computing, latency hiding, programming model, remote memory access

November 29, 2016 by gysit

No votes yet.

Please wait...

Your response

You must be logged in to post a comment.

* * *

high performance computing on graphics processing units: hgpu.org

dCUDA: hardware supported overlap of computation and communication

Your response

Recent source codes

Mutual-Supervised Learning for Sequential-to-Parallel Code Translation

Hardware Compute Partitioning on NVIDIA GPUs for Composable Systems

KISim: Kubernetes Intelligent Scheduling Simulator

Efficient GPU Implementation of Multi-Precision Integer Division

exa-AMD: Exascale Accelerated Materials Discovery

ParEval: A Parallel Code Evaluation Benchmark

FlashSparse: Minimizing Computation Redundancy for Fast Sparse Matrix Multiplications on Tensor Cores

WiLLM: An Open Wireless LLM Communication System

Vcc: the Vulkan Clang Compiler

hpcbench: A set of benchmarking utilities for biomolecular simulation tools

Most viewed papers (last 30 days)

dCUDA: hardware supported overlap of computation and communication

Share this:

Your response

Recent source codes

Most viewed papers (last 30 days)