high performance computing on graphics processing units: hgpu.org

hgpu.org » Applications » Computer science » Communication-minimizing Asynchronous Tensor Parallelism

Communication-minimizing Asynchronous Tensor Parallelism

Siddharth Singh, Zack Sating, Abhinav Bhatele

Department of Computer Science, University of Maryland, College Park, USA

arXiv:2305.13525 [cs.LG], (22 May 2023)

DOI:10.48550/arXiv.2305.13525

BibTeX

Download (PDF)

View

Source

752

views

As state-of-the-art neural networks scale to billions of parameters, designing parallel algorithms that can train these networks efficiently on multi-GPU clusters has become critical. This paper presents Tensor3D, a novel three-dimensional (3D) approach to parallelize tensor computations, that strives to minimize the idle time incurred due to communication in parallel training of large multi-billion parameter models. First, we introduce an intelligent distribution of neural network parameters across GPUs that eliminates communication required for satisfying data dependencies of individual layers. Then, we propose a novel overdecomposition of the parallel training process, using which we achieve significant overlap of communication with computation, thereby reducing GPU idle time. Finally, we present a communication model, which helps users identify communication optimal decompositions of available hardware resources for a given neural network. For a 28B parameter CNN on 256 A100 GPUs, Tensor3D improves the training time by nearly 60% as compared to Megatron-LM.

Tags: Computer science, CUDA, GPU cluster, Neural networks, nVidia, nVidia A100

May 28, 2023 by hgpu

No votes yet.

Please wait...

Your response

You must be logged in to post a comment.

* * *

high performance computing on graphics processing units: hgpu.org

Communication-minimizing Asynchronous Tensor Parallelism

Your response

Recent source codes

Mutual-Supervised Learning for Sequential-to-Parallel Code Translation

Hardware Compute Partitioning on NVIDIA GPUs for Composable Systems

KISim: Kubernetes Intelligent Scheduling Simulator

Efficient GPU Implementation of Multi-Precision Integer Division

exa-AMD: Exascale Accelerated Materials Discovery

ParEval: A Parallel Code Evaluation Benchmark

FlashSparse: Minimizing Computation Redundancy for Fast Sparse Matrix Multiplications on Tensor Cores

WiLLM: An Open Wireless LLM Communication System

Vcc: the Vulkan Clang Compiler

hpcbench: A set of benchmarking utilities for biomolecular simulation tools

Most viewed papers (last 30 days)

Communication-minimizing Asynchronous Tensor Parallelism

Share this:

Your response

Recent source codes

Most viewed papers (last 30 days)