high performance computing on graphics processing units: hgpu.org

hgpu.org » Applications » Computer science » Novel Parallelization Strategies for High-Performance DNN Training on HPC Systems

Novel Parallelization Strategies for High-Performance DNN Training on HPC Systems

Arpan Jain

The Ohio State University

The Ohio State University, 2023

@phdthesis{jain2023novel,

title={Novel Parallelization Strategies for High-Performance DNN Training on HPC Systems},

author={Jain, Arpan},

year={2023},

school={The Ohio State University}

}

Download (PDF)

View

Source

1015

views

Deep Learning has achieved state-of-the-art performance in several artificial intelligence tasks like object recognition, speech recognition, machine translation, and summarization. Deep learning is a subset of machine learning that learns multiple levels of data representation using Neural Networks (NNs). The rise of deep learning can be attributed to the presence of large datasets and computation power. Large-scale Deep Neural Networks (DNNs) can provide state-of-the-art performance by learning complex relationships, enabling them to push the boundaries in artificial intelligence. However, training such large-scale DNNs is a compute-intensive task as it can have billions of parameters, which increases both the memory and computational requirement of DNN training. Hence, distributed DNN training has become the default approach to train large-scale DNNs like AmoebaNet, GPT3, and T5. Broadly, the DNN training pipeline can be divided into multiple phases: 1) Data Loading and Data Augmentation, 2) Forward/Backward Pass, and 3) Model Validation. Traditionally, these phases are executed sequentially on a single CPU or GPU due to a lack of additional resources. Multiple processing elements can be used to parallelize the computation in each phase and reduce the overall training time. In this dissertation, we propose novel parallelization strategies for distributed DNN training to alleviate bottlenecks in different phases of DNN training and parallelize the computation across multiple processing elements. Novel parallelization strategies are required to efficiently distribute the work among multiple processing elements and reduce communication overhead, as naive parallelization strategies may not give performance benefits when distributing work among multiple processing elements because of high communication overhead. Therefore, we need novel parallelization strategies designed to distribute the work while keeping the communication overhead low. There are several challenges in the existing DNN training pipeline. Data loading/augmentation and model validation can be up to 20% of the overall training time, making training large-scale DNNs time-consuming. Therefore, we propose a new parallelization scheme that uses the computing power of NVIDIA’s recently released Data Processing Units (DPUs) to offload data loading and model validation phases and accelerate the performance of Data Parallelism. Forward and backward passes remain the most compute-intensive phase in the DNN training pipeline. Increasing the number of layers and parameters in DNNs to achieve better accuracy has become a common approach in deep learning. In the last couple of years, several DNNs like AmoebaNet, T5, and GPT3 have been proposed in the literature, pushing the boundary of the number of parameters and layers. However, computation and memory requirements also increase with the increase in the number of layers and parameters. Therefore, these models cannot be trained on a single processing element. Broadly, large-scale DNNs can be categorized into two categories: 1) In-core models (DNNs that fit inside the memory of a single processing element) and 2) Out-of-core models (DNNs that are too large to fit inside the memory of a single processing element). Technically, the in-core model can be trained on a single processing element, but the training time will be too high, which makes the training impossible. Therefore, we need a novel parallelization strategy that accelerates the training and uses the inherent parallelism of DNN architecture. Because of limited memory in modern accelerators like GPUs, several large-scale DNNs cannot be trained on a single processing element. Therefore, we need novel parallelization strategies to distribute the layers/neurons among multiple processing elements to reduce the memory requirement on each processing element. In this dissertation, we propose several novel parallelization strategies to alleviate current bottlenecks in different phases DNN training pipeline and reduce the overall training time. The key idea is that one should develop custom parallelization strategies for each DNN architecture type so that it can utilize the inherent parallelism and computation pattern to reduce the training time while keeping it generic and applicable to a large number of models in deep learning. We have developed several novel parallelization strategies like Data Sub-Graph Parallelism, Bi-Directional Parallelism, and Hybrid Five-Dimensional Parallelism that accelerate DNN training for in-core, out-of-core model, and out-of-core layer DNNs, respectively. Developed strategies are evaluated on large-scale GPU systems and are made available as public releases or published papers.

Tags: Artificial intelligence, Computer science, CUDA, Data parallelism, Deep learning, HPC, Machine learning, MPI, Neural networks, nVidia, Speech recognition, Tesla K80, Tesla V100, Thesis

October 1, 2023 by hgpu

No votes yet.

Please wait...

Your response

You must be logged in to post a comment.

high performance computing on graphics processing units: hgpu.org

Novel Parallelization Strategies for High-Performance DNN Training on HPC Systems

Your response

Recent source codes

Interleaved Learning and Exploration: A Self-Adaptive Fuzz Testing Framework for MLIR

Pinocchio: PINpointing Orbit Crossing Collapsed Hierarchical Objects

KernelCoder: trained on a curated dataset of reasoning traces and CUDA kernel pairs

VibeCodeHPC - Multi Agentic Vibe Coding for HPC

Compile-Time Resource Safety for GPU APIs: A Low-Overhead Typestate Framework

exa-AMD: Exascale Accelerated Materials Discovery

TRUST: a thermalhydraulic software package for CFD simulations

Modular: The Modular Platform (includes MAX & Mojo)

Allo: Accelerator Design Language

Towards Robust Agentic CUDA Kernel Benchmarking, Verification, and Optimization

Most viewed papers (last 30 days)

Novel Parallelization Strategies for High-Performance DNN Training on HPC Systems

Share this:

Your response

Recent source codes

Most viewed papers (last 30 days)