high performance computing on graphics processing units: hgpu.org

hgpu.org » Applications » Computer science » swCaffe: a Parallel Framework for Accelerating Deep Learning Applications on Sunway TaihuLight

swCaffe: a Parallel Framework for Accelerating Deep Learning Applications on Sunway TaihuLight

Jiarui Fang, Liandeng Li, Haohuan Fu, Jinlei Jiang, Wenlai Zhao, Conghui He, Xin You, Guangwen Yang

Tsinghua University, National Supercomputing Center in Wuxi

arXiv:1903.06934 [cs.DC], (16 Mar 2019)

BibTeX

Download (PDF)

View

Source

Source codes

Package:

swCaffe: A Deep Learning Framework customized for Sunway TaihuLight

2002

views

This paper reports our efforts on swCaffe, a highly efficient parallel framework for accelerating deep neural networks (DNNs) training on Sunway TaihuLight, the current fastest supercomputer in the world that adopts a unique many-core heterogeneous architecture, with 40,960 SW26010 processors connected through a customized communication network. First, we point out some insightful principles to fully exploit the performance of the innovative many-core architecture. Second, we propose a set of optimization strategies for redesigning a variety of neural network layers based on Caffe. Third, we put forward a topology-aware parameter synchronization scheme to scale the synchronous Stochastic Gradient Descent (SGD) method to multiple processors efficiently. We evaluate our framework by training a variety of widely used neural networks with the ImageNet dataset. On a single node, swCaffe can achieve 23%~119% overall performance compared with Caffe running on K40m GPU. As compared with the Caffe on CPU, swCaffe runs 3.04~7.84x faster on all the networks. Finally, we present the scalability of swCaffe for the training of ResNet-50 and AlexNet on the scale of 1024 nodes.

Tags: Computer science, CUDA, Deep learning, Heterogeneous systems, Neural networks, nVidia, OpenMPI, Package, Tesla K40

March 24, 2019 by hgpu

No votes yet.

Please wait...

Your response

You must be logged in to post a comment.

HPCTransCompile: An AI Compiler Generated Dataset for High-Performance CUDA Transpilation and LLM Preliminary Exploration

chemtrain: Training Molecular Dynamics Potentials in JAX

chemtrain-deploy: A parallel and scalable framework for machine learning potentials in million-atom MD simulations

microSYCL: SYCL micro-benchmarks repository

Exploring SYCL as a Portability Layer for High-Performance Computing on CPUs

See all packages

* * *

high performance computing on graphics processing units: hgpu.org

swCaffe: a Parallel Framework for Accelerating Deep Learning Applications on Sunway TaihuLight

Package:

Your response

Recent source codes

Efficient GPU Implementation of Multi-Precision Integer Division

ParEval: A Parallel Code Evaluation Benchmark

FlashSparse: Minimizing Computation Redundancy for Fast Sparse Matrix Multiplications on Tensor Cores

exa-AMD: Exascale Accelerated Materials Discovery

WiLLM: An Open Wireless LLM Communication System

Vcc: the Vulkan Clang Compiler

hpcbench: A set of benchmarking utilities for biomolecular simulation tools

HPCTransCompile: An AI Compiler Generated Dataset for High-Performance CUDA Transpilation and LLM Preliminary Exploration

chemtrain: Training Molecular Dynamics Potentials in JAX

microSYCL: SYCL micro-benchmarks repository

Most viewed papers (last 30 days)

swCaffe: a Parallel Framework for Accelerating Deep Learning Applications on Sunway TaihuLight

Package:

Share this:

Your response

Recent source codes

Most viewed papers (last 30 days)