high performance computing on graphics processing units: hgpu.org

hgpu.org » Applications » Computer science » Energy-Efficient Collective Reduce and Allreduce Operations on Distributed GPUs

Energy-Efficient Collective Reduce and Allreduce Operations on Distributed GPUs

Lena Oden, Benjamin Klenk, Holger Froning

Fraunhofer Institute for Industrial Mathematics, Competence Center High Perfomance Computing, Kaiserslautern, Germany

Fraunhofer Institute for Industrial Mathematics, 2014

@inproceedings{oden2014energy,

title={Energy-Efficient Collective Reduce and Allreduce Operations on Distributed GPUs},

author={Oden, Lena and Klenk, Benjamin and Froning, Holger},

booktitle={Cluster, Cloud and Grid Computing (CCGrid), 2014 14th IEEE/ACM International Symposium on},

pages={483–492},

year={2014},

organization={IEEE}

}

Download (PDF)

View

Source

2045

views

GPUs gain high popularity in High Performance Computing, due to their massive parallelism and high performance per Watt. Despite their popularity, data transfer between multiple GPUs in a cluster remains a problem. Most communication models require the CPU to control the data flow; also intermediate staging copies to host memory are often inevitable. These two facts lead to higher CPU and memory utilization. As a result, overall performance decreases and power consumption increases. Collective operations like reduce and allreduce are very common in scientific simulations and also very sensitive to performance. Due to their massive parallelism, GPUs are very suitable for such operations, but they only excel in performance if they can process the problem in-core. Global GPU Address Spaces (GGAS) enable a direct GPU-to-GPU communication for heterogeneous clusters, which is completely in-line with the GPUs thread-collective execution model and does not require CPU assistance or staging copies in host memory. As we will see, GGAS helps to process collective operations among distributed GPUs in-core. In this paper, we introduce the implementation and optimization of collective reduce and allreduce operations using GGAS as a communication model. Compared to message passing, we get a speedup of 1.7x for small data sizes. A detailed analysis based on power measurements of CPU, host memory and GPU reveals that GGAS as communication model not only saves cycles, also the power and energy consumption is reduced dramatically. For instance, for an allreduce operation half of the energy can be saved by the reduced the power consumption in combination with the lower run time.

Tags: Computer science, CUDA, Distributed computing, Energy-efficient computing, Heterogeneous systems, nVidia, Tesla K20

July 17, 2014 by hgpu

No votes yet.

Please wait...

Your response

You must be logged in to post a comment.

high performance computing on graphics processing units: hgpu.org

Energy-Efficient Collective Reduce and Allreduce Operations on Distributed GPUs

Your response

Recent source codes

tritonBLAS: A Lightweight Triton-based General Matrix Multiplication (GEMM) Library

hls4ml: Machine learning on FPGAs using HLS

ThunderKittens: Tile primitives for speedy kernels

NVIDIA Nemotron Parse 1.1

Iris: AMD RAD's multi-GPU Triton-based framework for seamless multi-GPU programming

HipKittens: Fast and Furious AMD Kernels

Fortran xDSL dialects

mt4g: Memory Topology 4 GPUs

Falcon: GPU-Based Floating-point Adaptive Lossless Compression

CudaForge: An Agent Framework with Hardware Feedback for CUDA Kernel Optimization

Most viewed papers (last 30 days)

Energy-Efficient Collective Reduce and Allreduce Operations on Distributed GPUs

Share this:

Your response

Recent source codes

Most viewed papers (last 30 days)