7406

A Scalable Framework for Heterogeneous GPU-Based Clusters

Fengguang Song, Jack Dongarra, Stanimire Tomov
Innovative Computing Laboratory, University of Tennessee, Knoxville TN 37996-3450
ACM Symposium on Parallelism in Algorithms and Architectures (SPAA 2012), 2012

@article{song2012scalable,

   title={A Scalable Framework for Heterogeneous GPU-Based Clusters},

   author={Song, F. and Dongarra, J. and Tomov, S.},

   year={2012}

}

Download Download (PDF)   View View   Source Source   

865

views

GPU-based heterogeneous clusters continue to draw attention from vendors and HPC users due to their high energy efficiency and much improved single-node computational performance, however, there is little parallel software available that can utilize all CPU cores and all GPUs on the heterogeneous system efficiently. On a heterogeneous cluster, the performance of a GPU (or a compute node) increases in a much faster rate than the performance of the PCI-Express connection (or the interconnection network) such that communication eventually becomes the bottleneck of the entire system. To overcome the bottleneck, we developed a multi-level partitioning and distribution method that guarantees a near-optimal communication volume. We have also extended heterogeneous tile algorithms to work on distributed-memory GPU clusters. Our main idea is to execute a serial program and generate hybrid-size tasks, and follow a dataflow programming model to fire the tasks on different compute nodes. We then devised a distributed dynamic scheduling runtime system to schedule tasks, and transfer data between hybrid CPU-GPU compute nodes transparently. The runtime system employs a novel distributed task-assignment protocol to solve data dependencies between tasks without coordination between processing units. The runtime system on each node consists of a number of CPU compute threads, a number of GPU compute threads, a task generation thread, an MPI communication thread, and a CUDA communication thread. By overlapping computation and communication through dynamic scheduling, we are able to attain a high performance of 75 TFlops for Cholesky factorization on the heterogeneous Keeneland system [24] using 100 nodes, each with twelve CPU cores and three GPUs. Moreover, our framework can also attain high performance on distributedmemory clusters without GPUs, and shared-system multiGPUs.
No votes yet.
Please wait...

* * *

* * *

HGPU group © 2010-2017 hgpu.org

All rights belong to the respective authors

Contact us: