A Scalable Framework for Heterogeneous GPU-Based Clusters
Innovative Computing Laboratory, University of Tennessee, Knoxville TN 37996-3450
ACM Symposium on Parallelism in Algorithms and Architectures (SPAA 2012), 2012
@article{song2012scalable,
title={A Scalable Framework for Heterogeneous GPU-Based Clusters},
author={Song, F. and Dongarra, J. and Tomov, S.},
year={2012}
}
GPU-based heterogeneous clusters continue to draw attention from vendors and HPC users due to their high energy efficiency and much improved single-node computational performance, however, there is little parallel software available that can utilize all CPU cores and all GPUs on the heterogeneous system efficiently. On a heterogeneous cluster, the performance of a GPU (or a compute node) increases in a much faster rate than the performance of the PCI-Express connection (or the interconnection network) such that communication eventually becomes the bottleneck of the entire system. To overcome the bottleneck, we developed a multi-level partitioning and distribution method that guarantees a near-optimal communication volume. We have also extended heterogeneous tile algorithms to work on distributed-memory GPU clusters. Our main idea is to execute a serial program and generate hybrid-size tasks, and follow a dataflow programming model to fire the tasks on different compute nodes. We then devised a distributed dynamic scheduling runtime system to schedule tasks, and transfer data between hybrid CPU-GPU compute nodes transparently. The runtime system employs a novel distributed task-assignment protocol to solve data dependencies between tasks without coordination between processing units. The runtime system on each node consists of a number of CPU compute threads, a number of GPU compute threads, a task generation thread, an MPI communication thread, and a CUDA communication thread. By overlapping computation and communication through dynamic scheduling, we are able to attain a high performance of 75 TFlops for Cholesky factorization on the heterogeneous Keeneland system [24] using 100 nodes, each with twelve CPU cores and three GPUs. Moreover, our framework can also attain high performance on distributedmemory clusters without GPUs, and shared-system multiGPUs.
April 7, 2012 by hgpu