A Scalable Framework for Heterogeneous GPU-Based Clusters

hgpu.org » Programming » Algorithms » A Scalable Framework for Heterogeneous GPU-Based Clusters

A Scalable Framework for Heterogeneous GPU-Based Clusters

Fengguang Song, Jack Dongarra, Stanimire Tomov

Innovative Computing Laboratory, University of Tennessee, Knoxville TN 37996-3450

ACM Symposium on Parallelism in Algorithms and Architectures (SPAA 2012), 2012

@article{song2012scalable,

title={A Scalable Framework for Heterogeneous GPU-Based Clusters},

author={Song, F. and Dongarra, J. and Tomov, S.},

year={2012}

}

Download (PDF)

View

Source

1582

views

GPU-based heterogeneous clusters continue to draw attention from vendors and HPC users due to their high energy efficiency and much improved single-node computational performance, however, there is little parallel software available that can utilize all CPU cores and all GPUs on the heterogeneous system efficiently. On a heterogeneous cluster, the performance of a GPU (or a compute node) increases in a much faster rate than the performance of the PCI-Express connection (or the interconnection network) such that communication eventually becomes the bottleneck of the entire system. To overcome the bottleneck, we developed a multi-level partitioning and distribution method that guarantees a near-optimal communication volume. We have also extended heterogeneous tile algorithms to work on distributed-memory GPU clusters. Our main idea is to execute a serial program and generate hybrid-size tasks, and follow a dataflow programming model to fire the tasks on different compute nodes. We then devised a distributed dynamic scheduling runtime system to schedule tasks, and transfer data between hybrid CPU-GPU compute nodes transparently. The runtime system employs a novel distributed task-assignment protocol to solve data dependencies between tasks without coordination between processing units. The runtime system on each node consists of a number of CPU compute threads, a number of GPU compute threads, a task generation thread, an MPI communication thread, and a CUDA communication thread. By overlapping computation and communication through dynamic scheduling, we are able to attain a high performance of 75 TFlops for Cholesky factorization on the heterogeneous Keeneland system [24] using 100 nodes, each with twelve CPU cores and three GPUs. Moreover, our framework can also attain high performance on distributedmemory clusters without GPUs, and shared-system multiGPUs.

Tags: Algorithms, Computer science, CUDA, Factorization, GPU cluster, Heterogeneous systems, Linear Algebra, MPI, nVidia, Tesla M2070

April 7, 2012 by hgpu

No votes yet.

Please wait...

gpu_tracker: Context manager and CLI that tracks the computational-resource-usage of a code block or shell command, particularly the GPU usage

gpu_tracker: Python package for tracking and profiling GPU utilization in both desktop and high-performance computing environments

* * *

high performance computing on graphics processing units: hgpu.org