A Scalable Framework for Heterogeneous GPU-Based Clusters

hgpu.org » Programming » Algorithms » A Scalable Framework for Heterogeneous GPU-Based Clusters

A Scalable Framework for Heterogeneous GPU-Based Clusters

Fengguang Song, Jack Dongarra, Stanimire Tomov

Innovative Computing Laboratory, University of Tennessee, Knoxville TN 37996-3450

ACM Symposium on Parallelism in Algorithms and Architectures (SPAA 2012), 2012

BibTeX

Download (PDF)

View

Source

1982

views

GPU-based heterogeneous clusters continue to draw attention from vendors and HPC users due to their high energy efficiency and much improved single-node computational performance, however, there is little parallel software available that can utilize all CPU cores and all GPUs on the heterogeneous system efficiently. On a heterogeneous cluster, the performance of a GPU (or a compute node) increases in a much faster rate than the performance of the PCI-Express connection (or the interconnection network) such that communication eventually becomes the bottleneck of the entire system. To overcome the bottleneck, we developed a multi-level partitioning and distribution method that guarantees a near-optimal communication volume. We have also extended heterogeneous tile algorithms to work on distributed-memory GPU clusters. Our main idea is to execute a serial program and generate hybrid-size tasks, and follow a dataflow programming model to fire the tasks on different compute nodes. We then devised a distributed dynamic scheduling runtime system to schedule tasks, and transfer data between hybrid CPU-GPU compute nodes transparently. The runtime system employs a novel distributed task-assignment protocol to solve data dependencies between tasks without coordination between processing units. The runtime system on each node consists of a number of CPU compute threads, a number of GPU compute threads, a task generation thread, an MPI communication thread, and a CUDA communication thread. By overlapping computation and communication through dynamic scheduling, we are able to attain a high performance of 75 TFlops for Cholesky factorization on the heterogeneous Keeneland system [24] using 100 nodes, each with twelve CPU cores and three GPUs. Moreover, our framework can also attain high performance on distributedmemory clusters without GPUs, and shared-system multiGPUs.

Tags: Algorithms, Computer science, CUDA, Factorization, GPU cluster, Heterogeneous systems, Linear Algebra, MPI, nVidia, Tesla M2070

April 7, 2012 by hgpu

No votes yet.

Please wait...

Your response

You must be logged in to post a comment.

HPCTransCompile: An AI Compiler Generated Dataset for High-Performance CUDA Transpilation and LLM Preliminary Exploration

chemtrain: Training Molecular Dynamics Potentials in JAX

chemtrain-deploy: A parallel and scalable framework for machine learning potentials in million-atom MD simulations

microSYCL: SYCL micro-benchmarks repository

Exploring SYCL as a Portability Layer for High-Performance Computing on CPUs

See all packages

* * *

high performance computing on graphics processing units: hgpu.org