high performance computing on graphics processing units: hgpu.org

hgpu.org » Applications » Computer science » UNICORN: A Bulk Synchronous Programming Model, Framework and Runtime for Hybrid CPU-GPU Clusters

UNICORN: A Bulk Synchronous Programming Model, Framework and Runtime for Hybrid CPU-GPU Clusters

Tarun Beri

Department of Computer Science and Engineering, Indian Institute of Technology Delhi

Indian Institute of Technology Delhi, 2016

@article{beri2016unicorn,

title={UNICORN: A Bulk Synchronous Programming Model, Framework and Runtime for Hybrid CPU-GPU Clusters},

author={Beri, Tarun},

year={2016}

}

Download (PDF)

View

Source

Source codes

Package:

Unicorn: An HPC Library for hybrid CPU-GPU clusters

2824

views

Rapid evolution of graphics processing units (GPUs) into general purpose computing devices has made them vital to high performance computing clusters. These computing environments consist of multiple nodes connected by a high speed network such as Infiniband, with each node comprising several multi-core processors and several many-core accelerators. The difficulty of programming hybrid CPU-GPU clusters often limits software’s exploitation of full computational power. This thesis addresses this difficulty and presents Unicorn – a novel parallel programming model for hybrid CPU-GPU clusters and the design and implementation of its runtime. In particular, this thesis proves that efficient distributed shared memory style programing is possible. We also prove that the simplicity of shared memory style programming can be retained across CPUs and GPUs in a cluster, minus the frustration of dealing with race conditions. And this can be done with a unified abstraction, avoiding much of the complication of dealing with hybrid architectures. This is achieved with the help of transactional semantics, deferred bulk data synchronization, subtask pipelining and various communication and computation scheduling optimizations. Unicorn provides a bulk synchronous programming model with a global address space. It schedules concurrent tasks of a program in an architecture and topology oblivious manner. It hides the network and exposes CPUs and accelerators loosely as bulk synchronous computing units with logical phases, respectively, of local computation and communication. Each task is further decomposed into coarse-grained concurrently executable subtasks that Unicorn schedules transparently on to available CPU and GPU devices in the cluster. Subtasks employ transactional memory semantics to access and synchronize data, i.e., they check out a private view of the global shared memory before their local computation phase and check in to the global shared memory afterwards, optionally resolving conflicting writes in a reduction step. Unicorn’s main design goals are easy programmability and a deterministic parallel execution environment. Device, node and cluster management are completely handled by the runtime and no such API is exposed to the application programmer. Load balancing, scheduling and scalability are also fully transparent to the application code. Application programs do not change from cluster to cluster to maintain efficiency. Rather, Unicorn adapts the execution to the set of present devices, the network and their dynamic load. Application code is oblivious to data placement within the cluster as well as to changes in network interfaces and data availability pattern. Unicorn’s programming model, being deterministic, eliminates data races and deadlocks. To provide efficiency, Unicorn’s runtime employs several optimizations. These include prefetching task data and pipelining subtasks in order to overlap their communication with computations. Unicorn employs pipelining at two levels – firstly to hide data transfer costs among cluster nodes and secondly to hide DMA communication costs between CPUs and GPUs on all nodes. Among other optimizations, Unicorn’s work-stealing based scheduler employs a twolevel victim selection technique to reduce the overhead of steal operations. Further, it employs special proactive and aggressive stealing mechanism to prevent the said pipelines from stalling (during a steal operation). To prevent a subtask (running on a slow device or on a device behind a slow network or I/O link) from becoming a bottleneck for the entire task, Unicorn reassesses its scheduling decisions at runtime and schedules a duplicate instance of a straggling subtask on a potentially faster device. Unicorn also employs a software LRU cache at every GPU in the cluster to prevent the shared data between subtasks getting DMA’ed more than once. To further boost GPU performance, Unicorn makes aggressive use of CUDA streams and schedules multiple subtasks for simultaneous execution. To evaluate the design and implementation of Unicorn, we parallelize several coarse-grained scientific workloads using Unicorn. We study the scalability and performance of these benchmarks and also the response of Unicorn’s runtime by putting it under stress tests like changing the input data availability of these experiments. We also study the load balancing achieved in these experiments and the amount of time the runtime spends in communications. We find that parallelization of coarse-grained applications like matrix multiplication or 2D FFT using our system requires only about 30 lines of C code to set up the runtime. The rest of the application code is regular single CPU/GPU implementation. This indicates the ease of extending sequential code to a parallel environment. The execution is efficient as well. Using GPUs only, when multiplying two square matrices of size 65536 x 65536, Unicorn achieves a peak performance of 7.81 TFlop/s when run over 28 Tesla M2070 GPUs (1.03 TFlop/s theoretical peak) of our 14-node cluster (with subtasks of size 4096 x 4096). On the other hand, CUPLAPACK [28], a linear algebra package specifically coded and optimized from scratch, reports 8 TFlop/s while multiplying two square matrices of size 62000 x 62000 using 32 Quadro FX 5800 GPUs (0.624 TFlop/s theoretical peak) of a 16 node cluster connected via QDR InfiniBand. Fine-grained applications, however, may not fit into our system as efficiently. Such applications often require frequent communication of small data. This is inherently against our bulk synchronous design and more advanced optimizations may be needed to make these applications profitable.

Tags: Computer science, CUDA, FFT, GPU cluster, Linear Algebra, Matrix multiplication, nVidia, Package, Prefetch, Tesla M2070, Thesis

November 5, 2016 by hgpu

Rating: 0.9/5. From 5 votes.

Please wait...

Your response

You must be logged in to post a comment.

high performance computing on graphics processing units: hgpu.org

UNICORN: A Bulk Synchronous Programming Model, Framework and Runtime for Hybrid CPU-GPU Clusters

Package:

Your response

Recent source codes

Interleaved Learning and Exploration: A Self-Adaptive Fuzz Testing Framework for MLIR

Pinocchio: PINpointing Orbit Crossing Collapsed Hierarchical Objects

KernelCoder: trained on a curated dataset of reasoning traces and CUDA kernel pairs

VibeCodeHPC - Multi Agentic Vibe Coding for HPC

Compile-Time Resource Safety for GPU APIs: A Low-Overhead Typestate Framework

exa-AMD: Exascale Accelerated Materials Discovery

TRUST: a thermalhydraulic software package for CFD simulations

Modular: The Modular Platform (includes MAX & Mojo)

Allo: Accelerator Design Language

Towards Robust Agentic CUDA Kernel Benchmarking, Verification, and Optimization

Most viewed papers (last 30 days)

UNICORN: A Bulk Synchronous Programming Model, Framework and Runtime for Hybrid CPU-GPU Clusters

Package:

Share this:

Your response

Recent source codes

Most viewed papers (last 30 days)