High-throughput Execution of Hierarchical Analysis Pipelines on Hybrid Cluster Platforms
Center for Comprehensive Informatics, Emory University, Atlanta, GA 30322
arXiv:1209.3332 [cs.DC] (14 Sep 2012)
@article{2012arXiv1209.3332T,
author={Teodoro, George and Pan, Tony and Kurc, Tahsin M. and Kong, Jun and Cooper, Lee A. D. and Saltz, Joel H.},
title={High-throughput Execution of Hierarchical Analysis Pipelines on Hybrid Cluster Platforms},
journal={ArXiv e-prints},
archivePrefix={"arXiv"},
eprint={1209.3332},
primaryClass={"cs.DC"},
keywords={Distributed, Parallel, and Cluster Computing; Systems and Control},
year={2012},
month={sep}
}
We propose, implement, and experimentally evaluate a runtime middleware to support high-throughput execution on hybrid cluster machines of large-scale analysis applications. A hybrid cluster machine consists of computation nodes which have multiple CPUs and general purpose graphics processing units (GPUs). Our work targets scientific analysis applications in which datasets are processed in application-specific data chunks, and the processing of a data chunk is expressed as a hierarchical pipeline of operations. The proposed middleware system combines a bag-of-tasks style execution with coarse-grain dataflow execution. Data chunks and associated data processing pipelines are scheduled across cluster nodes using a demand driven approach, while within a node operations in a given pipeline instance are scheduled across CPUs and GPUs. The runtime system implements several optimizations, including performance aware task scheduling, architecture aware process placement, data locality conscious task assignment, and data prefetching and asynchronous data copy, to maximize utilization of the aggregate computing power of CPUs and GPUs and minimize data copy overheads. The application and performance benefits of the runtime middleware are demonstrated using an image analysis application, which is employed in a brain cancer study, on a state-of-the-art hybrid cluster in which each node has two 6-core CPUs and three GPUs. Our results show that implementing and scheduling application data processing as a set of fine-grain operations provide more opportunities for runtime optimizations and attain better performance than a coarser-grain, monolithic implementation. The proposed runtime system can achieve high-throughput processing of large datasets – we were able to process an image dataset consisting of 36,848 4Kx4K-pixel image tiles at about 150 tiles/second rate on 100 nodes.
September 18, 2012 by hgpu