Merge: a programming model for heterogeneous multi-core systems
Dept. of Electrical Engineering, Stanford University, Stanford, CA, USA
In ASPLOS XIII: Proceedings of the 13th international conference on Architectural support for programming languages and operating systems (2008), pp. 287-296
@article{linderman2008merge,
title={Merge: a programming model for heterogeneous multi-core systems},
author={Linderman, M.D. and Collins, J.D. and Wang, H. and Meng, T.H.},
journal={ACM SIGOPS Operating Systems Review},
volume={42},
number={2},
pages={287–296},
issn={0163-5980},
year={2008},
publisher={ACM}
}
In this paper we propose the Merge framework, a general purpose programming model for heterogeneous multi-core systems. The Merge framework replaces current ad hoc approaches to parallel programming on heterogeneous platforms with a rigorous, library-based methodology that can automatically distribute computation across heterogeneous cores to achieve increased energy and performance efficiency. The Merge framework provides (1) a predicate dispatch-based library system for managing and invoking function variants for multiple architectures; (2) a high-level, library-oriented parallel language based on map-reduce; and (3) a compiler and runtime which implement the map-reduce language pattern by dynamically selecting the best available function implementations for a given input and machine configuration. Using a generic sequencer architecture interface for heterogeneous accelerators, the Merge framework can integrate function variants for specialized accelerators, offering the potential for to-the-metal performance for a wide range of heterogeneous architectures, all transparent to the user. The Merge framework has been prototyped on a heterogeneous platform consisting of an Intel Core 2 Duo CPU and an 8-core 32-thread Intel Graphics and Media Accelerator X3000, and a homogeneous 32-way Unisys SMP system with Intel Xeon processors. We implemented a set of benchmarks using the Merge framework and enhanced the library with X3000 specific implementations, achieving speedups of 3.6x — 8.5x using the X3000 and 5.2x — 22x using the 32-way system relative to the straight C reference implementation on a single IA32 core.
December 10, 2010 by hgpu