An OpenCL framework for heterogeneous multicores with local memory
Seoul National University, Seoul, South Korea
In PACT ’10: Proceedings of the 19th international conference on Parallel architectures and compilation techniques (2010), pp. 193-204
@conference{lee2010opencl,
title={An OpenCL framework for heterogeneous multicores with local memory},
author={Lee, J. and Kim, J. and Seo, S. and Kim, S. and Park, J. and Kim, H. and Dao, T.T. and Cho, Y. and Seo, S.J. and Lee, S.H. and others},
booktitle={Proceedings of the 19th international conference on Parallel architectures and compilation techniques},
pages={193–204},
year={2010},
organization={ACM}
}
In this paper, we present the design and implementation of an Open Computing Language (OpenCL) framework that targets heterogeneous accelerator multicore architectures with local memory. The architecture consists of a general-purpose processor core and multiple accelerator cores that typically do not have any cache. Each accelerator core, instead, has a small internal local memory. Our OpenCL runtime is based on software-managed caches and coherence protocols that guarantee OpenCL memory consistency to overcome the limited size of the local memory. To boost performance, the runtime relies on three source-code transformation techniques, work-item coalescing, web-based variable expansion and preload-poststore buffering, performed by our OpenCL C source-to-source translator. Work-item coalescing is a procedure to serialize multiple SPMD-like tasks that execute concurrently in the presence of barriers and to sequentially run them on a single accelerator core. It requires the web-based variable expansion technique to allocate local memory for private variables. Preload-poststore buffering is a buffering technique that eliminates the overhead of software cache accesses. Together with work-item coalescing, it has a synergistic effect on boosting performance. We show the effectiveness of our OpenCL framework, evaluating its performance with a system that consists of two Cell BE processors. The experimental result shows that our approach is promising.
January 16, 2011 by hgpu