MacroSS: macro-SIMDization of streaming applications
Advanced Computer Architecture Laborator, University of Michigan – Ann Arbor, MI
Proceedings of the fifteenth edition of ASPLOS on Architectural support for programming languages and operating systems, ASPLOS ’10, 2010
@article{hormati2010macross,
title={Macross: Macro-simdization of streaming applications},
author={Hormati, A.H. and Choi, Y. and Woh, M. and Kudlur, M. and Rabbah, R. and Mudge, T. and Mahlke, S.},
journal={ACM SIGARCH Computer Architecture News},
volume={38},
number={1},
pages={285–296},
year={2010},
publisher={ACM}
}
SIMD (Single Instruction, Multiple Data) engines are an essential part of the processors in various computing markets, from servers to the embedded domain. Although SIMD-enabled architectures have the capability of boosting the performance of many application domains by exploiting data-level parallelism, it is very challenging for compilers and also programmers to identify and transform parts of a program that will benefit from a particular SIMD engine. The focus of this paper is on the problem of SIMDization for the growing application domain of streaming. Streaming applications are an ideal solution for targeting multi-core architectures, such as shared/distributed memory systems, tiled architectures, and single-core systems. Since these architectures, in most cases, provide SIMD acceleration units as well, it is highly beneficial to generate SIMD code from streaming programs. Specifically, we introduce MacroSS, which is capable of performing macro-SIMDization on high-level streaming graphs. Macro-SIMDization uses high-level information such as execution rates of actors and communication patterns between them to transform the graph structure, vectorize actors of a streaming program, and generate intermediate code. We also propose low-overhead architectural modifications that accelerate shuffling of data elements between the scalar and vectorized parts of a streaming program. Our experiments show that MacroSS is capable of generating code that, on average, outperforms scalar code compiled with the current state-of-art auto-vectorizing compilers by 54%. Using the low-overhead data shuffling hardware, performance is improved by an additional 8% with less than 1% area overhead.
September 7, 2011 by hgpu