SIMD Divergence Optimization through Intra-Warp Compaction

hgpu.org » Applications » Computer science » SIMD Divergence Optimization through Intra-Warp Compaction

SIMD Divergence Optimization through Intra-Warp Compaction

Aniruddha S. Vaidya, Anahita Shayesteh, Dong Hyuk Woo, Roy Saharoy, Mani Azimi

Intel Corporation, Santa Clara, CA, USA

40th International Symposium on Computer Architecture (ISCA), 2013

DOI:10.1145/2485922.2485954

@inproceedings{vaidya2013simd,

title={SIMD divergence optimization through intra-warp compaction},

author={Vaidya, Aniruddha S and Shayesteh, Anahita and Woo, Dong Hyuk and Saharoy, Roy and Azimi, Mani},

booktitle={Proceedings of the 40th Annual International Symposium on Computer Architecture},

pages={368–379},

year={2013},

organization={ACM}

}

Download (PDF)

View

Source

2785

views

SIMD execution units in GPUs are increasingly used for high performance and energy efficient acceleration of general purpose applications. However, SIMD control flow divergence effects can result in reduced execution efficiency in a class of GPGPU applications, classified as divergent applications. Improving SIMD efficiency, therefore, has the potential to bring significant performance and energy benefits to a wide range of such data parallel applications. Recently, the SIMD divergence problem has received increased attention, and several micro-architectural techniques have been proposed to address various aspects of this problem. However, these techniques are often quite complex and, therefore, unlikely candidates for practical implementation. In this paper, we propose two micro-architectural optimizations for GPGPU architectures, which utilize relatively simple execution cycle compression techniques when certain groups of turned-off lanes exist in the instruction stream. We refer to these optimizations as basic cycle compression (BCC) and swizzled-cycle compression (SCC), respectively. In this paper, we will outline the additional requirements for implementing these optimizations in the context of the studied GPGPU architecture. Our evaluations with divergent SIMD workloads from OpenCL (GPGPU) and OpenGL (graphics) applications show that BCC and SCC reduce execution cycles in divergent applications by as much as 42% (20% on average). For a subset of divergent workloads, the execution time is reduced by an average of 7% for today’s GPUs or by 18% for future GPUs with a better provisioned memory subsystem. The key contribution of our work is in simplifying the micro-architecture for delivering divergence optimizations while providing the bulk of the benefits of more complex approaches.

Tags: Computer science, Hardware Architecture, OpenCL, OpenGL, Performance

July 12, 2013 by hgpu

No votes yet.

Please wait...

Your response

You must be logged in to post a comment.

high performance computing on graphics processing units: hgpu.org