high performance computing on graphics processing units: hgpu.org

hgpu.org » Applications » Computer science » Task Superscalar: An Out-of-Order Task Pipeline

Task Superscalar: An Out-of-Order Task Pipeline

Yoav Etsion, Felipe Cabarcas, Alejandro Rico, Alex Ramirez, Rosa M. Badia, Eduard Ayguade, Jesus Labarta, Mateo Valero

Barcelona Supercomputing Center (BSC), Barcelona, Spain

Proceedings of the 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture, MICRO ’43, 2010

DOI:10.1109/MICRO.2010.13

@inproceedings{etsion2010task,

title={Task superscalar: An out-of-order task pipeline},

author={Etsion, Y. and Cabarcas, F. and Rico, A. and Ramirez, A. and Badia, R.M. and Ayguade, E. and Labarta, J. and Valero, M.},

booktitle={Proceedings of the 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture},

pages={89–100},

year={2010},

organization={IEEE Computer Society}

}

Download (PDF)

View

Source

2327

views

We present emph{Task Super scalar}, an abstraction of instruction-level out-of-order pipeline that operates at the task-level. Like ILP pipelines, which uncover parallelism in a sequential instruction stream, task super scalar uncovers task-level parallelism among tasks generated by a sequential thread. Utilizing intuitive programmer annotations of task inputs and outputs, the task super scalar pipeline dynamically detects inter-task data dependencies, identifies task-level parallelism, and executes tasks out-of-order. Furthermore, we propose a design for a distributed task super scalar pipeline front end, that can be embedded into any many core fabric, and manages cores as functional units. We show that our proposed mechanism is capable of driving hundreds of cores simultaneously with non-speculative tasks, which allows our pipeline to sustain work windows consisting of tens of thousands of tasks. We further show that our pipeline can maintain a decode rate faster than 60ns per task and dynamically uncover data dependencies among as many as tilde 50,000 in-flight tasks, using 7MB of on-chip eDRAM storage. This configuration achieves speedups of 95-255x (average 183x) over sequential execution for nine scientific benchmarks, running on a simulated CMP with 256 cores. Task super scalar thus enables programmers to exploit many core systems effectively, while simultaneously simplifying their programming model.

Tags: Benchmarking, Cell processor, Computer science, Performance, Programming techniques

September 11, 2011 by hgpu

No votes yet.

Please wait...