Automatic Command Queue Scheduling for Task-Parallel Workloads in OpenCL

hgpu.org » Applications » Computer science » Automatic Command Queue Scheduling for Task-Parallel Workloads in OpenCL

Automatic Command Queue Scheduling for Task-Parallel Workloads in OpenCL

Ashwin M. Aji, Antonio J. Pena, Pavan Balaji, Wu-chun Feng

AMD Research, Advanced Micro Devices, Inc.

Proceedings of the IEEE Cluster, 2015

@InProceedings{aji-queue-sch-cluster15,

author={Aji, Ashwin M. and Pena, Antonio J. and Balaji, Pavan and Feng, Wu-chun},

title={Automatic Command Queue Scheduling for Task-Parallel Workloads in OpenCL},

booktitle={IEEE Cluster},

address={Chicago, Illinois},

month={September},

year={2015}

}

Download (PDF)

View

Source

2844

views

OpenCL is a portable interface that can be used to program cluster nodes with heterogeneous compute devices. The OpenCL specification tightly binds its workflow abstraction, or "command queue", to a specific device for the entire program. For best performance, the user has to find the ideal queue-device mapping at command queue creation time, an effort that requires a thorough understanding of the match between the characteristics of all the underlying device architectures and the kernels in the program. In this paper, we propose to add scheduling attributes to the OpenCL context and command queue objects that can be leveraged by an intelligent runtime scheduler to automatically perform ideal queue-device mapping. Our proposed extensions enable the average OpenCL programmer to focus on the algorithm design rather than scheduling and automatically gain performance without sacrificing programmability. As an example, we design and implement an OpenCL runtime for task-parallel workloads, called MultiCL, which efficiently schedules command queues across devices. Within MultiCL, we implement several key optimizations to reduce runtime overhead. Our case studies include the SNU-NPB OpenCL benchmark suite and a real-world seismology simulation. We show that, on average, users have to apply our proposed scheduler extensions to only four source lines of code in existing OpenCL applications in order to automatically benefit from our runtime optimizations. We also show that MultiCL always maps command queues to the optimal device set with negligible runtime overhead.

Tags: Computer science, Heterogeneous systems, nVidia, OpenCL, Performance, Seismology, Task scheduling, Tesla C2050

February 23, 2016 by hgpu

No votes yet.

Please wait...

Your response

You must be logged in to post a comment.

* * *

high performance computing on graphics processing units: hgpu.org