16387

A Comprehensive Performance Analysis of HSA and OpenCL 2.0

Saoni Mukherjee, Yifan Sun, Paul Blinzer, Amir Kavyan Ziabari, David Kaeli
Northeastern University, Boston, MA
IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), 2016

@inproceedings{mukherjee2016comprehensive,

   title={A comprehensive performance analysis of HSA and OpenCL 2.0},

   author={Mukherjee, Saoni and Sun, Yifan and Blinzer, Paul and Ziabari, Amir Kavyan and Kaeli, David},

   booktitle={2016 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)},

   pages={183–193},

   year={2016},

   organization={IEEE}

}

Download Download (PDF)   View View   Source Source   

1991

views

Heterogeneous systems, that marry CPUs and GPUs together in a range of configurations, are quickly becoming the design paradigm for today’s platforms because of their impressive parallel processing capabilities. However, in many existing heterogeneous systems, the GPU is only treated as an accelerator by the CPU, working as a slave to the CPU master. But recently we are starting to see the introduction of a new class of devices and changes to the system runtime model, which enable accelerators to be treated as first-class computing devices. To support programmability and efficiency of heterogeneous programming, the HSA foundation introduced the Heterogeneous System Architecture (HSA), which defines a platform and runtime architecture that provides rich support for OpenCL 2.0 features including shared virtual memory, dynamic parallelism, and improved atomic operations. In this paper, we provide the first comprehensive study of OpenCL 2.0 and HSA 1.0 execution, considering OpenCL 1.2 as the baseline. For workloads, we develop a suite of OpenCL micro-benchmarks designed to highlight the features of these emerging standards and also utilize real-world applications to better understand their impact at an application level. To fully exercise the new features provided by the HSA model, we experiment with a producer-consumer algorithm and persistent kernels. We find that by using HSA signals, we can remove 92% of the overhead due to synchronous kernel launches. In our realworld applications, the OpenCL 2.0 runtime achieves up to a 1.2X speedup, while the HSA 1.0 runtime achieves a 2.7X speedup over OpenCL 1.2.
Rating: 1.5/5. From 2 votes.
Please wait...

* * *

* * *

HGPU group © 2010-2024 hgpu.org

All rights belong to the respective authors

Contact us: