https://hgpu.org/?p=12996
Implementing Level-3 BLAS Routines in OpenCL on Different Processing Units