https://hgpu.org/?p=7916
Implementing a Code Generator for Fast Matrix Multiplication in OpenCL on the GPU