https://hgpu.org/?p=5255
A fast GEMM implementation on the cypress GPU