https://hgpu.org/?p=945
Cache and bandwidth aware matrix multiplication on the GPU