https://hgpu.org/?p=10896
Anatomy of High-Performance Many-Threaded Matrix Multiplication