https://hgpu.org/?p=11082
Lessons learned from contrasting a BLAS kernel implementations