https://hgpu.org/?p=16662
Accelerating BLAS on Custom Architecture through Algorithm-Architecture Co-design