https://hgpu.org/?p=12907
KBLAS: An Optimized Library for Dense Matrix-Vector Multiplication on GPU Accelerators