https://hgpu.org/?p=18884
Performance Engineering for a Tall & Skinny Matrix Multiplication Kernel on GPUs