https://hgpu.org/?p=4399
Auto-tuning Dense Matrix Multiplication for GPGPU with Cache