https://hgpu.org/?p=9341
Optimizing CUDA Code By Kernel Fusion - Application on BLAS