https://hgpu.org/?p=1081
Improving Performance of Matrix Multiplication and FFT on GPU