https://hgpu.org/?p=16989
Improving the Performance of Fully Connected Neural Networks by Out-of-Place Matrix Transpose