https://hgpu.org/?p=14780
Padding Free Bank Conflict Resolution for CUDA-Based Matrix Transpose Algorithm