https://hgpu.org/?p=17219
cuTT: A High-Performance Tensor Transpose Library for CUDA Compatible GPUs