https://hgpu.org/?p=16149
TTC: A Tensor Transposition Compiler for Multiple Architectures