https://hgpu.org/?p=13843
Batched Matrix Computations on Hardware Accelerators