https://hgpu.org/?p=15449
Performance Portable GPU Code Generation for Matrix Multiplication