https://hgpu.org/?p=13415
Locality-aware parallel block-sparse matrix-matrix multiplication using the Chunks and Tasks programming model