https://hgpu.org/?p=8056
Dense Matrix Computation on a Heterogenous Architecture: A Block Synchronous Approach