https://hgpu.org/?p=11008
Optimizing Data Locality for Iterative Matrix Solvers on CUDA