https://hgpu.org/?p=4313
CUDA Based Fast Implementation of Very Large Matrix Computation