https://hgpu.org/?p=6416
Efficient Parallel Nonnegative Least Squares on Multicore Architectures