https://hgpu.org/?p=1300
Power Efficient Large Matrices Multiplication by Load Scheduling on Multi-core and GPU Platform with CUDA