https://hgpu.org/?p=10330
Performance Drawbacks for Matrix Multiplication using Set Associative Cache in GPU devices