https://hgpu.org/?p=1181
Faster matrix-vector multiplication on GeForce 8800GTX