Register packing for cyclic reduction: a case study
University of California, Davis
Proceedings of the Fourth Workshop on General Purpose Processing on Graphics Processing Units, GPGPU-4, 2011
@inproceedings{davidson2011register,
title={Register packing for cyclic reduction: a case study},
author={Davidson, A. and Owens, J.D.},
booktitle={Proceedings of the Fourth Workshop on General Purpose Processing on Graphics Processing Units},
pages={4},
year={2011},
organization={ACM}
}
We generalize a method for avoiding GPU shared communication when dealing with a downsweep pattern. We apply this generalization to Cyclic Reduction, a tridiagonal solver with this pattern. Previously, Cyclic Reduction suffered poor performance when compared to other tridiagonal solvers on the GPU due to performance issues stemming from shared-memory bandwidth bottlenecks and step-efficiency. We address this problem by applying our down-sweep shared-memory communication-reducing methodology. Our re-mapping also allows Cyclic Reduction to solve larger systems directly in a virtual block. By using our generalized mapping, we improve Cyclic Reduction’s performance on a GPU by a factor of 3-4.5x over the original CR implementation, making it 1.5-3x faster than other GPU tridiagonal solvers.
September 22, 2011 by hgpu