Optimizing Data Locality for Iterative Matrix Solvers on CUDA
Department of Electrical and Computer Engineering, University of Maine, Orono, ME, USA
The 2013 International Conference on Parallel and Distributed, Processing Techniques and Applications (PDPTA’13), 2013
@article{flagg2013optimizing,
title={Optimizing Data Locality for Iterative Matrix Solvers on CUDA},
author={Flagg, Raymond and Monk, Jason},
year={2013}
}
Solving systems of linear equations is an important problem that spans almost all fields of science and mathematics. When these systems grow in size, iterative methods are used to solve these problems. This paper looks at optimizing these methods for CUDA Architectures. It discusses a multi-threaded CPU implementation, a GPU implementation, and a data optimized GPU implementation. The optimized version uses an extra kernel to rearrange the problem data so that there are a minimal number of memory access and minimum thread divergence. The normal GPU implementation achieved a total speedup of 1.60X over the CPU version whereas the optimized version was able to achieve a total speedup of 1.78X. This paper demonstrates the importance of pre-organizing the data in iterative methods and its impact.
December 4, 2013 by hgpu