CUDA Based Fast Implementation of Very Large Matrix Computation
Dept. of Comput. Sci. & Technol., Hunan Int. Econ. Univ., Changsha, China
International Conference on Parallel and Distributed Computing, Applications and Technologies (PDCAT), 2010
CUDA (Compute Unified Device Architecture) acceleration of very large scale matrix-vector and matrix-matrix multiplication is presented in this paper. The intrinsic parallelism in the matrix computations are exploited thoroughly. By dividing the entire matrix computation to multiple sub-groups, scalable performance improvement can be achieved using multiple GPUs. The key operations are accelerated by GPU. And the CUDA related data storage, threads hierarchy, and kernel implementation are proposed. Several optimization methods including coalesced global memory access, on-the-fly reduction, bank conflict free shared memory usage, loop unrolling, removing unnecessary synchronization, and concurrent execution on the device through streams are also employed. Experiment results show that about 8.5 times speedup can be achieved for CUDA accelerated matrix multiplication maximally.
June 10, 2011 by hgpu