4313

CUDA Based Fast Implementation of Very Large Matrix Computation

Yinghong Sun, Yuanman Tong
Dept. of Comput. Sci. & Technol., Hunan Int. Econ. Univ., Changsha, China
International Conference on Parallel and Distributed Computing, Applications and Technologies (PDCAT), 2010
BibTeX

Source Source   

2236

views

CUDA (Compute Unified Device Architecture) acceleration of very large scale matrix-vector and matrix-matrix multiplication is presented in this paper. The intrinsic parallelism in the matrix computations are exploited thoroughly. By dividing the entire matrix computation to multiple sub-groups, scalable performance improvement can be achieved using multiple GPUs. The key operations are accelerated by GPU. And the CUDA related data storage, threads hierarchy, and kernel implementation are proposed. Several optimization methods including coalesced global memory access, on-the-fly reduction, bank conflict free shared memory usage, loop unrolling, removing unnecessary synchronization, and concurrent execution on the device through streams are also employed. Experiment results show that about 8.5 times speedup can be achieved for CUDA accelerated matrix multiplication maximally.
No votes yet.
Please wait...

* * *

* * *

HGPU group © 2010-2025 hgpu.org

All rights belong to the respective authors

Contact us:

contact@hpgu.org