4313

CUDA Based Fast Implementation of Very Large Matrix Computation

Yinghong Sun, Yuanman Tong
Dept. of Comput. Sci. & Technol., Hunan Int. Econ. Univ., Changsha, China
International Conference on Parallel and Distributed Computing, Applications and Technologies (PDCAT), 2010

@inproceedings{sun2010cuda,

   title={CUDA Based Fast Implementation of Very Large Matrix Computation},

   author={Sun, Y. and Tong, Y.},

   booktitle={The 11th International Conference on Parallel and Distributed Computing, Applications and Technologies},

   pages={487–491},

   year={2010},

   organization={IEEE}

}

Source Source   

2147

views

CUDA (Compute Unified Device Architecture) acceleration of very large scale matrix-vector and matrix-matrix multiplication is presented in this paper. The intrinsic parallelism in the matrix computations are exploited thoroughly. By dividing the entire matrix computation to multiple sub-groups, scalable performance improvement can be achieved using multiple GPUs. The key operations are accelerated by GPU. And the CUDA related data storage, threads hierarchy, and kernel implementation are proposed. Several optimization methods including coalesced global memory access, on-the-fly reduction, bank conflict free shared memory usage, loop unrolling, removing unnecessary synchronization, and concurrent execution on the device through streams are also employed. Experiment results show that about 8.5 times speedup can be achieved for CUDA accelerated matrix multiplication maximally.
No votes yet.
Please wait...

* * *

* * *

HGPU group © 2010-2024 hgpu.org

All rights belong to the respective authors

Contact us: