high performance computing on graphics processing units: hgpu.org

hgpu.org » Applications » Computer science » CUDA Based Fast Implementation of Very Large Matrix Computation

CUDA Based Fast Implementation of Very Large Matrix Computation

Yinghong Sun, Yuanman Tong

Dept. of Comput. Sci. & Technol., Hunan Int. Econ. Univ., Changsha, China

International Conference on Parallel and Distributed Computing, Applications and Technologies (PDCAT), 2010

DOI:10.1109/PDCAT.2010.45

@inproceedings{sun2010cuda,

title={CUDA Based Fast Implementation of Very Large Matrix Computation},

author={Sun, Y. and Tong, Y.},

booktitle={The 11th International Conference on Parallel and Distributed Computing, Applications and Technologies},

pages={487–491},

year={2010},

organization={IEEE}

}

Source

2035

views

CUDA (Compute Unified Device Architecture) acceleration of very large scale matrix-vector and matrix-matrix multiplication is presented in this paper. The intrinsic parallelism in the matrix computations are exploited thoroughly. By dividing the entire matrix computation to multiple sub-groups, scalable performance improvement can be achieved using multiple GPUs. The key operations are accelerated by GPU. And the CUDA related data storage, threads hierarchy, and kernel implementation are proposed. Several optimization methods including coalesced global memory access, on-the-fly reduction, bank conflict free shared memory usage, loop unrolling, removing unnecessary synchronization, and concurrent execution on the device through streams are also employed. Experiment results show that about 8.5 times speedup can be achieved for CUDA accelerated matrix multiplication maximally.

Tags: Computer science, CUDA, Linear Algebra, Matrix multiplication, nVidia

June 10, 2011 by hgpu

No votes yet.

Please wait...

gpu_tracker: Context manager and CLI that tracks the computational-resource-usage of a code block or shell command, particularly the GPU usage

gpu_tracker: Python package for tracking and profiling GPU utilization in both desktop and high-performance computing environments

high performance computing on graphics processing units: hgpu.org

CUDA Based Fast Implementation of Very Large Matrix Computation

Recent source codes

SimSYCL: Synchronous, single-threaded, library-only SYCL implementation for debugging and verification

GPU plugin for PySCF

QArray

Celerity: High-level C++ for Accelerator Clusters

gpu_tracker: Context manager and CLI that tracks the computational-resource-usage of a code block or shell command, particularly the GPU usage

CIFAR-10 Airbench: 94% on CIFAR-10 in 3.29 second

LOOPer: a polyhedral compiler for expressing fast and portable data parallel algorithms

OpenMC Monte Carlo Code

Polygeist: C/C++ frontend for MLIR

Parallel Gaussian process with kernel approximation in CUDA

Most viewed papers (last 30 days)

CUDA Based Fast Implementation of Very Large Matrix Computation

Share this:

Recent source codes

Most viewed papers (last 30 days)