high performance computing on graphics processing units: hgpu.org

hgpu.org » Applications » Computer science » CUDA Based Fast Implementation of Very Large Matrix Computation

CUDA Based Fast Implementation of Very Large Matrix Computation

Yinghong Sun, Yuanman Tong

Dept. of Comput. Sci. & Technol., Hunan Int. Econ. Univ., Changsha, China

International Conference on Parallel and Distributed Computing, Applications and Technologies (PDCAT), 2010

DOI:10.1109/PDCAT.2010.45

BibTeX

Source

2350

views

CUDA (Compute Unified Device Architecture) acceleration of very large scale matrix-vector and matrix-matrix multiplication is presented in this paper. The intrinsic parallelism in the matrix computations are exploited thoroughly. By dividing the entire matrix computation to multiple sub-groups, scalable performance improvement can be achieved using multiple GPUs. The key operations are accelerated by GPU. And the CUDA related data storage, threads hierarchy, and kernel implementation are proposed. Several optimization methods including coalesced global memory access, on-the-fly reduction, bank conflict free shared memory usage, loop unrolling, removing unnecessary synchronization, and concurrent execution on the device through streams are also employed. Experiment results show that about 8.5 times speedup can be achieved for CUDA accelerated matrix multiplication maximally.

Tags: Computer science, CUDA, Linear Algebra, Matrix multiplication, nVidia

June 10, 2011 by hgpu

No votes yet.

Please wait...

high performance computing on graphics processing units: hgpu.org

CUDA Based Fast Implementation of Very Large Matrix Computation

Recent source codes

HPCTransCompile: An AI Compiler Generated Dataset for High-Performance CUDA Transpilation and LLM Preliminary Exploration

chemtrain: Training Molecular Dynamics Potentials in JAX

microSYCL: SYCL micro-benchmarks repository

XaaS containers

SYCL Container

CASS: Cuda-Amd aSSembly

Cluser of smartphones for edge computing application using TensorFlow

CFAL-bench

Efficient Graph Embedding at Scale: Optimizing CPU-GPU-SSD Integration

Can Large Language Models Predict Parallel Code Performance?

Most viewed papers (last 30 days)

CUDA Based Fast Implementation of Very Large Matrix Computation

Share this:

Recent source codes

Most viewed papers (last 30 days)