high performance computing on graphics processing units: hgpu.org

hgpu.org » Programming » Algorithms » Optimizing strassen matrix multiply on GPUs

Optimizing strassen matrix multiply on GPUs

Ayaz ul Hasan Khan, Mayez Al-Mouhamed, Allam Fatayer

Department of Computer Engineering, College of Computer Science & Engineering, KFUPM, Dhahran, 31261, Saudi Arabia

16th IEEE/ACIS International Conference on Software Engineering, Artificial Intelligence, Networking and Parallel/Distributed Computing (SNPD), 2015

DOI:10.1109/SNPD.2015.7176172

@inproceedings{al2015optimizing,

title={Optimizing strassen matrix multiply on GPUs},

author={Al-Mouhamed, Mayez and Fatayer, Allam and others},

booktitle={Software Engineering, Artificial Intelligence, Networking and Parallel/Distributed Computing (SNPD), 2015 16th IEEE/ACIS International Conference on},

pages={1–6},

year={2015},

organization={IEEE}

}

Download (PDF)

View

Source

2036

views

Many core systems are basically designed for applications having large data parallelism. Strassen Matrix Multiply (MM) can be formulated as a depth first (DFS) traversal of a recursion tree where all cores work in parallel on computing each of the NxN sub-matrices that reduces storage at the detriment of large data motion to gather and aggregate the results. We propose Strassen and Winograd algorithms (S-MM and W-MM) based on three optimizations: a set of basic algebra functions to reduce overhead, invoking efficient library (CUBLAS 5.5), and parameter-tuning of parametric kernel to improve resource occupancy. On GPUs, W-MM and S-MM with one recursion level outperform CUBLAS 5.5 Library with up to twice as faster for large arrays satisfying N>=2048 and N>=3072, respectively. Compared to NVIDIA SDK library, S-MM and W-MM achieved a speedup between 20x to 80x for the above arrays. The proposed approach can be used to enhance the performance of CUBLAS and MKL libraries.

Tags: Algorithms, Computer science, CUBLAS, CUDA, Data parallelism, Matrix multiplication, nVidia, Tesla K20

August 11, 2015 by hgpu

Rating: 2.5/5. From 1 vote.

Please wait...

gpu_tracker: Context manager and CLI that tracks the computational-resource-usage of a code block or shell command, particularly the GPU usage

gpu_tracker: Python package for tracking and profiling GPU utilization in both desktop and high-performance computing environments

high performance computing on graphics processing units: hgpu.org

Optimizing strassen matrix multiply on GPUs

Recent source codes

SimSYCL: Synchronous, single-threaded, library-only SYCL implementation for debugging and verification

GPU plugin for PySCF

QArray

Celerity: High-level C++ for Accelerator Clusters

gpu_tracker: Context manager and CLI that tracks the computational-resource-usage of a code block or shell command, particularly the GPU usage

CIFAR-10 Airbench: 94% on CIFAR-10 in 3.29 second

LOOPer: a polyhedral compiler for expressing fast and portable data parallel algorithms

OpenMC Monte Carlo Code

Polygeist: C/C++ frontend for MLIR

Parallel Gaussian process with kernel approximation in CUDA

Most viewed papers (last 30 days)

Optimizing strassen matrix multiply on GPUs

Share this:

Recent source codes

Most viewed papers (last 30 days)