high performance computing on graphics processing units: hgpu.org

hgpu.org » Applications » Computer science » Batched Kronecker product for 2-D matrices and 3-D arrays on NVIDIA GPUs

Batched Kronecker product for 2-D matrices and 3-D arrays on NVIDIA GPUs

Chetan Jhurani

Tech-X Corporation, Boulder

@article{tkron,

title={Batched Kronecker product for 2-D matrices and 3-D arrays on NVIDIA GPUs},

journal={Submitted},

url={\url{users.ices.utexas.edu/~chetan/preprints/2013-CJ-kron.pdf}},

author={{Chetan Jhurani}},

year={2013}

}

Download (PDF)

View

Source

1956

views

We describe an interface and an implementation for performing Kronecker product actions on NVIDIA GPUs for multiple small 2-D matrices and 3-D arrays processed in parallel as a batch. This method is suited to cases where the Kronecker product component matrices are identical but the operands in a matrix-free application vary in the batch. Any batched GEMM (General Matrix Multiply) implementation, for example ours or the one in cuBLAS, can also be used for performing batched Kronecker products on GPUs. However, the specialized implementation presented here is faster and uses less memory. Partly this is because a simple GEMM based approach would require extra copies to and from main memory. We focus on matrix sizes less than or equal to 16, since these are the typical polynomial degrees in Finite Elements, but the implementation can be easily extended for other sizes. We obtain 143 and 285 GFlop/s for single precision real when processing matrices of size 10 and 16, respectively on NVIDIA Tesla K20c using CUDA 5.0. The corresponding speeds for 3-D array Kronecker products are 126 and 268 GFlop/s, respectively. Double precision is easily supported using the C++ template mechanism.

Tags: BLAS, CUBLAS, CUDA, Kronecker product, nVidia, Tesla K20

April 10, 2013 by chetan.jhurani

No votes yet.

Please wait...

gpu_tracker: Context manager and CLI that tracks the computational-resource-usage of a code block or shell command, particularly the GPU usage

gpu_tracker: Python package for tracking and profiling GPU utilization in both desktop and high-performance computing environments

* * *

high performance computing on graphics processing units: hgpu.org

Batched Kronecker product for 2-D matrices and 3-D arrays on NVIDIA GPUs

Recent source codes

QArray

Celerity: High-level C++ for Accelerator Clusters

CIFAR-10 Airbench: 94% on CIFAR-10 in 3.29 second

gpu_tracker: Context manager and CLI that tracks the computational-resource-usage of a code block or shell command, particularly the GPU usage

LOOPer: a polyhedral compiler for expressing fast and portable data parallel algorithms

OpenMC Monte Carlo Code

Polygeist: C/C++ frontend for MLIR

Parallel Gaussian process with kernel approximation in CUDA

Optical flow algorithms for SYCL

OpenMP5-Offload-OpenMC-Intel-PVC

Most viewed papers (last 30 days)

Batched Kronecker product for 2-D matrices and 3-D arrays on NVIDIA GPUs

Share this:

Recent source codes

Most viewed papers (last 30 days)