high performance computing on graphics processing units: hgpu.org

hgpu.org » Applications » Computer science » Anatomy of High-Performance Many-Threaded Matrix Multiplication

Anatomy of High-Performance Many-Threaded Matrix Multiplication

Tyler M. Smith, Robert van de Geijn, Mikhail Smelyanskiy, Jeff R. Hammond, Field Van Zee

Institute for Computational Engineering and Sciences and Department of Computer Science, The University of Texas at Austin, Austin TX, 78712

28th IEEE International Parallel & Distributed Processing Symposium (IPDPS 2014), 2014

@conference{BLIS3,

author={Tyler M. Smith and Robert A. {v}an~{d}e~{G}eijn and Mikhail Smelyanskiy,Jeff R. Hammond and Field G. {V}an~{Z}ee},

title={Anatomy of High-Performance Many-Threaded Matrix Multiplication},

booktitle={28th IEEE International Parallel & Distributed Processing Symposium (IPDPS 2014)},

year={2014},

note={Submitted}

}

Download (PDF)

View

Source

Source codes

Package:

BLIS

2450

views

BLIS is a new framework for rapid instantiation of the BLAS. We describe how BLIS extends the "GotoBLAS approach" to implementing matrix multiplication (GEMM). While GEMM was previously implemented as three loops around an inner kernel, BLIS exposes two additional loops within that inner kernel, casting the computation in terms of the BLIS microkernel so that porting GEMM becomes a matter of customizing this micro-kernel for a given architecture. We discuss how this facilitates a finer level of parallelism that greatly simplifies the multithreading of GEMM as well as additional opportunities for parallelizing multiple loops. Specifically, we show that with the advent of many-core architectures such as the IBM PowerPC A2 processor (used by Blue Gene/Q) and the Intel Xeon Phi processor, parallelizing both within and around the inner kernel, as the BLIS approach supports, is not only convenient, but also necessary for scalability. The resulting implementations deliver what we believe to be the best open source performance for these architectures, achieving both impressive performance and excellent scalability.

Tags: Computer science, Intel Phi, Intel Xeon Phi, Matrix multiplication, Performance

November 13, 2013 by hgpu

No votes yet.

Please wait...

gpu_tracker: Context manager and CLI that tracks the computational-resource-usage of a code block or shell command, particularly the GPU usage

gpu_tracker: Python package for tracking and profiling GPU utilization in both desktop and high-performance computing environments

* * *

high performance computing on graphics processing units: hgpu.org

Anatomy of High-Performance Many-Threaded Matrix Multiplication

Package:

Recent source codes

QArray

Celerity: High-level C++ for Accelerator Clusters

CIFAR-10 Airbench: 94% on CIFAR-10 in 3.29 second

gpu_tracker: Context manager and CLI that tracks the computational-resource-usage of a code block or shell command, particularly the GPU usage

LOOPer: a polyhedral compiler for expressing fast and portable data parallel algorithms

OpenMC Monte Carlo Code

Polygeist: C/C++ frontend for MLIR

Parallel Gaussian process with kernel approximation in CUDA

Optical flow algorithms for SYCL

OpenMP5-Offload-OpenMC-Intel-PVC

Most viewed papers (last 30 days)

Anatomy of High-Performance Many-Threaded Matrix Multiplication

Package:

Share this:

Recent source codes

Most viewed papers (last 30 days)