high performance computing on graphics processing units: hgpu.org

hgpu.org » Applications » Computer science » Opportunities for Parallelism in Matrix Multiplication

Opportunities for Parallelism in Matrix Multiplication

Tyler M. Smith, Robert A. van de Geijn, Mikhail Smelyanskiy, Jeff R. Hammond and Field G. Van Zee

Institute for Computational Engineering and Sciences and Department of Computer Science, The University of Texas at Austin, Austin TX, 78712

The University of Texas at Austin, Department of Computer Science. Technical Report TR-13-20, FLAME Working Note 71, 2013

@article{smith2013opportunities,

title={Opportunities for Parallelism in Matrix Multiplication},

author={Smith, Tyler M and van de Geijn, Robert A and Smelyanskiy, Mikhail and Hammond, Jeff R and Van Zee, Field G},

year={2013}

}

Download (PDF)

View

Source

2282

views

BLIS is a new framework for rapid instantiation of the BLAS. We describe how BLIS extends the "GotoBLAS approach" to implementing matrix multiplication (GEMM). While GEMM was previously implemented as three loops around an inner kernel, BLIS exposes two additional loops within that inner kernel, casting the computation in terms of the BLIS micro-kernel so that porting GEMM becomes a matter of customizing this micro-kernel for a given architecture. We discuss how this facilitates a finer level of parallelism that greatly simplifies the multithreading of GEMM as well as additional opportunities for parallelizing multiple loops. Specifically, we show that with the advent of many-core architectures such as the IBM PowerPC A2 processor (used by Blue Gene/Q) and the Intel Xeon Phi processor, parallelizing both within and around the inner kernel, as the BLIS approach supports, is not only convenient, but also necessary for scalability. The resulting implementations deliver what we believe to be the best open source performance for these architectures, achieving both impressive performance and excellent scalability.

Tags: Computer science, Intel Xeon Phi, Matrix multiplication

January 5, 2014 by hgpu

No votes yet.

Please wait...

Your response

You must be logged in to post a comment.

* * *

high performance computing on graphics processing units: hgpu.org

Opportunities for Parallelism in Matrix Multiplication

Your response

Recent source codes

UniCoder: Unified Visual-to-Code Generation via Symbolic Rewards and Reference-Guided Code Optimization

CuFuzz: An API-Knowledge-Graph Coverage-Driven Fuzzing Framework for CUDA Libraries

AutoPass: Evidence-Guided LLM Agents for Compiler Performance Tuning

Probe-and-Refine Tuning of Repository Guidance for AI Coding Agents

CUDAnalyst (CUDA + Analyst)

CodegenBench

KernelBenchX: A Comprehensive Benchmark for Evaluating LLM-Generated GPU Kernels

CUDA Kernel Fusion Benchmarks

IntelliKit: Agent-first tooling for AMD hardware

DITRON: Distributed Compiler based on Triton for Parallel Systems

Most viewed papers (last 30 days)

Opportunities for Parallelism in Matrix Multiplication

Share this:

Your response

Recent source codes

Most viewed papers (last 30 days)