high performance computing on graphics processing units: hgpu.org

hgpu.org » Applications » Computer science » An Improved Magma Gemm For Fermi Graphics Processing Units

An Improved Magma Gemm For Fermi Graphics Processing Units

Rajib Nath, Stanimire Tomov, Jack Dongarra

University of Tennassee, USA

International Journal of High Performance Computing Applications, Vol. 24, No. 4. (1 November 2010), pp. 511-515

DOI:10.1177/1094342010385729

BibTeX

Download (PDF)

View

Source

Source codes

Package:

Magma Gemm For Fermi Graphics Processing Units

1962

views

We present an improved matrix-matrix multiplication routine (General Matrix Multiply [GEMM]) in the MAGMA BLAS library that targets the NVIDIA Fermi graphics processing units (GPUs) using Compute Unified Data Architecture (CUDA). We show how to modify the previous MAGMA GEMM kernels in order to make a more efficient use of the Fermi’s new architectural features, most notably their extended memory hierarchy and memory sizes. The improved kernels run at up to 300 GFlop/s in double precision and up to 645 GFlop/s in single precision arithmetic (on a C2050), which is correspondingly 58% and 63% of the theoretical peak. We compare the improved kernels with the currently available version in CUBLAS 3.1. Further, we show the effect of the new kernels on higher-level dense linear algebra (DLA) routines such as the one-sided matrix factorizations, and compare their performances with corresponding, currently available routines running on homogeneous multicore systems.

Tags: BLAS, Computer science, CUDA, Linear Algebra, Matrix multiplication, nVidia, Package, Tesla C2050

January 23, 2011 by hgpu

No votes yet.

Please wait...

Your response

You must be logged in to post a comment.

* * *

high performance computing on graphics processing units: hgpu.org

An Improved Magma Gemm For Fermi Graphics Processing Units

Package:

Your response

Recent source codes

Mutual-Supervised Learning for Sequential-to-Parallel Code Translation

Hardware Compute Partitioning on NVIDIA GPUs for Composable Systems

KISim: Kubernetes Intelligent Scheduling Simulator

Efficient GPU Implementation of Multi-Precision Integer Division

exa-AMD: Exascale Accelerated Materials Discovery

ParEval: A Parallel Code Evaluation Benchmark

FlashSparse: Minimizing Computation Redundancy for Fast Sparse Matrix Multiplications on Tensor Cores

WiLLM: An Open Wireless LLM Communication System

Vcc: the Vulkan Clang Compiler

hpcbench: A set of benchmarking utilities for biomolecular simulation tools

Most viewed papers (last 30 days)

An Improved Magma Gemm For Fermi Graphics Processing Units

Package:

Share this:

Your response

Recent source codes

Most viewed papers (last 30 days)