high performance computing on graphics processing units: hgpu.org

hgpu.org » Programming » Algorithms » Fast Implementation of DGEMM on Fermi GPU

Fast Implementation of DGEMM on Fermi GPU

Guangming Tan, Linchuan Li, Sean Triechle, Everett Phillips, Yungang Bao, Ninghui Sun

Key Laboratory of Computer Architecture, Institute of Computing Technology,Chinese Academy of Science

ACM/IEEE Supercomputing (SC’11), 2011

BibTeX

Download (PDF)

View

Source

Source codes

Package:

High Performance DGEMM on GPU (NVIDIA/ATI)

2639

views

In this paper we present a thorough experience on tuning double-precision matrix-matrix multiplication (DGEMM) on the Fermi GPU architecture. We choose an optimal algorithm with blocking in both shared memory and registers to satisfy the constraints of the Fermi memory hierarchy. Our optimization strategy is further guided by a performance modeling based on micro-architecture benchmarks. Our optimizations include software pipelining, use of vector memory operations, and instruction scheduling. Our best CUDA algorithm achieves comparable performance with the latest CUBLAS library. We further improve upon this with an implementation in the native machine language, leading to 20% increase in performance. That is, the achieved peak performance (efficiency) is improved from 302Gflop/s (58%) to 362Gflop/s (70%).

Tags: Algorithms, Benchmarking, Computer science, CUBLAS, CUDA, Matrix multiplication, nVidia, Optimization, Performance, Tesla C2050

October 17, 2011 by hgpu

No votes yet.

Please wait...

Your response

You must be logged in to post a comment.

* * *

high performance computing on graphics processing units: hgpu.org

Fast Implementation of DGEMM on Fermi GPU

Package:

Your response

Recent source codes

Mutual-Supervised Learning for Sequential-to-Parallel Code Translation

Hardware Compute Partitioning on NVIDIA GPUs for Composable Systems

KISim: Kubernetes Intelligent Scheduling Simulator

Efficient GPU Implementation of Multi-Precision Integer Division

exa-AMD: Exascale Accelerated Materials Discovery

ParEval: A Parallel Code Evaluation Benchmark

FlashSparse: Minimizing Computation Redundancy for Fast Sparse Matrix Multiplications on Tensor Cores

WiLLM: An Open Wireless LLM Communication System

Vcc: the Vulkan Clang Compiler

hpcbench: A set of benchmarking utilities for biomolecular simulation tools

Most viewed papers (last 30 days)

Fast Implementation of DGEMM on Fermi GPU

Package:

Share this:

Your response

Recent source codes

Most viewed papers (last 30 days)