high performance computing on graphics processing units: hgpu.org

hgpu.org » Programming » Algorithms » GPU Matrix Multiplication

GPU Matrix Multiplication

Junjie Li, Sanjay Ranka, Sartaj Sahni

Department of Computer and Information Science and Engineering, University of Florida, Gainesville, FL 32611

Chapter in the book "Multi- and Many-Core Technologies: Architectures, Programming, Algorithms, and Applications", Chapman-Hall/CRC Press, 2013

BibTeX

Download (PDF)

View

Source

2858

views

Graphics Processing Units (GPUs) were developed originally to meet the computational needs of algorithms for rendering computer graphics. The rapid and enormous growth in sophistication of graphics applications such as computer games has resulted in the availability of GPUs that have hundreds of processors and peak performance near a teraflop and that sell for hundreds of dollars to a few thousand dollars. Although GPUs are optimized for graphics calculations, their low cost per gigaflop has motivated significant research into their efficient use for non-graphics applications. The effort being expended in this direction has long-lasting potential because the widespread use of GPUs in the vibrant computer games industry almost ensures the longevity of GPUs. So, unlike traditional multimillion dollar supercomputers whose development cost had to be borne entirely by a relatively small supercomputing community, GPUs are backed by a very large gaming industry. This makes it more likely that GPU architectures will remain economically viable and will continue to evolve. Although the cost of a GPU measured as dollars per peak gigaflop is very low, obtaining performance near the peak requires very careful programming of the application code. This programming is complicated by the availability of several different memories (e.g., device memory, shared memory, constant cache, texture cache, and registers), each with different latency and the partitioning of the available scalar processors or cores into groups called streaming multiprocessors (Figure 1.1). In this chapter, we explore the intricacies of programming a GPU to obtain high performance for the multiplication of two single-precision square matrices. We focus our development to NVIDIA’s Tesla series of GPUs of which the C1060 is an example (Figure 1.2). Our example programs are developed using CUDA.

Tags: Algorithms, Computer science, CUDA, Matrix multiplication, nVidia, Tesla C1060

June 18, 2013 by hgpu

No votes yet.

Please wait...

HPCTransCompile: An AI Compiler Generated Dataset for High-Performance CUDA Transpilation and LLM Preliminary Exploration

* * *

high performance computing on graphics processing units: hgpu.org

GPU Matrix Multiplication

Recent source codes

WiLLM: An Open Wireless LLM Communication System

Vcc: the Vulkan Clang Compiler

hpcbench: A set of benchmarking utilities for biomolecular simulation tools

HPCTransCompile: An AI Compiler Generated Dataset for High-Performance CUDA Transpilation and LLM Preliminary Exploration

chemtrain: Training Molecular Dynamics Potentials in JAX

microSYCL: SYCL micro-benchmarks repository

XaaS containers

CASS: Cuda-Amd aSSembly

Cluser of smartphones for edge computing application using TensorFlow

SYCL Container

Most viewed papers (last 30 days)

GPU Matrix Multiplication

Share this:

Recent source codes

Most viewed papers (last 30 days)