GPU Matrix Multiplication

Junjie Li, Sanjay Ranka, Sartaj Sahni
Department of Computer and Information Science and Engineering, University of Florida, Gainesville, FL 32611
Chapter in the book "Multi- and Many-Core Technologies: Architectures, Programming, Algorithms, and Applications", Chapman-Hall/CRC Press, 2013


   title={GPU Matrix Multiplication},

   author={Li, Junjie and Ranka, Sanjay and Sahni, Sartaj},

   journal={chapter in Multi- and Many-Core Technologies: Architectures, Programming, Algorithms, and Applications, Chapman Hall},



Download Download (PDF)   View View   Source Source   



Graphics Processing Units (GPUs) were developed originally to meet the computational needs of algorithms for rendering computer graphics. The rapid and enormous growth in sophistication of graphics applications such as computer games has resulted in the availability of GPUs that have hundreds of processors and peak performance near a teraflop and that sell for hundreds of dollars to a few thousand dollars. Although GPUs are optimized for graphics calculations, their low cost per gigaflop has motivated significant research into their efficient use for non-graphics applications. The effort being expended in this direction has long-lasting potential because the widespread use of GPUs in the vibrant computer games industry almost ensures the longevity of GPUs. So, unlike traditional multimillion dollar supercomputers whose development cost had to be borne entirely by a relatively small supercomputing community, GPUs are backed by a very large gaming industry. This makes it more likely that GPU architectures will remain economically viable and will continue to evolve. Although the cost of a GPU measured as dollars per peak gigaflop is very low, obtaining performance near the peak requires very careful programming of the application code. This programming is complicated by the availability of several different memories (e.g., device memory, shared memory, constant cache, texture cache, and registers), each with different latency and the partitioning of the available scalar processors or cores into groups called streaming multiprocessors (Figure 1.1). In this chapter, we explore the intricacies of programming a GPU to obtain high performance for the multiplication of two single-precision square matrices. We focus our development to NVIDIA’s Tesla series of GPUs of which the C1060 is an example (Figure 1.2). Our example programs are developed using CUDA.
No votes yet.
Please wait...

* * *

* * *

HGPU group © 2010-2020 hgpu.org

All rights belong to the respective authors

Contact us: