GPU Matrix Multiplication

Junjie Li, Sanjay Ranka, Sartaj Sahni
Department of Computer and Information Science and Engineering, University of Florida, Gainesville, FL 32611
Chapter in the book "Multi- and Many-Core Technologies: Architectures, Programming, Algorithms, and Applications", Chapman-Hall/CRC Press, 2013

   title={GPU Matrix Multiplication},

   author={Li, Junjie and Ranka, Sanjay and Sahni, Sartaj},

   journal={chapter in Multi- and Many-Core Technologies: Architectures, Programming, Algorithms, and Applications, Chapman Hall},



Download Download (PDF)   View View   Source Source   



Graphics Processing Units (GPUs) were developed originally to meet the computational needs of algorithms for rendering computer graphics. The rapid and enormous growth in sophistication of graphics applications such as computer games has resulted in the availability of GPUs that have hundreds of processors and peak performance near a teraflop and that sell for hundreds of dollars to a few thousand dollars. Although GPUs are optimized for graphics calculations, their low cost per gigaflop has motivated significant research into their efficient use for non-graphics applications. The effort being expended in this direction has long-lasting potential because the widespread use of GPUs in the vibrant computer games industry almost ensures the longevity of GPUs. So, unlike traditional multimillion dollar supercomputers whose development cost had to be borne entirely by a relatively small supercomputing community, GPUs are backed by a very large gaming industry. This makes it more likely that GPU architectures will remain economically viable and will continue to evolve. Although the cost of a GPU measured as dollars per peak gigaflop is very low, obtaining performance near the peak requires very careful programming of the application code. This programming is complicated by the availability of several different memories (e.g., device memory, shared memory, constant cache, texture cache, and registers), each with different latency and the partitioning of the available scalar processors or cores into groups called streaming multiprocessors (Figure 1.1). In this chapter, we explore the intricacies of programming a GPU to obtain high performance for the multiplication of two single-precision square matrices. We focus our development to NVIDIA’s Tesla series of GPUs of which the C1060 is an example (Figure 1.2). Our example programs are developed using CUDA.
VN:F [1.9.22_1171]
Rating: 0.0/5 (0 votes cast)

* * *

* * *

* * *

Free GPU computing nodes at hgpu.org

Registered users can now run their OpenCL application at hgpu.org. We provide 1 minute of computer time per each run on two nodes with two AMD and one nVidia graphics processing units, correspondingly. There are no restrictions on the number of starts.

The platforms are

Node 1
  • GPU device 0: nVidia GeForce GTX 560 Ti 2GB, 822MHz
  • GPU device 1: AMD/ATI Radeon HD 6970 2GB, 880MHz
  • CPU: AMD Phenom II X6 @ 2.8GHz 1055T
  • RAM: 12GB
  • OS: OpenSUSE 13.1
  • SDK: nVidia CUDA Toolkit 6.5.14, AMD APP SDK 3.0
Node 2
  • GPU device 0: AMD/ATI Radeon HD 7970 3GB, 1000MHz
  • GPU device 1: AMD/ATI Radeon HD 5870 2GB, 850MHz
  • CPU: Intel Core i7-2600 @ 3.4GHz
  • RAM: 16GB
  • OS: OpenSUSE 12.2
  • SDK: AMD APP SDK 2.9

Completed OpenCL project should be uploaded via User dashboard (see instructions and example there), compilation and execution terminal output logs will be provided to the user.

The information send to hgpu.org will be treated according to our Privacy Policy

HGPU group © 2010-2015 hgpu.org

All rights belong to the respective authors

Contact us: