Sparse Matrix Algorithms Using GPGPU

Kaupo Kuresson
Faculty of Mathematics and Computer Science, Institute of Computer Science, University of Tartu
University of Tartu, 2012

   title={Sparse Matrix Algorithms Using GPGPU},

   author={Kuresson, Kaupo},


   school={Tartu {"U}likool}


The purpose of this thesis was to benchmark and compare different representations of sparse matrices and algorithms for multiplying them with a vector. Also, to see the performance differences of running the algorithms on a CPU and GPU(s). Four different storage formats were tested – full matrix storage, coordinate storage (COO), ELLPACK (ELL), compressed sparse row (CSR) and compressed diagonal storage (CDS). Performance tests were run on a desktop computer and also on EENet (the Estonian Education and Research Network) worknodes. The EENet worknodes added the opportunity for dividing the workload between their two GPUs. Using OpenCL gave a speedup of 3,6 times over pure C++ code when using the full storage method and basic algorithm. With more complex storage formats, the speed gain was even more distinct. Converting the vectors into images did not give the expected speedup on most cases. Still, it performed slightly better on the nVidia hardware. An option would be to create both kinds on kernels in a situation like this – one with image support, another with normal memory access, and see which one performs better. The conversion into textures requires only slight modifications of the kernel and host-code. The problem with using OpenCl is the need to effectively parallelize the original task and use as much of the available computing power as possible. As the results show, the performance is highly dependent on the type of matrix and hardware used. For an all-round choice the CSR seems to be the best, being the fastest in all tests. This may of course change with the selection of new matrices and further optimization of kernels. The performance benefit when using multiple devices also depends on the type of matrices used – with smaller ones, the additional overhead of creating a new command queue and kernel execution can nullify the advantage of more processing power. With larger matrices, speed ups of up to 60% were noted.
VN:F [1.9.22_1171]
Rating: 0.0/5 (0 votes cast)

* * *

* * *

Follow us on Twitter

HGPU group

1660 peoples are following HGPU @twitter

Like us on Facebook

HGPU group

334 people like HGPU on Facebook

* * *

Free GPU computing nodes at hgpu.org

Registered users can now run their OpenCL application at hgpu.org. We provide 1 minute of computer time per each run on two nodes with two AMD and one nVidia graphics processing units, correspondingly. There are no restrictions on the number of starts.

The platforms are

Node 1
  • GPU device 0: nVidia GeForce GTX 560 Ti 2GB, 822MHz
  • GPU device 1: AMD/ATI Radeon HD 6970 2GB, 880MHz
  • CPU: AMD Phenom II X6 @ 2.8GHz 1055T
  • RAM: 12GB
  • OS: OpenSUSE 13.1
  • SDK: nVidia CUDA Toolkit 6.5.14, AMD APP SDK 3.0
Node 2
  • GPU device 0: AMD/ATI Radeon HD 7970 3GB, 1000MHz
  • GPU device 1: AMD/ATI Radeon HD 5870 2GB, 850MHz
  • CPU: Intel Core i7-2600 @ 3.4GHz
  • RAM: 16GB
  • OS: OpenSUSE 12.3
  • SDK: AMD APP SDK 3.0

Completed OpenCL project should be uploaded via User dashboard (see instructions and example there), compilation and execution terminal output logs will be provided to the user.

The information send to hgpu.org will be treated according to our Privacy Policy

HGPU group © 2010-2015 hgpu.org

All rights belong to the respective authors

Contact us: