Parallel Implementations of the Cholesky Decomposition on CPUs and GPUs

Joao Paulo Tarasconi Ruschel
Universidade Federal Do Rio Grande Do Sul, Instituto De Informatica
Universidade Federal Do Rio Grande Do Sul, 2016


   title={Parallel Implementations of the Cholesky Decomposition on CPUs and GPUs},

   author={DA COMPUTA{c{C}}{~A}O, CURSO DE CI{^E}NCIA},




As Central Processing Units (CPUs) and Graphical Processing Units (GPUs) get progressively better, different approaches and designs for implementing algorithms with high data load must be studied and compared. This work compares several different algorithm designs and parallelization APIs (such as OpenMP, OpenCL and CUDA) for both CPU and GPU platforms. We used the Cholesky decomposition, a high-level arithmetic algorithm used in many linear algebra problems, as the benchmarking algorithm, due to being easily parallelizable, and having a considerable data dependence between elements. We carried out various experiments using the different designs and APIs in order to find the techniques which yield the best performance for each platform. We also compared these implementations with state-of-the-art solutions (such as LAPACK and cuSOLVER), and provided insights into the differences in implementation and performance. Our experiments showed us that parallelization on CPU tends to have a better performance than on GPU for this particular kind of algorithm, due to the intrinsic memory-intensive nature of the algorithm and memory transfer overhead, and that attempts at code micro-optimization do not offer any significant speedup.
Rating: 1.5. From 4 votes.
Please wait...

* * *

* * *

HGPU group © 2010-2017 hgpu.org

All rights belong to the respective authors

Contact us: