2515

Solving dense linear systems on platforms with multiple hardware accelerators

Enrique S. Quintana-Orti, Francisco D. Igual, Enrique S. Quintana-Orti, Robert A. van de Geijn
Departamento de Ingenieria y Ciencia de Computadores, Universidad Jaume I, 12.071-Castellon, Spain
Proceedings of the 14th ACM SIGPLAN symposium on Principles and practice of parallel programming, 2009, PPoPP ’09

@article{quintana2009solving,

   title={Solving dense linear systems on platforms with multiple hardware accelerators},

   author={Quintana-Ort{‘i}, G. and Igual, F.D. and Quintana-Ort{‘i}, E.S. and van de Geijn, R.A.},

   journal={ACM SIGPLAN Notices},

   volume={44},

   number={4},

   pages={121–130},

   issn={0362-1340},

   year={2009},

   publisher={ACM}

}

Download Download (PDF)   View View   Source Source   Source codes Source codes

Package:

661

views

In a previous PPoPP paper we showed how the FLAME methodology, combined with the SuperMatrix runtime system, yields a simple yet powerful solution for programming dense linear algebra operations on multicore platforms. In this paper we provide further evidence that this approach solves the programmability problem for this domain by targeting a more complex architecture, composed of a multicore processor and multiple hardware accelerators (GPUs, Cell B.E., etc.), each with its own local memory, resulting in a platform more reminiscent of a heterogeneous distributed-memory system. In particular, we show that the FLAME programming model accommodates this new situation effortlessly so that no significant change needs to be made to the codebase. All complexity is hidden inside the SuperMatrix runtime scheduling mechanism, which incorporates software implementations of standard cache/memory coherence techniques in computer architecture to improve the performance. Our experimental evaluation on a Intel Xeon 8-core host linked to an NVIDIA Tesla S870 platform with four GPUs delivers peak performances around 550 and 450 (single-precision) GFLOPS for the matrix-matrix product and the Cholesky factorization, respectively, which we believe to be the best performance numbers posted on this new architecture for such operations.
No votes yet.
Please wait...

* * *

* * *

HGPU group © 2010-2017 hgpu.org

All rights belong to the respective authors

Contact us: