Solving dense linear systems on platforms with multiple hardware accelerators

hgpu.org » Applications » Computer science » Solving dense linear systems on platforms with multiple hardware accelerators

Solving dense linear systems on platforms with multiple hardware accelerators

Enrique S. Quintana-Orti, Francisco D. Igual, Enrique S. Quintana-Orti, Robert A. van de Geijn

Departamento de Ingenieria y Ciencia de Computadores, Universidad Jaume I, 12.071-Castellon, Spain

Proceedings of the 14th ACM SIGPLAN symposium on Principles and practice of parallel programming, 2009, PPoPP ’09

DOI:10.1145/1504176.1504196

BibTeX

Download (PDF)

View

Source

Source codes

Package:

FLAME

1941

views

In a previous PPoPP paper we showed how the FLAME methodology, combined with the SuperMatrix runtime system, yields a simple yet powerful solution for programming dense linear algebra operations on multicore platforms. In this paper we provide further evidence that this approach solves the programmability problem for this domain by targeting a more complex architecture, composed of a multicore processor and multiple hardware accelerators (GPUs, Cell B.E., etc.), each with its own local memory, resulting in a platform more reminiscent of a heterogeneous distributed-memory system. In particular, we show that the FLAME programming model accommodates this new situation effortlessly so that no significant change needs to be made to the codebase. All complexity is hidden inside the SuperMatrix runtime scheduling mechanism, which incorporates software implementations of standard cache/memory coherence techniques in computer architecture to improve the performance. Our experimental evaluation on a Intel Xeon 8-core host linked to an NVIDIA Tesla S870 platform with four GPUs delivers peak performances around 550 and 450 (single-precision) GFLOPS for the matrix-matrix product and the Cholesky factorization, respectively, which we believe to be the best performance numbers posted on this new architecture for such operations.

Tags: Computer science, CUDA, GPU cluster, Linear Algebra, nVidia, Package, Task scheduling, Tesla S870

January 17, 2011 by hgpu

No votes yet.

Please wait...

high performance computing on graphics processing units: hgpu.org