high performance computing on graphics processing units: hgpu.org

hgpu.org » Applications » Computer science » Solving dense linear systems on platforms with multiple hardware accelerators

Solving dense linear systems on platforms with multiple hardware accelerators

Enrique S. Quintana-Orti, Francisco D. Igual, Enrique S. Quintana-Orti, Robert A. van de Geijn

Departamento de Ingenieria y Ciencia de Computadores, Universidad Jaume I, 12.071-Castellon, Spain

Proceedings of the 14th ACM SIGPLAN symposium on Principles and practice of parallel programming, 2009, PPoPP ’09

DOI:10.1145/1504176.1504196

@article{quintana2009solving,

title={Solving dense linear systems on platforms with multiple hardware accelerators},

author={Quintana-Ort{‘i}, G. and Igual, F.D. and Quintana-Ort{‘i}, E.S. and van de Geijn, R.A.},

journal={ACM SIGPLAN Notices},

volume={44},

number={4},

pages={121–130},

issn={0362-1340},

year={2009},

publisher={ACM}

}

Download (PDF)

View

Source

Source codes

Package:

FLAME

1597

views

In a previous PPoPP paper we showed how the FLAME methodology, combined with the SuperMatrix runtime system, yields a simple yet powerful solution for programming dense linear algebra operations on multicore platforms. In this paper we provide further evidence that this approach solves the programmability problem for this domain by targeting a more complex architecture, composed of a multicore processor and multiple hardware accelerators (GPUs, Cell B.E., etc.), each with its own local memory, resulting in a platform more reminiscent of a heterogeneous distributed-memory system. In particular, we show that the FLAME programming model accommodates this new situation effortlessly so that no significant change needs to be made to the codebase. All complexity is hidden inside the SuperMatrix runtime scheduling mechanism, which incorporates software implementations of standard cache/memory coherence techniques in computer architecture to improve the performance. Our experimental evaluation on a Intel Xeon 8-core host linked to an NVIDIA Tesla S870 platform with four GPUs delivers peak performances around 550 and 450 (single-precision) GFLOPS for the matrix-matrix product and the Cholesky factorization, respectively, which we believe to be the best performance numbers posted on this new architecture for such operations.

Tags: Computer science, CUDA, GPU cluster, Linear Algebra, nVidia, Package, Task scheduling, Tesla S870

January 17, 2011 by hgpu

No votes yet.

Please wait...

gpu_tracker: Context manager and CLI that tracks the computational-resource-usage of a code block or shell command, particularly the GPU usage

gpu_tracker: Python package for tracking and profiling GPU utilization in both desktop and high-performance computing environments

high performance computing on graphics processing units: hgpu.org

Solving dense linear systems on platforms with multiple hardware accelerators

Package:

Recent source codes

SimSYCL: Synchronous, single-threaded, library-only SYCL implementation for debugging and verification

GPU plugin for PySCF

QArray

Celerity: High-level C++ for Accelerator Clusters

gpu_tracker: Context manager and CLI that tracks the computational-resource-usage of a code block or shell command, particularly the GPU usage

CIFAR-10 Airbench: 94% on CIFAR-10 in 3.29 second

LOOPer: a polyhedral compiler for expressing fast and portable data parallel algorithms

OpenMC Monte Carlo Code

Polygeist: C/C++ frontend for MLIR

Parallel Gaussian process with kernel approximation in CUDA

Most viewed papers (last 30 days)

Solving dense linear systems on platforms with multiple hardware accelerators

Package:

Share this:

Recent source codes

Most viewed papers (last 30 days)