high performance computing on graphics processing units: hgpu.org

hgpu.org » Applications » Computer science » A Scalable High Performant Cholesky Factorization for Multicore with GPU Accelerators

A Scalable High Performant Cholesky Factorization for Multicore with GPU Accelerators

Hatem Ltaief, Stanimire Tomov, Rajib Nath, Peng Du and Jack Dongarra

Department of Electrical Engineering and Computer Science, University of Tennessee, Knoxville

High Performance Computing for Computational Science – VECPAR 2010, Lecture Notes in Computer Science, 2011, Volume 6449/2011, 93-101

DOI:10.1007/978-3-642-19328-6_11

@article{ltaief2011scalable,

title={A scalable high performant Cholesky factorization for multicore with GPU accelerators},

author={Ltaief, H. and Tomov, S. and Nath, R. and Du, P. and Dongarra, J.},

journal={High Performance Computing for Computational Science–VECPAR 2010},

pages={93–101},

year={2011},

publisher={Springer}

}

Download (PDF)

View

Source

2444

views

We present a Cholesky factorization for multicore with GPU accelerators systems. The challenges in developing scalable high performance algorithms for these emerging systems stem from their heterogeneity, massive parallelism, and the huge gap between the GPUs’ compute power vs the CPU-GPU communication speed. We show an approach that is largely based on software infrastructures that have already been developed for homogeneous multicores and hybrid GPU-based computing. This results in a scalable hybrid Cholesky factorization of unprecedented performance. In particular, using NVIDIA’s Tesla S1070 (4 C1060 GPUs, each with 30 cores @1.44 GHz) connected to two dual-core AMD Opteron @1.8GHz processors, we reach up to 1.163 TFlop/s in single and up to 275 GFlop/s in double precision arithmetic. Compared with the performance of the embarrassingly parallel xGEMM over four GPUs, where no communication between GPUs are involved, our algorithm still runs at 73% and 84% for single and double precision arithmetic respectively.

Tags: Computer science, CUDA, Factorization, Linear Algebra, nVidia, Tesla S1070

June 4, 2011 by hgpu

No votes yet.

Please wait...

Your response

You must be logged in to post a comment.

* * *

high performance computing on graphics processing units: hgpu.org

A Scalable High Performant Cholesky Factorization for Multicore with GPU Accelerators

Your response

Recent source codes

UniCoder: Unified Visual-to-Code Generation via Symbolic Rewards and Reference-Guided Code Optimization

CuFuzz: An API-Knowledge-Graph Coverage-Driven Fuzzing Framework for CUDA Libraries

AutoPass: Evidence-Guided LLM Agents for Compiler Performance Tuning

Probe-and-Refine Tuning of Repository Guidance for AI Coding Agents

CUDAnalyst (CUDA + Analyst)

CodegenBench

KernelBenchX: A Comprehensive Benchmark for Evaluating LLM-Generated GPU Kernels

CUDA Kernel Fusion Benchmarks

IntelliKit: Agent-first tooling for AMD hardware

DITRON: Distributed Compiler based on Triton for Parallel Systems

Most viewed papers (last 30 days)

A Scalable High Performant Cholesky Factorization for Multicore with GPU Accelerators

Share this:

Your response

Recent source codes

Most viewed papers (last 30 days)