high performance computing on graphics processing units: hgpu.org

hgpu.org » Programming » Algorithms » A Highly Efficient GPU-CPU Hybrid Parallel Implementation of Sparse LU Factorization

A Highly Efficient GPU-CPU Hybrid Parallel Implementation of Sparse LU Factorization

Liu Li, Liu Li, Yang Guangwen

Department of Computer Science and Technology, Tsinghua University, Beijing 100084, China

Chinese Journal of Electronics, Vol.21, No.1, 2012

@article{liu2012highly,

title={A Highly Efficient GPU-CPU Hybrid Parallel Implementation of Sparse LU Factorization},

author={LIU, L. and YANG, G.},

year={2012}

}

Download (PDF)

View

Source

2649

views

In this paper, we try to accelerate sparse LU factorization on GPU. We present a tiled storage format and a parallel algorithm to improve the memory access pattern, and a register blocking method to compress the on-chip working set. The OPENMP implementation of our algorithm gives more stable performance over different matrices, and outperforms SuperLU and KLU by 1.88~6 times on an Intel 8-core CPU (Central processing unit) for matrices from the Florida matrix collection. Based on this algorithm, we further propose a GPU-CPU hybrid pipelined scheme to overlap computations on CPU with computations on GPU. Compared to the better of SuperLU and KLU on an Intel 8-core CPU, our algorithm achieves 1.1~19.7-fold speedup on GPU for double precision. Compared to the OPENMP implementation of our algorithm on an Intel 8-core CPU, our GPU implementation gets a 2-fold speedup for the best cases.

Tags: Algorithms, Compression, Computer science, CUDA, Factorization, nVidia, Sparse matrix, Tesla C1060

January 8, 2012 by hgpu

Rating: 0.5/5. From 1 vote.

Please wait...

Your response

You must be logged in to post a comment.

high performance computing on graphics processing units: hgpu.org

A Highly Efficient GPU-CPU Hybrid Parallel Implementation of Sparse LU Factorization

Your response

Recent source codes

CUDAnalyst (CUDA + Analyst)

CodegenBench

KernelBenchX: A Comprehensive Benchmark for Evaluating LLM-Generated GPU Kernels

CUDA Kernel Fusion Benchmarks

IntelliKit: Agent-first tooling for AMD hardware

DITRON: Distributed Compiler based on Triton for Parallel Systems

CuTile Benchmark Suite: Performance and Productivity Tradeoffs for GPU Kernel Programming on Blackwell Architecture

Device Virtual Machine (DVM)

MegaTrain: Full Precision Training of 100B+ Parameter Large Language Models on a Single GPU

Agentic Code Optimization via Compiler-LLM Cooperation

Most viewed papers (last 30 days)

A Highly Efficient GPU-CPU Hybrid Parallel Implementation of Sparse LU Factorization

Share this:

Your response

Recent source codes

Most viewed papers (last 30 days)