high performance computing on graphics processing units: hgpu.org

hgpu.org » Programming » Algorithms » Balancing locality and concurrency: solving sparse triangular systems on GPUs

Balancing locality and concurrency: solving sparse triangular systems on GPUs

Andrea Picciau, Gordon E. Inggs, John Wickerson, Eric C. Kerrigan, George A. Constantinides

Department of Electrical and Electronic Engineering, Imperial College London, UK

IEEE International Conference on High Performance Computing, Data, and Analytics (HiPC ’16), 2016

@article{picciau2016balancing,

title={Balancing locality and concurrency: solving sparse triangular systems on GPUs},

author={Picciau, Andrea and Inggs, Gordon E and Wickerson, John and Kerrigan, Eric C and Constantinides, George A},

publisher={IEEE},

year={2016}

}

Download (PDF)

View

Source

2058

views

Many numerical optimisation problems rely on fast algorithms for solving sparse triangular systems of linear equations (STLs). To accelerate the solution of such equations, two types of approaches have been used: on GPUs, concurrency has been prioritised to the disadvantage of data locality, while on multi-core CPUs, data locality has been prioritised to the disadvantage of concurrency. In this paper, we discuss the interaction between data locality and concurrency in the solution of STLs on GPUs, and we present a new algorithm that balances both. We demonstrate empirically that, subject to there being enough concurrency available in the input matrix, our algorithm outperforms Nvidia’s concurrency-prioritising CUSPARSE algorithm for GPUs. Experimental results show a maximum speedup of 5.8-fold. Our solution algorithm, which we have implemented in OpenCL, requires a pre-processing phase that partitions the graph associated with the input matrix into sub-graphs, whose data can be stored in low-latency local memories. This preliminary analysis phase is expensive, but because it depends only on the input matrix, its cost can be amortised when solving for many different right-hand sides.

Tags: Algorithms, AMD FirePro W5000, ATI, Computer science, Linear Algebra, nVidia, nVidia Quadro K4000, OpenCL, Sparse matrix

November 8, 2016 by hgpu

No votes yet.

Please wait...

Your response

You must be logged in to post a comment.

* * *

high performance computing on graphics processing units: hgpu.org

Balancing locality and concurrency: solving sparse triangular systems on GPUs

Your response

Recent source codes

UniCoder: Unified Visual-to-Code Generation via Symbolic Rewards and Reference-Guided Code Optimization

CuFuzz: An API-Knowledge-Graph Coverage-Driven Fuzzing Framework for CUDA Libraries

AutoPass: Evidence-Guided LLM Agents for Compiler Performance Tuning

Probe-and-Refine Tuning of Repository Guidance for AI Coding Agents

CUDAnalyst (CUDA + Analyst)

CodegenBench

KernelBenchX: A Comprehensive Benchmark for Evaluating LLM-Generated GPU Kernels

CUDA Kernel Fusion Benchmarks

IntelliKit: Agent-first tooling for AMD hardware

DITRON: Distributed Compiler based on Triton for Parallel Systems

Most viewed papers (last 30 days)

Balancing locality and concurrency: solving sparse triangular systems on GPUs

Share this:

Your response

Recent source codes

Most viewed papers (last 30 days)