high performance computing on graphics processing units: hgpu.org

hgpu.org » Programming » CUDA » Optimizing Data Locality for Iterative Matrix Solvers on CUDA

Optimizing Data Locality for Iterative Matrix Solvers on CUDA

Raymond Flagg, Jason Monk, Yifeng Zhu, Bruce Segee

Department of Electrical and Computer Engineering, University of Maine, Orono, ME, USA

The 2013 International Conference on Parallel and Distributed, Processing Techniques and Applications (PDPTA’13), 2013

@article{flagg2013optimizing,

title={Optimizing Data Locality for Iterative Matrix Solvers on CUDA},

author={Flagg, Raymond and Monk, Jason},

year={2013}

}

Download (PDF)

View

Source

1898

views

Solving systems of linear equations is an important problem that spans almost all fields of science and mathematics. When these systems grow in size, iterative methods are used to solve these problems. This paper looks at optimizing these methods for CUDA Architectures. It discusses a multi-threaded CPU implementation, a GPU implementation, and a data optimized GPU implementation. The optimized version uses an extra kernel to rearrange the problem data so that there are a minimal number of memory access and minimum thread divergence. The normal GPU implementation achieved a total speedup of 1.60X over the CPU version whereas the optimized version was able to achieve a total speedup of 1.78X. This paper demonstrates the importance of pre-organizing the data in iterative methods and its impact.

Tags: CUDA, Mathematics, nVidia, nVidia GeForce GTX 580, nVidia GeForce GTX 680

December 4, 2013 by hgpu

No votes yet.

Please wait...

Your response

You must be logged in to post a comment.

* * *

high performance computing on graphics processing units: hgpu.org

Optimizing Data Locality for Iterative Matrix Solvers on CUDA

Your response

Recent source codes

Kernel Library for LLM Serving

Adaptivity in AdaptiveCpp: Optimizing Performance by Leveraging Runtime Information During JIT-Compilation

Neptune: Advanced ML Operator Fusion for Locality and Parallelism on GPUs

Genten: Software for Generalized Tensor Decompositions by Sandia National Laboratories

Interleaved Learning and Exploration: A Self-Adaptive Fuzz Testing Framework for MLIR

Pinocchio: PINpointing Orbit Crossing Collapsed Hierarchical Objects

KernelCoder: trained on a curated dataset of reasoning traces and CUDA kernel pairs

VibeCodeHPC - Multi Agentic Vibe Coding for HPC

Compile-Time Resource Safety for GPU APIs: A Low-Overhead Typestate Framework

exa-AMD: Exascale Accelerated Materials Discovery

Most viewed papers (last 30 days)

Optimizing Data Locality for Iterative Matrix Solvers on CUDA

Share this:

Your response

Recent source codes

Most viewed papers (last 30 days)