high performance computing on graphics processing units: hgpu.org

hgpu.org » Programming » Algorithms » Register packing for cyclic reduction: a case study

Register packing for cyclic reduction: a case study

Andrew Davidson, John D. Owens

University of California, Davis

Proceedings of the Fourth Workshop on General Purpose Processing on Graphics Processing Units, GPGPU-4, 2011

DOI:10.1145/1964179.1964185

@inproceedings{davidson2011register,

title={Register packing for cyclic reduction: a case study},

author={Davidson, A. and Owens, J.D.},

booktitle={Proceedings of the Fourth Workshop on General Purpose Processing on Graphics Processing Units},

pages={4},

year={2011},

organization={ACM}

}

Download (PDF)

View

Source

2092

views

We generalize a method for avoiding GPU shared communication when dealing with a downsweep pattern. We apply this generalization to Cyclic Reduction, a tridiagonal solver with this pattern. Previously, Cyclic Reduction suffered poor performance when compared to other tridiagonal solvers on the GPU due to performance issues stemming from shared-memory bandwidth bottlenecks and step-efficiency. We address this problem by applying our down-sweep shared-memory communication-reducing methodology. Our re-mapping also allows Cyclic Reduction to solve larger systems directly in a virtual block. By using our generalized mapping, we improve Cyclic Reduction’s performance on a GPU by a factor of 3-4.5x over the original CR implementation, making it 1.5-3x faster than other GPU tridiagonal solvers.

Tags: Algorithms, Computer science, CUDA, Mathematical Software, nVidia, nVidia GeForce GTX 460, Performance

September 22, 2011 by hgpu

No votes yet.

Please wait...

gpu_tracker: Context manager and CLI that tracks the computational-resource-usage of a code block or shell command, particularly the GPU usage

gpu_tracker: Python package for tracking and profiling GPU utilization in both desktop and high-performance computing environments

high performance computing on graphics processing units: hgpu.org

Register packing for cyclic reduction: a case study

Recent source codes

SimSYCL: Synchronous, single-threaded, library-only SYCL implementation for debugging and verification

GPU plugin for PySCF

QArray

Celerity: High-level C++ for Accelerator Clusters

gpu_tracker: Context manager and CLI that tracks the computational-resource-usage of a code block or shell command, particularly the GPU usage

CIFAR-10 Airbench: 94% on CIFAR-10 in 3.29 second

LOOPer: a polyhedral compiler for expressing fast and portable data parallel algorithms

OpenMC Monte Carlo Code

Polygeist: C/C++ frontend for MLIR

Parallel Gaussian process with kernel approximation in CUDA

Most viewed papers (last 30 days)

Register packing for cyclic reduction: a case study

Share this:

Recent source codes

Most viewed papers (last 30 days)