Morph Algorithms on GPUs

hgpu.org » Programming » Algorithms » Morph Algorithms on GPUs

Morph Algorithms on GPUs

Rupesh Nasre, Martin Burtscher, Keshav Pingali

Inst. for Computational Engineering and Sciences, University of Texas at Austin, USA

18th ACM SIGPLAN symposium on Principles and practice of parallel programming (PPoPP ’13), 2013

DOI:10.1145/2442516.2442531

@inproceedings{nasre2013morph,

title={Morph algorithms on GPUs},

author={Nasre, Rupesh and Burtscher, Martin and Pingali, Keshav},

booktitle={Proceedings of the 18th ACM SIGPLAN symposium on Principles and practice of parallel programming},

pages={147–156},

year={2013},

organization={ACM}

}

Download (PDF)

View

Source

2572

views

There is growing interest in using GPUs to accelerate graph algorithms such as breadth-first search, computing page-ranks, and finding shortest paths. However, these algorithms do not modify the graph structure, so their implementation is relatively easy compared to general graph algorithms like mesh generation and refinement, which morph the underlying graph in non-trivial ways by adding and removing nodes and edges. We know relatively little about how to implement morph algorithms efficiently on GPUs. In this paper, we present and study four morph algorithms: (i) a computational geometry algorithm called Delaunay Mesh Refinement (DMR), (ii) an approximate SAT solver called Survey Propagation (SP), (iii) a compiler analysis called Points-To Analysis (PTA), and (iv) Boruvka’s Minimum Spanning Tree algorithm (MST). Each of these algorithms modifies the graph data structure in different ways and thus poses interesting challenges. We overcome these challenges using algorithmic and GPU-specific optimizations. We propose efficient techniques to perform concurrent subgraph addition, subgraph deletion, conflict detection and several optimizations to improve the scalability of morph algorithms. For an input mesh with 10 million triangles, our DMR code achieves an 80x speedup over the highly optimized serial Triangle program and a 2.3x speedup over a multicore implementation running with 48 threads. Our SP code is 3x faster than a multicore implementation with 48 threads on an input with 1 million literals. The PTA implementation is able to analyze six SPEC 2000 benchmark programs in just 74 milliseconds, achieving a geometric mean speedup of 9.3x over a 48-thread multicore version. Our MST code is slower than a multicore version with 48 threads for sparse graphs but significantly faster for denser graphs. This work provides several insights into how other morph algorithms can be efficiently implemented on GPUs.

Tags: Algorithms, Computer science, CUDA, Graph theory, nVidia, Performance, Programming techniques, Tesla C2070

March 12, 2013 by hgpu

Rating: 2.5/5. From 1 vote.

Please wait...

Your response

You must be logged in to post a comment.

* * *

high performance computing on graphics processing units: hgpu.org