On Graphs, GPUs, and Blind Dating: A Workload to Processor Matchmaking Quest

hgpu.org » Applications » Computer science » On Graphs, GPUs, and Blind Dating: A Workload to Processor Matchmaking Quest

On Graphs, GPUs, and Blind Dating: A Workload to Processor Matchmaking Quest

Abdullah Gharaibeh, Lauro Beltrao Costa, Elizeu Santos-Neto, Matei Ripeanu

Department of Electrical and Computer Engineering, The University of British Columbia

IEEE International Parallel & Distributed Processing Systems (IPDPS), 2013

@article{gharaibeh2013graphs,

title={On Graphs, GPUs, and Blind Dating},

author={Gharaibeh, A. and Costa, L.B. and Santos-Neto, E. and Ripeanu, M.},

year={2013}

}

Download (PDF)

View

Source

2281

views

Graph processing has gained renewed attention. The increasing large scale and wealth of connected data, such as those accrued by social network applications, demand the design of new techniques and platforms to efficiently derive actionable information from large scale graphs. Hybrid systems that host processing units optimized for both fast sequential processing and bulk processing (e.g., GPUaccelerated systems) have the potential to cope with the heterogeneous structure of real graphs and enable high performance graph processing. Reaching this point, however, poses multiple challenges. The heterogeneity of the processing elements (e.g., GPUs implement a different parallel processing model than CPUs and have much less memory) and the inherent irregularity of graph workloads require careful graph partitioning and load assignment. In particular, the workload generated by a partitioning scheme should match the strength of the processing element the partition is allocated to. This work explores the feasibility and quantifies the performance gains of such low-cost partitioning schemes. We propose to partition the workload between the two types of processing elements based on vertex connectivity. We show that such partitioning schemes offer a simple, yet efficient way to boost the overall performance of the hybrid system. Our evaluation illustrates that processing a 4-billion edges graph on a system with one CPU socket and one GPU, while offloading as little as 25% of the edges to the GPU, achieves 2x performance improvement over state-of–the-art implementations running on a dual-socket symmetric system. Moreover, for the same graph, a hybrid system with dualsocket and dual-GPU is capable of 1.13 Billion breadth-first search traversed edge per second, a performance rate that is competitive with the latest entries in the Graph500 list, yet at a much lower price point.

Tags: Computer science, CUDA, Heterogeneous systems, nVidia, Tesla C2075

January 10, 2013 by hgpu

No votes yet.

Please wait...

Your response

You must be logged in to post a comment.

* * *

high performance computing on graphics processing units: hgpu.org