high performance computing on graphics processing units: hgpu.org

hgpu.org » Applications » Computer science » Optimization of linked list prefix computations on multithreaded GPUs using CUDA

Optimization of linked list prefix computations on multithreaded GPUs using CUDA

Zheng Wei, Joseph JaJa

Department of Electrical and Computer Engineering, Institute for Advanced Computer Studies, University of Maryland, College Park, U.S.A.

In 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS) (April 2010), pp. 1-8.

DOI:10.1109/IPDPS.2010.5470455

@conference{wei2010optimization,

title={Optimization of linked list prefix computations on multithreaded GPUs using CUDA},

author={Wei, Z. and JaJa, J.},

booktitle={Parallel & Distributed Processing (IPDPS), 2010 IEEE International Symposium on},

pages={1–8},

issn={1530-2075},

year={2010},

organization={IEEE}

}

Download (PDF)

View

Source

1915

views

We present a number of optimization techniques to compute prefix sums on linked lists and implement them on multithreaded GPUs using CUDA. Prefix computations on linked structures involve in general highly irregular fine grain memory accesses that are typical of many computations on linked lists, trees, and graphs. While the current generation of GPUs provides substantial computational power and extremely high bandwidth memory accesses, they may appear at first to be primarily geared toward streamed, highly data parallel computations. In this paper, we introduce an optimized multithreaded GPU algorithm for prefix computations through a randomization process that reduces the problem to a large number of fine-grain computations. We map these fine-grain computations onto multithreaded GPUs in such a way that the processing cost per element is shown to be close to the best possible. Our experimental results show scalability for list sizes ranging from 1M nodes to 256M nodes, and significantly improve on the recently published parallel implementations of list ranking, including implementations on the Cell Processor, the MTA-8, and the NVIDIA GeForce 200 series. They also compare favorably to the performance of the best known CUDA algorithm for the scan operation on the Tesla C1060.

Tags: Computer science, CUDA, List ranking, nVidia, Tesla C1060

November 6, 2010 by hgpu

No votes yet.

Please wait...

gpu_tracker: Context manager and CLI that tracks the computational-resource-usage of a code block or shell command, particularly the GPU usage

gpu_tracker: Python package for tracking and profiling GPU utilization in both desktop and high-performance computing environments

* * *

high performance computing on graphics processing units: hgpu.org

Optimization of linked list prefix computations on multithreaded GPUs using CUDA

Recent source codes

QArray

Celerity: High-level C++ for Accelerator Clusters

CIFAR-10 Airbench: 94% on CIFAR-10 in 3.29 second

gpu_tracker: Context manager and CLI that tracks the computational-resource-usage of a code block or shell command, particularly the GPU usage

LOOPer: a polyhedral compiler for expressing fast and portable data parallel algorithms

OpenMC Monte Carlo Code

Polygeist: C/C++ frontend for MLIR

Parallel Gaussian process with kernel approximation in CUDA

Optical flow algorithms for SYCL

OpenMP5-Offload-OpenMC-Intel-PVC

Most viewed papers (last 30 days)

Optimization of linked list prefix computations on multithreaded GPUs using CUDA

Share this:

Recent source codes

Most viewed papers (last 30 days)