Nested Parallelism on GPU: Exploring Parallelization Templates for Irregular Loops and Recursive Computations

hgpu.org » Applications » Computer science » Nested Parallelism on GPU: Exploring Parallelization Templates for Irregular Loops and Recursive Computations

Nested Parallelism on GPU: Exploring Parallelization Templates for Irregular Loops and Recursive Computations

Da Li, Hancheng Wu, Michela Becchi

Dept. of Electrical and Computer Engineering, University of Missouri – Columbia

International Conference on Parallel Processing (ICPP ’15), 2015

BibTeX

Download (PDF)

View

Source

1634

views

The effective deployment of applications exhibiting irregular nested parallelism on GPUs is still an open problem. A naive mapping of irregular code onto the GPU hardware often leads to resource underutilization and, thereby, limited performance. In this work, we focus on two computational patterns exhibiting nested parallelism: irregular nested loops and parallel recursive computations. In particular, we focus on recursive algorithms operating on trees and graphs. We propose different parallelization templates aimed to increase the GPU utilization of these codes. Specifically, we investigate mechanisms to effectively distribute irregular work to streaming multiprocessors and GPU cores. Some of our parallelization templates rely on dynamic parallelism, a feature recently introduced by Nvidia in their Kepler GPUs and announced as part of the OpenCL 2.0 standard. We propose mechanisms to maximize the work performed by nested kernels and minimize the overhead due to their invocation. Our results show that the use of our parallelization templates on applications with irregular nested loops can lead to a 2-6x speedup over baseline GPU codes that do not include load balancing mechanisms. The use of nested parallelism-based parallelization templates on recursive tree traversal algorithms can lead to substantial speedups (up to 15-24x) over optimized CPU implementations. However, the benefits of nested parallelism are still unclear in the presence of recursive applications operating on graphs, especially when recursive code variants require expensive synchronization. In these cases, a flat parallelization of iterative versions of the considered algorithms may be preferable.

Tags: Computer science, nVidia, OpenCL, Tesla K20

December 4, 2015 by hgpu

No votes yet.

Please wait...

Your response

You must be logged in to post a comment.

HPCTransCompile: An AI Compiler Generated Dataset for High-Performance CUDA Transpilation and LLM Preliminary Exploration

chemtrain: Training Molecular Dynamics Potentials in JAX

chemtrain-deploy: A parallel and scalable framework for machine learning potentials in million-atom MD simulations

microSYCL: SYCL micro-benchmarks repository

Exploring SYCL as a Portability Layer for High-Performance Computing on CPUs

See all packages

* * *

high performance computing on graphics processing units: hgpu.org