Nested Data-Parallelism on the GPU
University of Chicago
University of Chicago, 2012
@article{bergstrom2012nested,
title={Nested Data-Parallelism on the GPU},
author={Bergstrom, L. and Reppy, J.},
year={2012}
}
Graphics processing units (GPUs) provide both memory bandwidth and arithmetic performance far greater than that available on CPUs, but, because of their Single-Instruction-Multiple-Data (SIMD) architecture, they are hard to program. Most of the programs ported to GPUs thus far use traditional data-level parallelism, performing only operations that operate uniformly over vectors. Porting algorithms that do not easily fit this model requires laborious manual program conversion to run programs with irregular control-flow or data layout. Language support for programming GPUs is rather limited. The low-level GPU languages, such as CUDA and OpenCL, appear general-purpose, but must be used in restricted styles to get adequate performance. Some effort has been made to support computation on GPUs from high-level languages, such as C++, X10, and Haskell, but these mechanisms are limited to computations over flat vectors. NESL is a first-order functional language that was designed by Guy Blelloch to allow programmers to write irregular-parallel programs – such as parallel divide-and-conquer algorithms – for wide-vector parallel computers. NESL supports nested data parallelism (NDP), where the elements of a data-parallel computation can themselves be data-parallel computations, and it uses a novel compilation technique, called flattening, to map nested computations onto a SIMD execution model. This work demonstrated that NDP can be effectively used for a wide range of irregular-parallel computations. Wide-vector machines pose many of the same programming challenges as GPUs, so it is worth exploring the question of whether an NDP language, like NESL, can make an effective tool for programming GPUs. This paper presents the results of such an exploration. We describe our port of the NESL implementation to work on GPUs and present empirical evidence that NDP on GPUs significantly outperforms CPU-based implementations and matches or beats newer GPU languages that support only flat parallelism. We also compare our performance with hand-tuned CUDA programs. While our performance does not match that of handtuned codes, we argue that the notational conciseness of NESL is worth the loss in performance. This work provides the first language implementation that directly supports NDP on a GPU.
March 31, 2012 by hgpu