high performance computing on graphics processing units: hgpu.org

hgpu.org » Programming » Algorithms » Supporting Applications Involving Dynamic Data Structures and Irregular Memory Access on Emerging Parallel Platforms

Supporting Applications Involving Dynamic Data Structures and Irregular Memory Access on Emerging Parallel Platforms

Bin Ren

The Ohio State University

The Ohio State University, 2014

@phdthesis{ren2014supporting,

title={Supporting Applications Involving Dynamic Data Structures and Irregular Memory Access on Emerging Parallel Platforms},

author={Ren, Bin},

year={2014},

school={The Ohio State University}

}

Download (PDF)

View

Source

2210

views

SIMD accelerators and many-core coprocessors with coarse-grained and fine-grained level parallelism, become more and more popular. Streaming SIMD Extensions (SSE), Graphics Processing Unit (GPU), and Intel Xeon Phi (MIC) can provide orders of magnitude better performance and efficiency for parallel workloads as compared to single core CPUs. However, parallelizing irregular applications involving dynamic data structures and irregular memory access on these parallel platforms is not straightforward due to their intensive control-flow dependency and lack of memory locality. Our efforts focus on three classes of irregular applications: Irregular Trees and Graphs Traversals, Irregular Reductions and Dynamic Allocated Arrays and Lists, and explore the mechanism of parallelizing them on various parallel architectures from both fine-grained and coarse-grained perspectives. We first focus on the traversal of irregular trees and graphs, more specifically, a class of applications involving the traversal of many pointer-intensive data structures, e.g. Random Forest, and Regular Expressions, on various fine-grained SIMD architectures, e.g. the Streaming SIMD Extension (SSE), and Graphic Processing Unit (GPU). We address this problem by developing an intermediate language for specifying such traversals, followed by a run-time scheduler that maps traversals to SIMD units. A key idea in our run-time scheme is converting branches to arithmetic operations, which then allows us to use SIMD hardware. In order to make our approach fast, we demonstrate several optimizations including a stream compaction method that aids with control flow in SIMD, a set of layouts that reduce memory latency, and a tiling approach that enables more effective prefetching. Using our approach, we demonstrate significant increases in single-core performance over optimized baselines for two applications. However, different SIMD architectures have different features, so a significant challenge to our previous work is to automatically optimize applications for various architectures, i.e., we need to implement performance portability. Moreover, one of the first architectural features programmers look to when optimizing their applications is the memory hierarchy. Thus, we design a portable optimization engine for accelerating irregular data traversal applications on various SIMD architectures by emphasizing on improving the data locality and hiding memory latency. The contributions of this work are two fold: First, we develop a set of data layout optimizations that improve spatial locality for applications that traverse many irregular data structures. Unlike prior data layout optimizations, our approach incorporates a notion of both inter-thread and intra-thread spatial reuse into data layout. Second, we enable performance portability for these applications on SIMD architectures by accurately modeling the impact of inter and intra thread locality on program performance. As a consequence, our model can predict which data layout optimization to use on a wide variety of SIMD architectures. We next explore the possibility of efficiently parallelizing two irregular reduction applications on Intel Xeon Phi architecture, an emerging many-core coprocessor architecture with long SIMD vectors, via data layout optimization. During this process, we also identify a general data management problem in the CPU-Coprocessor programming model, i.e., the problem of automating and optimizing dynamic allocated data structures transfers between CPU and Coprocessors. For dynamic multi-dimensional arrays, we design a set of compile-time solutions involving heap layout transformation, while for other irregular data structures such as linked lists, we improve the existing shared memory runtime solution to reduce the transfer costs. Dynamic allocated data structures like List are also commonly used in high-level programming languages, such as Python to support dynamic, flexible features to increase the programming productivity. To parallelize applications in such high-level programming languages on both coarse-grained and fine-grained parallel platforms, we design a compilation system linearizing dynamic data structures into arrays, and invoking low level multi-core, many-core libraries. A critical issue of our linearization method is that it incurs extra data structure transformation overhead, especially for the irregular data structures not reused frequently. To address this challenge, we design a set of transformation optimization algorithms including an inter-procedural Partial Redundancy Elimination (PRE) algorithm to minimize the data transformation overhead automatically.

Tags: Algorithms, Computer science, CUDA, Intel Xeon Phi, Memory model, nVidia, nVidia Quadro FX 5800, Prefetch, Tesla C2050, Thesis

September 19, 2014 by hgpu

No votes yet.

Please wait...

gpu_tracker: Context manager and CLI that tracks the computational-resource-usage of a code block or shell command, particularly the GPU usage

gpu_tracker: Python package for tracking and profiling GPU utilization in both desktop and high-performance computing environments

* * *

high performance computing on graphics processing units: hgpu.org

Supporting Applications Involving Dynamic Data Structures and Irregular Memory Access on Emerging Parallel Platforms

Recent source codes

QArray

Celerity: High-level C++ for Accelerator Clusters

CIFAR-10 Airbench: 94% on CIFAR-10 in 3.29 second

gpu_tracker: Context manager and CLI that tracks the computational-resource-usage of a code block or shell command, particularly the GPU usage

LOOPer: a polyhedral compiler for expressing fast and portable data parallel algorithms

OpenMC Monte Carlo Code

Polygeist: C/C++ frontend for MLIR

Parallel Gaussian process with kernel approximation in CUDA

Optical flow algorithms for SYCL

OpenMP5-Offload-OpenMC-Intel-PVC

Most viewed papers (last 30 days)

Supporting Applications Involving Dynamic Data Structures and Irregular Memory Access on Emerging Parallel Platforms

Share this:

Recent source codes

Most viewed papers (last 30 days)