Redwood: Flexible and Portable Heterogeneous Tree Traversal Workloads

hgpu.org » Applications » Computer science » Redwood: Flexible and Portable Heterogeneous Tree Traversal Workloads

Redwood: Flexible and Portable Heterogeneous Tree Traversal Workloads

Yanwen Xu, Ang Li, Tyler Sorensen

University of California, Santa Cruz

2023 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS’23), 2023

BibTeX

Download (PDF)

View

Source

Source codes

Package:

Redwood heterogeneous framework

808

views

Shared memory heterogeneous systems are now mainstream, with nearly every mobile phone and tablet containing integrated processing units. However, developing applications for such devices is difficult as workloads must be decomposed across different processing units, and the decomposition must be flexible to account for the growing diversity of devices, each with different relative processing unit throughput. Furthermore, many devices require distinct programming front ends, requiring significant effort to write cross-platform applications. In this work, we identify a pragmatic class of applications, which we call traverse-compute applications, that are ideal for shared memory heterogeneous systems. These applications have a flexible heterogeneous decomposition where CPUs excel at traversing a tree structure, while accelerators excel at node computations. Leveraging this insight, we present Redwood: a framework for writing heterogeneous traverse-compute workloads. Redwood provides a simple processing unit abstraction and a tree traversal library that enables heterogeneous optimizations. Using Redwood, we implement Grove, a benchmark suite containing nine pragmatic tree traversal applications, e.g., k-nearest neighbors. We instantiate Redwood for three different heterogeneous programming platforms: CUDA, SYCL, and High-Level Synthesis; we use Grove to evaluate five shared memory heterogeneous systems. Our evaluation highlights the importance of flexible heterogeneous decomposition as the optimal parameters differ widely across platforms and applications. However, once optimally configured, heterogeneous implementations can provide up to 13.53x speedups (geomean of 3.01x) over homogeneous implementations, showcasing the potential of heterogeneous computing for these workloads.

Tags: Benchmarking, Computer science, CUDA, FPGA, Heterogeneous systems, HLS, Intel UHD 630, nVidia, nVidia Jetson AGX Xavier, nVidia Jetson Nano, Package, Performance, SYCL

May 7, 2023 by hgpu

No votes yet.

Please wait...

high performance computing on graphics processing units: hgpu.org