high performance computing on graphics processing units: hgpu.org

hgpu.org » Programming » Algorithms » A Tuning Framework for Software-Managed Memory Hierarchies

A Tuning Framework for Software-Managed Memory Hierarchies

Manman Ren, Ji Young Park, Mike Houston, Alex Aiken, William J. Dally

Stanford University

Proceedings of the 17th international conference on Parallel architectures and compilation techniques, 2008, p.280-291

@conference{ren2008tuning,

title={A tuning framework for software-managed memory hierarchies},

author={Ren, M. and Park, J.Y. and Houston, M. and Aiken, A. and Dally, W.J.},

booktitle={Proceedings of the 17th international conference on Parallel architectures and compilation techniques},

pages={280–291},

year={2008},

organization={ACM}

}

Download (PDF)

View

Source

2042

views

New architectures are emerging at a rapid pace, architectures with multiple processing units on a chip and with deep memory hierarchies have become pervasive; while architectures with software-managed memory hierarchies (such as the Sony/Toshiba/IBM Cell processor) have gained popularity. Due to the increased complexity of architectures, re-targeting a legacy application to a new architecture requires lots of time porting and tuning. To achieve both portability and high performance on modern machines, we propose a programming environment that includes a portable language (Sequoia), a portable runtime and a tuning framework. In this thesis, we focus on the design and implementation of the tuning framework. Achieving good performance on a modern machine with a multi-level memory hierarchy, and in particular on a machine with software-managed memories, requires the meticulous tuning of programs to the machine’s particular characteristics. Further, the choices made when tuning a program for one machine will typically be very different to those made when tuning the same program for a different machine. A large program on a multi-level machine can easily expose tens or hundreds of inter-dependent parameters which require tuning, ranging (for example) from subarray sizes to compiler flags to loop optimizations to decomposition strategies, and manually searching the resultant large, non-linear space of program parameters is a tedious process of trial-and-error. These challenges entail the design of an automatic tuning framework. In this dissertation, we present a general framework for automatically tuning arbitrary applications to machines with software-managed memory hierarchies. The tuning framework matches the decomposition strategies to the memory hierarchies. It uses a search algorithm, I specialized to software-managed memory hierarchies, that achieves good performance quickly due to the smoothness of the search space. The framework also applies a novel fusion algorithm that considers multiple outermost loop levels in a single step. The knowledge learned when searching the tunable space is used to guide the selection of a fusion configuration. We evaluate our framework by measuring the performance of benchmarks that are tuned for a range of machines with different memory hierarchy configurations: a cluster of Intel P4 Xeon processors, a single Cell processor and a cluster of Sony Playstation 3s. The tuning framework gives similar or better performance than what is achieved by the best-available hand-tuned version coded in Sequoia.

Tags: Algorithms, Cell processor, Compilers, Computer science, Optimization, Performance, Playstation, Programming Languages

February 26, 2011 by hgpu

No votes yet.

Please wait...

gpu_tracker: Context manager and CLI that tracks the computational-resource-usage of a code block or shell command, particularly the GPU usage

gpu_tracker: Python package for tracking and profiling GPU utilization in both desktop and high-performance computing environments

* * *

high performance computing on graphics processing units: hgpu.org

A Tuning Framework for Software-Managed Memory Hierarchies

Recent source codes

QArray

Celerity: High-level C++ for Accelerator Clusters

CIFAR-10 Airbench: 94% on CIFAR-10 in 3.29 second

gpu_tracker: Context manager and CLI that tracks the computational-resource-usage of a code block or shell command, particularly the GPU usage

LOOPer: a polyhedral compiler for expressing fast and portable data parallel algorithms

OpenMC Monte Carlo Code

Polygeist: C/C++ frontend for MLIR

Parallel Gaussian process with kernel approximation in CUDA

Optical flow algorithms for SYCL

OpenMP5-Offload-OpenMC-Intel-PVC

Most viewed papers (last 30 days)

A Tuning Framework for Software-Managed Memory Hierarchies

Share this:

Recent source codes

Most viewed papers (last 30 days)