high performance computing on graphics processing units: hgpu.org

hgpu.org » Applications » Computer science » Runtime Systems and Scheduling Support for High-End CPU-GPU Architectures

Runtime Systems and Scheduling Support for High-End CPU-GPU Architectures

Vignesh Trichy Ravi

Ohio State University

Ohio State University, 2012

@phdthesis{ravi2012runtime,

title={Runtime Systems and Scheduling Support for High-End CPU-GPU Architectures},

author={Ravi, V.T.},

year={2012},

school={The Ohio State University}

}

Download (PDF)

View

Source

1627

views

In recent years, multi-core CPUs and many-core GPUs have emerged as mainstream and cost-effective means for scaling. Consequently, a trend that is receiving wide attention is of heterogeneous computing platforms consisting of both CPU and GPU. Such heterogeneous architectures are pervasive across notebooks, desktops, clusters, supercomputers and cloud environments. While they expose huge potential for computing, the state-of-the-art software support lacks much of the desired features to improve the performance and utilization of such systems. Particularly, we focus on three important problems: (i) While machines consisting of both multi-core CPU and GPU are available, there is no standard software support that enables application to harness aggregate compute power of both CPU and GPU, (ii) Although GPUs offer very high peak performance, often, its utilization is low, which is an important concern in heavily shared cloud environments. While resource sharing is a classic way to improve utilization, there is no software support to truly share the GPUs, and (iii) In shared supercomputers and cloud environments, a critical software component is a job scheduler, which aims at improving the resource utilization and maximizing the aggregate throughput. Thus, we formulate and revisit scheduling problems for CPU-GPU clusters. For the first problem, we have developed a runtime system that will enable an application to simultaneously benefit from the aggregate computing power of available CPU and GPU. Starting from a high-level API support, the runtime system transparently handles the concurrency control, and efficiently distributes the work automatically between CPU and GPU. This work has been extended and optimized to consider structured grid computation pattern. Our evaluation shows that significant performance benefits can be achieved while also improving the productivity of the user. For the second problem, we have developed a framework with runtime support for enabling one or more applications to transparently share one or more GPUs. We use consolidation as a mechanism to share a GPU and provide solutions to the conceptual problem of consolidation through affinity score and molding. Particularly, affinity score between two or more kernels provides an indication of potential performance improvement upon kernel consolidation. In addition, we explore molding as a means to achieve efficient GPU sharing in the case of kernels with conflicting resource requirements. As such, we demonstrate significant performance improvements from the use of our GPU sharing mechanisms. For the third problem, our scheduling formulations actively exploit the portability offered by programming models like OpenCL to automatically map jobs to CPU and GPU resources in the cluster. Based on this assumption, we have developed a number of scheduling schemes with two different goals. One is based on system-wide metrics like global throughput (make span) and latency, while the other is based on market-based metrics (known as value or yield) as defined or agreed between user and the service provider. Particularly, our scheduling schemes improve the utilization (thus, global throughput) by minimizing resource idle time, and also by efficiently handling the trade-off between queuing delay and non-optimal resource penalty. When the goal is to improve yield, we also factor various parameters of value functions (in addition to afore-mentioned trade-offs) into scheduling decisions. Our experimental results show that our schemes can significantly outperform the state-of-the-art solutions in practice.

Tags: Cloud, Computer science, CUDA, GPU cluster, Heterogeneous systems, nVidia, OpenCL, Tesla C2050, Thesis

July 10, 2012 by hgpu

No votes yet.

Please wait...

gpu_tracker: Context manager and CLI that tracks the computational-resource-usage of a code block or shell command, particularly the GPU usage

gpu_tracker: Python package for tracking and profiling GPU utilization in both desktop and high-performance computing environments

high performance computing on graphics processing units: hgpu.org

Runtime Systems and Scheduling Support for High-End CPU-GPU Architectures

Recent source codes

SimSYCL: Synchronous, single-threaded, library-only SYCL implementation for debugging and verification

GPU plugin for PySCF

QArray

Celerity: High-level C++ for Accelerator Clusters

gpu_tracker: Context manager and CLI that tracks the computational-resource-usage of a code block or shell command, particularly the GPU usage

CIFAR-10 Airbench: 94% on CIFAR-10 in 3.29 second

LOOPer: a polyhedral compiler for expressing fast and portable data parallel algorithms

OpenMC Monte Carlo Code

Polygeist: C/C++ frontend for MLIR

Parallel Gaussian process with kernel approximation in CUDA

Most viewed papers (last 30 days)

Runtime Systems and Scheduling Support for High-End CPU-GPU Architectures

Share this:

Recent source codes

Most viewed papers (last 30 days)