high performance computing on graphics processing units: hgpu.org

hgpu.org » Applications » Computer science » Runtime Systems and Scheduling Support for High-End CPU-GPU Architectures

Runtime Systems and Scheduling Support for High-End CPU-GPU Architectures

Vignesh Trichy Ravi

Ohio State University

Ohio State University, 2012

BibTeX

Download (PDF)

View

Source

1983

views

In recent years, multi-core CPUs and many-core GPUs have emerged as mainstream and cost-effective means for scaling. Consequently, a trend that is receiving wide attention is of heterogeneous computing platforms consisting of both CPU and GPU. Such heterogeneous architectures are pervasive across notebooks, desktops, clusters, supercomputers and cloud environments. While they expose huge potential for computing, the state-of-the-art software support lacks much of the desired features to improve the performance and utilization of such systems. Particularly, we focus on three important problems: (i) While machines consisting of both multi-core CPU and GPU are available, there is no standard software support that enables application to harness aggregate compute power of both CPU and GPU, (ii) Although GPUs offer very high peak performance, often, its utilization is low, which is an important concern in heavily shared cloud environments. While resource sharing is a classic way to improve utilization, there is no software support to truly share the GPUs, and (iii) In shared supercomputers and cloud environments, a critical software component is a job scheduler, which aims at improving the resource utilization and maximizing the aggregate throughput. Thus, we formulate and revisit scheduling problems for CPU-GPU clusters. For the first problem, we have developed a runtime system that will enable an application to simultaneously benefit from the aggregate computing power of available CPU and GPU. Starting from a high-level API support, the runtime system transparently handles the concurrency control, and efficiently distributes the work automatically between CPU and GPU. This work has been extended and optimized to consider structured grid computation pattern. Our evaluation shows that significant performance benefits can be achieved while also improving the productivity of the user. For the second problem, we have developed a framework with runtime support for enabling one or more applications to transparently share one or more GPUs. We use consolidation as a mechanism to share a GPU and provide solutions to the conceptual problem of consolidation through affinity score and molding. Particularly, affinity score between two or more kernels provides an indication of potential performance improvement upon kernel consolidation. In addition, we explore molding as a means to achieve efficient GPU sharing in the case of kernels with conflicting resource requirements. As such, we demonstrate significant performance improvements from the use of our GPU sharing mechanisms. For the third problem, our scheduling formulations actively exploit the portability offered by programming models like OpenCL to automatically map jobs to CPU and GPU resources in the cluster. Based on this assumption, we have developed a number of scheduling schemes with two different goals. One is based on system-wide metrics like global throughput (make span) and latency, while the other is based on market-based metrics (known as value or yield) as defined or agreed between user and the service provider. Particularly, our scheduling schemes improve the utilization (thus, global throughput) by minimizing resource idle time, and also by efficiently handling the trade-off between queuing delay and non-optimal resource penalty. When the goal is to improve yield, we also factor various parameters of value functions (in addition to afore-mentioned trade-offs) into scheduling decisions. Our experimental results show that our schemes can significantly outperform the state-of-the-art solutions in practice.

Tags: Cloud, Computer science, CUDA, GPU cluster, Heterogeneous systems, nVidia, OpenCL, Tesla C2050, Thesis

July 10, 2012 by hgpu

No votes yet.

Please wait...

Your response

You must be logged in to post a comment.

* * *

high performance computing on graphics processing units: hgpu.org

Runtime Systems and Scheduling Support for High-End CPU-GPU Architectures

Your response

Recent source codes

Mutual-Supervised Learning for Sequential-to-Parallel Code Translation

Hardware Compute Partitioning on NVIDIA GPUs for Composable Systems

KISim: Kubernetes Intelligent Scheduling Simulator

Efficient GPU Implementation of Multi-Precision Integer Division

exa-AMD: Exascale Accelerated Materials Discovery

ParEval: A Parallel Code Evaluation Benchmark

FlashSparse: Minimizing Computation Redundancy for Fast Sparse Matrix Multiplications on Tensor Cores

WiLLM: An Open Wireless LLM Communication System

Vcc: the Vulkan Clang Compiler

hpcbench: A set of benchmarking utilities for biomolecular simulation tools

Most viewed papers (last 30 days)

Runtime Systems and Scheduling Support for High-End CPU-GPU Architectures

Share this:

Your response

Recent source codes

Most viewed papers (last 30 days)