high performance computing on graphics processing units: hgpu.org

hgpu.org » Programming » Algorithms » Characterizing and Predicting Scientific Workloads for Heterogeneous Computing Systems

Characterizing and Predicting Scientific Workloads for Heterogeneous Computing Systems

Beau Johnston

The Australian National University

The Australian National University, 2019

@article{johnston2019characterizing,

title={Characterizing and Predicting Scientific Workloads for Heterogeneous Computing Systems},

author={Johnston, Beau},

year={2019}

}

Download (PDF)

View

Source

Source codes

Package:

OpenCL Performance Prediction using Architecture-Independent Features

2632

views

The next-generation of supercomputers will feature a diverse mix of accelerator devices. The increase in heterogeneity is explained by the nature of supercomputing workloads – certain devices offer acceleration, or a shorter time to completion, for particular application programs. Certain characteristics of these programs are fixed and impose fundamental limitations on the workloads regardless of which accelerator is used for the computation; for instance, a graph traversal program always exhibits the same high-branch and lowcomputation properties regardless of what device is used to execute it. To support efficient scheduling on High Performance Computing (HPC) systems it is necessary to make accurate performance predictions for workloads on varied compute devices, which is challenging due to diverse computation, communication and memory access characteristics which result in varying performance between devices. On HPC systems a single node may feature a Graphics Processing Unit (GPU), a Central Processing Unit (CPU), and a Field-Programmable Gate Array (FPGA) or Many Integrated Core (MIC) device. This work presents a device independent predictor – a methodology to use device-independent characteristics of scientific codes to select the optimal accelerator device with regard to execution time or energy expenditure. Open Compute Language (OpenCL) is a programming model designed to facilitate the development of application codes capable of running on multiple different devices. First released in late 2008, it defines a C-like language used to write kernels that can be compiled to run on the different devices. Implementations of the current release (2.2) exist for CPUs, GPUs, FPGAs and the Intel MIC systems, and as such, there is increasing interest in the use of OpenCL for developing scientific applications designed to run on next-generation supercomputer systems. This thesis seeks to use the device-independent characteristics of an OpenCL code to select the optimal accelerator device on which to execute each OpenCL kernel. Consideration is given both to execution time and energy usage. The first focus of this work is to present a comprehensive benchmark suite for OpenCL in the heterogeneous HPC setting: an extended and enhanced version of the OpenDwarfs OpenCL benchmark suite. Our extensions improve the portability and robustness of the applications, the correctness of results and the choice of problem size, and diversity of coverage through the inclusion of additional application patterns. This work manifests in performance measurements on a set 15 devices and over 12 applications. We next present our Architecture Independent Workload Characterization (AIWC) tool which characterizes OpenCL kernels according to a set of architecture-independent features. Features are measured by counting target characteristics which are collected during program execution in a simulator. They are presented as 28 metrics in four categories: parallelism – how well an algorithm scales in response to core count; compute – the diversity of instructions; memory – working memory footprint and entropy measurements which correspond to caching characteristics; and control – branching and program flow. The metrics collected are primarily used in the prediction of execution times, but since they are representative of structural characteristics of the underlying program and are free from architectural traits, they can be used in diversity analysis in benchmark suites, identifying program requirements which allows the automatic calculation of theoretical peak performance for a given device and examining the differences in kernels to show the phase-transitional properties of the application codes. We also discuss the design decisions made to collect AIWC features. Finally, this work culminates in a methodology which uses AIWC features to train a random forest model capable of predicting accelerator execution times. We use this model to predict execution times for a set of 37 computational kernels running on 15 different devices representing a broad range of CPU, GPU and MIC architectures. The predictions are highly accurate, differing from the measured experimental run-times by an average of only 1.2%. A previously unencountered code can be instrumented using AIWC to allow performance prediction across the full range of modelled devices. The results suggest that this methodology supports the correct selection of the most appropriate device for a previously unencountered code, and is highly relevant to efficiently schedule codes to emerging heterogeneous supercomputing systems.

Tags: Algorithms, AMD Radeon RX 480, ATI Radeon HD 7970, Benchmarking, Computer science, FPGA, Heterogeneous systems, Intel Xeon Phi, nVidia, nVidia GeForce GTX 1080, OpenCL, Package, Tesla K20, Tesla K40, Thesis

September 15, 2019 by hgpu

No votes yet.

Please wait...

Your response

You must be logged in to post a comment.

A Safety Report on GPT-5.2, Gemini 3 Pro, Qwen3-VL, Grok 4.1 Fast, Nano Banana Pro, and Seedream 4.5

DICE: Diffusion Large Language Models Excel at Generating CUDA Kernels

KernelGYM & Dr. Kernel: A distributed GPU environment and a collection of RL training methods to support RL for Kernel Generations

Dr. Kernel: Reinforcement Learning Done Right for Triton Kernel Generations

* * *

high performance computing on graphics processing units: hgpu.org

Characterizing and Predicting Scientific Workloads for Heterogeneous Computing Systems

Package:

Your response

Recent source codes

A Safety Report on GPT-5.2, Gemini 3 Pro, Qwen3-VL, Grok 4.1 Fast, Nano Banana Pro, and Seedream 4.5

DICE: Diffusion Large Language Models Excel at Generating CUDA Kernels

KernelGYM & Dr. Kernel: A distributed GPU environment and a collection of RL training methods to support RL for Kernel Generations

Vortex-Optimized Light-weight Toolchain (VOLT)

SciDef: Automated Definition Extraction from Scientific Literature

bioagent-bench: Benchmark for evaluating LLM agents in bioinformatics

Benchmark suite for LLM inference on NVIDIA consumer GPUs

Theorizer: from the paper Generating Literature-Driven Scientific Discoveries at Scale

Nsight Python: a Python kernel profiling interface based on NVIDIA Nsight Tools

Awesome LLM-Driven Kernel Generation

Most viewed papers (last 30 days)

Characterizing and Predicting Scientific Workloads for Heterogeneous Computing Systems

Package:

Share this:

Your response

Recent source codes

Most viewed papers (last 30 days)