Posts
Jan, 8
Arax: a runtime framework for decoupling applications from heterogeneous accelerators
Today, using multiple heterogeneous accelerators efficiently from applications and high-level frameworks, such as Tensor-Flow and Caffe, poses significant challenges in three respects: (a) sharing accelerators, (b) allocating available resources elastically during application execution, and (c) reducing the required programming effort. In this paper, we present Arax, a runtime system that decouples applications from heterogeneous accelerators […]
Dec, 25
Understanding the Impact of Input Entropy on FPU, CPU, and GPU Power
Power is increasingly becoming a limiting resource in high-performance, GPU-accelerated computing systems. Understanding the range and sources of power variation is essential in setting realistic bounds on rack and system peak power, and developing techniques that minimize energy. While variations arising during manufacturing and other factors like algorithm among others have been previously studied, this […]
Dec, 25
Extending MAGMA Portability with OneAPI
As the architectures of super-computing systems are continually changing, it is important to maintain efficient code portability in order to continue to take advantage of the computing capabilities of the diverse and evolving hardware in these systems. Intel has adopted an open standard programming interface for heterogeneous systems called oneAPI, designed to allow code portability […]
Dec, 25
mu-grind: A Framework for Dynamically Instrumenting HLS-Generated RTL
High-level synthesis compilers (HLS) enable the rapid creation of accelerator circuits. Unfortunately, compiler generated RTL (H-RTL) is inconsistent in terms of quality, hard to comprehend, and tends to be brittle [28, 41]. This paper develops a framework to help HLS compiler architects inspect and profile H-RTL. Prior state-of-the-art tools [23, 57] have predominantly focused on […]
Dec, 25
GPU Load Balancing
Fine-grained workload and resource balancing is the key to high performance for regular and irregular computations on the GPUs. In this dissertation, we conduct an extensive survey of existing load-balancing techniques to build an abstraction that addresses the difficulty of scheduling computations on the GPU. We propose a GPU fine-grained load-balancing abstraction that decouples load […]
Dec, 25
Kernel-as-a-Service: A Serverless Interface to GPUs
Serverless computing has made it easier than ever to deploy applications over scalable cloud resources, all the while driving higher utilization for cloud providers. While this technique has worked well for easily divisible resources like CPU and local DRAM, it has struggled to incorporate more expensive and monolithic resources like GPUs or other application accelerators. […]
Dec, 19
A Framework to Generate High-Performance Time-stepped Agent-based Simulations on Heterogeneous Hardware
Agent-Based Simulation (ABS) is a modelling approach where simulated entities i.e., agents, perform actions autonomously and interact with other agents based on a set of rules. ABSs have demonstrated their usefulness in various domains such as transportation, social science, or biology. Agent-based simulators commonly rely vastly on Central Processing Unit (CPU)-based sequential execution. As a […]
Dec, 19
Code Generation from Functional to Imperative: Combining Destination-Passing Style and Views
Programming in low-level imperative languages provides good performance but is error-prone. In contrast, high-level functional programming is usually free from low-level errors but performance suffers from costly abstractions. To benefit from both worlds, approaches like Lift compile from high-level functional programs to high-performance imperative code. However, problems such as removing high-level abstraction costs and handling […]
Dec, 19
A Study on the Intersection of GPU Utilization and CNN Inference
There has been significant progress in developing neural network architectures that both achieve high predictive performance and that also achieve high application-level inference throughput (e.g., frames per second). Another metric of increasing importance is GPU utilization during inference: the measurement of how well a deployed neural network uses the computational capabilities of the GPU on […]
Dec, 19
FLIA: Architecture of Collaborated Mobile GPU and FPGA Heterogeneous Computing
Accelerators, such as GPUs (Graphics Processing Unit) that is suitable for handling highly parallel data, and FPGA (Field Programmable Gate Array) with algorithms customized architectures, are widely adopted. The motivation is that algorithms with various parallel characteristics can efficiently map to the heterogeneous computing architecture by collaborated GPU and FPGA. However, current applications always utilize […]
Dec, 19
Portable C++ Code that can Look and Feel Like Fortran Code with Yet Another Kernel Launcher (YAKL)
This paper introduces the Yet Another Kernel Launcher (YAKL) C++ portability library, which strives to enable user-level code with the look and feel of Fortran code. The intended audience includes both C++ developers and Fortran developers unfamiliar with C++. The C++ portability approach is briefly explained, YAKL’s main features are described, and code examples are […]
Dec, 11
Assessing Application Efficiency and Performance Portability in Single-Source Programming for Heterogeneous Parallel Systems
We analyze the performance portability of the skeleton-based, single-source multi-backend high-level programming framework SkePU across multiple different CPU–GPU heterogeneous systems. Thereby, we provide a systematic application efficiency characterization of SkePU-generated code in comparison to equivalent hand-written code in more low-level parallel programming models such as OpenMP and CUDA. For this purpose, we contribute ports of […]