Posts
Aug, 7
A Container-Based Workflow for Distributed Training of Deep Learning Algorithms in HPC Clusters
Deep learning has been postulated as a solution for numerous problems in different branches of science. Given the resource-intensive nature of these models, they often need to be executed on specialized hardware such graphical processing units (GPUs) in a distributed manner. In the academic field, researchers get access to this kind of resources through High […]
Aug, 7
Real-Time High-Performance Computing for Embedded Control Systems
Critical real-time systems include a wide spectrum of computer systems whose correct behavior is dictated not only by correct functionality but also by their timely execution with respect to predefined deadlines. The increasing demand for higher performance in these systems has led the industry to recently include embedded Graphics Processing Units (GPUs), mainly for machine […]
Aug, 7
COX: Exposing CUDA Warp-Level Functions to CPUs
As CUDA becomes the de facto programming language among data parallel applications such as high-performance computing or machine learning applications, running CUDA on other platforms becomes a compelling option. Although several efforts have attempted to support CUDA on devices other than NVIDIA GPUs, due to extra steps in the translation, the support is always a […]
Aug, 7
Design and Implementation of ShenWei Universal C/C++
The ShenWei many-core series processors powering multiple cutting-edge supercomputers are equipped with their unique on-chip heterogeneous architecture. They have long required programmers to write separate codes for the control part on Management Processing Element (MPE) and accelerated part on Compute Processing Element (CPE), which is similar to open standards like OpenCL. Such a programming model […]
Jul, 24
Demystifying Dependency Bugs in Deep Learning Stack
Recent breakthroughs in deep learning (DL) techniques have stimulated significant growth in developing DL-enabled applications. These DL applications, built upon a heterogeneous and complex DL stack (e.g., Nvidia GPU, Linux, CUDA driver, Python runtime, and TensorFlow), are subject to software and hardware dependencies across the DL stack. A persistent challenge in dependency management across the […]
Jul, 24
CPU-GPU Layer-Switched Low Latency CNN Inference
Convolutional Neural Networks (CNNs) inference on Heterogeneous Multi-Processor System-on-Chips (HMPSoCs) in edge devices represent cutting-edge embedded machine learning. Embedded CPU and GPU within an HMPSoC can both perform inference using CNNs. However, common practice is to run a CNN on the HMPSoC component (CPU or GPU) provides the best performance (lowest latency) for that CNN. […]
Jul, 24
FPGA Accelerators on Heterogeneous Systems: An Approach Using High Level Synthesis
The emergence of FPGAs in the High-Performance Computing domain is arising thanks to their promise of better energy efficiency and low control latency, compared with other devices such as CPUs or GPUs. Albeit these benefits, their complete inclusion into HPC systems still faces several challenges. First, FPGA complexity means its programming more difficult compared to […]
Jul, 24
Theseus: A Library for Differentiable Nonlinear Optimization
We present Theseus, an efficient application-agnostic open source library for differentiable nonlinear least squares (DNLS) optimization built on PyTorch, providing a common framework for end-to-end structured learning in robotics and vision. Existing DNLS implementations are application specific and do not always incorporate many ingredients important for efficiency. Theseus is application-agnostic, as we illustrate with several […]
Jul, 24
On Scheduling Ring-All-Reduce Learning Jobs in Multi-Tenant GPU Clusters with Communication Contention
Powered by advances in deep learning (DL) techniques, machine learning and artificial intelligence have achieved astonishing successes. However, the rapidly growing needs for DL also led to communication- and resource-intensive distributed training jobs for large-scale DL training, which are typically deployed over GPU clusters. To sustain the ever-increasing demand for DL training, the so-called "ring-all-reduce" […]
Jul, 17
Reducing Synchronous GPU Memory Transfers: Design and implementation of a Futhark compiler optimisation
We present a series of dataflow dependent program transformations that reduce memory transfers between a GPU and its host, and show how the problem of minimising memory transfers to the host amounts to finding minimum vertex cuts in a series of data dependency graphs. We provide a specialised algorithm to solve these minimisation problems, based […]
Jul, 17
Heterogeneous Energy-aware Load Balancing for Industry 4.0 and IoT Environments
With the improvement of global infrastructure, Cyber-Physical Systems (CPS) have become an important component of Industry 4.0. Both the application as well as the machine work together to improve the task of interdependencies. Machine learning methods in CPS require the monitoring of computational algorithms, including adopting optimizations, fine-tuning cyber systems, improving resource utilization, as well […]
Jul, 17
Just-in-Time Compilation and Link-Time Optimization for OpenMP Target Offloading
Following the mass adoption of external accelerators for high performance computing, the overall performance of many applications has become increasingly dependent on relatively small accelerated kernels. As static analysis is fundamentally limited by dynamic values and external definitions, standard ahead-of-time compilation is not always sufficient to achieve the best performance. Furthermore, many users looking to […]