Posts
Aug, 21
Optimization of GPU workloads using natural language processing based on deep learning techniques
Setting program parameters is challenging due to the abstract relationship between hardware and software. Automatic optimization algorithms that are accurate are required to cope with the complexity and variety of current hardware and software. Autotuning has always relied on time-consuming trial and error approaches. Machine learning (ML) and Natural Language Processing (NLP) has flourished over […]
Aug, 7
A Container-Based Workflow for Distributed Training of Deep Learning Algorithms in HPC Clusters
Deep learning has been postulated as a solution for numerous problems in different branches of science. Given the resource-intensive nature of these models, they often need to be executed on specialized hardware such graphical processing units (GPUs) in a distributed manner. In the academic field, researchers get access to this kind of resources through High […]
Aug, 7
Deep Learning Approaches to Source Code Analysis for Optimization of Heterogeneous Systems: Recent Results, Challenges and Opportunities
To cope with the increasing complexity of digital systems programming, deep learning techniques have recently been proposed to enhance software deployment by analysing source code for different purposes, ranging from performance and energy improvement to debugging and security assessment. As embedded platforms for cyber-physical systems are characterised by increasing heterogeneity and parallelism, one of the […]
Aug, 7
Real-Time High-Performance Computing for Embedded Control Systems
Critical real-time systems include a wide spectrum of computer systems whose correct behavior is dictated not only by correct functionality but also by their timely execution with respect to predefined deadlines. The increasing demand for higher performance in these systems has led the industry to recently include embedded Graphics Processing Units (GPUs), mainly for machine […]
Aug, 7
COX: Exposing CUDA Warp-Level Functions to CPUs
As CUDA becomes the de facto programming language among data parallel applications such as high-performance computing or machine learning applications, running CUDA on other platforms becomes a compelling option. Although several efforts have attempted to support CUDA on devices other than NVIDIA GPUs, due to extra steps in the translation, the support is always a […]
Aug, 7
Design and Implementation of ShenWei Universal C/C++
The ShenWei many-core series processors powering multiple cutting-edge supercomputers are equipped with their unique on-chip heterogeneous architecture. They have long required programmers to write separate codes for the control part on Management Processing Element (MPE) and accelerated part on Compute Processing Element (CPE), which is similar to open standards like OpenCL. Such a programming model […]
Jul, 24
Demystifying Dependency Bugs in Deep Learning Stack
Recent breakthroughs in deep learning (DL) techniques have stimulated significant growth in developing DL-enabled applications. These DL applications, built upon a heterogeneous and complex DL stack (e.g., Nvidia GPU, Linux, CUDA driver, Python runtime, and TensorFlow), are subject to software and hardware dependencies across the DL stack. A persistent challenge in dependency management across the […]
Jul, 24
CPU-GPU Layer-Switched Low Latency CNN Inference
Convolutional Neural Networks (CNNs) inference on Heterogeneous Multi-Processor System-on-Chips (HMPSoCs) in edge devices represent cutting-edge embedded machine learning. Embedded CPU and GPU within an HMPSoC can both perform inference using CNNs. However, common practice is to run a CNN on the HMPSoC component (CPU or GPU) provides the best performance (lowest latency) for that CNN. […]
Jul, 24
FPGA Accelerators on Heterogeneous Systems: An Approach Using High Level Synthesis
The emergence of FPGAs in the High-Performance Computing domain is arising thanks to their promise of better energy efficiency and low control latency, compared with other devices such as CPUs or GPUs. Albeit these benefits, their complete inclusion into HPC systems still faces several challenges. First, FPGA complexity means its programming more difficult compared to […]
Jul, 24
Theseus: A Library for Differentiable Nonlinear Optimization
We present Theseus, an efficient application-agnostic open source library for differentiable nonlinear least squares (DNLS) optimization built on PyTorch, providing a common framework for end-to-end structured learning in robotics and vision. Existing DNLS implementations are application specific and do not always incorporate many ingredients important for efficiency. Theseus is application-agnostic, as we illustrate with several […]
Jul, 24
On Scheduling Ring-All-Reduce Learning Jobs in Multi-Tenant GPU Clusters with Communication Contention
Powered by advances in deep learning (DL) techniques, machine learning and artificial intelligence have achieved astonishing successes. However, the rapidly growing needs for DL also led to communication- and resource-intensive distributed training jobs for large-scale DL training, which are typically deployed over GPU clusters. To sustain the ever-increasing demand for DL training, the so-called "ring-all-reduce" […]
Jul, 17
Reducing Synchronous GPU Memory Transfers: Design and implementation of a Futhark compiler optimisation
We present a series of dataflow dependent program transformations that reduce memory transfers between a GPU and its host, and show how the problem of minimising memory transfers to the host amounts to finding minimum vertex cuts in a series of data dependency graphs. We provide a specialised algorithm to solve these minimisation problems, based […]