Posts
Nov, 14
Performance Optimisations for Heterogeneous Managed Runtime Systems
High demand for increased computational capabilities and power efficiency has resulted in making commodity devices integrating diverse hardware resources. Desktops, laptops, and smartphones have embraced heterogeneity through multi-core Central Processing Units (CPUs), energy-efficient integrated Graphics Processing Units (GPUs), Field-Programmable Gate Arrays (FPGAs), powerful discrete GPUs, and Tensor Processing Units (TPUs). To ease the programmability of […]
Nov, 14
Safe and Practical GPU Acceleration in TrustZone
We present a holistic design for GPU-accelerated computation in TrustZone TEE. Without pulling the complex GPU software stack into the TEE, we follow a simple approach: record the CPU/GPU interactions ahead of time, and replay the interactions in the TEE at run time. This paper addresses the approach’s key missing piece – the recording environment, […]
Nov, 14
MQBench: Towards Reproducible and Deployable Model Quantization Benchmark
Model quantization has emerged as an indispensable technique to accelerate deep learning inference. While researchers continue to push the frontier of quantization algorithms, existing quantization work is often unreproducible and undeployable. This is because researchers do not choose consistent training pipelines and ignore the requirements for hardware deployments. In this work, we propose Model Quantization […]
Nov, 14
Performance Evaluation of Python ParallelProgramming Models: Charm4Py and mpi4py
Python is rapidly becoming the lingua franca of machine learning and scientific computing. With the broad use of frameworks such as Numpy, SciPy, and TensorFlow, scientific computing and machine learning are seeing a productivity boost on systems without a requisite loss in performance. While high-performance libraries often provide adequate performance within a node, distributed computing […]
Nov, 7
Collage: Automated Integration of Deep Learning Backends
Strong demands for efficient deployment of Deep Learning (DL) applications prompt the rapid development of a rich DL ecosystem. To keep up with its fast advancement, it is crucial for DL frameworks to efficiently integrate a variety of optimized libraries and runtimes as their backends and generate the fastest possible executable by using them properly. […]
Nov, 7
iGUARD: In-GPU Advanced Race Detection
Newer use cases of GPU (Graphics Processing Unit) computing, e.g., graph analytics, look less like traditional bulksynchronous GPU programs. To cater to the needs of emerging applications with semantically richer and finer grain sharing patterns, GPU vendors have been introducing advanced programming features, e.g., scoped synchronization and independent thread scheduling. While these features can speed […]
Nov, 7
Optimizing a Hardware Network Stack to Realize an In-Network ML Inference Application
FPGAs are an interesting platform for the implementation of network-attached accelerators, either in the form of smart network interface cards or as In-Network Processing accelerators. Both application scenarios require a high-throughput hardware network stack. In this work, we integrate such a stack into the open-source TaPaSCo framework and implement a library of easy-to-use design primitives […]
Nov, 7
Principles towards Real-Time Simulation of Material Point Method on Modern GPUs
Physics-based simulation has been actively employed in generating offline visual effects in the film and animation industry. However, the computations required for high-quality scenarios are generally immense, deterring its adoption in real-time applications, e.g., virtual production, avatar live-streaming, and cloud gaming. We summarize the principles that can accelerate the computation pipeline on single-GPU and multi-GPU […]
Nov, 7
Source-to-Source Automatic Differentiation of OpenMP Parallel Loops
This paper presents our work toward correct and efficient automatic differentiation of OpenMP parallel worksharing loops in forward and reverse mode. Automatic differentiation is a method to obtain gradients of numerical programs, which are crucial in optimization, uncertainty quantification, and machine learning. The computational cost to compute gradients is a common bottleneck in practice. For […]
Oct, 31
Improving Performance and Energy Efficiency of GPUs through Locality Analysis
The massive parallelism provided by general-purpose GPUs (GPGPUs) possessing numerous compute threads in their streaming multiprocessors (SMs) and enormous memory bandwidths have made them the de-facto accelerator of choice in many scientific domains. To support the complex memory access patterns of applications, GPGPUs have a multi-level memory hierarchy consisting of a huge register file and […]
Oct, 31
Colossal-AI: A Unified Deep Learning System For Large-Scale Parallel Training
The Transformer architecture has improved the performance of deep learning models in domains such as Computer Vision and Natural Language Processing. Together with better performance come larger model sizes. This imposes challenges to the memory wall of the current accelerator hardware such as GPU. It is never ideal to train large models such as Vision […]
Oct, 31
Mixed precision in Graphics Processing Unit
Modern graphics computing units (GPUs) are designed and optimized to perform highly parallel numerical calculations. This parallelism has enabled (and promises) significant advantages, both in terms of energy performance and calculation. In this document, we take stock of the different applications of mixed precision. We recall the standards currently used in the overwhelming majority of […]