Posts
Nov, 21
Programming Heterogeneous Systems with General and Domain-Specific Frameworks
As chip manufacturing processes are getting ever closer to what is physically possible, the projections made by Moore’s Law and Dennard Scaling no longer hold true, and CPU performance has been stagnating over the last decade. At the same time, the performance requirements of many important application areas, ranging from machine learning to scientific computing, […]
Nov, 21
Accelerating JPEG Decompression on GPUs
The JPEG compression format has been the standard for lossy image compression for over multiple decades, offering high compression rates at minor perceptual loss in image quality. For GPU-accelerated computer vision and deep learning tasks, such as the training of image classification models, efficient JPEG decoding is essential due to limitations in memory bandwidth. As […]
Nov, 21
QGTC: Accelerating Quantized GNN via GPU Tensor Core
Over the most recent years, quantized graph neural network (QGNN) attracts lots of research and industry attention due to its high robustness and low computation and memory overhead. Unfortunately, the performance gains of QGNN have never been realized on modern GPU platforms. To this end, we propose the first Tensor Core (TC) based computing framework, […]
Nov, 21
FLOWER: A Comprehensive Dataflow Compiler for High-Level Synthesis
FPGAs have found their way into data centers as accelerator cards, making reconfigurable computing more accessible for high-performance applications. At the same time, new high-level synthesis compilers like Xilinx Vitis and runtime libraries such as XRT attract software programmers into the reconfigurable domain. While software programmers are familiar with task-level and data-parallel programming, FPGAs often […]
Nov, 14
General purpose lattice QCD code set Bridge++ 2.0 for high performance computing
Bridge++ is a general-purpose code set for a numerical simulation of lattice QCD aiming at a readable, extensible, and portable code while keeping practically high performance. The previous version of Bridge++ is implemented in double precision with a fixed data layout. To exploit the high arithmetic capability of new processor architecture, we extend the Bridge++ […]
Nov, 14
Performance Optimisations for Heterogeneous Managed Runtime Systems
High demand for increased computational capabilities and power efficiency has resulted in making commodity devices integrating diverse hardware resources. Desktops, laptops, and smartphones have embraced heterogeneity through multi-core Central Processing Units (CPUs), energy-efficient integrated Graphics Processing Units (GPUs), Field-Programmable Gate Arrays (FPGAs), powerful discrete GPUs, and Tensor Processing Units (TPUs). To ease the programmability of […]
Nov, 14
Safe and Practical GPU Acceleration in TrustZone
We present a holistic design for GPU-accelerated computation in TrustZone TEE. Without pulling the complex GPU software stack into the TEE, we follow a simple approach: record the CPU/GPU interactions ahead of time, and replay the interactions in the TEE at run time. This paper addresses the approach’s key missing piece – the recording environment, […]
Nov, 14
MQBench: Towards Reproducible and Deployable Model Quantization Benchmark
Model quantization has emerged as an indispensable technique to accelerate deep learning inference. While researchers continue to push the frontier of quantization algorithms, existing quantization work is often unreproducible and undeployable. This is because researchers do not choose consistent training pipelines and ignore the requirements for hardware deployments. In this work, we propose Model Quantization […]
Nov, 14
Performance Evaluation of Python ParallelProgramming Models: Charm4Py and mpi4py
Python is rapidly becoming the lingua franca of machine learning and scientific computing. With the broad use of frameworks such as Numpy, SciPy, and TensorFlow, scientific computing and machine learning are seeing a productivity boost on systems without a requisite loss in performance. While high-performance libraries often provide adequate performance within a node, distributed computing […]
Nov, 7
Collage: Automated Integration of Deep Learning Backends
Strong demands for efficient deployment of Deep Learning (DL) applications prompt the rapid development of a rich DL ecosystem. To keep up with its fast advancement, it is crucial for DL frameworks to efficiently integrate a variety of optimized libraries and runtimes as their backends and generate the fastest possible executable by using them properly. […]
Nov, 7
iGUARD: In-GPU Advanced Race Detection
Newer use cases of GPU (Graphics Processing Unit) computing, e.g., graph analytics, look less like traditional bulksynchronous GPU programs. To cater to the needs of emerging applications with semantically richer and finer grain sharing patterns, GPU vendors have been introducing advanced programming features, e.g., scoped synchronization and independent thread scheduling. While these features can speed […]
Nov, 7
Optimizing a Hardware Network Stack to Realize an In-Network ML Inference Application
FPGAs are an interesting platform for the implementation of network-attached accelerators, either in the form of smart network interface cards or as In-Network Processing accelerators. Both application scenarios require a high-throughput hardware network stack. In this work, we integrate such a stack into the open-source TaPaSCo framework and implement a library of easy-to-use design primitives […]