Posts
May, 20
Glow: Graph Lowering Compiler Techniques for Neural Networks
This paper presents the design of Glow, a machine learning compiler for heterogeneous hardware. It is a pragmatic approach to compilation that enables the generation of highly optimized code for multiple targets. Glow lowers the traditional neural network dataflow graph into a two-phase strongly-typed intermediate representation. The high-level intermediate representation allows the optimizer to perform […]
May, 20
OpenCL Vector Swizzling Optimization under Global Value Numbering
Under Heterogeneous System Architecture (HSA), each device work together to improve performance. Open Computing Language (OpenCL) provides functionalities to complete parallel computing in different devices. OpenCL vector bundles identical data types and operates them meanwhile. However, optimization upon vector swizzling becomes difficult due to vector swizzling operations with several OpenCL vectors. In this paper, new […]
May, 20
Generic System Calls for GPUs
GPUs are becoming first-class compute citizens and increasingly support programmability-enhancing features such as shared virtual memory and hardware cache coherence. This enables them to run a wider variety of programs. However, a key aspect of general-purpose programming where GPUs still have room for improvement is the ability to invoke system calls. We explore how to […]
May, 20
Neural Multi-scale Image Compression
This study presents a new lossy image compression method that utilizes the multi-scale features of natural images. Our model consists of two networks: multi-scale lossy autoencoder and parallel multi-scale lossless coder. The multi-scale lossy autoencoder extracts the multi-scale image features to quantized variables and the parallel multi-scale lossless coder enables rapid and accurate lossless coding […]
May, 20
Deep Neural Machine Translation with Weakly-Recurrent Units
Recurrent neural networks (RNNs) have represented for years the state of the art in neural machine translation. Recently, new architectures have been proposed, which can leverage parallel computation on GPUs better than classical RNNs. Faster training and inference combined with different sequence-to-sequence modeling also lead to performance improvements. While the new models completely depart from […]
May, 12
Dwarfs on Accelerators: Enhancing OpenCL Benchmarking for Heterogeneous Computing Architectures
For reasons of both performance and energy efficiency, high-performance computing (HPC) hardware is becoming increasingly heterogeneous. The OpenCL framework supports portable programming across a wide range of computing devices and is gaining influence in programming next-generation accelerators. To characterize the performance of these devices across a range of applications requires a diverse, portable and configurable […]
May, 12
EngineCL: Usability and Performance in Heterogeneous Computing
Heterogeneous systems composed by a CPU and a set of hardware accelerators have become one of the most common architectures today, thanks to their excellent performance and energy consumption. However, due to their heterogeneity they are very complex to program and even more to achieve performance portability on different devices. This paper presents EngineCL, a […]
May, 12
Modeling and Evaluation of Synchronous Stochastic Gradient Descent in Distributed Deep Learning on Multiple GPUs
With huge amounts of training data, deep learning has made great breakthroughs in many artificial intelligence (AI) applications. However, such large-scale data sets present computational challenges, requiring training to be distributed on a cluster equipped with accelerators like GPUs. With the fast increase of GPU computing power, the data communications among GPUs have become a […]
May, 12
Parallel Programming for FPGAs
This book focuses on the use of algorithmic high-level synthesis (HLS) to build application-specific FPGA systems. Our goal is to give the reader an appreciation of the process of creating an optimized hardware design using HLS. Although the details are, of necessity, different from parallel programming for multicore processors or GPUs, many of the fundamental […]
May, 12
Efficient Hardware Acceleration on SoC-FPGA with OpenCL
Field Programmable Gate Arrays (FPGAs) are taking over the conventional processors in the field of High Performance computing. With the advent of FPGA architectures and High level synthesis tools, FPGAs can now be easily used to accelerate computationally intensive applications like, e.g., AI and Cognitive computing. One of the advantages of raising the level of […]
May, 5
Simulating Quantum Computers Using OpenCL
I present QCGPU, an open source Rust library for simulating quantum computers. QCGPU uses the OpenCL framework to enable acceleration by devices such as GPUs, FPGAs and DSPs. I perform a number of optimizations including parallelizing operations such as the application of gates and the calculation of various state probabilities for the purpose of measurment. […]
May, 5
Fast Simulation of Large-Scale Floods Based on GPU Parallel Computing
Computing speed is a significant issue of large-scale flood simulations for real-time response to disaster prevention and mitigation. Even today, most of the large-scale flood simulations are generally run on supercomputers due to the massive amounts of data and computations necessary. In this work, a two-dimensional shallow water model based on an unstructured Godunov-type finite […]