Posts
May, 5
Auto-tuning Streamed Applications on Intel Xeon Phi
Many-core accelerators, as represented by the XeonPhi coprocessors and GPGPUs, allow software to exploit spatial and temporal sharing of computing resources to improve the overall system performance. To unlock this performance potential requires software to effectively partition the hardware resource to maximize the overlap between hostdevice communication and accelerator computation, and to match the granularity […]
May, 5
Tiramisu: A Code Optimization Framework for High Performance Systems
This paper introduces Tiramisu, an optimization framework designed to generate efficient code for high-performance systems such as multicores, GPUs, FPGAs, distributed machines, or any combination of these. Tiramisu relies on a flexible representation based on the polyhedral model and introduces a novel four-level IR that allows full separation between algorithms, schedules, data-layouts and communication. This […]
May, 5
A Survey of ReRAM-based Architectures for Processing-in-memory and Neural Networks
As data movement operations and power-budget become key bottlenecks in the design of computing systems, the interest in unconventional approaches such as processing-in-memory (PIM) and machine learning (ML), especially neural network (NN) based accelerators has grown significantly. Resistive RAM (ReRAM) is a promising technology for efficiently architecting PIM and NN based accelerators due to its […]
Apr, 28
Multidimensional Parallelization for Streaming Text Processing Applications Based on Parabix Framework
Streaming text processing is important for transforming and analyzing the rapidly growing data in modern society. Unfortunately, text processing software written using the sequential byte-at-a-time processing model fails to take full advantage of the resources available on modern processors for many reasons, including significant branch misprediction penalties due to the input-dependent branching structures of text […]
Apr, 28
A Comparative Study on Exact Triangle Counting Algorithms on the GPU
We implement exact triangle counting in graphs on the GPU using three different methodologies: subgraph matching to a triangle pattern; programmable graph analytics, with a set-intersection approach; and a matrix formulation based on sparse matrix-matrix multiplies. All three deliver best-of-class performance over CPU implementations and over comparable GPU implementations, with the graph-analytic approach achieving the […]
Apr, 28
A Strategy for Automatic Performance Tuning of Stencil Computations on GPUs
We propose and evaluate a novel strategy for tuning the performance of a class of stencil computations on Graphics Processing Units. The strategy uses a machine learning model to predict the optimal way to load data from memory followed by a heuristic that divides other optimizations into groups and exhaustively explores one group at a […]
Apr, 28
Accelerating Blockchain Search of Full Nodes Using GPUs
Blockchain is a distributed ledger system based on P2P network and originally used for a crypto currency system. The P2P network of Blockchain is maintained by full nodes which are in charge of verifying all the transactions in the network. However, most Blockchain user nodes do not act as full nodes, because workload of full […]
Apr, 28
Automatic generation of CUDA code performing tensor manipulations using C++ expression templates
We present a C++ library, TLoops, which uses a hierarchy of expression templates to represent operations upon tensorial quantities in single lines of C++ code that resemble analytic equations. These expressions may be run as-is, but may also be used to emit equivalent low-level C or CUDA code, which either performs the operations more quickly […]
Apr, 25
BrainSlug: Transparent Acceleration of Deep Learning Through Depth-First Parallelism
Project page: BrainSlug: Transparent Neural Network Acceleration (http://www.brainslug.info/) Neural network frameworks such as PyTorch and TensorFlow are the workhorses of numerous machine learning applications ranging from object recognition to machine translation. While these frameworks are versatile and straightforward to use, the training of and inference in deep neural networks is resource (energy, compute, and […]
Apr, 25
A Survey of Techniques for Dynamic Branch Prediction
Branch predictor (BP) is an essential component in modern processors since high BP accuracy can improve performance and reduce energy by decreasing the number of instructions executed on wrong-path. However, reducing latency and storage overhead of BP while maintaining high accuracy presents significant challenges. In this paper, we present a survey of dynamic branch prediction […]
Apr, 22
Fast inference of deep neural networks in FPGAs for particle physics
Recent results at the Large Hadron Collider (LHC) have pointed to enhanced physics capabilities through the improvement of the real-time event processing techniques. Machine learning methods are ubiquitous and have proven to be very powerful in LHC physics, and particle physics as a whole. However, exploration of the use of such techniques in low-latency, low-power […]
Apr, 22
CANNA: Neural Network Acceleration using Configurable Approximation on GPGPU
Neural networks have been successfully used in many applications. Due to their computational complexity, it is difficult to implement them on embedded devices. Neural networks are inherently approximate and thus can be simplified. In this paper, CANNA proposes a gradual training approximation which adaptively sets the level of hardware approximation depending on the neural network’s […]