Posts
Feb, 10
Accelerating Deep Neural Networks on Low Power Heterogeneous Architectures
Deep learning applications are able to recognise images and speech with great accuracy, and their use is now everywhere in our daily lives. However, developing deep learning architectures such as deep neural networks in embedded systems is a challenging task because of the demanding computational resources and power consumption. Hence, sophisticated algorithms and methods that […]
Feb, 9
Ikra-Cpp: A C++/CUDA DSL for Object-Oriented Programming with Structure-of-Arrays Layout
Structure of Arrays (SOA) is a well-studied data layout technique for SIMD architectures. Previous work has shown that it can speed up applications in high-performance computing by several factors compared to a traditional Array of Structures (AOS) layout. However, most programmers are used to AOS-style programming, which is more readable and easier to maintain. We […]
Feb, 9
Combined Spatial and Temporal Blocking for High-Performance Stencil Computation on FPGAs Using OpenCL
Recent developments in High Level Synthesis tools have attracted software programmers to accelerate their high-performance computing applications on FPGAs. Even though it has been shown that FPGAs can compete with GPUs in terms of performance for stencil computation, most previous work achieve this by avoiding spatial blocking and restricting input dimensions relative to FPGA on-chip […]
Feb, 9
MiMatrix: A Massively Distributed Deep Learning Framework on a Petascale High-density Heterogeneous Cluster
In this paper, we present a co-designed petascale high-density GPU cluster to expedite distributed deep learning training with synchronous Stochastic Gradient Descent (SSGD). This architecture of our heterogeneous cluster is inspired by Harvard architecture. Regarding to different roles in the system, nodes are configured as different specifications. Based on the topology of the whole system’s […]
Feb, 9
Interactive GPU active contours for segmenting inhomogeneous objects
We present a segmentation software package primarily targeting medical and biological applications, with a high level of visual feedback and several usability enhancements over existing packages. Specifically, we provide a substantially faster GPU implementation of the local Gaussian distribution fitting energy model, which can segment inhomogeneous objects with poorly defined boundaries as often encountered in […]
Feb, 9
Tuning Streamed Applications on Intel Xeon Phi: A Machine Learning Based Approach
Many-core accelerators, as represented by the XeonPhi coprocessors and GPGPUs, allow software to exploit spatial and temporal sharing of computing resources to improve the overall system performance. To unlock this performance potential requires software to effectively partition the hardware resource to maximize the overlap between hostdevice communication and accelerator computation, and to match the granularity […]
Feb, 3
REOH: Runtime Energy Optimization for Heterogeneous Systems
Significant efforts have been devoted to choosing the best configuration of a computing system to run an application energy efficiently. However, available tuning approaches mainly focus on homogeneous systems and are inextensible for heterogeneous systems which include several components (e.g., CPUs, GPUs) with different architectures. This study proposes a holistic tuning approach called REOH, based […]
Feb, 3
Accelerating recurrent neural network language model based online speech recognition system
This paper presents methods to accelerate recurrent neural network based language models (RNNLMs) for online speech recognition systems. Firstly, a lossy compression of the past hidden layer outputs (history vector) with caching is introduced in order to reduce the number of LM queries. Next, RNNLM computations are deployed in a CPU-GPU hybrid manner, which computes […]
Feb, 3
A Collective Knowledge workflow for collaborative research into multi-objective autotuning and machine learning techniques
Developing efficient software and hardware has never been harder whether it is for a tiny IoT device or an Exascale supercomputer. Apart from the ever growing design and optimization complexity, there exist even more fundamental problems such as lack of interdisciplinary knowledge required for effective software/hardware co-design, and a growing technology transfer gap between academia […]
Feb, 3
Efficient SIMD Vectorization for Hashing in OpenCL
Hashing is at the core of many efficient database operators such as hash-based joins and aggregations. Vectorization is a technique that uses Single Instruction Multiple Data (SIMD) instructions to process multiple data elements at once. Applying vectorization to hash tables results in promising speedups for build and probe operations. However, vectorization typically requires intrinsics – […]
Feb, 3
Intel nGraph: An Intermediate Representation, Compiler, and Executor for Deep Learning
The Deep Learning (DL) community sees many novel topologies published each year. Achieving high performance on each new topology remains challenging, as each requires some level of manual effort. This issue is compounded by the proliferation of frameworks and hardware platforms. The current approach, which we call "direct optimization", requires deep changes within each framework […]
Jan, 30
The 4th International Conference on Control, Automation and Robotics (ICCAR), 2018
ICCAR 2018 is a not-to-be-missed opportunity that distills the most current knowledge on a rapidly advancing discipline in one conference. Join key researchers and established professionals in the field of control, automation and robotics as they assess the current state-of-the-art and road-map crucial areas for future research. It will provide a valuable opportunity for researchers, […]