Posts
Feb, 8
Workload distribution and balancing in FPGAs and CPUs with OpenCL and TBB
In this paper we evaluate the performance and energy effectiveness of FPGA and CPU devices for a kind of parallel computing applications in which the workload can be distributed in a way that enables simultaneous computing in addition to simple off loading. The FPGA device is programmed via OpenCL using the recent availability of commercial […]
Feb, 8
Integrating GPGPU computations with CPU coroutines in C++
We present results on integration of two major GPGPU APIs with reactor-based event processing model in C++ that utilizes coroutines. With current lack of universally usable GPGPU programming interface that gives optimal performance and debates about the style of implementing asynchronous computing in C++, we present a working implementation that allows a uniform and seamless […]
Feb, 8
Collaborative design and optimization using Collective Knowledge
Designing faster, more energy efficient and reliable computer systems requires effective collaboration between hardware designers, system programmers and performance analysts, as well as feedback from system users. We present Collective Knowledge (CK), an open framework for reproducible and collaborative design and optimization. CK enables systematic and reproducible experimentation, combined with leading edge predictive analytics to […]
Feb, 6
GPU Hackathons, 2016
Background General-purpose Graphics Processing Units (GPGPUs) potentially offer exceptionally high memory bandwidth and performance for a wide range of applications. The challenge in utilizing such accelerators has been the difficulty in programming them. The OpenACC Directives for Accelerators offers straightforward pragma extensions to C++ and Fortran to address this programming hurdle, but other GPU programming […]
Feb, 6
FPGA Based Implementation of Deep Neural Networks Using On-chip Memory Only
Deep neural networks (DNNs) demand a very large amount of computation and weight storage, and thus efficient implementation using special purpose hardware is highly desired. In this work, we have developed an FPGA based fixed-point DNN system using only on-chip memory not to access external DRAM. The execution time and energy consumption of the developed […]
Feb, 6
Asynchronous Methods for Deep Reinforcement Learning
We propose a conceptually simple and lightweight framework for deep reinforcement learning that uses asynchronous gradient descent for optimization of deep neural network controllers. We present asynchronous variants of four standard reinforcement learning algorithms and show that parallel actor-learners have a stabilizing effect on training allowing all four methods to successfully train neural network controllers. […]
Feb, 6
Impact of data layouts on the efficiency of GPU-accelerated IDW interpolation
This paper focuses on evaluating the impact of different data layouts on the computational efficiency of GPU-accelerated Inverse Distance Weighting (IDW) interpolation algorithm. First we redesign and improve our previous GPU implementation that was performed by exploiting the feature of CUDA dynamic parallelism (CDP). Then we implement three versions of GPU implementations, i.e., the naive […]
Feb, 6
PRISM-PSY: Precise GPU-Accelerated Parameter Synthesis for Stochastic Systems
In this paper we present PRISM-PSY, a novel tool that performs precise GPU-accelerated parameter synthesis for continuous-time Markov chains and time-bounded temporal logic specifications. We redesign, in terms of matrix-vector operations, the recently formulated algorithms for precise parameter synthesis in order to enable effective dataparallel processing, which results in significant acceleration on many-core architectures. High […]
Feb, 6
EIE: Efficient Inference Engine on Compressed Deep Neural Network
State-of-the art deep neural networks (DNNs) have hundreds of millions of connections and are both computationally and memory intensive, making them difficult to deploy on embedded systems with limited hardware resources and power budgets. While custom hardware can help the computation, fetching the weights from DRAM can be as much as two orders of magnitude […]
Feb, 4
A Performance Analysis Framework for Optimizing OpenCL Applications on FPGAs
Recently, FPGA vendors such as Altera and Xilinx have released OpenCL SDK for programming FPGAs. However, the architecture of FPGA is significantly different from that of CPU/GPU, for which OpenCL is originally designed. Tuning the OpenCL code for good performance on FPGAs is still an open problem, since the existing OpenCL tools and models designed […]
Feb, 4
Workshop on Exascale Multi/Many Core Computing Systems, 2016
* CONTEXT Exascale computing will revolutionize computational science and engineering by providing 1000x the capabilities of currently available computing systems, while having a similar power footprint. The HPC community is working towards the development of the first Exaflop computer after reaching the Petaflop milestone in 2008. There are concerns that computer designs based on existing […]
Feb, 3
Optimization and Large Scale Computation of an Entropy-Based Moment Closure
We present computational advances and results in the implementation of an entropy-based moment closure, M_N, in the context of linear kinetic equations, with an emphasis on heterogeneous and large-scale computing platforms. Entropy-based closures are known in several cases to yield more accurate results than closures based on standard spectral approximations, such as P_N, but the […]