Posts
Feb, 10
Using Meta-heuristics and Machine Learning for Software Optimization of Parallel Computing Systems: A Systematic Literature Review
While the modern parallel computing systems offer high performance, utilizing these powerful computing resources to the highest possible extent demands advanced knowledge of various hardware architectures and parallel programming models. Furthermore, optimized software execution on parallel computing systems demands consideration of many parameters at compile-time and run-time. Determining the optimal set of parameters in a […]
Feb, 10
Zorua: Enhancing Programming Ease, Portability, and Performance in GPUs by Decoupling Programming Models from Resource Management
The application resource specification – a static specification of several parameters such as the number of threads and the scratchpad memory usage per thread block–forms a critical component of the existing GPU programming models. This specification determines the performance of the application during execution because the corresponding on-chip hardware resources are allocated and managed purely […]
Feb, 10
Running Financial Risk Management Applications on FPGA in the Amazon Cloud
Nowadays, risk analysis and management is a core part of the daily operations in the financial industry, and strictly enforced by regulatory agencies. At the same time, large financial corporations have started migrating their operations into cloud services. Since the latter use a pay-per-use business model, there is a real need for implementations with high […]
Feb, 10
Lost in Abstraction: Pitfalls of Analyzing GPUs at the Intermediate Language Level
Modern GPU frameworks use a two-phase compilation approach. Kernels written in a high-level language are initially compiled to an implementation-agnostic intermediate language (IL), then finalized to the machine ISA only when the target GPU hardware is known. Most GPU microarchitecture simulators available to academics execute IL instructions because there is substantially less functional state associated […]
Feb, 10
Accelerating Deep Neural Networks on Low Power Heterogeneous Architectures
Deep learning applications are able to recognise images and speech with great accuracy, and their use is now everywhere in our daily lives. However, developing deep learning architectures such as deep neural networks in embedded systems is a challenging task because of the demanding computational resources and power consumption. Hence, sophisticated algorithms and methods that […]
Feb, 9
Ikra-Cpp: A C++/CUDA DSL for Object-Oriented Programming with Structure-of-Arrays Layout
Structure of Arrays (SOA) is a well-studied data layout technique for SIMD architectures. Previous work has shown that it can speed up applications in high-performance computing by several factors compared to a traditional Array of Structures (AOS) layout. However, most programmers are used to AOS-style programming, which is more readable and easier to maintain. We […]
Feb, 9
Combined Spatial and Temporal Blocking for High-Performance Stencil Computation on FPGAs Using OpenCL
Recent developments in High Level Synthesis tools have attracted software programmers to accelerate their high-performance computing applications on FPGAs. Even though it has been shown that FPGAs can compete with GPUs in terms of performance for stencil computation, most previous work achieve this by avoiding spatial blocking and restricting input dimensions relative to FPGA on-chip […]
Feb, 9
MiMatrix: A Massively Distributed Deep Learning Framework on a Petascale High-density Heterogeneous Cluster
In this paper, we present a co-designed petascale high-density GPU cluster to expedite distributed deep learning training with synchronous Stochastic Gradient Descent (SSGD). This architecture of our heterogeneous cluster is inspired by Harvard architecture. Regarding to different roles in the system, nodes are configured as different specifications. Based on the topology of the whole system’s […]
Feb, 9
Interactive GPU active contours for segmenting inhomogeneous objects
We present a segmentation software package primarily targeting medical and biological applications, with a high level of visual feedback and several usability enhancements over existing packages. Specifically, we provide a substantially faster GPU implementation of the local Gaussian distribution fitting energy model, which can segment inhomogeneous objects with poorly defined boundaries as often encountered in […]
Feb, 9
Tuning Streamed Applications on Intel Xeon Phi: A Machine Learning Based Approach
Many-core accelerators, as represented by the XeonPhi coprocessors and GPGPUs, allow software to exploit spatial and temporal sharing of computing resources to improve the overall system performance. To unlock this performance potential requires software to effectively partition the hardware resource to maximize the overlap between hostdevice communication and accelerator computation, and to match the granularity […]
Feb, 3
REOH: Runtime Energy Optimization for Heterogeneous Systems
Significant efforts have been devoted to choosing the best configuration of a computing system to run an application energy efficiently. However, available tuning approaches mainly focus on homogeneous systems and are inextensible for heterogeneous systems which include several components (e.g., CPUs, GPUs) with different architectures. This study proposes a holistic tuning approach called REOH, based […]
Feb, 3
Accelerating recurrent neural network language model based online speech recognition system
This paper presents methods to accelerate recurrent neural network based language models (RNNLMs) for online speech recognition systems. Firstly, a lossy compression of the past hidden layer outputs (history vector) with caching is introduced in order to reduce the number of LM queries. Next, RNNLM computations are deployed in a CPU-GPU hybrid manner, which computes […]