Posts
Nov, 3
A Comparative Measurement Study of Deep Learning as a Service Framework
Big data powered Deep Learning (DL) and its applications have blossomed in recent years, fueled by three technological trends: a large amount of digitized data openly accessible, a growing number of DL software frameworks in open source and commercial markets, and a selection of affordable parallel computing hardware devices. However, no single DL framework, to […]
Nov, 3
OpenCL Performance Prediction using Architecture-Independent Features
OpenCL is an attractive model for heterogeneous high-performance computing systems, with wide support from hardware vendors and significant performance portability. To support efficient scheduling on HPC systems it is necessary to perform accurate performance predictions for OpenCL workloads on varied compute devices, which is challenging due to diverse computation, communication and memory access characteristics which […]
Oct, 28
High Performance Computing with FPGAs and OpenCL
In this work we evaluate the potential of FPGAs for accelerating HPC workloads as a more power-efficient alternative to GPUs. Using High-Level Synthesis and a large set of optimization techniques, we show that FPGAs can achieve better performance than CPUs, and better power efficiency than both CPUs and GPUs for typical HPC workloads. Furthermore, we […]
Oct, 28
Automatic Mapping for OpenCL-Programs on CPU/GPU Heterogeneous Platforms
Heterogeneous computing systems with multiple CPUs and GPUs are increasingly popular. Today, heterogeneous platforms are deployed in many setups, ranging from low-power mobile systems to high performance computing systems. Such platforms are usually programmed using OpenCL which allows to execute the same program on different types of device. Nevertheless, programming such platforms is a challenging […]
Oct, 28
The Ocean Tensor Package
Matrix and tensor operations form the basis of a wide range of fields and applications, and in many cases constitute a substantial part of the overall computational complexity. The ability of general-purpose GPUs to speed up many of these operations and enable others has resulted in a widespread adaptation of these devices. In order for […]
Oct, 28
Towards Efficient Large-Scale Graph Neural Network Computing
Recent deep learning models have moved beyond low-dimensional regular grids such as image, video, and speech, to high-dimensional graph-structured data, such as social networks, brain connections, and knowledge graphs. This evolution has led to large graph-based irregular and sparse models that go beyond what existing deep learning frameworks are designed for. Further, these models are […]
Oct, 28
Improving OpenCL Performance by Specializing Compiler Phase Selection and Ordering
Automatic compiler phase selection/ordering has traditionally been focused on CPUs and, to a lesser extent, FPGAs. We present experiments regarding compiler phase ordering specialization of OpenCL kernels targeting a GPU. We use iterative exploration to specialize LLVM phase orders on 15 OpenCL benchmarks to an NVIDIA GPU. We analyze the generated NVIDIA PTX code for […]
Oct, 21
Using Compiler Snippets to Exploit Parallelism on Heterogeneous Hardware: A Java Reduction Case Study
Parallel skeletons are essential structured design patterns for efficient heterogeneous and parallel programming. They allow programmers to express common algorithms in such a way that it is much easier to read, maintain, debug and implement for different parallel programming models and parallel architectures. Reductions are one of the most common parallel skeletons. Many programming frameworks […]
Oct, 21
Non-Uniform Domain Decomposition for Heterogeneous Accelerated Processing Units
The use of heterogeneous architectures has become indispensable in optimizing application performance. Nowadays, one of the most popular heterogeneous architectures is discrete CPU+GPU. Despite the high computational power present in such architectures, in many cases, memory data transfers between CPU and GPU are significant performance bottlenecks. As an attempt to mitigate performance costs involved in […]
Oct, 21
A Survey of FPGA-based Accelerators for Convolutional Neural Networks
Deep convolutional neural networks (CNNs) have recently shown very high accuracy in a wide range of cognitive tasks and due to this, they have received significant interest from the researchers. Given the high computational demands of CNNs, custom hardware accelerators are vital for boosting their performance. The high energy-efficiency, computing capabilities and reconfigurability of FPGA […]
Oct, 21
Exploiting Task Parallelism with OpenCL: A Case Study
While data parallelism aspects of OpenCL have been of primary interest due to the massively data parallel GPUs being on focus, OpenCL also provides powerful capabilities to describe task parallelism. In this article we study the task parallel concepts available in OpenCL and find out how well the different vendor-specific implementations can exploit task parallelism […]
Oct, 21
AI Benchmark: Running Deep Neural Networks on Android Smartphones
Over the last years, the computational power of mobile devices such as smartphones and tablets has grown dramatically, reaching the level of desktop computers available not long ago. While standard smartphone apps are no longer a problem for them, there is still a group of tasks that can easily challenge even high-end devices, namely running […]