Posts
Feb, 16
Task-based, GPU-accelerated and Robust Library for Solving Dense Nonsymmetric Eigenvalue Problems
In this paper, we present the StarNEig library for solving dense nonsymmetric standard and generalized eigenvalue problems. The library is built on top of the StarPU runtime system and targets both shared and distributed memory machines. Some components of the library have support for GPU acceleration. The library is currently in an early beta state […]
Feb, 9
Working With Incremental Spatial Data During Parallel (GPU) Computation
Central to many complex systems, spatial actors require an awareness of their local environment to enable behaviours such as communication and navigation. Complex system simulations represent this behaviour with Fixed Radius Near Neighbours (FRNN) search. This algorithm allows actors to store data at spatial locations and then query the data structure to find all data […]
Feb, 9
Automated Runtime Analysis and Adaptation for Scalable Heterogeneous Computing
In the last decade, there have been tectonic shifts in computer hardware because of reaching the physical limits of the sequential CPU performance. As a consequence, current high-performance computing (HPC) systems integrate a wide variety of compute resources with different capabilities and execution models, ranging from multi-core CPUs to many-core accelerators. While such heterogeneous systems […]
Feb, 9
TC-CIM: Empowering Tensor Comprehensions for Computing-In-Memory
Memristor-based, non-von-Neumann architectures performing tensor operations directly in memory are a promising approach to address the ever-increasing demand for energy-efficient, high-throughput hardware accelerators for Machine Learning (ML) inference. A major challenge for the programmability and exploitation of such Computing-InMemory (CIM) architectures consists in the efficient mapping of tensor operations from high-level ML frameworks to fixed-function […]
Feb, 9
MKPipe: A Compiler Framework for Optimizing Multi-Kernel Workloads in OpenCL for FPGA
OpenCL for FPGA enables developers to design FPGAs using a programming model similar for processors. Recent works have shown that code optimization at the OpenCL level is important to achieve high computational efficiency. However, existing works either focus primarily on optimizing single kernels or solely depend on channels to design multi-kernel pipelines. In this paper, […]
Feb, 9
A Language for Describing Optimization Strategies
Optimizing programs to run efficiently on modern parallel hardware is hard but crucial for many applications. The predominantly used imperative languages – like C or OpenCL – force the programmer to intertwine the code describing functionality and optimizations. This results in a nightmare for portability which is particularly problematic given the accelerating trend towards specialized […]
Feb, 2
GPU-accelerated dynamic programming for join-order optimization
Relational databases need to select efficient join orders, as inefficient join orders can increase the query execution time by several orders of magnitude. To select efficient join orders, relational databases can apply an exhaustive search using dynamic programming. Unfortunately, the applicability of sequential dynamic programming variants is limited to simple queries due to the exhaustive […]
Feb, 2
Non-Determinism in TensorFlow ResNets
We show that the stochasticity in training ResNets for image classification on GPUs in TensorFlow is dominated by the non-determinism from GPUs, rather than by the initialisation of the weights and biases of the network or by the sequence of minibatches given. The standard deviation of test set accuracy is 0.02 with fixed seeds, compared […]
Feb, 2
Optimization of a discontinuous Galerkin solver with OpenCL and StarPU
Since the recent advance in microprocessor design, the optimization of computing software becomes more and more technical. One of the difficulties is to transform sequential algorithms into parallel ones. A possible solution is the task-based design. In this approach, it is possible to describe the parallelization possibilities of the algorithm automatically. The task-based design is […]
Feb, 2
Noise Removal from Remote Sensed Images by NonLocal Means with OpenCL Algorithm
We introduce a multi-platform portable implementation of the NonLocal Means methodology aimed at noise removal from remotely sensed images. It is particularly suited for hyperspectral sensors for which real-time applications are not possible with only CPU based algorithms. In the last decades computational devices have usually been a compound of cross-vendor sets of specifications (heterogeneous […]
Feb, 2
Interoperable GPU Kernels as Latency Improver for MEC
Mixed reality (MR) applications are expected to become common when 5G goes mainstream. However, the latency requirements are challenging to meet due to the resources required by video-based remoting of graphics, that is, decoding video codecs. We propose an approach towards tackling this challenge: a client-server implementation for transacting intermediate representation (IR) between a mobile […]
Jan, 26
Using Parallel Programming Models for Automotive Workloads on Heterogeneous Systems – a Case Study
Due to the ever-increasing computational demand of automotive applications, and in particular autonomous driving functionalities, the automotive industry and supply vendors are starting to adopt parallel and heterogeneous embedded platforms for their products. However, C and C++, the currently dominating programming languages in this industry, do not provide sufficient mechanisms to target such platforms. Established […]