high performance computing on graphics processing units: hgpu.org

Posts

Feb, 9

TC-CIM: Empowering Tensor Comprehensions for Computing-In-Memory

Memristor-based, non-von-Neumann architectures performing tensor operations directly in memory are a promising approach to address the ever-increasing demand for energy-efficient, high-throughput hardware accelerators for Machine Learning (ML) inference. A major challenge for the programmability and exploitation of such Computing-InMemory (CIM) architectures consists in the efficient mapping of tensor operations from high-level ML frameworks to fixed-function […]

CUDA

Feb, 9

MKPipe: A Compiler Framework for Optimizing Multi-Kernel Workloads in OpenCL for FPGA

OpenCL for FPGA enables developers to design FPGAs using a programming model similar for processors. Recent works have shown that code optimization at the OpenCL level is important to achieve high computational efficiency. However, existing works either focus primarily on optimizing single kernels or solely depend on channels to design multi-kernel pipelines. In this paper, […]

OpenCL

Feb, 9

A Language for Describing Optimization Strategies

Optimizing programs to run efficiently on modern parallel hardware is hard but crucial for many applications. The predominantly used imperative languages – like C or OpenCL – force the programmer to intertwine the code describing functionality and optimizations. This results in a nightmare for portability which is particularly problematic given the accelerating trend towards specialized […]

OpenCL

Feb, 2

GPU-accelerated dynamic programming for join-order optimization

Relational databases need to select efficient join orders, as inefficient join orders can increase the query execution time by several orders of magnitude. To select efficient join orders, relational databases can apply an exhaustive search using dynamic programming. Unfortunately, the applicability of sequential dynamic programming variants is limited to simple queries due to the exhaustive […]

OpenCL

Feb, 2

Non-Determinism in TensorFlow ResNets

We show that the stochasticity in training ResNets for image classification on GPUs in TensorFlow is dominated by the non-determinism from GPUs, rather than by the initialisation of the weights and biases of the network or by the sequence of minibatches given. The standard deviation of test set accuracy is 0.02 with fixed seeds, compared […]

Feb, 2

Optimization of a discontinuous Galerkin solver with OpenCL and StarPU

Since the recent advance in microprocessor design, the optimization of computing software becomes more and more technical. One of the difficulties is to transform sequential algorithms into parallel ones. A possible solution is the task-based design. In this approach, it is possible to describe the parallelization possibilities of the algorithm automatically. The task-based design is […]

OpenCL

Feb, 2

Noise Removal from Remote Sensed Images by NonLocal Means with OpenCL Algorithm

We introduce a multi-platform portable implementation of the NonLocal Means methodology aimed at noise removal from remotely sensed images. It is particularly suited for hyperspectral sensors for which real-time applications are not possible with only CPU based algorithms. In the last decades computational devices have usually been a compound of cross-vendor sets of specifications (heterogeneous […]

OpenCL

Feb, 2

Interoperable GPU Kernels as Latency Improver for MEC

Mixed reality (MR) applications are expected to become common when 5G goes mainstream. However, the latency requirements are challenging to meet due to the resources required by video-based remoting of graphics, that is, decoding video codecs. We propose an approach towards tackling this challenge: a client-server implementation for transacting intermediate representation (IR) between a mobile […]

Jan, 26

Using Parallel Programming Models for Automotive Workloads on Heterogeneous Systems – a Case Study

Due to the ever-increasing computational demand of automotive applications, and in particular autonomous driving functionalities, the automotive industry and supply vendors are starting to adopt parallel and heterogeneous embedded platforms for their products. However, C and C++, the currently dominating programming languages in this industry, do not provide sufficient mechanisms to target such platforms. Established […]

CUDA

•

OpenCL

Jan, 26

Hardware/Software Co-Design for Data-Intensive Genomics Workloads

Since the last decade, the main components of computer systems have been evolving, diversifying, to overcome their physical limits and to minimize their energy footprint. Hardware specialization and heterogeneity have become key to design more efficient systems and tackle ever-important problems with ever-larger volumes of data. However, to fully take advantage of the new hardware, […]

OpenCL

Jan, 26

Automatically Harnessing Sparse Acceleration

Sparse linear algebra is central to many scientific programs, yet compilers fail to optimize it well. High-performance libraries are available, but adoption costs are significant. Moreover, libraries tie programs into vendor-specific software and hardware ecosystems, creating non-portable code. In this paper, we develop a new approach based on our specification Language for implementers of Linear […]

CUDA

•

OpenCL

Jan, 26

Efficient Radial Pattern Keyword Search on Knowledge Graphs in Parallel

Recently, keyword search on Knowledge Graphs (KGs) becomes popular. Typical keyword search approaches aim at finding a concise subgraph from a KG, which can reflect a close relationship among all input keywords. The connection paths between keywords are selected in a way that leads to a result subgraph with a better semantic score. However, such […]

CUDA