Aug, 8

PoCL-R: A Scalable Low Latency Distributed OpenCL Runtime

Offloading the most demanding parts of applications to an edge GPU server cluster to save power or improve the result quality is a solution that becomes increasingly realistic with new networking technologies. In order to make such a computing scheme feasible, an application programming layer that can provide both low latency and scalable utilization of […]
Jul, 25

Effective GPU Sharing Under Compiler Guidance

Modern computing platforms tend to deploy multiple GPUs (2, 4, or more) on a single node to boost system performance, with each GPU having a large capacity of global memory and streaming multiprocessors (SMs). GPUs are an expensive resource, and boosting utilization of GPUs without causing performance degradation of individual workloads is an important and […]
Jul, 25

Face.evoLVe: A High-Performance Face Recognition Library

In this paper, we develop face.evoLVe – a comprehensive library that collects and implements a wide range of popular deep learning-based methods for face recognition. First of all, face.evoLVe is composed of key components that cover the full process of face analytics, including face alignment, data processing, various backbones, losses, and alternatives with bags of […]
Jul, 25

A method for decompilation of AMD GCN kernels to OpenCL

Introduction: Decompilers are useful tools for software analysis and support in the absence of source code. They are available for many hardware architectures and programming languages. However, none of the existing decompilers support modern AMD GPU architectures such as AMD GCN and RDNA. Purpose: We aim at developing the first assembly decompiler tool for a […]
Jul, 25

DNN is not all you need: Parallelizing Non-Neural ML Algorithms on Ultra-Low-Power IoT Processors

Machine Learning (ML) functions are becoming ubiquitous in latency- and privacy-sensitive IoT applications, prompting for a shift toward near-sensor processing at the extreme edge and the consequent increasing adoption of Parallel Ultra-Low Power (PULP) IoT processors. These compute- and memory-constrained parallel architectures need to run efficiently a wide range of algorithms, including key Non-Neural ML […]
Jul, 25

StreamBlocks: A compiler for heterogeneous dataflow computing

To increase performance and efficiency, systems use FPGAs as reconfigurable accelerators. A key challenge in designing these systems is partitioning computation between processors and an FPGA. An appropriate division of labor may be difficult to predict in advance and require experiments and measurements. When an investigation requires rewriting part of the system in a new […]
Jul, 18

OpenCL FPGA Optimization guided by memory accesses and roofline model analysis applied to tomography acceleration

Backward projection is one of the most time-consuming steps in method-based iterative reconstruction computed tomography. The 3D backprojection memory access pattern is potentially enough regular to exploit efficiently the computation power of acceleration boards based on GPU or FPGA. The highlevel tools like HLS or OpenCL ease consider such particular memory accesses during the design […]
Jul, 18

Accelerating Regular-Expression Matching on FPGAs with High-Level Synthesis

The importance of security infrastructures for high-throughput networks has rapidly grown as a result of expanding internet traffic and increasingly high-bandwidth connections. Intrusion-detection systems (IDSs), such as SNORT, rely upon rule sets designed to alert system administrators of malicious packets. Methods for deep-packet inspection, which often depend upon regular-expression searches, can be accelerated on programmable-logic […]
Jul, 18

Designing a high-performance boundary element library with OpenCL and Numba

The Bempp boundary element library is a well known library for the simulation of a range of electrostatic, acoustic and electromagnetic problems in homogeneous bounded and unbounded domains. It originally started as a traditional C++ library with a Python interface. Over the last two years we have completely redesigned Bempp as a native Python library, […]
Jul, 18

Optimisation and GPU code generation of Stencils for Futhark

Stencils are a common problem in the area of scientific computing. Exploitation of parallel computing is a central part when optimising for faster execution times of stencils running on large amounts of data. For this reason stencils are well suited to be run in a GPGPU setting. However, programming stencils to run on massively-parallel hardware […]
Jul, 18

GPTPU: Accelerating Applications using Edge Tensor Processing Units

Neural network (NN) accelerators have been integrated into a wide-spectrum of computer systems to accommodate the rapidly growing demands for artificial intelligence (AI) and machine learning (ML) applications. NN accelerators share the idea of providing native hardware support for operations on multidimensional tensor data. Therefore, NN accelerators are theoretically tensor processors that can improve system […]
Jul, 11

Bringing OpenCL to Commodity RISC-V CPUs

The importance of open-source hardware has been increasing in recent years with the introduction of the RISC-V Open ISA. This has also accelerated the push for support of the open-source software stack from compiler tools to full-blown operating systems. Parallel computing with today’s Application Programming Interfaces such as OpenCL has proven to be effective at […]

* * *

* * *

HGPU group © 2010-2021 hgpu.org

All rights belong to the respective authors

Contact us: