25401

Posts

Aug, 8

ndzip-gpu: Efficient Lossless Compression of Scientific Floating-Point Data on GPUs

Lossless data compression is a promising software approach for reducing the bandwidth requirements of scientific applications on accelerator clusters without introducing approximation errors. Suitable compressors must be able to effectively compact floating-point data while saturating the system interconnect to avoid introducing unnecessary latencies. We present ndzip-gpu, a novel, highly-efficient GPU parallelization scheme for the block […]
Aug, 8

Performance assessment of CUDA and OpenACC in large scale combustion simulations

GPUs have climbed up to the top of supercomputer systems making life harder to many legacy scientific codes. Nowadays, many recipes are being used in such code’s portability, without any clarity of which is the best option. We present a comparative analysis of the two most common approaches, CUDA and OpenACC, into the multi-physics CFD […]
Aug, 8

On Efficient GPGPU Computing for Integrated Heterogeneous CPU-GPU Microprocessors

Heterogeneous microprocessors which integrate a CPU and GPU on a single chip provide low-overhead CPU-GPU communication and permit sharing of on-chip resources that a traditional discrete GPU would not have direct access to. These features allow for the optimization of codes that heretofore would be suitable only for multi-core CPUs or discrete GPUs to be […]
Aug, 8

ScaleHLS: Scalable High-Level Synthesis through MLIR

High-level Synthesis (HLS) has been widely adopted as it significantly improves the hardware design productivity and enables efficient design space exploration (DSE). HLS tools can be used to deliver solutions for many different kinds of design problems, which are often better solved with different levels of abstraction. While existing HLS tools are built using compiler […]
Aug, 8

PoCL-R: A Scalable Low Latency Distributed OpenCL Runtime

Offloading the most demanding parts of applications to an edge GPU server cluster to save power or improve the result quality is a solution that becomes increasingly realistic with new networking technologies. In order to make such a computing scheme feasible, an application programming layer that can provide both low latency and scalable utilization of […]
Jul, 25

Effective GPU Sharing Under Compiler Guidance

Modern computing platforms tend to deploy multiple GPUs (2, 4, or more) on a single node to boost system performance, with each GPU having a large capacity of global memory and streaming multiprocessors (SMs). GPUs are an expensive resource, and boosting utilization of GPUs without causing performance degradation of individual workloads is an important and […]
Jul, 25

Face.evoLVe: A High-Performance Face Recognition Library

In this paper, we develop face.evoLVe – a comprehensive library that collects and implements a wide range of popular deep learning-based methods for face recognition. First of all, face.evoLVe is composed of key components that cover the full process of face analytics, including face alignment, data processing, various backbones, losses, and alternatives with bags of […]
Jul, 25

StreamBlocks: A compiler for heterogeneous dataflow computing

To increase performance and efficiency, systems use FPGAs as reconfigurable accelerators. A key challenge in designing these systems is partitioning computation between processors and an FPGA. An appropriate division of labor may be difficult to predict in advance and require experiments and measurements. When an investigation requires rewriting part of the system in a new […]
Jul, 25

A method for decompilation of AMD GCN kernels to OpenCL

Introduction: Decompilers are useful tools for software analysis and support in the absence of source code. They are available for many hardware architectures and programming languages. However, none of the existing decompilers support modern AMD GPU architectures such as AMD GCN and RDNA. Purpose: We aim at developing the first assembly decompiler tool for a […]
Jul, 25

DNN is not all you need: Parallelizing Non-Neural ML Algorithms on Ultra-Low-Power IoT Processors

Machine Learning (ML) functions are becoming ubiquitous in latency- and privacy-sensitive IoT applications, prompting for a shift toward near-sensor processing at the extreme edge and the consequent increasing adoption of Parallel Ultra-Low Power (PULP) IoT processors. These compute- and memory-constrained parallel architectures need to run efficiently a wide range of algorithms, including key Non-Neural ML […]
Jul, 18

OpenCL FPGA Optimization guided by memory accesses and roofline model analysis applied to tomography acceleration

Backward projection is one of the most time-consuming steps in method-based iterative reconstruction computed tomography. The 3D backprojection memory access pattern is potentially enough regular to exploit efficiently the computation power of acceleration boards based on GPU or FPGA. The highlevel tools like HLS or OpenCL ease consider such particular memory accesses during the design […]
Jul, 18

Accelerating Regular-Expression Matching on FPGAs with High-Level Synthesis

The importance of security infrastructures for high-throughput networks has rapidly grown as a result of expanding internet traffic and increasingly high-bandwidth connections. Intrusion-detection systems (IDSs), such as SNORT, rely upon rule sets designed to alert system administrators of malicious packets. Methods for deep-packet inspection, which often depend upon regular-expression searches, can be accelerated on programmable-logic […]

* * *

* * *

HGPU group © 2010-2024 hgpu.org

All rights belong to the respective authors

Contact us: