27019

Posts

Jul, 17

Just-in-Time Compilation and Link-Time Optimization for OpenMP Target Offloading

Following the mass adoption of external accelerators for high performance computing, the overall performance of many applications has become increasingly dependent on relatively small accelerated kernels. As static analysis is fundamentally limited by dynamic values and external definitions, standard ahead-of-time compilation is not always sufficient to achieve the best performance. Furthermore, many users looking to […]
Jul, 17

The OpenMP Cluster Programming Model

Despite the various research initiatives and proposed programming models, efficient solutions for parallel programming in HPC clusters still rely on a complex combination of different programming models (e.g., OpenMP and MPI), languages (e.g., C++ and CUDA), and specialized runtimes (e.g., Charm++ and Legion). On the other hand, task parallelism has shown to be an efficient […]
Jul, 17

High Performance Simulation for Scalable Multi-Agent Reinforcement Learning

Multi-agent reinforcement learning experiments and open-source training environments are typically limited in scale, supporting tens or sometimes up to hundreds of interacting agents. In this paper we demonstrate the use of Vogue, a high performance agent based model (ABM) framework. Vogue serves as a multi-agent training environment, supporting thousands to tens of thousands of interacting […]
Jul, 10

Design and Implementation of CNN-FPGA accelerator based on Open Computing Language

In a wide range of applications, convolutional neural networks (CNNs) have been widely used, including face and speech recognition, picture retrieval and classification, and automated driving. As a result, CNN accelerators have become a popular topic of discourse. CNN Accelerators Graphics processing units (GPU) are often employed in CNN accelerators, and they are referred to […]
Jul, 10

Ultra-low latency recurrent neural network inference on FPGAs for physics applications with hls4ml

Recurrent neural networks have been shown to be effective architectures for many tasks in high energy physics, and thus have been widely adopted. Their use in low-latency environments has, however, been limited as a result of the difficulties of implementing recurrent architectures on field-programmable gate arrays (FPGAs). In this paper we present an implementation of […]
Jul, 10

High-Performance GPU-to-CPU Transpilation and Optimization via High-Level Parallel Constructs

While parallelism remains the main source of performance, architectural implementations and programming models change with each new hardware generation, often leading to costly application re-engineering. Most tools for performance portability require manual and costly application porting to yet another programming model. We propose an alternative approach that automatically translates programs written in one programming model […]
Jul, 10

DarKnight: An Accelerated Framework for Privacy and Integrity Preserving Deep Learning Using Trusted Hardware

Privacy and security-related concerns are growing as machine learning reaches diverse application domains. The data holders want to train or infer with private data while exploiting accelerators, such as GPUs, that are hosted in the cloud. Cloud systems are vulnerable to attackers that compromise the privacy of data and integrity of computations. Tackling such a […]
Jul, 10

FPGA Implementation of Bluetooth Low Energy Physical Layer with OpenCL

This dissertation is primarily presenting the design of Digital Signal Processing (DSP) between the transmission in Bluetooth Low Energy Physical Layer (BLE PHY), and its implementation in a Field Programmable Gate Array (FPGA) device with Open Computing Language (OpenCL). During the design of DSP, it bases on the In-Phase/Quadrature-Phase (IQ) architecture to construct the modulation […]
Jul, 3

Evaluation of Intel’s DPC++ Compatibility Tool in heterogeneous computing

The Intel DPC++ Compatibility Tool is a component of the Intel oneAPI Base Toolkit. This tool automatically transforms CUDA code into Data Parallel C++ (DPC++), thus assisting in the migration process. DPC++ is an implementation of the programming standard for heterogeneous computing known as SYCL, which unifies the development of parallel applications on CPUs, GPUs […]
Jul, 3

Novel Parallel Approaches to Efficiently Solve Spatial Problems on Heterogeneous CPU-GPU Systems

In recent years, approaches that seek to extract valuable information from large datasets have become particularly relevant in today’s society. In this category, we can highlight those problems that comprise data analysis distributed across two-dimensional scenarios called spatial problems. These usually involve processing (i) a series of features distributed across a given plane or (ii) […]
Jul, 3

Optimizing the Performance of Parallel and Concurrent Applications Based on Asynchronous Many-Task Runtimes

Nowadays, High-performance Computing (HPC) scientific applications often face performance challenges when running on heterogeneous supercomputers, so do scalability, portability, and efficiency issues. For years, supercomputer architectures have been rapidly changing and becoming more complex, and this challenge will become even more complicated as we enter the exascale era, where computers will exceed one quintillion calculations […]
Jul, 3

Tensor Computation Based on Heterogeneous Memory

Tensors, which generalize matrices to more than two dimensions, are fundamental to many disciplines, such as scientific computing and machine learning. Improving the performance and scalability of tensor computation is essential to those domains. The recent advance of heterogeneous memory is promising to deliver large-scale, high-performance tensor computation. However, it is challenging to leverage memory […]

* * *

* * *

* * *

HGPU group © 2010-2022 hgpu.org

All rights belong to the respective authors

Contact us: