27017

Posts

Jul, 17

High Performance Simulation for Scalable Multi-Agent Reinforcement Learning

Multi-agent reinforcement learning experiments and open-source training environments are typically limited in scale, supporting tens or sometimes up to hundreds of interacting agents. In this paper we demonstrate the use of Vogue, a high performance agent based model (ABM) framework. Vogue serves as a multi-agent training environment, supporting thousands to tens of thousands of interacting […]
Jul, 10

Design and Implementation of CNN-FPGA accelerator based on Open Computing Language

In a wide range of applications, convolutional neural networks (CNNs) have been widely used, including face and speech recognition, picture retrieval and classification, and automated driving. As a result, CNN accelerators have become a popular topic of discourse. CNN Accelerators Graphics processing units (GPU) are often employed in CNN accelerators, and they are referred to […]
Jul, 10

Ultra-low latency recurrent neural network inference on FPGAs for physics applications with hls4ml

Recurrent neural networks have been shown to be effective architectures for many tasks in high energy physics, and thus have been widely adopted. Their use in low-latency environments has, however, been limited as a result of the difficulties of implementing recurrent architectures on field-programmable gate arrays (FPGAs). In this paper we present an implementation of […]
Jul, 10

High-Performance GPU-to-CPU Transpilation and Optimization via High-Level Parallel Constructs

While parallelism remains the main source of performance, architectural implementations and programming models change with each new hardware generation, often leading to costly application re-engineering. Most tools for performance portability require manual and costly application porting to yet another programming model. We propose an alternative approach that automatically translates programs written in one programming model […]
Jul, 10

DarKnight: An Accelerated Framework for Privacy and Integrity Preserving Deep Learning Using Trusted Hardware

Privacy and security-related concerns are growing as machine learning reaches diverse application domains. The data holders want to train or infer with private data while exploiting accelerators, such as GPUs, that are hosted in the cloud. Cloud systems are vulnerable to attackers that compromise the privacy of data and integrity of computations. Tackling such a […]
Jul, 10

FPGA Implementation of Bluetooth Low Energy Physical Layer with OpenCL

This dissertation is primarily presenting the design of Digital Signal Processing (DSP) between the transmission in Bluetooth Low Energy Physical Layer (BLE PHY), and its implementation in a Field Programmable Gate Array (FPGA) device with Open Computing Language (OpenCL). During the design of DSP, it bases on the In-Phase/Quadrature-Phase (IQ) architecture to construct the modulation […]
Jul, 3

Novel Parallel Approaches to Efficiently Solve Spatial Problems on Heterogeneous CPU-GPU Systems

In recent years, approaches that seek to extract valuable information from large datasets have become particularly relevant in today’s society. In this category, we can highlight those problems that comprise data analysis distributed across two-dimensional scenarios called spatial problems. These usually involve processing (i) a series of features distributed across a given plane or (ii) […]
Jul, 3

Evaluation of Intel’s DPC++ Compatibility Tool in heterogeneous computing

The Intel DPC++ Compatibility Tool is a component of the Intel oneAPI Base Toolkit. This tool automatically transforms CUDA code into Data Parallel C++ (DPC++), thus assisting in the migration process. DPC++ is an implementation of the programming standard for heterogeneous computing known as SYCL, which unifies the development of parallel applications on CPUs, GPUs […]
Jul, 3

Optimizing the Performance of Parallel and Concurrent Applications Based on Asynchronous Many-Task Runtimes

Nowadays, High-performance Computing (HPC) scientific applications often face performance challenges when running on heterogeneous supercomputers, so do scalability, portability, and efficiency issues. For years, supercomputer architectures have been rapidly changing and becoming more complex, and this challenge will become even more complicated as we enter the exascale era, where computers will exceed one quintillion calculations […]
Jul, 3

Tensor Computation Based on Heterogeneous Memory

Tensors, which generalize matrices to more than two dimensions, are fundamental to many disciplines, such as scientific computing and machine learning. Improving the performance and scalability of tensor computation is essential to those domains. The recent advance of heterogeneous memory is promising to deliver large-scale, high-performance tensor computation. However, it is challenging to leverage memory […]
Jul, 3

TPU-KNN: K Nearest Neighbor Search at Peak FLOP/s

This paper presents a novel nearest neighbor search algorithm achieving TPU (Google Tensor Processing Unit) peak performance, outperforming state-of-the-art GPU algorithms with similar level of recall. The design of the proposed algorithm is motivated by an accurate accelerator performance model that takes into account both the memory and instruction bottlenecks. Our algorithm comes with an […]
Jun, 26

SnuHPL: high performance LINPACK for heterogeneous GPUs

These days, it is typical for a large-scale cluster system to have different kinds of GPUs. However, HPL (High-Performance LINPACK), the de-facto standard LINPACK implementation for evaluating the performance of a cluster system, is originally designed to work only for homogeneous CPU-only systems. In this paper, we develop SnuHPL, an optimized HPL for clusters of […]

* * *

* * *

HGPU group © 2010-2024 hgpu.org

All rights belong to the respective authors

Contact us: