
Dec, 6

High-Throughput Parallel Viterbi Decoder on GPU Tensor Cores

Many research works have been performed on implementation of Vitrerbi decoding algorithm on GPU instead of FPGA because this platform provides considerable flexibility in addition to great performance. Recently, the recently-introduced Tensor cores in modern GPU architectures provide incredible computing capability. This paper proposes a novel parallel implementation of Viterbi decoding algorithm based on Tensor […]
Nov, 29

Evaluating the Performance and Portability of Contemporary SYCL Implementations

SYCL is a single-source programming model for heterogeneous systems; it promises improved maintainability, productivity, and opportunity for compiler optimization, when compared to accelerator specific programming models. Several implementations of the SYCL standard have been developed over the past few years, including several backends using contemporary accelerator languages, like OpenCL, CUDA, and HIP. These implementations vary […]
Nov, 29

Efficient Deep Neural Network Inference for Embedded Systems: A Mixture of Experts Approach

Deep neural networks (DNNs) have become one of the dominant machine learning approaches in recent years for many application domains. Unfortunately, DNNs are not well suited to addressing the challenges of embedded systems, where on-device inference on battery-powered, resource-constrained devices is often infeasible due to prohibitively long inferencing time and resource requirements. Furthermore, offloading computation […]
Nov, 29

HALO 1.0: A Hardware-agnostic Accelerator Orchestration Framework for Enabling Hardware-agnostic Programming with True Performance Portability for Heterogeneous HPC

Hardware-agnostic programming with high performance portability will be the bedrock for realizing the ubiquitous adoption of emerging accelerator technologies in future heterogeneous high-performance computing (HPC) systems, which is the key to achieving the next level of HPC performance on an expanding accelerator landscape. In this paper, we present HALO 1.0, an open-ended extensible multi-agent software […]
Nov, 29

BootCMatchG: An adaptive Algebraic MultiGrid linear solver for GPUs

Sparse solvers are one of the building blocks of any technology for reliableand high-performance scientific and engineering computing. In this paperwe present a software package which implements an efficient multigrid sparsesolver running on Graphics Processing Units. The package is a branch ofa wider initiative of software development for sparse Linear Algebra com-putations on emergent HPC […]
Nov, 29

AZP: Automatic Specialization for Zero Values in Gaming Applications

Recent research has shown that dynamic zeros in shader programs of gaming applications can be effectively leveraged with a profile-guided, code-versioning transform. This transform duplicates code, specializes one path assuming certain key program operands, called versioning variables, are zero, and leaves the other path unspecialized. Dynamically, depending on the versioning variable’s value, either the specialized […]
Nov, 22

Data Parallel C++: Mastering DPC++ for Programming of Heterogeneous Systems using C++ and SYCL

This book is about programming for data parallelism using C++. If you are new to parallel programming, that is okay. If you have never heard of SYCL or the DPC++ compiler, that is also okay. SYCL is an industry-driven Khronos standard adding data parallelism to C++ for heterogeneous systems. DPC++ is an open source compiler […]
Nov, 22

A Survey of System Architectures and Techniques for FPGA Virtualization

FPGA accelerators are gaining increasing attention in both cloud and edge computing because of their hardware flexibility, high computational throughput, and low power consumption. However, the design flow of FPGAs often requires specific knowledge of the underlying hardware, which hinders the wide adoption of FPGAs by application developers. Therefore, the virtualization of FPGAs becomes extremely […]
Nov, 22

A Novel Memory-Efficient Deep Learning Training Framework via Error-Bounded Lossy Compression

Deep neural networks (DNNs) are becoming increasingly deeper, wider, and non-linear due to the growing demands on prediction accuracy and analysis quality. When training a DNN model, the intermediate activation data must be saved in the memory during forward propagation and then restored for backward propagation. However, state-of-the-art accelerators such as GPUs are only equipped […]
Nov, 22

Ginkgo – A Math Library designed for Platform Portability

The first associations to software sustainability might be the existence of a continuous integration (CI) framework; the existence of a testing framework composed of unit tests, integration tests, and end-to-end tests; and also the existence of software documentation. However, when asking what is a common deathblow for a scientific software product, it is often the […]
Nov, 22

GPURepair: Automated Repair of GPU Kernels

This paper presents a tool for repairing errors in GPU kernels written in CUDA or OpenCL due to data races and barrier divergence. Our novel extension to prior work can also remove barriers that are deemed unnecessary for correctness. We implement these ideas in our tool called GPURepair, which uses GPUVerify as the verification oracle […]
Nov, 15

Adaptive Data Migration in Load-Imbalanced HPC Applications

Distributed parallel applications need to maximize and maintain computer resource utilization and be portable across different machines. Balanced execution of some applications requires more effort than others because their data distribution changes over time. Data re-distribution at runtime requires elaborate schemes that are expensive and may benefit particular applications. This dissertation discusses a solution for […]

* * *

* * *

HGPU group © 2010-2024 hgpu.org

All rights belong to the respective authors

Contact us: