25243

Posts

Jun, 27

Sigmoid: An auto-tuned load balancing algorithm for heterogeneous systems

A challenge that heterogeneous system programmers face is leveraging the performance of all the devices that integrate the system. This paper presents Sigmoid, a new load balancing algorithm that efficiently co-executes a single OpenCL data-parallel kernel on all the devices of heterogeneous systems. Sigmoid splits the workload proportionally to the capabilities of the devices, drastically […]
Jun, 27

Efficient heterogeneous matrix profile on a CPU + High Performance FPGA with integrated HBM

In this work, we study the problem of efficiently executing a state-of-the-art time series algorithm class – SCAMP – on a heterogeneous platform comprised of CPU + High Performance FPGA with integrated HBM (High Bandwidth Memory). The geometry of the algorithm (a triangular matrix walk) and the FPGA capabilities pose two challenges. First, several replicated […]
Jun, 27

APNN-TC: Accelerating Arbitrary Precision Neural Networks on Ampere GPU Tensor Cores

Over the years, accelerating neural networks with quantization has been widely studied. Unfortunately, prior efforts with diverse precisions (e.g., 1-bit weights and 2-bit activations) are usually restricted by limited precision support on GPUs (e.g., int1 and int4). To break such restrictions, we introduce the first Arbitrary Precision Neural Network framework (APNN-TC) to fully exploit quantization […]
Jun, 27

Lettuce: PyTorch-based Lattice Boltzmann Framework

The lattice Boltzmann method (LBM) is an efficient simulation technique for computational fluid mechanics and beyond. It is based on a simple stream-and-collide algorithm on Cartesian grids, which is easily compatible with modern machine learning architectures. While it is becoming increasingly clear that deep learning can provide a decisive stimulus for classical simulation techniques, recent […]
Jun, 20

GPUAPI: Multi-level Chapel Runtime API for GPUs

Chapel is inherently well suited not only for homogeneous nodes but also heterogeneous nodes because they employ the concept of locales, distributed domains, forall/reduce constructs, and implicit communications. However, it is unfortunate that there is room for further improvements in supporting GPU in Chapel. This paper addresses some of the key limitations of past approaches […]
Jun, 20

Study and evaluation of improved automatic GPU offloading method

With the slowing down of Moore’s law, the use of hardware other than CPUs, such as graphics processing units (GPUs) or field-Programmable gate arrays (FPGAs), is increasing. However, when using heterogeneous hardware other than CPUs, barriers to technical skills, such for compute unified device architecture (CUDA) and open computing language (OpenCL), are high. Therefore, I […]
Jun, 20

Benchmarking the Nvidia GPU Lineage: From Early K80 to Modern A100 with Asynchronous Memory Transfers

For many, Graphics Processing Units (GPUs) provides a source of reliable computing power. Recently, Nvidia introduced its 9th generation HPC-grade GPUs, the Ampere 100, claiming significant performance improvements over previous generations, particularly for AI-workloads, as well as introducing new architectural features such as asynchronous data movement. But how well does the A100 perform on non-AI […]
Jun, 20

StreamBrain: An HPC Framework for Brain-like Neural Networks on CPUs, GPUs and FPGAs

The modern deep learning method based on backpropagation has surged in popularity and has been used in multiple domains and application areas. At the same time, there are other — less-known — machine learning algorithms with a mature and solid theoretical foundation whose performance remains unexplored. One such example is the brain-like Bayesian Confidence Propagation […]
Jun, 20

Experience Report: Writing A Portable GPU Runtime with OpenMP 5.1

GPU runtimes are historically implemented in CUDA or other vendor specific languages dedicated to GPU programming. In this work we show that OpenMP 5.1, with minor compiler extensions, is capable of replacing existing solutions without a performance penalty. The result is a performant and portable GPU runtime that can be compiled with LLVM/Clang to Nvidia […]
Jun, 6

DLIO: A Data-Centric Benchmark for Scientific Deep Learning Applications

Deep learning has been shown as a successful method for various tasks, and its popularity results in numerous open-source deep learning software tools. Deep learning has been applied to a broad spectrum of scientific domains such as cosmology, particle physics, computer vision, fusion, and astrophysics. Scientists have performed a great deal of work to optimize […]
Jun, 6

Data-Driven Analysis and Design of Vulkan Ray-Tracing Applications using Automatic Instrumentation

Modern graphics Application Programming Interfaces (APIs) provide first-class support for ray tracing. Hardware vendors implement drivers for the graphics API including a black-box compiler. The black-box compiler creates architecture-specific binaries that leverage ray-tracing hardware acceleration. Ray-tracing support in modern APIs allows all geometry and shaders to be specified for a single execution. Thus, ray tracing […]
Jun, 6

Optimization of Heterogeneous Systems with AI Planning Heuristics and Machine Learning: A Performance and Energy Aware Approach

Heterogeneous computing systems provide high performance and energy efficiency. However, to optimally utilize such systems, solutions that distribute the work across host CPUs and accelerating devices are needed. In this paper, we present a performance and energy aware approach that combines AI planning heuristics for parameter space exploration with a machine learning model for performance […]

* * *

* * *

HGPU group © 2010-2021 hgpu.org

All rights belong to the respective authors

Contact us: