26258

Posts

Feb, 20

Heuristic Adaptability to Input Dynamics for SpMM on GPUs

Sparse Matrix-Matrix Multiplication (SpMM) has served as fundamental components in various domains. Many previous studies exploit GPUs for SpMM acceleration because GPUs provide high bandwidth and parallelism. We point out that a static design does not always improve the performance of SpMM on different input data (e.g., >85% performance loss with a single algorithm). In […]
Feb, 20

Lightning: Scaling the GPU Programming Model Beyond a Single GPU

The GPU programming model is primarily designed to support the development of applications that run on one GPU. However, just a single GPU is limited in its capabilities in terms of memory capacity and compute power. To handle large problems that exceed these capabilities, one must rewrite application code to manually transfer data between GPU […]
Feb, 13

Electrical-Level Attacks on CPUs, FPGAs, and GPUs: Survey and Implications in the Heterogeneous Era

Given the need for efficient high-performance computing, computer architectures combining CPUs, GPUs, and FPGAs are nowadays prevalent. However, each of these components suffers from electrical-level security risks. Moving to heterogeneous systems, with the potential of multitenancy, it is essential to understand and investigate how the security vulnerabilities of individual components may affect the system as […]
Feb, 13

Pattern-based Programming Abstractions for Heterogeneous Parallel Computing

Contemporary computer architectures utilize wide multi-core processors, accelerators such as GPUs, and clustering of individual computers into complex large-scale systems. These hardware trends are prevalent across computers of all sizes, from the largest supercomputers down to the smallest mobile phones. While these innovations provide high peak computing performance, software developers find it increasingly difficult to […]
Feb, 13

The Ecological Footprint of Neural Machine Translation Systems

Over the past decade, deep learning (DL) has led to significant advancements in various fields of artificial intelligence, including machine translation (MT). These advancements would not be possible without the ever-growing volumes of data and the hardware that allows large DL models to be trained efficiently. Due to the large amount of computing cores as […]
Feb, 13

Improving Loop Parallelization by a Combination of Static and Dynamic Analyses in HLS

High-level synthesis (HLS) can be used to create hardware accelerators for compute-intense software parts such as loop structures. Usually, this process requires significant amount of user interaction to steer kernel selection and optimizations. This can be tedious and time-consuming. In this article, we present an approach that fully autonomously finds independent loop iterations and reductions […]
Feb, 13

FC_ACCEL: Enabling Efficient, Low-Latency and Flexible Inference in DNN Fully Connected Layers, using Optimized Checkerboard Block matrix decomposition, fast scheduling, and a resource efficient 1D PE array with a custom HBM2 memory subsystem

This article presents a novel low latency CMOS hardware accelerator for fully connected (FC) layers in deep neural networks (DNNs). The accelerator, FC-Accel, is based on 128 8×8 or 16×16 processing elements (PEs) for matrix-vector multiplication, and 128 multiply-accumulate (MAC) units integrated with 16 High Bandwidth Memory (HBM) stack units for storing the pre-trained weights. […]
Feb, 6

Flashlight: Enabling Innovation in Tools for Machine Learning

As the computational requirements for machine learning systems and the size and complexity of machine learning frameworks increases, essential framework innovation has become challenging. While computational needs have driven recent compiler, networking, and hardware advancements, utilization of those advancements by machine learning tools is occurring at a slower pace. This is in part due to […]
Feb, 6

Dr.Jit: A Just-In-Time Compiler for Differentiable Rendering

We present Dr.Jit, a domain-specific just-in-time compiler for physically based rendering and its derivative. Dr.Jit traces high-level programs (e.g., written in Python) and compiles them into efficient CPU or GPU megakernels. It achieves state-of-the-art performance thanks to global optimizations that specialize code generation to the rendering or optimization task at hand. While Dr.Jit drastically simplifies […]
Feb, 6

SZx: an Ultra-fast Error-bounded Lossy Compressor for Scientific Datasets

Today’s scientific high performance computing (HPC) applications or advanced instruments are producing vast volumes of data across a wide range of domains, which introduces a serious burden on data transfer and storage. Error-bounded lossy compression has been developed and widely used in scientific community, because not only can it significantly reduce the data volumes but […]
Feb, 6

Porting OpenACC to OpenMP on heterogeneous systems

This documentation is designed for beginners in Graphics Processing Unit (GPU)-programming and who want to get familiar with OpenACC and OpenMP offloading models. Here we present an overview of these two programming models as well as of the GPU-architectures. Specifically, we provide some insights into the functionality of these models and perform experiments involving different […]
Feb, 6

GC3: An Optimizing Compiler for GPU Collective Communication

Machine learning models made up of millions or billions of parameters are often trained and served on large multi-GPU systems. As models grow in size and execute on more GPUs, the collective communications used in these applications becomes a bottleneck. Custom collective algorithms optimized for both particular network topologies and application specific communication patterns can […]

* * *

* * *

HGPU group © 2010-2023 hgpu.org

All rights belong to the respective authors

Contact us: