26220

Posts

Feb, 13

Improving Loop Parallelization by a Combination of Static and Dynamic Analyses in HLS

High-level synthesis (HLS) can be used to create hardware accelerators for compute-intense software parts such as loop structures. Usually, this process requires significant amount of user interaction to steer kernel selection and optimizations. This can be tedious and time-consuming. In this article, we present an approach that fully autonomously finds independent loop iterations and reductions […]
Feb, 13

FC_ACCEL: Enabling Efficient, Low-Latency and Flexible Inference in DNN Fully Connected Layers, using Optimized Checkerboard Block matrix decomposition, fast scheduling, and a resource efficient 1D PE array with a custom HBM2 memory subsystem

This article presents a novel low latency CMOS hardware accelerator for fully connected (FC) layers in deep neural networks (DNNs). The accelerator, FC-Accel, is based on 128 8×8 or 16×16 processing elements (PEs) for matrix-vector multiplication, and 128 multiply-accumulate (MAC) units integrated with 16 High Bandwidth Memory (HBM) stack units for storing the pre-trained weights. […]
Feb, 6

Flashlight: Enabling Innovation in Tools for Machine Learning

As the computational requirements for machine learning systems and the size and complexity of machine learning frameworks increases, essential framework innovation has become challenging. While computational needs have driven recent compiler, networking, and hardware advancements, utilization of those advancements by machine learning tools is occurring at a slower pace. This is in part due to […]
Feb, 6

Dr.Jit: A Just-In-Time Compiler for Differentiable Rendering

We present Dr.Jit, a domain-specific just-in-time compiler for physically based rendering and its derivative. Dr.Jit traces high-level programs (e.g., written in Python) and compiles them into efficient CPU or GPU megakernels. It achieves state-of-the-art performance thanks to global optimizations that specialize code generation to the rendering or optimization task at hand. While Dr.Jit drastically simplifies […]
Feb, 6

SZx: an Ultra-fast Error-bounded Lossy Compressor for Scientific Datasets

Today’s scientific high performance computing (HPC) applications or advanced instruments are producing vast volumes of data across a wide range of domains, which introduces a serious burden on data transfer and storage. Error-bounded lossy compression has been developed and widely used in scientific community, because not only can it significantly reduce the data volumes but […]
Feb, 6

Porting OpenACC to OpenMP on heterogeneous systems

This documentation is designed for beginners in Graphics Processing Unit (GPU)-programming and who want to get familiar with OpenACC and OpenMP offloading models. Here we present an overview of these two programming models as well as of the GPU-architectures. Specifically, we provide some insights into the functionality of these models and perform experiments involving different […]
Feb, 6

GC3: An Optimizing Compiler for GPU Collective Communication

Machine learning models made up of millions or billions of parameters are often trained and served on large multi-GPU systems. As models grow in size and execute on more GPUs, the collective communications used in these applications becomes a bottleneck. Custom collective algorithms optimized for both particular network topologies and application specific communication patterns can […]
Jan, 30

Teaching Parallel Programming in Containers: Virtualization of a Heterogeneous Local Infrastructure

Providing parallel programming education is an emerging challenge, requires teaching approaches to further the learning process and a complex infrastructure to provide a suitable environment for the laboratory practical classes. Do not prioritize parallel programming requirements in future computing professionals learning can lead to a significant training gap, negatively impacting the efficient use of current […]
Jan, 30

Performance prediction of deep learning applications training in GPU as a service systems

Data analysts predict that the GPU as a Service (GPUaaS) market will grow from US$700 million in 2019 to $7 billion in 2025 with a compound annual growth rate of over 38% to support 3D models, animated video processing, and gaming. GPUaaS adoption will be also boosted by the use of graphics processing units (GPUs) […]
Jan, 30

Optimizing Huffman Decoding for Error-Bounded Lossy Compression on GPUs

More and more HPC applications require fast and effective compression techniques to handle large volumes of data in storage and transmission. Not only do these applications need to compress the data effectively during simulation, but they also need to perform decompression efficiently for post hoc analysis. SZ is an error-bounded lossy compressor for scientific data, […]
Jan, 30

GenGNN: A Generic FPGA Framework for Graph Neural Network Acceleration

Graph neural networks (GNNs) have recently exploded in popularity thanks to their broad applicability to ubiquitous graph-related problems such as quantum chemistry, drug discovery, and high energy physics. However, meeting demand for novel GNN models and fast inference simultaneously is challenging because of the gap between the difficulty in developing efficient FPGA accelerators and the […]
Jan, 30

Bit-GraphBLAS: Bit-Level Optimizations of Matrix-Centric Graph Processing on GPU

In a general graph data structure like an adjacency matrix, when edges are homogeneous, the connectivity of two nodes can be sufficiently represented using a single bit. This insight has, however, not yet been adequately exploited by the existing matrix-centric graph processing frameworks. This work fills the void by systematically exploring the bit-level representation of […]

* * *

* * *

HGPU group © 2010-2025 hgpu.org

All rights belong to the respective authors

Contact us:

contact@hpgu.org