high performance computing on graphics processing units: hgpu.org

Posts

Feb, 20

A Comprehensive Benchmark of Deep Learning Libraries on Mobile Devices

Deploying deep learning (DL) on mobile devices has been a notable trend in recent years. To support fast inference of on-device DL, DL libraries play a critical role as algorithms and hardware do. Unfortunately, no prior work ever dives deep into the ecosystem of modern DL libs and provides quantitative results on their performance. In […]

OpenCL

•

OpenGL

Feb, 20

Heuristic Adaptability to Input Dynamics for SpMM on GPUs

Sparse Matrix-Matrix Multiplication (SpMM) has served as fundamental components in various domains. Many previous studies exploit GPUs for SpMM acceleration because GPUs provide high bandwidth and parallelism. We point out that a static design does not always improve the performance of SpMM on different input data (e.g., >85% performance loss with a single algorithm). In […]

CUDA

Feb, 13

Electrical-Level Attacks on CPUs, FPGAs, and GPUs: Survey and Implications in the Heterogeneous Era

Given the need for efficient high-performance computing, computer architectures combining CPUs, GPUs, and FPGAs are nowadays prevalent. However, each of these components suffers from electrical-level security risks. Moving to heterogeneous systems, with the potential of multitenancy, it is essential to understand and investigate how the security vulnerabilities of individual components may affect the system as […]

Feb, 13

Pattern-based Programming Abstractions for Heterogeneous Parallel Computing

Contemporary computer architectures utilize wide multi-core processors, accelerators such as GPUs, and clustering of individual computers into complex large-scale systems. These hardware trends are prevalent across computers of all sizes, from the largest supercomputers down to the smallest mobile phones. While these innovations provide high peak computing performance, software developers find it increasingly difficult to […]

CUDA

•

OpenCL

Feb, 13

The Ecological Footprint of Neural Machine Translation Systems

Over the past decade, deep learning (DL) has led to significant advancements in various fields of artificial intelligence, including machine translation (MT). These advancements would not be possible without the ever-growing volumes of data and the hardware that allows large DL models to be trained efficiently. Due to the large amount of computing cores as […]

CUDA

Feb, 13

Improving Loop Parallelization by a Combination of Static and Dynamic Analyses in HLS

High-level synthesis (HLS) can be used to create hardware accelerators for compute-intense software parts such as loop structures. Usually, this process requires significant amount of user interaction to steer kernel selection and optimizations. This can be tedious and time-consuming. In this article, we present an approach that fully autonomously finds independent loop iterations and reductions […]

Feb, 13

FC_ACCEL: Enabling Efficient, Low-Latency and Flexible Inference in DNN Fully Connected Layers, using Optimized Checkerboard Block matrix decomposition, fast scheduling, and a resource efficient 1D PE array with a custom HBM2 memory subsystem

This article presents a novel low latency CMOS hardware accelerator for fully connected (FC) layers in deep neural networks (DNNs). The accelerator, FC-Accel, is based on 128 8×8 or 16×16 processing elements (PEs) for matrix-vector multiplication, and 128 multiply-accumulate (MAC) units integrated with 16 High Bandwidth Memory (HBM) stack units for storing the pre-trained weights. […]

Feb, 6

Flashlight: Enabling Innovation in Tools for Machine Learning

As the computational requirements for machine learning systems and the size and complexity of machine learning frameworks increases, essential framework innovation has become challenging. While computational needs have driven recent compiler, networking, and hardware advancements, utilization of those advancements by machine learning tools is occurring at a slower pace. This is in part due to […]

CUDA

Feb, 6

Dr.Jit: A Just-In-Time Compiler for Differentiable Rendering

We present Dr.Jit, a domain-specific just-in-time compiler for physically based rendering and its derivative. Dr.Jit traces high-level programs (e.g., written in Python) and compiles them into efficient CPU or GPU megakernels. It achieves state-of-the-art performance thanks to global optimizations that specialize code generation to the rendering or optimization task at hand. While Dr.Jit drastically simplifies […]

CUDA

Feb, 6

SZx: an Ultra-fast Error-bounded Lossy Compressor for Scientific Datasets

Today’s scientific high performance computing (HPC) applications or advanced instruments are producing vast volumes of data across a wide range of domains, which introduces a serious burden on data transfer and storage. Error-bounded lossy compression has been developed and widely used in scientific community, because not only can it significantly reduce the data volumes but […]

CUDA

Feb, 6

Porting OpenACC to OpenMP on heterogeneous systems

This documentation is designed for beginners in Graphics Processing Unit (GPU)-programming and who want to get familiar with OpenACC and OpenMP offloading models. Here we present an overview of these two programming models as well as of the GPU-architectures. Specifically, we provide some insights into the functionality of these models and perform experiments involving different […]

Feb, 6

GC3: An Optimizing Compiler for GPU Collective Communication

Machine learning models made up of millions or billions of parameters are often trained and served on large multi-GPU systems. As models grow in size and execute on more GPUs, the collective communications used in these applications becomes a bottleneck. Custom collective algorithms optimized for both particular network topologies and application specific communication patterns can […]

CUDA

* * *

high performance computing on graphics processing units: hgpu.org

Posts

A Comprehensive Benchmark of Deep Learning Libraries on Mobile Devices

Heuristic Adaptability to Input Dynamics for SpMM on GPUs

Electrical-Level Attacks on CPUs, FPGAs, and GPUs: Survey and Implications in the Heterogeneous Era

Pattern-based Programming Abstractions for Heterogeneous Parallel Computing

The Ecological Footprint of Neural Machine Translation Systems

Improving Loop Parallelization by a Combination of Static and Dynamic Analyses in HLS

FC_ACCEL: Enabling Efficient, Low-Latency and Flexible Inference in DNN Fully Connected Layers, using Optimized Checkerboard Block matrix decomposition, fast scheduling, and a resource efficient 1D PE array with a custom HBM2 memory subsystem

Flashlight: Enabling Innovation in Tools for Machine Learning

Dr.Jit: A Just-In-Time Compiler for Differentiable Rendering

SZx: an Ultra-fast Error-bounded Lossy Compressor for Scientific Datasets

Porting OpenACC to OpenMP on heterogeneous systems

GC3: An Optimizing Compiler for GPU Collective Communication

Recent source codes

SYCL Container

CASS: Cuda-Amd aSSembly

Cluser of smartphones for edge computing application using TensorFlow

CFAL-bench

Efficient Graph Embedding at Scale: Optimizing CPU-GPU-SSD Integration

Can Large Language Models Predict Parallel Code Performance?

PELSI: Power-Efficient Layer-Switched Inference

Ouroboros: Virtualized Queues for dynamic memory management

MSCCL++: A GPU-driven communication stack for scalable AI applications

Benchmark compute shader of Unity against InteropUnityCUDA

Most viewed papers (last 30 days)