high performance computing on graphics processing units: hgpu.org

Posts

Feb, 20

A Comprehensive Benchmark of Deep Learning Libraries on Mobile Devices

Deploying deep learning (DL) on mobile devices has been a notable trend in recent years. To support fast inference of on-device DL, DL libraries play a critical role as algorithms and hardware do. Unfortunately, no prior work ever dives deep into the ecosystem of modern DL libs and provides quantitative results on their performance. In […]

OpenCL

•

OpenGL

Feb, 20

Heuristic Adaptability to Input Dynamics for SpMM on GPUs

Sparse Matrix-Matrix Multiplication (SpMM) has served as fundamental components in various domains. Many previous studies exploit GPUs for SpMM acceleration because GPUs provide high bandwidth and parallelism. We point out that a static design does not always improve the performance of SpMM on different input data (e.g., >85% performance loss with a single algorithm). In […]

CUDA

Feb, 20

Lightning: Scaling the GPU Programming Model Beyond a Single GPU

The GPU programming model is primarily designed to support the development of applications that run on one GPU. However, just a single GPU is limited in its capabilities in terms of memory capacity and compute power. To handle large problems that exceed these capabilities, one must rewrite application code to manually transfer data between GPU […]

CUDA

•

OpenCL

Feb, 13

Electrical-Level Attacks on CPUs, FPGAs, and GPUs: Survey and Implications in the Heterogeneous Era

Given the need for efficient high-performance computing, computer architectures combining CPUs, GPUs, and FPGAs are nowadays prevalent. However, each of these components suffers from electrical-level security risks. Moving to heterogeneous systems, with the potential of multitenancy, it is essential to understand and investigate how the security vulnerabilities of individual components may affect the system as […]

Feb, 13

Pattern-based Programming Abstractions for Heterogeneous Parallel Computing

Contemporary computer architectures utilize wide multi-core processors, accelerators such as GPUs, and clustering of individual computers into complex large-scale systems. These hardware trends are prevalent across computers of all sizes, from the largest supercomputers down to the smallest mobile phones. While these innovations provide high peak computing performance, software developers find it increasingly difficult to […]

CUDA

•

OpenCL

Feb, 13

FC_ACCEL: Enabling Efficient, Low-Latency and Flexible Inference in DNN Fully Connected Layers, using Optimized Checkerboard Block matrix decomposition, fast scheduling, and a resource efficient 1D PE array with a custom HBM2 memory subsystem

This article presents a novel low latency CMOS hardware accelerator for fully connected (FC) layers in deep neural networks (DNNs). The accelerator, FC-Accel, is based on 128 8×8 or 16×16 processing elements (PEs) for matrix-vector multiplication, and 128 multiply-accumulate (MAC) units integrated with 16 High Bandwidth Memory (HBM) stack units for storing the pre-trained weights. […]

Feb, 13

The Ecological Footprint of Neural Machine Translation Systems

Over the past decade, deep learning (DL) has led to significant advancements in various fields of artificial intelligence, including machine translation (MT). These advancements would not be possible without the ever-growing volumes of data and the hardware that allows large DL models to be trained efficiently. Due to the large amount of computing cores as […]

CUDA

Feb, 13

Improving Loop Parallelization by a Combination of Static and Dynamic Analyses in HLS

High-level synthesis (HLS) can be used to create hardware accelerators for compute-intense software parts such as loop structures. Usually, this process requires significant amount of user interaction to steer kernel selection and optimizations. This can be tedious and time-consuming. In this article, we present an approach that fully autonomously finds independent loop iterations and reductions […]

Feb, 6

Flashlight: Enabling Innovation in Tools for Machine Learning

As the computational requirements for machine learning systems and the size and complexity of machine learning frameworks increases, essential framework innovation has become challenging. While computational needs have driven recent compiler, networking, and hardware advancements, utilization of those advancements by machine learning tools is occurring at a slower pace. This is in part due to […]

CUDA

Feb, 6

Dr.Jit: A Just-In-Time Compiler for Differentiable Rendering

We present Dr.Jit, a domain-specific just-in-time compiler for physically based rendering and its derivative. Dr.Jit traces high-level programs (e.g., written in Python) and compiles them into efficient CPU or GPU megakernels. It achieves state-of-the-art performance thanks to global optimizations that specialize code generation to the rendering or optimization task at hand. While Dr.Jit drastically simplifies […]

CUDA

Feb, 6

SZx: an Ultra-fast Error-bounded Lossy Compressor for Scientific Datasets

Today’s scientific high performance computing (HPC) applications or advanced instruments are producing vast volumes of data across a wide range of domains, which introduces a serious burden on data transfer and storage. Error-bounded lossy compression has been developed and widely used in scientific community, because not only can it significantly reduce the data volumes but […]

CUDA

Feb, 6

Porting OpenACC to OpenMP on heterogeneous systems

This documentation is designed for beginners in Graphics Processing Unit (GPU)-programming and who want to get familiar with OpenACC and OpenMP offloading models. Here we present an overview of these two programming models as well as of the GPU-architectures. Specifically, we provide some insights into the functionality of these models and perform experiments involving different […]

HPCTransCompile: An AI Compiler Generated Dataset for High-Performance CUDA Transpilation and LLM Preliminary Exploration

chemtrain: Training Molecular Dynamics Potentials in JAX

chemtrain-deploy: A parallel and scalable framework for machine learning potentials in million-atom MD simulations

microSYCL: SYCL micro-benchmarks repository

Exploring SYCL as a Portability Layer for High-Performance Computing on CPUs

See all packages

* * *

high performance computing on graphics processing units: hgpu.org

Posts

A Comprehensive Benchmark of Deep Learning Libraries on Mobile Devices

Heuristic Adaptability to Input Dynamics for SpMM on GPUs

Lightning: Scaling the GPU Programming Model Beyond a Single GPU

Electrical-Level Attacks on CPUs, FPGAs, and GPUs: Survey and Implications in the Heterogeneous Era

Pattern-based Programming Abstractions for Heterogeneous Parallel Computing

FC_ACCEL: Enabling Efficient, Low-Latency and Flexible Inference in DNN Fully Connected Layers, using Optimized Checkerboard Block matrix decomposition, fast scheduling, and a resource efficient 1D PE array with a custom HBM2 memory subsystem

The Ecological Footprint of Neural Machine Translation Systems

Improving Loop Parallelization by a Combination of Static and Dynamic Analyses in HLS

Flashlight: Enabling Innovation in Tools for Machine Learning

Dr.Jit: A Just-In-Time Compiler for Differentiable Rendering

SZx: an Ultra-fast Error-bounded Lossy Compressor for Scientific Datasets

Porting OpenACC to OpenMP on heterogeneous systems

Recent source codes

Efficient GPU Implementation of Multi-Precision Integer Division

ParEval: A Parallel Code Evaluation Benchmark

FlashSparse: Minimizing Computation Redundancy for Fast Sparse Matrix Multiplications on Tensor Cores

exa-AMD: Exascale Accelerated Materials Discovery

WiLLM: An Open Wireless LLM Communication System

Vcc: the Vulkan Clang Compiler

hpcbench: A set of benchmarking utilities for biomolecular simulation tools

HPCTransCompile: An AI Compiler Generated Dataset for High-Performance CUDA Transpilation and LLM Preliminary Exploration

chemtrain: Training Molecular Dynamics Potentials in JAX

microSYCL: SYCL micro-benchmarks repository

Most viewed papers (last 30 days)