high performance computing on graphics processing units: hgpu.org

Posts

Dec, 25

mu-grind: A Framework for Dynamically Instrumenting HLS-Generated RTL

High-level synthesis compilers (HLS) enable the rapid creation of accelerator circuits. Unfortunately, compiler generated RTL (H-RTL) is inconsistent in terms of quality, hard to comprehend, and tends to be brittle [28, 41]. This paper develops a framework to help HLS compiler architects inspect and profile H-RTL. Prior state-of-the-art tools [23, 57] have predominantly focused on […]

Dec, 25

GPU Load Balancing

Fine-grained workload and resource balancing is the key to high performance for regular and irregular computations on the GPUs. In this dissertation, we conduct an extensive survey of existing load-balancing techniques to build an abstraction that addresses the difficulty of scheduling computations on the GPU. We propose a GPU fine-grained load-balancing abstraction that decouples load […]

CUDA

Dec, 25

Kernel-as-a-Service: A Serverless Interface to GPUs

Serverless computing has made it easier than ever to deploy applications over scalable cloud resources, all the while driving higher utilization for cloud providers. While this technique has worked well for easily divisible resources like CPU and local DRAM, it has struggled to incorporate more expensive and monolithic resources like GPUs or other application accelerators. […]

CUDA

Dec, 19

A Framework to Generate High-Performance Time-stepped Agent-based Simulations on Heterogeneous Hardware

Agent-Based Simulation (ABS) is a modelling approach where simulated entities i.e., agents, perform actions autonomously and interact with other agents based on a set of rules. ABSs have demonstrated their usefulness in various domains such as transportation, social science, or biology. Agent-based simulators commonly rely vastly on Central Processing Unit (CPU)-based sequential execution. As a […]

OpenCL

Dec, 19

Code Generation from Functional to Imperative: Combining Destination-Passing Style and Views

Programming in low-level imperative languages provides good performance but is error-prone. In contrast, high-level functional programming is usually free from low-level errors but performance suffers from costly abstractions. To benefit from both worlds, approaches like Lift compile from high-level functional programs to high-performance imperative code. However, problems such as removing high-level abstraction costs and handling […]

OpenCL

Dec, 19

Portable C++ Code that can Look and Feel Like Fortran Code with Yet Another Kernel Launcher (YAKL)

This paper introduces the Yet Another Kernel Launcher (YAKL) C++ portability library, which strives to enable user-level code with the look and feel of Fortran code. The intended audience includes both C++ developers and Fortran developers unfamiliar with C++. The C++ portability approach is briefly explained, YAKL’s main features are described, and code examples are […]

Dec, 19

A Study on the Intersection of GPU Utilization and CNN Inference

There has been significant progress in developing neural network architectures that both achieve high predictive performance and that also achieve high application-level inference throughput (e.g., frames per second). Another metric of increasing importance is GPU utilization during inference: the measurement of how well a deployed neural network uses the computational capabilities of the GPU on […]

Dec, 19

FLIA: Architecture of Collaborated Mobile GPU and FPGA Heterogeneous Computing

Accelerators, such as GPUs (Graphics Processing Unit) that is suitable for handling highly parallel data, and FPGA (Field Programmable Gate Array) with algorithms customized architectures, are widely adopted. The motivation is that algorithms with various parallel characteristics can efficiently map to the heterogeneous computing architecture by collaborated GPU and FPGA. However, current applications always utilize […]

OpenCL

Dec, 11

Assessing Application Efficiency and Performance Portability in Single-Source Programming for Heterogeneous Parallel Systems

We analyze the performance portability of the skeleton-based, single-source multi-backend high-level programming framework SkePU across multiple different CPU–GPU heterogeneous systems. Thereby, we provide a systematic application efficiency characterization of SkePU-generated code in comparison to equivalent hand-written code in more low-level parallel programming models such as OpenMP and CUDA. For this purpose, we contribute ports of […]

CUDA

•

OpenCL

Dec, 11

Precise Energy Consumption Measurements of Heterogeneous Artificial Intelligence Workloads

With the rise of AI in recent years and the increase in complexity of the models, the growing demand in computational resources is starting to pose a significant challenge. The need for higher compute power is being met with increasingly more potent accelerators and the use of large compute clusters. However, the gain in prediction […]

Dec, 11

Mixing Low-Precision Formats in Multiply-Accumulate Units for DNN Training

The most compute-intensive stage of deep neural network (DNN) training is matrix multiplication where the multiply-accumulate (MAC) operator is key. To reduce training costs, we consider using low-precision arithmetic for MAC operations. While low-precision training has been investigated in prior work, the focus has been on reducing the number of bits in weights or activations […]

CUDA

Dec, 11

SAIH: A Scalable Evaluation Methodology for Understanding AI Performance Trend on HPC Systems

Novel artificial intelligence (AI) technology has expedited various scientific research, e.g., cosmology, physics and bioinformatics, inevitably becoming a significant category of workload on high performance computing (HPC) systems. Existing AI benchmarks tend to customize well-recognized AI applications, so as to evaluate the AI performance of HPC systems under predefined problem size, in terms of datasets […]

CUDA

HPCTransCompile: An AI Compiler Generated Dataset for High-Performance CUDA Transpilation and LLM Preliminary Exploration

chemtrain: Training Molecular Dynamics Potentials in JAX

chemtrain-deploy: A parallel and scalable framework for machine learning potentials in million-atom MD simulations

microSYCL: SYCL micro-benchmarks repository

Exploring SYCL as a Portability Layer for High-Performance Computing on CPUs

See all packages

* * *

high performance computing on graphics processing units: hgpu.org

Posts

mu-grind: A Framework for Dynamically Instrumenting HLS-Generated RTL

GPU Load Balancing

Kernel-as-a-Service: A Serverless Interface to GPUs

A Framework to Generate High-Performance Time-stepped Agent-based Simulations on Heterogeneous Hardware

Code Generation from Functional to Imperative: Combining Destination-Passing Style and Views

Portable C++ Code that can Look and Feel Like Fortran Code with Yet Another Kernel Launcher (YAKL)

A Study on the Intersection of GPU Utilization and CNN Inference

FLIA: Architecture of Collaborated Mobile GPU and FPGA Heterogeneous Computing

Assessing Application Efficiency and Performance Portability in Single-Source Programming for Heterogeneous Parallel Systems

Precise Energy Consumption Measurements of Heterogeneous Artificial Intelligence Workloads

Mixing Low-Precision Formats in Multiply-Accumulate Units for DNN Training

SAIH: A Scalable Evaluation Methodology for Understanding AI Performance Trend on HPC Systems

Recent source codes

Efficient GPU Implementation of Multi-Precision Integer Division

ParEval: A Parallel Code Evaluation Benchmark

FlashSparse: Minimizing Computation Redundancy for Fast Sparse Matrix Multiplications on Tensor Cores

exa-AMD: Exascale Accelerated Materials Discovery

WiLLM: An Open Wireless LLM Communication System

Vcc: the Vulkan Clang Compiler

hpcbench: A set of benchmarking utilities for biomolecular simulation tools

HPCTransCompile: An AI Compiler Generated Dataset for High-Performance CUDA Transpilation and LLM Preliminary Exploration

chemtrain: Training Molecular Dynamics Potentials in JAX

microSYCL: SYCL micro-benchmarks repository

Most viewed papers (last 30 days)