high performance computing on graphics processing units: hgpu.org

Posts

Dec, 25

mu-grind: A Framework for Dynamically Instrumenting HLS-Generated RTL

High-level synthesis compilers (HLS) enable the rapid creation of accelerator circuits. Unfortunately, compiler generated RTL (H-RTL) is inconsistent in terms of quality, hard to comprehend, and tends to be brittle [28, 41]. This paper develops a framework to help HLS compiler architects inspect and profile H-RTL. Prior state-of-the-art tools [23, 57] have predominantly focused on […]

Dec, 25

Kernel-as-a-Service: A Serverless Interface to GPUs

Serverless computing has made it easier than ever to deploy applications over scalable cloud resources, all the while driving higher utilization for cloud providers. While this technique has worked well for easily divisible resources like CPU and local DRAM, it has struggled to incorporate more expensive and monolithic resources like GPUs or other application accelerators. […]

CUDA

Dec, 19

A Framework to Generate High-Performance Time-stepped Agent-based Simulations on Heterogeneous Hardware

Agent-Based Simulation (ABS) is a modelling approach where simulated entities i.e., agents, perform actions autonomously and interact with other agents based on a set of rules. ABSs have demonstrated their usefulness in various domains such as transportation, social science, or biology. Agent-based simulators commonly rely vastly on Central Processing Unit (CPU)-based sequential execution. As a […]

OpenCL

Dec, 19

Code Generation from Functional to Imperative: Combining Destination-Passing Style and Views

Programming in low-level imperative languages provides good performance but is error-prone. In contrast, high-level functional programming is usually free from low-level errors but performance suffers from costly abstractions. To benefit from both worlds, approaches like Lift compile from high-level functional programs to high-performance imperative code. However, problems such as removing high-level abstraction costs and handling […]

OpenCL

Dec, 19

Portable C++ Code that can Look and Feel Like Fortran Code with Yet Another Kernel Launcher (YAKL)

This paper introduces the Yet Another Kernel Launcher (YAKL) C++ portability library, which strives to enable user-level code with the look and feel of Fortran code. The intended audience includes both C++ developers and Fortran developers unfamiliar with C++. The C++ portability approach is briefly explained, YAKL’s main features are described, and code examples are […]

Dec, 19

A Study on the Intersection of GPU Utilization and CNN Inference

There has been significant progress in developing neural network architectures that both achieve high predictive performance and that also achieve high application-level inference throughput (e.g., frames per second). Another metric of increasing importance is GPU utilization during inference: the measurement of how well a deployed neural network uses the computational capabilities of the GPU on […]

Dec, 19

FLIA: Architecture of Collaborated Mobile GPU and FPGA Heterogeneous Computing

Accelerators, such as GPUs (Graphics Processing Unit) that is suitable for handling highly parallel data, and FPGA (Field Programmable Gate Array) with algorithms customized architectures, are widely adopted. The motivation is that algorithms with various parallel characteristics can efficiently map to the heterogeneous computing architecture by collaborated GPU and FPGA. However, current applications always utilize […]

OpenCL

Dec, 11

Assessing Application Efficiency and Performance Portability in Single-Source Programming for Heterogeneous Parallel Systems

We analyze the performance portability of the skeleton-based, single-source multi-backend high-level programming framework SkePU across multiple different CPU–GPU heterogeneous systems. Thereby, we provide a systematic application efficiency characterization of SkePU-generated code in comparison to equivalent hand-written code in more low-level parallel programming models such as OpenMP and CUDA. For this purpose, we contribute ports of […]

CUDA

•

OpenCL

Dec, 11

Precise Energy Consumption Measurements of Heterogeneous Artificial Intelligence Workloads

With the rise of AI in recent years and the increase in complexity of the models, the growing demand in computational resources is starting to pose a significant challenge. The need for higher compute power is being met with increasingly more potent accelerators and the use of large compute clusters. However, the gain in prediction […]

Dec, 11

Mixing Low-Precision Formats in Multiply-Accumulate Units for DNN Training

The most compute-intensive stage of deep neural network (DNN) training is matrix multiplication where the multiply-accumulate (MAC) operator is key. To reduce training costs, we consider using low-precision arithmetic for MAC operations. While low-precision training has been investigated in prior work, the focus has been on reducing the number of bits in weights or activations […]

CUDA

Dec, 11

SAIH: A Scalable Evaluation Methodology for Understanding AI Performance Trend on HPC Systems

Novel artificial intelligence (AI) technology has expedited various scientific research, e.g., cosmology, physics and bioinformatics, inevitably becoming a significant category of workload on high performance computing (HPC) systems. Existing AI benchmarks tend to customize well-recognized AI applications, so as to evaluate the AI performance of HPC systems under predefined problem size, in terms of datasets […]

CUDA

Dec, 11

Towards energy efficiency and productivity for decision making in mobile robot navigation

Our goal in this work is to make it easy and feasible to implement solutions for autonomous decision-making and planning under uncertainty on low-power mobile platforms. We focus on practical applications, such as autonomous driving and service robotics, that must run on mobile SoC platforms. These applications often have real-time execution constraints. The main challenge […]

OpenCL

* * *

high performance computing on graphics processing units: hgpu.org

Posts

mu-grind: A Framework for Dynamically Instrumenting HLS-Generated RTL

Kernel-as-a-Service: A Serverless Interface to GPUs

A Framework to Generate High-Performance Time-stepped Agent-based Simulations on Heterogeneous Hardware

Code Generation from Functional to Imperative: Combining Destination-Passing Style and Views

Portable C++ Code that can Look and Feel Like Fortran Code with Yet Another Kernel Launcher (YAKL)

A Study on the Intersection of GPU Utilization and CNN Inference

FLIA: Architecture of Collaborated Mobile GPU and FPGA Heterogeneous Computing

Assessing Application Efficiency and Performance Portability in Single-Source Programming for Heterogeneous Parallel Systems

Precise Energy Consumption Measurements of Heterogeneous Artificial Intelligence Workloads

Mixing Low-Precision Formats in Multiply-Accumulate Units for DNN Training

SAIH: A Scalable Evaluation Methodology for Understanding AI Performance Trend on HPC Systems

Towards energy efficiency and productivity for decision making in mobile robot navigation

Recent source codes

SYCL Container

CASS: Cuda-Amd aSSembly

Cluser of smartphones for edge computing application using TensorFlow

CFAL-bench

Efficient Graph Embedding at Scale: Optimizing CPU-GPU-SSD Integration

Can Large Language Models Predict Parallel Code Performance?

PELSI: Power-Efficient Layer-Switched Inference

Ouroboros: Virtualized Queues for dynamic memory management

MSCCL++: A GPU-driven communication stack for scalable AI applications

Benchmark compute shader of Unity against InteropUnityCUDA

Most viewed papers (last 30 days)