high performance computing on graphics processing units: hgpu.org

Posts

Aug, 7

COX: Exposing CUDA Warp-Level Functions to CPUs

As CUDA becomes the de facto programming language among data parallel applications such as high-performance computing or machine learning applications, running CUDA on other platforms becomes a compelling option. Although several efforts have attempted to support CUDA on devices other than NVIDIA GPUs, due to extra steps in the translation, the support is always a […]

CUDA

Aug, 7

Design and Implementation of ShenWei Universal C/C++

The ShenWei many-core series processors powering multiple cutting-edge supercomputers are equipped with their unique on-chip heterogeneous architecture. They have long required programmers to write separate codes for the control part on Management Processing Element (MPE) and accelerated part on Compute Processing Element (CPE), which is similar to open standards like OpenCL. Such a programming model […]

CUDA

•

OpenCL

Jul, 24

Demystifying Dependency Bugs in Deep Learning Stack

Recent breakthroughs in deep learning (DL) techniques have stimulated significant growth in developing DL-enabled applications. These DL applications, built upon a heterogeneous and complex DL stack (e.g., Nvidia GPU, Linux, CUDA driver, Python runtime, and TensorFlow), are subject to software and hardware dependencies across the DL stack. A persistent challenge in dependency management across the […]

CUDA

Jul, 24

CPU-GPU Layer-Switched Low Latency CNN Inference

Convolutional Neural Networks (CNNs) inference on Heterogeneous Multi-Processor System-on-Chips (HMPSoCs) in edge devices represent cutting-edge embedded machine learning. Embedded CPU and GPU within an HMPSoC can both perform inference using CNNs. However, common practice is to run a CNN on the HMPSoC component (CPU or GPU) provides the best performance (lowest latency) for that CNN. […]

OpenCL

Jul, 24

FPGA Accelerators on Heterogeneous Systems: An Approach Using High Level Synthesis

The emergence of FPGAs in the High-Performance Computing domain is arising thanks to their promise of better energy efficiency and low control latency, compared with other devices such as CPUs or GPUs. Albeit these benefits, their complete inclusion into HPC systems still faces several challenges. First, FPGA complexity means its programming more difficult compared to […]

OpenCL

Jul, 24

Theseus: A Library for Differentiable Nonlinear Optimization

We present Theseus, an efficient application-agnostic open source library for differentiable nonlinear least squares (DNLS) optimization built on PyTorch, providing a common framework for end-to-end structured learning in robotics and vision. Existing DNLS implementations are application specific and do not always incorporate many ingredients important for efficiency. Theseus is application-agnostic, as we illustrate with several […]

CUDA

Jul, 24

On Scheduling Ring-All-Reduce Learning Jobs in Multi-Tenant GPU Clusters with Communication Contention

Powered by advances in deep learning (DL) techniques, machine learning and artificial intelligence have achieved astonishing successes. However, the rapidly growing needs for DL also led to communication- and resource-intensive distributed training jobs for large-scale DL training, which are typically deployed over GPU clusters. To sustain the ever-increasing demand for DL training, the so-called "ring-all-reduce" […]

Jul, 17

Reducing Synchronous GPU Memory Transfers: Design and implementation of a Futhark compiler optimisation

We present a series of dataflow dependent program transformations that reduce memory transfers between a GPU and its host, and show how the problem of minimising memory transfers to the host amounts to finding minimum vertex cuts in a series of data dependency graphs. We provide a specialised algorithm to solve these minimisation problems, based […]

CUDA

•

OpenCL

Jul, 17

Heterogeneous Energy-aware Load Balancing for Industry 4.0 and IoT Environments

With the improvement of global infrastructure, Cyber-Physical Systems (CPS) have become an important component of Industry 4.0. Both the application as well as the machine work together to improve the task of interdependencies. Machine learning methods in CPS require the monitoring of computational algorithms, including adopting optimizations, fine-tuning cyber systems, improving resource utilization, as well […]

OpenCL

Jul, 17

Just-in-Time Compilation and Link-Time Optimization for OpenMP Target Offloading

Following the mass adoption of external accelerators for high performance computing, the overall performance of many applications has become increasingly dependent on relatively small accelerated kernels. As static analysis is fundamentally limited by dynamic values and external definitions, standard ahead-of-time compilation is not always sufficient to achieve the best performance. Furthermore, many users looking to […]

CUDA

•

OpenCL

Jul, 17

The OpenMP Cluster Programming Model

Despite the various research initiatives and proposed programming models, efficient solutions for parallel programming in HPC clusters still rely on a complex combination of different programming models (e.g., OpenMP and MPI), languages (e.g., C++ and CUDA), and specialized runtimes (e.g., Charm++ and Legion). On the other hand, task parallelism has shown to be an efficient […]

Jul, 17

High Performance Simulation for Scalable Multi-Agent Reinforcement Learning

Multi-agent reinforcement learning experiments and open-source training environments are typically limited in scale, supporting tens or sometimes up to hundreds of interacting agents. In this paper we demonstrate the use of Vogue, a high performance agent based model (ABM) framework. Vogue serves as a multi-agent training environment, supporting thousands to tens of thousands of interacting […]

CUDA

* * *

high performance computing on graphics processing units: hgpu.org

Posts

COX: Exposing CUDA Warp-Level Functions to CPUs

Design and Implementation of ShenWei Universal C/C++

Demystifying Dependency Bugs in Deep Learning Stack

CPU-GPU Layer-Switched Low Latency CNN Inference

FPGA Accelerators on Heterogeneous Systems: An Approach Using High Level Synthesis

Theseus: A Library for Differentiable Nonlinear Optimization

On Scheduling Ring-All-Reduce Learning Jobs in Multi-Tenant GPU Clusters with Communication Contention

Reducing Synchronous GPU Memory Transfers: Design and implementation of a Futhark compiler optimisation

Heterogeneous Energy-aware Load Balancing for Industry 4.0 and IoT Environments

Just-in-Time Compilation and Link-Time Optimization for OpenMP Target Offloading

The OpenMP Cluster Programming Model

High Performance Simulation for Scalable Multi-Agent Reinforcement Learning

Recent source codes

SYCL Container

CASS: Cuda-Amd aSSembly

Cluser of smartphones for edge computing application using TensorFlow

CFAL-bench

Efficient Graph Embedding at Scale: Optimizing CPU-GPU-SSD Integration

Can Large Language Models Predict Parallel Code Performance?

PELSI: Power-Efficient Layer-Switched Inference

Ouroboros: Virtualized Queues for dynamic memory management

MSCCL++: A GPU-driven communication stack for scalable AI applications

Benchmark compute shader of Unity against InteropUnityCUDA

Most viewed papers (last 30 days)