high performance computing on graphics processing units: hgpu.org

Posts

Mar, 21

AVEC: Accelerator Virtualization in Cloud-Edge Computing for Deep Learning Libraries

Edge computing offers the distinct advantage of harnessing compute capabilities on resources located at the edge of the network to run workloads of relatively weak user devices. This is achieved by offloading computationally intensive workloads, such as deep learning from user devices to the edge. Using the edge reduces the overall communication latency of applications […]

CUDA

Mar, 21

A Deep Learning Based Cost Model for Automatic Code Optimization

Enabling compilers to automatically optimize code has been a longstanding goal for the compiler community. Efficiently solving this problem requires using precise cost models. These models predict whether applying a sequence of code transformations reduces the execution time of the program. Building an analytical cost model to do so is hard in modern x86 architectures […]

Mar, 19

Reviewing GPU architectures to build efficient back projection for parallel geometries

Back-Projection is the major algorithm in Computed Tomography to reconstruct images from a set of recorded projections. It is used for both fast analytical methods and high-quality iterative techniques. X-ray imaging facilities rely on Back-Projection to reconstruct internal structures in material samples and living organisms with high spatial and temporal resolution. Fast image reconstruction is […]

CUDA

•

OpenCL

Mar, 14

DISC: A Dynamic Shape Compiler for Machine Learning Workloads

Many recent machine learning models show dynamic shape characteristics. However, existing AI compiler optimization systems suffer a lot from problems brought by dynamic shape models, including compilation overhead, memory usage, optimization pipeline and deployment complexity. This paper provides a compiler system to natively support optimization for dynamic shape workloads, named DISC. DISC enriches a set […]

CUDA

Mar, 14

Transparent Checkpointing for OpenGL Applications on GPUs

This work presents transparent checkpointing of OpenGL applications, refining the split-process technique[1] for application in GPU-based 3D graphics. The split-process technique was earlier applied to checkpointing MPI and CUDA programs, enabling reinitialization of driver libraries. The presented design targets practical, checkpoint-package agnostic checkpointing of OpenGL applications. An early prototype is demonstrated on Autodesk Maya. Maya […]

OpenGL

Mar, 14

A Compiler Infrastructure for Accelerator Generators

We present Calyx, a new intermediate language (IL) for compiling high-level programs into hardware designs. Calyx combines a hardware-like structural language with a software-like control flow representation with loops and conditionals. This split representation enables a new class of hardware-focused optimizations that require both structural and control flow information which are crucial for high-level programming […]

Mar, 14

ECM modeling and performance tuning of SpMV and Lattice QCD on A64FX

The A64FX CPU is arguably the most powerful Arm-based processor design to date. Although it is a traditional cache-based multicore processor, its peak performance and memory bandwidth rival accelerator devices. A good understanding of its performance features is of paramount importance for developers who wish to leverage its full potential. We present an architectural analysis […]

CUDA

Mar, 14

hls4ml: An Open-Source Codesign Workflow to Empower Scientific Low-Power Machine Learning Devices

Accessible machine learning algorithms, software, and diagnostic tools for energy-efficient devices and systems are extremely valuable across a broad range of application domains. In scientific domains, real-time near-sensor processing can drastically improve experimental design and accelerate scientific discoveries. To support domain scientists, we have developed hls4ml, an open-source software-hardware codesign workflow to interpret and translate […]

Mar, 7

LazyTensor: combining eager execution with domain-specific compilers

Domain-specific optimizing compilers have demonstrated significant performance and portability benefits, but require programs to be represented in their specialized IRs. Existing frontends to these compilers suffer from the "language subset problem" where some host language features are unsupported in the subset of the user’s program that interacts with the domain-specific compiler. By contrast, define-by-run ML […]

Mar, 7

Benchmarking Modern Edge Devices for AI Applications

AI (artificial intelligence) has grown at an overwhelming speed for the last decade, to the extent that it has become one of the mainstream tools that drive the advancements in science and technology. Meanwhile, the paradigm of edge computing has emerged as one of the foremost areas in which applications using the AI technology are […]

Mar, 7

Integrating Accelerators in Heterogeneous Systems

This work studies programmability enhancing abstractions in the context of accelerators and heterogeneous systems. Specifically, the focus is on adapting abstractions that have been successfully established to improve the programmability of CPUs. Specialized accelerators including GPUs, TPUs, and FPGAs promise to deliver orders of magnitude improvements in performance and energy efficiency. However, to exploit these […]

CUDA

Mar, 7

Scalable communication for high-order stencil computations using CUDA-aware MPI

Modern compute nodes in high-performance computing provide a tremendous level of parallelism and processing power. However, as arithmetic performance has been observed to increase at a faster rate relative to memory and network bandwidths, optimizing data movement has become critical for achieving strong scaling in many communication-heavy applications. This performance gap has been further accentuated […]

CUDA

HPCTransCompile: An AI Compiler Generated Dataset for High-Performance CUDA Transpilation and LLM Preliminary Exploration

* * *

high performance computing on graphics processing units: hgpu.org

Posts

AVEC: Accelerator Virtualization in Cloud-Edge Computing for Deep Learning Libraries

A Deep Learning Based Cost Model for Automatic Code Optimization

Reviewing GPU architectures to build efficient back projection for parallel geometries

DISC: A Dynamic Shape Compiler for Machine Learning Workloads

Transparent Checkpointing for OpenGL Applications on GPUs

A Compiler Infrastructure for Accelerator Generators

ECM modeling and performance tuning of SpMV and Lattice QCD on A64FX

hls4ml: An Open-Source Codesign Workflow to Empower Scientific Low-Power Machine Learning Devices

LazyTensor: combining eager execution with domain-specific compilers

Benchmarking Modern Edge Devices for AI Applications

Integrating Accelerators in Heterogeneous Systems

Scalable communication for high-order stencil computations using CUDA-aware MPI

Recent source codes

WiLLM: An Open Wireless LLM Communication System

Vcc: the Vulkan Clang Compiler

hpcbench: A set of benchmarking utilities for biomolecular simulation tools

HPCTransCompile: An AI Compiler Generated Dataset for High-Performance CUDA Transpilation and LLM Preliminary Exploration

chemtrain: Training Molecular Dynamics Potentials in JAX

microSYCL: SYCL micro-benchmarks repository

XaaS containers

CASS: Cuda-Amd aSSembly

Cluser of smartphones for edge computing application using TensorFlow

SYCL Container

Most viewed papers (last 30 days)