high performance computing on graphics processing units: hgpu.org

Posts

Mar, 30

PyGraph: Robust Compiler Support for CUDA Graphs in PyTorch

CUDA Graphs — a recent hardware feature introduced for NVIDIA GPUs — aim to reduce CPU launch overhead by capturing and launching a series of GPU tasks (kernels) as a DAG. However, deploying CUDA Graphs faces several challenges today due to the static structure of a graph. It also incurs performance overhead due to data […]

CUDA

Mar, 30

Efficient allocation of image recognition and LLM tasks on multi-GPU system

This work is concerned with the evaluation of the performance of parallelization of learning and tuning processes for image classification and large language models. For machine learning model in image recognition, various parallelization methods are developed based on different hardware and software scenarios: simple data parallelism, distributed data parallelism, and distributed processing. A detailed description […]

CUDA

Mar, 30

Hardware-Assisted Software Testing and Debugging for Heterogeneous Computing

There is a growing interest in the computer architecture community to incorporate heterogeneity and specialization to improve performance. Developers can write heterogeneous applications that consist of host code and kernel code, where compute-intensive kernels can be offloaded from CPU to GPU, FPGA, or quantum computer. However, the high complexity of these systems can pose challenges […]

Mar, 30

TileLink: Generating Efficient Compute-Communication Overlapping Kernels using Tile-Centric Primitives

Large deep learning models have achieved state-of-the-art performance in a wide range of tasks. These models often necessitate distributed systems for efficient training and inference. The fundamental building blocks for distributed model execution are intra-layer parallel operators. The most effective approach to enhancing the performance of intra-layer parallel operators involves overlapping computation with communication. The […]

Mar, 30

Analyzing Modern NVIDIA GPU cores

GPUs are the most popular platform for accelerating HPC workloads, such as artificial intelligence and science simulations. However, most microarchitectural research in academia relies on GPU core pipeline designs based on architectures that are more than 15 years old. This paper reverse engineers modern NVIDIA GPU cores, unveiling many key aspects of its design and […]

Mar, 23

The Shamrock code: I- Smoothed Particle Hydrodynamics on GPUs

We present Shamrock, a performance portable framework developed in C++17 with the SYCL programming standard, tailored for numerical astrophysics on Exascale architectures. The core of Shamrock is an accelerated parallel tree with negligible construction time, whose efficiency is based on binary algebra. The Smoothed Particle Hydrodynamics algorithm of the Phantom code is implemented in Shamrock. […]

CUDA

Mar, 23

Concurrent Scheduling of High-Level Parallel Programs on Multi-GPU Systems

Parallel programming models can encourage performance portability by moving the responsibility for work assignment and data distribution from the programmer to a runtime system. However, analyzing the resulting implicit memory allocations, coherence operations and their interdependencies can quickly introduce delays into the latency-sensitive execution pipeline of a distributed-memory application. In this paper, we show how […]

Mar, 23

Hercules: A Compiler for Productive Programming of Heterogeneous Systems

Modern computing systems increasingly rely on composing heterogeneous devices to improve performance and efficiency. Programming these systems is often unproductive: algorithm implementations must be coupled to system-specific logic, including device-specific optimizations, partitioning, and inter-device communication and synchronization, which requires developing different programs for different system configurations. We propose the Juno language, which represents general purpose […]

CUDA

Mar, 23

ML-Triton, A Multi-Level Compilation and Language Extension to Triton GPU Programming

In the era of LLMs, dense operations such as GEMM and MHA are critical components. These operations are well-suited for parallel execution using a tilebased approach. While traditional GPU programming often relies on low level interfaces like CUDA or SYCL, Triton has emerged as a DSL that offers a more user-friendly and portable alternative by […]

CUDA

Mar, 23

LLMPerf: GPU Performance Modeling meets Large Language Models

Performance modeling, a pivotal domain in program cost analysis, currently relies on manually crafted models constrained by various program and hardware limitations, especially in the intricate landscape of GPGPU. Meanwhile, Large Language Models (LLMs) have demonstrated their effectiveness in addressing diverse programming challenges. Our work establishes a connection between LLMs and performance modeling, employing the […]

OpenCL

Mar, 10

WgPy: GPU-accelerated NumPy-like array library for web browsers

To execute scientific computing programs such as deep learning at high speed, GPU acceleration is a powerful option. With the recent advancements in web technologies, interfaces like WebGL and WebGPU, which utilize GPUs on the client side of web applications, have become available. On the other hand, Pyodide, a Python runtime that operates on web […]

CUDA

Mar, 10

A Microbenchmark Framework for Performance Evaluation of OpenMP Target Offloading

We present a framework based on Catch2 to evaluate performance of OpenMP’s target offload model via micro-benchmarks. The compilers supporting OpenMP’s target offload model for heterogeneous architectures are currently undergoing rapid development. These developments influence performance of various complex applications in different ways. This framework can be employed to track the impact of compiler upgrades […]

high performance computing on graphics processing units: hgpu.org

Posts

PyGraph: Robust Compiler Support for CUDA Graphs in PyTorch

Efficient allocation of image recognition and LLM tasks on multi-GPU system

Hardware-Assisted Software Testing and Debugging for Heterogeneous Computing

TileLink: Generating Efficient Compute-Communication Overlapping Kernels using Tile-Centric Primitives

Analyzing Modern NVIDIA GPU cores

The Shamrock code: I- Smoothed Particle Hydrodynamics on GPUs

Concurrent Scheduling of High-Level Parallel Programs on Multi-GPU Systems

Hercules: A Compiler for Productive Programming of Heterogeneous Systems

ML-Triton, A Multi-Level Compilation and Language Extension to Triton GPU Programming

LLMPerf: GPU Performance Modeling meets Large Language Models

WgPy: GPU-accelerated NumPy-like array library for web browsers

A Microbenchmark Framework for Performance Evaluation of OpenMP Target Offloading

Recent source codes

HPCTransCompile: An AI Compiler Generated Dataset for High-Performance CUDA Transpilation and LLM Preliminary Exploration

chemtrain: Training Molecular Dynamics Potentials in JAX

microSYCL: SYCL micro-benchmarks repository

XaaS containers

SYCL Container

CASS: Cuda-Amd aSSembly

Cluser of smartphones for edge computing application using TensorFlow

CFAL-bench

Efficient Graph Embedding at Scale: Optimizing CPU-GPU-SSD Integration

Can Large Language Models Predict Parallel Code Performance?

Most viewed papers (last 30 days)