high performance computing on graphics processing units: hgpu.org

Posts

Nov, 20

Going green: optimizing GPUs for energy efficiency through model-steered auto-tuning

Graphics Processing Units (GPUs) have revolutionized the computing landscape over the past decade. However, the growing energy demands of data centres and computing facilities equipped with GPUs come with significant capital and environmental costs. The energy consumption of GPU applications greatly depend on how well they are optimized. Auto-tuning is an effective and commonly applied […]

CUDA

•

OpenCL

Nov, 20

TorchOpt: An Efficient Library for Differentiable Optimization

Recent years have witnessed the booming of various differentiable optimization algorithms. These algorithms exhibit different execution patterns, and their execution needs massive computational resources that go beyond a single CPU and GPU. Existing differentiable optimization libraries, however, cannot support efficient algorithm development and multi-CPU/GPU execution, making the development of differentiable optimization algorithms often cumbersome and […]

CUDA

Nov, 13

Capturing the Memory Topology of GPUs

Optimizing program code is an essential process for High-Performance Computing and in general. Due to a trend in the last years of employing graphics cards as accelerators for systems and due to a universal gain of the importance of GPUs, optimizing GPU code is crucial in order to achieve the best possible performance of a […]

CUDA

Nov, 13

A Study on Neural-based Code Summarization in Low-resource Settings

Automated software engineering with deep learning techniques has been comprehensively explored because of breakthroughs in code representation learning. Many code intelligence approaches have been proposed for the downstream tasks of this field in the past years, contributing to significant performance progress. Code summarization has been the central research topic among these downstream tasks because of […]

CUDA

Nov, 13

pyGSL: A Graph Structure Learning Toolkit

We introduce pyGSL, a Python library that provides efficient implementations of state-of-the-art graph structure learning models along with diverse datasets to evaluate them on. The implementations are written in GPU-friendly ways, allowing one to scale to much larger network tasks. A common interface is introduced for algorithm unrolling methods, unifying implementations of recent state-of-the-art techniques […]

Nov, 13

iGniter: Interference-Aware GPU Resource Provisioning for Predictable DNN Inference in the Cloud

GPUs are essential to accelerating the latency-sensitive deep neural network (DNN) inference workloads in cloud datacenters. To fully utilize GPU resources, spatial sharing of GPUs among co-located DNN inference workloads becomes increasingly compelling. However, GPU sharing inevitably brings severe performance interference among co-located inference workloads, as motivated by an empirical measurement study of DNN inference […]

CUDA

Nov, 13

Multi-GPU thermal lattice Boltzmann simulations using OpenACC and MPI

We assess the performance of the hybrid Open Accelerator (OpenACC) and Message Passing Interface (MPI) approach for multi-graphics processing units (GPUs) accelerated thermal lattice Boltzmann (LB) simulation. The OpenACC accelerates computation on a single GPU, and the MPI synchronizes the information between multiple GPUs. With a single GPU, the two-dimension (2D) simulation achieved 1.93 billion […]

Nov, 6

Enabling Data Movement and Computation Pipelining in Deep Learning Compiler

Pipelining between data loading and computation is a critical tensor program optimization for GPUs. Multi-stage pipelining across the multi-level buffer hierarchy of GPU is particularly indispensable on the latest NVIDIA Ampere GPUs to reduce resource idleness and guarantee kernel performance. Currently, people rely on libraries written by experts such as cuBLAS to access the pipelining […]

CUDA

Nov, 6

Using scheduling entropy amplification in CUDA/OpenMP code to exhibit non-reproducibility issues

Rounding error or cancellation that appears with each floating-point operations, combined with the lack of control over execution order in parallel code leads to numerical issues such as numerical reproducibility. In order to enhance the possibility to discover such numerical issue, in this article we propose a simple solution base on an index interposer and […]

CUDA

Nov, 6

An Open-source FPGA Library for Data Sorting

Field-programmable gate arrays (FPGAs) have garnered significant interest in research on high-performance computing because their flexibility enables the building of application-specific computation pipelines and data supply systems. In addition to the flexibility, toolchains for the development of FPGAs in OpenCL have been developed and offered by FPGA vendors that reduce the programming effort required. However, […]

OpenCL

Nov, 6

The Open MatSci ML Toolkit: A Flexible Framework for Machine Learning in Materials Science

We present the Open MatSci ML Toolkit: a flexible, self-contained, and scalable Python-based framework to apply deep learning models and methods on scientific data with a specific focus on materials science and the OpenCatalyst Dataset. Our toolkit provides: 1. A scalable machine learning workflow for materials science leveraging PyTorch Lightning, which enables seamless scaling across […]

Nov, 6

Apple Silicon Performance in Scientific Computing

With the release of the Apple Silicon System-on-a-Chip processors, and the impressive performance shown in general use by both the M1 and M1 Ultra, the potential use for Apple Silicon processors in scientific computing is explored. Both the M1 and M1 Ultra are compared to current state-of-the-art data-center GPUs, including an NVIDIA V100 with PCIe, […]

OpenCL

gpu_tracker: Context manager and CLI that tracks the computational-resource-usage of a code block or shell command, particularly the GPU usage

gpu_tracker: Python package for tracking and profiling GPU utilization in both desktop and high-performance computing environments

CIFAR-10 Airbench: 94% on CIFAR-10 in 3.29 second

94% on CIFAR-10 in 3.29 Seconds on a Single GPU

See all packages

* * *

high performance computing on graphics processing units: hgpu.org

Posts

Going green: optimizing GPUs for energy efficiency through model-steered auto-tuning

TorchOpt: An Efficient Library for Differentiable Optimization

Capturing the Memory Topology of GPUs

A Study on Neural-based Code Summarization in Low-resource Settings

pyGSL: A Graph Structure Learning Toolkit

iGniter: Interference-Aware GPU Resource Provisioning for Predictable DNN Inference in the Cloud

Multi-GPU thermal lattice Boltzmann simulations using OpenACC and MPI

Enabling Data Movement and Computation Pipelining in Deep Learning Compiler

Using scheduling entropy amplification in CUDA/OpenMP code to exhibit non-reproducibility issues

An Open-source FPGA Library for Data Sorting

The Open MatSci ML Toolkit: A Flexible Framework for Machine Learning in Materials Science

Apple Silicon Performance in Scientific Computing

Recent source codes

CuPBoP-AMD: Extending CUDA to AMD Platforms

Adopter: Automated Deep Learning Optimization via DSL-based Source Code Transformation

ROCm's implementation of Gromacs

Code examples for paper on SYCL backend of Kokkos - IWOCL 2024

SimSYCL: Synchronous, single-threaded, library-only SYCL implementation for debugging and verification

GPU plugin for PySCF

QArray

Celerity: High-level C++ for Accelerator Clusters

gpu_tracker: Context manager and CLI that tracks the computational-resource-usage of a code block or shell command, particularly the GPU usage

CIFAR-10 Airbench: 94% on CIFAR-10 in 3.29 second

Most viewed papers (last 30 days)