high performance computing on graphics processing units: hgpu.org

Posts

Jul, 16

Towards Intelligent Runtime Framework for Distributed Heterogeneous Systems

Scientific applications strive for increased memory and computing performance, requiring massive amounts of data and time to produce results. Applications utilize large-scale, parallel computing platforms with advanced architectures to accommodate their needs. However, developing performance-portable applications for modern, heterogeneous platforms requires lots of effort and expertise in both the application and systems domains. This is […]

CUDA

•

OpenCL

Jul, 16

Mystique: Enabling Accurate and Scalable Generation of Production AI Benchmarks

Building large AI fleets to support the rapidly growing DL workloads is an active research topic for modern cloud providers. Generating accurate benchmarks plays an essential role in designing the fast-paced software and hardware solutions in this space. Two fundamental challenges to make this scalable are (i) workload representativeness and (ii) the ability to quickly […]

CUDA

Jul, 16

Tile-based Lightweight Integer Compression in GPU

GPUs are increasingly used for high-performance and interactive data analytics workloads due to their capability to accelerate computation using massive parallelism. A key constraint of GPU-based data analytics today is the limited memory capacity in GPU devices. Data compression is a powerful technique that can mitigate the capacity limitation in two ways: (1) fitting more […]

CUDA

Jul, 16

Miriam: Exploiting Elastic Kernels for Real-time Multi-DNN Inference on Edge GPU

Many applications such as autonomous driving and augmented reality, require the concurrent running of multiple deep neural networks (DNN) that poses different levels of real-time performance requirements. However, coordinating multiple DNN tasks with varying levels of criticality on edge GPUs remains an area of limited study. Unlike server-level GPUs, edge GPUs are resource-limited and lack […]

CUDA

Jul, 16

Improving the Performance, Portability, and Productivity of Hardware Accelerators

With the end of Moore’s Law and Dennard’s scaling, attention is shifting to new ways of enhancing computer performance. Improving microprocessor performance is becoming increasingly complex, whereas computational power demands still grow tremendously fast. In recent years, we are witnessing a paradigm change: rather than using one single chip, the CPU, for computing everything, computers […]

CUDA

Jul, 9

Safe, Seamless, And Scalable Integration Of Asynchronous GPU Streams In PETSc

Leveraging Graphics Processing Units (GPUs) to accelerate scientific software has proven to be highly successful, but in order to extract more performance, GPU programmers must overcome the high latency costs associated with their use. One method of reducing or hiding this latency cost is to use asynchronous streams to issue commands to the GPU. While […]

CUDA

Jul, 9

Modeling Parallel Programs using Large Language Models

Parallel software codes in high performance computing (HPC) continue to grow in complexity and scale as we enter the exascale era. A diverse set of emerging hardware and programming paradigms make developing, optimizing, and maintaining parallel software burdensome for developers. One way to alleviate some of these burdens is with automated development and analysis tools. […]

Jul, 9

Optimization Techniques for GPU Programming

In the past decade, Graphics Processing Units have played an important role in the field of high-performance computing and they still advance new fields such as IoT, autonomous vehicles, and exascale computing. It is therefore important to understand how to extract performance from these processors, something that is not trivial. This survey discusses various optimization […]

CUDA

•

OpenCL

Jul, 9

Matrix Multiplication Using Only Addition

Matrix multiplication consumes a large fraction of the time taken in many machine-learning algorithms. Thus, accelerator chips that perform matrix multiplication faster than conventional processors or even GPU’s are of increasing interest. In this paper, we demonstrate a method of performing matrix multiplication without a scalar multiplier circuit. In many cases of practical interest, only […]

Jul, 9

Improving Automatic Parallel Training via Balanced Memory Workload Optimization

Transformer models have emerged as the leading approach for achieving state-of-the-art performance across various application domains, serving as the foundation for advanced large-scale deep learning (DL) models. However, efficiently training these models across multiple GPUs remains a complex challenge due to the abundance of parallelism options. Existing DL systems either require manual efforts to design […]

Jul, 2

Evaluation of OpenAI Codex for HPC Parallel Programming Models Kernel Generation

We evaluate AI-assisted generative capabilities on fundamental numerical kernels in high-performance computing (HPC), including AXPY, GEMV, GEMM, SpMV, Jacobi Stencil, and CG. We test the generated kernel codes for a variety of language-supported programming models, including (1) C++ (e.g., OpenMP [including offload], OpenACC, Kokkos, SyCL, CUDA, and HIP), (2) Fortran (e.g., OpenMP [including offload] and […]

CUDA

Jul, 2

cuSLINK: Single-linkage Agglomerative Clustering on the GPU

In this paper, we propose cuSLINK, a novel and state-of-the-art reformulation of the SLINK algorithm on the GPU which requires only O(Nk) space and uses a parameter k to trade off space and time. We also propose a set of novel and reusable building blocks that compose cuSLINK. These building blocks include highly optimized computational […]

CUDA

gpu_tracker: Context manager and CLI that tracks the computational-resource-usage of a code block or shell command, particularly the GPU usage

gpu_tracker: Python package for tracking and profiling GPU utilization in both desktop and high-performance computing environments

* * *

high performance computing on graphics processing units: hgpu.org

Posts

Towards Intelligent Runtime Framework for Distributed Heterogeneous Systems

Mystique: Enabling Accurate and Scalable Generation of Production AI Benchmarks

Tile-based Lightweight Integer Compression in GPU

Miriam: Exploiting Elastic Kernels for Real-time Multi-DNN Inference on Edge GPU

Improving the Performance, Portability, and Productivity of Hardware Accelerators

Safe, Seamless, And Scalable Integration Of Asynchronous GPU Streams In PETSc

Modeling Parallel Programs using Large Language Models

Optimization Techniques for GPU Programming

Matrix Multiplication Using Only Addition

Improving Automatic Parallel Training via Balanced Memory Workload Optimization

Evaluation of OpenAI Codex for HPC Parallel Programming Models Kernel Generation

cuSLINK: Single-linkage Agglomerative Clustering on the GPU

Recent source codes

QArray

Celerity: High-level C++ for Accelerator Clusters

CIFAR-10 Airbench: 94% on CIFAR-10 in 3.29 second

gpu_tracker: Context manager and CLI that tracks the computational-resource-usage of a code block or shell command, particularly the GPU usage

LOOPer: a polyhedral compiler for expressing fast and portable data parallel algorithms

OpenMC Monte Carlo Code

Polygeist: C/C++ frontend for MLIR

Parallel Gaussian process with kernel approximation in CUDA

Optical flow algorithms for SYCL

OpenMP5-Offload-OpenMC-Intel-PVC

Most viewed papers (last 30 days)