high performance computing on graphics processing units: hgpu.org

Posts

Jul, 9

Optimization Techniques for GPU Programming

In the past decade, Graphics Processing Units have played an important role in the field of high-performance computing and they still advance new fields such as IoT, autonomous vehicles, and exascale computing. It is therefore important to understand how to extract performance from these processors, something that is not trivial. This survey discusses various optimization […]

CUDA

•

OpenCL

Jul, 9

Matrix Multiplication Using Only Addition

Matrix multiplication consumes a large fraction of the time taken in many machine-learning algorithms. Thus, accelerator chips that perform matrix multiplication faster than conventional processors or even GPU’s are of increasing interest. In this paper, we demonstrate a method of performing matrix multiplication without a scalar multiplier circuit. In many cases of practical interest, only […]

Jul, 9

Improving Automatic Parallel Training via Balanced Memory Workload Optimization

Transformer models have emerged as the leading approach for achieving state-of-the-art performance across various application domains, serving as the foundation for advanced large-scale deep learning (DL) models. However, efficiently training these models across multiple GPUs remains a complex challenge due to the abundance of parallelism options. Existing DL systems either require manual efforts to design […]

Jul, 2

Evaluation of OpenAI Codex for HPC Parallel Programming Models Kernel Generation

We evaluate AI-assisted generative capabilities on fundamental numerical kernels in high-performance computing (HPC), including AXPY, GEMV, GEMM, SpMV, Jacobi Stencil, and CG. We test the generated kernel codes for a variety of language-supported programming models, including (1) C++ (e.g., OpenMP [including offload], OpenACC, Kokkos, SyCL, CUDA, and HIP), (2) Fortran (e.g., OpenMP [including offload] and […]

CUDA

Jul, 2

cuSLINK: Single-linkage Agglomerative Clustering on the GPU

In this paper, we propose cuSLINK, a novel and state-of-the-art reformulation of the SLINK algorithm on the GPU which requires only O(Nk) space and uses a parameter k to trade off space and time. We also propose a set of novel and reusable building blocks that compose cuSLINK. These building blocks include highly optimized computational […]

CUDA

Jul, 2

Out-of-the-box library support for DBMS operations on GPUs

GPU accelerated query execution is still ongoing research in the database community, as GPUs continue to be heterogeneous in their architectures varying their capabilities (e.g., their newest selling point: tensor cores). Hence, many researchers come up with optimal operator implementations for a specific device generation involving tedious operator tuning by hand. Alternatively, there is a […]

OpenCL

Jul, 2

SYCL compute kernels for ExaHyPE

We discuss three SYCL realisations of a simple Finite Volume scheme over multiple Cartesian patches. The realisation flavours differ in the way how they map the compute steps onto loops and tasks: We compare an implementation which is exclusively using a cascade of for-loops to a version which uses nested parallelism, and finally benchmark these […]

Jul, 2

Managing, Profiling, and Optimizing Heterogeneous GPU Workloads

The popularity of machine learning (ML) workloads have made GPU instance offerings ubiquitous in the cloud, introducing new challenges in managing, profiling, and optimizing GPU workloads. Cloud providers assign passthrough GPUs directly to virtual machines (VMs) for high performance, but doing so renders VM migration non-functional, limiting cloud operator ability to manage hardware resources. Existing […]

CUDA

Jun, 25

Deep Language Models for Software Testing and Optimisation

Developing software is difficult. A challenging part of production development is ensuring programs are correct and fast, two properties satisfied with software testing and optimisation. While both tasks still rely on manual effort and expertise, the recent surge in software applications has led them to become tedious and time-consuming. Under this fast-pace environment, manual testing […]

OpenCL

Jun, 25

DGEMM on Integer Matrix Multiplication Unit

Deep learning hardware achieves high throughput and low power consumption by reducing computing precision and specializing in matrix multiplication. For machine learning inference, fixed-point value computation is commonplace, where the input and output values and the model parameters are quantized. Thus, many processors are now equipped with fast integer matrix multiplication units (IMMU). It is […]

CUDA

Jun, 25

Compilation and Design Space Exploration of Dataflow Programs for Heterogeneous CPU-GPU Platforms

Today’s continued increase in demand for processing power, despite the slowdown of Moore’s law, has led to an increase in processor count, which has resulted in energy consumption and distribution problems. To address this, there is a growing trend toward creating more complex heterogeneous systems where multicore, many-core, GPU, FPGA, and DSPs are combined in […]

CUDA

Jun, 25

GPU First – Execution of Legacy CPU Codes on GPUs

Utilizing GPUs is critical for high performance on heterogeneous systems. However, leveraging the full potential of GPUs for accelerating legacy CPU applications can be a challenging task for developers. The porting process requires identifying code regions amenable to acceleration, managing distinct memories, synchronizing host and device execution, and handling library functions that may not be […]

CUDA

gpu_tracker: Context manager and CLI that tracks the computational-resource-usage of a code block or shell command, particularly the GPU usage

gpu_tracker: Python package for tracking and profiling GPU utilization in both desktop and high-performance computing environments

high performance computing on graphics processing units: hgpu.org

Posts

Optimization Techniques for GPU Programming

Matrix Multiplication Using Only Addition

Improving Automatic Parallel Training via Balanced Memory Workload Optimization

Evaluation of OpenAI Codex for HPC Parallel Programming Models Kernel Generation

cuSLINK: Single-linkage Agglomerative Clustering on the GPU

Out-of-the-box library support for DBMS operations on GPUs

SYCL compute kernels for ExaHyPE

Managing, Profiling, and Optimizing Heterogeneous GPU Workloads

Deep Language Models for Software Testing and Optimisation

DGEMM on Integer Matrix Multiplication Unit

Compilation and Design Space Exploration of Dataflow Programs for Heterogeneous CPU-GPU Platforms

GPU First – Execution of Legacy CPU Codes on GPUs

Recent source codes

SimSYCL: Synchronous, single-threaded, library-only SYCL implementation for debugging and verification

GPU plugin for PySCF

QArray

Celerity: High-level C++ for Accelerator Clusters

gpu_tracker: Context manager and CLI that tracks the computational-resource-usage of a code block or shell command, particularly the GPU usage

CIFAR-10 Airbench: 94% on CIFAR-10 in 3.29 second

LOOPer: a polyhedral compiler for expressing fast and portable data parallel algorithms

OpenMC Monte Carlo Code

Polygeist: C/C++ frontend for MLIR

Parallel Gaussian process with kernel approximation in CUDA

Most viewed papers (last 30 days)