high performance computing on graphics processing units: hgpu.org

Posts

Jul, 7

Automatic Code Rewriting for Performance Portability

Rewriting code for cleanliness, API changes, and new programming models is a common, yet time-consuming task. This is important for HPC applications that desire performance portability in particular, since these applications are usually very long lived and wish to run on many architectures, so they need to be written such that they can make good […]

OpenCL

Jul, 7

PSCToolkit: solving sparse linear systems with a large number of GPUs

In this chapter, we describe the Parallel Sparse Computation Toolkit (PSCToolkit), a suite of libraries for solving large-scale linear algebra problems in an HPC environment. In particular, we focus on the tools provided for the solution of symmetric and positive-definite linear systems using up to 8192 GPUs on the EuroHPC-JU Leonardo supercomputer. PSCToolkit is an […]

CUDA

Jun, 30

Composing Distributed Computations Through Task and Kernel Fusion

We introduce Diffuse, a system that dynamically performs task and kernel fusion in distributed, task-based runtime systems. The key component of Diffuse is an intermediate representation of distributed computation that enables the necessary analyses for the fusion of distributed tasks to be performed in a scalable manner. We pair task fusion with a JIT compiler […]

CUDA

Jun, 30

CATBench: A Compiler Autotuning Benchmarking Suite for Black-box Optimization

Bayesian optimization is a powerful method for automating tuning of compilers. The complex landscape of autotuning provides a myriad of rarely considered structural challenges for black-box optimizers, and the lack of standardized benchmarks has limited the study of Bayesian optimization within the domain. To address this, we present CATBench, a comprehensive benchmarking suite that captures […]

OpenCL

Jun, 30

Adapting database components to heterogeneous environments

Data management has seen rapid evolution during the last years, influenced by factors such as data explosion, the prevalence of machine and deep learning, the slowdown of Moore’s law and the popularity of hardware accelerators. Data processing systems are trying to adapt to all these trends by building monolithic and highly specialized systems, which are […]

CUDA

Jun, 30

A Survey of General-purpose Polyhedral Compilers

Since the 1990’s many implementations of polyhedral compilers have been written and distributed, either as source-to-source translating compilers or integrated into wider purpose compilers. This paper provides a survey on those various available implementations as of today, 2024. We list and describe most commonly available polyhedral schedulers and compiler implementations. Then, we compare the general-purpose […]

CUDA

•

OpenCL

Jun, 30

How to Rent GPUs on a Budget

The explosion in Machine Learning (ML) over the past ten years has led to a dramatic increase in demand for GPUs to train ML models. Because it is prohibitively expensive for most users to build and maintain a large GPU cluster, large cloud providers (Microsoft Azure, Amazon AWS, Google Cloud) have seen explosive growth in […]

Jun, 23

Breaking the Memory Wall: A Study of I/O Patterns and GPU Memory Utilization for Hybrid CPU-GPU Offloaded Optimizers

Transformers and LLMs have seen rapid adoption in all domains. Their sizes have exploded to hundreds of billions of parameters and keep increasing. Under these circumstances, the training of transformers is slow and often takes in the order of weeks or months. Thanks to 3D model parallelism (data, pipeline, and tensor-level parallelism), the training can […]

CUDA

Jun, 23

GPU Parallelization of Astronomical Image Subtraction

Astronomical image subtraction is a method for generating a difference image from two images, which covers the same area but taken at different times, in order to see changes over time. Due to the images being taken at different times, one of the images has to be convolved, to match the atmospheric conditions of the […]

OpenCL

Jun, 23

An End-to-End Programming Model for AI Engine Architectures

The proliferation of deep learning in various domains has led to remarkable advancements in artificial intelligence applications, such as large-language models for scientific use cases. However, the concomitant exponential growth in computational demands, driven by the development of ever-larger deep learning models, presents significant challenges in terms of resource consumption and sustainability. This dissertation addresses […]

Jun, 23

Optimal Kernel Orchestration for Tensor Programs with Korch

Kernel orchestration is the task of mapping the computation defined in different operators of a deep neural network (DNN) to the execution of GPU kernels on modern hardware platforms. Prior approaches optimize kernel orchestration by greedily applying operator fusion, which fuses the computation of multiple operators into a single kernel, and miss a variety of […]

CUDA

Jun, 23

COOK Access Control on an embedded Volta GPU

The last decade has seen the emergence of a new generation of multi-core in response to advances in machine learning, and in particular Deep Neural Network (DNN) training and inference tasks. These platforms, like the JETSON AGX XAVIER, embed several cores and accelerators in a SWaP- efficient (Size Weight and Power) package with a limited […]

CUDA

* * *

high performance computing on graphics processing units: hgpu.org

Posts

Automatic Code Rewriting for Performance Portability

PSCToolkit: solving sparse linear systems with a large number of GPUs

Composing Distributed Computations Through Task and Kernel Fusion

CATBench: A Compiler Autotuning Benchmarking Suite for Black-box Optimization

Adapting database components to heterogeneous environments

A Survey of General-purpose Polyhedral Compilers

How to Rent GPUs on a Budget

Breaking the Memory Wall: A Study of I/O Patterns and GPU Memory Utilization for Hybrid CPU-GPU Offloaded Optimizers

GPU Parallelization of Astronomical Image Subtraction

An End-to-End Programming Model for AI Engine Architectures

Optimal Kernel Orchestration for Tensor Programs with Korch

COOK Access Control on an embedded Volta GPU

Recent source codes

SYCL Container

CASS: Cuda-Amd aSSembly

Cluser of smartphones for edge computing application using TensorFlow

CFAL-bench

Efficient Graph Embedding at Scale: Optimizing CPU-GPU-SSD Integration

Can Large Language Models Predict Parallel Code Performance?

PELSI: Power-Efficient Layer-Switched Inference

Ouroboros: Virtualized Queues for dynamic memory management

MSCCL++: A GPU-driven communication stack for scalable AI applications

Benchmark compute shader of Unity against InteropUnityCUDA

Most viewed papers (last 30 days)