high performance computing on graphics processing units: hgpu.org

Posts

Jun, 30

How to Rent GPUs on a Budget

The explosion in Machine Learning (ML) over the past ten years has led to a dramatic increase in demand for GPUs to train ML models. Because it is prohibitively expensive for most users to build and maintain a large GPU cluster, large cloud providers (Microsoft Azure, Amazon AWS, Google Cloud) have seen explosive growth in […]

Jun, 23

Breaking the Memory Wall: A Study of I/O Patterns and GPU Memory Utilization for Hybrid CPU-GPU Offloaded Optimizers

Transformers and LLMs have seen rapid adoption in all domains. Their sizes have exploded to hundreds of billions of parameters and keep increasing. Under these circumstances, the training of transformers is slow and often takes in the order of weeks or months. Thanks to 3D model parallelism (data, pipeline, and tensor-level parallelism), the training can […]

CUDA

Jun, 23

GPU Parallelization of Astronomical Image Subtraction

Astronomical image subtraction is a method for generating a difference image from two images, which covers the same area but taken at different times, in order to see changes over time. Due to the images being taken at different times, one of the images has to be convolved, to match the atmospheric conditions of the […]

OpenCL

Jun, 23

An End-to-End Programming Model for AI Engine Architectures

The proliferation of deep learning in various domains has led to remarkable advancements in artificial intelligence applications, such as large-language models for scientific use cases. However, the concomitant exponential growth in computational demands, driven by the development of ever-larger deep learning models, presents significant challenges in terms of resource consumption and sustainability. This dissertation addresses […]

Jun, 23

Optimal Kernel Orchestration for Tensor Programs with Korch

Kernel orchestration is the task of mapping the computation defined in different operators of a deep neural network (DNN) to the execution of GPU kernels on modern hardware platforms. Prior approaches optimize kernel orchestration by greedily applying operator fusion, which fuses the computation of multiple operators into a single kernel, and miss a variety of […]

CUDA

Jun, 23

COOK Access Control on an embedded Volta GPU

The last decade has seen the emergence of a new generation of multi-core in response to advances in machine learning, and in particular Deep Neural Network (DNN) training and inference tasks. These platforms, like the JETSON AGX XAVIER, embed several cores and accelerators in a SWaP- efficient (Size Weight and Power) package with a limited […]

CUDA

Jun, 16

Understanding GPU Triggering APIs for MPI+X Communication

GPU-enhanced architectures are now dominant in HPC systems, but message-passing communication involving GPUs with MPI has proven to be both complex and expensive, motivating new approaches that lower such costs. We compare and contrast stream/graph- and kernel-triggered MPI communication abstractions, whose principal purpose is to enhance the performance of communication when GPU kernels create or […]

Jun, 16

Stencil Computations on AMD and Nvidia Graphics Processors: Performance and Tuning Strategies

Over the last ten years, graphics processors have become the de facto accelerator for data-parallel tasks in various branches of high-performance computing, including machine learning and computational sciences. However, with the recent introduction of AMD-manufactured graphics processors to the world’s fastest supercomputers, tuning strategies established for previous hardware generations must be re-evaluated. In this study, […]

CUDA

Jun, 16

Bag of Tricks: Benchmarking of Jailbreak Attacks on LLMs

Although Large Language Models (LLMs) have demonstrated significant capabilities in executing complex tasks in a zero-shot manner, they are susceptible to jailbreak attacks and can be manipulated to produce harmful outputs. Recently, a growing body of research has categorized jailbreak attacks into token-level and prompt-level attacks. However, previous work primarily overlooks the diverse key factors […]

Jun, 16

A methodology for comparing optimization algorithms for auto-tuning

Adapting applications to optimally utilize available hardware is no mean feat: the plethora of choices for optimization techniques are infeasible to maximize manually. To this end, auto-tuning frameworks are used to automate this task, which in turn use optimization algorithms to efficiently search the vast searchspaces. However, there is a lack of comparability in studies […]

CUDA

Jun, 16

How much can we gain from Tensor Kernel Fusion on GPUs?

Kernel fusion is a crucial optimization technique for GPU applications, particularly deep neural networks, where it involves combining multiple consecutive kernels into a single larger kernel. This approach aims to enhance performance by reducing the need for slow off-chip memory accesses. Instead, intermediate results between successive kernels are stored in faster on-chip memory like shared […]

CUDA

Jun, 9

Memory Interference and Performance Prediction in GPU-Accelerated Heterogeneous Systems

Nowadays, a variety of applications, including automated factories, autonomous vehicles, and Cyber Physical Systems (CPS), are experiencing significant growth. Given the diverse range of challenges that must be addressed, such as real-time management and visualization of a factory’s current state through a 3D digital twin, trajectory calculation within autonomous vehicles, visualizing Human Machine Interfaces (HMI), […]

CUDA

* * *

high performance computing on graphics processing units: hgpu.org

Posts

How to Rent GPUs on a Budget

Breaking the Memory Wall: A Study of I/O Patterns and GPU Memory Utilization for Hybrid CPU-GPU Offloaded Optimizers

GPU Parallelization of Astronomical Image Subtraction

An End-to-End Programming Model for AI Engine Architectures

Optimal Kernel Orchestration for Tensor Programs with Korch

COOK Access Control on an embedded Volta GPU

Understanding GPU Triggering APIs for MPI+X Communication

Stencil Computations on AMD and Nvidia Graphics Processors: Performance and Tuning Strategies

Bag of Tricks: Benchmarking of Jailbreak Attacks on LLMs

A methodology for comparing optimization algorithms for auto-tuning

How much can we gain from Tensor Kernel Fusion on GPUs?

Memory Interference and Performance Prediction in GPU-Accelerated Heterogeneous Systems

Recent source codes

Mutual-Supervised Learning for Sequential-to-Parallel Code Translation

Hardware Compute Partitioning on NVIDIA GPUs for Composable Systems

KISim: Kubernetes Intelligent Scheduling Simulator

Efficient GPU Implementation of Multi-Precision Integer Division

exa-AMD: Exascale Accelerated Materials Discovery

ParEval: A Parallel Code Evaluation Benchmark

FlashSparse: Minimizing Computation Redundancy for Fast Sparse Matrix Multiplications on Tensor Cores

WiLLM: An Open Wireless LLM Communication System

Vcc: the Vulkan Clang Compiler

hpcbench: A set of benchmarking utilities for biomolecular simulation tools

Most viewed papers (last 30 days)