high performance computing on graphics processing units: hgpu.org

Posts

Mar, 27

Migrating CUDA to oneAPI: A Smith-Waterman Case Study

To face the programming challenges related to heterogeneous computing, Intel recently introduced oneAPI, a new programming environment that allows code developed in Data Parallel C++ (DPC++) language to be run on different devices such as CPUs, GPUs, FPGAs, among others. To tackle CUDA-based legacy codes, oneAPI provides a compatibility tool (dpct) that facilitates the migration […]

CUDA

Mar, 20

Managing Extreme Heterogeneity in Next Generation HPC Systems

As traditional high performance computing architectures are unable to meet the energy and performance requirements of increasingly intensive applications, HPC centers are moving towards incorporating heterogeneous node architectures in next-generation HPC systems. While GPUs have become quite popular over the last few years as accelerators, other novel acceleration devices such as FPGAs and neural network […]

OpenCL

Mar, 20

Machine Learning for CUDA+MPI Design Rules

We present a new strategy for automatically exploring the design space of key CUDA+MPI programs and providing design rules that discriminate slow from fast implementations. In such programs, the order of operations (e.g., GPU kernels, MPI communication) and assignment of operations to resources (e.g., GPU streams) makes the space of possible designs enormous. Systems experts […]

CUDA

Mar, 20

DISTAL: The Distributed Tensor Algebra Compiler

We introduce DISTAL, a compiler for dense tensor algebra that targets modern distributed and heterogeneous systems. DISTAL lets users independently describe how tensors and computation map onto target machines through separate format and scheduling languages. The combination of choices for data and computation distribution creates a large design space that includes many algorithms from both […]

CUDA

Mar, 20

Concurrent CPU-GPU Task Programming using Modern C++

In this paper, we introduce Heteroflow, a new C++ library to help developers quickly write parallel CPU-GPU programs using task dependency graphs. Heteroflow leverages the power of modern C++ and task-based approaches to enable efficient implementations of heterogeneous decomposition strategies. Our new CPU-GPU programming model allows users to express a problem in a way that […]

CUDA

Mar, 20

Benchmarking a Proof-of-Concept Performance Portable SYCL-based Fast Fourier Transformation Library

In this paper, we present an early version of a SYCL-based FFT library, capable of running on all major vendor hardware, including CPUs and GPUs from AMD, ARM, Intel and NVIDIA. Although preliminary, the aim of this work is to seed further developments for a rich set of features for calculating FFTs. It has the […]

OpenCL

Mar, 11

HipBone: A performance-portable GPU-accelerated C++ version of the NekBone benchmark

We present hipBone, an open source performance-portable proxy application for the Nek5000 (and NekRS) CFD applications. HipBone is a fully GPU-accelerated C++ implementation of the original NekBone CPU proxy application with several novel algorithmic and implementation improvements which optimize its performance on modern fine-grain parallel GPU accelerators. Our optimizations include a conversion to store the […]

CUDA

Mar, 6

Integrating SkePU’s algorithmic skeletons with GPI on a cluster

As processors’ clock-speed flattened out in the early 2000s, multi-core processors became more prevalent and so did parallel programming. However this programming paradigm introduces additional complexities, and to combat this, the SkePU framework was created. SkePU does this by offering a single-threaded interface which executes the user’s code in parallel in accordance to a chosen […]

CUDA

•

OpenCL

Mar, 6

Romou: Rapidly Generate High-Performance Tensor Kernels for Mobile GPUs

Mobile GPU, as a ubiquitous and powerful accelerator, plays an important role in accelerating on-device DNN (Deep Neural Network) inference. The frequent-upgrade and diversity of mobile GPUs require automatic kernel generation to empower fast DNN deployment. However, current generated kernels have poor performance. The goal of this paper is to rapidly generate high-performance kernels for […]

OpenCL

Mar, 6

FastFold: Reducing AlphaFold Training Time from 11 Days to 67 Hours

Protein structure prediction is an important method for understanding gene translation and protein function in the domain of structural biology. AlphaFold introduced the Transformer model to the field of protein structure prediction with atomic accuracy. However, training and inference of the AlphaFold model are time-consuming and expensive because of the special performance characteristics and huge […]

CUDA

Mar, 6

Query Processing on Tensor Computation Runtimes

The huge demand for computation in artificial intelligence (AI) is driving unparalleled investments in new hardware and software systems for AI. This leads to an explosion in the number of specialized hardware devices, which are now part of the offerings of major cloud providers. Meanwhile, by hiding the low-level complexity through a tensor-based interface, tensor […]

CUDA

Mar, 6

Enabling On-Device Smartphone GPU based Training: Lessons Learned

Deep Learning (DL) has shown impressive performance in many mobile applications. Most existing works have focused on reducing the computational and resource overheads of running Deep Neural Networks (DNN) inference on resource-constrained mobile devices. However, the other aspect of DNN operations, i.e. training (forward and backward passes) on smartphone GPUs, has received little attention thus […]

OpenCL

* * *

high performance computing on graphics processing units: hgpu.org

Posts

Migrating CUDA to oneAPI: A Smith-Waterman Case Study

Managing Extreme Heterogeneity in Next Generation HPC Systems

Machine Learning for CUDA+MPI Design Rules

DISTAL: The Distributed Tensor Algebra Compiler

Concurrent CPU-GPU Task Programming using Modern C++

Benchmarking a Proof-of-Concept Performance Portable SYCL-based Fast Fourier Transformation Library

HipBone: A performance-portable GPU-accelerated C++ version of the NekBone benchmark

Integrating SkePU’s algorithmic skeletons with GPI on a cluster

Romou: Rapidly Generate High-Performance Tensor Kernels for Mobile GPUs

FastFold: Reducing AlphaFold Training Time from 11 Days to 67 Hours

Query Processing on Tensor Computation Runtimes

Enabling On-Device Smartphone GPU based Training: Lessons Learned

Recent source codes

Allo: Accelerator Design Language

Towards Robust Agentic CUDA Kernel Benchmarking, Verification, and Optimization

HPC Benchmark Survey

HDM: Home made Diffusion Models

General Matrix Multiplication (GEMM)

CrossTL: Universal Programming Language & Translator

TBD-GPU

DG-SWEM - The Discontinuous Galerkin Shallow Water Equation Model

torchPDLP: Primal-Dual Linear Programming in PyTorch. In collaboration with AMD and IPAM

Benchmarks for Dissecting CPU-GPU Unified Physical Memory on AMD MI300A APUs

Most viewed papers (last 30 days)