high performance computing on graphics processing units: hgpu.org

Posts

Apr, 10

ALPINIST: An Annotation-Aware GPU Program Optimizer

GPU programs are widely used in industry. To obtain the best performance, a typical development process involves the manual or semi-automatic application of optimizations prior to compiling the code. To avoid the introduction of errors, we can augment GPU programs with (pre- and postcondition-style) annotations to capture functional properties. However, keeping these annotations correct when […]

CUDA

•

OpenCL

Mar, 27

Advanced Joins on GPUs

Over the past years, the rise of General Purpose GPU (GPGPU) paradigm has become more evident in high-performance computing. The massive parallelism that GPUs offer at low cost is the catalyst for its adoption in numerous computational intensive applications, where tremendous speedup gains are reported due to the ease of parallelization of the algorithms they […]

CUDA

Mar, 27

One-shot tuner for deep learning compilers

Auto-tuning DL compilers are gaining ground as an optimizing back-end for DL frameworks. While existing work can generate deep learning models that exceed the performance of hand-tuned libraries, they still suffer from prohibitively long auto-tuning time due to repeated hardware measurements in large search spaces. In this paper, we take a neural-predictor inspired approach to […]

CUDA

Mar, 27

Simulation Methodologies for Mobile GPUs

GPUs critically rely on a complex system software stack comprising kernel- and user-space drivers and JIT compilers. Yet, existing GPU simulators typically abstract away details of the software stack and GPU instruction set. Partly, this is because GPU vendors rarely release sufficient information about their latest GPU products. However, this is also due to the […]

OpenCL

Mar, 27

Data transfer optimizations for heterogeneous managed runtime systems

Nowadays, most programmable systems contain multiple hardware accelerators with different characteristics. In order to use the available hardware resources and improve the performance of their applications, developers must use a low-level language, such as C/C++. Succeeding the same goal from a high-level managed language (Java, Haskell, C#) poses several challenges such as the inability to […]

CUDA

•

OpenCL

Mar, 27

Migrating CUDA to oneAPI: A Smith-Waterman Case Study

To face the programming challenges related to heterogeneous computing, Intel recently introduced oneAPI, a new programming environment that allows code developed in Data Parallel C++ (DPC++) language to be run on different devices such as CPUs, GPUs, FPGAs, among others. To tackle CUDA-based legacy codes, oneAPI provides a compatibility tool (dpct) that facilitates the migration […]

CUDA

Mar, 20

Managing Extreme Heterogeneity in Next Generation HPC Systems

As traditional high performance computing architectures are unable to meet the energy and performance requirements of increasingly intensive applications, HPC centers are moving towards incorporating heterogeneous node architectures in next-generation HPC systems. While GPUs have become quite popular over the last few years as accelerators, other novel acceleration devices such as FPGAs and neural network […]

OpenCL

Mar, 20

Machine Learning for CUDA+MPI Design Rules

We present a new strategy for automatically exploring the design space of key CUDA+MPI programs and providing design rules that discriminate slow from fast implementations. In such programs, the order of operations (e.g., GPU kernels, MPI communication) and assignment of operations to resources (e.g., GPU streams) makes the space of possible designs enormous. Systems experts […]

CUDA

Mar, 20

Concurrent CPU-GPU Task Programming using Modern C++

In this paper, we introduce Heteroflow, a new C++ library to help developers quickly write parallel CPU-GPU programs using task dependency graphs. Heteroflow leverages the power of modern C++ and task-based approaches to enable efficient implementations of heterogeneous decomposition strategies. Our new CPU-GPU programming model allows users to express a problem in a way that […]

CUDA

Mar, 20

DISTAL: The Distributed Tensor Algebra Compiler

We introduce DISTAL, a compiler for dense tensor algebra that targets modern distributed and heterogeneous systems. DISTAL lets users independently describe how tensors and computation map onto target machines through separate format and scheduling languages. The combination of choices for data and computation distribution creates a large design space that includes many algorithms from both […]

CUDA

Mar, 20

Benchmarking a Proof-of-Concept Performance Portable SYCL-based Fast Fourier Transformation Library

In this paper, we present an early version of a SYCL-based FFT library, capable of running on all major vendor hardware, including CPUs and GPUs from AMD, ARM, Intel and NVIDIA. Although preliminary, the aim of this work is to seed further developments for a rich set of features for calculating FFTs. It has the […]

OpenCL

Mar, 11

HipBone: A performance-portable GPU-accelerated C++ version of the NekBone benchmark

We present hipBone, an open source performance-portable proxy application for the Nek5000 (and NekRS) CFD applications. HipBone is a fully GPU-accelerated C++ implementation of the original NekBone CPU proxy application with several novel algorithmic and implementation improvements which optimize its performance on modern fine-grain parallel GPU accelerators. Our optimizations include a conversion to store the […]

CUDA

gpu_tracker: Context manager and CLI that tracks the computational-resource-usage of a code block or shell command, particularly the GPU usage

gpu_tracker: Python package for tracking and profiling GPU utilization in both desktop and high-performance computing environments

* * *

high performance computing on graphics processing units: hgpu.org

Posts

ALPINIST: An Annotation-Aware GPU Program Optimizer

Advanced Joins on GPUs

One-shot tuner for deep learning compilers

Simulation Methodologies for Mobile GPUs

Data transfer optimizations for heterogeneous managed runtime systems

Migrating CUDA to oneAPI: A Smith-Waterman Case Study

Managing Extreme Heterogeneity in Next Generation HPC Systems

Machine Learning for CUDA+MPI Design Rules

Concurrent CPU-GPU Task Programming using Modern C++

DISTAL: The Distributed Tensor Algebra Compiler

Benchmarking a Proof-of-Concept Performance Portable SYCL-based Fast Fourier Transformation Library

HipBone: A performance-portable GPU-accelerated C++ version of the NekBone benchmark

Recent source codes

QArray

Celerity: High-level C++ for Accelerator Clusters

CIFAR-10 Airbench: 94% on CIFAR-10 in 3.29 second

gpu_tracker: Context manager and CLI that tracks the computational-resource-usage of a code block or shell command, particularly the GPU usage

LOOPer: a polyhedral compiler for expressing fast and portable data parallel algorithms

OpenMC Monte Carlo Code

Polygeist: C/C++ frontend for MLIR

Parallel Gaussian process with kernel approximation in CUDA

Optical flow algorithms for SYCL

OpenMP5-Offload-OpenMC-Intel-PVC

Most viewed papers (last 30 days)