high performance computing on graphics processing units: hgpu.org

hgpu.org » Programming » CUDA » Automated GPU Kernel Transformations in Large-Scale Production Stencil Applications

Automated GPU Kernel Transformations in Large-Scale Production Stencil Applications

Mohamed Wahib, Naoya Maruyama

RIKEN Advanced Institute for Computational Science, Kobe, Japan

ACM Symposium on High-Performance Parallel and Distributed Computing (HPDC’15), 2015

DOI:10.1145/2749246.2749255

BibTeX

Download (PDF)

View

Source

1963

views

This paper proposes an end-to-end framework for automatically transforming stencil-based CUDA programs to exploit inter-kernel data locality. The CUDA-to-CUDA transformation collectively replaces the user-written kernels by auto-generated kernels optimized for data reuse. The transformation is based on two basic operations, kernel fusion and fission, and relies on a series of automated steps: gathering metadata, generating graphs expressing dependencies and precedency constraints, searching for optimal kernel fissions/fusions, and generation of optimized code. The framework is modeled to provide the flexibility required for accommodating different applications, allowing the programmer to monitor and amend the intermediate results of different phases of the transformation. We demonstrate the practicality and effectiveness of automatic transformations in exploiting exposed data localities using a variety of real-world applications with large codebases that contain dozens of kernels and data arrays. Experimental results show that the proposed end-to-end automated approach, with minimum intervention from the user, improved performance of six applications with speedups ranging between 1.12x to 1.76x.

Tags: CUDA, Stencil computation

April 9, 2015 by wahibium

Rating: 2.5/5. From 1 vote.

Please wait...

Your response

You must be logged in to post a comment.

* * *

high performance computing on graphics processing units: hgpu.org

Automated GPU Kernel Transformations in Large-Scale Production Stencil Applications

Your response

Recent source codes

Mutual-Supervised Learning for Sequential-to-Parallel Code Translation

Hardware Compute Partitioning on NVIDIA GPUs for Composable Systems

KISim: Kubernetes Intelligent Scheduling Simulator

Efficient GPU Implementation of Multi-Precision Integer Division

exa-AMD: Exascale Accelerated Materials Discovery

ParEval: A Parallel Code Evaluation Benchmark

FlashSparse: Minimizing Computation Redundancy for Fast Sparse Matrix Multiplications on Tensor Cores

WiLLM: An Open Wireless LLM Communication System

Vcc: the Vulkan Clang Compiler

hpcbench: A set of benchmarking utilities for biomolecular simulation tools

Most viewed papers (last 30 days)

Automated GPU Kernel Transformations in Large-Scale Production Stencil Applications

Share this:

Your response

Recent source codes

Most viewed papers (last 30 days)