high performance computing on graphics processing units: hgpu.org

hgpu.org » Applications » Computer science » Directive-Based Partitioning and Pipelining for Graphical Processing Units

Directive-Based Partitioning and Pipelining for Graphical Processing Units

X. Cui, T. R. W. Scogland, B. R. de Supinski, W. Feng

Virginia Tech, Blacksburg, VA 24060

IEEE International Parallel and Distributed Processing Symposium (IPDPS), 2017

@inproceedings{cui2017directive,

title={Directive-Based Partitioning and Pipelining for Graphics Processing Units},

author={Cui, Xuewen and Scogland, Thomas RW and de Supinski, Bronis R and Feng, Wu-chun},

booktitle={Parallel and Distributed Processing Symposium (IPDPS), 2017 IEEE International},

pages={575–584},

year={2017},

organization={IEEE}

}

Download (PDF)

View

Source

2430

views

The community needs simpler mechanisms to access the performance available in accelerators, such as GPUs, FPGAs, and APUs, due to their increasing use in stateof-the-art supercomputers. Programming models like CUDA, OpenMP, OpenACC and OpenCL can efficiently offload compute-intensive workloads to these devices. By default these models naively offload computation without overlapping it with communication (copying data to or from the device). Achieving performance can require extensive refactoring and hand-tuning to apply optimizations such as pipelining. Further, users must manually partition the dataset whenever its size is larger than device memory, which can be especially difficult when the device memory size is not exposed to the user. We propose a directive-based partitioning and pipelining extension for accelerators appropriate for either OpenMP or OpenACC. Its interface supports overlap of data transfers and kernel computation without explicit user splitting of data. It can map data to a pre-allocated device buffer and automate memory-constrained array indexing and sub-task scheduling. We evaluate a prototype implementation with four different applications. The experimental results show that our approach can reduce memory usage by 52% to 97% while delivering a 1.41x to 1.65x speedup over the naive offload model.

Tags: ATI, ATI Radeon HD 7970, Computer science, CUDA, nVidia, OpenACC, OpenCL, OpenMP, Performance, Task scheduling, Tesla K40

August 1, 2017 by hgpu

Rating: 2.8/5. From 2 votes.

Please wait...

Your response

You must be logged in to post a comment.

high performance computing on graphics processing units: hgpu.org

Directive-Based Partitioning and Pipelining for Graphical Processing Units

Your response

Recent source codes

Agentic Code Optimization via Compiler-LLM Cooperation

MegaTrain: Full Precision Training of 100B+ Parameter Large Language Models on a Single GPU

Device Virtual Machine (DVM)

AutoKernel: Autoresearch for GPU kernels. Give it any PyTorch model, go to sleep, wake up to optimized Triton kernels

SOL-ExecBench: Speed-of-Light Benchmarking for Real-World GPU Kernels Against Hardware Limits

Triton-Sanitizer: A Fast and Device-Agnostic Memory Sanitizer for Triton with Rich Diagnostic Context

LLM.Q: Quantized LLM training in pure CUDA/C++

True 4-Bit Quantized CNN Training on CPU

cuFuzz: A GPU-oriented coverage-guided fuzzer for userland CUDA application

KernelSkill: A Multi-Agent Framework for GPU Kernel Optimization

Most viewed papers (last 30 days)

Directive-Based Partitioning and Pipelining for Graphical Processing Units

Share this:

Your response

Recent source codes

Most viewed papers (last 30 days)