high performance computing on graphics processing units: hgpu.org

hgpu.org » Applications » Computer science » Study of Bandwidth Partitioning for Co-executing GPU Kernels

Study of Bandwidth Partitioning for Co-executing GPU Kernels

Erik Melander

Department of Information Technology, Upsala University

Upsala University, 2017

@misc{melander2017study,

title={Study of Bandwidth Partitioning for Co-executing GPU Kernels},

author={Melander, Erik},

year={2017}

}

Download (PDF)

View

Source

1939

views

Co-executing GPU kernels on a partitioned GPU has been shown to improve utilization efficiency of poorly scaling tasks. While kernels can be executed in parallel, data transfers to the GPU are serial which can negatively impact parallelism and predictability of the kernels.In this work we implement a fairness-based approach to memory transfers by chunking data sets and transferring them interleaved and evaluate the overhead of this approach. Then we develop a model to predict when kernels will start using this implementation. We found that chunked transfers in a single CUDA stream have only a small overhead compared to serial transfers, while event synchronized transfers in several streams have larger overhead particularly for chunk sizes less than 500 KB.The prediction models accurately estimate kernel starting times and return transfer times with less than 2.7% relative error.

Tags: Computer science, CUDA, nVidia, Performance, Tesla K20, Thesis

December 7, 2017 by hgpu

No votes yet.

Please wait...

Your response

You must be logged in to post a comment.

* * *

high performance computing on graphics processing units: hgpu.org

Study of Bandwidth Partitioning for Co-executing GPU Kernels

Your response

Recent source codes

NVIDIA Nemotron Parse 1.1

ThunderKittens: Tile primitives for speedy kernels

Iris: AMD RAD's multi-GPU Triton-based framework for seamless multi-GPU programming

HipKittens: Fast and Furious AMD Kernels

Fortran xDSL dialects

mt4g: Memory Topology 4 GPUs

Falcon: GPU-Based Floating-point Adaptive Lossless Compression

CudaForge: An Agent Framework with Hardware Feedback for CUDA Kernel Optimization

pplx-garden: Perplexity open source garden for inference technology

LC Framework

Most viewed papers (last 30 days)

Study of Bandwidth Partitioning for Co-executing GPU Kernels

Share this:

Your response

Recent source codes

Most viewed papers (last 30 days)