high performance computing on graphics processing units: hgpu.org

hgpu.org » Programming » CUDA » Investigating Host-Device communication in a GPU-based H.264 encoder

Investigating Host-Device communication in a GPU-based H.264 encoder

Kristoffer Egil Bonarjee

Department of Informatics, University of Oslo

University of Oslo, 2012

@article{bonarjee2012investigating,

title={Investigating Host-Device communication in a GPU-based H.264 encoder},

author={Kristoffer Egil Bonarjee},

year={2012}

}

Download (PDF)

View

Source

2219

views

Modern graphical processing units (GPU) are powerful parallel processors, capable of running thousands of concurrent threads. While originally limited to graphics processing, newer generations can be used for general computing (GPGPU). Through frameworks such as nVidia Compute Uniﬁed Device Architecture (CUDA) and OpenCL, GPU programs can be written using established programming languages (with minor extensions) such as C and C++. The extensiveness of GPU deployment, low cost of entry and high performance makes GPUs an attractive target for workloads formerly reserved for supercomputers or special hardware. While the programming language is similar, the hardware architecture itself is signiﬁcantly different than a CPU. In addition, the GPU is connected through a comparably slow interconnect, the PCI Express bus. Hence, it is easy to fall into performance pitfalls if these characteristics are not taken into account. In this thesis, we have investigated the performance pitfalls of a H.264 encoder written for nVidia GPUs. More speciﬁcally, we looked into the interaction between the host CPU and the GPU. We did not focus on optimizing GPU code, but rather how the execution and communication was handled by the CPU code. As much manual labour is required to optimize GPU code, it is easy to neglect the CPU part of accelerated applications. Through our experiments, we have looked into multiple issues in the host application that can effect performance. By moving IO operations into separate host threads, we masked away the latencies associated with reading input from secondary storage. By analyzing the state shared between the host and the device, we where able to reduce the time spent synchronizing data by only transferring actual changes. Using CUDA streams, we further enhanced our work on input prefetching by transferring input frames to device memory in parallel with the encoding. We also experimented with concurrent kernel execution to perform preprocessing of future frames in parallel with encoding. While we only touched upon the possibilities in concurrent kernel execution, the results where promising. Our results show that a signiﬁcant improvement can be achieved by focusing optimizing effort on the host part of a GPU application. To reach peak performance, the host code must be designed for low latency in job dispatching and GPU memory management. Otherwise the GPU will idle while waiting for more work. With the rapid advancement of GPU technology, this trend is likely to escalate.

Tags: CUDA, H.264/AVC, Image processing, nVidia, nVidia GeForce GTX 480, Prefetch, Thesis, Video encoding

October 28, 2012 by hgpu

No votes yet.

Please wait...

Your response

You must be logged in to post a comment.

* * *

high performance computing on graphics processing units: hgpu.org

Investigating Host-Device communication in a GPU-based H.264 encoder

Your response

Recent source codes

Kernel Library for LLM Serving

Adaptivity in AdaptiveCpp: Optimizing Performance by Leveraging Runtime Information During JIT-Compilation

Neptune: Advanced ML Operator Fusion for Locality and Parallelism on GPUs

Genten: Software for Generalized Tensor Decompositions by Sandia National Laboratories

Interleaved Learning and Exploration: A Self-Adaptive Fuzz Testing Framework for MLIR

Pinocchio: PINpointing Orbit Crossing Collapsed Hierarchical Objects

KernelCoder: trained on a curated dataset of reasoning traces and CUDA kernel pairs

VibeCodeHPC - Multi Agentic Vibe Coding for HPC

Compile-Time Resource Safety for GPU APIs: A Low-Overhead Typestate Framework

exa-AMD: Exascale Accelerated Materials Discovery

Most viewed papers (last 30 days)

Investigating Host-Device communication in a GPU-based H.264 encoder

Share this:

Your response

Recent source codes

Most viewed papers (last 30 days)