high performance computing on graphics processing units: hgpu.org

hgpu.org » Programming » CUDA » Investigating Host-Device communication in a GPU-based H.264 encoder

Investigating Host-Device communication in a GPU-based H.264 encoder

Kristoffer Egil Bonarjee

Department of Informatics, University of Oslo

University of Oslo, 2012

BibTeX

Download (PDF)

View

Source

2141

views

Modern graphical processing units (GPU) are powerful parallel processors, capable of running thousands of concurrent threads. While originally limited to graphics processing, newer generations can be used for general computing (GPGPU). Through frameworks such as nVidia Compute Uniﬁed Device Architecture (CUDA) and OpenCL, GPU programs can be written using established programming languages (with minor extensions) such as C and C++. The extensiveness of GPU deployment, low cost of entry and high performance makes GPUs an attractive target for workloads formerly reserved for supercomputers or special hardware. While the programming language is similar, the hardware architecture itself is signiﬁcantly different than a CPU. In addition, the GPU is connected through a comparably slow interconnect, the PCI Express bus. Hence, it is easy to fall into performance pitfalls if these characteristics are not taken into account. In this thesis, we have investigated the performance pitfalls of a H.264 encoder written for nVidia GPUs. More speciﬁcally, we looked into the interaction between the host CPU and the GPU. We did not focus on optimizing GPU code, but rather how the execution and communication was handled by the CPU code. As much manual labour is required to optimize GPU code, it is easy to neglect the CPU part of accelerated applications. Through our experiments, we have looked into multiple issues in the host application that can effect performance. By moving IO operations into separate host threads, we masked away the latencies associated with reading input from secondary storage. By analyzing the state shared between the host and the device, we where able to reduce the time spent synchronizing data by only transferring actual changes. Using CUDA streams, we further enhanced our work on input prefetching by transferring input frames to device memory in parallel with the encoding. We also experimented with concurrent kernel execution to perform preprocessing of future frames in parallel with encoding. While we only touched upon the possibilities in concurrent kernel execution, the results where promising. Our results show that a signiﬁcant improvement can be achieved by focusing optimizing effort on the host part of a GPU application. To reach peak performance, the host code must be designed for low latency in job dispatching and GPU memory management. Otherwise the GPU will idle while waiting for more work. With the rapid advancement of GPU technology, this trend is likely to escalate.

Tags: CUDA, H.264/AVC, Image processing, nVidia, nVidia GeForce GTX 480, Prefetch, Thesis, Video encoding

October 28, 2012 by hgpu

No votes yet.

Please wait...

Your response

You must be logged in to post a comment.

* * *

high performance computing on graphics processing units: hgpu.org

Investigating Host-Device communication in a GPU-based H.264 encoder

Your response

Recent source codes

Mutual-Supervised Learning for Sequential-to-Parallel Code Translation

Hardware Compute Partitioning on NVIDIA GPUs for Composable Systems

KISim: Kubernetes Intelligent Scheduling Simulator

Efficient GPU Implementation of Multi-Precision Integer Division

exa-AMD: Exascale Accelerated Materials Discovery

ParEval: A Parallel Code Evaluation Benchmark

FlashSparse: Minimizing Computation Redundancy for Fast Sparse Matrix Multiplications on Tensor Cores

WiLLM: An Open Wireless LLM Communication System

Vcc: the Vulkan Clang Compiler

hpcbench: A set of benchmarking utilities for biomolecular simulation tools

Most viewed papers (last 30 days)

Investigating Host-Device communication in a GPU-based H.264 encoder

Share this:

Your response

Recent source codes

Most viewed papers (last 30 days)