Design and Development of an Efficient H. 264 Video Encoder for CPU/GPU using OpenCL
Computer Engineering, M. S. Ramaiah School of Advanced Studies, Bangalore
SASTECH Journal, Volume 11, Issue 2, 2012
@article{laeeq2012design,
title={Design and Development of an Efficient H. 264 Video Encoder for CPU/GPU using OpenCL},
author={Laeeq, S.M. and Gangadhar, ND and Chacko, B.G.},
year={2012}
}
Video codecs have undergone dramatic improvements and increased in complexity over the years owing to various commercial products like mobiles and Tablet PCs. With the emergence of standards, such H.264 which has emerged as the de facto standard for video, uniformity in the delivery of video is observed. With constraints of memory and transmission bandwidth, focus has been on the effective compression and decompression of video. Multicore architectures have increasingly becoming available on mobiles and Tablet PCs. As codecs have increased in complexity and become computationally intensive, it is all the more important to leverage such computation over multicore hardware architectures. OpenCL programming framework for programming multicore hardware architectures such as CPUs, GPUs and DSPs has grown to a high level of maturity. In this study an efficient H.264 video codec is developed using OpenCL for multicore architectures based on the x264 open source H.264 library. The x264 library is profiled using sample videos on a CPU and performance hotspots are identified for optimisation. These hotspots are optimized by means of encapsulation into the OpenCL kernel loops where 4 parallel threads are created by OpenMP. Further, compiler optimization flags and assembly instructions within the x264 library are used to improve memory efficiency and execution speed. Programs to identify and use the queried OpenCL CPU device and analyze the PCI bandwidth between the host and the device are developed. When launched over CPU and GPU platforms, with OpenCL API’s and multi threading, improvements in time of execution and the number of systems calls made are observed. The hotspot of x264_pixel_satd_8*4 resulted in 1.2 seconds gain as compared with earlier non OpenCL based optimization on CPU and 0.4 seconds gain for a GPU. The degradation in performance on a GPU platform is due to the read and write latencies. However, along with the use of compiler optimization flags and invoking assembly instructions in the entire x264 library resulted in a 4.3X improvement on a CPU and a 4.2X on a GPU platform. It can be concluded that, along with multithreading with OpenCL, the traditional approach of compiler level optimization is important as it deals with the core improvement in the application considered.
November 6, 2012 by hgpu