{"id":6421,"date":"2011-11-28T18:06:16","date_gmt":"2011-11-28T16:06:16","guid":{"rendered":"http:\/\/hgpu.org\/?p=6421"},"modified":"2011-11-28T18:06:16","modified_gmt":"2011-11-28T16:06:16","slug":"compute-unified-device-architecture-implementation-of-a-block-matching-algorithm-for-multiple-graphical-processing-unit-cards","status":"publish","type":"post","link":"https:\/\/hgpu.org\/?p=6421","title":{"rendered":"Compute-unified device architecture implementation of a block-matching algorithm for multiple graphical processing unit cards"},"content":{"rendered":"<p>We describe and evaluate a fast implementation of a classical block-matching motion estimation algorithm for multiple graphical processing units (GPUs) using the compute unified device architecture computing engine. The implemented block-matching algorithm uses summed absolute difference error criterion and full grid search (FS) for finding optimal block displacement. In this evaluation, we compared the execution time of a GPU and CPU implementation for images of various sizes, using integer and noninteger search grids. The results show that use of a GPU card can shorten computation time by a factor of 200 times for integer and 1000 times for a noninteger search grid. The additional speedup for a noninteger search grid comes from the fact that GPU has built-in hardware for image interpolation. Further, when using multiple GPU cards, the presented evaluation shows the importance of the data splitting method across multiple cards, but an almost linear speedup with a number of cards is achievable. In addition, we compared the execution time of the proposed FS GPU implementation with two existing, highly optimized nonfull grid search CPU-based motion estimations methods, namely implementation of the Pyramidal Lucas Kanade Optical flow algorithm in OpenCV and simplified unsymmetrical multi-hexagon search in H.264\/AVC standard. In these comparisons, FS GPU implementation still showed modest improvement even though the computational complexity of FS GPU implementation is substantially higher than non-FS CPU implementation. We also demonstrated that for an image sequence of 720 x 480 pixels in resolution commonly used in video surveillance, the proposed GPU implementation is sufficiently fast for real-time motion estimation at 30 frames-per-second using two NVIDIA C1060 Tesla GPU cards.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>We describe and evaluate a fast implementation of a classical block-matching motion estimation algorithm for multiple graphical processing units (GPUs) using the compute unified device architecture computing engine. The implemented block-matching algorithm uses summed absolute difference error criterion and full grid search (FS) for finding optimal block displacement. In this evaluation, we compared the execution [&hellip;]<\/p>\n","protected":false},"author":351,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_jetpack_memberships_contains_paid_content":false,"footnotes":"","jetpack_publicize_message":"","jetpack_publicize_feature_enabled":true,"jetpack_social_post_already_shared":false,"jetpack_social_options":{"image_generator_settings":{"template":"highway","default_image_id":0,"font":"","enabled":false},"version":2}},"categories":[36,89,33,3],"tags":[1787,659,14,125,1786,20,896,199,1006],"class_list":["post-6421","post","type-post","status-publish","format-standard","hentry","category-algorithms","category-nvidia-cuda","category-image-processing","category-paper","tag-algorithms","tag-computational-complexity","tag-cuda","tag-h-264avc","tag-image-processing","tag-nvidia","tag-optical-flow","tag-tesla-c1060","tag-tesla-c2070"],"views":2076,"jetpack_publicize_connections":[],"jetpack_featured_media_url":"","jetpack_sharing_enabled":true,"_links":{"self":[{"href":"https:\/\/hgpu.org\/index.php?rest_route=\/wp\/v2\/posts\/6421","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/hgpu.org\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/hgpu.org\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/hgpu.org\/index.php?rest_route=\/wp\/v2\/users\/351"}],"replies":[{"embeddable":true,"href":"https:\/\/hgpu.org\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=6421"}],"version-history":[{"count":0,"href":"https:\/\/hgpu.org\/index.php?rest_route=\/wp\/v2\/posts\/6421\/revisions"}],"wp:attachment":[{"href":"https:\/\/hgpu.org\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=6421"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/hgpu.org\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=6421"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/hgpu.org\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=6421"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}