{"id":8119,"date":"2012-08-28T09:42:22","date_gmt":"2012-08-28T06:42:22","guid":{"rendered":"http:\/\/hgpu.org\/?p=8119"},"modified":"2012-08-28T09:42:22","modified_gmt":"2012-08-28T06:42:22","slug":"optimization-techniques-for-cuda-application","status":"publish","type":"post","link":"https:\/\/hgpu.org\/?p=8119","title":{"rendered":"Optimization Techniques for CUDA Application"},"content":{"rendered":"<p>In this paper, we summarize our experiment results of applying various optimization techniques for CUDA application running on NVIDIA Fermi GPUs. Our experiments on matrix multiplication and breadth first search algorithms show that optimization techniques such as coalesced global memory access, conflict-free shared memory access and data pre-fetching improve the performance of applications running on GPUs.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>In this paper, we summarize our experiment results of applying various optimization techniques for CUDA application running on NVIDIA Fermi GPUs. Our experiments on matrix multiplication and breadth first search algorithms show that optimization techniques such as coalesced global memory access, conflict-free shared memory access and data pre-fetching improve the performance of applications running on [&hellip;]<\/p>\n","protected":false},"author":351,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_jetpack_newsletter_access":"","_jetpack_dont_email_post_to_subs":false,"_jetpack_newsletter_tier_id":0,"_jetpack_memberships_contains_paywalled_content":false,"_jetpack_feature_clip_id":0,"_jetpack_memberships_contains_paid_content":false,"footnotes":"","jetpack_publicize_message":"","jetpack_publicize_feature_enabled":true,"jetpack_social_post_already_shared":false,"jetpack_social_options":{"image_generator_settings":{"template":"highway","default_image_id":0,"font":"","enabled":false},"version":2},"jetpack_post_was_ever_published":false},"categories":[11,89,3],"tags":[1782,14,324,20,298,67],"class_list":["post-8119","post","type-post","status-publish","format-standard","hentry","category-computer-science","category-nvidia-cuda","category-paper","tag-computer-science","tag-cuda","tag-matrix-multiplication","tag-nvidia","tag-optimization","tag-performance"],"views":2395,"jetpack_publicize_connections":[],"jetpack_featured_media_url":"","jetpack_sharing_enabled":true,"_links":{"self":[{"href":"https:\/\/hgpu.org\/index.php?rest_route=\/wp\/v2\/posts\/8119","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/hgpu.org\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/hgpu.org\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/hgpu.org\/index.php?rest_route=\/wp\/v2\/users\/351"}],"replies":[{"embeddable":true,"href":"https:\/\/hgpu.org\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=8119"}],"version-history":[{"count":0,"href":"https:\/\/hgpu.org\/index.php?rest_route=\/wp\/v2\/posts\/8119\/revisions"}],"wp:attachment":[{"href":"https:\/\/hgpu.org\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=8119"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/hgpu.org\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=8119"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/hgpu.org\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=8119"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}