{"id":893,"date":"2010-10-27T13:07:47","date_gmt":"2010-10-27T13:07:47","guid":{"rendered":"http:\/\/hgpu.org\/?p=893"},"modified":"2010-10-27T13:07:47","modified_gmt":"2010-10-27T13:07:47","slug":"on-the-limits-of-gpu-acceleration-4","status":"publish","type":"post","link":"https:\/\/hgpu.org\/?p=893","title":{"rendered":"On the limits of GPU acceleration"},"content":{"rendered":"<p>This paper throws a small &#8220;wet blanket&#8221; on the hot topic of GPGPU acceleration, based on experience analyzing and tuning both multithreaded CPU and GPU implementations of three computations in scientific computing. These computations&#8211;(a) iterative sparse linear solvers; (b) sparse Cholesky factorization; and (c) the fast multipole method&#8211;exhibit complex behavior and vary in computational intensity and memory reference irregularity. In each case, algorithmic analysis and prior work might lead us to conclude that an idealized GPU can deliver better performance, but we find that for at least equal-effort CPU tuning and consideration of realistic workloads and calling-contexts, we can with two modern quad-core CPU sockets roughly match one or two GPUs in performance.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>This paper throws a small &#8220;wet blanket&#8221; on the hot topic of GPGPU acceleration, based on experience analyzing and tuning both multithreaded CPU and GPU implementations of three computations in scientific computing. These computations&#8211;(a) iterative sparse linear solvers; (b) sparse Cholesky factorization; and (c) the fast multipole method&#8211;exhibit complex behavior and vary in computational intensity [&hellip;]<\/p>\n","protected":false},"author":351,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_jetpack_newsletter_access":"","_jetpack_dont_email_post_to_subs":false,"_jetpack_newsletter_tier_id":0,"_jetpack_memberships_contains_paywalled_content":false,"_jetpack_feature_clip_id":0,"_jetpack_memberships_contains_paid_content":false,"footnotes":"","jetpack_publicize_message":"","jetpack_publicize_feature_enabled":true,"jetpack_social_post_already_shared":false,"jetpack_social_options":{"image_generator_settings":{"template":"highway","default_image_id":0,"font":"","enabled":false},"version":2},"jetpack_post_was_ever_published":false},"categories":[11,3],"tags":[1782,20,251,67,199,244],"class_list":["post-893","post","type-post","status-publish","format-standard","hentry","category-computer-science","category-paper","tag-computer-science","tag-nvidia","tag-nvidia-geforce-gtx-285","tag-performance","tag-tesla-c1060","tag-tesla-s1070"],"views":3148,"jetpack_publicize_connections":[],"jetpack_featured_media_url":"","jetpack_sharing_enabled":true,"_links":{"self":[{"href":"https:\/\/hgpu.org\/index.php?rest_route=\/wp\/v2\/posts\/893","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/hgpu.org\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/hgpu.org\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/hgpu.org\/index.php?rest_route=\/wp\/v2\/users\/351"}],"replies":[{"embeddable":true,"href":"https:\/\/hgpu.org\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=893"}],"version-history":[{"count":0,"href":"https:\/\/hgpu.org\/index.php?rest_route=\/wp\/v2\/posts\/893\/revisions"}],"wp:attachment":[{"href":"https:\/\/hgpu.org\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=893"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/hgpu.org\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=893"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/hgpu.org\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=893"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}