{"id":2885,"date":"2011-02-17T16:05:46","date_gmt":"2011-02-17T16:05:46","guid":{"rendered":"http:\/\/hgpu.org\/?p=2885"},"modified":"2011-02-17T16:05:46","modified_gmt":"2011-02-17T16:05:46","slug":"medium-grained-functions-mapping-using-modern-gpus","status":"publish","type":"post","link":"https:\/\/hgpu.org\/?p=2885","title":{"rendered":"Medium-Grained Functions Mapping using Modern GPUs"},"content":{"rendered":"<p>The map is a higher-order function that applies a given function to the list or lists of elements producing the list of results. The mapped function is applied to each element of the list independently, thus can be performed for all elements in parallel, making the GPU an interesting platform to be implemented on. Although the map introduce a high level of parallelism when it is applied to sufficiently large number of elements, its implementation can be difficult with respect to utilizing GPU parallel model granularity by mapped functions. In this paper, we show the performance gap between fine-grained (per-thread) and coarse-grained (per-block) implementation of mapped function and introduce the medium-grained implementation that can fill this gap. We also discuss some memory access implications arising from this method and show example how to use them to estimate the performance of different implementations.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>The map is a higher-order function that applies a given function to the list or lists of elements producing the list of results. The mapped function is applied to each element of the list independently, thus can be performed for all elements in parallel, making the GPU an interesting platform to be implemented on. Although [&hellip;]<\/p>\n","protected":false},"author":351,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_jetpack_newsletter_access":"","_jetpack_dont_email_post_to_subs":false,"_jetpack_newsletter_tier_id":0,"_jetpack_memberships_contains_paywalled_content":false,"_jetpack_feature_clip_id":0,"_jetpack_memberships_contains_paid_content":false,"footnotes":"","jetpack_publicize_message":"","jetpack_publicize_feature_enabled":true,"jetpack_social_post_already_shared":false,"jetpack_social_options":{"image_generator_settings":{"template":"highway","default_image_id":0,"font":"","enabled":false},"version":2},"jetpack_post_was_ever_published":false},"categories":[11,89,3],"tags":[1782,14,273,20,234,67],"class_list":["post-2885","post","type-post","status-publish","format-standard","hentry","category-computer-science","category-nvidia-cuda","category-paper","tag-computer-science","tag-cuda","tag-memory-model","tag-nvidia","tag-nvidia-geforce-gtx-280","tag-performance"],"views":2051,"jetpack_publicize_connections":[],"jetpack_featured_media_url":"","jetpack_sharing_enabled":true,"_links":{"self":[{"href":"https:\/\/hgpu.org\/index.php?rest_route=\/wp\/v2\/posts\/2885","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/hgpu.org\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/hgpu.org\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/hgpu.org\/index.php?rest_route=\/wp\/v2\/users\/351"}],"replies":[{"embeddable":true,"href":"https:\/\/hgpu.org\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=2885"}],"version-history":[{"count":0,"href":"https:\/\/hgpu.org\/index.php?rest_route=\/wp\/v2\/posts\/2885\/revisions"}],"wp:attachment":[{"href":"https:\/\/hgpu.org\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=2885"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/hgpu.org\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=2885"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/hgpu.org\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=2885"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}