{"id":12097,"date":"2014-05-20T00:18:12","date_gmt":"2014-05-19T21:18:12","guid":{"rendered":"http:\/\/hgpu.org\/?p=12097"},"modified":"2014-05-20T00:18:12","modified_gmt":"2014-05-19T21:18:12","slug":"targetdp-an-abstraction-of-lattice-based-parallelism-with-portable-performance","status":"publish","type":"post","link":"https:\/\/hgpu.org\/?p=12097","title":{"rendered":"targetDP: an Abstraction of Lattice Based Parallelism with Portable Performance"},"content":{"rendered":"<p>To achieve high performance on modern computers, it is vital to map algorithmic parallelism to that inherent in the hardware. From an application developer&#8217;s perspective, it is also important that code can be maintained in a portable manner across a range of hardware. Here we present targetDP, a lightweight programming layer that allows the abstraction of data parallelism for applications that employ structured grids. A single source code may be used to target both thread level parallelism (TLP) and instruction level parallelism (ILP) on either SIMD multi-core CPUs or GPU-accelerated platforms. targetDP is implemented via standard C preprocessor macros and library functions, can be added to existing applications incrementally, and can be combined with higher-level paradigms such as MPI. We present CPU and GPU performance results for a benchmark taken from the lattice Boltzmann application that motivated this work. These demonstrate not only performance portability, but also the improved optimisation resulting from the intelligent exposure of ILP.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>To achieve high performance on modern computers, it is vital to map algorithmic parallelism to that inherent in the hardware. From an application developer&#8217;s perspective, it is also important that code can be maintained in a portable manner across a range of hardware. Here we present targetDP, a lightweight programming layer that allows the abstraction [&hellip;]<\/p>\n","protected":false},"author":351,"featured_media":0,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_jetpack_memberships_contains_paid_content":false,"footnotes":"","jetpack_publicize_message":"","jetpack_publicize_feature_enabled":true,"jetpack_social_post_already_shared":true,"jetpack_social_options":{"image_generator_settings":{"template":"highway","default_image_id":0,"font":"","enabled":false},"version":2}},"categories":[89,104,3],"tags":[14,263,1795,1483,108,20,1543],"class_list":["post-12097","post","type-post","status-publish","format-standard","hentry","category-nvidia-cuda","category-fluid-dynamics","category-paper","tag-cuda","tag-data-parallelism","tag-fluid-dynamics","tag-intel-xeon-phi","tag-lattice-boltzmann-model","tag-nvidia","tag-tesla-k40"],"views":1797,"jetpack_publicize_connections":[],"jetpack_featured_media_url":"","jetpack_sharing_enabled":true,"_links":{"self":[{"href":"https:\/\/hgpu.org\/index.php?rest_route=\/wp\/v2\/posts\/12097","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/hgpu.org\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/hgpu.org\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/hgpu.org\/index.php?rest_route=\/wp\/v2\/users\/351"}],"replies":[{"embeddable":true,"href":"https:\/\/hgpu.org\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=12097"}],"version-history":[{"count":0,"href":"https:\/\/hgpu.org\/index.php?rest_route=\/wp\/v2\/posts\/12097\/revisions"}],"wp:attachment":[{"href":"https:\/\/hgpu.org\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=12097"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/hgpu.org\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=12097"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/hgpu.org\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=12097"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}