{"id":27021,"date":"2022-07-17T13:27:30","date_gmt":"2022-07-17T10:27:30","guid":{"rendered":"https:\/\/hgpu.org\/?p=27021"},"modified":"2022-07-17T13:27:30","modified_gmt":"2022-07-17T10:27:30","slug":"reducing-synchronous-gpu-memory-transfers-design-and-implementation-of-a-futhark-compiler-optimisation","status":"publish","type":"post","link":"https:\/\/hgpu.org\/?p=27021","title":{"rendered":"Reducing Synchronous GPU Memory Transfers: Design and implementation of a Futhark compiler optimisation"},"content":{"rendered":"<p>We present a series of dataflow dependent program transformations that reduce memory transfers between a GPU and its host, and show how the problem of minimising memory transfers to the host amounts to finding minimum vertex cuts in a series of data dependency graphs. We provide a specialised algorithm to solve these minimisation problems, based on the Ford-Fulkerson max-flow algorithm, and detail techniques to model conditional execution and loops in a pure functional programming language. We present our work in context of the array programming language Futhark, in whose compiler we have implemented our techniques. Empirical evaluation of 27 benchmark programs on four GPUs show mean speedups of 117\u2013158%, heavily skewed by significant improvements to a few programs.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>We present a series of dataflow dependent program transformations that reduce memory transfers between a GPU and its host, and show how the problem of minimising memory transfers to the host amounts to finding minimum vertex cuts in a series of data dependency graphs. We provide a specialised algorithm to solve these minimisation problems, based [&hellip;]<\/p>\n","protected":false},"author":351,"featured_media":0,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_jetpack_memberships_contains_paid_content":false,"footnotes":"","jetpack_publicize_message":"","jetpack_publicize_feature_enabled":true,"jetpack_social_post_already_shared":true,"jetpack_social_options":{"image_generator_settings":{"template":"highway","default_image_id":0,"font":"","enabled":false},"version":2}},"categories":[11,89,90,3],"tags":[2087,7,1782,14,20,2066,1793,67,390],"class_list":["post-27021","post","type-post","status-publish","format-standard","hentry","category-computer-science","category-nvidia-cuda","category-opencl","category-paper","tag-amd-radeon-instinct-mi100","tag-ati","tag-computer-science","tag-cuda","tag-nvidia","tag-nvidia-a100","tag-opencl","tag-performance","tag-thesis"],"views":1343,"jetpack_publicize_connections":[],"jetpack_featured_media_url":"","jetpack_sharing_enabled":true,"_links":{"self":[{"href":"https:\/\/hgpu.org\/index.php?rest_route=\/wp\/v2\/posts\/27021","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/hgpu.org\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/hgpu.org\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/hgpu.org\/index.php?rest_route=\/wp\/v2\/users\/351"}],"replies":[{"embeddable":true,"href":"https:\/\/hgpu.org\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=27021"}],"version-history":[{"count":0,"href":"https:\/\/hgpu.org\/index.php?rest_route=\/wp\/v2\/posts\/27021\/revisions"}],"wp:attachment":[{"href":"https:\/\/hgpu.org\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=27021"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/hgpu.org\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=27021"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/hgpu.org\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=27021"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}