{"id":29823,"date":"2025-03-23T19:48:55","date_gmt":"2025-03-23T17:48:55","guid":{"rendered":"https:\/\/hgpu.org\/?p=29823"},"modified":"2025-03-23T19:48:55","modified_gmt":"2025-03-23T17:48:55","slug":"concurrent-scheduling-of-high-level-parallel-programs-on-multi-gpu-systems","status":"publish","type":"post","link":"https:\/\/hgpu.org\/?p=29823","title":{"rendered":"Concurrent Scheduling of High-Level Parallel Programs on Multi-GPU Systems"},"content":{"rendered":"<p>Parallel programming models can encourage performance portability by moving the responsibility for work assignment and data distribution from the programmer to a runtime system. However, analyzing the resulting implicit memory allocations, coherence operations and their interdependencies can quickly introduce delays into the latency-sensitive execution pipeline of a distributed-memory application. In this paper, we show how graph-based intermediate representations help moving such scheduling work out of the critical path. In the context of SYCL programs distributed onto accelerator clusters, we introduce the instruction graph, a low-level representation that preserves full concurrency between memory management, data transfers, MPI peer-to-peer communication and kernel invocation. Through integration within the Celerity runtime, we demonstrate how instruction-graph scheduling enables a system architecture that performs this analysis concurrently with execution. Using a scheduler lookahead mechanism, we further detect changing access patterns to optimize memory allocation in the presence of virtualized buffers. We show the effectiveness of our method through strong-scaling benchmarks with multiple Celerity applications on up to 128 GPUs in a production cluster.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Parallel programming models can encourage performance portability by moving the responsibility for work assignment and data distribution from the programmer to a runtime system. However, analyzing the resulting implicit memory allocations, coherence operations and their interdependencies can quickly introduce delays into the latency-sensitive execution pipeline of a distributed-memory application. In this paper, we show how [&hellip;]<\/p>\n","protected":false},"author":351,"featured_media":0,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_jetpack_memberships_contains_paid_content":false,"footnotes":"","jetpack_publicize_message":"","jetpack_publicize_feature_enabled":true,"jetpack_social_post_already_shared":true,"jetpack_social_options":{"image_generator_settings":{"template":"highway","default_image_id":0,"font":"","enabled":false},"version":2}},"categories":[11,3],"tags":[451,1782,1682,20,2066,176,1586,1845,854],"class_list":["post-29823","post","type-post","status-publish","format-standard","hentry","category-computer-science","category-paper","tag-benchmarking","tag-computer-science","tag-hpc","tag-nvidia","tag-nvidia-a100","tag-package","tag-performance-portability","tag-sycl","tag-task-scheduling"],"views":1468,"jetpack_publicize_connections":[],"jetpack_featured_media_url":"","jetpack_sharing_enabled":true,"_links":{"self":[{"href":"https:\/\/hgpu.org\/index.php?rest_route=\/wp\/v2\/posts\/29823","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/hgpu.org\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/hgpu.org\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/hgpu.org\/index.php?rest_route=\/wp\/v2\/users\/351"}],"replies":[{"embeddable":true,"href":"https:\/\/hgpu.org\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=29823"}],"version-history":[{"count":0,"href":"https:\/\/hgpu.org\/index.php?rest_route=\/wp\/v2\/posts\/29823\/revisions"}],"wp:attachment":[{"href":"https:\/\/hgpu.org\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=29823"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/hgpu.org\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=29823"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/hgpu.org\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=29823"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}