{"id":29810,"date":"2025-03-10T19:03:18","date_gmt":"2025-03-10T17:03:18","guid":{"rendered":"https:\/\/hgpu.org\/?p=29810"},"modified":"2025-03-10T19:03:18","modified_gmt":"2025-03-10T17:03:18","slug":"mpache-interaction-aware-multi-level-cache-bypassing-on-gpus","status":"publish","type":"post","link":"https:\/\/hgpu.org\/?p=29810","title":{"rendered":"Mpache: Interaction Aware Multi-level Cache Bypassing on GPUs"},"content":{"rendered":"<p>Graphics Processing Units (GPUs) are essential for general-purpose applications and are commonly leveraging multi-level caches to alleviate memory access pressure. However, the default cache management may lose opportunities for optimal performance in different applications. Although existing cache bypassing techniques tend to address this challenge, these methods predominantly concentrate on single-level cache, thus restricting their potential for further enhancements. To mitigate this issue, we propose Mpache, a novel software-based mechanism designed to bypass multi-level caches based on the characterization of load instructions. Mpache constructs an interaction graph and analyzes the cooperation and contention among instructions. Then, the profiling data of bypassing effectiveness guides Mpache to select the appropriate cache levels to bypass for each instruction. Finally, the design is integrated into the compiler to enable automatic bypassing for existing workloads. Evaluations on off-the-shelf GPUs show that Mpache achieves an average 1.15\u00d7 speedup over the default cache policy, and effectively outperforms prior arts.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Graphics Processing Units (GPUs) are essential for general-purpose applications and are commonly leveraging multi-level caches to alleviate memory access pressure. However, the default cache management may lose opportunities for optimal performance in different applications. Although existing cache bypassing techniques tend to address this challenge, these methods predominantly concentrate on single-level cache, thus restricting their potential [&hellip;]<\/p>\n","protected":false},"author":351,"featured_media":0,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_jetpack_memberships_contains_paid_content":false,"footnotes":"","jetpack_publicize_message":"","jetpack_publicize_feature_enabled":true,"jetpack_social_post_already_shared":true,"jetpack_social_options":{"image_generator_settings":{"template":"highway","default_image_id":0,"font":"","enabled":false},"version":2}},"categories":[11,3],"tags":[1674,955,1782,20,193],"class_list":["post-29810","post","type-post","status-publish","format-standard","hentry","category-computer-science","category-paper","tag-cache","tag-compilers","tag-computer-science","tag-nvidia","tag-ptx"],"views":929,"jetpack_publicize_connections":[],"jetpack_featured_media_url":"","jetpack_sharing_enabled":true,"_links":{"self":[{"href":"https:\/\/hgpu.org\/index.php?rest_route=\/wp\/v2\/posts\/29810","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/hgpu.org\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/hgpu.org\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/hgpu.org\/index.php?rest_route=\/wp\/v2\/users\/351"}],"replies":[{"embeddable":true,"href":"https:\/\/hgpu.org\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=29810"}],"version-history":[{"count":0,"href":"https:\/\/hgpu.org\/index.php?rest_route=\/wp\/v2\/posts\/29810\/revisions"}],"wp:attachment":[{"href":"https:\/\/hgpu.org\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=29810"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/hgpu.org\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=29810"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/hgpu.org\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=29810"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}