{"id":19134,"date":"2019-09-22T19:57:22","date_gmt":"2019-09-22T16:57:22","guid":{"rendered":"https:\/\/hgpu.org\/?p=19134"},"modified":"2019-09-22T19:57:22","modified_gmt":"2019-09-22T16:57:22","slug":"alpyna-acceleration-of-loops-in-python-for-novel-architectures","status":"publish","type":"post","link":"https:\/\/hgpu.org\/?p=19134","title":{"rendered":"ALPyNA: Acceleration of Loops in Python for Novel Architectures"},"content":{"rendered":"<p>We present ALPyNA, an automatic loop parallelization framework for Python, which analyzes data dependences within nested loops and dynamically generates CUDA kernels for GPU execution. The ALPyNA system applies classical dependence analysis techniques to discover and exploit potential parallelism. The skeletal structure of the dependence graph is determined statically (if possible) or at runtime; this is combined with type and bounds information discovered at runtime, to auto-generate high-performance kernels for offload to GPU. We demonstrate speedups of up to 1000x relative to the native CPython interpreter across four array-intensive numerical Python benchmarks. Performance improvement is related to both iteration domain size and dependence graph complexity. Nevertheless, this approach promises to bring the benefits of manycore parallelism to application developers.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>We present ALPyNA, an automatic loop parallelization framework for Python, which analyzes data dependences within nested loops and dynamically generates CUDA kernels for GPU execution. The ALPyNA system applies classical dependence analysis techniques to discover and exploit potential parallelism. The skeletal structure of the dependence graph is determined statically (if possible) or at runtime; this [&hellip;]<\/p>\n","protected":false},"author":351,"featured_media":0,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_jetpack_newsletter_access":"","_jetpack_dont_email_post_to_subs":false,"_jetpack_newsletter_tier_id":0,"_jetpack_memberships_contains_paywalled_content":false,"_jetpack_feature_clip_id":0,"_jetpack_memberships_contains_paid_content":false,"footnotes":"","jetpack_publicize_message":"","jetpack_publicize_feature_enabled":true,"jetpack_social_post_already_shared":true,"jetpack_social_options":{"image_generator_settings":{"template":"highway","default_image_id":0,"font":"","enabled":false},"version":2},"jetpack_post_was_ever_published":false},"categories":[11,89,3],"tags":[215,1782,14,20,1956,513],"class_list":["post-19134","post","type-post","status-publish","format-standard","hentry","category-computer-science","category-nvidia-cuda","category-paper","tag-code-generation","tag-computer-science","tag-cuda","tag-nvidia","tag-nvidia-geforce-gtx-1060","tag-python"],"views":2097,"jetpack_publicize_connections":[],"jetpack_featured_media_url":"","jetpack_sharing_enabled":true,"_links":{"self":[{"href":"https:\/\/hgpu.org\/index.php?rest_route=\/wp\/v2\/posts\/19134","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/hgpu.org\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/hgpu.org\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/hgpu.org\/index.php?rest_route=\/wp\/v2\/users\/351"}],"replies":[{"embeddable":true,"href":"https:\/\/hgpu.org\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=19134"}],"version-history":[{"count":0,"href":"https:\/\/hgpu.org\/index.php?rest_route=\/wp\/v2\/posts\/19134\/revisions"}],"wp:attachment":[{"href":"https:\/\/hgpu.org\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=19134"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/hgpu.org\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=19134"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/hgpu.org\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=19134"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}