{"id":16447,"date":"2016-08-23T02:29:13","date_gmt":"2016-08-22T23:29:13","guid":{"rendered":"http:\/\/hgpu.org\/?p=16447"},"modified":"2016-08-23T02:29:13","modified_gmt":"2016-08-22T23:29:13","slug":"magma-batched-a-batched-blas-approach-for-small-matrix-factorizations-and-applications-on-gpus","status":"publish","type":"post","link":"https:\/\/hgpu.org\/?p=16447","title":{"rendered":"MAGMA Batched: A Batched BLAS Approach for Small Matrix Factorizations and Applications on GPUs"},"content":{"rendered":"<p>A particularly challenging class of problems arising in many applications, called batched problems, involves linear algebra operations on many small-sized matrices. We proposed and designed batched BLAS (Basic Linear Algebra Subroutines), Level-2 GEMV and Level-3 GEMM, to solve them. We illustrate how to optimize batched GEMV and GEMM to assist batched advance factorization (e.g. bi-diagonalization) and other BLAS routines (e.g. forward\/back substitution) to achieve optimal performance on GPUs. Our solutions achieved up to 2.8-3x speedups compared to CUBLAS and MKL solutions, wherever possible. We applied our batched methodology in a real-world Hydrodynamic application by reformulating the tensor operations into batched BLAS GEMV and GEMM operations. A 2.5x speedup and a 1.4x greenup are obtained by changing 10% of the code. We accelerated and scaled it on Titan supercomputer to 4096 nodes.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>A particularly challenging class of problems arising in many applications, called batched problems, involves linear algebra operations on many small-sized matrices. We proposed and designed batched BLAS (Basic Linear Algebra Subroutines), Level-2 GEMV and Level-3 GEMM, to solve them. We illustrate how to optimize batched GEMV and GEMM to assist batched advance factorization (e.g. bi-diagonalization) [&hellip;]<\/p>\n","protected":false},"author":351,"featured_media":0,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_jetpack_memberships_contains_paid_content":false,"footnotes":"","jetpack_publicize_message":"","jetpack_publicize_feature_enabled":true,"jetpack_social_post_already_shared":true,"jetpack_social_options":{"image_generator_settings":{"template":"highway","default_image_id":0,"font":"","enabled":false},"version":2}},"categories":[11,89,104,3],"tags":[430,1782,14,288,1795,37,20,1543],"class_list":["post-16447","post","type-post","status-publish","format-standard","hentry","category-computer-science","category-nvidia-cuda","category-fluid-dynamics","category-paper","tag-blas","tag-computer-science","tag-cuda","tag-factorization","tag-fluid-dynamics","tag-linear-algebra","tag-nvidia","tag-tesla-k40"],"views":3031,"jetpack_publicize_connections":[],"jetpack_featured_media_url":"","jetpack_sharing_enabled":true,"_links":{"self":[{"href":"https:\/\/hgpu.org\/index.php?rest_route=\/wp\/v2\/posts\/16447","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/hgpu.org\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/hgpu.org\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/hgpu.org\/index.php?rest_route=\/wp\/v2\/users\/351"}],"replies":[{"embeddable":true,"href":"https:\/\/hgpu.org\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=16447"}],"version-history":[{"count":0,"href":"https:\/\/hgpu.org\/index.php?rest_route=\/wp\/v2\/posts\/16447\/revisions"}],"wp:attachment":[{"href":"https:\/\/hgpu.org\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=16447"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/hgpu.org\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=16447"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/hgpu.org\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=16447"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}