{"id":3694,"date":"2011-04-25T11:42:51","date_gmt":"2011-04-25T11:42:51","guid":{"rendered":"http:\/\/hgpu.org\/?p=3694"},"modified":"2011-04-25T11:42:51","modified_gmt":"2011-04-25T11:42:51","slug":"optimized-hpl-for-amd-gpu-and-multi-core-cpu-usage","status":"publish","type":"post","link":"https:\/\/hgpu.org\/?p=3694","title":{"rendered":"Optimized HPL for AMD GPU and multi-core CPU usage"},"content":{"rendered":"<p>The installation of the LOEWE-CSC (http:\/\/csc.uni-frankfurt.de\/csc\/?51) supercomputer at the Goethe University in Frankfurt lead to the development of a Linpack which can fully utilize the installed AMD Cypress GPUs. At its core, a fast DGEMM for combined GPU and CPU usage was created. The DGEMM library is tuned to hide all DMA transfer times and thus maximize the GPU load. A\u00a0work stealing scheduler was implemented to add the remaining CPU resources to the DGEMM. On the GPU, the DGEMM achieves 497\u00a0GFlop\/s (90.9% of the theoretical peak). Combined with the 24-core Magny-Cours CPUs, 623\u00a0GFlop\/s (83.6% of the peak) are achieved. The HPL (http:\/\/www.netlib.org\/benchmark\/hpl\/algorithm.html) benchmark was modified to perform well with one MPI-process per node. The modifications include multi-threading, vectorization, use of the GPU DGEMM, cache optimizations, and a new Lookahead algorithm. A\u00a0Linpack performance of 70% theoretical peak is achieved and this performance scales linearly to hundreds of nodes.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>The installation of the LOEWE-CSC (http:\/\/csc.uni-frankfurt.de\/csc\/?51) supercomputer at the Goethe University in Frankfurt lead to the development of a Linpack which can fully utilize the installed AMD Cypress GPUs. At its core, a fast DGEMM for combined GPU and CPU usage was created. The DGEMM library is tuned to hide all DMA transfer times and [&hellip;]<\/p>\n","protected":false},"author":351,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_jetpack_memberships_contains_paid_content":false,"footnotes":"","jetpack_publicize_message":"","jetpack_publicize_feature_enabled":true,"jetpack_social_post_already_shared":false,"jetpack_social_options":{"image_generator_settings":{"template":"highway","default_image_id":0,"font":"","enabled":false},"version":2}},"categories":[11,3],"tags":[7,455,1782,106,452,37,242,67],"class_list":["post-3694","post","type-post","status-publish","format-standard","hentry","category-computer-science","category-paper","tag-ati","tag-ati-radeon-hd-5870","tag-computer-science","tag-gpu-cluster","tag-heterogeneous-systems","tag-linear-algebra","tag-mpi","tag-performance"],"views":2347,"jetpack_publicize_connections":[],"jetpack_featured_media_url":"","jetpack_sharing_enabled":true,"_links":{"self":[{"href":"https:\/\/hgpu.org\/index.php?rest_route=\/wp\/v2\/posts\/3694","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/hgpu.org\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/hgpu.org\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/hgpu.org\/index.php?rest_route=\/wp\/v2\/users\/351"}],"replies":[{"embeddable":true,"href":"https:\/\/hgpu.org\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=3694"}],"version-history":[{"count":0,"href":"https:\/\/hgpu.org\/index.php?rest_route=\/wp\/v2\/posts\/3694\/revisions"}],"wp:attachment":[{"href":"https:\/\/hgpu.org\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=3694"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/hgpu.org\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=3694"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/hgpu.org\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=3694"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}