{"id":28922,"date":"2023-12-31T18:08:49","date_gmt":"2023-12-31T16:08:49","guid":{"rendered":"https:\/\/hgpu.org\/?p=28922"},"modified":"2023-12-31T18:08:49","modified_gmt":"2023-12-31T16:08:49","slug":"optimization-of-ported-cfd-kernels-on-intel-data-center-gpu-max-1550-using-oneapi-esimd","status":"publish","type":"post","link":"https:\/\/hgpu.org\/?p=28922","title":{"rendered":"Optimization of Ported CFD Kernels on Intel Data Center GPU Max 1550 using oneAPI ESIMD"},"content":{"rendered":"<p>We describe our experience porting FUN3D\u2019s CUDA-optimized kernels to Intel oneAPI SYCL. We faced several challenges, including foremost the suboptimal performance of the oneAPI code on Intel\u2019s new data center GPU. Suboptimal performance of the oneAPI code was due primarily to high register spills, memory latency, and poor vectorization. We addressed these issues by implementing the kernels using Intel oneAPI\u2019s Explicit SIMD SYCL extension (ESIMD) API. The ESIMD API enables the writing of explicitly vectorized kernel code, gives more precise control over register usage and prefetching, and better handles thread divergence compared to SYCL. The ESIMD code outperforms the optimized SYCL code by up to a factor of 3.6, depending on the kernel. We also compared the performance of three ESIMD kernels on the Intel Data Center Max 1550 GPU with the CUDA-optimized versions on NVIDIA V100 and A100 GPUs. We found the performance of a single tile of the Intel GPU using ESIMD greater than NVIDIA V100 and similar to NVIDIA A100.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>We describe our experience porting FUN3D\u2019s CUDA-optimized kernels to Intel oneAPI SYCL. We faced several challenges, including foremost the suboptimal performance of the oneAPI code on Intel\u2019s new data center GPU. Suboptimal performance of the oneAPI code was due primarily to high register spills, memory latency, and poor vectorization. We addressed these issues by implementing [&hellip;]<\/p>\n","protected":false},"author":351,"featured_media":0,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_jetpack_memberships_contains_paid_content":false,"footnotes":"","jetpack_publicize_message":"","jetpack_publicize_feature_enabled":true,"jetpack_social_post_already_shared":true,"jetpack_social_options":{"image_generator_settings":{"template":"highway","default_image_id":0,"font":"","enabled":false},"version":2}},"categories":[89,104,3],"tags":[1600,14,1795,905,2133,20,2066,2118,67,1845],"class_list":["post-28922","post","type-post","status-publish","format-standard","hentry","category-nvidia-cuda","category-fluid-dynamics","category-paper","tag-cfd","tag-cuda","tag-fluid-dynamics","tag-intel","tag-intel-data-center-gpu-max-1550","tag-nvidia","tag-nvidia-a100","tag-oneapi","tag-performance","tag-sycl"],"views":1562,"jetpack_publicize_connections":[],"jetpack_featured_media_url":"","jetpack_sharing_enabled":true,"_links":{"self":[{"href":"https:\/\/hgpu.org\/index.php?rest_route=\/wp\/v2\/posts\/28922","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/hgpu.org\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/hgpu.org\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/hgpu.org\/index.php?rest_route=\/wp\/v2\/users\/351"}],"replies":[{"embeddable":true,"href":"https:\/\/hgpu.org\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=28922"}],"version-history":[{"count":0,"href":"https:\/\/hgpu.org\/index.php?rest_route=\/wp\/v2\/posts\/28922\/revisions"}],"wp:attachment":[{"href":"https:\/\/hgpu.org\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=28922"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/hgpu.org\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=28922"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/hgpu.org\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=28922"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}