{"id":20755,"date":"2020-05-02T18:40:43","date_gmt":"2020-05-02T15:40:43","guid":{"rendered":"https:\/\/hgpu.org\/?p=20755"},"modified":"2020-05-02T18:40:45","modified_gmt":"2020-05-02T15:40:45","slug":"cuda-kat-the-cuda-kernel-authors-toolkit","status":"publish","type":"post","link":"https:\/\/hgpu.org\/?p=20755","title":{"rendered":"cuda-kat: The CUDA Kernel Author&#8217;s Toolkit"},"content":{"rendered":"<p>An install-less, header-only library which is a loosely-coupled collection of utility functions and classes for writing device-side CUDA code (kernels and non-kernel functions). These let us:<\/p>\n<p>* Write templated device-side without constantly coming up against not-trivially-templatable bits.<br \/>\n* Use standard-library(-like) containers in device-side code (but not have to use them).<br \/>\n* Not repeat ourselves as much (the DRY principle).<br \/>\n* Use less magic numbers.<br \/>\n* Make our device-side code less cryptic and idiosyncratic, with clearer naming and semantics.<\/p>\n<p>&#8230; while not committing to any particular framework, paradigm or class hierarchy &#8211; and not compromising performance.<\/p>\n<p>Library facilities include:<\/p>\n<p>Templated versions of math functions | GPU-enabled versions of std::array, std::span and std::tuple | Wrapper functions for non-exposed PTX instructions |  Templated versions of PTX intrinsic | Warp-, block- and grid-level sequence operations | Warp-, block- and grid-level atomic mechanisms | effective access to shared memory | on-device stringsteams and ostreaam like classes on the device. | etc.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>An install-less, header-only library which is a loosely-coupled collection of utility functions and classes for writing device-side CUDA code (kernels and non-kernel functions). These let us: * Write templated device-side without constantly coming up against not-trivially-templatable bits. * Use standard-library(-like) containers in device-side code (but not have to use them). * Not repeat ourselves as [&hellip;]<\/p>\n","protected":false},"author":769,"featured_media":0,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_jetpack_memberships_contains_paid_content":false,"footnotes":"","jetpack_publicize_message":"","jetpack_publicize_feature_enabled":true,"jetpack_social_post_already_shared":true,"jetpack_social_options":{"image_generator_settings":{"template":"highway","default_image_id":0,"font":"","enabled":false},"version":2}},"categories":[89,3],"tags":[2053,14,2054,20,176],"class_list":["post-20755","post","type-post","status-publish","format-standard","hentry","category-nvidia-cuda","category-paper","tag-cpp","tag-cuda","tag-library","tag-nvidia","tag-package"],"views":2036,"jetpack_publicize_connections":[],"jetpack_featured_media_url":"","jetpack_sharing_enabled":true,"_links":{"self":[{"href":"https:\/\/hgpu.org\/index.php?rest_route=\/wp\/v2\/posts\/20755","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/hgpu.org\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/hgpu.org\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/hgpu.org\/index.php?rest_route=\/wp\/v2\/users\/769"}],"replies":[{"embeddable":true,"href":"https:\/\/hgpu.org\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=20755"}],"version-history":[{"count":1,"href":"https:\/\/hgpu.org\/index.php?rest_route=\/wp\/v2\/posts\/20755\/revisions"}],"predecessor-version":[{"id":20827,"href":"https:\/\/hgpu.org\/index.php?rest_route=\/wp\/v2\/posts\/20755\/revisions\/20827"}],"wp:attachment":[{"href":"https:\/\/hgpu.org\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=20755"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/hgpu.org\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=20755"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/hgpu.org\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=20755"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}