cuda-kat: The CUDA Kernel Author’s Toolkit
* Write templated device-side without constantly coming up against not-trivially-templatable bits.
* Use standard-library(-like) containers in device-side code (but not have to use them).
* Not repeat ourselves as much (the DRY principle).
* Use less magic numbers.
* Make our device-side code less cryptic and idiosyncratic, with clearer naming and semantics.
… while not committing to any particular framework, paradigm or class hierarchy – and not compromising performance.
Library facilities include:
Templated versions of math functions | GPU-enabled versions of std::array, std::span and std::tuple | Wrapper functions for non-exposed PTX instructions | Templated versions of PTX intrinsic | Warp-, block- and grid-level sequence operations | Warp-, block- and grid-level atomic mechanisms | effective access to shared memory | on-device stringsteams and ostreaam like classes on the device. | etc.