high performance computing on graphics processing units: hgpu.org

hgpu.org » Programming » CUDA » cuda-kat: The CUDA Kernel Author’s Toolkit

cuda-kat: The CUDA Kernel Author’s Toolkit

Eyal Rozenberg

GitHub

BibTeX

Download (PDF)

View

Source

Package:

cuda-kat: The CUDA Kernel Author’s Toolkit

1747

views

An install-less, header-only library which is a loosely-coupled collection of utility functions and classes for writing device-side CUDA code (kernels and non-kernel functions). These let us:

* Write templated device-side without constantly coming up against not-trivially-templatable bits.
* Use standard-library(-like) containers in device-side code (but not have to use them).
* Not repeat ourselves as much (the DRY principle).
* Use less magic numbers.
* Make our device-side code less cryptic and idiosyncratic, with clearer naming and semantics.

… while not committing to any particular framework, paradigm or class hierarchy – and not compromising performance.

Library facilities include:

Templated versions of math functions | GPU-enabled versions of std::array, std::span and std::tuple | Wrapper functions for non-exposed PTX instructions | Templated versions of PTX intrinsic | Warp-, block- and grid-level sequence operations | Warp-, block- and grid-level atomic mechanisms | effective access to shared memory | on-device stringsteams and ostreaam like classes on the device. | etc.

Tags: cpp, CUDA, library, nVidia, Package

May 2, 2020 by epk

No votes yet.

Please wait...

Your response

You must be logged in to post a comment.

* * *

high performance computing on graphics processing units: hgpu.org

cuda-kat: The CUDA Kernel Author’s Toolkit

Package:

Your response

Recent source codes

Mutual-Supervised Learning for Sequential-to-Parallel Code Translation

Hardware Compute Partitioning on NVIDIA GPUs for Composable Systems

KISim: Kubernetes Intelligent Scheduling Simulator

Efficient GPU Implementation of Multi-Precision Integer Division

exa-AMD: Exascale Accelerated Materials Discovery

ParEval: A Parallel Code Evaluation Benchmark

FlashSparse: Minimizing Computation Redundancy for Fast Sparse Matrix Multiplications on Tensor Cores

WiLLM: An Open Wireless LLM Communication System

Vcc: the Vulkan Clang Compiler

hpcbench: A set of benchmarking utilities for biomolecular simulation tools

Most viewed papers (last 30 days)

cuda-kat: The CUDA Kernel Author’s Toolkit

Package:

Share this:

Your response

Recent source codes

Most viewed papers (last 30 days)