high performance computing on graphics processing units: hgpu.org

hgpu.org » Programming » Algorithms » cuTT: A High-Performance Tensor Transpose Library for CUDA Compatible GPUs

cuTT: A High-Performance Tensor Transpose Library for CUDA Compatible GPUs

Antti-Pekka Hynninen, Dmitry I. Lyakh

NVIDIA Corporation, Santa Clara, CA 95050

arXiv:1705.01598 [cs.MS], (3 May 2017)

BibTeX

Download (PDF)

View

Source

Source codes

Package:

CUDA Tensor Transpose (cuTT) library

2843

views

We introduce the CUDA Tensor Transpose (cuTT) library that implements high-performance tensor transposes for NVIDIA GPUs with Kepler and above architectures. cuTT achieves high performance by (a) utilizing two GPU-optimized transpose algorithms that both use a shared memory buffer in order to reduce global memory access scatter, and by (b) computing memory positions of tensor elements using a thread-parallel algorithm. We evaluate the performance of cuTT on a variety of benchmarks with tensor ranks ranging from 2 to 12 and show that cuTT performance is independent of the tensor rank and that it performs no worse than an approach based on code generation. We develop a heuristic scheme for choosing the optimal parameters for tensor transpose algorithms by implementing an analytical GPU performance model that can be used at runtime without need for performance measurements or profiling. Finally, by integrating cuTT into the tensor algebra library TAL-SH, we significantly reduce the tensor transpose overhead in tensor contractions, achieving as low as just one percent overhead for arithmetically intensive tensor contractions.

Tags: Algorithms, Benchmarking, Code generation, Computer science, CUDA, nVidia, Package, Tesla K20, Tesla K40, Tesla M40, Tesla P100

May 6, 2017 by hgpu

Rating: 1.5/5. From 2 votes.

Please wait...

hpcbench: A set of benchmarking utilities for biomolecular simulation tools

Engineering Supercomputing Platforms for Biomolecular Applications

high performance computing on graphics processing units: hgpu.org

cuTT: A High-Performance Tensor Transpose Library for CUDA Compatible GPUs

Package:

Recent source codes

hpcbench: A set of benchmarking utilities for biomolecular simulation tools

HPCTransCompile: An AI Compiler Generated Dataset for High-Performance CUDA Transpilation and LLM Preliminary Exploration

chemtrain: Training Molecular Dynamics Potentials in JAX

microSYCL: SYCL micro-benchmarks repository

XaaS containers

CASS: Cuda-Amd aSSembly

Cluser of smartphones for edge computing application using TensorFlow

SYCL Container

CFAL-bench

Efficient Graph Embedding at Scale: Optimizing CPU-GPU-SSD Integration

Most viewed papers (last 30 days)

cuTT: A High-Performance Tensor Transpose Library for CUDA Compatible GPUs

Package:

Share this:

Recent source codes

Most viewed papers (last 30 days)