high performance computing on graphics processing units: hgpu.org

hgpu.org » Applications » Computer science » A Non-linear GPU Thread Map for Triangular Domains

A Non-linear GPU Thread Map for Triangular Domains

Cristobal A. Navarro, Benjamin Bustos, Nancy Hitschfeld

Instituto de Informatica, Universidad Austral de Chile

arXiv:1609.01490 [cs.DC], (6 Sep 2016)

@article{navarro2016nonlinear,

title={A Non-linear GPU Thread Map for Triangular Domains},

author={Navarro, Cristobal A. and Bustos, Benjamin and Hitschfeld, Nancy},

year={2016},

month={sep},

archivePrefix={"arXiv"},

primaryClass={cs.DC}

}

Download (PDF)

View

Source

1243

views

There is a stage in the GPU computing pipeline where a grid of thread-blocks, in parallel space, is mapped onto the problem domain, in data space. Since the parallel space is restricted to a box type geometry, the mapping approach is typically a k-dimensional bounding box (BB) that covers a p-dimensional data space. Threads that fall inside the domain perform computations while threads that fall outside are discarded at runtime. In this work we study the case of mapping threads efficiently onto triangular domain problems and propose a block-space linear map $lambda(omega)$, based on the properties of the lower triangular matrix, that reduces the number of unnnecessary threads from $mathcal{O}(n^2)$ to $mathcal{O}(n)$. Performance results for global memory accesses show an improvement of up to 18% with respect to the bounding-box approach, placing $lambda(omega)$ on second place below the rectangular-box approach and above the recursive-partition and upper-triangular approaches. For shared memory scenarios $lambda(omega)$ was the fastest approach achieving 7% of performance improvement while preserving thread locality. The results obtained in this work make $lambda(omega)$ an interesting map for efficient GPU computing on parallel problems that define a triangular domain with or without neighborhood interactions. The extension to tetrahedral domains is analyzed, with applications to triplet-interaction n-body applications.

Tags: Computer science, CUDA, nVidia, nVidia GeForce GTX 680, nVidia GeForce GTX 765 M, Performance, Tesla K40

September 8, 2016 by hgpu

Rating: 1.7/5. From 5 votes.

Please wait...

gpu_tracker: Context manager and CLI that tracks the computational-resource-usage of a code block or shell command, particularly the GPU usage

gpu_tracker: Python package for tracking and profiling GPU utilization in both desktop and high-performance computing environments

high performance computing on graphics processing units: hgpu.org

A Non-linear GPU Thread Map for Triangular Domains

Recent source codes

SimSYCL: Synchronous, single-threaded, library-only SYCL implementation for debugging and verification

GPU plugin for PySCF

QArray

Celerity: High-level C++ for Accelerator Clusters

gpu_tracker: Context manager and CLI that tracks the computational-resource-usage of a code block or shell command, particularly the GPU usage

CIFAR-10 Airbench: 94% on CIFAR-10 in 3.29 second

LOOPer: a polyhedral compiler for expressing fast and portable data parallel algorithms

OpenMC Monte Carlo Code

Polygeist: C/C++ frontend for MLIR

Parallel Gaussian process with kernel approximation in CUDA

Most viewed papers (last 30 days)

A Non-linear GPU Thread Map for Triangular Domains

Share this:

Recent source codes

Most viewed papers (last 30 days)