high performance computing on graphics processing units: hgpu.org

hgpu.org » Programming » Algorithms » Efficient Hash Tables on the GPU

Efficient Hash Tables on the GPU

Dan Anthony Feliciano Alcantara

University of California, Davis

University of California, Davis, 134 pages, 2011

@phdthesis{alcantara2012efficient,

title={Efficient Hash Tables on the GPU},

author={Alcantara, D.A.F.},

year={2012},

school={UNIVERSITY OF CALIFORNIA, DAVIS}

}

Download (PDF)

View

Source

3866

views

Advances in GPU architecture have made efficient implementations of hash tables possible, allowing fast parallel constructions and retrievals despite the uncoalesced memory accesses naturally incurred by hashing algorithms. The key is to mitigate the penalty of these accesses by minimizing the number that occur and utilizing the cache (when one is available). Most work done on parallel hashing is ill-equipped for this objective and relies on the theoretical PRAM model, which abstracts away the difficulties of programming on actual hardware. We examine hashing schemes from a practical perspective using NVIDIA’s CUDA architecture. Our main contribution is a set of parallel implementations for open addressing, chaining, and cuckoo hashing. We analyze each method and identify when applications should use one over another. Because each makes different performance trade-offs, we compare them using three metrics: memory usage, construction time, and retrieval efficiency. Retrieval efficiency considers both the average time and deviation from it, since answering some queries can be several orders of magnitude more difficult than others. Our quadratic probing implementation shows this as the hash table becomes more compact: on a GTX 470, using datasets containing 10M random key-value pairs, it has respective rates of [369M, 723M, 539M] pairs per second (pps) for insertion, retrieving every input item, and retrieving 10M keys absent from the table when using 2N space. For 1.05N space, these rates significantly drop to [162M, 208M, 46M] pps, reflecting the difficulty of terminating both insertions and queries. Applications requiring more robust retrieval could benefit from our chaining implementation, which eschews linked lists and uses radix sort for an efficient parallel construction. When using 2N space, the rates are [344M, 436M, 624M] pps, while for 1.05N the rates are [449M, 211M, 126M] pps. For compact tables, its construction rate is almost 3x faster than quadratic probing with a smaller drop in retrieval efficiency for failed queries. However, the number of probes required to answer queries grows drastically for compact tables, leading to poorer retrieval rates. Cuckoo hashing is better suited for these cases, trading a more complicated construction for guaranteed constant time retrievals. It has rates of [366M, 670M, 501M] pps using 2 N space and [133M, 386M, 258M] pps for 1.05N. Our method is also adaptable and can be specialized for situations where multiple values are stored per key.

Tags: Algorithms, Computer science, CUDA, Hashing, nVidia, nVidia GeForce GTX 280, nVidia GeForce GTX 470, Thesis

April 14, 2012 by hgpu

No votes yet.

Please wait...

gpu_tracker: Context manager and CLI that tracks the computational-resource-usage of a code block or shell command, particularly the GPU usage

gpu_tracker: Python package for tracking and profiling GPU utilization in both desktop and high-performance computing environments

high performance computing on graphics processing units: hgpu.org

Efficient Hash Tables on the GPU

Recent source codes

SimSYCL: Synchronous, single-threaded, library-only SYCL implementation for debugging and verification

GPU plugin for PySCF

QArray

Celerity: High-level C++ for Accelerator Clusters

gpu_tracker: Context manager and CLI that tracks the computational-resource-usage of a code block or shell command, particularly the GPU usage

CIFAR-10 Airbench: 94% on CIFAR-10 in 3.29 second

LOOPer: a polyhedral compiler for expressing fast and portable data parallel algorithms

OpenMC Monte Carlo Code

Polygeist: C/C++ frontend for MLIR

Parallel Gaussian process with kernel approximation in CUDA

Most viewed papers (last 30 days)

Efficient Hash Tables on the GPU

Share this:

Recent source codes

Most viewed papers (last 30 days)