Parallelizing General Histogram Application for CUDA Architectures

hgpu.org » Programming » Algorithms » Parallelizing General Histogram Application for CUDA Architectures

Parallelizing General Histogram Application for CUDA Architectures

Ugljesa Milic, Isaac Gelado, Nikola Puzovic, Alex Ramirez, Milo Tomasevic

Barcelona Supercomputing Center, Centro Nacional de Supercomputacion, Barcelona, Spain

IEEE International Conference on Embedded Computer Systems: Architectures, Modeling, and Simulation, 2013

BibTeX

Download (PDF)

View

Source

2558

views

Histogramming is a tool commonly used in data analysis. Although its serial version is simple to implement, providing an efficient and scalable way to parallelize it can be challenging. This especially holds in case of platforms that contain one or several massively parallel devices like CUDAcapable GPUs due to issues with domain decomposition, use of global memory and similar. In this paper we compare two approaches for implementing general purpose histogramming on GPUs. The first algorithm is based on private copies of bin counters stored in shared memory for each block of threads. The second one uses the Thrust library to sort the input elements and then to search for upper bounds according to bin widths. For both algorithms we analyze how the speedup over the sequential version depends on the size of input collection, number of bins, and the type and distribution of input elements. We also implement overlapping of data transfers between host CPU and CUDA device with kernel execution. For both algorithms we analyze the pros and cons in detail. For example, privatization strategy can be up to 2x faster than sort-search with realistic inputs, but can only support a limited number of bins. On the other hand, sort-search strategy has about 50% higher speedup than privatization when we use characters as input and can support unlimited number of bins. Finally, we perform an exploration to determine the optimal algorithm depending on the characteristics and values of input parameters.

Tags: Algorithms, CUDA, Image processing, nVidia, Search strategies, Tesla C2070, Tesla K20

June 17, 2013 by hgpu

No votes yet.

Please wait...

high performance computing on graphics processing units: hgpu.org

Parallelizing General Histogram Application for CUDA Architectures

Recent source codes

HPCTransCompile: An AI Compiler Generated Dataset for High-Performance CUDA Transpilation and LLM Preliminary Exploration

chemtrain: Training Molecular Dynamics Potentials in JAX

microSYCL: SYCL micro-benchmarks repository

XaaS containers

SYCL Container

CASS: Cuda-Amd aSSembly

Cluser of smartphones for edge computing application using TensorFlow

CFAL-bench

Efficient Graph Embedding at Scale: Optimizing CPU-GPU-SSD Integration

Can Large Language Models Predict Parallel Code Performance?

Most viewed papers (last 30 days)

Parallelizing General Histogram Application for CUDA Architectures

Share this:

Recent source codes

Most viewed papers (last 30 days)