high performance computing on graphics processing units: hgpu.org

hgpu.org » Programming » Algorithms » Approaches for parallelizing reductions on modern GPUs

Approaches for parallelizing reductions on modern GPUs

Xin Huo, V.T. Ravi, Wenjing Ma, G. Agrawal

Dept. of Comput. Sci. & Eng., Ohio State Univ., Columbus, OH, USA

International Conference on High Performance Computing (HiPC), 2010

DOI:10.1109/HIPC.2010.5713189

@inproceedings{huo2010approaches,

title={Approaches for parallelizing reductions on modern GPUs},

author={Huo, X. and Ravi, VT and Ma, W. and Agrawal, G.},

booktitle={High Performance Computing (HiPC), 2010 International Conference on},

pages={1–10},

year={2010},

organization={IEEE}

}

Source

1721

views

GPU hardware and software has been evolving rapidly. CUDA versions 1.1 and higher started supporting atomic operations on device memory, and CUDA versions 1.2 and higher started supporting atomic operations on shared memory. This paper focuses on parallelizing applications involving reductions on GPUs. Prior to the availability of support for locking, these applications could only be parallelized using full replication, i.e., by creating a copy of the reduction object for each thread. However, CUDA 1.1 (1.2) onwards, use of atomic operations (on shared memory) is another option, though some effort is still required in supporting locking on floating point numbers and for supporting coarse-grained locking. Based on the tradeoffs between locking and full replication, we also introduce a hybrid approach, in which a group of threads use atomic operations to update one copy of the reduction object. Using three data mining algorithms that follow the reduction structure – k-means clustering, Principal Component Analysis (PCA) and k-nearest neighbor search (kNN), we evaluate the relative performance of these three approaches. We show how the relative performance of these techniques can vary depending upon the application and its parameters. The hybrid approach we have introduced clearly outperforms other approaches in several cases.

Tags: Algorithms, Clustering, Computer science, CUDA, Data mining, Nearest neighbour, nVidia

August 10, 2011 by hgpu

No votes yet.

Please wait...

Your response

You must be logged in to post a comment.

high performance computing on graphics processing units: hgpu.org

Approaches for parallelizing reductions on modern GPUs

Your response

Recent source codes

Interleaved Learning and Exploration: A Self-Adaptive Fuzz Testing Framework for MLIR

Pinocchio: PINpointing Orbit Crossing Collapsed Hierarchical Objects

KernelCoder: trained on a curated dataset of reasoning traces and CUDA kernel pairs

VibeCodeHPC - Multi Agentic Vibe Coding for HPC

Compile-Time Resource Safety for GPU APIs: A Low-Overhead Typestate Framework

exa-AMD: Exascale Accelerated Materials Discovery

TRUST: a thermalhydraulic software package for CFD simulations

Modular: The Modular Platform (includes MAX & Mojo)

Allo: Accelerator Design Language

Towards Robust Agentic CUDA Kernel Benchmarking, Verification, and Optimization

Most viewed papers (last 30 days)

Approaches for parallelizing reductions on modern GPUs

Share this:

Your response

Recent source codes

Most viewed papers (last 30 days)