high performance computing on graphics processing units: hgpu.org

hgpu.org » Applications » Computer science » Performance modeling of atomic additions on GPU scratchpad memory

Performance modeling of atomic additions on GPU scratchpad memory

Juan Gomez-Luna, Jose Maria Gonzalez-Linares, Jose Ignacio Benavides, Nicolas Guil

Department of Computer Architecture and Electronics, University of Cordoba, Spain

IEEE Transactions on Parallel and Distributed Systems, 2012

@article{gomez2012performance,

title={Performance modeling of atomic additions on GPU scratchpad memory},

author={G{‘o}mez-Luna, J. and Gonz{‘a}lez-Linares, J.M. and Benavides, J.I. and Guil, N.},

year={2012}

}

Download (PDF)

View

Source

2755

views

GPU application implementations using scatter approaches will fall into write contention due to atomic updates of output elements, if these result from more than one input element. Colliding threads will be serialized, seriously harming performance. Dealing with these issues requires a proper understanding of the behavior of the scratchpad or shared memory under conflicting accesses caused by concurrent threads. Thus, this paper presents an exhaustive microbenchmark-based analysis of atomic additions in shared memory that quantifies the impact of access conflicts on latency and throughput. This analysis has led us to discover the lock mechanism that enables atomic updates to shared memory and to propose a performance model to estimate the latency penalties due to collisions by position or bank conflicts. Then, we have derived experiments from this model that show us the way to optimize applications using atomic operations. Position and bank conflicts can be diminished by replication and padding, respectively. The benefits of such techniques are illustrated with the optimization of two widely-used voting processes: the centroid updating step in k-means clustering, and histogram calculation.

Tags: Clustering, Computer science, CUDA, nVidia, nVidia GeForce GTX 580, Optimization

November 14, 2012 by hgpu

Rating: 2.5/5. From 5 votes.

Please wait...

Your response

You must be logged in to post a comment.

high performance computing on graphics processing units: hgpu.org

Performance modeling of atomic additions on GPU scratchpad memory

Your response

Recent source codes

Agentic Code Optimization via Compiler-LLM Cooperation

MegaTrain: Full Precision Training of 100B+ Parameter Large Language Models on a Single GPU

Device Virtual Machine (DVM)

AutoKernel: Autoresearch for GPU kernels. Give it any PyTorch model, go to sleep, wake up to optimized Triton kernels

SOL-ExecBench: Speed-of-Light Benchmarking for Real-World GPU Kernels Against Hardware Limits

Triton-Sanitizer: A Fast and Device-Agnostic Memory Sanitizer for Triton with Rich Diagnostic Context

LLM.Q: Quantized LLM training in pure CUDA/C++

True 4-Bit Quantized CNN Training on CPU

cuFuzz: A GPU-oriented coverage-guided fuzzer for userland CUDA application

KernelSkill: A Multi-Agent Framework for GPU Kernel Optimization

Most viewed papers (last 30 days)

Performance modeling of atomic additions on GPU scratchpad memory

Share this:

Your response

Recent source codes

Most viewed papers (last 30 days)