high performance computing on graphics processing units: hgpu.org

hgpu.org » Applications » Computer science » Design and Evaluation of Scalable Concurrent Queues for Many-Core Architectures

Design and Evaluation of Scalable Concurrent Queues for Many-Core Architectures

Thomas R. W. Scogland, Wu-chun Feng

Department of Computer Science, Virginia Tech

Virginia Polytechnic Institute & State University, Technical Report number: TR-14-03, 2014

@article{scogland2014design,

title={Design and Evaluation of Scalable Concurrent Queues for Many-Core Architectures},

author={Scogland, Thomas RW and Feng, Wu-chun},

year={2014}

}

Download (PDF)

View

Source

3880

views

As core counts increase and as heterogeneity becomes more common in parallel computing, we face the prospect of programming hundreds or even thousands of concurrent threads in a single shared-memory system. At these scales, even highly-efficient concurrent algorithms and data structures can become bottlenecks, unless they are designed from the ground up with throughput as their primary goal. In this paper, we present three contributions: (1) a characterization of queue designs in terms of modern multi- and many-core architectures, (2) the design of a high-throughput concurrent FIFO queue for many-core architectures that avoids the bottlenecks common in modern queue designs, and (3) a thorough evaluation of concurrent queue throughput across CPU, GPU, and co-processor devices. Our evaluation shows that focusing on throughput, rather than progress guarantees, allows our queue to scale to as much as three orders of magnitude (1000X) faster than lock-free and combining queues on GPU platforms and two times (2X) faster on CPU devices. These results deliver critical insight into the design of data structures for highly concurrent systems: (1) progress guarantees do not guarantee scalability, and (2) allowing an algorithm to block can actually increase throughput.

Tags: ATI, ATI Radeon HD 7970, Computer science, Intel Xeon Phi, nVidia, nVidia GeForce GTX 280, OpenCL, Performance, Tesla K20

August 15, 2014 by hgpu

No votes yet.

Please wait...

Your response

You must be logged in to post a comment.

high performance computing on graphics processing units: hgpu.org

Design and Evaluation of Scalable Concurrent Queues for Many-Core Architectures

Your response

Recent source codes

Agentic Code Optimization via Compiler-LLM Cooperation

MegaTrain: Full Precision Training of 100B+ Parameter Large Language Models on a Single GPU

Device Virtual Machine (DVM)

AutoKernel: Autoresearch for GPU kernels. Give it any PyTorch model, go to sleep, wake up to optimized Triton kernels

SOL-ExecBench: Speed-of-Light Benchmarking for Real-World GPU Kernels Against Hardware Limits

Triton-Sanitizer: A Fast and Device-Agnostic Memory Sanitizer for Triton with Rich Diagnostic Context

LLM.Q: Quantized LLM training in pure CUDA/C++

True 4-Bit Quantized CNN Training on CPU

cuFuzz: A GPU-oriented coverage-guided fuzzer for userland CUDA application

KernelSkill: A Multi-Agent Framework for GPU Kernel Optimization

Most viewed papers (last 30 days)

Design and Evaluation of Scalable Concurrent Queues for Many-Core Architectures

Share this:

Your response

Recent source codes

Most viewed papers (last 30 days)