high performance computing on graphics processing units: hgpu.org

hgpu.org » Applications » Computer science » To GPU Synchronize or Not GPU Synchronize?

To GPU Synchronize or Not GPU Synchronize?

Wu-Chun Feng, Shucai Xiao

Department of Computer Science, Virginia Tech

Proceedings of 2010 IEEE International Symposium on Circuits and Systems (ISCAS), (2010) Publisher: IEEE, Pages: 3801-3804

DOI:10.1109/ISCAS.2010.5537722

@conference{feng2010gpu,

title={To GPU synchronize or not GPU synchronize?},

author={Feng, W. and Xiao, S.},

booktitle={Circuits and Systems (ISCAS), Proceedings of 2010 IEEE International Symposium on},

pages={3801–3804},

year={2010},

organization={IEEE}

}

View

Source

1653

views

The graphics processing unit (GPU) has evolved from being a fixed-function processor with programmable stages into a programmable processor with many fixed-function components that deliver massive parallelism. By modifying the GPU’s stream processor to support “general-purpose computation” on the GPU (GPGPU), applications that perform massive vector operations can realize many orders-of-magnitude improvement in performance over a traditional processor, i.e., CPU. However, the breadth of general-purpose computation that can be efficiently supported on a GPU has largely been limited to highly dataparallel or task-parallel applications due to the lack of explicit support for communication between streaming multiprocessors (SMs) on the GPU. Such communication can occur via the global memory of a GPU, but it then requires a barrier synchronization across the SMs of the GPU in order to complete the communication between SMs. Although our previous work demonstrated that implementing barrier synchronization on the GPU itself can significantly improve performance and deliver correct results in critical bioinformatics applications, guaranteeing the correctness of inter-SM communication is only possible if a memory consistency model is assumed. To address this problem, NVIDIA recently introduced the _threadfence() function in CUDA 2.2, a function that can guarantee the correctness of GPU-based inter-SM communication. However, this function currently introduces so much overhead that when using it in (direct) GPU synchronization, GPU synchronization actually performs worse than indirect synchronization via the CPU, thus raising the question of whether “to GPU synchronize or not GPU synchronize?”

Tags: Computer science, CUDA, nVidia, nVidia GeForce GTX 280, Programming techniques, Synchronization

March 12, 2011 by hgpu

No votes yet.

Please wait...

QArray

QArray: a GPU-accelerated constant capacitance model simulator for large quantum dot arrays

Celerity: High-level C++ for Accelerator Clusters

Balancing Tracking Granularity and Parallelism in Many-Task Systems: The Horizons Approach

CIFAR-10 Airbench: 94% on CIFAR-10 in 3.29 second

94% on CIFAR-10 in 3.29 Seconds on a Single GPU

gpu_tracker: Context manager and CLI that tracks the computational-resource-usage of a code block or shell command, particularly the GPU usage

gpu_tracker: Python package for tracking and profiling GPU utilization in both desktop and high-performance computing environments

LOOPer: a polyhedral compiler for expressing fast and portable data parallel algorithms

LOOPer: A Learned Automatic Code Optimizer For Polyhedral Compilers

OpenMC Monte Carlo Code

Performance Portable Monte Carlo Particle Transport on Intel, NVIDIA, and AMD GPUs

Polygeist: C/C++ frontend for MLIR

Retargeting and Respecializing GPU Workloads for Performance Portability

Parallel Gaussian process with kernel approximation in CUDA

Parallel Gaussian process with kernel approximation in CUDA

Optical flow algorithms for SYCL

SYCL in the edge: performance and energy evaluation for heterogeneous acceleration

OpenMP5-Offload-OpenMC-Intel-PVC

Distributed OpenMP Offloading of OpenMC on Intel GPU MAX Accelerators

See all packages

* * *

* * *

HGPU group © 2010-2024 hgpu.org

All rights belong to the respective authors

Login | Sitemap | Feedback | Policy

Contact us: