high performance computing on graphics processing units: hgpu.org

hgpu.org » Applications » Computer science » Simultaneous Branch and Warp Interweaving for Sustained GPU Performance

Simultaneous Branch and Warp Interweaving for Sustained GPU Performance

Nicolas Brunie, Sylvain Collange, Gregory Diamos

ARENAIRE (Inria Grenoble Rhone-Alpes / LIP Laboratoire de l’Informatique du Parallelisme), INRIA – CNRS : UMR5668 – Universite Claude Bernard – Lyon I – Ecole Normale Superieure de Lyon

HAL : ensl-00649650, version 1, 2011

BibTeX

Download (PDF)

View

Source

1880

views

Single-Instruction Multiple-Thread (SIMT) micro-architectures implemented in Graphics Processing Units (GPUs) run fine-grained threads in lockstep by grouping them into so-called warps to amortize the cost of instruction fetch, decode and control logic over multiple execution units. As individual threads take divergent execution paths, their processing takes place sequentially, defeating part of the efficiency advantage of SIMD execution. We present two complementary techniques that mitigate the impact of thread divergence on SIMT micro-architectures. Both techniques relax the SIMD execution model by allowing two distinct instructions to be scheduled to disjoint subsets of the the same row of execution units, instead of one single instruction. They increase flexibility by providing more thread grouping opportunities than SIMD, while preserving the affinity between threads to avoid introducing extra memory divergence. We consider (1) co-issuing instructions from different divergent paths of the same warp and (2) co-issuing instructions from different warps. To support (1), we introduce a novel thread reconvergence technique that ensures threads are run back in lockstep at control-flow reconvergence points without hindering their ability to run branches in parallel. We propose a lane shuffling technique to allow technique (2) to benefit from inter-warp correlations in divergence patterns. The combination of all these techniques improves performance by 23% on a set of regular GPGPU applications and by 40% on irregular applications, while keeping the same instruction-fetch and processing-unit resources as the contemporary Fermi GPU architecture.

Tags: Computer science, CUDA, nVidia, Performance

December 15, 2011 by hgpu

No votes yet.

Please wait...

high performance computing on graphics processing units: hgpu.org

Simultaneous Branch and Warp Interweaving for Sustained GPU Performance

Recent source codes

HPCTransCompile: An AI Compiler Generated Dataset for High-Performance CUDA Transpilation and LLM Preliminary Exploration

chemtrain: Training Molecular Dynamics Potentials in JAX

microSYCL: SYCL micro-benchmarks repository

XaaS containers

SYCL Container

CASS: Cuda-Amd aSSembly

Cluser of smartphones for edge computing application using TensorFlow

CFAL-bench

Efficient Graph Embedding at Scale: Optimizing CPU-GPU-SSD Integration

Can Large Language Models Predict Parallel Code Performance?

Most viewed papers (last 30 days)

Simultaneous Branch and Warp Interweaving for Sustained GPU Performance

Share this:

Recent source codes

Most viewed papers (last 30 days)