high performance computing on graphics processing units: hgpu.org

hgpu.org » Applications » Computer science » Exploiting concurrent kernel execution on graphic processing units

Exploiting concurrent kernel execution on graphic processing units

Lingyuan Wang, Miaoqing Huang, Tarek El-Ghazawi

ECE Department, The George Washington University

International Conference on High Performance Computing and Simulation (HPCS), 2011

DOI:10.1109/HPCSim.2011.5999803

@inproceedings{wang2011exploiting,

title={Exploiting concurrent kernel execution on graphic processing units},

author={Wang, L. and Huang, M. and El-Ghazawi, T.},

booktitle={High Performance Computing and Simulation (HPCS), 2011 International Conference on},

pages={24–32},

year={2011},

organization={IEEE}

}

Download (PDF)

View

Source

2046

views

Graphics processing units (GPUs) have been accepted as a powerful and viable coprocessor solution in high-performance computing domain. In order to maximize the benefit of GPUs for a multicore platform, a mechanism is needed for CPU threads in a parallel application to share this computing resource for efficient execution. NVIDIA’s Fermi architecture pioneers the feature of concurrent kernel execution; however, only kernels of the same thread context can execute in parallel. In order to get the best use of a GPU device in a multi-threaded application environment, this paper explores the techniques to effectively share a context, i.e., context funneling, which could be done either manually at application level, or automatically at the GPU runtime starting from CUDA v4.0. For synthetic microbenchmark tests, we find that both funneling mechanisms are more capable of exploring the benefit of concurrent kernel execution than traditional context switching, therefore improving the overall application performance. We also find that the manual funneling mechanism provides the highest performance and more explicit control, while CUDA v4.0 provides better productivity with good performance. Finally, we assess the impact of such techniques on a compact application benchmark, SSCA#3 – SAR sensor processing.

Tags: Computer science, CUDA, nVidia, Performance, Programming techniques, Tesla C2070

November 19, 2011 by hgpu

No votes yet.

Please wait...

Your response

You must be logged in to post a comment.

* * *

high performance computing on graphics processing units: hgpu.org

Exploiting concurrent kernel execution on graphic processing units

Your response

Recent source codes

Awesome LLM-Driven Kernel Generation

PhysProver: Advancing Automatic Theorem Proving for Physics

ParaCodex: A Profiling-Guided Autonomous Coding Agent for Reliable Parallel Code Generation and Translation

SeedFold: Scaling Biomolecular Structure Prediction

Tilus: A Tile-Level GPU Kernel Programming Language

Memory-Efficient Acceleration of Block Low-Rank Foundation Models on Resource Constrained GPUs

BoltzGen:Toward Universal Binder Design

CUDA-L2: Surpassing cuBLAS Performance for Matrix Multiplication through Reinforcement Learning

cuPilot: A Strategy-Coordinated Multi-agent Framework for CUDA Kernel Evolution

MATLAB Tensor Core models

Most viewed papers (last 30 days)

Exploiting concurrent kernel execution on graphic processing units

Share this:

Your response

Recent source codes

Most viewed papers (last 30 days)