high performance computing on graphics processing units: hgpu.org

hgpu.org » Applications » Computer science » Tuned and wildly asynchronous stencil kernels for hybrid CPU/GPU systems

Tuned and wildly asynchronous stencil kernels for hybrid CPU/GPU systems

Sundaresan Venkatasubramanian, Richard W. Vuduc, None None

Georgia Institute of Technology, College of Computing, School of Computer Science, 266 Ferst Drive, Altanta, Georgia, USA

In ICS ’09: Proceedings of the 23rd international conference on Supercomputing (2009), pp. 244-255.

DOI:10.1145/1542275.1542312

@conference{venkatasubramanian2009tuned,

title={Tuned and wildly asynchronous stencil kernels for hybrid CPU/GPU systems},

author={Venkatasubramanian, S. and Vuduc, R.W. and others},

booktitle={Proceedings of the 23rd international conference on Supercomputing},

pages={244–255},

year={2009},

organization={ACM}

}

Download (PDF)

View

Source

2611

views

We describe heterogeneous multi-CPU and multi-GPU implementations of Jacobi’s iterative method for the 2-D Poisson equation on a structured grid, in both single- and double-precision. Properly tuned, our best implementation achieves 98% of the empirical streaming GPU bandwidth (66% of peak) on a NVIDIA C1060, and 78% on a C870. Motivated to find a still faster implementation, we further consider “wildly asynchronous” implementations that can reduce or even eliminate the synchronization bottleneck between iterations. In these versions, which are based on chaotic relaxation (Chazan and Miranker, 1969), we simply remove or delay synchronization between iterations. By doing so, we trade-off more flops, via more iterations to converge, for a higher degree of asynchronous parallelism. Our wild implementations on a GPU can be 1.2-2.5x faster than our best synchronized GPU implementation while achieving the same accuracy. Looking forward, this result suggests research on similarly “fast-and-loose” algorithms in the coming era of increasingly massive concurrency and relatively high synchronization or communication costs.

Tags: Computer science, CUDA, nVidia, nVidia Quadro FX 570, Performance, Tesla C1060, Tesla C870

November 22, 2010 by hgpu

No votes yet.

Please wait...

Your response

You must be logged in to post a comment.

* * *

high performance computing on graphics processing units: hgpu.org

Tuned and wildly asynchronous stencil kernels for hybrid CPU/GPU systems

Your response

Recent source codes

UniCoder: Unified Visual-to-Code Generation via Symbolic Rewards and Reference-Guided Code Optimization

CuFuzz: An API-Knowledge-Graph Coverage-Driven Fuzzing Framework for CUDA Libraries

AutoPass: Evidence-Guided LLM Agents for Compiler Performance Tuning

Probe-and-Refine Tuning of Repository Guidance for AI Coding Agents

CUDAnalyst (CUDA + Analyst)

CodegenBench

KernelBenchX: A Comprehensive Benchmark for Evaluating LLM-Generated GPU Kernels

CUDA Kernel Fusion Benchmarks

IntelliKit: Agent-first tooling for AMD hardware

DITRON: Distributed Compiler based on Triton for Parallel Systems

Most viewed papers (last 30 days)

Tuned and wildly asynchronous stencil kernels for hybrid CPU/GPU systems

Share this:

Your response

Recent source codes

Most viewed papers (last 30 days)