high performance computing on graphics processing units: hgpu.org

hgpu.org » Applications » Computer science » Tuned and asynchronous stencil kernels for CPU/GPU systems (thesis)

Tuned and asynchronous stencil kernels for CPU/GPU systems (thesis)

Sundaresan Venkatasubramanian

Georgia Institute of Technology

Georgia Institute of Technology, 2009

BibTeX

Download (PDF)

View

Source

2078

views

We describe heterogeneous multi-CPU and multi-GPU implementations of Jacobi’s iterative method for the 2-D Poisson equation on a structured grid, in both single- and double-precision. Properly tuned, our best implementation achieves 98% of the empirical streaming GPU bandwidth (66% of peak) on a NVIDIA C1060. Motivated to find a still faster implementation, we further consider “wildly asynchronous” implementations that can reduce or even eliminate the synchronization bottleneck between iterations. In these versions, which are based on the principle of a chaotic relaxation (Chazan and Miranker, 1969), we simply remove or delay synchronization between iterations, thereby potentially trading off more flops (via more iterations to converge) for a higher degree of asynchronous parallelism. Our relaxed-synchronization implementations on a GPU can be 1.2-2.5x faster than our best synchronized GPU implementation while achieving the same accuracy. Looking forward, this result suggests research on similarly “fast-and-loose” algorithms in the coming era of increasingly massive concurrency and relatively high synchronization or communication costs.

Tags: Computer science, CUDA, nVidia, nVidia Quadro FX 570, Performance, Tesla C1060, Tesla C870, Thesis

March 6, 2011 by hgpu

No votes yet.

Please wait...

high performance computing on graphics processing units: hgpu.org

Tuned and asynchronous stencil kernels for CPU/GPU systems (thesis)

Recent source codes

HPCTransCompile: An AI Compiler Generated Dataset for High-Performance CUDA Transpilation and LLM Preliminary Exploration

chemtrain: Training Molecular Dynamics Potentials in JAX

microSYCL: SYCL micro-benchmarks repository

XaaS containers

SYCL Container

CASS: Cuda-Amd aSSembly

Cluser of smartphones for edge computing application using TensorFlow

CFAL-bench

Efficient Graph Embedding at Scale: Optimizing CPU-GPU-SSD Integration

Can Large Language Models Predict Parallel Code Performance?

Most viewed papers (last 30 days)

Tuned and asynchronous stencil kernels for CPU/GPU systems (thesis)

Share this:

Recent source codes

Most viewed papers (last 30 days)