high performance computing on graphics processing units: hgpu.org

hgpu.org » Programming » CUDA » Fast and Accurate Finite-Element Multigrid Solvers for PDE Simulations on GPU Clusters

Fast and Accurate Finite-Element Multigrid Solvers for PDE Simulations on GPU Clusters

Dominik Goddeke

Fakultat fur Mathematik der Technischen Universitat Dortmund

Technische Universitat Dortmund, Fakultat fur Mathematik

@phdthesis{Goeddeke:2010:FAA,

author={Dominik G{“o}ddeke},

title={Fast and Accurate Finite-Element Multigrid Solvers for {PDE} Simulations on {GPU} Clusters},

school={T}echnische {U}niversit{“a}t Dortmund, {F}akult{“a}t f{“u}r {M}athematik},

year={2010},

month={may},

note={url{http://hdl.handle.net/2003/27243}}

}

Download (PDF)

View

Source

2366

views

The main contribution of this thesis is to demonstrate that graphics processors (GPUs) as representatives of emerging many-core architectures are very well-suited for the fast and accurate solution of large sparse linear systems of equations, using parallel multigrid methods on heterogeneous compute clusters. Such systems arise for instance in the discretisation of (elliptic) partial differential equations with finite elements. We report on at least one order of magnitude speedup over highly-tuned conventional CPU implementations, without sacrificing neither accuracy nor functionality. In more detail, this thesis includes the following contributions: Single precision floating point computations may be insufficient for the class of problems considered in this thesis. We revisit mixed precision iterative refinement techniques to not only increase the accuracy of computed results, but also to increase the efficiency of the solution process. Both on CPUs and on GPUs, we demonstrate a significant performance improvement without loss of accuracy compared to computing in high precision only. We present efficient parallelisation techniques for multigrid solvers on graphics hardware, in particular for numerically strong smoothers and preconditioners that are suitable for highly anisotropic grids and operators. For instance, an efficient formulation of the cyclic reduction algorithm to solve tridiagonal systems is developed. In view of hardware-oriented numerics, we carefully analyse the trade-off between numerical and runtime performance for inexact parallelisation techniques that decouple some of the inherently sequential characteristics of strong smoothing operators. For large-scale established software frameworks, the re-implementation tailored to novel hardware platforms is often prohibitively expensive. We develop a ‘minimally invasive’ approach to integrate support for co-processor hardware like GPUs into FEAST, a finite element discretisation and solver toolbox. Our technique has the major advantage that applications built on top of the toolbox do not have to be changed at all to benefit from co-processor acceleration. The approach is evaluated for benchmark problems in linearised elasticity and stationary laminar flow computed on large-scale GPU-enhanced clusters. Good speedup factors and near-ideal weak scalability are observed. The achievable speedup is analysed and a theoretical speedup model is presented. Finally, we provide a historical overview of scientific computing on graphics hardware since the early beginnings in 2001/2002, when GPGPU was an obscure research topic pursued by few, to the widespread adoption nowadays. We discuss the evolution of the hardware and the programming model, and provide a comprehensive bibliography of publications related to PDE simulations on GPUs.

Tags: CUDA, Finite element method, Fluid dynamics, GPU cluster, Navier-Stokes equations, NSEs, nVidia, nVidia GeForce 6800, nVidia GeForce 7600 GT, nVidia GeForce 7900 GT, nVidia GeForce 8600 GT, nVidia GeForce 8800 GTX, nVidia GeForce 9600 GT, nVidia GeForce GTX 280, OpenGL, Review, Thesis

December 16, 2010 by hgpu

No votes yet.

Please wait...

Your response

You must be logged in to post a comment.

KernelGYM & Dr. Kernel: A distributed GPU environment and a collection of RL training methods to support RL for Kernel Generations

Dr. Kernel: Reinforcement Learning Done Right for Triton Kernel Generations

* * *

high performance computing on graphics processing units: hgpu.org

Fast and Accurate Finite-Element Multigrid Solvers for PDE Simulations on GPU Clusters

Your response

Recent source codes

CUDA Agent: Large-Scale Agentic RL for High-Performance CUDA Kernel Generation

CUDABench: Benchmarking LLMs for Text-to-CUDA Generation

CL4SE: A Context Learning Benchmark For Software Engineering Tasks

CodeScaler: Scaling Code LLM Training and Test-Time Inference via Execution-Free Reward Models

A Safety Report on GPT-5.2, Gemini 3 Pro, Qwen3-VL, Grok 4.1 Fast, Nano Banana Pro, and Seedream 4.5

DICE: Diffusion Large Language Models Excel at Generating CUDA Kernels

KernelGYM & Dr. Kernel: A distributed GPU environment and a collection of RL training methods to support RL for Kernel Generations

Vortex-Optimized Light-weight Toolchain (VOLT)

SciDef: Automated Definition Extraction from Scientific Literature

bioagent-bench: Benchmark for evaluating LLM agents in bioinformatics

Most viewed papers (last 30 days)

Fast and Accurate Finite-Element Multigrid Solvers for PDE Simulations on GPU Clusters

Share this:

Your response

Recent source codes

Most viewed papers (last 30 days)