high performance computing on graphics processing units: hgpu.org

hgpu.org » Programming » Algorithms » Performance engineering for the Lattice Boltzmann method on GPGPUs: Architectural requirements and performance results

Performance engineering for the Lattice Boltzmann method on GPGPUs: Architectural requirements and performance results

Johannes Habich, Christian Feichtinger, Harald Kostler, Georg Hager, Gerhard Wellein

Erlangen Regional Computing Center, University of Erlangen-Nuremberg, Germany

arXiv:1112.0850v1 [cs.PF] (5 Dec 2011)

@article{2011arXiv1112.0850H,

author={Habich, Johannes and Feichtinger, Christian and Kostler, Harald and Hager, Georg and Wellein, Gerhard},

title={"{Performance engineering for the Lattice Boltzmann method on GPGPUs: Architectural requirements and performance results}"},

journal={ArXiv e-prints},

archivePrefix={"arXiv"},

eprint={1112.0850},

primaryClass={"cs.PF"},

keywords={Computer Science – Performance},

year={2011},

month={dec}

}

Download (PDF)

View

Source

2384

views

GPUs offer several times the floating point performance and memory bandwidth of current standard two socket CPU servers, e.g. NVIDIA C2070 vs. Intel Xeon Westmere X5650. The lattice Boltzmann method has been established as a flow solver in recent years and was one of the first flow solvers to be successfully ported and that performs well on GPUs. We demonstrate advanced optimization strategies for a D3Q19 lattice Boltzmann based incompressible flow solver for GPGPUs and CPUs based on NVIDIA CUDA and OpenCL. Since the implemented algorithm is limited by memory bandwidth, we concentrate on improving memory access. Basic data layout issues for optimal data access are explained and discussed. Furthermore, the algorithmic steps are rearranged to improve scattered access of the GPU memory. The importance of occupancy is discussed as well as optimization strategies to improve overall concurrency. We arrive at a well-optimized GPU kernel, which is integrated into a larger framework that can handle single phase fluid flow simulations as well as particle-laden flows. Our 3D LBM GPU implementation reaches up to 650 MLUPS in single precision and 290 MLUPS in double precision on an NVIDIA Tesla C2070.

Tags: Algorithms, ATI, ATI Radeon HD 6970, CUDA, Fluid dynamics, Lattice Boltzmann model, nVidia, nVidia GeForce 8800 GTX, OpenCL, Optimization, Performance, Tesla C1060, Tesla C2070

December 6, 2011 by hgpu

No votes yet.

Please wait...

Your response

You must be logged in to post a comment.

* * *

high performance computing on graphics processing units: hgpu.org

Performance engineering for the Lattice Boltzmann method on GPGPUs: Architectural requirements and performance results

Your response

Recent source codes

UniCoder: Unified Visual-to-Code Generation via Symbolic Rewards and Reference-Guided Code Optimization

CuFuzz: An API-Knowledge-Graph Coverage-Driven Fuzzing Framework for CUDA Libraries

AutoPass: Evidence-Guided LLM Agents for Compiler Performance Tuning

Probe-and-Refine Tuning of Repository Guidance for AI Coding Agents

CUDAnalyst (CUDA + Analyst)

CodegenBench

KernelBenchX: A Comprehensive Benchmark for Evaluating LLM-Generated GPU Kernels

CUDA Kernel Fusion Benchmarks

IntelliKit: Agent-first tooling for AMD hardware

DITRON: Distributed Compiler based on Triton for Parallel Systems

Most viewed papers (last 30 days)

Performance engineering for the Lattice Boltzmann method on GPGPUs: Architectural requirements and performance results

Share this:

Your response

Recent source codes

Most viewed papers (last 30 days)