high performance computing on graphics processing units: hgpu.org

hgpu.org » Applications » Fluid dynamics » Performance analysis and optimization of a CFD application

Performance analysis and optimization of a CFD application

Wentao Zhang

University of Illinois at Urbana-Champaign

University of Illinois at Urbana-Champaign, 2015

@article{zhang2015performance,

title={Performance analysis and optimization of a CFD application},

author={Zhang, Wentao},

year={2015}

}

Download (PDF)

View

Source

1323

views

This thesis documents the analysis and optimization of a high-order finite difference computational fluid dynamics (CFD) application (PlasComCM). Performance bottlenecks were identified using performance tools and hardware counters. The performance analysis of PlasComCM showed that the quantity of memory accesses and the lack of vectorization inhibited optimal serial performance on a x86-based CPU. Optimizing techniques including pointer dereferencing, loop transformation and Fortran SIMD directives were applied to the top 10 time-consuming subroutines to remove obstacles to vectorization and to improve the serial performance. Details about the optimization techniques are presented and their impacts on performance are discussed. A 63% reduction in the number of memory loads and a serial speedup of 2.02 were obtained from the optimization efforts. Using the optimized serial program as the codebase, further investigation was focused on the analysis and optimization of parallel heterogeneous execution on a dual-socket node fitted with an Intel Xeon Phi MIC card. To reduce the overhead created by host-accelerator copies in heterogeneous execution, the data layout of the halo region was changed from a "star" shape to a "box" shape to agglomerate small communications and to create a larger work granularity. Preliminary results of running PlasComCM on Intel Xeon Phis in symmetric mode are also presented, where it was found that a 20% reduction in wall-clock time can be obtained for particular problem size when using 2 SandyBridge sockets + 1 Phi card vs 2 SandyBridge sockets.

Tags: Finite difference, Fluid dynamics, Fortran, Heterogeneous systems, Intel Xeon Phi, Performance, Thesis

October 18, 2015 by hgpu

No votes yet.

Please wait...

gpu_tracker: Context manager and CLI that tracks the computational-resource-usage of a code block or shell command, particularly the GPU usage

gpu_tracker: Python package for tracking and profiling GPU utilization in both desktop and high-performance computing environments

high performance computing on graphics processing units: hgpu.org

Performance analysis and optimization of a CFD application

Recent source codes

SimSYCL: Synchronous, single-threaded, library-only SYCL implementation for debugging and verification

GPU plugin for PySCF

QArray

Celerity: High-level C++ for Accelerator Clusters

gpu_tracker: Context manager and CLI that tracks the computational-resource-usage of a code block or shell command, particularly the GPU usage

CIFAR-10 Airbench: 94% on CIFAR-10 in 3.29 second

LOOPer: a polyhedral compiler for expressing fast and portable data parallel algorithms

OpenMC Monte Carlo Code

Polygeist: C/C++ frontend for MLIR

Parallel Gaussian process with kernel approximation in CUDA

Most viewed papers (last 30 days)

Performance analysis and optimization of a CFD application

Share this:

Recent source codes

Most viewed papers (last 30 days)