high performance computing on graphics processing units: hgpu.org

hgpu.org » Applications » Computer science » Auto-Vectorizing a Large-scale Production Unstructured-mesh CFD Application

Auto-Vectorizing a Large-scale Production Unstructured-mesh CFD Application

G.R. Mudalige, I.Z. Reguly, M.B. Giles

Oxford eResearch Centre, University of Oxford,U.K.

3rd Workshop on Programming Models for SIMD/Vector Processing (WPMVP), 2016

BibTeX

Download (PDF)

View

Source

1949

views

For modern x86 based CPUs with increasingly longer vector lengths, achieving good vectorization has become very important for gaining higher performance. Using very explicit SIMD vector programming techniques has been shown to give near optimal performance, however they are difficult to implement for all classes of applications particularly ones with very irregular memory accesses and usually require considerable re-factorisation of the code. Vector intrinsics are also not available for languages such as Fortran which is still heavily used in large production applications. The alternative is to depend on compiler auto-vectorization which usually have been less effective in vectorizing codes with irregular memory access patterns. In this paper we present recent research exploring techniques to gain compiler auto-vectorization for unstructured mesh applications. A key contribution is details on software techniques that achieve auto-vectorisation for a large production grade unstructured mesh application from the CFD domain so as to benefit from the vector units on the latest Intel processors without a significant code re-write. We use code generation tools in the OP2 domain specific library to apply the auto-vectorising optimisations automatically to the production code base and further explore the performance of the application compared to the performance with other parallelisations such as on the latest NVIDIA GPUs. We see that there is considerable performance improvements with autovectorization. The most compute intensive parallel loops in the large CFD application shows speedups of nearly 40% on a 20 core Intel Haswell system compared to their nonvectorized versions. However not all loops gain due to vectorization where loops with less computational intensity lose performance due to the associated overheads.

Tags: cfd, Code generation, Computer science, CUDA, Fluid dynamics, Fortran, nVidia, Programming techniques, Tesla K80

February 25, 2016 by hgpu

Rating: 2.5/5. From 1 vote.

Please wait...

high performance computing on graphics processing units: hgpu.org

Auto-Vectorizing a Large-scale Production Unstructured-mesh CFD Application

Recent source codes

HPCTransCompile: An AI Compiler Generated Dataset for High-Performance CUDA Transpilation and LLM Preliminary Exploration

chemtrain: Training Molecular Dynamics Potentials in JAX

microSYCL: SYCL micro-benchmarks repository

XaaS containers

SYCL Container

CASS: Cuda-Amd aSSembly

Cluser of smartphones for edge computing application using TensorFlow

CFAL-bench

Efficient Graph Embedding at Scale: Optimizing CPU-GPU-SSD Integration

Can Large Language Models Predict Parallel Code Performance?

Most viewed papers (last 30 days)

Auto-Vectorizing a Large-scale Production Unstructured-mesh CFD Application

Share this:

Recent source codes

Most viewed papers (last 30 days)