Exploring Optimisations for the Local Assembly phase of Finite Element Methods on GPUs
Department of Computing, Imperial College London
Imperial College London, 2015
@article{dearman2015exploring,
title={Exploring Optimisations for the Local Assembly phase of Finite Element Methods on GPUs},
author={Dearman, Hector},
year={2015}
}
Finite Element Methods (FEM) are ubiquitous in science and engineering where they are used in fields as diverse as structural analysis, ocean modeling and bioengineering. FEM allow us to find approximate solutions to a system of partial differential equations over an unstructured mesh. The first phase of solving a FEM problem, local assembly, involves computing a tensor for every element in the mesh. Local assembly is extremely data-parallel, each entry in each tensor may be computed independently, making local assembly an excellent target for General Purpose Graphics Processing Units. We systematically investigate optimisations to improve the performance of the local assembly phase of FEM on GPUs for a broad range of problems. We look at four classes of optimisations: effective use of constant memory, tuning the kernel launch parameters, using multiple threads per element and loop unrolling. The optimisations are implemented in the Firedrake toolchain, particularly in PyOP2 and COFFEE, and the performance improvement of each optimisation is measured using three representative benchmarks. In order to ensure our results are robust we consider each of these benchmarks in the context of a variety of element shapes and polynomial degrees of the basis functions. Combining these optimisations, we achieve speed increases of up to 35 times compared to Firedrake’s current performance on some benchmarks and an average increase of 13 times across all benchmarks. Finally, we measure the absolute performance of the combined optimisations, showing that we achieve up to 78% of peak FLOPs on some benchmarks and an average of 57% of peak FLOPs across all benchmarks on an NVIDA GRID K520.
November 3, 2015 by hgpu