high performance computing on graphics processing units: hgpu.org

hgpu.org » Applications » Computer science » Exploring Optimisations for the Local Assembly phase of Finite Element Methods on GPUs

Exploring Optimisations for the Local Assembly phase of Finite Element Methods on GPUs

Hector Dearman

Department of Computing, Imperial College London

Imperial College London, 2015

@article{dearman2015exploring,

title={Exploring Optimisations for the Local Assembly phase of Finite Element Methods on GPUs},

author={Dearman, Hector},

year={2015}

}

Download (PDF)

View

Source

1665

views

Finite Element Methods (FEM) are ubiquitous in science and engineering where they are used in fields as diverse as structural analysis, ocean modeling and bioengineering. FEM allow us to find approximate solutions to a system of partial differential equations over an unstructured mesh. The first phase of solving a FEM problem, local assembly, involves computing a tensor for every element in the mesh. Local assembly is extremely data-parallel, each entry in each tensor may be computed independently, making local assembly an excellent target for General Purpose Graphics Processing Units. We systematically investigate optimisations to improve the performance of the local assembly phase of FEM on GPUs for a broad range of problems. We look at four classes of optimisations: effective use of constant memory, tuning the kernel launch parameters, using multiple threads per element and loop unrolling. The optimisations are implemented in the Firedrake toolchain, particularly in PyOP2 and COFFEE, and the performance improvement of each optimisation is measured using three representative benchmarks. In order to ensure our results are robust we consider each of these benchmarks in the context of a variety of element shapes and polynomial degrees of the basis functions. Combining these optimisations, we achieve speed increases of up to 35 times compared to Firedrake’s current performance on some benchmarks and an average increase of 13 times across all benchmarks. Finally, we measure the absolute performance of the combined optimisations, showing that we achieve up to 78% of peak FLOPs on some benchmarks and an average of 57% of peak FLOPs across all benchmarks on an NVIDA GRID K520.

Tags: Benchmarking, Computer science, CUDA, Differential equations, FEM, Finite element method, nVidia, nVidia GRID K520, OpenCL, Partial differential equations, PDEs, Thesis

November 3, 2015 by hgpu

No votes yet.

Please wait...

gpu_tracker: Context manager and CLI that tracks the computational-resource-usage of a code block or shell command, particularly the GPU usage

gpu_tracker: Python package for tracking and profiling GPU utilization in both desktop and high-performance computing environments

* * *

high performance computing on graphics processing units: hgpu.org

Exploring Optimisations for the Local Assembly phase of Finite Element Methods on GPUs

Recent source codes

QArray

Celerity: High-level C++ for Accelerator Clusters

CIFAR-10 Airbench: 94% on CIFAR-10 in 3.29 second

gpu_tracker: Context manager and CLI that tracks the computational-resource-usage of a code block or shell command, particularly the GPU usage

LOOPer: a polyhedral compiler for expressing fast and portable data parallel algorithms

OpenMC Monte Carlo Code

Polygeist: C/C++ frontend for MLIR

Parallel Gaussian process with kernel approximation in CUDA

Optical flow algorithms for SYCL

OpenMP5-Offload-OpenMC-Intel-PVC

Most viewed papers (last 30 days)

Exploring Optimisations for the Local Assembly phase of Finite Element Methods on GPUs

Share this:

Recent source codes

Most viewed papers (last 30 days)