high performance computing on graphics processing units: hgpu.org

hgpu.org » Applications » Computer science » Exploring Optimisations for the Local Assembly phase of Finite Element Methods on GPUs

Exploring Optimisations for the Local Assembly phase of Finite Element Methods on GPUs

Hector Dearman

Department of Computing, Imperial College London

Imperial College London, 2015

@article{dearman2015exploring,

title={Exploring Optimisations for the Local Assembly phase of Finite Element Methods on GPUs},

author={Dearman, Hector},

year={2015}

}

Download (PDF)

View

Source

2215

views

Finite Element Methods (FEM) are ubiquitous in science and engineering where they are used in fields as diverse as structural analysis, ocean modeling and bioengineering. FEM allow us to find approximate solutions to a system of partial differential equations over an unstructured mesh. The first phase of solving a FEM problem, local assembly, involves computing a tensor for every element in the mesh. Local assembly is extremely data-parallel, each entry in each tensor may be computed independently, making local assembly an excellent target for General Purpose Graphics Processing Units. We systematically investigate optimisations to improve the performance of the local assembly phase of FEM on GPUs for a broad range of problems. We look at four classes of optimisations: effective use of constant memory, tuning the kernel launch parameters, using multiple threads per element and loop unrolling. The optimisations are implemented in the Firedrake toolchain, particularly in PyOP2 and COFFEE, and the performance improvement of each optimisation is measured using three representative benchmarks. In order to ensure our results are robust we consider each of these benchmarks in the context of a variety of element shapes and polynomial degrees of the basis functions. Combining these optimisations, we achieve speed increases of up to 35 times compared to Firedrake’s current performance on some benchmarks and an average increase of 13 times across all benchmarks. Finally, we measure the absolute performance of the combined optimisations, showing that we achieve up to 78% of peak FLOPs on some benchmarks and an average of 57% of peak FLOPs across all benchmarks on an NVIDA GRID K520.

Tags: Benchmarking, Computer science, CUDA, Differential equations, FEM, Finite element method, nVidia, nVidia GRID K520, OpenCL, Partial differential equations, PDEs, Thesis

November 3, 2015 by hgpu

No votes yet.

Please wait...

Your response

You must be logged in to post a comment.

* * *

high performance computing on graphics processing units: hgpu.org

Exploring Optimisations for the Local Assembly phase of Finite Element Methods on GPUs

Your response

Recent source codes

ParaCodex: A Profiling-Guided Autonomous Coding Agent for Reliable Parallel Code Generation and Translation

SeedFold: Scaling Biomolecular Structure Prediction

Tilus: A Tile-Level GPU Kernel Programming Language

Memory-Efficient Acceleration of Block Low-Rank Foundation Models on Resource Constrained GPUs

CUDA-L2: Surpassing cuBLAS Performance for Matrix Multiplication through Reinforcement Learning

BoltzGen:Toward Universal Binder Design

cuPilot: A Strategy-Coordinated Multi-agent Framework for CUDA Kernel Evolution

MATLAB Tensor Core models

TritonForge: Transform PyTorch Operations into Optimized GPU Kernels with LLMs

RLTune: Hybrid Learning and Optimization-Based Dynamic Scheduling for DL Workloads on Heterogeneous GPU Clusters

Most viewed papers (last 30 days)

Exploring Optimisations for the Local Assembly phase of Finite Element Methods on GPUs

Share this:

Your response

Recent source codes

Most viewed papers (last 30 days)