high performance computing on graphics processing units: hgpu.org

hgpu.org » Programming » Algorithms » Tuning a Finite Difference Computation for Parallel Vector Processors

Tuning a Finite Difference Computation for Parallel Vector Processors

Gerhard Zumbusch

Institut fur Angewandte Mathematik, Friedrich-Schiller-Universitat Jena, Jena, Germany

International Symposium on Parallel and Distributed Computing (ISPDC), 2012

BibTeX

Download (PDF)

View

Source

2090

views

Current CPU and GPU architectures heavily use data and instruction parallelism at different levels. Floating point operations are organised in vector instructions of increasing vector length. For reasons of performance it is mandatory to use the vector instructions efficiently. Several ways of tuning a model problem finite difference stencil computation are discussed. The combination of vectorisation and an interleaved data layout, cache aware algorithms, loop unrolling, parallelisation and parameter tuning lead to optimised implementations at a level of 90% peak performance of the floating point pipelines on recent Intel Sandy Bridge and AMD Bulldozer CPU cores, both with AVX vector instructions as well as on Nvidia Fermi/ Kepler GPU architectures. Furthermore, we present numbers for parallel multi-core/ multi-processor and multi-GPU configurations. They represent regularly more than an order of speed up compared to a standard implementation. The analysis may also explain deciencies of automatic vectorisation for linear data layout and serve as a foundation of efficient implementations of more complex expressions.

Tags: Algorithms, Computer science, Finite difference, nVidia, nVidia GeForce GT 540 M, nVidia GeForce GTX 480, nVidia GeForce GTX 680, OpenCL, Performance

May 24, 2012 by hgpu

No votes yet.

Please wait...

hpcbench: A set of benchmarking utilities for biomolecular simulation tools

Engineering Supercomputing Platforms for Biomolecular Applications

high performance computing on graphics processing units: hgpu.org

Tuning a Finite Difference Computation for Parallel Vector Processors

Recent source codes

hpcbench: A set of benchmarking utilities for biomolecular simulation tools

HPCTransCompile: An AI Compiler Generated Dataset for High-Performance CUDA Transpilation and LLM Preliminary Exploration

chemtrain: Training Molecular Dynamics Potentials in JAX

microSYCL: SYCL micro-benchmarks repository

XaaS containers

CASS: Cuda-Amd aSSembly

Cluser of smartphones for edge computing application using TensorFlow

SYCL Container

CFAL-bench

Efficient Graph Embedding at Scale: Optimizing CPU-GPU-SSD Integration

Most viewed papers (last 30 days)

Tuning a Finite Difference Computation for Parallel Vector Processors

Share this:

Recent source codes

Most viewed papers (last 30 days)