high performance computing on graphics processing units: hgpu.org

hgpu.org » Programming » Algorithms » Vectorized Higher Order Finite Difference Kernels

Vectorized Higher Order Finite Difference Kernels

Gerhard Zumbusch

Friedrich-Schiller-Universitat Jena, Institut fur Angewandte Mathematik, 07743 Jena, Germany

State-of-the-Art in Scientific and Parallel Computing (PARA), 2012

@inproceedings{Zumbusch2012Vectorized,

author={G. Zumbusch},

title={Vectorized Higher Order Finite Difference Kernels},

booktitle={PARA 2012, State-of-the-Art in Scientific and Parallel Computing},

year={2012},

editor={P. Manninen},

series={LNCS},

pages={15},

publisher={Springer},

address={Heidelberg},

pdf={http://cse.mathe.uni-jena.de/pub/zumbusch/para12.pdf},

ps={http://cse.mathe.uni-jena.de/pub/zumbusch/para12.ps.gz},

annote={refereed}

}

Download (PDF)

View

Source

2732

views

Several highly optimized implementations of Finite Difference schemes are discussed. The combination of vectorization and an interleaved data layout, spatial and temporal loop tiling algorithms, loop unrolling, and parameter tuning lead to efficient computational kernels in one to three spatial dimensions, truncation errors of order two to twelve, and isotropic and compact anisotropic stencils. The kernels are implemented on and tuned for several processor architectures like recent Intel Sandy Bridge, Ivy Bridge and AMD Bulldozer CPU cores, all with AVX vector instructions as well as Nvidia Kepler and Fermi and AMD Southern and Northern Islands GPU architectures, as well as some older architectures for comparison. The kernels are either based on a cache aware spatial loop or on time-slicing to compute several time steps at once. Furthermore, vector components can either be independent, grouped in short vectors of SSE, AVX or GPU warp size or in larger virtual vectors with explicit synchronization. The optimal choice of the algorithm and its parameters depend both on the Finite Difference stencil and on the processor architecture.

Tags: Algorithms, ATI, ATI Radeon HD 6990, ATI Radeon HD 7970, Computer science, Finite difference, nVidia, nVidia GeForce GTX 590, nVidia GeForce GTX 680, OpenCL, Performance, Tesla C2050, Tesla K20

December 10, 2012 by hgpu

No votes yet.

Please wait...

Your response

You must be logged in to post a comment.

* * *

high performance computing on graphics processing units: hgpu.org

Vectorized Higher Order Finite Difference Kernels

Your response

Recent source codes

UniCoder: Unified Visual-to-Code Generation via Symbolic Rewards and Reference-Guided Code Optimization

CuFuzz: An API-Knowledge-Graph Coverage-Driven Fuzzing Framework for CUDA Libraries

AutoPass: Evidence-Guided LLM Agents for Compiler Performance Tuning

Probe-and-Refine Tuning of Repository Guidance for AI Coding Agents

CUDAnalyst (CUDA + Analyst)

CodegenBench

KernelBenchX: A Comprehensive Benchmark for Evaluating LLM-Generated GPU Kernels

CUDA Kernel Fusion Benchmarks

IntelliKit: Agent-first tooling for AMD hardware

DITRON: Distributed Compiler based on Triton for Parallel Systems

Most viewed papers (last 30 days)

Vectorized Higher Order Finite Difference Kernels

Share this:

Your response

Recent source codes

Most viewed papers (last 30 days)