An auto-tuning framework for parallel multicore stencil computations

hgpu.org » Applications » Computer science » An auto-tuning framework for parallel multicore stencil computations

An auto-tuning framework for parallel multicore stencil computations

Shoaib Kamil, Cy Chan, Leonid Oliker, John Shalf, Samuel Williams

CRD, Lawrence Berkeley National Laboratory Berkeley, Berkeley, CA, USA

IEEE International Symposium on Parallel & Distributed Processing (IPDPS), 2010

DOI:10.1109/IPDPS.2010.5470421

BibTeX

Download (PDF)

View

Source

1739

views

Although stencil auto-tuning has shown tremendous potential in effectively utilizing architectural resources, it has hitherto been limited to single kernel instantiations; in addition, the large variety of stencil kernels used in practice makes this computation pattern difficult to assemble into a library. This work presents a stencil auto-tuning framework that significantly advances programmer productivity by automatically converting a straightforward sequential Fortran 95 stencil expression into tuned parallel implementations in Fortran, C, or CUDA, thus allowing performance portability across diverse computer architectures, including the AMD Barcelona, Intel Nehalem, Sun Victoria Falls, and the latest NVIDIA GPUs. Results show that our generalized methodology delivers significant performance gains of up to 22x speedup over the reference serial implementation. Overall we demonstrate that such domain-specific auto-tuners hold enormous promise for architectural efficiency, programmer productivity, performance portability, and algorithmic adaptability on existing and emerging multicore systems.

Tags: Code generation, Computer science, CUDA, Fortran, nVidia, nVidia GeForce GTX 280

July 18, 2011 by hgpu

No votes yet.

Please wait...

high performance computing on graphics processing units: hgpu.org

An auto-tuning framework for parallel multicore stencil computations

Recent source codes

HPCTransCompile: An AI Compiler Generated Dataset for High-Performance CUDA Transpilation and LLM Preliminary Exploration

chemtrain: Training Molecular Dynamics Potentials in JAX

microSYCL: SYCL micro-benchmarks repository

XaaS containers

SYCL Container

CASS: Cuda-Amd aSSembly

Cluser of smartphones for edge computing application using TensorFlow

CFAL-bench

Efficient Graph Embedding at Scale: Optimizing CPU-GPU-SSD Integration

Can Large Language Models Predict Parallel Code Performance?

Most viewed papers (last 30 days)

An auto-tuning framework for parallel multicore stencil computations

Share this:

Recent source codes

Most viewed papers (last 30 days)