high performance computing on graphics processing units: hgpu.org

hgpu.org » Applications » Computer science » Stencil computation optimization and auto-tuning on state-of-the-art multicore architectures

Stencil computation optimization and auto-tuning on state-of-the-art multicore architectures

Kaushik Datta, Mark Murphy, Vasily Volkov, Samuel Williams, Jonathan Carter, Leonid Oliker, David Patterson, John Shalf, Katherine Yelick

CRD/NERSC, Lawrence Berkeley National Laboratory, Berkeley, CA 94720, USA

In Proceedings of the 2008 ACM/IEEE conference on Supercomputing (2008), pp. 1-12

DOI:10.1109/SC.2008.5222004

@conference{datta2009stencil,

title={Stencil computation optimization and auto-tuning on state-of-the-art multicore architectures},

author={Datta, K. and Murphy, M. and Volkov, V. and Williams, S. and Carter, J. and Oliker, L. and Patterson, D. and Shalf, J. and Yelick, K.},

booktitle={High Performance Computing, Networking, Storage and Analysis, 2008. SC 2008. International Conference for},

pages={1–12},

year={2009},

organization={IEEE}

}

Download (PDF)

View

Source

2063

views

Understanding the most efficient design and utilization of emerging multicore systems is one of the most challenging questions faced by the mainstream and scientific computing industries in several decades. Our work explores multicore stencil (nearest-neighbor) computations — a class of algorithms at the heart of many structured grid codes, including PDF solvers. We develop a number of effective optimization strategies, and build an auto-tuning environment that searches over our optimizations and their parameters to minimize runtime, while maximizing performance portability. To evaluate the effectiveness of these strategies we explore the broadest set of multicore architectures in the current HPC literature, including the Intel Clovertown, AMD Barcelona, Sun Victoria Falls, IBM QS22 PowerXCell 8i, and NVIDIA GTX280. Overall, our auto-tuning optimization methodology results in the fastest multicore stencil performance to date. Finally, we present several key insights into the architectural tradeoffs of emerging multicore designs and their implications on scientific algorithm development.

Tags: Cell processor, Computer science, CUDA, Finite difference, nVidia, nVidia GeForce GTX 280, Performance

December 14, 2010 by hgpu

No votes yet.

Please wait...

Your response

You must be logged in to post a comment.

* * *

high performance computing on graphics processing units: hgpu.org

Stencil computation optimization and auto-tuning on state-of-the-art multicore architectures

Your response

Recent source codes

UniCoder: Unified Visual-to-Code Generation via Symbolic Rewards and Reference-Guided Code Optimization

CuFuzz: An API-Knowledge-Graph Coverage-Driven Fuzzing Framework for CUDA Libraries

AutoPass: Evidence-Guided LLM Agents for Compiler Performance Tuning

Probe-and-Refine Tuning of Repository Guidance for AI Coding Agents

CUDAnalyst (CUDA + Analyst)

CodegenBench

KernelBenchX: A Comprehensive Benchmark for Evaluating LLM-Generated GPU Kernels

CUDA Kernel Fusion Benchmarks

IntelliKit: Agent-first tooling for AMD hardware

DITRON: Distributed Compiler based on Triton for Parallel Systems

Most viewed papers (last 30 days)

Stencil computation optimization and auto-tuning on state-of-the-art multicore architectures

Share this:

Your response

Recent source codes

Most viewed papers (last 30 days)