high performance computing on graphics processing units: hgpu.org

hgpu.org » Applications » Computer science » Optimizing and tuning the fast multipole method for state-of-the-art multicore architectures

Optimizing and tuning the fast multipole method for state-of-the-art multicore architectures

Aparna Chandramowlishwaran, Samuel Williams, Leonid Oliker, Ilya Lashuk, George Biros, Richard Vuduc

CRD, Lawrence Berkeley National Laboratory, Berkeley, CA 94720

Parallel & Distributed Processing (IPDPS), 2010 IEEE International Symposium (April 2010), pp. 1-12.

DOI:10.1109/IPDPS.2010.5470415

@conference{chandramowlishwaran2010optimizing,

title={Optimizing and tuning the fast multipole method for state-of-the-art multicore architectures},

author={Chandramowlishwaran, A. and Williams, S. and Oliker, L. and Lashuk, I. and Biros, G. and Vuduc, R.},

booktitle={Parallel & Distributed Processing (IPDPS), 2010 IEEE International Symposium on},

pages={1–12},

issn={1530-2075},

year={2010},

organization={IEEE}

}

Download (PDF)

View

Source

2358

views

This work presents the first extensive study of single-node performance optimization, tuning, and analysis of the fast multipole method (FMM) on modern multi-core systems. We consider single- and double-precision with numerous performance enhancements, including low-level tuning, numerical approximation, data structure transformations, OpenMP parallelization, and algorithmic tuning. Among our numerous findings, we show that optimization and parallelization can improve double-precision performance by 25x- on Intel’s quad-core Nehalem, 9.4x- on AMD’s quad-core Barcelona, and 37.6x- on Sun’s Victoria Falls (dual-sockets on all systems). We also compare our single-precision version against our prior state-of-the-art GPU-based code and show, surprisingly, that the most advanced multicore architecture (Nehalem) reaches parity in both performance and power efficiency with NVIDIA’s most advanced GPU architecture.

Tags: Computer science, CUDA, Fast multipole method, MPI, nVidia, OpenMP, Performance, Tesla S1070

November 19, 2010 by hgpu

No votes yet.

Please wait...

Your response

You must be logged in to post a comment.

* * *

high performance computing on graphics processing units: hgpu.org

Optimizing and tuning the fast multipole method for state-of-the-art multicore architectures

Your response

Recent source codes

UniCoder: Unified Visual-to-Code Generation via Symbolic Rewards and Reference-Guided Code Optimization

CuFuzz: An API-Knowledge-Graph Coverage-Driven Fuzzing Framework for CUDA Libraries

AutoPass: Evidence-Guided LLM Agents for Compiler Performance Tuning

Probe-and-Refine Tuning of Repository Guidance for AI Coding Agents

CUDAnalyst (CUDA + Analyst)

CodegenBench

KernelBenchX: A Comprehensive Benchmark for Evaluating LLM-Generated GPU Kernels

CUDA Kernel Fusion Benchmarks

IntelliKit: Agent-first tooling for AMD hardware

DITRON: Distributed Compiler based on Triton for Parallel Systems

Most viewed papers (last 30 days)

Optimizing and tuning the fast multipole method for state-of-the-art multicore architectures

Share this:

Your response

Recent source codes

Most viewed papers (last 30 days)