Optimizing and tuning the fast multipole method for state-of-the-art multicore architectures
CRD, Lawrence Berkeley National Laboratory, Berkeley, CA 94720
Parallel & Distributed Processing (IPDPS), 2010 IEEE International Symposium (April 2010), pp. 1-12.
@conference{chandramowlishwaran2010optimizing,
title={Optimizing and tuning the fast multipole method for state-of-the-art multicore architectures},
author={Chandramowlishwaran, A. and Williams, S. and Oliker, L. and Lashuk, I. and Biros, G. and Vuduc, R.},
booktitle={Parallel & Distributed Processing (IPDPS), 2010 IEEE International Symposium on},
pages={1–12},
issn={1530-2075},
year={2010},
organization={IEEE}
}
This work presents the first extensive study of single-node performance optimization, tuning, and analysis of the fast multipole method (FMM) on modern multi-core systems. We consider single- and double-precision with numerous performance enhancements, including low-level tuning, numerical approximation, data structure transformations, OpenMP parallelization, and algorithmic tuning. Among our numerous findings, we show that optimization and parallelization can improve double-precision performance by 25x- on Intel’s quad-core Nehalem, 9.4x- on AMD’s quad-core Barcelona, and 37.6x- on Sun’s Victoria Falls (dual-sockets on all systems). We also compare our single-precision version against our prior state-of-the-art GPU-based code and show, surprisingly, that the most advanced multicore architecture (Nehalem) reaches parity in both performance and power efficiency with NVIDIA’s most advanced GPU architecture.
November 19, 2010 by hgpu