high performance computing on graphics processing units: hgpu.org

hgpu.org » Applications » Computer science » Optimizing and tuning the fast multipole method for state-of-the-art multicore architectures

Optimizing and tuning the fast multipole method for state-of-the-art multicore architectures

Aparna Chandramowlishwaran, Samuel Williams, Leonid Oliker, Ilya Lashuk, George Biros, Richard Vuduc

CRD, Lawrence Berkeley National Laboratory, Berkeley, CA 94720

Parallel & Distributed Processing (IPDPS), 2010 IEEE International Symposium (April 2010), pp. 1-12.

DOI:10.1109/IPDPS.2010.5470415

@conference{chandramowlishwaran2010optimizing,

title={Optimizing and tuning the fast multipole method for state-of-the-art multicore architectures},

author={Chandramowlishwaran, A. and Williams, S. and Oliker, L. and Lashuk, I. and Biros, G. and Vuduc, R.},

booktitle={Parallel & Distributed Processing (IPDPS), 2010 IEEE International Symposium on},

pages={1–12},

issn={1530-2075},

year={2010},

organization={IEEE}

}

Download (PDF)

View

Source

1655

views

This work presents the first extensive study of single-node performance optimization, tuning, and analysis of the fast multipole method (FMM) on modern multi-core systems. We consider single- and double-precision with numerous performance enhancements, including low-level tuning, numerical approximation, data structure transformations, OpenMP parallelization, and algorithmic tuning. Among our numerous findings, we show that optimization and parallelization can improve double-precision performance by 25x- on Intel’s quad-core Nehalem, 9.4x- on AMD’s quad-core Barcelona, and 37.6x- on Sun’s Victoria Falls (dual-sockets on all systems). We also compare our single-precision version against our prior state-of-the-art GPU-based code and show, surprisingly, that the most advanced multicore architecture (Nehalem) reaches parity in both performance and power efficiency with NVIDIA’s most advanced GPU architecture.

Tags: Computer science, CUDA, Fast multipole method, MPI, nVidia, OpenMP, Performance, Tesla S1070

November 19, 2010 by hgpu

No votes yet.

Please wait...

gpu_tracker: Context manager and CLI that tracks the computational-resource-usage of a code block or shell command, particularly the GPU usage

gpu_tracker: Python package for tracking and profiling GPU utilization in both desktop and high-performance computing environments

* * *

high performance computing on graphics processing units: hgpu.org

Optimizing and tuning the fast multipole method for state-of-the-art multicore architectures

Recent source codes

QArray

Celerity: High-level C++ for Accelerator Clusters

CIFAR-10 Airbench: 94% on CIFAR-10 in 3.29 second

gpu_tracker: Context manager and CLI that tracks the computational-resource-usage of a code block or shell command, particularly the GPU usage

LOOPer: a polyhedral compiler for expressing fast and portable data parallel algorithms

OpenMC Monte Carlo Code

Polygeist: C/C++ frontend for MLIR

Parallel Gaussian process with kernel approximation in CUDA

Optical flow algorithms for SYCL

OpenMP5-Offload-OpenMC-Intel-PVC

Most viewed papers (last 30 days)

Optimizing and tuning the fast multipole method for state-of-the-art multicore architectures

Share this:

Recent source codes

Most viewed papers (last 30 days)