high performance computing on graphics processing units: hgpu.org

hgpu.org » Applications » Computer science » A Massively Parallel Adaptive Fast Multipole Method on Heterogeneous Architectures

A Massively Parallel Adaptive Fast Multipole Method on Heterogeneous Architectures

Ilya Lashuk, Aparna Chandramowlishwaran, Harper Langston, Tuan-Anh nguyen, Rahul Sampath, Aashay Shringarpure, Richard Vuduc, Lexing Ying, Denis Zorin, George Biros

Lawrence Livermore National Laboratory, Livermore, CA

Communications of the ACM 55, 2012

DOI:10.1145/2160718.2160740

BibTeX

Download (PDF)

View

Source

1762

views

We describe a parallel fast multipole method (FMM) for highly nonuniform distributions of particles. We employ both distributed memory parallelism (via MPI) and shared memory parallelism (via OpenMP and GPU acceleration) to rapidly evaluate two-body nonoscillatory potentials in three dimensions on heterogeneous high performance computing architectures. We have performed scalability tests with up to 30 billion particles on 196,608 cores on the AMD/CRAY-based Jaguar system at ORNL. On a GPU-enabled system (NSF’s Keeneland at Georgia Tech/ORNL), we observed 30x speedup over a single core CPU and 7x speedup over a multicore CPU implementation. By combining GPUs with MPI, we achieve less than 10ns/particle and six digits of accuracy for a run with 48 million nonuniformly distributed particles on 192 GPUs.

Tags: Computer science, CUDA, Fast multipole method, Heterogeneous systems, MPI, nVidia, Programming techniques, Tesla M2070

July 5, 2012 by hgpu

No votes yet.

Please wait...

Your response

You must be logged in to post a comment.

* * *

high performance computing on graphics processing units: hgpu.org

A Massively Parallel Adaptive Fast Multipole Method on Heterogeneous Architectures

Your response

Recent source codes

Mutual-Supervised Learning for Sequential-to-Parallel Code Translation

Hardware Compute Partitioning on NVIDIA GPUs for Composable Systems

KISim: Kubernetes Intelligent Scheduling Simulator

Efficient GPU Implementation of Multi-Precision Integer Division

exa-AMD: Exascale Accelerated Materials Discovery

ParEval: A Parallel Code Evaluation Benchmark

FlashSparse: Minimizing Computation Redundancy for Fast Sparse Matrix Multiplications on Tensor Cores

WiLLM: An Open Wireless LLM Communication System

Vcc: the Vulkan Clang Compiler

hpcbench: A set of benchmarking utilities for biomolecular simulation tools

Most viewed papers (last 30 days)

A Massively Parallel Adaptive Fast Multipole Method on Heterogeneous Architectures

Share this:

Your response

Recent source codes

Most viewed papers (last 30 days)