Accelerating Kirchhoff Migration by CPU and GPU Cooperation

hgpu.org » Programming » CUDA » Accelerating Kirchhoff Migration by CPU and GPU Cooperation

Accelerating Kirchhoff Migration by CPU and GPU Cooperation

Jairo Panetta, Thiago Teixeira, Paulo R. P. de Souza Filho, Carlos A. da Cunha Finho, David Sotelo, Fernando M. da Motta, Silvio S. Pinheiro, Ivan P. Junior, Andre L. Rosa, Luiz R. Monnerat, Leandro T. Carneiro, Carlos H. B. de Albrecht

Tecnologia Geofisica, Petroleo Brasileiro SA, PETROBRAS, Rio de Janeiro, Brazil

Computer Architecture and High Performance Computing, 2009. SBAC-PAD ’09. 21st International Symposium on In 21st International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD ’09). (31 October 2009), pp. 26-32.

DOI:10.1109/SBAC-PAD.2009.29

@conference{panetta2009accelerating,

title={Accelerating Kirchhoff Migration by CPU and GPU Cooperation},

author={Panetta, J. and Teixeira, T. and de Souza Filho, P.R.P. and da Cunha Finho, C.A. and Sotelo, D. and da Motta, F. and Pinheiro, S.S. and Pedrosa, I. and Rosa, A.L.R. and Monnerat, L.R. and others},

booktitle={Computer Architecture and High Performance Computing, 2009. SBAC-PAD’09. 21st International Symposium on},

pages={26–32},

year={2009},

organization={IEEE}

}

Download (PDF)

View

Source

2897

views

We discuss the performance of Petrobras production Kirchhoff prestack seismic migration on a cluster of 64 GPUs and 256 CPU cores. Porting and optimization of the application hot spot (98.2% of a single CPU core execution time) to a single GPU reduces total execution time by a factor of 36 on a control run. We then argue against the usual practice of porting the next hot spot (1.5% of single CPU core execution time) to the GPU. Instead, we show that cooperation of CPU and GPU reduces total execution time by a factor of 59 on the same control run. Remaining GPU idle cycles are eliminated by overloading the GPU with multiple requests originated from distinct CPU cores. However, increasing the number of CPU cores in the computation reduces the gain due to the combination of enhanced parallelism in the runs without GPUs and GPU saturation on runs with GPUs. We proceed by obtaining close to perfect speed-up on the full cluster over homogeneous load obtained by replicating control run data. To cope with the heterogeneous load of real world data we show a dynamic load balancing scheme that reduces total execution time by a factor of 20 on runs that use all GPUs and half of the cluster CPU cores with respect to runs that use all CPU cores but no GPU.

Tags: CUDA, Geoscience, GPU cluster, nVidia, Physics, Tesla C1060

October 28, 2010 by hgpu

No votes yet.

Please wait...

Your response

You must be logged in to post a comment.

* * *

high performance computing on graphics processing units: hgpu.org