Auto-Tunning of Data Communication on Heterogeneous Systems
Barcelona Supercomputing Center
7th IEEE International Symposium on Embedded Multicore/Many-core System-on-Chip, 2013
@article{jorda2013auto,
title={Auto-Tunning of Data Communication on Heterogeneous Systems},
author={Jorda, Marc and Tanasic, Ivan and Cabezas, Javier and Vilanova, Lluis and Gelado, Isaac and Navarro, Nacho},
year={2013}
}
Heterogeneous systems formed by trandional CPUs and compute accelerators, such as GPUs, are becoming widely used to build modern supercomputers. However, many different system topologies, i.e., how CPUs, accelerators, and I/O devices are interconnected, are being deployed. Each system organization presents different trade-offs when transferring data between CPUs, accelerators, and nodes within a cluster, requiring each of them a different software implementation to achieve optimal data communication bandwidth. Hence, there is a great potential for auto-tunning of applications to match the constrains of the system where the code is being executed. In his paper we explore the potential impact of the two key optimizations to achieve optimal data transfer bandwidth: topology-aware process placement policies, and double-buffering. We design a set of experiments to evaluate all possible alternatives, and run each of them on different hardware configurations. We show that optimal data transfer mechanisms not only depend on the hardware topology, but also on the application dataset. Our experimental results show that auto-tunning applications to match the hardware topology, and the usage of double-buffering for large data transfers can improve the data transfer bandwidth by ~70% of local communication. We also show that doublebuffering large data transfers is key to achieve optimal bandwidht on remote communication for data transfers larger than 128KB, while topology-aware policies produce minimal benefits.
June 17, 2013 by hgpu