10559

ClusterWatch: Flexible, Lightweight Monitoring for High-end GPGPU Clusters

Magdalena Slawinska, Karsten Schwan, Greg Eisenhauer
College of Computing, Georgia Institute of Technology, Atlanta, Georgia 30332-0250
CERCS Technical Report GIT-CERCS-13-07, 2013

@article{slawinska2013clusterwatch,

   title={ClusterWatch: Flexible, Lightweight Monitoring for High-end GPGPU Clusters},

   author={Slawinska, Magdalena and Schwan, Karsten and Eisenhauer, Greg},

   year={2013}

}

Download Download (PDF)   View View   Source Source   

1346

views

The ClusterWatch middleware provides runtime flexibility in what system-level metrics are monitored, how frequently such monitoring is done, and how metrics are combined to obtain reliable information about the current behavior of GPGPU clusters. Interesting attributes of ClusterWatch are (1) the ease with which different metrics can be added to the system-by simply deploying additional "cluster spies," (2) the ability to filter and process monitoring metrics at their sources, to reduce data movement overhead, (3) flexibility in the rate at which monitoring is done, (4) efficient movement of monitoring data into backend stores for long-term or historical analysis, and most importantly, (5) specific support for monitoring the behavior and use of the GPGPUs used by applications. This paper presents our initial experiences with using ClusterWatch to assess the performance behavior of the a larger-scale GPGPU-based simulation code. We report the overheads seen when using ClusterWatch, the experimental results obtained for the simulation, and the manner in which ClusterWatch will interact with infrastructures for detailed program performance monitoring and profiling such as TAU or Lynx. Experiments conducted on the NICS Keeneland Initial Delivery System (KIDS), with up to 64 nodes, demonstrate low monitoring overheads for high fidelity assessments of the simulation’s performance behavior, for both its CPU and GPU components.
No votes yet.
Please wait...

* * *

* * *

HGPU group © 2010-2024 hgpu.org

All rights belong to the respective authors

Contact us: