27716

Analyzing Resource Utilization in an HPC System: A Case Study of NERSC’s Perlmutter

Jie Li, George Michelogiannakis, Brandon Cook, Dulanya Cooray, Yong Chen
Texas Tech University, Lubbock, TX 79409, USA
arXiv:2301.05145 [cs.DC], (12 Jan 2023)

@misc{https://doi.org/10.48550/arxiv.2301.05145,

   doi={10.48550/ARXIV.2301.05145},

   url={https://arxiv.org/abs/2301.05145},

   author={Li, Jie and Michelogiannakis, George and Cook, Brandon and Cooray, Dulanya and Chen, Yong},

   keywords={Distributed, Parallel, and Cluster Computing (cs.DC), FOS: Computer and information sciences, FOS: Computer and information sciences},

   title={Analyzing Resource Utilization in an HPC System: A Case Study of NERSC Perlmutter},

   publisher={arXiv},

   year={2023},

   copyright={Creative Commons Attribution Non Commercial Share Alike 4.0 International}

}

Download Download (PDF)   View View   Source Source   

424

views

The resource demands of HPC applications vary significantly. However, it is common for HPC systems to assign resources on a per-node basis to prevent interference from co-located workloads. This gap between the coarse-grained resource allocation and the varying resource demands can lead to underutilization of HPC resources. In this study, we comprehensively analyzed the resource usage and characteristics of NERSC Perlmutter, a state-of-the-art HPC system with both CPU-only and GPU-accelerated nodes. Our three-week usage analysis revealed that the majority of jobs had low CPU utilization and that around 86% of both CPU and GPU-enabled jobs used 50% or less of the available host memory. Additionally, 52.1% of GPU-enabled jobs used up to 25% of the GPU memory, and the memory capacity was over-provisioned in some ways for all jobs. The study also found that 60% of GPU-enabled jobs had idle GPUs, which could indicate that resource underutilization may occur as users adapt workflows to a system with new resources. Our research provides valuable insights into performance characterization and offers new perspectives for system operators to understand and track the migration of workloads. Furthermore, it can be extremely useful for designing, optimizing, and procuring HPC systems.
No votes yet.
Please wait...

* * *

* * *

HGPU group © 2010-2024 hgpu.org

All rights belong to the respective authors

Contact us: