high performance computing on graphics processing units: hgpu.org

hgpu.org » Applications » Computer science » Memory Bandwidth and Latency in HPC: System Requirements and Performance Impact

Memory Bandwidth and Latency in HPC: System Requirements and Performance Impact

Milan Radulovic

Departament d’Arquitectura de Computadors (DAC), Universitat Politecnica de Catalunya (UPC)

Universitat Politecnica de Catalunya, 2019

@article{radulovic2019memory,

title={Memory bandwidth and latency in HPC: system requirements and performance impact},

author={Radulovi{‘c}, Milan},

year={2019},

publisher={Universitat Polit{`e}cnica de Catalunya}

}

Download (PDF)

View

Source

2497

views

A major contributor to the deployment and operational costs of a large-scale high-performance computing (HPC) clusters is the memory system. In terms of system performance it is one of the most critical aspects of the system’s design. However, next generation of HPC systems poses significant challenges for the main memory, and it is questionable whether current memory technologies will meet the required goals. This motivates a lot of research in future memory architectures, their capacity, performance and reliability. In this thesis we focus on HPC performance aspects of the memory system design, covering memory bandwidth and latency. We start our study with an extensive analysis, evaluating and comparing three mainstream and five alternative HPC architectures, regarding memory bandwidth and latency aspects. Increasing diversity of HPC systems in the market causes their evaluation and comparison in terms of HPC features to become complex and multi-dimensional. There is as yet no well established methodology for a unified evaluation of HPC systems and workloads that quantifies the main performance bottlenecks. Our work provides a significant body of useful information for HPC practitioners and infrastructure providers, and emphasizes four usually overlooked aspects of HPC systems’ evaluation. Understanding the dominant performance bottlenecks of HPC applications is essential for designing a balanced HPC system. In our study, we execute a set of real HPC applications from diverse scientific fields, quantifying key performance bottlenecks: FLOPS performance and memory bandwidth congestion. We show that the results depend significantly on the number of execution processes, which is typically overlooked in benchmark suites, and argue for guidance on selecting the representative scale of the experiments. Also, we find that average measurements of performance metrics and bottlenecks can be highly misleading, and suggest reporting as the percentage of execution time in which applications use certain portions of maximum sustained values. Innovations in 3D-stacking technology enable DRAM devices with much higher bandwidths than traditional DIMMs. The first such products hit the market, and some of the publicity claims that they will break through the memory wall. We summarize our preliminary analysis and expectations of how such 3D-stacked DRAMs will affect the memory wall for a set of representative HPC applications. Higher bandwidth may lower average latency, provided that our applications offer sufficient memory-level parallelism (MLP) and that CPU architectures can exploit it. We conclude that although 3D-stacked DRAM is a major technological innovation, it is unlikely to break through the memory wall – at best, it moves it. Novel memory systems are typically explored by hardware simulators that are slow and often have a simplified or obsolete model of the CPU. We propose an analytical model that quantifies the impact of the main memory on application performance and system power and energy consumption, based on the memory system and application profiles. The model is evaluated on a mainstream platform, comprising various DDR3 memory configurations, and an alternative platform comprising DDR4 and 3D-stacked high-bandwidth memory. The evaluation results show that the model predictions are accurate, typically with only 2% difference from the values measured on actual hardware. Additionally, we compare the model performance estimation with simulation results, and our model shows significantly better accuracy over the simulator, while being faster by three orders of magnitude, so it can be used to analyze production HPC applications on arbitrarily sized systems. Overall, we believe our study provides valuable insights on the importance of memory bandwidth and latency in HPC: their role in evaluation and comparison of HPC platforms, guidelines on measuring and presenting the related performance bottlenecks, and understanding and modeling of their performance, power and energy impact.

Tags: Computer science, HPC, Intel Xeon Phi, Performance, Thesis

June 30, 2019 by hgpu

Rating: 1.0/5. From 1 vote.

Please wait...

Your response

You must be logged in to post a comment.

high performance computing on graphics processing units: hgpu.org

Memory Bandwidth and Latency in HPC: System Requirements and Performance Impact

Your response

Recent source codes

Interleaved Learning and Exploration: A Self-Adaptive Fuzz Testing Framework for MLIR

Pinocchio: PINpointing Orbit Crossing Collapsed Hierarchical Objects

KernelCoder: trained on a curated dataset of reasoning traces and CUDA kernel pairs

VibeCodeHPC - Multi Agentic Vibe Coding for HPC

Compile-Time Resource Safety for GPU APIs: A Low-Overhead Typestate Framework

exa-AMD: Exascale Accelerated Materials Discovery

TRUST: a thermalhydraulic software package for CFD simulations

Modular: The Modular Platform (includes MAX & Mojo)

Allo: Accelerator Design Language

Towards Robust Agentic CUDA Kernel Benchmarking, Verification, and Optimization

Most viewed papers (last 30 days)

Memory Bandwidth and Latency in HPC: System Requirements and Performance Impact

Share this:

Your response

Recent source codes

Most viewed papers (last 30 days)