Conquering Noise with Hardware Counters on HPC Systems

Conquering Noise with Hardware Counters on HPC Systems

Default-Image-ProjectsDefault Image - ProjectHessen Agentur/Jürgen Kneifel

Introduction

With increasing system performance and complexity, it is becoming increasingly crucial to examine the scaling behavior of an application and thus determine performance bottlenecks at early stages. Unfortunately, modeling this trend is a challenging task in the presence of noise, as the measurements can become irreproducible and misleading, thus resulting in strong deviations from the actual behavior. While noise impacts the application runtime, it has little to no effect on some hardware counters like floating-point operations. However, selecting the appropriate counters for performance modeling demands some investigation. In this project, we perform a noise analysis on various hardware counters.

Methods

Using a noise generator called NOIGENA, we add additional noise on top of the system noise to inspect the counters' variability. We perform the analysis on five different HPC systems (including the Lichtenberg system) with three applications in the presence of various noise patterns and categorize the counters across the systems according to their noise resilience.

Results

Based on our measurements and variability analysis of all available system hardware counters, we were able to identify hardware counters applicable for empirical performance modelling under noisy conditions. Furthermore, we created a best practice guide for users that want to employ hardware counters to improve their performance models.

Discussion

In this project, we investigated the noise-resilience of hardware counters using three application benchmarks and five evaluation systems with diverse hardware architectures. We examined all available PAPI preset events and a selected set of native events on these systems and analyzed their reliability in the presence and absence of injected noise. Our analysis confirmed the results of previous studies, showcasing that all counters measuring either floating point operations or instructions are noise-resilient. Overall, it unveiled that, independent of the system architecture, noise generally affects hardware counters. Furthermore, the reliability of many counters depends significantly on the system architecture. While the instruction and cycle counters are highly reliable on some systems, on others, they are much more prone to be influenced by noise. Therefore, our best practice user guide enables application developers and researchers aiming to analyze or optimize the performance of their code to easily identify the hardware counters relevant for performance analysis for their system architecture.

Outlook

Future work will focus on generating performance models with the inspected hardware counters and expand the noise analysis to include other noise sources, such as I/O contention.

Last Update

  • Last Update: 2023-02-16 16:38

Participating Universities