0% found this document useful (0 votes)
21 views8 pages

AIOS Research1

Uploaded by

sharonx332
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views8 pages

AIOS Research1

Uploaded by

sharonx332
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

Investigating Memory Failure Prediction Across

CPU Architectures
Qiao Yu∗† , Wengui Zhang‡ , Min Zhou‡¶ , Jialiang Yu‡ , Zhenli Sheng‡ , Jasmin Bogatinovski†
Jorge Cardoso∗§ and Odej Kao†
∗ Huawei Munich Research Center, Germany † Technical University of Berlin, Germany
‡ Huawei
Technologies Co., Ltd, China § CISUC, University of Coimbra, Portugal
{qiao.yu, zhangwengui1, zhoumin27, yujialiang, shengzhenli, jorge.cardoso}@huawei.com
{jasmin.bogatinovski, odej.kao}@tu-berlin.de

Abstract—Large-scale datacenters often experience memory failures, laying the groundwork for our research. Machine
failures, where Uncorrectable Errors (UEs) highlight critical
arXiv:2406.05354v1 [cs.AR] 8 Jun 2024

Learning (ML) techniques have been employed for predicting


malfunction in Dual Inline Memory Modules (DIMMs). Existing memory failures [15]–[24], using CEs information from large-
approaches primarily utilize Correctable Errors (CEs) to predict
UEs, yet they typically neglect how these errors vary between scale datacenters to forecast UEs. These investigations have
different CPU architectures, especially in terms of Error Cor- effectively exploited the spatial distribution of CEs to improve
rection Code (ECC) applicability. In this paper, we investigate memory failure prediction. Additionally, system-level work-
the correlation between CEs and UEs across different CPU load metrics such as memory utilization, read/write access
architectures, including X86 and ARM. Our analysis identifies have been considered for memory failure prediction in [25]–
unique patterns of memory failure associated with each processor
platform. Leveraging Machine Learning (ML) techniques on [27]. Results from [27] indicate that workload metrics play a
production datasets, we conduct the memory failure prediction minor role compared to other CE related features. Research
in different processors’ platforms, achieving up to 15% improve- in [28] focuses on CE storms (a high frequency of CEs in
ments in F1-score compared to the existing algorithm. Finally, an a brief timeframe) and UEs to predict DRAM-caused node
MLOps (Machine Learning Operations) framework is provided unavailability (DCNU), highlighting the significance of spatio-
to consistently improve the failure prediction in the production
environment. temporal features of CEs. Furthermore, [7] explores specific
Index Terms—Memory, Failure prediction, Uncorrectable er- error bit patterns and their association with DRAM UEs,
ror, Reliability, Machine Learning, AIOps, ML for Systems developing rule-based indicators for DRAM failure prediction
that vary by manufacturer and part number, adapted to the ECC
I. I NTRODUCTION designs of modern Intel Skylake and Cascade Lake servers.
Moreover, [29], [30] examine the distribution of error bits
With the expansion of cloud computing and big data ser- and propose a hierarchical, system-level method for predicting
vices, the challenge of maintaining the Reliability, Availability, memory failures, leveraging error bit characteristics. However,
and Serviceability (RAS)1 of servers is intensifying, due to the incidence of UEs is influenced not just by DRAM faults
memory failures, which represent a significant fraction of but also by differences across CPU architectures, due to the
hardware malfunctions [1]–[3]. These failures often occur as diverse ECC mechanisms in use, which can alter the patterns
Correctable Errors (CEs) and Uncorrectable Errors (UEs). To of memory failures observed. Understanding and modeling
tackle these issues, Error Correction Code (ECC) mechanisms these failure patterns across various CPU platforms and ECC
such as SEC-DED [4], Chipkill [5], and SDDC [6] are utilized types is essential for accurate prediction of UEs. This gap in
to detect and correct errors. For instance, Chipkill ECC is research motivates us to undertake the first study of DRAM
capable of correcting all erroneous bits from a single DRAM failures comparing X86 and ARM systems, specifically the
(Dynamic Random Access Memory) chip. However, its ef- Intel Purley and Whitley platforms and the Huawei ARM
ficacy diminishes when errors span multiple chips, leading K920 (anonymized to protect confidentiality) processor. By
to system failures caused by UEs. Furthermore, the ECC analyzing the relationship between UEs and fault patterns
mechanisms on modern Intel platform servers do not offer across these processor platforms, we aim to create targeted
the same level of protection as Chipkill ECC, making them memory failure prediction algorithms. Additionally, we ac-
vulnerable to certain error patterns originating from a single knowledge the dynamic nature of server configurations, CPU
chip [7]. Therefore, relying exclusively on ECC mechanisms architectures, memory types, and workloads. To address these
for memory reliability proves inadequate, as memory failures variables, we introduce an MLOps framework designed to
remain a prevalent source of system failures. accommodate such changes, thereby continuously improving
To enhance memory reliability, numerous studies [8]–[14] failure prediction throughout the lifecycle of the production
have delved into the correlations between memory errors and environment.
¶ Corresponding author: [email protected] We make the main contributions of this paper are below:
1 Reliability, Availability, and Serviceability are the foundational pillars used
to measure the dependability of computer systems. • We present the first memory failure study between X86

1
and ARM systems, specifically focusing on Intel X86 1 Memory Organization 2 Memory Access 3
Purley and Whitley, as well as Huawei ARM K920 DRAM ECC bits
Rank
processor platforms, in large-scale datacenters. Different

1
0
0
1
0
0
0
0
0 1 2 3
DRAM (Device)

DQs
Column Chip

0
0
1
0
0
0
0
0

Memory Channel
fault modes within DRAM hierarchy are associated with

0
0
0
0
0
1
0
0
(ECC)

Controller
Memory
0
0
0
0
1
0
0
0
...
memory failures across these platforms.

...
Row Cell Data bits
• We develop ML-based algorithms for predicting memory

0 0 0 0 1
1 0 0 0 0
2 0 0 1 0
3 0 0 0 1
4 1 0 0 0
5 0 1 0 0
6 0 0 0 0
7 0 0 0 0
0 1 2 3
DQs
failures, leveraging identified DRAM fault modes to Chip

anticipate UEs on these platforms. Bank


CPU
• We establish an MLOps framework of failure prediction, Beats

to facilitate the collaboration across teams within the


Fig. 1: Memory Organization.
organization and help manage production ML algorithms
events, and memory specifications are recorded in Baseboard
lifecycle.
Management Controller (BMC)2 .
The organization of this paper is as follows: Section II
provides the background of this work. Section III describes C. Memory RAS Techniques
the dataset employed in our data analysis. Section IV details DRAM subsystems leverage RAS features for protection,
the problem formulation and performance metrics. Section V including proactive VM migrations to minimize interruptions
uncovers high-level fault modes within the DRAM hierarchy and CE storm3 suppression to prevent service degradation.
and their relationship to UEs across various platforms. Sec- Advanced RAS techniques are designed to protect server-grade
tion VI demonstrates the use of machine learning techniques machines by avoiding faulty regions and employing sparing
for memory failure prediction. Section VII introduces our techniques like bit, row/column, and bank/chip sparing (e.g.,
MLOps framework for failure prediction. Related work is Partial Cache Line Sparing (PCLS) [32], Post Package Repair
shown in Section VIII. Section IX concludes this paper. (PPR) [33], Intel’s Adaptive Double Device Data Correction
II. BACKGROUND (ADDDC) [34], [35]). Software-sparing mechanisms, such as
A. Terminology the page offlining, mitigate memory errors [34], [36], [37].
A fault in DRAM acts as the root cause for an error, which However, these approaches may increase redundancy and
may arise from a variety of sources, including particle impacts, overhead, affecting performance and limiting their universal
cosmic rays, or manufacturing defects. applicability. Memory failure prediction plays a key role in
An error occurs when a DIMM sends incorrect data to foreseeing UEs and implementing specific mitigation strate-
the memory controller, deviating from what the ECC [4]–[6], gies.
[31] expects, indicative of an underlying fault. Memory errors, III. DATASET
depending on the ECC’s correction capacity, are classified into Our dataset sourced from Huawei cloud datacenters,
two main types: Correctable Errors (CEs) and Uncorrectable includes system configuration, Machine Check Exception
Errors (UEs). Two specific types of UEs are well described in (MCE) log, and memory events (CE storms, etc), focusing
[15]. 1) sudden UEs, which result from component malfunc- on DIMMs experiencing CEs and omitting those with sudden
tions that immediately corrupt data, and 2) predictable UEs, UEs due to the lack of predictive data. We examined error
which initially appear as CEs but evolve into UEs over time. logs from approximately 250,000 servers across Intel Purley
Sudden UEs occur without prior CEs, whereas predictable UEs and Whitley platforms (including Skylake, Cascade Lake, and
may be forecasted through CEs using algorithms designed for Icelake) as well as the Huawei K920 processor platform.
failure prediction. In this study, our focus is on predicting Table I describes an overview of our data, which includes
predictable UEs, as they constitute the majority of memory over 90,000 DDR4 DIMMs from various manufacturers, span-
failures, described in Section III. ning different CPU architectures, with CEs recorded from Jan-
B. DRAM Organization and Access uary to October 2023. Within Intel platforms, predictable UEs
Fig. 1 illustrates the memory’s hierarchical layout and its constitute 73% of the UEs on the Purley platform, surpassing
CPU interactions. In Figure 1(1), it shows a DIMM rank made the rate of sudden UEs. In contrast, the Whitley platform
up of DRAM chips organized by banks, rows, and columns, shows a higher incidence of sudden UEs than Purley, despite it
where data moves from memory cells to the memory con- having a lower total UE rate compared to Purley. Meanwhile,
troller, which can generally detect and correct CEs. Figure 1(2) in ARM system with K920 processor platform, there’s a
shows the data transmission of x4 DRAM DDR4 chips via significant predominance of predictable UEs over sudden UEs,
Data Bus (DQs) upon CPU requests, involving 8 beats of 72 showcasing a variance in ratios compared to the X86 systems,
bits (64 data bits plus 8 ECC bits). Implementing the ECC, the the overall rate of UEs in the K920 dataset is less than that
memory controller detects and corrects them in Figure 1(3). of the Intel platforms. Note that these statistics are specific
Note that the exact ECC algorithms are highly confidential
2 BMC is a specialized processor built into the server’s motherboard,
and never exposed and ECC checking bits addresses can be
designed to supervise the physical status of computers, network servers, and
decoded to locate specific errors in DQs and beats. Finally, additional hardware components.
all these logs including Corrected and Uncorrected errors, 3 CE interruptions repeatedly occur multiple times, e.g., 10 times.

2
TABLE I: Description of Dataset.
1 Live
Abnormal Migration
CPU DIMMs DIMMs Predictable UE Sudden UE
Platform with CEs with UEs DIMMs in % DIMMs in % True Positive
Intel Purley > 50,000 > 2,000 73% 27% Memory
Intel Whitley > 10,000 > 400 42% 58% Mitigation
K920 > 30,000 > 600 82% 18% 2
False Nagative
Cold
to the datasets analyzed and the observed variations may be Migration
influenced by several factors, such as workload, server age, Failure 3
4
VM
and distinct RAS mechanisms, etc. In particular, the ECC used Prediction Normal Interruption
Live
in contemporary Intel platforms, which is integral for error Migration
False Positive
correction and detection, is considered weaker than Chipkill.
This is partly because some of the extra bits previously used Memory
by Intel ECCs are reallocated for other uses [7], such as to Mitigation
True Negative
2
store ownership, security information, to mark failed areas of
Cold
DRAM, etc. This suggests that the observed discrepancies in Migration
UE rates across different architectures could stem from the
unique ECC mechanisms employed. Fig. 2: VM Interruption under Failure Prediction.
Finding 1. The UE and sudden UE rates show variation
between X86 and ARM systems. This discrepancy could Past Present Future Prediction interval

be attributed to the distinct ECC mechanisms implemented Sampling interval


within these differing architectures.
IV. P ROBLEM F ORMULATION AND P ERFORMANCE
M EASURES Present time Failure time
Time
The problem of predicting memory failures is approached as
a binary classification task, following the methodology in [29], Observation window Lead prediction Prediction validation
[30]. As illustrated in Figure 3, an algorithm at time t uses data window window

from a historical observation window △td to predict failures


within a future prediction period [t + △tl , t + △tl + △tp ],
Fig. 3: Failure prediction problem definition [29].
where △tl represents the lead time [38] before a failure
occurs, and △tp is the duration of the prediction window. as seen in the Abnormal of Figure 2. While proactive VM live
Event samples are recorded at intervals of △is (e.g., CE migrations and memory mitigation techniques aim to minimize
events are logged every minute), and predictions are made interruptions without service interruption, a notable fraction
at intervals of △ip (every 5 minutes). The observation (△td ) of VMs might still undergo cold migration, typically causing
and prediction validation windows (△tp ) are set to 5 days and VM interruptions. Cold migrations are often the fallback when
30 days, respectively, to facilitate early proactive strategies. live migrations or memory mitigation are unfeasible, due to
Note that these parameters were optimized based on empirical limited resources or unexpected failures, and are a common
data from the production environment. The lead prediction approach for VM reallocation and maintenance. The fraction
time △tl ∈ (0, 3h], ranging up to 3 hours, designed to of VMs under such migrations is denoted as yc . As a result in
specific operational scenarios. A True Positive (TP) denotes an Figure 2, we define ❷ V1′ = Va · yc (T P + F P ) as the volume
correctly predicted failure within the window. A False Positive of interruptions from cold migrations triggered by accurate
(FP) represents an incorrect forecast. A False Negative (FN) failure predictions (TP + FP). On the other side, missed failure
describes a failure that happens without an earlier warning, predictions lead to increased interruptions, represented by ❸
and a True Negative (TN) is identified when no failures are V2′ = Va · F N . Considering the prediction algorithm, the total
anticipated or take place. The performance of the algorithm interruptions are ❹ V ′ = V1′ + V2′ . The VIRR formula is
′ yc
is evaluated using P recision = T PT+F P TP
P , Recall = T P +F N
thus: V IRR = V −V V , simplifying to (1 − precision ) · recall,
and F 1 = 2×P recision×Recall
P recision+Recall .
according to [29].
VM Interruption Reduction Rate (VIRR). Prior works [7], In practical production settings, yc remains a positive value
[18], [20], [28]–[30] have introduced cost-aware models to since VMs may need cold migrated due to the failure of live
assess the benefits of memory failure prediction. In this work, migration or memory mitigation. When a model’s precision
we emphasizes VM Interruption Reduction Rate (VIRR) [29] in falls below the percentage of cold migration (precision <
Figure 2, as it more precisely reflects the effects on customer yc ), the VIRR shifts to negative, indicating an increase in
experience. VM interruptions. Conversely, high-precision models achieve
Understanding the VIRR involves considering Va as the positive VIRR, amplified by their recall rate. Based on our
average number of VMs per server. Without predictive capabil- observations, we’ve conservatively set yc = 0.1 for our evalu-
ities, the interruptions are calculated as ❶ V = Va (T P +F N ) ation. Note that this value is already pessimistic, anticipating

3
Intel_Purley beats from our observations. Thus, the failure patterns on
Intel_Whitley Intel’s more advanced Whitley platform are markedly different
60 K920

Relative % of UE
from those observed on the Purley platform.
Finding 3. At the bit-level within Intel architecture, distinct
40 DQ and beat patterns emerge, highlighting the importance
of formulating failure patterns designed to specific platforms
20 due to the potential variations in their underlying ECC
mechanisms.
0
VI. FAILURE P REDICTION
w
n

gle k
Mu evice
Co l

e
l
Ce

Sin an

vic
lum

Ro
B

de
We develop our failure prediction using tree-based algo-

d
lti-
rithms (Random Forest and LightGBM [28]–[30]) and deep
Fig. 4: Relative % of UE. learning FT-Transformer [39]. These ensemble learning tech-
a reduction in yc as cloud infrastructure evolves and expands. niques have been prevalently utilized in previous memory
failure prediction literature [15], [18]–[20], [27], with the FT-
V. FAULT A NALYSIS Transformer considered as the leading algorithm for handling
Our analysis investigates the high-level fault modes in the tabular data in the field of deep learning. The experimental
DRAM hierarchy, and correlates them with UE rates across design and feature generation follow the methodology in [30],
various platforms. We consider faults at the DRAM-level, which categorizes samples into two classes: Positive and Neg-
including cell, column, row, bank faults as illustrated in ative. DIMMs expected to experience at least one UE within
Figure 1(1). A Cell fault occurs when CEs in a cell surpass the prediction window are categorized as Positive, while those
a set threshold, while Row and Column faults are identified expected not to have any UE are classified Negative. Samples
by exceeding thresholds across a row or column, respectively. labeling is based on the time interval between a CE and its
Bank faults arise when thresholds for both row and column subsequent UE, with specifics on interval settings available
faults within a bank are exceeded. Additionally, when CEs in [29], [30]. Features used in our models include DRAM
affect a single device, this constitutes a Single-device fault. characteristics such as manufacturer, data width, frequency,
In contrast, if CEs extend across multiple devices, it is Multi- chip process, CE error rate, our conducted failure analysis,
device fault. Further details on fault definitions and threshold and memory events. The performance of these algorithms
settings can be found in [12], [29], [30]. The approach to was evaluated using precision, recall, F1-score and VIRR as
calculating the relative UE rate depicted in Figure 4 fol- described in Section IV.
lows previous studies [9], [11], [21], [29], [30], categorizing Comparison with the existing approach. We compare
DIMMs according to distinct fault types (e.g., cell faults) and our algorithms with the existing reproduced Risky CE Pattern
assessing the percentage of DIMMs that encounter UEs. approach in [7], particularly for the Intel Skylake/Cascade
As shown in Fig. 4, the most UEs are attributed to faults in (Purley platform) architecture. However, we noted a lack
higher-level components, such as row and bank faults across of dedicated memory failure prediction algorithms for the
all platforms. Specifically, on the Intel Purley platform, the Intel Whitley platform and the Huawei ARM K920 Platform.
primary source of UEs is single device faults. Conversely, Table II shows the superior performance of our method,
in both the Whitley and K920 platforms, UEs predominantly achieving the high F1-score of 0.64 and VIRR of 0.65 using
arise from multi-device faults, possibly due to variations in LightGBM on the Intel Purley platform, outperforming rule-
ECC mechanisms. based risky CE pattern algorithm. Additionally, it scores
0.50 F1-score on the Intel Whitley platform using the FT-
Finding 2. The Intel Purley platform primarily experiences
Transformer, and 0.54 F1-score and 0.46 VIRR in K920
UEs due to single device faults, a trend that appears to dimin-
architecture with LightGBM. The LightGBM results overall
ish in the Whitley platform. Meanwhile, the K920 platform
outperformed than other machine learning methods including
exhibits fewer single device faults, potentially attributed to
deep learning FT-Transformer algorithm, which agrees with
the efficiency of its K920-SDDC.
the finding in [40].
Then we investigate the failure patterns of error bits in DQs Finding 4. Prediction efficacy varies across platforms; the
and beats, similar to [30]. As shown in Fig. 5, we examined Intel Whitley platform demonstrates comparatively weaker
the counts and intervals of error DQs and beats in x4 bit predictive performance than both the Intel Purley and K920
width DRAM. On the Intel Purley platform, 2 error DQs platform.
and beats counts and a 4-beat interval are associated with
significantly higher UE rates, in comparison to other error DQ VII. MLO PS OF FAILURE P REDICTION
and beat counts and intervals. By contrast, the Intel Whitley After developing machine learning (ML) algorithms that
platform exhibited higher UE rates with 4 error DQs and accurately predict memory failures, it becomes crucial to
5 error beats counts. However, variations in UE rates were both maintain and enhance these algorithms and to automate
not significantly influenced by the intervals between DQs and their operation within the data center. The MLOps framework

4
0.18
Intel Purley
0.25
0.25 0.20
0.15
0.20
Relative UE Rate

0.20 0.12 0.15


0.10 0.15
0.15
0.08 0.10
0.10 0.10
0.05
0.05 0.05
0.05 0.03
0.00 0.00 0.00 0.00
1 2 3 4 1 2 3 4 5 6 7 8 0 1 2 3 0 1 2 3 4 5 6 7

0.80
DQ count Beat count Intel Whitley DQ interval Beat interval
0.35 0.14 0.30
0.60 0.30 0.12
Relative UE Rate

0.25
0.25 0.10
0.20
0.40 0.20 0.08
0.15
0.15 0.06
0.20 0.10 0.04 0.10
0.05 0.02 0.05
0.00 0.00 0.00 0.00
1 2 3 4 1 2 3 4 5 6 7 8 0 1 2 3 0 1 2 3 4 5 6 7
DQ count Beat count DQ interval Beat interval

Fig. 5: Analyses of Error Bits in Intel Platforms: Highlighting The Highest Rate with Red Bar.
TABLE II: Overview of Algorithm Performance Comparisons. (X denotes the absence of prediction values)
Intel Purley Intel Whitley K920
Algorithm
Precision Recall F1 VIRR Precision Recall F1 VIRR Precision Recall F1 VIRR
Risky CE Pattern [7] 0.53 0.46 0.49 0.37 X X X X X X X X
Random forest 0.61 0.62 0.61 0.52 0.34 0.46 0.39 0.32 0.44 0.51 0.47 0.39
LightGBM 0.54 0.80 0.64 0.65 0.46 0.54 0.49 0.45 0.51 0.57 0.54 0.46
FT-Transformer 0.49 0.74 0.59 0.58 0.53 0.49 0.50 0.40 0.40 0.54 0.46 0.41

is ideally suited for this purpose, ensuring the continuous Serving: Feature store serves features for model training

accuracy and applicable of our memory failure prediction and inference, enabling Data Scientists to select features
algorithms. Figure 6 illustrates an overview of the MLOps on demand for training different models based on varying
framework for memory failure prediction, with the workflow requirements. For instance, Data Scientists might develop
introduced in stages as follows: various models designed to distinct CPU architectures,
Data Pipeline: The initial stage involves collecting raw utilizing unique features for each.
data from various sources. For example, CE and UE logs are ML Deployment: This phase involves (1) model training,
collected by the BMC, and are then processed and stored in which contains algorithms selection, hyperparameters tuning,
the Data Lake, alongside other data sources such as runtime and the application of these configured algorithms on prepared
workload metrics (e.g., CPU utilizations) and environmental datasets. This task can be performed manually by Data Scien-
metrics (e.g., server locations, temperatures) using the Huawei tists or through automated tools like AutoML. Once models
Data Lake Insight (DLI) solution. are trained, and show substantial improvements in predefined
Feature Store: A feature store acts as a centralized reposi- benchmark evaluations, they advance to deployment in the
tory for transforming, storing, cataloging, and serving features production environment. This deployment leverages a (2)
used in model training and inference. It ensures consistency Continuous Integration and Continuous Delivery (CI/CD)
between training and prediction phases and accelerate the pipeline, which automates the integration, testing, and deploy-
development of machine learning models by making features ment of ML algorithms, thereby ensuring models can be con-
readily accessible. This involves: sistently updated and reliably released within the production
• Transformation: Converting raw data into features suit- environment. Subsequently, the successfully deployed models
able for machine learning algorithms. This process is continue delivering (3) online prediction utilizing streaming
divided into batch and stream transformations for model data and the prediction results will be served to our cloud
training and online prediction, respectively. service.
• Storage: Once transformed, features are stored in an Cloud Service: The alarm is triggered in the Cloud Alarm
accessible format for training and prediction. They are System upon predicting memory failures. Depending on dif-
cataloged with registry information to standardize fea- ferent use cases, the memory RAS techniques are then
tures across all teams’ models. For example, CEs are implemented to mitigate the failures, with VMs being migrated
converted into temporal and spatial features within the automatically or manually as required.
DRAM hierarchy. This conversion includes the distri- Monitoring: Throughout the MLOps workflow, each phase
bution of error bits across DQs and beats, the number is continuously monitored through dashboards. This includes
of faults, within different time intervals (1min, 1h, 5d, monitoring data collection rates, feature importance, and al-
etc) and memory configurations, such as manufacturers, gorithm performance, as well as VM migrations and service
DRAM processes are further encoded to static features. interruptions. To enhance algorithm accuracy and ensure fair-

5
Data Pipeline Feature Store ML Deployment Cloud Service

Monitoring Monitoring Monitoring


Monitoring
Serving
Transformation Storage Serving Online
Prediction
BMC Static Features

Row Data
Batch Cloud Alarm System
Temporal
Features On-demand CI/CD
Log Collection Feature Pipeline
Selection
Spatial Features
Stream Memory RAS
Data Source: Techniques
Log Others Model
Workload Training
Environment Feedback
etc
Feature Registry Model Registry
Data Lake VM Migration

Request
new Re-implement
feature
Data Engineer Data Scientist MLOps Engineer

Fig. 6: The MLOps Framework of Failure Prediction.


ness in predictions, feedback is proactively gathered from the distribution of error bits and proposed a hierarchical, system-
cloud service. Dashboards are implemented in both testing and level approach for predicting memory failures by utilizing the
production settings to closely observe algorithm performance, error bits features.
as well as the rates of false negatives and positives. This dual- However, the literature mentioned does not examine failure
environment monitoring facilitates the ongoing refinement of patterns across various processors’ platforms, nor does it
algorithms. engage in the development of ML models specifically designed
In our memory failure prediction development, collaboration for distinct CPU architectures to improve prediction. In our
across various teams is essential, this collaborative effort spans previous work [41], we explored memory failure patterns
from Data Engineers, who are responsible for processing new across various CPU architectures. In this extended version,
data and integrating it into the Data Lake in response to we further expand the fault analysis, present 4 findings,
Data Scientists’ requests. Data Scientists analyze this data, and establish an MLOps framework to continuously improve
develop predictive algorithms, and specify requirements for failure prediction models throughout their lifecycles.
operational deployment. To the MLOps Engineers, who take IX. C ONCLUSION
on the critical role of re-implementing and deploying newly We present the first comprehensive analysis of DRAM
developed algorithms by Data Scientists into the production failures spanning both X86 and ARM systems across various
environment. platforms in large-scale datacenters. From our analytical and
VIII. R ELATED W ORK predictive modeling work, we report 4 findings: 1) UE and
Empirical research [8]–[12] has laid the groundwork in the sudden UE rates differ between X86 and ARM systems. 2)
study of memory errors, focusing on correlation analyses and Fault modes vary across architectures. 3) Bit-level failure
failure modes. Their works serves as foundational elements patterns of DRAM are architecture-dependent. 4) Prediction
for developing memory failure prediction algorithm. This accuracy differs by platforms. Utilizing datasets from produc-
section highlights significant contributions in memory failure tion environment, our approach showcased a 15% enhance-
prediction. ment in F1-score compared to the method in [7], specifically
The ensemble learning approaches [15], [17], [18], [20] within the Intel Purley platform. Moreover, we executed initial
have constantly improved memory failure prediction by lever- experiments on the Intel Whitley and ARM-based platforms,
aging correctable errors, event logs, sensor metrics. The achieving F1-scores of 0.50 and 0.54, along with VIRR of
node/server-level memory unavailability prediction methods 0.45 and 0.46 respectively. Finally, we present our MLOps
are proposed in [21], [28], considering both UE and CE framework for memory failure prediction, implemented in
storm/CE-driven prediction. Li et al. [7] explored correlations the production environment, This framework is designed to
between CEs and UEs using error bit information and DIMM ensure the continuous enhancement and maintenance of failure
part numbers, creating a new risky CE indicator for UE pre- prediction performance.
diction across different manufacturers and part numbers. Peng ACKNOWLEDGEMENT
et al. [23] designed DRAM failure forecasters by utilizing We thank the anonymous reviewers from DSN’24 for their
different UCE types. Yu et al. [29], [30] further examined the constructive comments.

6
R EFERENCES Large-Scale Field Data,” in 2020 16th European Dependable Computing
Conference (EDCC), Sep. 2020, pp. 41–46, 00003 ISSN: 2641-810X.
[1] “Intel MCA+MFP Helps JD Stable and Efficient [18] I. Boixaderas, D. Zivanovic, S. Moré, J. Bartolome, D. Vicente,
Cloud Services.” [Online]. Available: https://www.intel. M. Casas, P. M. Carpenter, P. Radojković, and E. Ayguadé, “Cost-aware
com/content/dam/www/central-libraries/us/en/documents/2023-12/ prediction of uncorrected DRAM errors in the field,” in Proceedings
mca-mfp-helps-jd-stable-and-efficient-cloud-services.pdf of the International Conference for High Performance Computing,
[2] G. Wang, L. Zhang, and W. Xu, “What can we learn from four years Networking, Storage and Analysis, ser. SC ’20. Atlanta, Georgia: IEEE
of data center hardware failures?” in 2017 47th Annual IEEE/IFIP Press, Nov. 2020, pp. 1–15, 00005.
International Conference on Dependable Systems and Networks (DSN), [19] F. Yu, H. Xu, S. Jian, C. Huang, Y. Wang, and Z. Wu, “Dram failure
2017, pp. 25–36. prediction in large-scale data centers,” in 2021 IEEE International
[3] P. Notaro, Q. Yu, S. Haeri, J. Cardoso, and M. Gerndt, “An optical Conference on Joint Cloud Computing (JCC). Los Alamitos, CA,
transceiver reliability study based on sfp monitoring and os-level metric USA: IEEE Computer Society, aug 2021, pp. 1–8. [Online]. Available:
data,” in 2023 IEEE/ACM 23rd International Symposium on Cluster, https://doi.ieeecomputersociety.org/10.1109/JCC53141.2021.00012
Cloud and Internet Computing (CCGrid), 2023, pp. 1–12. [20] X. Du and C. Li, “Predicting uncorrectable memory errors from the
[4] M. Y. Hsiao, “A class of optimal minimum odd-weight-column sec-ded correctable error history: No free predictors in the field,” in The
codes,” IBM Journal of Research and Development, 1970. International Symposium on Memory Systems, ser. MEMSYS 2021.
[5] T. J. Dell, “A white paper on the benefits of chipkill correct ecc New York, NY, USA: Association for Computing Machinery, 2021.
for pcserver main memory,” in Computer Science, 1997. [Online]. [Online]. Available: https://doi.org/10.1145/3488423.3519316
Available: ”https://asset-pdf.scinapse.io/prod/48011110/48011110.pdf” [21] Z. Cheng, S. Han, P. P. C. Lee, X. Li, J. Liu, and Z. Li, “An in-
[6] “Intel® e7500 chipset mch intel® x4 single device data depth correlative study between dram errors and server failures in
correction (x4 sddc) implementation and validation.” [Online]. production data centers,” in 2022 41st International Symposium on
Available: https://www.intel.com/content/dam/doc/application-note/ Reliable Distributed Systems (SRDS), 2022, pp. 262–272.
e7500-chipset-mch-x4-single-device-data-correction-note.pdf [22] J. Bogatinovski, O. Kao, Q. Yu, and J. Cardoso, “First ce matters: On
[7] C. Li, Y. Zhang, J. Wang, H. Chen, X. Liu, T. Huang, L. Peng, S. Zhou, the importance of long term properties on memory failure prediction,”
L. Wang, and S. Ge, “From correctable memory errors to uncorrectable in 2022 IEEE International Conference on Big Data (Big Data), 2022,
memory errors: What error bits tell,” in Proceedings of the International pp. 4733–4736.
Conference on High Performance Computing, Networking, Storage and [23] X. Peng, Z. Huang, A. Cantrell, B. H. Shu, K. K. Xie, Y. Li, Y. Li,
Analysis, ser. SC ’22. IEEE Press, 2022. L. Jiang, Q. Xu, and M.-C. Yang, “Expert: Exploiting dram error types
[8] B. Schroeder, E. Pinheiro, and W.-D. Weber, “Dram errors in the to improve the effective forecasting coverage in the field,” in 2023 53rd
wild: A large-scale field study,” in Proceedings of the Eleventh Annual IEEE/IFIP International Conference on Dependable Systems and
International Joint Conference on Measurement and Modeling of Networks - Supplemental Volume (DSN-S), 2023, pp. 35–41.
Computer Systems, ser. SIGMETRICS ’09. New York, NY, USA: [24] Z. Liu, C. Benge, Y. Dagli, and S. Jiang, “Stim: Predicting memory
Association for Computing Machinery, 2009, p. 193–204. [Online]. uncorrectable errors with spatio-temporal transformer.”
Available: https://doi.org/10.1145/1555349.1555372 [25] X. Sun, K. Chakrabarty, R. Huang, Y. Chen, B. Zhao, H. Cao, Y. Han,
[9] J. Meza, Q. Wu, S. Kumar, and O. Mutlu, “Revisiting memory errors X. Liang, and L. Jiang, “System-level hardware failure prediction using
in large-scale production data centers: Analysis and modeling of new deep learning,” in 2019 56th ACM/IEEE Design Automation Conference
trends from the field,” in 2015 45th Annual IEEE/IFIP International (DAC), Jun. 2019, pp. 1–6, iSSN: 0738-100X.
Conference on Dependable Systems and Networks. Rio de Janeiro, [26] L. Mukhanov, K. Tovletoglou, H. Vandierendonck, D. S. Nikolopoulos,
Brazil: IEEE, 2015, pp. 415–426. and G. Karakonstantis, “Workload-aware dram error prediction using
[10] V. Sridharan and D. Liberty, “A study of dram failures in the field,” in SC machine learning,” in 2019 IEEE International Symposium on Workload
’12: Proceedings of the International Conference on High Performance Characterization (IISWC), 2019, pp. 106–118.
Computing, Networking, Storage and Analysis, 2012, pp. 1–11. [27] X. Wang, Y. Li, Y. Chen, S. Wang, Y. Du, C. He, Y. Zhang, P. Chen,
[11] V. Sridharan, N. DeBardeleben, S. Blanchard, K. B. Ferreira, J. Stearley, X. Li, W. Song, Q. xu, and L. Jiang, “On workload-aware dram failure
J. Shalf, and S. Gurumurthi, “Memory errors in modern systems: prediction in large-scale data centers,” in 2021 IEEE 39th VLSI Test
The good, the bad, and the ugly,” in Proceedings of the Twentieth Symposium (VTS), 2021, pp. 1–6.
International Conference on Architectural Support for Programming [28] P. Zhang, Y. Wang, X. Ma, Y. Xu, B. Yao, X. Zheng, and L. Jiang,
Languages and Operating Systems, ser. ASPLOS ’15. New York, “Predicting dram-caused node unavailability in hyper-scale clouds,” in
NY, USA: Association for Computing Machinery, 2015, p. 297–310. 2022 52nd Annual IEEE/IFIP International Conference on Dependable
[Online]. Available: https://doi.org/10.1145/2694344.2694348 Systems and Networks (DSN), 2022, pp. 275–286.
[12] M. V. Beigi, Y. Cao, S. Gurumurthi, C. Recchia, A. Walton, and [29] Q. Yu, W. Zhang, P. Notaro, S. Haeri, J. Cardoso, and O. Kao, “Himfp:
V. Sridharan, “A systematic study of ddr4 dram faults in the field,” in Hierarchical intelligent memory failure prediction for cloud service
2023 IEEE International Symposium on High-Performance Computer reliability,” in 2023 53rd Annual IEEE/IFIP International Conference
Architecture (HPCA), 2023, pp. 991–1002. on Dependable Systems and Networks (DSN), 2023, pp. 216–228.
[13] M. Patel, T. Shahroodi, A. Manglik, A. G. Yaglikci, A. Olgun, H. Luo, [30] Q. Yu, W. Zhang, J. Cardoso, and O. Kao, “Exploring error
and O. Mutlu, “A case for transparent reliability in dram systems,” 2022. bits for memory failure prediction: An in-depth correlative study,”
[14] J. Jung and M. Erez, “Predicting future-system reliability with a in 2023 IEEE/ACM International Conference on Computer Aided
component-level dram fault model,” in Proceedings of the 56th Annual Design (ICCAD). IEEE, Oct. 2023. [Online]. Available: http:
IEEE/ACM International Symposium on Microarchitecture, ser. MICRO //dx.doi.org/10.1109/ICCAD57390.2023.10323692
’23. New York, NY, USA: Association for Computing Machinery, [31] “Memory ras configuration user’s guide.” [Online].
2023, p. 944–956. [Online]. Available: https://doi.org/10.1145/3613424. Available: https://www.supermicro.com/manuals/other/Memory RAS
3614294 Configuration User Guide.pdf
[15] I. Giurgiu, J. Szabo, D. Wiesmann, and J. Bird, “Predicting DRAM [32] X. Du and C. Li, “Dpcls: Improving partial cache line sparing with
reliability in the field with machine learning,” in Proceedings dynamics for memory error prevention,” in 2020 IEEE 38th International
of the 18th ACM/IFIP/USENIX Middleware Conference: Industrial Conference on Computer Design (ICCD), 2020, pp. 197–204.
Track, ser. Middleware ’17. New York, NY, USA: Association [33] C.-S. Hou, Y.-X. Chen, J.-F. Li, C.-Y. Lo, D.-M. Kwai, and Y.-F. Chou,
for Computing Machinery, Dec. 2017, pp. 15–21, 00026. [Online]. “A built-in self-repair scheme for drams with spare rows, columns, and
Available: https://doi.org/10.1145/3154448.3154451 bits,” in 2016 IEEE International Test Conference (ITC), 2016, pp. 1–7.
[16] X. Du and C. Li, “Memory failure prediction using online learning,” [34] X. Du, C. Li, S. Zhou, X. Liu, X. Xu, T. Wang, and S. Ge, “Fault-
in Proceedings of the International Symposium on Memory Systems, aware prediction-guided page offlining for uncorrectable memory error
ser. MEMSYS ’18. New York, NY, USA: Association for Computing prevention,” in 2021 IEEE 39th International Conference on Computer
Machinery, Oct. 2018, pp. 38–49, 00000. [Online]. Available: Design (ICCD), 2021, pp. 456–463.
https://doi.org/10.1145/3240302.3240309 [35] X. Jian and R. Kumar, “Adaptive reliability chipkill correct (arcc),”
[17] X. Du, C. Li, S. Zhou, M. Ye, and J. Li, “Predicting Uncorrectable in 2013 IEEE 19th International Symposium on High Performance
Memory Errors for Proactive Replacement: An Empirical Study on Computer Architecture (HPCA), 2013, pp. 270–281.

7
[36] X. Du and C. Li, “Combining error statistics with failure prediction in
memory page offlining,” in Proceedings of the International Symposium
on Memory Systems, ser. MEMSYS ’19. New York, NY, USA:
Association for Computing Machinery, 2019, p. 127–132. [Online].
Available: https://doi.org/10.1145/3357526.3357527
[37] D. Tang, P. Carruthers, Z. Totari, and M. Shapiro, “Assessment of the
effect of memory page retirement on system ras against hardware faults,”
in International Conference on Dependable Systems and Networks
(DSN’06), 2006, pp. 365–370.
[38] K. A. Alharthi, A. Jhumka, S. Di, L. Gui, F. Cappello, and S. McIntosh-
Smith, “Time machine: Generative real-time model for failure (and
lead time) prediction in hpc systems,” in 2023 53rd Annual IEEE/IFIP
International Conference on Dependable Systems and Networks (DSN),
2023, pp. 508–521.
[39] Y. Gorishniy, I. Rubachev, V. Khrulkov, and A. Babenko, “Revisiting
deep learning models for tabular data,” Advances in Neural Information
Processing Systems, vol. 34, pp. 18 932–18 943, 2021.
[40] L. Grinsztajn, E. Oyallon, and G. Varoquaux, “Why do tree-based
models still outperform deep learning on typical tabular data?” Advances
in Neural Information Processing Systems, vol. 35, pp. 507–520, 2022.
[41] Q. Yu, J. Cardoso, and O. Kao, “Unveiling dram failures across
different cpu architectures in large-scale datacenters,” in 2024 IEEE 44th
International Conference on Distributed Computing Systems (ICDCS).
IEEE, 2024, pp. 1–2, to appear.

You might also like