0% found this document useful (0 votes)

21 views8 pages

AIOS Research1

Uploaded by

sharonx332

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

21 views8 pages

AIOS Research1

Uploaded by

sharonx332

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 8

Investigating Memory Failure Prediction Across

CPU Architectures
Qiao Yu∗† , Wengui Zhang‡ , Min Zhou‡¶ , Jialiang Yu‡ , Zhenli Sheng‡ , Jasmin Bogatinovski†
Jorge Cardoso∗§ and Odej Kao†
∗ Huawei Munich Research Center, Germany † Technical University of Berlin, Germany
‡ Huawei
Technologies Co., Ltd, China § CISUC, University of Coimbra, Portugal
{qiao.yu, zhangwengui1, zhoumin27, yujialiang, shengzhenli, jorge.cardoso}@huawei.com
{jasmin.bogatinovski, odej.kao}@tu-berlin.de

Abstract—Large-scale datacenters often experience memory failures, laying the groundwork for our research. Machine
failures, where Uncorrectable Errors (UEs) highlight critical
arXiv:2406.05354v1 [cs.AR] 8 Jun 2024

Learning (ML) techniques have been employed for predicting

malfunction in Dual Inline Memory Modules (DIMMs). Existing memory failures [15]–[24], using CEs information from large-
approaches primarily utilize Correctable Errors (CEs) to predict
UEs, yet they typically neglect how these errors vary between scale datacenters to forecast UEs. These investigations have
different CPU architectures, especially in terms of Error Cor- effectively exploited the spatial distribution of CEs to improve
rection Code (ECC) applicability. In this paper, we investigate memory failure prediction. Additionally, system-level work-
the correlation between CEs and UEs across different CPU load metrics such as memory utilization, read/write access
architectures, including X86 and ARM. Our analysis identifies have been considered for memory failure prediction in [25]–
unique patterns of memory failure associated with each processor
platform. Leveraging Machine Learning (ML) techniques on [27]. Results from [27] indicate that workload metrics play a
production datasets, we conduct the memory failure prediction minor role compared to other CE related features. Research
in different processors’ platforms, achieving up to 15% improve- in [28] focuses on CE storms (a high frequency of CEs in
ments in F1-score compared to the existing algorithm. Finally, an a brief timeframe) and UEs to predict DRAM-caused node
MLOps (Machine Learning Operations) framework is provided unavailability (DCNU), highlighting the significance of spatio-
to consistently improve the failure prediction in the production
environment. temporal features of CEs. Furthermore, [7] explores specific
Index Terms—Memory, Failure prediction, Uncorrectable er- error bit patterns and their association with DRAM UEs,
ror, Reliability, Machine Learning, AIOps, ML for Systems developing rule-based indicators for DRAM failure prediction
that vary by manufacturer and part number, adapted to the ECC
I. I NTRODUCTION designs of modern Intel Skylake and Cascade Lake servers.
Moreover, [29], [30] examine the distribution of error bits
With the expansion of cloud computing and big data ser- and propose a hierarchical, system-level method for predicting
vices, the challenge of maintaining the Reliability, Availability, memory failures, leveraging error bit characteristics. However,
and Serviceability (RAS)1 of servers is intensifying, due to the incidence of UEs is influenced not just by DRAM faults
memory failures, which represent a significant fraction of but also by differences across CPU architectures, due to the
hardware malfunctions [1]–[3]. These failures often occur as diverse ECC mechanisms in use, which can alter the patterns
Correctable Errors (CEs) and Uncorrectable Errors (UEs). To of memory failures observed. Understanding and modeling
tackle these issues, Error Correction Code (ECC) mechanisms these failure patterns across various CPU platforms and ECC
such as SEC-DED [4], Chipkill [5], and SDDC [6] are utilized types is essential for accurate prediction of UEs. This gap in
to detect and correct errors. For instance, Chipkill ECC is research motivates us to undertake the first study of DRAM
capable of correcting all erroneous bits from a single DRAM failures comparing X86 and ARM systems, specifically the
(Dynamic Random Access Memory) chip. However, its ef- Intel Purley and Whitley platforms and the Huawei ARM
ficacy diminishes when errors span multiple chips, leading K920 (anonymized to protect confidentiality) processor. By
to system failures caused by UEs. Furthermore, the ECC analyzing the relationship between UEs and fault patterns
mechanisms on modern Intel platform servers do not offer across these processor platforms, we aim to create targeted
the same level of protection as Chipkill ECC, making them memory failure prediction algorithms. Additionally, we ac-
vulnerable to certain error patterns originating from a single knowledge the dynamic nature of server configurations, CPU
chip [7]. Therefore, relying exclusively on ECC mechanisms architectures, memory types, and workloads. To address these
for memory reliability proves inadequate, as memory failures variables, we introduce an MLOps framework designed to
remain a prevalent source of system failures. accommodate such changes, thereby continuously improving
To enhance memory reliability, numerous studies [8]–[14] failure prediction throughout the lifecycle of the production
have delved into the correlations between memory errors and environment.
¶ Corresponding author: [email protected] We make the main contributions of this paper are below:
1 Reliability, Availability, and Serviceability are the foundational pillars used
to measure the dependability of computer systems. • We present the first memory failure study between X86

1
and ARM systems, specifically focusing on Intel X86 1 Memory Organization 2 Memory Access 3
Purley and Whitley, as well as Huawei ARM K920 DRAM ECC bits
Rank
processor platforms, in large-scale datacenters. Different

1
0
0
1
0
0
0
0
0 1 2 3
DRAM (Device)

DQs
Column Chip

0
0
1
0
0
0
0
0

Memory Channel
fault modes within DRAM hierarchy are associated with

0
0
0
0
0
1
0
0
(ECC)

Controller
Memory
0
0
0
0
1
0
0
0
...
memory failures across these platforms.

...
Row Cell Data bits
• We develop ML-based algorithms for predicting memory

0 0 0 0 1
1 0 0 0 0
2 0 0 1 0
3 0 0 0 1
4 1 0 0 0
5 0 1 0 0
6 0 0 0 0
7 0 0 0 0
0 1 2 3
DQs
failures, leveraging identified DRAM fault modes to Chip

anticipate UEs on these platforms. Bank

CPU
• We establish an MLOps framework of failure prediction, Beats

to facilitate the collaboration across teams within the

Fig. 1: Memory Organization.
organization and help manage production ML algorithms
events, and memory specifications are recorded in Baseboard
lifecycle.
Management Controller (BMC)2 .
The organization of this paper is as follows: Section II
provides the background of this work. Section III describes C. Memory RAS Techniques
the dataset employed in our data analysis. Section IV details DRAM subsystems leverage RAS features for protection,
the problem formulation and performance metrics. Section V including proactive VM migrations to minimize interruptions
uncovers high-level fault modes within the DRAM hierarchy and CE storm3 suppression to prevent service degradation.
and their relationship to UEs across various platforms. Sec- Advanced RAS techniques are designed to protect server-grade
tion VI demonstrates the use of machine learning techniques machines by avoiding faulty regions and employing sparing
for memory failure prediction. Section VII introduces our techniques like bit, row/column, and bank/chip sparing (e.g.,
MLOps framework for failure prediction. Related work is Partial Cache Line Sparing (PCLS) [32], Post Package Repair
shown in Section VIII. Section IX concludes this paper. (PPR) [33], Intel’s Adaptive Double Device Data Correction
II. BACKGROUND (ADDDC) [34], [35]). Software-sparing mechanisms, such as
A. Terminology the page offlining, mitigate memory errors [34], [36], [37].
A fault in DRAM acts as the root cause for an error, which However, these approaches may increase redundancy and
may arise from a variety of sources, including particle impacts, overhead, affecting performance and limiting their universal
cosmic rays, or manufacturing defects. applicability. Memory failure prediction plays a key role in
An error occurs when a DIMM sends incorrect data to foreseeing UEs and implementing specific mitigation strate-
the memory controller, deviating from what the ECC [4]–[6], gies.
[31] expects, indicative of an underlying fault. Memory errors, III. DATASET
depending on the ECC’s correction capacity, are classified into Our dataset sourced from Huawei cloud datacenters,
two main types: Correctable Errors (CEs) and Uncorrectable includes system configuration, Machine Check Exception
Errors (UEs). Two specific types of UEs are well described in (MCE) log, and memory events (CE storms, etc), focusing
[15]. 1) sudden UEs, which result from component malfunc- on DIMMs experiencing CEs and omitting those with sudden
tions that immediately corrupt data, and 2) predictable UEs, UEs due to the lack of predictive data. We examined error
which initially appear as CEs but evolve into UEs over time. logs from approximately 250,000 servers across Intel Purley
Sudden UEs occur without prior CEs, whereas predictable UEs and Whitley platforms (including Skylake, Cascade Lake, and
may be forecasted through CEs using algorithms designed for Icelake) as well as the Huawei K920 processor platform.
failure prediction. In this study, our focus is on predicting Table I describes an overview of our data, which includes
predictable UEs, as they constitute the majority of memory over 90,000 DDR4 DIMMs from various manufacturers, span-
failures, described in Section III. ning different CPU architectures, with CEs recorded from Jan-
B. DRAM Organization and Access uary to October 2023. Within Intel platforms, predictable UEs
Fig. 1 illustrates the memory’s hierarchical layout and its constitute 73% of the UEs on the Purley platform, surpassing
CPU interactions. In Figure 1(1), it shows a DIMM rank made the rate of sudden UEs. In contrast, the Whitley platform
up of DRAM chips organized by banks, rows, and columns, shows a higher incidence of sudden UEs than Purley, despite it
where data moves from memory cells to the memory con- having a lower total UE rate compared to Purley. Meanwhile,
troller, which can generally detect and correct CEs. Figure 1(2) in ARM system with K920 processor platform, there’s a
shows the data transmission of x4 DRAM DDR4 chips via significant predominance of predictable UEs over sudden UEs,
Data Bus (DQs) upon CPU requests, involving 8 beats of 72 showcasing a variance in ratios compared to the X86 systems,
bits (64 data bits plus 8 ECC bits). Implementing the ECC, the the overall rate of UEs in the K920 dataset is less than that
memory controller detects and corrects them in Figure 1(3). of the Intel platforms. Note that these statistics are specific
Note that the exact ECC algorithms are highly confidential
2 BMC is a specialized processor built into the server’s motherboard,
and never exposed and ECC checking bits addresses can be
designed to supervise the physical status of computers, network servers, and
decoded to locate specific errors in DQs and beats. Finally, additional hardware components.
all these logs including Corrected and Uncorrected errors, 3 CE interruptions repeatedly occur multiple times, e.g., 10 times.

2
TABLE I: Description of Dataset.
1 Live
Abnormal Migration
CPU DIMMs DIMMs Predictable UE Sudden UE
Platform with CEs with UEs DIMMs in % DIMMs in % True Positive
Intel Purley > 50,000 > 2,000 73% 27% Memory
Intel Whitley > 10,000 > 400 42% 58% Mitigation
K920 > 30,000 > 600 82% 18% 2
False Nagative
Cold
to the datasets analyzed and the observed variations may be Migration
influenced by several factors, such as workload, server age, Failure 3
4
VM
and distinct RAS mechanisms, etc. In particular, the ECC used Prediction Normal Interruption
Live
in contemporary Intel platforms, which is integral for error Migration
False Positive
correction and detection, is considered weaker than Chipkill.
This is partly because some of the extra bits previously used Memory
by Intel ECCs are reallocated for other uses [7], such as to Mitigation
True Negative
2
store ownership, security information, to mark failed areas of
Cold
DRAM, etc. This suggests that the observed discrepancies in Migration
UE rates across different architectures could stem from the
unique ECC mechanisms employed. Fig. 2: VM Interruption under Failure Prediction.
Finding 1. The UE and sudden UE rates show variation
between X86 and ARM systems. This discrepancy could Past Present Future Prediction interval

be attributed to the distinct ECC mechanisms implemented Sampling interval

within these differing architectures.
IV. P ROBLEM F ORMULATION AND P ERFORMANCE
M EASURES Present time Failure time
Time
The problem of predicting memory failures is approached as
a binary classification task, following the methodology in [29], Observation window Lead prediction Prediction validation
[30]. As illustrated in Figure 3, an algorithm at time t uses data window window

from a historical observation window △td to predict failures

within a future prediction period [t + △tl , t + △tl + △tp ],
Fig. 3: Failure prediction problem definition [29].
where △tl represents the lead time [38] before a failure
occurs, and △tp is the duration of the prediction window. as seen in the Abnormal of Figure 2. While proactive VM live
Event samples are recorded at intervals of △is (e.g., CE migrations and memory mitigation techniques aim to minimize
events are logged every minute), and predictions are made interruptions without service interruption, a notable fraction
at intervals of △ip (every 5 minutes). The observation (△td ) of VMs might still undergo cold migration, typically causing
and prediction validation windows (△tp ) are set to 5 days and VM interruptions. Cold migrations are often the fallback when
30 days, respectively, to facilitate early proactive strategies. live migrations or memory mitigation are unfeasible, due to
Note that these parameters were optimized based on empirical limited resources or unexpected failures, and are a common
data from the production environment. The lead prediction approach for VM reallocation and maintenance. The fraction
time △tl ∈ (0, 3h], ranging up to 3 hours, designed to of VMs under such migrations is denoted as yc . As a result in
specific operational scenarios. A True Positive (TP) denotes an Figure 2, we define ❷ V1′ = Va · yc (T P + F P ) as the volume
correctly predicted failure within the window. A False Positive of interruptions from cold migrations triggered by accurate
(FP) represents an incorrect forecast. A False Negative (FN) failure predictions (TP + FP). On the other side, missed failure
describes a failure that happens without an earlier warning, predictions lead to increased interruptions, represented by ❸
and a True Negative (TN) is identified when no failures are V2′ = Va · F N . Considering the prediction algorithm, the total
anticipated or take place. The performance of the algorithm interruptions are ❹ V ′ = V1′ + V2′ . The VIRR formula is
′ yc
is evaluated using P recision = T PT+F P TP
P , Recall = T P +F N
thus: V IRR = V −V V , simplifying to (1 − precision ) · recall,
and F 1 = 2×P recision×Recall
P recision+Recall .
according to [29].
VM Interruption Reduction Rate (VIRR). Prior works [7], In practical production settings, yc remains a positive value
[18], [20], [28]–[30] have introduced cost-aware models to since VMs may need cold migrated due to the failure of live
assess the benefits of memory failure prediction. In this work, migration or memory mitigation. When a model’s precision
we emphasizes VM Interruption Reduction Rate (VIRR) [29] in falls below the percentage of cold migration (precision <
Figure 2, as it more precisely reflects the effects on customer yc ), the VIRR shifts to negative, indicating an increase in
experience. VM interruptions. Conversely, high-precision models achieve
Understanding the VIRR involves considering Va as the positive VIRR, amplified by their recall rate. Based on our
average number of VMs per server. Without predictive capabil- observations, we’ve conservatively set yc = 0.1 for our evalu-
ities, the interruptions are calculated as ❶ V = Va (T P +F N ) ation. Note that this value is already pessimistic, anticipating

3
Intel_Purley beats from our observations. Thus, the failure patterns on
Intel_Whitley Intel’s more advanced Whitley platform are markedly different
60 K920

Relative % of UE
from those observed on the Purley platform.
Finding 3. At the bit-level within Intel architecture, distinct
40 DQ and beat patterns emerge, highlighting the importance
of formulating failure patterns designed to specific platforms
20 due to the potential variations in their underlying ECC
mechanisms.
0
VI. FAILURE P REDICTION
w
n

gle k
Mu evice
Co l

e
l
Ce

Sin an

vic
lum

Ro
B

de
We develop our failure prediction using tree-based algo-

d
lti-
rithms (Random Forest and LightGBM [28]–[30]) and deep
Fig. 4: Relative % of UE. learning FT-Transformer [39]. These ensemble learning tech-
a reduction in yc as cloud infrastructure evolves and expands. niques have been prevalently utilized in previous memory
failure prediction literature [15], [18]–[20], [27], with the FT-
V. FAULT A NALYSIS Transformer considered as the leading algorithm for handling
Our analysis investigates the high-level fault modes in the tabular data in the field of deep learning. The experimental
DRAM hierarchy, and correlates them with UE rates across design and feature generation follow the methodology in [30],
various platforms. We consider faults at the DRAM-level, which categorizes samples into two classes: Positive and Neg-
including cell, column, row, bank faults as illustrated in ative. DIMMs expected to experience at least one UE within
Figure 1(1). A Cell fault occurs when CEs in a cell surpass the prediction window are categorized as Positive, while those
a set threshold, while Row and Column faults are identified expected not to have any UE are classified Negative. Samples
by exceeding thresholds across a row or column, respectively. labeling is based on the time interval between a CE and its
Bank faults arise when thresholds for both row and column subsequent UE, with specifics on interval settings available
faults within a bank are exceeded. Additionally, when CEs in [29], [30]. Features used in our models include DRAM
affect a single device, this constitutes a Single-device fault. characteristics such as manufacturer, data width, frequency,
In contrast, if CEs extend across multiple devices, it is Multi- chip process, CE error rate, our conducted failure analysis,
device fault. Further details on fault definitions and threshold and memory events. The performance of these algorithms
settings can be found in [12], [29], [30]. The approach to was evaluated using precision, recall, F1-score and VIRR as
calculating the relative UE rate depicted in Figure 4 fol- described in Section IV.
lows previous studies [9], [11], [21], [29], [30], categorizing Comparison with the existing approach. We compare
DIMMs according to distinct fault types (e.g., cell faults) and our algorithms with the existing reproduced Risky CE Pattern
assessing the percentage of DIMMs that encounter UEs. approach in [7], particularly for the Intel Skylake/Cascade
As shown in Fig. 4, the most UEs are attributed to faults in (Purley platform) architecture. However, we noted a lack
higher-level components, such as row and bank faults across of dedicated memory failure prediction algorithms for the
all platforms. Specifically, on the Intel Purley platform, the Intel Whitley platform and the Huawei ARM K920 Platform.
primary source of UEs is single device faults. Conversely, Table II shows the superior performance of our method,
in both the Whitley and K920 platforms, UEs predominantly achieving the high F1-score of 0.64 and VIRR of 0.65 using
arise from multi-device faults, possibly due to variations in LightGBM on the Intel Purley platform, outperforming rule-
ECC mechanisms. based risky CE pattern algorithm. Additionally, it scores
0.50 F1-score on the Intel Whitley platform using the FT-
Finding 2. The Intel Purley platform primarily experiences
Transformer, and 0.54 F1-score and 0.46 VIRR in K920
UEs due to single device faults, a trend that appears to dimin-
architecture with LightGBM. The LightGBM results overall
ish in the Whitley platform. Meanwhile, the K920 platform
outperformed than other machine learning methods including
exhibits fewer single device faults, potentially attributed to
deep learning FT-Transformer algorithm, which agrees with
the efficiency of its K920-SDDC.
the finding in [40].
Then we investigate the failure patterns of error bits in DQs Finding 4. Prediction efficacy varies across platforms; the
and beats, similar to [30]. As shown in Fig. 5, we examined Intel Whitley platform demonstrates comparatively weaker
the counts and intervals of error DQs and beats in x4 bit predictive performance than both the Intel Purley and K920
width DRAM. On the Intel Purley platform, 2 error DQs platform.
and beats counts and a 4-beat interval are associated with
significantly higher UE rates, in comparison to other error DQ VII. MLO PS OF FAILURE P REDICTION
and beat counts and intervals. By contrast, the Intel Whitley After developing machine learning (ML) algorithms that
platform exhibited higher UE rates with 4 error DQs and accurately predict memory failures, it becomes crucial to
5 error beats counts. However, variations in UE rates were both maintain and enhance these algorithms and to automate
not significantly influenced by the intervals between DQs and their operation within the data center. The MLOps framework

4
0.18
Intel Purley
0.25
0.25 0.20
0.15
0.20
Relative UE Rate

0.20 0.12 0.15

0.10 0.15
0.15
0.08 0.10
0.10 0.10
0.05
0.05 0.05
0.05 0.03
0.00 0.00 0.00 0.00
1 2 3 4 1 2 3 4 5 6 7 8 0 1 2 3 0 1 2 3 4 5 6 7

0.80
DQ count Beat count Intel Whitley DQ interval Beat interval
0.35 0.14 0.30
0.60 0.30 0.12
Relative UE Rate

0.25
0.25 0.10
0.20
0.40 0.20 0.08
0.15
0.15 0.06
0.20 0.10 0.04 0.10
0.05 0.02 0.05
0.00 0.00 0.00 0.00
1 2 3 4 1 2 3 4 5 6 7 8 0 1 2 3 0 1 2 3 4 5 6 7
DQ count Beat count DQ interval Beat interval

Fig. 5: Analyses of Error Bits in Intel Platforms: Highlighting The Highest Rate with Red Bar.
TABLE II: Overview of Algorithm Performance Comparisons. (X denotes the absence of prediction values)
Intel Purley Intel Whitley K920
Algorithm
Precision Recall F1 VIRR Precision Recall F1 VIRR Precision Recall F1 VIRR
Risky CE Pattern [7] 0.53 0.46 0.49 0.37 X X X X X X X X
Random forest 0.61 0.62 0.61 0.52 0.34 0.46 0.39 0.32 0.44 0.51 0.47 0.39
LightGBM 0.54 0.80 0.64 0.65 0.46 0.54 0.49 0.45 0.51 0.57 0.54 0.46
FT-Transformer 0.49 0.74 0.59 0.58 0.53 0.49 0.50 0.40 0.40 0.54 0.46 0.41

is ideally suited for this purpose, ensuring the continuous Serving: Feature store serves features for model training
•
accuracy and applicable of our memory failure prediction and inference, enabling Data Scientists to select features
algorithms. Figure 6 illustrates an overview of the MLOps on demand for training different models based on varying
framework for memory failure prediction, with the workflow requirements. For instance, Data Scientists might develop
introduced in stages as follows: various models designed to distinct CPU architectures,
Data Pipeline: The initial stage involves collecting raw utilizing unique features for each.
data from various sources. For example, CE and UE logs are ML Deployment: This phase involves (1) model training,
collected by the BMC, and are then processed and stored in which contains algorithms selection, hyperparameters tuning,
the Data Lake, alongside other data sources such as runtime and the application of these configured algorithms on prepared
workload metrics (e.g., CPU utilizations) and environmental datasets. This task can be performed manually by Data Scien-
metrics (e.g., server locations, temperatures) using the Huawei tists or through automated tools like AutoML. Once models
Data Lake Insight (DLI) solution. are trained, and show substantial improvements in predefined
Feature Store: A feature store acts as a centralized reposi- benchmark evaluations, they advance to deployment in the
tory for transforming, storing, cataloging, and serving features production environment. This deployment leverages a (2)
used in model training and inference. It ensures consistency Continuous Integration and Continuous Delivery (CI/CD)
between training and prediction phases and accelerate the pipeline, which automates the integration, testing, and deploy-
development of machine learning models by making features ment of ML algorithms, thereby ensuring models can be con-
readily accessible. This involves: sistently updated and reliably released within the production
• Transformation: Converting raw data into features suit- environment. Subsequently, the successfully deployed models
able for machine learning algorithms. This process is continue delivering (3) online prediction utilizing streaming
divided into batch and stream transformations for model data and the prediction results will be served to our cloud
training and online prediction, respectively. service.
• Storage: Once transformed, features are stored in an Cloud Service: The alarm is triggered in the Cloud Alarm
accessible format for training and prediction. They are System upon predicting memory failures. Depending on dif-
cataloged with registry information to standardize fea- ferent use cases, the memory RAS techniques are then
tures across all teams’ models. For example, CEs are implemented to mitigate the failures, with VMs being migrated
converted into temporal and spatial features within the automatically or manually as required.
DRAM hierarchy. This conversion includes the distri- Monitoring: Throughout the MLOps workflow, each phase
bution of error bits across DQs and beats, the number is continuously monitored through dashboards. This includes
of faults, within different time intervals (1min, 1h, 5d, monitoring data collection rates, feature importance, and al-
etc) and memory configurations, such as manufacturers, gorithm performance, as well as VM migrations and service
DRAM processes are further encoded to static features. interruptions. To enhance algorithm accuracy and ensure fair-

5
Data Pipeline Feature Store ML Deployment Cloud Service

Monitoring Monitoring Monitoring

Monitoring
Serving
Transformation Storage Serving Online
Prediction
BMC Static Features

Row Data
Batch Cloud Alarm System
Temporal
Features On-demand CI/CD
Log Collection Feature Pipeline
Selection
Spatial Features
Stream Memory RAS
Data Source: Techniques
Log Others Model
Workload Training
Environment Feedback
etc
Feature Registry Model Registry
Data Lake VM Migration

Request
new Re-implement
feature
Data Engineer Data Scientist MLOps Engineer

Fig. 6: The MLOps Framework of Failure Prediction.

ness in predictions, feedback is proactively gathered from the distribution of error bits and proposed a hierarchical, system-
cloud service. Dashboards are implemented in both testing and level approach for predicting memory failures by utilizing the
production settings to closely observe algorithm performance, error bits features.
as well as the rates of false negatives and positives. This dual- However, the literature mentioned does not examine failure
environment monitoring facilitates the ongoing refinement of patterns across various processors’ platforms, nor does it
algorithms. engage in the development of ML models specifically designed
In our memory failure prediction development, collaboration for distinct CPU architectures to improve prediction. In our
across various teams is essential, this collaborative effort spans previous work [41], we explored memory failure patterns
from Data Engineers, who are responsible for processing new across various CPU architectures. In this extended version,
data and integrating it into the Data Lake in response to we further expand the fault analysis, present 4 findings,
Data Scientists’ requests. Data Scientists analyze this data, and establish an MLOps framework to continuously improve
develop predictive algorithms, and specify requirements for failure prediction models throughout their lifecycles.
operational deployment. To the MLOps Engineers, who take IX. C ONCLUSION
on the critical role of re-implementing and deploying newly We present the first comprehensive analysis of DRAM
developed algorithms by Data Scientists into the production failures spanning both X86 and ARM systems across various
environment. platforms in large-scale datacenters. From our analytical and
VIII. R ELATED W ORK predictive modeling work, we report 4 findings: 1) UE and
Empirical research [8]–[12] has laid the groundwork in the sudden UE rates differ between X86 and ARM systems. 2)
study of memory errors, focusing on correlation analyses and Fault modes vary across architectures. 3) Bit-level failure
failure modes. Their works serves as foundational elements patterns of DRAM are architecture-dependent. 4) Prediction
for developing memory failure prediction algorithm. This accuracy differs by platforms. Utilizing datasets from produc-
section highlights significant contributions in memory failure tion environment, our approach showcased a 15% enhance-
prediction. ment in F1-score compared to the method in [7], specifically
The ensemble learning approaches [15], [17], [18], [20] within the Intel Purley platform. Moreover, we executed initial
have constantly improved memory failure prediction by lever- experiments on the Intel Whitley and ARM-based platforms,
aging correctable errors, event logs, sensor metrics. The achieving F1-scores of 0.50 and 0.54, along with VIRR of
node/server-level memory unavailability prediction methods 0.45 and 0.46 respectively. Finally, we present our MLOps
are proposed in [21], [28], considering both UE and CE framework for memory failure prediction, implemented in
storm/CE-driven prediction. Li et al. [7] explored correlations the production environment, This framework is designed to
between CEs and UEs using error bit information and DIMM ensure the continuous enhancement and maintenance of failure
part numbers, creating a new risky CE indicator for UE pre- prediction performance.
diction across different manufacturers and part numbers. Peng ACKNOWLEDGEMENT
et al. [23] designed DRAM failure forecasters by utilizing We thank the anonymous reviewers from DSN’24 for their
different UCE types. Yu et al. [29], [30] further examined the constructive comments.

6
R EFERENCES Large-Scale Field Data,” in 2020 16th European Dependable Computing
Conference (EDCC), Sep. 2020, pp. 41–46, 00003 ISSN: 2641-810X.
[1] “Intel MCA+MFP Helps JD Stable and Efficient [18] I. Boixaderas, D. Zivanovic, S. Moré, J. Bartolome, D. Vicente,
Cloud Services.” [Online]. Available: https://www.intel. M. Casas, P. M. Carpenter, P. Radojković, and E. Ayguadé, “Cost-aware
com/content/dam/www/central-libraries/us/en/documents/2023-12/ prediction of uncorrected DRAM errors in the field,” in Proceedings
mca-mfp-helps-jd-stable-and-efficient-cloud-services.pdf of the International Conference for High Performance Computing,
[2] G. Wang, L. Zhang, and W. Xu, “What can we learn from four years Networking, Storage and Analysis, ser. SC ’20. Atlanta, Georgia: IEEE
of data center hardware failures?” in 2017 47th Annual IEEE/IFIP Press, Nov. 2020, pp. 1–15, 00005.
International Conference on Dependable Systems and Networks (DSN), [19] F. Yu, H. Xu, S. Jian, C. Huang, Y. Wang, and Z. Wu, “Dram failure
2017, pp. 25–36. prediction in large-scale data centers,” in 2021 IEEE International
[3] P. Notaro, Q. Yu, S. Haeri, J. Cardoso, and M. Gerndt, “An optical Conference on Joint Cloud Computing (JCC). Los Alamitos, CA,
transceiver reliability study based on sfp monitoring and os-level metric USA: IEEE Computer Society, aug 2021, pp. 1–8. [Online]. Available:
data,” in 2023 IEEE/ACM 23rd International Symposium on Cluster, https://doi.ieeecomputersociety.org/10.1109/JCC53141.2021.00012
Cloud and Internet Computing (CCGrid), 2023, pp. 1–12. [20] X. Du and C. Li, “Predicting uncorrectable memory errors from the
[4] M. Y. Hsiao, “A class of optimal minimum odd-weight-column sec-ded correctable error history: No free predictors in the field,” in The
codes,” IBM Journal of Research and Development, 1970. International Symposium on Memory Systems, ser. MEMSYS 2021.
[5] T. J. Dell, “A white paper on the benefits of chipkill correct ecc New York, NY, USA: Association for Computing Machinery, 2021.
for pcserver main memory,” in Computer Science, 1997. [Online]. [Online]. Available: https://doi.org/10.1145/3488423.3519316
Available: ”https://asset-pdf.scinapse.io/prod/48011110/48011110.pdf” [21] Z. Cheng, S. Han, P. P. C. Lee, X. Li, J. Liu, and Z. Li, “An in-
[6] “Intel® e7500 chipset mch intel® x4 single device data depth correlative study between dram errors and server failures in
correction (x4 sddc) implementation and validation.” [Online]. production data centers,” in 2022 41st International Symposium on
Available: https://www.intel.com/content/dam/doc/application-note/ Reliable Distributed Systems (SRDS), 2022, pp. 262–272.
e7500-chipset-mch-x4-single-device-data-correction-note.pdf [22] J. Bogatinovski, O. Kao, Q. Yu, and J. Cardoso, “First ce matters: On
[7] C. Li, Y. Zhang, J. Wang, H. Chen, X. Liu, T. Huang, L. Peng, S. Zhou, the importance of long term properties on memory failure prediction,”
L. Wang, and S. Ge, “From correctable memory errors to uncorrectable in 2022 IEEE International Conference on Big Data (Big Data), 2022,
memory errors: What error bits tell,” in Proceedings of the International pp. 4733–4736.
Conference on High Performance Computing, Networking, Storage and [23] X. Peng, Z. Huang, A. Cantrell, B. H. Shu, K. K. Xie, Y. Li, Y. Li,
Analysis, ser. SC ’22. IEEE Press, 2022. L. Jiang, Q. Xu, and M.-C. Yang, “Expert: Exploiting dram error types
[8] B. Schroeder, E. Pinheiro, and W.-D. Weber, “Dram errors in the to improve the effective forecasting coverage in the field,” in 2023 53rd
wild: A large-scale field study,” in Proceedings of the Eleventh Annual IEEE/IFIP International Conference on Dependable Systems and
International Joint Conference on Measurement and Modeling of Networks - Supplemental Volume (DSN-S), 2023, pp. 35–41.
Computer Systems, ser. SIGMETRICS ’09. New York, NY, USA: [24] Z. Liu, C. Benge, Y. Dagli, and S. Jiang, “Stim: Predicting memory
Association for Computing Machinery, 2009, p. 193–204. [Online]. uncorrectable errors with spatio-temporal transformer.”
Available: https://doi.org/10.1145/1555349.1555372 [25] X. Sun, K. Chakrabarty, R. Huang, Y. Chen, B. Zhao, H. Cao, Y. Han,
[9] J. Meza, Q. Wu, S. Kumar, and O. Mutlu, “Revisiting memory errors X. Liang, and L. Jiang, “System-level hardware failure prediction using
in large-scale production data centers: Analysis and modeling of new deep learning,” in 2019 56th ACM/IEEE Design Automation Conference
trends from the field,” in 2015 45th Annual IEEE/IFIP International (DAC), Jun. 2019, pp. 1–6, iSSN: 0738-100X.
Conference on Dependable Systems and Networks. Rio de Janeiro, [26] L. Mukhanov, K. Tovletoglou, H. Vandierendonck, D. S. Nikolopoulos,
Brazil: IEEE, 2015, pp. 415–426. and G. Karakonstantis, “Workload-aware dram error prediction using
[10] V. Sridharan and D. Liberty, “A study of dram failures in the field,” in SC machine learning,” in 2019 IEEE International Symposium on Workload
’12: Proceedings of the International Conference on High Performance Characterization (IISWC), 2019, pp. 106–118.
Computing, Networking, Storage and Analysis, 2012, pp. 1–11. [27] X. Wang, Y. Li, Y. Chen, S. Wang, Y. Du, C. He, Y. Zhang, P. Chen,
[11] V. Sridharan, N. DeBardeleben, S. Blanchard, K. B. Ferreira, J. Stearley, X. Li, W. Song, Q. xu, and L. Jiang, “On workload-aware dram failure
J. Shalf, and S. Gurumurthi, “Memory errors in modern systems: prediction in large-scale data centers,” in 2021 IEEE 39th VLSI Test
The good, the bad, and the ugly,” in Proceedings of the Twentieth Symposium (VTS), 2021, pp. 1–6.
International Conference on Architectural Support for Programming [28] P. Zhang, Y. Wang, X. Ma, Y. Xu, B. Yao, X. Zheng, and L. Jiang,
Languages and Operating Systems, ser. ASPLOS ’15. New York, “Predicting dram-caused node unavailability in hyper-scale clouds,” in
NY, USA: Association for Computing Machinery, 2015, p. 297–310. 2022 52nd Annual IEEE/IFIP International Conference on Dependable
[Online]. Available: https://doi.org/10.1145/2694344.2694348 Systems and Networks (DSN), 2022, pp. 275–286.
[12] M. V. Beigi, Y. Cao, S. Gurumurthi, C. Recchia, A. Walton, and [29] Q. Yu, W. Zhang, P. Notaro, S. Haeri, J. Cardoso, and O. Kao, “Himfp:
V. Sridharan, “A systematic study of ddr4 dram faults in the field,” in Hierarchical intelligent memory failure prediction for cloud service
2023 IEEE International Symposium on High-Performance Computer reliability,” in 2023 53rd Annual IEEE/IFIP International Conference
Architecture (HPCA), 2023, pp. 991–1002. on Dependable Systems and Networks (DSN), 2023, pp. 216–228.
[13] M. Patel, T. Shahroodi, A. Manglik, A. G. Yaglikci, A. Olgun, H. Luo, [30] Q. Yu, W. Zhang, J. Cardoso, and O. Kao, “Exploring error
and O. Mutlu, “A case for transparent reliability in dram systems,” 2022. bits for memory failure prediction: An in-depth correlative study,”
[14] J. Jung and M. Erez, “Predicting future-system reliability with a in 2023 IEEE/ACM International Conference on Computer Aided
component-level dram fault model,” in Proceedings of the 56th Annual Design (ICCAD). IEEE, Oct. 2023. [Online]. Available: http:
IEEE/ACM International Symposium on Microarchitecture, ser. MICRO //dx.doi.org/10.1109/ICCAD57390.2023.10323692
’23. New York, NY, USA: Association for Computing Machinery, [31] “Memory ras configuration user’s guide.” [Online].
2023, p. 944–956. [Online]. Available: https://doi.org/10.1145/3613424. Available: https://www.supermicro.com/manuals/other/Memory RAS
3614294 Configuration User Guide.pdf
[15] I. Giurgiu, J. Szabo, D. Wiesmann, and J. Bird, “Predicting DRAM [32] X. Du and C. Li, “Dpcls: Improving partial cache line sparing with
reliability in the field with machine learning,” in Proceedings dynamics for memory error prevention,” in 2020 IEEE 38th International
of the 18th ACM/IFIP/USENIX Middleware Conference: Industrial Conference on Computer Design (ICCD), 2020, pp. 197–204.
Track, ser. Middleware ’17. New York, NY, USA: Association [33] C.-S. Hou, Y.-X. Chen, J.-F. Li, C.-Y. Lo, D.-M. Kwai, and Y.-F. Chou,
for Computing Machinery, Dec. 2017, pp. 15–21, 00026. [Online]. “A built-in self-repair scheme for drams with spare rows, columns, and
Available: https://doi.org/10.1145/3154448.3154451 bits,” in 2016 IEEE International Test Conference (ITC), 2016, pp. 1–7.
[16] X. Du and C. Li, “Memory failure prediction using online learning,” [34] X. Du, C. Li, S. Zhou, X. Liu, X. Xu, T. Wang, and S. Ge, “Fault-
in Proceedings of the International Symposium on Memory Systems, aware prediction-guided page offlining for uncorrectable memory error
ser. MEMSYS ’18. New York, NY, USA: Association for Computing prevention,” in 2021 IEEE 39th International Conference on Computer
Machinery, Oct. 2018, pp. 38–49, 00000. [Online]. Available: Design (ICCD), 2021, pp. 456–463.
https://doi.org/10.1145/3240302.3240309 [35] X. Jian and R. Kumar, “Adaptive reliability chipkill correct (arcc),”
[17] X. Du, C. Li, S. Zhou, M. Ye, and J. Li, “Predicting Uncorrectable in 2013 IEEE 19th International Symposium on High Performance
Memory Errors for Proactive Replacement: An Empirical Study on Computer Architecture (HPCA), 2013, pp. 270–281.

7
[36] X. Du and C. Li, “Combining error statistics with failure prediction in
memory page offlining,” in Proceedings of the International Symposium
on Memory Systems, ser. MEMSYS ’19. New York, NY, USA:
Association for Computing Machinery, 2019, p. 127–132. [Online].
Available: https://doi.org/10.1145/3357526.3357527
[37] D. Tang, P. Carruthers, Z. Totari, and M. Shapiro, “Assessment of the
effect of memory page retirement on system ras against hardware faults,”
in International Conference on Dependable Systems and Networks
(DSN’06), 2006, pp. 365–370.
[38] K. A. Alharthi, A. Jhumka, S. Di, L. Gui, F. Cappello, and S. McIntosh-
Smith, “Time machine: Generative real-time model for failure (and
lead time) prediction in hpc systems,” in 2023 53rd Annual IEEE/IFIP
International Conference on Dependable Systems and Networks (DSN),
2023, pp. 508–521.
[39] Y. Gorishniy, I. Rubachev, V. Khrulkov, and A. Babenko, “Revisiting
deep learning models for tabular data,” Advances in Neural Information
Processing Systems, vol. 34, pp. 18 932–18 943, 2021.
[40] L. Grinsztajn, E. Oyallon, and G. Varoquaux, “Why do tree-based
models still outperform deep learning on typical tabular data?” Advances
in Neural Information Processing Systems, vol. 35, pp. 507–520, 2022.
[41] Q. Yu, J. Cardoso, and O. Kao, “Unveiling dram failures across
different cpu architectures in large-scale datacenters,” in 2024 IEEE 44th
International Conference on Distributed Computing Systems (ICDCS).
IEEE, 2024, pp. 1–2, to appear.

SAP Application Interface Framework
No ratings yet
SAP Application Interface Framework
37 pages
Memory Errors 2
No ratings yet
Memory Errors 2
14 pages
ASP Los 2012
No ratings yet
ASP Los 2012
12 pages
F S: A Fast, Configurable Memory-Reliability Simulator For Conventional and 3D-Stacked Systems
No ratings yet
F S: A Fast, Configurable Memory-Reliability Simulator For Conventional and 3D-Stacked Systems
24 pages
Review of Memory RAS For Data Centers
No ratings yet
Review of Memory RAS For Data Centers
15 pages
Memory Errors in Operating Systems - Problem and Solutions: Accompanying Paper For The Seminar Presentation
No ratings yet
Memory Errors in Operating Systems - Problem and Solutions: Accompanying Paper For The Seminar Presentation
10 pages
Heterogeneous Memory Scalability
No ratings yet
Heterogeneous Memory Scalability
14 pages
探索用于内存故障预测的误差位：深入的相关研究
No ratings yet
探索用于内存故障预测的误差位：深入的相关研究
7 pages
Fault Modeling Testing of SRAM and DRAM
No ratings yet
Fault Modeling Testing of SRAM and DRAM
9 pages
Satish DATE07
No ratings yet
Satish DATE07
6 pages
The Case For Lifetime Reliability-Aware Microprocessors: Jayanth Srinivasan, Sarita V. Adve Pradip Bose, Jude A. Rivers
No ratings yet
The Case For Lifetime Reliability-Aware Microprocessors: Jayanth Srinivasan, Sarita V. Adve Pradip Bose, Jude A. Rivers
12 pages
SSD: An Affordable Fault Tolerant Architecture For Superscalar Processors
No ratings yet
SSD: An Affordable Fault Tolerant Architecture For Superscalar Processors
8 pages
A Study of DRAM Failures in The Field
No ratings yet
A Study of DRAM Failures in The Field
11 pages
Use ECC To Protect Embedded Memories
No ratings yet
Use ECC To Protect Embedded Memories
3 pages
A Survey of Fault Tolerance Approaches On Different Architecture Levels
No ratings yet
A Survey of Fault Tolerance Approaches On Different Architecture Levels
9 pages
Software Defect Prediction Via Convolutional Neural Network
No ratings yet
Software Defect Prediction Via Convolutional Neural Network
11 pages
Deep Learning Project
No ratings yet
Deep Learning Project
21 pages
Parallaft Runtime-Based CPU Fault Tolerance Via
No ratings yet
Parallaft Runtime-Based CPU Fault Tolerance Via
16 pages
Fine-Grained Aging Prediction Based On The Monitoring of Run-Time Stress Using DFT Infrastructure
No ratings yet
Fine-Grained Aging Prediction Based On The Monitoring of Run-Time Stress Using DFT Infrastructure
35 pages
Chen
No ratings yet
Chen
12 pages
A Flexible Software-Based Framework For Online Detection of Hardware Defects
No ratings yet
A Flexible Software-Based Framework For Online Detection of Hardware Defects
17 pages
Abhishek
No ratings yet
Abhishek
19 pages
Stochastic Regular Expressions
No ratings yet
Stochastic Regular Expressions
16 pages
JETIR2104196
No ratings yet
JETIR2104196
5 pages
Article 1 WPS Office
No ratings yet
Article 1 WPS Office
6 pages
FTT-NAS: Discovering Fault-Tolerant Convolutional Neural Architecture
No ratings yet
FTT-NAS: Discovering Fault-Tolerant Convolutional Neural Architecture
24 pages
Focusing Processor Policies Via Critical-Path Prediction
No ratings yet
Focusing Processor Policies Via Critical-Path Prediction
12 pages
CA550 Miner Project Final PPTTT James
No ratings yet
CA550 Miner Project Final PPTTT James
29 pages
PublishedPaper 2020-APCSM MachineLearning
No ratings yet
PublishedPaper 2020-APCSM MachineLearning
8 pages
Electronics 11 02707 v2
No ratings yet
Electronics 11 02707 v2
13 pages
The Resilience of Deep Neural Networks
No ratings yet
The Resilience of Deep Neural Networks
6 pages
Overcoming Computational Errors in Sensing Platforms Through Embedded Machine-Learning Kernels
No ratings yet
Overcoming Computational Errors in Sensing Platforms Through Embedded Machine-Learning Kernels
12 pages
Test & Reliability Challenges in Advance Semiconductor Geometries
No ratings yet
Test & Reliability Challenges in Advance Semiconductor Geometries
66 pages
Jucs 24 12 1776 1799 Kokila
No ratings yet
Jucs 24 12 1776 1799 Kokila
24 pages
Comparative Analysis of Software Reliability Prediction Using Machine Learning and Deep Learning
No ratings yet
Comparative Analysis of Software Reliability Prediction Using Machine Learning and Deep Learning
6 pages
AVF
No ratings yet
AVF
12 pages
Zhang 2020
No ratings yet
Zhang 2020
4 pages
Third Chapter
No ratings yet
Third Chapter
35 pages
Software Defect Prediction Using An Intelligent Ensemble-Based Model - Abstract
No ratings yet
Software Defect Prediction Using An Intelligent Ensemble-Based Model - Abstract
5 pages
(P) Debugging High Performance Embedded Memories, 2004
No ratings yet
(P) Debugging High Performance Embedded Memories, 2004
6 pages
Using Machine Learning To Analyze and Predict The Effect of Different Cpu Features On Cpu Benchmarks For Edge Computing
No ratings yet
Using Machine Learning To Analyze and Predict The Effect of Different Cpu Features On Cpu Benchmarks For Edge Computing
29 pages
Roofline: An Insightful Visual Performance Model For Floating-Point Programs and Multicore Architectures
No ratings yet
Roofline: An Insightful Visual Performance Model For Floating-Point Programs and Multicore Architectures
10 pages
1 s2.0 S0925231215020500 Main
No ratings yet
1 s2.0 S0925231215020500 Main
8 pages
Bulletproof: A Defect-Tolerant CMP Switch Architecture
No ratings yet
Bulletproof: A Defect-Tolerant CMP Switch Architecture
12 pages
Li LDL
No ratings yet
Li LDL
14 pages
Software Reliability Prediction Using Machine Learning and Deep Learning
No ratings yet
Software Reliability Prediction Using Machine Learning and Deep Learning
6 pages
Fault Tolerance Computing Lecture Note
No ratings yet
Fault Tolerance Computing Lecture Note
61 pages
ASurvey of AIOps For Failure Management in The Era of Large Language Models
No ratings yet
ASurvey of AIOps For Failure Management in The Era of Large Language Models
35 pages
Sivam 219303066 Research Paper Testing 1
No ratings yet
Sivam 219303066 Research Paper Testing 1
13 pages
Explainable Automated Program Repair
No ratings yet
Explainable Automated Program Repair
6 pages
Deep Learning For Software Defect Prediction - A Survey
No ratings yet
Deep Learning For Software Defect Prediction - A Survey
6 pages
Print Out Project MACHINE LEARNING
No ratings yet
Print Out Project MACHINE LEARNING
12 pages
Herding Cats Modelling, Simulation, Testing, and Data-Mining For Weak Memory
No ratings yet
Herding Cats Modelling, Simulation, Testing, and Data-Mining For Weak Memory
76 pages
A Novel Approach To Enhancing Software Quality Assurance Through Early Detection and Prevention of Software Faults
No ratings yet
A Novel Approach To Enhancing Software Quality Assurance Through Early Detection and Prevention of Software Faults
13 pages
MEMORY TESTING-Yalta04
No ratings yet
MEMORY TESTING-Yalta04
72 pages
Fault IC Detection
No ratings yet
Fault IC Detection
30 pages
02 Task Performance 1 - ARG A
No ratings yet
02 Task Performance 1 - ARG A
2 pages
A Comprehensive Framework For Analysis of Time-Dependent Performance-Reliability Degradation of SRAM Cache Memory
No ratings yet
A Comprehensive Framework For Analysis of Time-Dependent Performance-Reliability Degradation of SRAM Cache Memory
14 pages
Evaluation LSTM
No ratings yet
Evaluation LSTM
6 pages
1 Advanced Fault Models in Small-Scale CMOS Technology Nodes
No ratings yet
1 Advanced Fault Models in Small-Scale CMOS Technology Nodes
5 pages
Edu Cloud
No ratings yet
Edu Cloud
10 pages
6 Michal KNAPCZYK Krzysztof PIENKOWSKI 2 PDF
No ratings yet
6 Michal KNAPCZYK Krzysztof PIENKOWSKI 2 PDF
16 pages
Signals and Systems - EC3354 - Hand Written Notes - Unit 1 - Classification of Signals and Systems
100% (1)
Signals and Systems - EC3354 - Hand Written Notes - Unit 1 - Classification of Signals and Systems
54 pages
Loi P08
No ratings yet
Loi P08
36 pages
Ethernet Crossover Cable-1
No ratings yet
Ethernet Crossover Cable-1
31 pages
System Administration 1 Notes
No ratings yet
System Administration 1 Notes
54 pages
Texas Instruments HDMI Design Guide
No ratings yet
Texas Instruments HDMI Design Guide
12 pages
KKNPP Nuclear Npcil Project PDF
No ratings yet
KKNPP Nuclear Npcil Project PDF
2 pages
Microsoft .NET SDK 8.0.106 (x64) 20240710131539
No ratings yet
Microsoft .NET SDK 8.0.106 (x64) 20240710131539
17 pages
Cloud Computing Assignment-1
No ratings yet
Cloud Computing Assignment-1
9 pages
Kresetter Manual
No ratings yet
Kresetter Manual
4 pages
C++ Programming
No ratings yet
C++ Programming
14 pages
Ladder Safety Device Tri-Board
No ratings yet
Ladder Safety Device Tri-Board
1 page
session-6-NIST CLOUD REFERENCE MODEL
No ratings yet
session-6-NIST CLOUD REFERENCE MODEL
13 pages
Wireless PHY-Layer Insecurity in EV Charging
0% (1)
Wireless PHY-Layer Insecurity in EV Charging
19 pages
OpenStack Networking Essentials - Sample Chapter
No ratings yet
OpenStack Networking Essentials - Sample Chapter
23 pages
Digital Signage
No ratings yet
Digital Signage
8 pages
Basic Tutorial of Circuit Maker: Circuitmaker
No ratings yet
Basic Tutorial of Circuit Maker: Circuitmaker
7 pages
NodOn TheSoftRemote ZWave UserGuide en
No ratings yet
NodOn TheSoftRemote ZWave UserGuide en
2 pages
Example: A Binary System Produces Marks With Probability of 0.7 and Spaces With Probability 0.3, 2/7 of The Marks Are Received in Error and 1/3 of The Spaces. Find The Information Trans-Fer
No ratings yet
Example: A Binary System Produces Marks With Probability of 0.7 and Spaces With Probability 0.3, 2/7 of The Marks Are Received in Error and 1/3 of The Spaces. Find The Information Trans-Fer
11 pages
COA Sheet Basic
No ratings yet
COA Sheet Basic
96 pages
DSD Lab Manual
No ratings yet
DSD Lab Manual
16 pages
Network Attacks, Architecture and Isolation
100% (2)
Network Attacks, Architecture and Isolation
15 pages
Quiz Buzzer System
No ratings yet
Quiz Buzzer System
5 pages
Class Notes Dicd
No ratings yet
Class Notes Dicd
39 pages
M62446AFP Mitsubishi Elenota - PL PDF
No ratings yet
M62446AFP Mitsubishi Elenota - PL PDF
18 pages
Bvoc
No ratings yet
Bvoc
2 pages
Template JETIA
No ratings yet
Template JETIA
9 pages
Dice Resume CV Santosh Srinivasan
No ratings yet
Dice Resume CV Santosh Srinivasan
5 pages

AIOS Research1

Uploaded by

AIOS Research1

Uploaded by

Investigating Memory Failure Prediction Across

Learning (ML) techniques have been employed for predicting

anticipate UEs on these platforms. Bank

to facilitate the collaboration across teams within the

be attributed to the distinct ECC mechanisms implemented Sampling interval

from a historical observation window △td to predict failures

0.20 0.12 0.15

Monitoring Monitoring Monitoring

Fig. 6: The MLOps Framework of Failure Prediction.

You might also like