Modeling Repairable System Failure Data Using NHPP Reliability Growth Mode
Modeling Repairable System Failure Data Using NHPP Reliability Growth Mode
Winter 2023
Part of the Discrete Mathematics and Combinatorics Commons, Numerical Analysis and Scientific
Computing Commons, and the Systems Architecture Commons
Recommended Citation
Ofori-Addo, Eunice, "Modeling repairable system failure data using NHPP reliability growth mode." (2023).
EWU Masters Thesis Collection. 880.
https://dc.ewu.edu/theses/880
This Thesis is brought to you for free and open access by the Student Research and Creative Works at EWU Digital
Commons. It has been accepted for inclusion in EWU Masters Thesis Collection by an authorized administrator of
EWU Digital Commons. For more information, please contact [email protected].
Modeling Repairable System Failure Data Using
NHPP Reliability Growth Model
A Thesis
Presented To
Eastern Washington University
Cheney, Washington
By
Eunice Ofori-Addo
Winter 2023
THESIS OF EUNICE OFORI-ADDO APPROVED BY
ii
ABSTRACT
by
Eunice Ofori-Addo
Winter 2023
Stochastic point processes have been widely used to describe the behaviour of
repairable systems. The Crow nonhomogeneous Poisson process (NHPP) often
known as the Power Law model is regarded as one of the best models for repairable
systems. The goodness-of-fit test rejects the intensity function of the power law
model, and so the log-linear model was fitted and tested for goodness-of-fit. The
Weibull Time to Failure recurrent neural network (WTTE-RNN) framework, a
probabilistic deep learning model for failure data, is also explored. However,
we find that the WTTE-RNN framework is only appropriate failure data with
independent and identically distributed interarrival times of successive failures,
and so cannot be applied to nonhomogeneous Poisson process.
iii
ACKNOWLEDGMENTS
We would like to thank Evan Felix and David Brown from PNNL for collecting
the data and sharing it. The data was collected and made available using the
Molecular Science Computing Facility (MSCF) in the William R. Wiley Environ-
mental Molecular Sciences Laboratory, a national scientific user facility sponsored
by the U.S. Department of Energy’s Office of Biological and Environmental Re-
search and located at the Pacific Northwest National Laboratory, operated for
the Department of Energy by Battelle.
I would also like to specifically thank Dr. Christian Hansen, my thesis advisor,
for introducing me to the thesis topic and the field of reliability and for his helpful
comments, which led to improvements in the presentation of this thesis. Thank
you, Dr. Xiuqin Bai and Mrs. Lynnae Daniels, for agreeing to serve on my thesis
committee.
iv
Table of Contents
1 Introduction 2
1.1 Objective of Study . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2 A Brief History to Reliability . . . . . . . . . . . . . . . . . . . . 4
1.3 Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2 Literature Review 7
3 Methodology 11
3.1 Basic Reliability Terms . . . . . . . . . . . . . . . . . . . . . . . . 11
3.2 Basic Concepts for Stochastic Point Processes . . . . . . . . . . . 17
3.3 Models Applicable to Repairable Systems . . . . . . . . . . . . . 22
3.4 NHPP Reliability Growth Models . . . . . . . . . . . . . . . . . . 25
3.5 Nonparametric Methods . . . . . . . . . . . . . . . . . . . . . . . 29
3.6 Probabilistic Deep Learning Model for Failure Data . . . . . . . . 31
4 Case Study 34
4.1 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
4.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
5 Conclusion 44
6 Vitae 50
v
List of Figures
1 Times Between Successive Failures of Happy, Sad and Non-Committal
Systems. (Source: Ascher & Feingold, 1984) . . . . . . . . . . . . 3
2 Example of three types of cumulative hazard function, (a) constant
hazard rate, (b) increasing hazard rate and (c) decreasing hazard
rate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3 Bathtub shaped ROCOF . . . . . . . . . . . . . . . . . . . . . . . 14
4 MCNF plots for three different types of systems Source:[59] . . . . 15
5 Types of Censoring . . . . . . . . . . . . . . . . . . . . . . . . . . 17
6 Relationship between the number of failures N (t), the interarrival
times, Xi and failure times, Ti . . . . . . . . . . . . . . . . . . . . . 19
7 Superposition of Renewal Processes . . . . . . . . . . . . . . . . . 24
8 Layers of A Recurrent Neural Network . . . . . . . . . . . . . . . 32
9 X ∗ represents censored age of component. . . . . . . . . . . . . . 36
10 Mean cumulative number of failures from the HPC System . . . . 37
11 Fit of HPP Model . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
12 Fit of Power Law Process . . . . . . . . . . . . . . . . . . . . . . 41
13 Fit of Cox-Lewis Log-Linear Model . . . . . . . . . . . . . . . . . 41
14 Plot of Failure Intensity Function and Cumulative Failures for
Power Law Process . . . . . . . . . . . . . . . . . . . . . . . . . . 42
15 Plot of Failure Intensity function and Cumulative Failures for the
Log-linear model . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
1 Introduction
Tragedies like the Deepwater Horizon oil spill, the Boeing 737 Max airplane ac-
cidents, and the Chernobyl disaster brought reliability issues in design to light.
Reliability is the ability of a system or component to perform its required func-
tions under specified conditions for a specified period of time, according to the
IEEE Standard Computer Dictionary.[29] It is a measure of the likelihood that a
system will not fail. A system is deemed reliable if it satisfies the specified perfor-
mance standards and operates faultlessly for a specified period. Probability and
statistics serve as a good tool for improving the reliability. While probabilistic
modeling and statistical analysis cannot directly improve reliability, they can be
used to predict reliability using experimental data, test data, or field performance
failure data.[31] Stochastic processes are the most powerful mathematical tools for
studying models in reliability theory. This study presents the fundamental con-
cepts in stochastic modeling for the repairable system. It’s crucial to understand
how frequently failures might happen in order to minimize failures in repairable
systems. This involves predicting when failures will occur. As a result, reliability
engineers and practitioners must be educated on stochastic processes and models
that are important for system reliability.
For the purpose of this thesis, a system is defined as a collection of two or more
parts which is designed to carry out one or more functions.[7] Systems can be clas-
sified into two categories; repairable systems and non-repairable systems. Systems
that are non-repairable are those that are not repaired when they fail. They are
discarded after failure. A light bulb is an example of a non-repairable system.
A broken light bulb cannot be fixed and must be replaced. A satellite is also
considered non-repairable because of its complexity and location in space. Once
a satellite is launched into space, it is not easily accessible for repairs. Since most
systems are, at least in principle, repairable in nature, non-repairable system are
commonly referred to as a component or part. Typically, a repairable system is
made up of component or a part that is discarded or replaced completely upon
failure. Components are parts of the larger system that have a direct effect on
the system’s reliability.
As the name implies, repairable systems are restored to operation upon failure
by means other than replacing the entire system. Asher and Feingold[7] in their
book define a repairable system as ”a system in which, after failing to perform
one or more of its functions satisfactorily, can be restored to fully satisfactory
performance by any method, other than replacement of the entire system.” Re-
pairable systems house components that are non-repairable. Common examples
of repairable systems include automobile, computers, printers, etc. If a compo-
nent or subsystem fails and renders an automobile inoperative, that component is
typically repaired or replaced rather than purchasing a new vehicle. The engine,
transmission, brakes, tires, and electrical systems are all examples of repairable
parts of a car. When these parts break down or fail, they can be fixed to get the
car back to working properly. Being able to repair an automobile can increase its
2
lifespan and decrease the need for complete repairs, which can be advantageous
from an economic standpoint. In other cases, repairing the system can be more
expensive than replacing it entirely. An example is mobile devices. Repairing
a broken smartphone or laptop, such as one with a cracked screen, a mother-
board issue in laptops, or an issue with other internal components, can frequently
be more costly than purchasing a new one. This is due to the high cost of re-
placement parts, as well as the specialized skills required to repair the complex
system. It is important to understand the type of system being analyzed and use
the appropriate reliability methods and tools. We’ll use the terms
In this study, when we refer to a system, we mean a repairable system. Ascher and
Feingold presented the following example of a “happy”, “sad” and ”noncommit-
tal” system from different sets of data: Figure 1 depicts failure data from three
different systems. The first system, the ”happy” system, fails less frequently.
Failures occur more frequently in the ”sad” system, but nearly equally frequently
in the non-committal system. The happy system suggests that system reliability
improves with age, whereas the sad system suggests that system reliability de-
teriorates with age. The non-committal system’s reliability is neither improving
3
nor deteriorating. According to Asher and Feingold, practitioners and reliability
engineers are unaware of the different cases of interarrival times of failures in a
system and mistakenly believe that all systems have independent and identically
distributed (i.i.d) times between subsequent failures. They did, however, highlight
the use of point process models to analyse data from repairable systems.
During WWII, a group in Germany led by Wernher von Braun worked on devel-
oping the V-1 missile.[4] Following the war, it was revealed that the first ten V-1
missiles were all a failure. Despite efforts to supply high-quality parts and pay
close attention to detail, all of the first missiles either exploded on the launch pad
or landed ”too soon” (in the English Channel). Throughout the 1940s and 1950s,
poor field reliability of military equipment drew attention to the need for more
formal methods of reliability engineering. Ad hoc studies were initiated under
the leadership of the US Department of Defense (DoD), which eventually coa-
lesced into a new discipline, reliability engineering. Several groups began formal
research on reliability issues. These had a significant impact on how the statistical
treatment of the area. The United States Department of Defense established the
Advisory Group on the Reliability of Electronic Equipment (AGREE) in 1952,
and it later produced widely used specification standards for the reliability of
electronic equipment. The Professional Group on Quality Control of the IRE
was formed in 1949 and became the IEEE Reliability Society in 1963. Also, the
military handbook MH-217 was published around the 1960s. It was proposed to
provide a standard for the prediction of failures of electronic military parts and
systems in order to improve the reliability of the equipment being designed.
In the 1970s, interest in the risks and safety issues associated with the construc-
tion and operation of nuclear power plants grew in the United States and as well
4
as other parts of the world. A large research commission led by Professor Nor-
man Rasmussen was formed in the United States to investigate the problem. The
multimillion-dollar project yielded the Rasmussen report, WASH-1400 (NUREG-
75/014). Despite its flaws, this report is the first serious safety analysis of such a
complex system as a nuclear power plant.[2] Similar research has been conducted
in Europe and Asia. The oil crisis renewed interest in energy efficiency in Nor-
way, particularly in the offshore oil industry. Engineers developed and used risk
analysis and decision analysis techniques to improve the reliability of oil and gas
pipelines while also supporting the reduction in asset costs.
The semiconductor industry began to expand in the 1980s, and the field of re-
liability engineering began to be applied to the software industry. Reliability
engineering entered the digital age in the 1990s. As more industries began to use
computers and software, having reliable software and hardware became increas-
ingly vital.
5
UK Atomic Energy Authority (UKAEA), Royal Radar Establishment(RRE), and
Rome Air Development Corporation, US (RADC) led to the creation of failure
data-banks. Since the 1960s, failure data have been published.
6
2 Literature Review
Over the past 50 years, the theory and methods of repairable system reliability
have been extensively developed and acknowledged in a number of publications.
For this section of the study, we investigate related studies on statistical methods
and mathematical models relevant to the reliability of repairable systems.
One of the key areas of research on statistical methods for reliability in the 1960s
and 1970s was concerned with drawing parametric inferences related to compo-
nents using univariate life distributions considering both censored and uncen-
sored observations. Much of the work in this area is reviewed in books by Bain
(1978), Lawless (1982), and Nelson (1982).[9, 40, 48] The exponential, lognormal,
gamma, and Weibull distributions are all commonly used for modeling component
failure data, though the Weibull and exponential are more popular in reliability
practice. Prior research on parametric statistical inference with both complete
and censored data focused on the exponential distribution (e.g., Bartholomew
1957).[11] Up until the 1960s, almost all the studies on statistical analysis of the
reliability of components assumed failure times to be exponentially distributed.
However, the study by (Zelen et. al., 1961)[67], showed that the exponential dis-
tribution was not an appropriate model in many situations, while the Weibull
distribution became more popular for life distributions. In recent years, the Lind-
ley distribution has been used for modeling failure data and reliability.[39, 10]
The Lindley distribution was developed by British statistician D. V. Lindley in
a paper published in 1958.[42] The properties of the distribution itself remained
relatively unstudied until a 2008 publication by Ghitany et al., but even so, the
Lindley distribution has been used to model real-world data, including failure
data.
7
asymptotically normal under certain conditions. Nelson W. (1982)presents hazard
plots for multiple censored data.[47] Nelson W. (1982) and J.F. Lawless(1982)[30]
have shown that the cumulative hazard plot can be useful for rough estimation
of parameters in parametric models and distributions. In 2020, Jiang et al.[35]
developed a non-parametric likelihood based estimation procedure for left trun-
cated and right censored data using B-splines. This method is useful for dealing
with data that was collected considerably later than the system’s production or
installation date.
In an HPC system, replacing components upon failure does not restore the sys-
tem to ”as good as new” condition, hence Crow’s (1993) claim that the renewal
process is not a suitable model.[22] There are two approaches that have been
adopted in relevant studies to model the reliability of repairable systems: stochas-
tic point processes and differential equations. Ascher and Feingold (1984) pro-
vide a brilliant survey and discussion of five stochastic point process models that
are applicable to repairable systems. They emphasize the importance of the
times between failures being independently and identically distributed for the
Homogeneous Poisson Process (HPP). Çinlar (1975) defines a HPP as an orderly
stochastic process with stationary, independent increments. HPP is the simplest
8
model for failures in a repairable system.[17] Many researchers have looked into
the link between the nonhomogeneous Poisson process (NHPP) and the HPP
including Brown and Proschan (1983), Bartoszyński et al. (1981) and Feigin
(1979).[12, 16, 27] In NHPP models, the intensity function is assumed to depend
on the cumulative system operating time, i.e. the age of the system, and not
necessarily on the time of the most recent failure. (Ascher & Feingold, 1984).
Modeling repairable systems that deteriorate or improve over time requires the
use of a NHPP. A new model proposed by Ibrahimi (1993) on conditional NHPP
provided valuable techniques for analyzing the reliability of complex repairable
systems that are influenced by external factors. His work on conditional NHPP
can also be utilized to model reliability growth.[25] The most commonly used
model for reliability growth is proposed by Crow (1974). The Weibull Process
(different from the Weibull distribution) is also known as the Power Law model.
Other NHPP reliability growth models include The Cox-Lewis (Cozzolino) model
proposed by Cox & Lewis (1966) and later by Cozzolino (1968)[19, 20]. This model
also known as the log-linear model is utilized in situations where the Power Law
model is rejected by a goodness-of-fit test. There is also a generalized version,
Cozzolino’s ”Initial Defects” Model.
The differential equation reliability growth models is based on an approach quite
different from the point processes approaches. These models can be very use-
ful technique for reflecting known underlying mechanisms which contribute to
reliability growth. For instance, if the rate of improvement is known to be in-
versely proportional to some power of time, this fact can be explicitly considered.
Schafer et al. (1975) treat these models in detail in their article ”Reliability
Growth Study”.[54] Differential equation reliability models include what has be-
come known as the IBM model by Rosner (1961)[51], the exponential single-term
power series model by Perkowski and Hartvigsen (1962), etc. Lloyd-Lipow model
developed by Lloyd and Lipow (1962) estimates the reliability of a system com-
prised of a single failure mode.[44] Other models include the Aroef model by Aroef
(1957) and the Simple Exponential Model.
Much research on repairable systems analysis have been conducted. Here are
a few more notable works. In the study by Ascher and Hansen (1998) titled
”Spurious Exponentiality Observed When Incorrectly Fitting a Distribution to
Nonstationary Data” [8] , the authors stressed that ignoring the chronological
ordering of interarrival times may lead to misleading results about the system’s
behaviour. They also refuted the notion that exponential distribution and HPP
can be used interchangeably. Ascher and Hansen point out, ”The close mathe-
matical relationship between the HPP and the exponential distribution has led
many practitioners to incorrectly-use the two concepts interchangeably, and many
falsely believe that if the assumed “distribution” of interarrival times exponen-
tial, eg, when represented in a histogram, then it follows that the HPP model
can be justified as an appropriate model for the system failures.” In addition to
failure time data, modern reliability databases typically also include information
on the type of failure, the type of maintenance, and other factors. For recent
9
literature Lindqvist (2007) reviewed basic modeling approaches for failure and
maintenance data from repairable systems and presented a framework where the
observed events are modeled as marked point processes, with marks labeling the
types of events.[43]
There has been growing literature in the area of machine learning for reliability
engineering. The rise in data availability and computer capacity in recent years
has fostered significant progress in machine learning research. This will have a
profound impact on academia and industry. Supervised learning algorithms have
been used to estimate the remaining useful life (RUL), which is the number of
remaining years that a component will function. Kang et al. (2021) propose a
model that applies normalization and principal component analysis for predict-
ing remaining life of the failure of equipment in continuous production lines.[37]
Vanderhaegen et al. adopted a deep neural network model be adopted to predict
a specific human car driving violation. [60] Recurrent neural networks (RNNs)
are a type of neural network that are suited to handling time series data and
other sequential. data[23] In his thesis, Egil Martinsson (2017) [45] developed an
RNN model for predicting time to event.The model known as the Weibull Time to
Event Recurrent Neural Network (WTTE-RNN) estimates the time to the next
event in the case of discrete or continuous censored data and outputs parameters
of the distribution of time to the next event.
10
3 Methodology
3.1 Basic Reliability Terms
Reliability Function: The reliability function R(x), also known as the survival
function S(x), is the probability of an item operating for a certain amount of
time without failure. The reliability function is the complement of the cumulative
distribution function. If X is a random variable representing the time to failure
of a component, the reliability function, R(x) is defined as
R(x) represents the probability that the component is operating correctly at time
x. R(x) is a monotone non-increasing function of x. The item is assumed to
be working properly or operational at time x = 0 and no item can work forever
without failure: i.e. R(0) = 1 and limx→∞ R(x) = 0. We can also define the
probability of failure of component at or before time x, F (x), the cumulative
distribution function (CDF) as:
F (x) = 1 − R(x)
11
P (x < X ≤ x + dx) 1
h(x) = lim ·
dx→0 dx S(x)
f (x)
h(x) =
1 − F (x)
The simplest models, such as MIL-HDBK-217F(1991), assume a constant hazard
rate. However, field data revealed that most systems do not have a constant
hazard rate. Some probability distributions such as the Weibull distribution are
often used in reliability engineering to represent time-dependent failure behavior.
Other models, such as the ”roller coaster” [65] model and mixed distribution
models[33], have been proposed to model hazard rates.
(a) Constant hazard rate (b) Increasing hazard rate (c) Decreasing hazard rate
In figure 2, H(x) is plotted against x. The plot of constant hazard rate, which is a
linearly increasing function of time, implies that the hazard rate does not change
with age. A concave-up plot of CHF against time implies an increasing hazard
rate, whereas a concave-down plot of CHF implies a decreasing hazard rate.
12
M (t + ∆t) − M (t)
m(t) = lim
∆t→0 ∆t
E[N (t + ∆t)] − E[N (t)]
= lim
∆t→0 ∆t
E[N (t + ∆t) − N (t)]
= lim
∆t→0 ∆t
E[N (t, t + ∆t)]
= lim
∆t→0 ∆t
expected number of failures in (t, t + ∆t)
= lim
∆t→0 ∆t
The ROCOF m(t) is has some resemblance to the hazard rate h(x).[32] The quan-
tity h(x)∆x is approximately the probability that a component will fail within
the time interval (x + ∆x) given that the component is operating at time x. Simi-
larly, the quantity m(t)∆t is approximately the probability that a system will fail
within the time interval (t, t + ∆t). Given these similarities, it is not surprising
that the two functions are being confused with one another.[32]
Bathtub Curve: The ROCOF is generally a function of time and may follow
the bathtub curve. Figure 3 depicts a typical bathtub curve. This is perhaps
the most famous graphical representation in the field of reliability. It is named
the bathtub curve based on the fact that its shape resembles a cross-sectional
13
view of a bathtub. The curve is divided into three different sections. The burn-in
period also known as early failures or infant mortality failures. The middle sec-
tion is referred to as the useful life or random failures period, and it is assumed
that failures occur at random, with a constant ROCOF. The wear-out failures
are described in the latter section of the curve, and it is assumed that the RO-
COF increases as the wear-out accelerates. The burn-in stage is characterized by
a high initial ROCOF that gradually decreases when defective parts fail early.
This is followed by a slower, steady ROCOF dominated by randomly distributed
failures. Failures during this time frame are usually covered by warranties. As
parts wear out, the ROCOF rises again at the end of life. Failures in this stage
can be attributed to aging, fatigue, wear-out, etc. and they are to be expected.
Accelerated life testing[46] subjects the components or subsystems under test to
extreme operating and/or environmental conditions, such as high temperatures,
in order to induce failures in a short period of time. This enables the fast identi-
fication of weaker components.
Ascher (1984) also suggests that the bathtub curve for a repairable system sub-
stantially differs from that of a component or part. He claimed that the time
between successive failures in the burn-in phase of the repairable system tends to
increase, so the rate of occurrence of failures tends to decrease. However, failures
of a part or component tend to occur more frequently, i.e., the interarrival times
tend to become smaller in the initial phase of the bathtub curve. Even though he
does not discuss the reasons for these types of behaviors, these differences in be-
haviors could be attributed to the lifetime of the systems. Typically, a repairable
system has an unlimited lifetime, while a component has only one.
14
data as it reveals trends in the occurrence of failure. Figure below shows three
different cumulative plots. The plot of MCNF reveals the evolution of failures
with time. The shape of the curve indicates whether the number of failures in
the system is increasing (or worsening), decreasing (or improving) or staying the
same (or stable) over time. (see Figure 4) The system has a decreasing ROCOF
and is improving if the MCNF curve is concave-down. A trendless (straight line)
MCNF curve indicates that the system is staying the same. A concave-up or steep
MCNF curve shows the system has an increasing ROCOF and is getting worse
as there are more failures as time progresses.
The MCNF is calculated incrementally for each failure event, taking into account
the number of systems at risk at the time. The MCNF must be adjusted for the
presence of censoring i.e. when some observations do not have complete failure
information. MCNF can be calculated based on parametric and non-parametric
methods. These methods will be discussed in later sections.
Mean Time Between Failure (MTBF): MTBF is the mean operating time
between failures, is a commonly used metric to represent the overall reliability
of repairable systems. It represents how long a repairable system can operate
without being interrupted by a failure. Xi for i = 1, 2, 3, . . . represent the time
between failures or interarrival time, and the mean or expected time between
failures is represented by E[Xi ].
The arithmetic mean value of the reliability function, R(x), is another way to
define MTBF, as shown below.
Z ∞
MTBF = R(x)dx
0
15
is a renewal process i.e., the system is ”as good as new” after each repair. It also
assumes the time between failures are independent and exponentially distributed
with constant rate of occurrence of failure (i.e a HPP), meaning there is no early
failures or wear-out. In their research[59], Trindade and Nathan argue that MTBF
masks information and fails to account for trends in failure data. Different systems
may have the same MTBF but very different failure behavior.
Mean Time To Failure (MTTF): This is the same as the mean time between
failures but for non-repairable systems or components. MTTF is a maintenance
metric that measures the average amount of time a component operates before it
fails.
Censoring: This occurs when the precise failure time of the part or component
is unknown. Data for which the exact failure time is known is referred to as
”complete data.” It is not always possible to collect data for lifetime distribution
in a complete form. As a result, you may have to handle data that is censored
or truncated. There are two common types of censored data in reliability; right
censoring and left censoring.
1. Right Censoring. Observations may cease before the failure has occurred at
time T . When a component’s failure time is unknown but only known to
exceed a certain point in time, it is said to be right censored. This can also
be the case when the component is removed from the observation before it
failed. C denotes the time at which observation ceases.
Ti = min(T, C)
16
Figure 5: Types of Censoring
A point process is a stochastic model that describes the occurrence of events over
time. A model of a repairable system must describe the occurrence of events over
time. In the context of reliability, events are typically referred to as “failures”.
In repairable systems analysis, the two main models used are stochastic point
17
processes and differential equations. This thesis will focus on only stochastic
point processes for modeling failures in a repairable system.
A stochastic point process is simply a random collection of points that fall into
some space. The continuum is time in our application, even though (Cox 1966)
labels this an oversimplification. Hence, we are dealing with temporal point pro-
cesses. Temporal point processes represent a set of observed events that occur at
various points in time.
4. For s < t, [N (t)−N (s)] represents the number of failures that have occurred
in the interval (s, t]
18
Figure 6: Relationship between the number of failures N (t), the interarrival times,
Xi and failure times, Ti .
The random variable, N (t), is the number of failures which occur during (0, t].
{N (t), t ≥ 0} is the integer valued counting process which includes both the
number of failures in (0, t], N (t), and the moment in time T1 , T2 , T3 , . . . at which
they occur. N (t) represents the cumulative number of failures. The expected
value of N (t) is denoted at M (t), i.e. M (t) = E[N (t)], is the MCNF.
Independent Increments
If a counting process has independent increments, then there is no dependence
between the number of failures in an interval and the number of failures in
another interval. Mathematically, a counting process {N (t), t ≥ 0} is said to
have independent increments if for all t0 ≤ t1 ≤ . . . ≤ tk , k = 2, 3, . . . , N (t1 ) −
N (t0 ), . . . , N (tk ) − N (tk−1 ) are independent random variables.
Stationary Increments
A counting process {N (t), t ≥ 0} has stationary increments if for any two points
t > s ≥ 0, and any ∆ > 0 the random variables N (t) − N (s) and N (t + ∆) −
N (s + ∆) are identically distributed.
19
The renewal process (see section 3.3), has independent and identically distributed
interarrival times, so under synchronous sample, it is a stationary sequence of
interarrival times. However, the process is still not stationary because it does not
have stationary increments.
Intensity Function
The intensity function of a stochastic point process, ρ(t), is the same as the
ROCOF associated with a repairable system.
Typically, the assumptions we make about how a system ages and how failure
and repair affect it will influence our choice of a repairable system model.
20
Minimal Repair
Minimal repair means that the repair done on a system leaves the system in
exactly the same condition as it was before the failure.[14] The assumption of
minimal repair leads to the nonhomogeneous Poisson process (NHPP). The NHPP
is often a good model for repairable systems because it can model systems that
are deteriorating or improving.
21
3.3 Models Applicable to Repairable Systems
This section is a brief discussion of commonly used point processes which have
been applied to model repairable systems.
Consider a process of point events occurring on the real axis. Let N (t, t + ∆t)
denote the number of in a small time interval (t, t + ∆t]
where ρ is called the intensity function or the rate of the Poisson process.
(ρt)n
P [N (t) = n] = e−ρt
n!
is a Poisson distribution with mean ρt. The most commonly used probabilistic
models in the counting process are homogeneous and nonhomogeneous Poisson
processes.
(a) The number of failures in any interval of length t2 − t1 has a Poisson dis-
tribution with mean ρ(t2 − t1 ). That is, for all t2 > t1 ≥ 0,
22
for j ≥ 0. From condition (a) it follows that
where the constant ρ, is the rate of occurrence of failures. Since the HPP has sta-
tionary, independent increments, m(t) = mf (t) = ρ = 1/E[X], i.e. the ROCOF
is a constant. From the definition, the reliability function, R(t1 , t2 ) is
The time between failures of HPP are exponentially distributed with mean 1/ρ
and the time to the nth failure, Tn from a system modeled by an HPP has a gamma
distribution. A system that fails in accordance with an HPP, has no memory of
its age. This is the most basic model for repairable systems, however it should
be used with caution.[14] The HPP cannot be used to describe systems that
deteriorate or improve because it has a constant intensity function or ROCOF.
Definition 6 (Ascher and Feingold, 1984) The number Rt of failures in any in-
terval (t1 , t2 ) has a Poisson distribution with mean t12 ρ(t)dt. That is, for all
t2 > t1 ≥ 0 R t2 Rt
e− t1 ρ(t)dt { t12 ρ(t)dt}j
P r[N (t2 ) − N (t1 ) = j] =
j!
for j ≥ 0. From the above, it follows that
R t2
E[N (t2 − t1 )] = e− t1 ρ(t)dt
For NHPP, the interarrival times Xi ’s are neither independent nor identically
distributed. The interarrival times are not independent samples from any single
distribution, including the exponential distribution. However, the independent
increment property still holds. NHPP is characterized by the minimal repair
assumption, for which states that system after repair is only as good as it was
immediately before the failure. A popular case of the NHPP is the Power Law
Process (see Section 3.4.1)
23
3.3.4 Renewal Process
The renewal process is also a generalization of HPP. The time between succes-
sive failures, like the HPP, is independently exponentially distributed but also
independently and identically distributed, with PDF, f (x). When sampled syn-
chronously, the renewal process is an example of a transient point process[7] and
the ROCOF is asymptotically constant. The term ”good as new” has been used
to describe renewal. The ROCOF for the renewal process can be derived from
the PDF as[18]
f ∗ (s)
m∗ (s) =
1 − f ∗ (s)
where f ∗ (s) and m∗ (s) denote the Laplace transform of the PDF and ROCOF.
of the n sockets are depicted by the red crosses and the failures of the system,
depicted by the black crosses on the bottom horizontal line, are formed by the
union of all the failures of the n sockets.
24
3.4 NHPP Reliability Growth Models
The most commonly used stochastic process for modeling reliability growth is
the nonhomogeneous Poisson process (NHPP). The Power Law model and the
Log-linear model are discussed in this section.
Suppose a system is put into operation at time T0 = 0, let T1 < T2 < . . . < Tn
be the first n arrival times of a random point process. The power law process
is defined by the intensity function (or ROCOF). The Crow/AMSAA reliability
model is as follows
ρ(t) = λβtβ−1 λ > 0, β > 0, t > 0 (4)
t is the age of the system. The process in Equation (4) can be expressed through
the mean cumulative number of failure function
M (t) = E[N (t)] = λtβ (5)
When β = 1, interarrival times, Xi ’s follow an exponential distribution. The
process reduces to a HPP in this case. In the presence of reliability growth,
however, the interarrival times should be stochastically increasing. This occurs for
the Weibull process when 0 < β < 1, i.e. when ρ(t) is decreasing.[22] Finkelstein
(1976) reparameterized the intensity function as [28]
β−1
β t
ρ(t) = α > 0, β > 0, t > 0 (6)
α α
β
t
M (t) = (7)
α
1
where α = λ1 β
α and β represent the process’s scale and shape parameters, respectively.
25
3.4.2 Parameter Estimation of the Power Law Model
The values of λ and β will be estimated based on failure data from the k systems.
Assuming we have k systems, whose operation starts at times Sq and ends at Tq
for q = 1, . . . , k. Nq is the total number of failures for the qth system and Xiq is
the age of the system at the ith occurrence of failure. The maximum likelihood
estimates for λ and β are given by
Pk
q=1 Nq
λ̂ = P (8)
k β̂ β̂
q=1 (Tq − Sq )
and Pk
q=1 Nq
β̂ = (9)
λ̂(Tqβ̂ ln Tq − Sqβ̂ ln Sq ) − kq=1 N
P P q
q=1 ln Xiq
Eqns (8) and (9) must be solved by an iterative procedure. If all the systems have
the same start time at zero, i.e. Sq = 0 and all end at the same time at Tq = T
then these equations simplify to Eqns. (10) and (11) below.
Pk
q=1 Nq
λ̂ = (10)
kT β̂
Pk
q=1 Nq
β̂ = P P (11)
k Nq T
q=1 q=1 ln Xiq
In Eqns. (10) and (11) the maximum likelihood estimates are in closed form.
Also when k = 1, the estimates for λ and β are,
N1
λ̂ = (12)
T1β̂
N1
β̂ = P (13)
N1 T1
q=1 ln Xi1
H0 : β = 1
i.e. no trend in data (homogeneous Poisson Process).
H1 : β ̸= 1
26
The alternate hypothesis implies that there is trend in data (nonhomogeneous
Poisson process). The MIL-HDBK 189 test is based on the test statistic
m−1
X Tm
U =2 ln (14)
i=1
Ti
• If you fail to reject the null hypothesis, there is not sufficient evidence to
reject the null hypothesis of homogeneous Poisson process model. HPP is
an appropriate model to use.
In the case of no growth, β is equal to 1; when β < 1 , the process indicates
reliability growth, when β > 1, the process indicates reliability deterioration.
where β̂
Ti
R̂i =
T
27
The observation on the system ends at time T and the ith arrival time is given
as Ti . N is the number of system failures.
CR2 has its own critical values for various values of n. A value of CR2 larger
than the critical value leads to the rejection of the null hypothesis
and the conclusion that the model does not fit adequately.
AIC = −2Loglikelihood + 2k
28
3.5 Nonparametric Methods
3.5.1 Nelson-Aalen Estimator
The Nelson-Aalen Estimator provides a nonparametric estimate of the Cumulative
hazard function H(x) and the Mean Cumulative number of failures , M (t). With-
out any parametric assumptions, the hazard rate h(x) might be any nonnegative
function, making estimation problematic. [3] Yet, it turns out that estimating
the cumulative hazard function is simple
Z x
H(x) = h(s)ds (18)
0
without making any assumptions about the distribution of h(x). This is analogous
to estimating the cumulative distribution function, which is significantly simpler
than estimating the density function. The result is the Nelson-Aalen estimator,
which is given as (Nelson, 1969)
X 1
H(x)
d = (19)
Tj ≤x
Y (Tj )
where Y (t) is the number of individuals at risk at time t (in survival analysis
terms). Eqn. 19 is an increasing step function, which may look like a smooth
curve with a large sample of data. Eqn. 19 can also be defined as
X di
H(x)
d = x≥0 (20)
n
x ≤x i
i
di = 1 when failures occur at distinct times and there are no multiple failures
occurring at the same time.
Nelson (1988) developed a nonparametric estimate for the mean cumulative num-
ber of failures, M (t), given by
X di
M
d (t) = t≥0 (21)
n
t ≤t i
i
where ti stands for the observed failure times, di is the number of failures ob-
served at these times and ni the number of systems in operation at these times.
As discussed in section 3.1, the shape of the MCNF curve reveals the system’s
behavior. Nelson-Aalen estimator has the advantage of being able to be used on
both complete and censored data.
29
order to test for trend since the existence of trend shows that the data is non-
stationary. Plot the interarrival times in chronological order. If there’s a trend,
check for NHPP or other nonstationary models. No trends imply that the Xi ’s
may be identically distributed but not necessarily independent. You can check
for dependence and use the appropriate models. For an improving (deteriorating)
system, successive inter-arrival failure times will likely become larger (smaller).
The Laplace Test can also be used to determine whether or not an observed series
of events has a trend. The hypothesis test is
H0 : No trend
Ha : Trend
The test uses chronologically ordered arrival times T1 , T2 , . . . Tm .[8] The Laplace
test statistic is Pm−1
i=1 Ti
m−1
− T2m
UL = q (22)
1
Tm 12(m−1)
Interpretation:
30
3.6 Probabilistic Deep Learning Model for Failure Data
This section introduces a recurrent neural network based on survival analysis.
The Weibull Time To Event Recurrent Neural Nekwork by Egil Martinsson (2016)
[45]. In contrast to one of the main assumptions in survival analysis, which is the
occurrence of a single event or failure, WTTE-RNN can model recurring events
or multiple failures. We will describe the WTTE-RNN model using a general
framework for censored data, but before then, we present some fundamental deep
learning concepts.
31
Figure 8: Layers of A Recurrent Neural Network
constructing the likelihood function, we use it as a loss function for the model,
depending on θ and to maximize the probability of X = x|θ, which will be the
same as minimizing the log-likelihood log(L(x, θ)).
3.6.4 WTTE-RNN
WTTE-RNN is a framework for predicting the time until the next event as a
discrete or continuous Weibull distribution, with two parameters of the Weibull
distribution being the output of a recurrent neural network. The Weibull dis-
tribution is used in the model because it has some good attributes, such as a
closed-form PDF, CDF and hazard function, and it can be used as a good ap-
proximation for many distributions such as the exponential distribution. The
model is trained using a special objective function that allows use of censored
data by constructing a likelihood function for censoring.
Given t ∈ [0, ∞), scale parameter, α ∈ (0, ∞) and shape parameter, β ∈ (0, ∞),
a random variable X ∼ W eibull(α, β) (continuous case) has cumulative hazard
function given as
x β
H(x) = (23)
α
and hazard function x β−1 β
h(x) = (24)
α α
The loss function of WTTE-RNN is
n
Y
L(x, θ) = P r(X = xi )ui · P r(X > xi )1−ui (25)
i
n
X n
X
−H(xi )
log(L(x, θ)) = ui log(e h(xi )) + (1 − ui ) log(e−H(xi ) ) (26)
i i
n
X
= [ui · log(h(xi )) − H(xi )] (27)
i
32
The framework used an exponential activation function for the scale parameter,
α>0
f (x) = x, if x > 0 and a(ex − 1), if x ≤ 0 (28)
and the softplus activation function for shape parameter, β.
The activation function is used to ensure that the outputs of the parameters of
the Weibull distribution, α and β are positive.
33
4 Case Study
High-Performance Computing, (or HPC) uses supercomputers and computer clus-
ters to solve advanced computational problems that require computing power and
performance that is beyond the capabilities of a typical desktop computer. These
large computational problems exist in numerous fields such as science and en-
gineering. HPC clusters have three main components: compute, network, and
storage components. HPC clusters have multiple compute servers (or computers)
networked together into a cluster. Each cluster’s nodes work in parallel with one
another, boosting processing speed to achieve high-performance computing. To
capture the output, the cluster is networked to a data storage system. All HPC
cluster nodes have the same components as a laptop or desktop: CPU cores (also
known as processors), memory (or RAM), and disk space. What distinguishes a
personal computer from a cluster node is the quantity, quality, and power of the
components.
The rise of big data and artificial intelligence has contributed to an increase in
demand for high-performance computing systems in both industry and academia.
The growing demands have resulted in frequent HPC failures. These failures
may cause system outages, which can be costly and disruptive to institutions and
the people who rely on them. This emphasizes the significance of ensuring the
reliability of HPC systems. As a result, a detailed understanding of failure char-
acteristics can better guide HPC management and thereby limit the occurrence
of failures, improving system performance and reliability. The goal of this case
study is to model the reliability of HPC systems based on failure data obtained
during the first seven years of operation using stochastic point process models.
4.1 Data
The case study uses data on the hardware failures of HPC computing systems
from the Computer Failure Data Repository (CFDR).[55] This data set is a record
of hardware failures recorded on the High Performance Computing System-2
(MPP2) operated by the Environmental and Molecular Science Labratory (EMSL),
Molecular Science Computing Facility (MSCF) at Pacific Northwest National
Laboratory (PNNL) from November 2003 through to September 2007. Below
is a description of the HPC system from PNNL provided by CFDR.[55]
The MPP2 computing system has the following equipment and capabilities:
• HP/Linux Itanium-2
34
– 366 nodes are ”thin” compute nodes with 10 Gbyte RAM and 10 Gbyte
local disk
– 34 nodes are Lustre server nodes (32 OSS, 2 MDS)
– 2 nodes are administrative nodes
– 4 nodes are login nodes
The applications running on this system are typically large-scale scientific simu-
lations or visualization applications. The data contains an entry for any failure
that occurred during the 5-year time period and that required the attention of a
system administrator. For each hardware failure, the data set includes a times-
tamp for when the failure happened, the node affected, what failed in the node
affected, a description of the failure, and the repair action taken. Table 2 shows
the first five rows of the raw data from PNNL.
35
operation at the same time that hardware failure record keeping began.[1] There
is no information given on the exact start and end dates of the study. All that is
known is that the study began in November 2003 and ended in September 2007.
The dates November 1, 2003, and September 1, 2007 were selected for the start
and end of the study, respectively. This way we could calculate the arrival times
of failures, Ti and the interarrival times, or the time between successive failures,
Xi . Failure arrival times, Ti and interarrival times, Xi were both measured in
days for the study. Our data is censored. Since we are dealing with a repairable
system, there could be more than one failure in the system during the observation
period (Nov. ’03 - Sept ’07). Components in the system are replaced upon failure.
In the study, we treat censored age as the time between the last failure for each
component and the time the study ends. See figure 9.
36
4.2 Results
This section is a report on the results from the models used in this study. JMP,[53]
a software built on SAS mainly for reliability analysis, was used for the analysis.
Figure 10: Mean cumulative number of failures from the HPC System
From the plot of the MCNF, there is no evidence of the system getting worse
with age. We will conduct the trend test to determine whether or not the system
is improving.
37
study is improving. The interarrival time of failures is getting longer as the system
ages.
We can see that HPP may not be a good fit for our failure data. We now check
with two NHPP reliability growth models, the Power Law process and the Log-
Linear model.
38
The intensity function and mean cumulative number of failures of the Power
law process is
−0.23135
t
ρ(t) = 0.67024
d
1.14683
0.76865
t
M (t) =
d
1.14683
The MIL-HDBK 189 test: U = 208.700, p-value < 0.0001. The test rejects the
null hypothesis of β = 1 (or constant MTBF) in favor of the Nonhomogeneous
Poisson process as a model for this data.
Figure 12 fits a Power law process NHPP reliability growth model to the arrival
of failures data. The dotted line in figure 15 is a pointer that shows the value at
the selected point of the graph. The failure intensity is a decreasing function of
time. This demonstrates that our system is improving.
Figure 13 shows the fit of the log-linear model on the failure data. The log-linear
model seems to have a better fit as compared to the power law process. Fig. 14
shows that the log-linear model also has a decreasing ROCOF.
To select the most appropriate NHPP model for failure data from our HPC sys-
tem, we calculated the mean square errors (MSE) between the observed mean
cumulative number of failures, M (t), and the estimated mean cumulative number
of failures, M
d (t) for each NHPP reliability growth model. Because the log-linear
39
Model −2Loglikelihood AIC BIC
HPP 19409.121 19413.121 19425.54
Power Law 19200.421 19204.421 19216.846
Process
Log-Linear 18919.436 18923.436 18935.861
Model
model proposed by Cox and Lewis has a much lower MSE, it is chosen.
n
1 X
M SE = M (ti ) − Md
(ti )
n i=1
Model MSE
Power Law Process 37600.699
Log-Linear Model 3387.614
40
Figure 12: Fit of Power Law Process
41
Figure 14: Plot of Failure Intensity Function and Cumulative Failures for Power
Law Process
42
Figure 15: Plot of Failure Intensity function and Cumulative Failures for the Log-
linear model
43
5 Conclusion
Several studies have highlighted the significance of studying failure data from
repairable systems. The thesis discusses the fundamental concepts of reliability
engineering for repairable systems. It also looks at stochastic point processes as
they apply to repairable systems. A case study is undertaken to see how the
reliability models discussed apply to failure data.
In the case study we study of failure data that was collected over the past two
decades at Pacific Northwest National Laboratories. We find that the interarrival
times of failure of the HPC system studied are not independent and identically
distributed. For this reason, the HPP is not an appropriate model to use. After
conducting the Laplace trend test, we discovered a trend in the data. The de-
creasing rate of occurrence of failures as the system ages indicates that we have
an improving system. The plot of the intensity function indicates that the system
is in the burn-in stage, where there is an initial high number of failures due to
defective parts in the system. These parts are replaced, and the occurrence of
failures decreases as the system ages. The log-linear model is chosen as an ap-
propriate model for the failure data used in the case study, with results from the
goodness-of-fit tests and mean squared error (MSE).
44
References
[1] Mpp2 - cluster platform 6000 rx2600 itanium2 1.5 ghz, quadrics. Available
at https://www.top500.org/system/173082/.
[14] A. Basu and S. Rigdon, Statistical methods for the reliability of repairable
systems, (2000).
[15] N. Breslow and J. Crowley, A large sample study of the life table and
product limit estimates under random censorship, The Annals of statistics,
(1974), pp. 437–453.
45
[16] M. Brown and F. Proschan, Imperfect repair, Journal of Applied prob-
ability, 20 (1983), pp. 851–859.
[18] D. Cox and H. Miller, The theory of stochastic processes, methuen & co,
Ltd, London, UK, (1965).
[23] R. DiPietro and G. D. Hager, Deep learning: Rnns and lstm, in Hand-
book of medical image computing and computer assisted intervention, Else-
vier, 2020, pp. 503–519.
[30] S. M. Gore, Statistical models and methods for lifetime data, jerald f. law-
less, wiley, new york,1982. price £d27.25. no. of pages: 580, Statistics in
Medicine, 1 (1982), pp. 293–294.
46
[32] C. K. Hansen, “Reliability,” Realizing Complex System Design (Eds. Shep-
pard, J.W and Ambler, A.P.), CDC Press (to appear), 2023, ch. 2.
[34] Y. Hu, X. Miao, Y. Si, E. Pan, and E. Zio, Prognostics and health
management: A review from the perspectives of design, development and
decision, Reliability Engineering & System Safety, 217 (2022).
[40] J. F. Lawless, Statistical models and methods for lifetime data, John Wiley
& Sons, 2011.
[41] L. Lee and S. K. Lee, Some results on inference for the weibull process,
Technometrics, 20 (1978), pp. 41–45.
47
[46] W. Nelson, Accelerated life testing - step-stress models and data analyses,
IEEE Transactions on Reliability, R-29 (1980), pp. 103–108.
[48] W. B. Nelson, Applied life data analysis, John Wiley & Sons, 2005.
[50] S. E. Rigdon and A. P. Basu, The power law process: a model for the
reliability of repairable systems, Journal of Quality Technology, 21 (1989),
pp. 251–260.
[52] J. H. Saleh and K. Marais, Highlights from the early (and pre-) history
of reliability engineering, Reliability engineering & system safety, 91 (2006),
pp. 249–256.
[56] C. Singh, Failure data analysis for transit vehicles, in Proceedings of the
Annual Reliability and Maintainability Symposium, Washington, DC, Jan-
uary 23-25, 1979., no. IEEE 79CH1429-OR Conf Paper, 1979.
[59] D. Trindade and S. Nathan, Field Data Analysis for Repairable Systems:
Status and Industry Trends, 08 2008, pp. 397–412.
48
[62] G. Weckman, R. Shell, and J. Marvel, Modeling the reliability of re-
pairable systems in the aviation industry, Computers Industrial Engineering,
40 (2001), pp. 51–63.
[65] K. L. Wong, The bathtub does not hold water any more, Quality and reli-
ability engineering international, 4 (1988), pp. 279–282.
[69] E. Zio, Reliability engineering: Old problems and new challenges, Reliability
Engineering System Safety, 94 (2009), pp. 125–141.
49
6 Vitae
Professional
Experience: Internship, Carnegie Mellon University, Pittsburgh,
Pennsylvania, 2019
50