0% found this document useful (0 votes)
8 views56 pages

Modeling Repairable System Failure Data Using NHPP Reliability Growth Mode

Eunice Ofori-Addo's thesis explores the modeling of repairable system failure data using the nonhomogeneous Poisson process (NHPP) reliability growth model, highlighting its application and limitations. The study also examines alternative models, such as the Weibull Time to Failure recurrent neural network framework, and discusses the importance of stochastic processes in understanding system reliability. The research aims to enhance the understanding of failure data in repairable systems and improve reliability engineering practices.

Uploaded by

Daniel Mesafint
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views56 pages

Modeling Repairable System Failure Data Using NHPP Reliability Growth Mode

Eunice Ofori-Addo's thesis explores the modeling of repairable system failure data using the nonhomogeneous Poisson process (NHPP) reliability growth model, highlighting its application and limitations. The study also examines alternative models, such as the Weibull Time to Failure recurrent neural network framework, and discusses the importance of stochastic processes in understanding system reliability. The research aims to enhance the understanding of failure data in repairable systems and improve reliability engineering practices.

Uploaded by

Daniel Mesafint
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 56

Eastern Washington University

EWU Digital Commons

EWU Masters Thesis Collection Student Research and Creative Works

Winter 2023

Modeling repairable system failure data using NHPP reliability


growth mode.
Eunice Ofori-Addo
Eastern Washington University

Follow this and additional works at: https://dc.ewu.edu/theses

Part of the Discrete Mathematics and Combinatorics Commons, Numerical Analysis and Scientific
Computing Commons, and the Systems Architecture Commons

Recommended Citation
Ofori-Addo, Eunice, "Modeling repairable system failure data using NHPP reliability growth mode." (2023).
EWU Masters Thesis Collection. 880.
https://dc.ewu.edu/theses/880

This Thesis is brought to you for free and open access by the Student Research and Creative Works at EWU Digital
Commons. It has been accepted for inclusion in EWU Masters Thesis Collection by an authorized administrator of
EWU Digital Commons. For more information, please contact [email protected].
Modeling Repairable System Failure Data Using
NHPP Reliability Growth Model

A Thesis
Presented To
Eastern Washington University
Cheney, Washington

In Partial Fulfilment of the Requirements for the Degree


Master of Science in Applied Mathematics

By
Eunice Ofori-Addo
Winter 2023
THESIS OF EUNICE OFORI-ADDO APPROVED BY

DR. CHRISTIAN K. HANSEN, GRADUATE DATE


STUDY COMMITTEE

DR. XIUQIN BAI, GRADUATE STUDY COM- DATE


MITTEE

MRS. LYNNAE DANIELS, GRADUATE DATE


STUDY COMMITTEE

ii
ABSTRACT

MODELING REPAIRABLE SYSTEM FAILURE DATA USING NHPP


RELIABILITY GROWTH MODEL

by

Eunice Ofori-Addo

Winter 2023

Stochastic point processes have been widely used to describe the behaviour of
repairable systems. The Crow nonhomogeneous Poisson process (NHPP) often
known as the Power Law model is regarded as one of the best models for repairable
systems. The goodness-of-fit test rejects the intensity function of the power law
model, and so the log-linear model was fitted and tested for goodness-of-fit. The
Weibull Time to Failure recurrent neural network (WTTE-RNN) framework, a
probabilistic deep learning model for failure data, is also explored. However,
we find that the WTTE-RNN framework is only appropriate failure data with
independent and identically distributed interarrival times of successive failures,
and so cannot be applied to nonhomogeneous Poisson process.

iii
ACKNOWLEDGMENTS

We would like to thank Evan Felix and David Brown from PNNL for collecting
the data and sharing it. The data was collected and made available using the
Molecular Science Computing Facility (MSCF) in the William R. Wiley Environ-
mental Molecular Sciences Laboratory, a national scientific user facility sponsored
by the U.S. Department of Energy’s Office of Biological and Environmental Re-
search and located at the Pacific Northwest National Laboratory, operated for
the Department of Energy by Battelle.

I would also like to specifically thank Dr. Christian Hansen, my thesis advisor,
for introducing me to the thesis topic and the field of reliability and for his helpful
comments, which led to improvements in the presentation of this thesis. Thank
you, Dr. Xiuqin Bai and Mrs. Lynnae Daniels, for agreeing to serve on my thesis
committee.

iv
Table of Contents
1 Introduction 2
1.1 Objective of Study . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2 A Brief History to Reliability . . . . . . . . . . . . . . . . . . . . 4
1.3 Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2 Literature Review 7

3 Methodology 11
3.1 Basic Reliability Terms . . . . . . . . . . . . . . . . . . . . . . . . 11
3.2 Basic Concepts for Stochastic Point Processes . . . . . . . . . . . 17
3.3 Models Applicable to Repairable Systems . . . . . . . . . . . . . 22
3.4 NHPP Reliability Growth Models . . . . . . . . . . . . . . . . . . 25
3.5 Nonparametric Methods . . . . . . . . . . . . . . . . . . . . . . . 29
3.6 Probabilistic Deep Learning Model for Failure Data . . . . . . . . 31

4 Case Study 34
4.1 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
4.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

5 Conclusion 44

6 Vitae 50

v
List of Figures
1 Times Between Successive Failures of Happy, Sad and Non-Committal
Systems. (Source: Ascher & Feingold, 1984) . . . . . . . . . . . . 3
2 Example of three types of cumulative hazard function, (a) constant
hazard rate, (b) increasing hazard rate and (c) decreasing hazard
rate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3 Bathtub shaped ROCOF . . . . . . . . . . . . . . . . . . . . . . . 14
4 MCNF plots for three different types of systems Source:[59] . . . . 15
5 Types of Censoring . . . . . . . . . . . . . . . . . . . . . . . . . . 17
6 Relationship between the number of failures N (t), the interarrival
times, Xi and failure times, Ti . . . . . . . . . . . . . . . . . . . . . 19
7 Superposition of Renewal Processes . . . . . . . . . . . . . . . . . 24
8 Layers of A Recurrent Neural Network . . . . . . . . . . . . . . . 32
9 X ∗ represents censored age of component. . . . . . . . . . . . . . 36
10 Mean cumulative number of failures from the HPC System . . . . 37
11 Fit of HPP Model . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
12 Fit of Power Law Process . . . . . . . . . . . . . . . . . . . . . . 41
13 Fit of Cox-Lewis Log-Linear Model . . . . . . . . . . . . . . . . . 41
14 Plot of Failure Intensity Function and Cumulative Failures for
Power Law Process . . . . . . . . . . . . . . . . . . . . . . . . . . 42
15 Plot of Failure Intensity function and Cumulative Failures for the
Log-linear model . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
1 Introduction
Tragedies like the Deepwater Horizon oil spill, the Boeing 737 Max airplane ac-
cidents, and the Chernobyl disaster brought reliability issues in design to light.
Reliability is the ability of a system or component to perform its required func-
tions under specified conditions for a specified period of time, according to the
IEEE Standard Computer Dictionary.[29] It is a measure of the likelihood that a
system will not fail. A system is deemed reliable if it satisfies the specified perfor-
mance standards and operates faultlessly for a specified period. Probability and
statistics serve as a good tool for improving the reliability. While probabilistic
modeling and statistical analysis cannot directly improve reliability, they can be
used to predict reliability using experimental data, test data, or field performance
failure data.[31] Stochastic processes are the most powerful mathematical tools for
studying models in reliability theory. This study presents the fundamental con-
cepts in stochastic modeling for the repairable system. It’s crucial to understand
how frequently failures might happen in order to minimize failures in repairable
systems. This involves predicting when failures will occur. As a result, reliability
engineers and practitioners must be educated on stochastic processes and models
that are important for system reliability.

For the purpose of this thesis, a system is defined as a collection of two or more
parts which is designed to carry out one or more functions.[7] Systems can be clas-
sified into two categories; repairable systems and non-repairable systems. Systems
that are non-repairable are those that are not repaired when they fail. They are
discarded after failure. A light bulb is an example of a non-repairable system.
A broken light bulb cannot be fixed and must be replaced. A satellite is also
considered non-repairable because of its complexity and location in space. Once
a satellite is launched into space, it is not easily accessible for repairs. Since most
systems are, at least in principle, repairable in nature, non-repairable system are
commonly referred to as a component or part. Typically, a repairable system is
made up of component or a part that is discarded or replaced completely upon
failure. Components are parts of the larger system that have a direct effect on
the system’s reliability.
As the name implies, repairable systems are restored to operation upon failure
by means other than replacing the entire system. Asher and Feingold[7] in their
book define a repairable system as ”a system in which, after failing to perform
one or more of its functions satisfactorily, can be restored to fully satisfactory
performance by any method, other than replacement of the entire system.” Re-
pairable systems house components that are non-repairable. Common examples
of repairable systems include automobile, computers, printers, etc. If a compo-
nent or subsystem fails and renders an automobile inoperative, that component is
typically repaired or replaced rather than purchasing a new vehicle. The engine,
transmission, brakes, tires, and electrical systems are all examples of repairable
parts of a car. When these parts break down or fail, they can be fixed to get the
car back to working properly. Being able to repair an automobile can increase its

2
lifespan and decrease the need for complete repairs, which can be advantageous
from an economic standpoint. In other cases, repairing the system can be more
expensive than replacing it entirely. An example is mobile devices. Repairing
a broken smartphone or laptop, such as one with a cracked screen, a mother-
board issue in laptops, or an issue with other internal components, can frequently
be more costly than purchasing a new one. This is due to the high cost of re-
placement parts, as well as the specialized skills required to repair the complex
system. It is important to understand the type of system being analyzed and use
the appropriate reliability methods and tools. We’ll use the terms

1. Part or Component refers to an item that cannot be repaired and is dis-


carded after it fails.

2. Socket is an equipment position which, at any given time, holds a part of a


given type.[7]

3. System is a collection of two or more sockets and their associated parts,


interconnected to perform one or more functions.

In this study, when we refer to a system, we mean a repairable system. Ascher and
Feingold presented the following example of a “happy”, “sad” and ”noncommit-
tal” system from different sets of data: Figure 1 depicts failure data from three

Figure 1: Times Between Successive Failures of Happy, Sad and Non-Committal


Systems. (Source: Ascher & Feingold, 1984)

different systems. The first system, the ”happy” system, fails less frequently.
Failures occur more frequently in the ”sad” system, but nearly equally frequently
in the non-committal system. The happy system suggests that system reliability
improves with age, whereas the sad system suggests that system reliability de-
teriorates with age. The non-committal system’s reliability is neither improving

3
nor deteriorating. According to Asher and Feingold, practitioners and reliability
engineers are unaware of the different cases of interarrival times of failures in a
system and mistakenly believe that all systems have independent and identically
distributed (i.i.d) times between subsequent failures. They did, however, highlight
the use of point process models to analyse data from repairable systems.

1.1 Objective of Study


The purpose of this thesis is to investigate and gain a better understanding of
models utilized on failure data in repairable systems. Also, to understand the
behavior of these systems.

1.2 A Brief History to Reliability


The word reliability can be traced back to 1816 and is first attested to by the poet
Samuel Taylor Coleridge.[52] However, the modern field of reliability engineering
did not emerge until the twentieth century. Dr. Walter A. Shewhart at Bell
Labs promoted product improvement through statistical process control in the
1920s, around the time that Waloddi Weibull was working on statistical models
for fatigue. W. Weibull’s work later lead to the development of the Weibull
distribution in the 1950s.[63] The modern use of the term reliability was defined
by the United States military in the 1940s, as a product that would operate when
expected and for a specified period.

During WWII, a group in Germany led by Wernher von Braun worked on devel-
oping the V-1 missile.[4] Following the war, it was revealed that the first ten V-1
missiles were all a failure. Despite efforts to supply high-quality parts and pay
close attention to detail, all of the first missiles either exploded on the launch pad
or landed ”too soon” (in the English Channel). Throughout the 1940s and 1950s,
poor field reliability of military equipment drew attention to the need for more
formal methods of reliability engineering. Ad hoc studies were initiated under
the leadership of the US Department of Defense (DoD), which eventually coa-
lesced into a new discipline, reliability engineering. Several groups began formal
research on reliability issues. These had a significant impact on how the statistical
treatment of the area. The United States Department of Defense established the
Advisory Group on the Reliability of Electronic Equipment (AGREE) in 1952,
and it later produced widely used specification standards for the reliability of
electronic equipment. The Professional Group on Quality Control of the IRE
was formed in 1949 and became the IEEE Reliability Society in 1963. Also, the
military handbook MH-217 was published around the 1960s. It was proposed to
provide a standard for the prediction of failures of electronic military parts and
systems in order to improve the reliability of the equipment being designed.

In the 1970s, interest in the risks and safety issues associated with the construc-
tion and operation of nuclear power plants grew in the United States and as well

4
as other parts of the world. A large research commission led by Professor Nor-
man Rasmussen was formed in the United States to investigate the problem. The
multimillion-dollar project yielded the Rasmussen report, WASH-1400 (NUREG-
75/014). Despite its flaws, this report is the first serious safety analysis of such a
complex system as a nuclear power plant.[2] Similar research has been conducted
in Europe and Asia. The oil crisis renewed interest in energy efficiency in Nor-
way, particularly in the offshore oil industry. Engineers developed and used risk
analysis and decision analysis techniques to improve the reliability of oil and gas
pipelines while also supporting the reduction in asset costs.

The semiconductor industry began to expand in the 1980s, and the field of re-
liability engineering began to be applied to the software industry. Reliability
engineering entered the digital age in the 1990s. As more industries began to use
computers and software, having reliable software and hardware became increas-
ingly vital.

In the twenty-first century, the field of reliability engineering has continued to


evolve. Engineers are facing new challenges as a result of new technologies such
as the internet of things (IoT) and big data. To meet these challenges, engineers
are employing techniques such as predictive maintenance and artificial intelligence
(AI). In recent years, Prognostic Health Maintenance (PHM) has emerged as a
key enabler of reliable and efficient systems.[34] PHM is a multifaceted discipline
that connects the study of failure mechanisms to product life cycle management.
Health management is the technique of using diagnostic and prognostic informa-
tion to intelligently manage the use and maintenance of a system. The end goal
is enhanced reliability, safety, and minimized maintenance cost. According to
the IEEE Reliability Society, PHM goes hand-in-hand with reliability, as PHM
can directly improve effective reliability, availability, mission reliability, system
safety, and maintenance by being able to provide on-condition health, prognose
pending failure, and predict future health status. PHM was first introduced in
the Department of Energy (DoE) and the DoD. The primary motivation for its
implementation was to simultaneously reduce operating and support costs for the
military and industries in the United States while increasing system availability.
PHM gained popularity in both academia and industry. Over the last decade, it
has seen growing research across multiple disciplines, such as machine learning
and stochastic modeling in reliability engineering.[34]

1.2.1 Failure Data


The advent of the electronic age, accelerated by the Second World War, led to the
need for more complex mass-produced component and parts with a higher level of
variability in the parameters and dimensions involved. The 1940s and 1950s mili-
tary equipment field reliability experience brought attention to the need for more
formal reliability engineering approaches. This led to the gathering of failure data
from the field as well as the analysis of test results. In the mid-1960s, efforts at the

5
UK Atomic Energy Authority (UKAEA), Royal Radar Establishment(RRE), and
Rome Air Development Corporation, US (RADC) led to the creation of failure
data-banks. Since the 1960s, failure data have been published.

1.3 Thesis Outline


The rest of this paper is organized as follows: Section 2 gives a brief review of the
existing literature on modeling failures in repairable systems. Section 3 introduces
basic reliability terms such as the hazard function, the mean cumulative number
of failures, and the bathtub curve and points out the confusion with the term
”failure rate” in reliability literature. The third section also provides definitions
and concepts of stochastic point processes that are relevant to the study, as well
as point process models often used in repairable systems. The models used for
the study, as well as the statistical tests and methods carried out in the research,
are discussed. And lastly, an overview of the Weibull-Time-to-Event Recurrent
Neural Network (WTTE-RNN) model. Section 4 focuses on a case study on mod-
eling hardware failure data from an High-Performance computing (HPC) system.
A description of the data used and how it was preprocessed. The results from the
case study are also presented. Lastly, in the conclusion section, a brief summary
of what the study entailed is provided.

6
2 Literature Review
Over the past 50 years, the theory and methods of repairable system reliability
have been extensively developed and acknowledged in a number of publications.
For this section of the study, we investigate related studies on statistical methods
and mathematical models relevant to the reliability of repairable systems.

One of the key areas of research on statistical methods for reliability in the 1960s
and 1970s was concerned with drawing parametric inferences related to compo-
nents using univariate life distributions considering both censored and uncen-
sored observations. Much of the work in this area is reviewed in books by Bain
(1978), Lawless (1982), and Nelson (1982).[9, 40, 48] The exponential, lognormal,
gamma, and Weibull distributions are all commonly used for modeling component
failure data, though the Weibull and exponential are more popular in reliability
practice. Prior research on parametric statistical inference with both complete
and censored data focused on the exponential distribution (e.g., Bartholomew
1957).[11] Up until the 1960s, almost all the studies on statistical analysis of the
reliability of components assumed failure times to be exponentially distributed.
However, the study by (Zelen et. al., 1961)[67], showed that the exponential dis-
tribution was not an appropriate model in many situations, while the Weibull
distribution became more popular for life distributions. In recent years, the Lind-
ley distribution has been used for modeling failure data and reliability.[39, 10]
The Lindley distribution was developed by British statistician D. V. Lindley in
a paper published in 1958.[42] The properties of the distribution itself remained
relatively unstudied until a 2008 publication by Ghitany et al., but even so, the
Lindley distribution has been used to model real-world data, including failure
data.

Non-parametric estimation attempts to estimate an unknown function from a


sample of data in the absence of any distribution assumptions. As a result, the
non-parametric estimation approaches have no significant related parameters. Ka-
plan and Meier’s work in 1958 [38] began the advancement in lifetime data analysis
with incomplete (or censored) data using non-parametric techniques. Their work
resulted in the development of the product-limit estimate, popularly known as
the Kaplan-Meier estimator. This is a well-known method for calculating sur-
vival over time from lifetime data despite having censored data. Closely related
to this is the Nelson-Aalen estimator. Since no distribution assumptions are re-
quired, one important application of the Nelson-Aalen estimator is to check the
fit of parametric models graphically, which is why Nelson first introduced it in
1969 & 1972. Altshuler (1970) [5] developed the Nelson-Aalen estimator indepen-
dently of Nelson in the context of competing risks animal experiments. Later,
Aalen (1975, 1978b) utilized estimator on Markov chains and other event history
models.[3] This estimator has become a widely used tool in reliability. Breslow
and Crowley (1974)[15] studied the asymptotic behavior of the Kaplan-Meier and
Nelson-Aalan estimators. They showed that the two estimators are consistent and

7
asymptotically normal under certain conditions. Nelson W. (1982)presents hazard
plots for multiple censored data.[47] Nelson W. (1982) and J.F. Lawless(1982)[30]
have shown that the cumulative hazard plot can be useful for rough estimation
of parameters in parametric models and distributions. In 2020, Jiang et al.[35]
developed a non-parametric likelihood based estimation procedure for left trun-
cated and right censored data using B-splines. This method is useful for dealing
with data that was collected considerably later than the system’s production or
installation date.

Modeling the reliability of repairable systems caught the attention of professionals


and researchers working in the field after the publication of the book by Ascher
and Feingold (1984). The book has served as a primary reference for a large
number of studies on the reliability of repairable systems.[62, 14, 64, 61, 69] The
reliability models for repairable systems and non-repairable parts have some sim-
ilarities. However, they cannot be modeled with the same reliability models.
Doing so will lead to errors and inaccurate results.[6] The article by Basile et al.
(2004)[13] focuses on the identification of reliability models for non repairable and
repairable systems.
Non-repairable system reliability models employ the univariate probability distri-
butions. The non-repairable system consists of only one component in operation.
Upon failure, the component is discarded.[7] The renewal process is an appropri-
ate model to use when the system is comprised of only one component. The study
of non-repairable systems consists of determining the distribution of failure times.
Of late, there have been some interesting works published on reliability models for
non-repairable systems. Fang et al. (2021) in their work, ”Reliability evaluation of
non-repairable systems with failure mechanism trigger effect,” proposed a method
that integrated ”Petri Net” and Monte Carlo to evaluate reliability.[26] Zhai et
al. (2018) proposed a combinatorial model, named aggregated binary decision
diagram (ABDD) for reliability analysis of non-repairable parallel phased-mission
systems (PMS) subject to dynamic demand requirements.[68] Yañez et al. (2002)
proposed a generalized renewal process (GRP) to address the disparities in re-
newal states whether the system is renewed to a ”new” state (i.e., as good as
new) or repaired to the condition it was in immediately before failing (i.e., as bad
as old).[66]

In an HPC system, replacing components upon failure does not restore the sys-
tem to ”as good as new” condition, hence Crow’s (1993) claim that the renewal
process is not a suitable model.[22] There are two approaches that have been
adopted in relevant studies to model the reliability of repairable systems: stochas-
tic point processes and differential equations. Ascher and Feingold (1984) pro-
vide a brilliant survey and discussion of five stochastic point process models that
are applicable to repairable systems. They emphasize the importance of the
times between failures being independently and identically distributed for the
Homogeneous Poisson Process (HPP). Çinlar (1975) defines a HPP as an orderly
stochastic process with stationary, independent increments. HPP is the simplest

8
model for failures in a repairable system.[17] Many researchers have looked into
the link between the nonhomogeneous Poisson process (NHPP) and the HPP
including Brown and Proschan (1983), Bartoszyński et al. (1981) and Feigin
(1979).[12, 16, 27] In NHPP models, the intensity function is assumed to depend
on the cumulative system operating time, i.e. the age of the system, and not
necessarily on the time of the most recent failure. (Ascher & Feingold, 1984).
Modeling repairable systems that deteriorate or improve over time requires the
use of a NHPP. A new model proposed by Ibrahimi (1993) on conditional NHPP
provided valuable techniques for analyzing the reliability of complex repairable
systems that are influenced by external factors. His work on conditional NHPP
can also be utilized to model reliability growth.[25] The most commonly used
model for reliability growth is proposed by Crow (1974). The Weibull Process
(different from the Weibull distribution) is also known as the Power Law model.
Other NHPP reliability growth models include The Cox-Lewis (Cozzolino) model
proposed by Cox & Lewis (1966) and later by Cozzolino (1968)[19, 20]. This model
also known as the log-linear model is utilized in situations where the Power Law
model is rejected by a goodness-of-fit test. There is also a generalized version,
Cozzolino’s ”Initial Defects” Model.
The differential equation reliability growth models is based on an approach quite
different from the point processes approaches. These models can be very use-
ful technique for reflecting known underlying mechanisms which contribute to
reliability growth. For instance, if the rate of improvement is known to be in-
versely proportional to some power of time, this fact can be explicitly considered.
Schafer et al. (1975) treat these models in detail in their article ”Reliability
Growth Study”.[54] Differential equation reliability models include what has be-
come known as the IBM model by Rosner (1961)[51], the exponential single-term
power series model by Perkowski and Hartvigsen (1962), etc. Lloyd-Lipow model
developed by Lloyd and Lipow (1962) estimates the reliability of a system com-
prised of a single failure mode.[44] Other models include the Aroef model by Aroef
(1957) and the Simple Exponential Model.

Much research on repairable systems analysis have been conducted. Here are
a few more notable works. In the study by Ascher and Hansen (1998) titled
”Spurious Exponentiality Observed When Incorrectly Fitting a Distribution to
Nonstationary Data” [8] , the authors stressed that ignoring the chronological
ordering of interarrival times may lead to misleading results about the system’s
behaviour. They also refuted the notion that exponential distribution and HPP
can be used interchangeably. Ascher and Hansen point out, ”The close mathe-
matical relationship between the HPP and the exponential distribution has led
many practitioners to incorrectly-use the two concepts interchangeably, and many
falsely believe that if the assumed “distribution” of interarrival times exponen-
tial, eg, when represented in a histogram, then it follows that the HPP model
can be justified as an appropriate model for the system failures.” In addition to
failure time data, modern reliability databases typically also include information
on the type of failure, the type of maintenance, and other factors. For recent

9
literature Lindqvist (2007) reviewed basic modeling approaches for failure and
maintenance data from repairable systems and presented a framework where the
observed events are modeled as marked point processes, with marks labeling the
types of events.[43]

There has been growing literature in the area of machine learning for reliability
engineering. The rise in data availability and computer capacity in recent years
has fostered significant progress in machine learning research. This will have a
profound impact on academia and industry. Supervised learning algorithms have
been used to estimate the remaining useful life (RUL), which is the number of
remaining years that a component will function. Kang et al. (2021) propose a
model that applies normalization and principal component analysis for predict-
ing remaining life of the failure of equipment in continuous production lines.[37]
Vanderhaegen et al. adopted a deep neural network model be adopted to predict
a specific human car driving violation. [60] Recurrent neural networks (RNNs)
are a type of neural network that are suited to handling time series data and
other sequential. data[23] In his thesis, Egil Martinsson (2017) [45] developed an
RNN model for predicting time to event.The model known as the Weibull Time to
Event Recurrent Neural Network (WTTE-RNN) estimates the time to the next
event in the case of discrete or continuous censored data and outputs parameters
of the distribution of time to the next event.

10
3 Methodology
3.1 Basic Reliability Terms
Reliability Function: The reliability function R(x), also known as the survival
function S(x), is the probability of an item operating for a certain amount of
time without failure. The reliability function is the complement of the cumulative
distribution function. If X is a random variable representing the time to failure
of a component, the reliability function, R(x) is defined as

R(x) = P r(X > x), x≥0


Z ∞
= f (s)ds
t

R(x) represents the probability that the component is operating correctly at time
x. R(x) is a monotone non-increasing function of x. The item is assumed to
be working properly or operational at time x = 0 and no item can work forever
without failure: i.e. R(0) = 1 and limx→∞ R(x) = 0. We can also define the
probability of failure of component at or before time x, F (x), the cumulative
distribution function (CDF) as:

F (x) = P (X ≤ x), x≥0

The distribution function can also be defined in terms of reliability as:

F (x) = 1 − R(x)

F is a continuous and differentiable function ”almost everywhere” with proba-


bility density function f defined as the derivative of the cumulative distribution
function.
dF (x)
f (x) = x>0
dx

Hazard Function: Also known as the hazard rate or force of mortality(FOM),


given by h(x). h(x)dx is approximately the probability that a component fails in
a small time interval (x, x+dx) given that it has survived from time zero until the
beginning of the time interval. The hazard function is the ratio of the probability
density function to the survival function. It is applied to non-repairable items
(component or part). x represents the time to failure of a component. The
hazard function can be expressed as probability of failure between time x and
x + dx, given that there were no failures up to time x. The probability expression
is written as:
P (x < X ≤ x + dx)
P (x < X ≤ x + dx|X > x) =
P (X > x)

The hazard function is derived below:

11
P (x < X ≤ x + dx) 1
h(x) = lim ·
dx→0 dx S(x)
f (x)
h(x) =
1 − F (x)
The simplest models, such as MIL-HDBK-217F(1991), assume a constant hazard
rate. However, field data revealed that most systems do not have a constant
hazard rate. Some probability distributions such as the Weibull distribution are
often used in reliability engineering to represent time-dependent failure behavior.
Other models, such as the ”roller coaster” [65] model and mixed distribution
models[33], have been proposed to model hazard rates.

Another measure of reliability is the cumulative hazard function (CHF),


defined by Z x
H(x) = h(s)ds x ≥ 0
0
You can interpret H(x) as the cumulative amount of hazard up to time x.

(a) Constant hazard rate (b) Increasing hazard rate (c) Decreasing hazard rate

Figure 2: Example of three types of cumulative hazard function, (a) constant


hazard rate, (b) increasing hazard rate and (c) decreasing hazard rate

In figure 2, H(x) is plotted against x. The plot of constant hazard rate, which is a
linearly increasing function of time, implies that the hazard rate does not change
with age. A concave-up plot of CHF against time implies an increasing hazard
rate, whereas a concave-down plot of CHF implies a decreasing hazard rate.

Rate of Occurrence of Failures: For repairable systems, the intensity of fail-


ures is described by the rate of occurrence of failures (ROCOF) or intensity func-
tion. ROCOF is the probability of failure in a small interval divided by the length
of the interval.[14] The estimation of ROCOF can become a very complicated pro-
cedure. [18] When M (t) is differentiable, we defined ROCOF as
m(t) = M ′ (t)
The ROCOF is the derivative of the mean cumulative number of failures (MCNF).
M (t) = E[N (t)] = expected number of events (failures) in (0, t]

12
M (t + ∆t) − M (t)
m(t) = lim
∆t→0 ∆t
E[N (t + ∆t)] − E[N (t)]
= lim
∆t→0 ∆t
E[N (t + ∆t) − N (t)]
= lim
∆t→0 ∆t
E[N (t, t + ∆t)]
= lim
∆t→0 ∆t
expected number of failures in (t, t + ∆t)
= lim
∆t→0 ∆t

So expected number of failures in (t, t + ∆t) ≈ m(t) · ∆t


ROCOF is different from the hazard rate as it is a property of a sequence of failure
times as opposed to a property of a single time to failure.

The ROCOF m(t) is has some resemblance to the hazard rate h(x).[32] The quan-
tity h(x)∆x is approximately the probability that a component will fail within
the time interval (x + ∆x) given that the component is operating at time x. Simi-
larly, the quantity m(t)∆t is approximately the probability that a system will fail
within the time interval (t, t + ∆t). Given these similarities, it is not surprising
that the two functions are being confused with one another.[32]

The ”Failure Rate” Confusion: There have been some misunderstandings


regarding the use of terminologies to describe the system and its components,
especially with the term ”failure rate.” Ascher (1984) argues in his book about the
misuse of the term ”failure rate.” The term was frequently used interchangeably
in literature for both force of mortality (FOM) or hazard rate of a non-repairable
system and rate of occurrence of failures (ROCOF) of sequences of failures in a
repairable system.[56, 57, 58] A subtle source of confusion about ”failure rate” is
the improper use of the term in an official publication, MIL-HDBK-217D (1982).
Reliability engineers and researchers can make poor analysis decisions due to a
lack of distinction.
The term has also been used to express the reliability of non-repairable com-
ponents and non-repairable components functioning inside a repairable system.
It also has been used to express the reliability of repairable systems. However,
the meaning of the term failure rate is different in each of these contexts. Hence,
the term ”failure rate” must be avoided.[8]

Bathtub Curve: The ROCOF is generally a function of time and may follow
the bathtub curve. Figure 3 depicts a typical bathtub curve. This is perhaps
the most famous graphical representation in the field of reliability. It is named
the bathtub curve based on the fact that its shape resembles a cross-sectional

13
view of a bathtub. The curve is divided into three different sections. The burn-in
period also known as early failures or infant mortality failures. The middle sec-
tion is referred to as the useful life or random failures period, and it is assumed
that failures occur at random, with a constant ROCOF. The wear-out failures
are described in the latter section of the curve, and it is assumed that the RO-
COF increases as the wear-out accelerates. The burn-in stage is characterized by

Figure 3: Bathtub shaped ROCOF

a high initial ROCOF that gradually decreases when defective parts fail early.
This is followed by a slower, steady ROCOF dominated by randomly distributed
failures. Failures during this time frame are usually covered by warranties. As
parts wear out, the ROCOF rises again at the end of life. Failures in this stage
can be attributed to aging, fatigue, wear-out, etc. and they are to be expected.
Accelerated life testing[46] subjects the components or subsystems under test to
extreme operating and/or environmental conditions, such as high temperatures,
in order to induce failures in a short period of time. This enables the fast identi-
fication of weaker components.
Ascher (1984) also suggests that the bathtub curve for a repairable system sub-
stantially differs from that of a component or part. He claimed that the time
between successive failures in the burn-in phase of the repairable system tends to
increase, so the rate of occurrence of failures tends to decrease. However, failures
of a part or component tend to occur more frequently, i.e., the interarrival times
tend to become smaller in the initial phase of the bathtub curve. Even though he
does not discuss the reasons for these types of behaviors, these differences in be-
haviors could be attributed to the lifetime of the systems. Typically, a repairable
system has an unlimited lifetime, while a component has only one.

Mean Cumulative Number of Failures (MCNF) The cumulative plot is


often used to visualize the mean cumulative number of failures for a repairable
system over the age of the system. It is a great tool for identifying patterns in

14
data as it reveals trends in the occurrence of failure. Figure below shows three
different cumulative plots. The plot of MCNF reveals the evolution of failures

(a) Stable system (b) Improving system (c) Worsening system

Figure 4: MCNF plots for three different types of systems Source:[59]

with time. The shape of the curve indicates whether the number of failures in
the system is increasing (or worsening), decreasing (or improving) or staying the
same (or stable) over time. (see Figure 4) The system has a decreasing ROCOF
and is improving if the MCNF curve is concave-down. A trendless (straight line)
MCNF curve indicates that the system is staying the same. A concave-up or steep
MCNF curve shows the system has an increasing ROCOF and is getting worse
as there are more failures as time progresses.
The MCNF is calculated incrementally for each failure event, taking into account
the number of systems at risk at the time. The MCNF must be adjusted for the
presence of censoring i.e. when some observations do not have complete failure
information. MCNF can be calculated based on parametric and non-parametric
methods. These methods will be discussed in later sections.

Mean Time Between Failure (MTBF): MTBF is the mean operating time
between failures, is a commonly used metric to represent the overall reliability
of repairable systems. It represents how long a repairable system can operate
without being interrupted by a failure. Xi for i = 1, 2, 3, . . . represent the time
between failures or interarrival time, and the mean or expected time between
failures is represented by E[Xi ].
The arithmetic mean value of the reliability function, R(x), is another way to
define MTBF, as shown below.
Z ∞
MTBF = R(x)dx
0

Assuming a constant rate of occurrence of failures, MTBF is defined as the


inverse of the rate of occurrence of failure.
1
M T BF =
λ
where λ is a constant rate of occurrence of failure.
There are numerous issues associated with the use of this metric and it also
requires a number of assumptions. First, it assumes failures of a repairable system

15
is a renewal process i.e., the system is ”as good as new” after each repair. It also
assumes the time between failures are independent and exponentially distributed
with constant rate of occurrence of failure (i.e a HPP), meaning there is no early
failures or wear-out. In their research[59], Trindade and Nathan argue that MTBF
masks information and fails to account for trends in failure data. Different systems
may have the same MTBF but very different failure behavior.

Mean Time To Failure (MTTF): This is the same as the mean time between
failures but for non-repairable systems or components. MTTF is a maintenance
metric that measures the average amount of time a component operates before it
fails.

Censoring: This occurs when the precise failure time of the part or component
is unknown. Data for which the exact failure time is known is referred to as
”complete data.” It is not always possible to collect data for lifetime distribution
in a complete form. As a result, you may have to handle data that is censored
or truncated. There are two common types of censored data in reliability; right
censoring and left censoring.

1. Right Censoring. Observations may cease before the failure has occurred at
time T . When a component’s failure time is unknown but only known to
exceed a certain point in time, it is said to be right censored. This can also
be the case when the component is removed from the observation before it
failed. C denotes the time at which observation ceases.

Ti = min(T, C)

Usually, 1 is used as an indicator for failure occurrence and 0 otherwise.


Here’s an example of right censoring: Suppose a study is conducted on the
reliability of hard disk drives in a computer. We want to know how long
the disks last before they fail. The computer has seven hard drives in their
slots and we observe them over a one-year period. During this period, three
disks failed and were replaced with new ones. At the end of the one-year
period, we had seven hard disk drives that did not fail. We do not know the
failure times beyond the one-year period because we stopped observing. The
failure times of the four disk drives that did not fail during the observation
period are known to be right censored. The three replacement disks are also
censored.

2. Left Censoring. This happens when a component of a repairable system has


been working for an unknown amount of time before we start observing it.
This type of data is uncommon in the study of systems in reliability.

Failure Truncation and Time Truncation: The data are considered to be


failure truncated if the monitoring of the repairable system ends after a specified

16
Figure 5: Types of Censoring

number of failures, say n. If monitoring stops at a predetermined time T , the


data is said to be time truncated.[14] The data used for the case study in this
study (see 4.1) is time truncated.

3.2 Basic Concepts for Stochastic Point Processes


Definition 1 Let (Ω, F, P ) be a probability space and let T be an arbitrary set
(called an index set). Any collection of random variables X = {Xt : t ∈ T }
defined on (Ω, F, P ) is called a stochastic process with index set T .

Every t ∈ T corresponds to some random variable Xt . A random experiment


has the outcome ω ∈ Ω according to the probability measure P . A realization
(or sample path) of a stochastic process corresponds to the outcome ω. The set
of all possible realizations of a stochastic process is called ensemble. Stochastic
processes are classified according to state space and time domain. The state
space can be either discrete or continuous, and the time domain can also be
either discrete or continuous.

An important class of stochastic processes have the Markov property. A stochastic


process has the Markov property if its future evolution depends only on its current
state, and does not depend on past history.

A point process is a stochastic model that describes the occurrence of events over
time. A model of a repairable system must describe the occurrence of events over
time. In the context of reliability, events are typically referred to as “failures”.
In repairable systems analysis, the two main models used are stochastic point

17
processes and differential equations. This thesis will focus on only stochastic
point processes for modeling failures in a repairable system.

Definition 2 (Ascher and Feingold, 1984) A stochastic point process is a math-


ematical model for a physical phenomenon characterized by highly localized events
distributed randomly in a continuum.

A stochastic point process is simply a random collection of points that fall into
some space. The continuum is time in our application, even though (Cox 1966)
labels this an oversimplification. Hence, we are dealing with temporal point pro-
cesses. Temporal point processes represent a set of observed events that occur at
various points in time.

Before we discuss some basic concepts in stochastic point processes, we introduce


the counting process. Suppose a component in a repairable system is put into
operation at time T0 . The first failure of the component will occur at time T1 .
The faulty component will be replaced, and the system will be restored to normal
operation. The second and third failure will occur at T2 and T3 and so on. We
thus get a sequence of failure times T1 , T2 , T3 , . . .. Let Xi be the time between
failure i − 1 and failure i for i = 1, 2, 3, . . .. The counting process is used to model
a sequence of events failures. The sequence of interarrival times, X1 , X2 , X3 , . . .
will generally not be independent and identically distributed - unless the system
is restored to ”as good as new” condition.

Definition 3 (Ross 1996) A stochastic process {N (t), t ≥ 0} is said to be a


counting process if N (t) satisfies:
1. N (t) ≥ 0

2. N (t) is integer valued

3. If s < t, then N (s) ≤ N (t)

4. For s < t, [N (t)−N (s)] represents the number of failures that have occurred
in the interval (s, t]

A counting process may be represented as either a sequence of failure times or a


sequence of interarrival times, as both representation contains the same informa-
tion about the counting process.[49]

Arrival and Interarrival Times


Ti , i = 1, 2, 3, . . . measures the total time from the start of operation T0 to the
ith failure and is called the arrival time to that failure. Ti is a random variable.
Xi , i = 1, 2, 3, . . . is the interarrival time between failure i − 1 and failure i. Xi is a
random variable. Since the origin for Xi is the arrival time of the failure at i − 1,
we say that the Xi ’s are chronologically ordered.[7] Tk = X1 + X2 + X3 + . . . + Xk

18
Figure 6: Relationship between the number of failures N (t), the interarrival times,
Xi and failure times, Ti .

The random variable, N (t), is the number of failures which occur during (0, t].
{N (t), t ≥ 0} is the integer valued counting process which includes both the
number of failures in (0, t], N (t), and the moment in time T1 , T2 , T3 , . . . at which
they occur. N (t) represents the cumulative number of failures. The expected
value of N (t) is denoted at M (t), i.e. M (t) = E[N (t)], is the MCNF.

Independent Increments
If a counting process has independent increments, then there is no dependence
between the number of failures in an interval and the number of failures in
another interval. Mathematically, a counting process {N (t), t ≥ 0} is said to
have independent increments if for all t0 ≤ t1 ≤ . . . ≤ tk , k = 2, 3, . . . , N (t1 ) −
N (t0 ), . . . , N (tk ) − N (tk−1 ) are independent random variables.

Stationary Increments
A counting process {N (t), t ≥ 0} has stationary increments if for any two points
t > s ≥ 0, and any ∆ > 0 the random variables N (t) − N (s) and N (t + ∆) −
N (s + ∆) are identically distributed.

Synchronous and Asynchronous sampling


The process is sampled by an asynchronous sampling when arrival times of fail-
ures are observed after the time the system was placed into operation, without
prior knowledge of the failures before the observation began, i.e. system starts
operation at T−∞ but observation begins at T0 . Sampling is synchronous when
the system is put into operation simultaneously with the start of observation.[7]

Stationary point process


A stochastic point process is said to be stationary if its increments are stationary.

19
The renewal process (see section 3.3), has independent and identically distributed
interarrival times, so under synchronous sample, it is a stationary sequence of
interarrival times. However, the process is still not stationary because it does not
have stationary increments.

Rate of occurence of failures for stationary, transient and nonstationary process


ROCOF can be stationary, transient or nonstationary. It might be constant but
Cox and Lewis (1966) point out that ”the possibility of a constant ROCOF is
usually ignored....” It is shown (in Cox and Lewis, 1996) that for a stationary
process
d 1 1
m(t) ≡ M ′ (t) ≡ E[N (t)] ≡ = = m,
dt E[Xi ] E[X]
the ROCOF is the reciprocal of the mean time between failure or interarrival
times. The ROCOF of an asynchronously sampled point process, such as the
renewal process, is just the reciprocal of the mean of each interarrival times. The
time invariance of the asynchronous ROCOF characterizes a stationary process,
i.e. events of the process occur at a constant rate.[8] The ROCOF for a transient
process is time dependent at the beginning of processes but eventually approaches
1
the constant m = E[X] .[7] A nonstationary point process has a time dependent
ROCOF which could asymptotically approach a constant. It can approach an
asymptote regardless of whether the process is sampled synchronously or asyn-
chronously.

Intensity Function
The intensity function of a stochastic point process, ρ(t), is the same as the
ROCOF associated with a repairable system.

Improving and deteriorating properties of a stochastic point process


A repairable system is said to be deteriorating if the time between failures have
a tendency to become shorter as it ages. When the times between failures have
the tendency to increase, then the system is improving.[14]

Chronological ordering of component failure times


This is when component failure times are chronologically ordered. A key point to
remember is that if data exist in a specific order (chronologically or otherwise),
the data should be initially evaluated in that order.[36]

Typically, the assumptions we make about how a system ages and how failure
and repair affect it will influence our choice of a repairable system model.

Renewal (or Perfect) Repair


A renewal repair presumes that the system is restored to like-new condition fol-
lowing the repair. If every repair is a renewal, then the time between failures is
independently and identically distributed. The renewal process (see section 3.3.4)
is a suitable model for the system.

20
Minimal Repair
Minimal repair means that the repair done on a system leaves the system in
exactly the same condition as it was before the failure.[14] The assumption of
minimal repair leads to the nonhomogeneous Poisson process (NHPP). The NHPP
is often a good model for repairable systems because it can model systems that
are deteriorating or improving.

21
3.3 Models Applicable to Repairable Systems
This section is a brief discussion of commonly used point processes which have
been applied to model repairable systems.

3.3.1 Probabilistic Models: The Poisson Process


Poisson processes are the most commonly used probabilistic models in a counting
process.

Definition 4 (Poisson Process) A counting process N (t) is said to be a Pois-


son process if

1. The cumulative number of failures at time t = 0 is 0, i.e. N (0) = 0.

2. {N (t) t ≥ 0} has independent increments.

Consider a process of point events occurring on the real axis. Let N (t, t + ∆t)
denote the number of in a small time interval (t, t + ∆t]

P r[N (t, t + ∆t) = 0] = 1 − ρ∆t + o(∆t) (1)


P r[N (t, t + ∆t) = 1] = ρ∆t + o(∆t) (2)
P r[N (t, t + ∆t) > 1] = o(∆t) (3)

where ρ is called the intensity function or the rate of the Poisson process.

The properties above of the Poisson process imply that

(ρt)n
P [N (t) = n] = e−ρt
n!
is a Poisson distribution with mean ρt. The most commonly used probabilistic
models in the counting process are homogeneous and nonhomogeneous Poisson
processes.

3.3.2 Homogeneous Poisson Process (HPP)


HPP is defined as a sequence of independent and identically exponentially dis-
tributed Xi ’s. Çinlar (1975) defines HPP as the orderly stochastic process with
stationary, independent increments.

Definition 5 (Ascher and Feingold, 1984) The counting process {N (t), t ≥


0} is said to be an HPP if

(a) The number of failures in any interval of length t2 − t1 has a Poisson dis-
tribution with mean ρ(t2 − t1 ). That is, for all t2 > t1 ≥ 0,

e−ρ(t2 −t1 ) [ρ(t2 − t1 )]j


P r[N (t2 ) − N (t1 ) = j] =
j!

22
for j ≥ 0. From condition (a) it follows that

E[N (t2 − t1 )] = ρ(t2 − t1 )

where the constant ρ, is the rate of occurrence of failures. Since the HPP has sta-
tionary, independent increments, m(t) = mf (t) = ρ = 1/E[X], i.e. the ROCOF
is a constant. From the definition, the reliability function, R(t1 , t2 ) is

R(t1 , t2 ) = e−ρ(t2 −t1 )

The time between failures of HPP are exponentially distributed with mean 1/ρ
and the time to the nth failure, Tn from a system modeled by an HPP has a gamma
distribution. A system that fails in accordance with an HPP, has no memory of
its age. This is the most basic model for repairable systems, however it should
be used with caution.[14] The HPP cannot be used to describe systems that
deteriorate or improve because it has a constant intensity function or ROCOF.

3.3.3 Nonhomogeneous Poisson Process (NHPP)


NHPP is a direct generalization of HPP. The rate of occurrence of failure (RO-
COF) for NHPP is assumed to vary with time, m(t) = ρ(t), rather than being
constant.

Definition 6 (Ascher and Feingold, 1984) The number Rt of failures in any in-
terval (t1 , t2 ) has a Poisson distribution with mean t12 ρ(t)dt. That is, for all
t2 > t1 ≥ 0 R t2 Rt
e− t1 ρ(t)dt { t12 ρ(t)dt}j
P r[N (t2 ) − N (t1 ) = j] =
j!
for j ≥ 0. From the above, it follows that
R t2
E[N (t2 − t1 )] = e− t1 ρ(t)dt

From the definition, the reliability function, R(t1 , t2 ) is


R t2
R(t1 , t2 ) = e− t1 ρ(t)dt

For NHPP, the interarrival times Xi ’s are neither independent nor identically
distributed. The interarrival times are not independent samples from any single
distribution, including the exponential distribution. However, the independent
increment property still holds. NHPP is characterized by the minimal repair
assumption, for which states that system after repair is only as good as it was
immediately before the failure. A popular case of the NHPP is the Power Law
Process (see Section 3.4.1)

23
3.3.4 Renewal Process
The renewal process is also a generalization of HPP. The time between succes-
sive failures, like the HPP, is independently exponentially distributed but also
independently and identically distributed, with PDF, f (x). When sampled syn-
chronously, the renewal process is an example of a transient point process[7] and
the ROCOF is asymptotically constant. The term ”good as new” has been used
to describe renewal. The ROCOF for the renewal process can be derived from
the PDF as[18]
f ∗ (s)
m∗ (s) =
1 − f ∗ (s)
where f ∗ (s) and m∗ (s) denote the Laplace transform of the PDF and ROCOF.

3.3.5 Superimposed Renewal Process (SRP)


Suppose that there are n renewal processes operating independently of each other.
Then the stochastic process formed by the union of all events is known as the
superposition of n renewal processes or a superimposed renewal process (SRP).
Çinlar (1972) presents a thorough review of the SRP. In terms of reliability, SRP
can be explained as follows. Suppose a system is made up of n sockets, each of
which contains a component. When a component fails, the entire system fails, and
the failed component is replaced with a new, identical component. The socket
is considered ”as good as new” immediately after the component is replaced.
The replacement can be regarded as a renewal. Note that SRP does not have
independent and identically distributed interarrival times. (See figure 7) In figure
7, an SRP is formed by failures observed from n sockets. The failures in each

Figure 7: Superposition of Renewal Processes

of the n sockets are depicted by the red crosses and the failures of the system,
depicted by the black crosses on the bottom horizontal line, are formed by the
union of all the failures of the n sockets.

24
3.4 NHPP Reliability Growth Models
The most commonly used stochastic process for modeling reliability growth is
the nonhomogeneous Poisson process (NHPP). The Power Law model and the
Log-linear model are discussed in this section.

3.4.1 Power Law model


A number of reliability growth models have been proposed in the literature for
estimating system reliability. The power law model, often known as the Weibull
process, is the most commonly discussed NHPP model in literature. Perhaps the
popularity of this model could be due to the study conducted by Duane (1964). In
this study, Duane[24] noticed that the cumulative rates of failure plotted against
the cumulative operating time were close to a straight line on a ln-ln scale. Crow
(1974)[21] noted that the Duane postulate could be stochastically represented as
a Weibull process. More studies were carried out on the power law model after
Duane’s study. Rigdon and Basu (1989)[50] provided a thorough review of the
model. Lee and Lee (1978)[41] and Bain and Engelhardt (1980)[9] addressed point
estimates and proposed tests for the model’s parameters. Thompson (1988) and
Ascher and Feingold (1984) examine the applications of this model and present
various inference tools.

Suppose a system is put into operation at time T0 = 0, let T1 < T2 < . . . < Tn
be the first n arrival times of a random point process. The power law process
is defined by the intensity function (or ROCOF). The Crow/AMSAA reliability
model is as follows
ρ(t) = λβtβ−1 λ > 0, β > 0, t > 0 (4)
t is the age of the system. The process in Equation (4) can be expressed through
the mean cumulative number of failure function
M (t) = E[N (t)] = λtβ (5)
When β = 1, interarrival times, Xi ’s follow an exponential distribution. The
process reduces to a HPP in this case. In the presence of reliability growth,
however, the interarrival times should be stochastically increasing. This occurs for
the Weibull process when 0 < β < 1, i.e. when ρ(t) is decreasing.[22] Finkelstein
(1976) reparameterized the intensity function as [28]
   β−1
β t
ρ(t) = α > 0, β > 0, t > 0 (6)
α α
 β
t
M (t) = (7)
α
1
where α = λ1 β
α and β represent the process’s scale and shape parameters, respectively.

25
3.4.2 Parameter Estimation of the Power Law Model
The values of λ and β will be estimated based on failure data from the k systems.
Assuming we have k systems, whose operation starts at times Sq and ends at Tq
for q = 1, . . . , k. Nq is the total number of failures for the qth system and Xiq is
the age of the system at the ith occurrence of failure. The maximum likelihood
estimates for λ and β are given by
Pk
q=1 Nq
λ̂ = P (8)
k β̂ β̂
q=1 (Tq − Sq )

and Pk
q=1 Nq
β̂ = (9)
λ̂(Tqβ̂ ln Tq − Sqβ̂ ln Sq ) − kq=1 N
P P q
q=1 ln Xiq

Eqns (8) and (9) must be solved by an iterative procedure. If all the systems have
the same start time at zero, i.e. Sq = 0 and all end at the same time at Tq = T
then these equations simplify to Eqns. (10) and (11) below.
Pk
q=1 Nq
λ̂ = (10)
kT β̂
Pk
q=1 Nq
β̂ = P P   (11)
k Nq T
q=1 q=1 ln Xiq

In Eqns. (10) and (11) the maximum likelihood estimates are in closed form.
Also when k = 1, the estimates for λ and β are,
N1
λ̂ = (12)
T1β̂
N1
β̂ = P   (13)
N1 T1
q=1 ln Xi1

3.4.3 Test of Significance for β̂ - MIL-HDBK-189 test (1981)


This test is for tracking reliability growth developed by the US Army Materiel
Systems Analysis Activity (AMSAA) (Unkle and Venkataraman, 2002). It as-
sumes that the ROCOF of failure events is ρ(t) = λβtβ−1 .

The hypotheses to be tested are:

H0 : β = 1
i.e. no trend in data (homogeneous Poisson Process).

H1 : β ̸= 1

26
The alternate hypothesis implies that there is trend in data (nonhomogeneous
Poisson process). The MIL-HDBK 189 test is based on the test statistic
m−1  
X Tm
U =2 ln (14)
i=1
Ti

U is distributed with the Chi-square distribution, χ2 with degrees of freedom


2(m − 1). Interpretation:
• If the null hypothesis is rejected, you might conclude that your data has a
trend and should be modeled with a nonhomogeneous Poisson process, such
as the power-law process.

• If you fail to reject the null hypothesis, there is not sufficient evidence to
reject the null hypothesis of homogeneous Poisson process model. HPP is
an appropriate model to use.
In the case of no growth, β is equal to 1; when β < 1 , the process indicates
reliability growth, when β > 1, the process indicates reliability deterioration.

3.4.4 Cox-Lewis Log-Linear Model


Another NHPP reliability growth model is the Log-Linear model proposed by
Cox and Lewis (1966). If the Power Law model in section 3.4.1 with intensity
function ρ(t) = λβtβ−1 , is rejected by a goodness-of-fit test, the Log-linear model
can be fitted. The failure intensity function of the log-linear model is given as

ρ(t) = eα0 +α1 t , −∞ < α0 , α1 < ∞, t ≥ 0 (15)


ˆ will
The intensity function, ρ(t), of this model has the advantage that since ρ(t)
be non-negative, so no nonlinear constraints need to be placed on parameter
estimates. The parameters α0 and α1 can be estimated from the failure data.
Despite the advantages of the log-linear model’s intensity function, estimating
the parameters could be difficult.
eα0 +α1 t − eα0
M (t) = (16)
α1

3.4.5 Goodness-of-Fit Test for NHPP models


The Cramer-Von Mises Statistics [21], for the goodness of fit test is
m  2
2 1 X 2i − 1
CR = + R̂i − (17)
12m i=1 2m

where  β̂
Ti
R̂i =
T

27
The observation on the system ends at time T and the ith arrival time is given
as Ti . N is the number of system failures.
CR2 has its own critical values for various values of n. A value of CR2 larger
than the critical value leads to the rejection of the null hypothesis

H0 : the failure times were governed by a power law process.

and the conclusion that the model does not fit adequately.

An information-based model selection procedure is another method for evaluating


the model’s goodness-of-fit. This enables for the comparison of several models.
Two widely used information criteria for assessing model fit are Akaike’s infor-
mation criterion (AIC) and Bayesian information criterion (BIC). The AIC and
BIC are computed as follows:

AIC = −2Loglikelihood + 2k

BIC = −2Loglikelihood + klog(n)


where k denotes the total number of parameters in the model and n denotes the
total number of observations (i.e. failures). The model with the lowest AIC/BIC
values is preferred.

28
3.5 Nonparametric Methods
3.5.1 Nelson-Aalen Estimator
The Nelson-Aalen Estimator provides a nonparametric estimate of the Cumulative
hazard function H(x) and the Mean Cumulative number of failures , M (t). With-
out any parametric assumptions, the hazard rate h(x) might be any nonnegative
function, making estimation problematic. [3] Yet, it turns out that estimating
the cumulative hazard function is simple
Z x
H(x) = h(s)ds (18)
0

without making any assumptions about the distribution of h(x). This is analogous
to estimating the cumulative distribution function, which is significantly simpler
than estimating the density function. The result is the Nelson-Aalen estimator,
which is given as (Nelson, 1969)
X 1
H(x)
d = (19)
Tj ≤x
Y (Tj )

where Y (t) is the number of individuals at risk at time t (in survival analysis
terms). Eqn. 19 is an increasing step function, which may look like a smooth
curve with a large sample of data. Eqn. 19 can also be defined as
X di
H(x)
d = x≥0 (20)
n
x ≤x i
i

di = 1 when failures occur at distinct times and there are no multiple failures
occurring at the same time.

Nelson (1988) developed a nonparametric estimate for the mean cumulative num-
ber of failures, M (t), given by
X di
M
d (t) = t≥0 (21)
n
t ≤t i
i

where ti stands for the observed failure times, di is the number of failures ob-
served at these times and ni the number of systems in operation at these times.
As discussed in section 3.1, the shape of the MCNF curve reveals the system’s
behavior. Nelson-Aalen estimator has the advantage of being able to be used on
both complete and censored data.

3.5.2 Laplace Trend Test


For a counting process, the times between successive failures may tend to get
longer or shorter. The trend test is performed to determine if the system is im-
proving or deteriorating. It is essential to first analyze the Xi ’s in chronological

29
order to test for trend since the existence of trend shows that the data is non-
stationary. Plot the interarrival times in chronological order. If there’s a trend,
check for NHPP or other nonstationary models. No trends imply that the Xi ’s
may be identically distributed but not necessarily independent. You can check
for dependence and use the appropriate models. For an improving (deteriorating)
system, successive inter-arrival failure times will likely become larger (smaller).

The Laplace Test can also be used to determine whether or not an observed series
of events has a trend. The hypothesis test is

H0 : No trend
Ha : Trend

The test uses chronologically ordered arrival times T1 , T2 , . . . Tm .[8] The Laplace
test statistic is Pm−1
i=1 Ti
m−1
− T2m
UL = q (22)
1
Tm 12(m−1)

The null hypothesis is rejected and there is an evidence of trend if:


UL > Zα/2 (reliability deteriorating)
UL < Zα/2 (reliability improving)

Interpretation:

• Negative values of UL less than z-score means that there is downward or


decreasing trend, this indicates a decreasing rate of occurrence of failures.

• Positive values of UL greater than z-score means that there is upward or


increasing trend, this indicates an increasing rate of occurrence of failures.

30
3.6 Probabilistic Deep Learning Model for Failure Data
This section introduces a recurrent neural network based on survival analysis.
The Weibull Time To Event Recurrent Neural Nekwork by Egil Martinsson (2016)
[45]. In contrast to one of the main assumptions in survival analysis, which is the
occurrence of a single event or failure, WTTE-RNN can model recurring events
or multiple failures. We will describe the WTTE-RNN model using a general
framework for censored data, but before then, we present some fundamental deep
learning concepts.

3.6.1 Brief Introduction to Deep Learning


Deep learning is a subset of machine learning inspired by the structure of the
human brain. Deep learning is built on artificial neural networks, which were
influenced by biological neurons.

A neural network consists of an input layer, an output layer and, in between,


hidden layers. The layers are connected via nodes, and these connections form a
“network” of interconnected nodes. An output of a neural network is computed
through a series of calculations. First, computing the dot product between the
inputs and their respective weights. Then add the bias term and apply the acti-
vation function to the result from the input layer. The activation function decides
whether the neuron should be activated if the output of each node exceeds a cer-
tain threshold value. This results in the output of one node becoming the input
of the next node. This process of passing information from data from one layer to
the next layer defines this neural network as a feedforward network. Neural net-
works learn through a feedback process known as backpropagation. This involves
comparing the output a network produces with the actual output and using the
difference between them to modify the weights of the connections between the
units in the network, working from the output units through the hidden units to
the input units going backward, in other words. Figure 8 demonstrates how a
recurrent neural network moves information between layers.

3.6.2 Recurrent Neural Networks (RNN)


RNNs are a class of neural networks that utilise sequential or time series data.
RNNs are commonly used in natural language processing, time series analysis,
and machine translation. The internal memory of RNNs help them remember
important things about the input they received, which allows them to anticipate
what will happen next with great accuracy. Figure 8 demonstrate how RNNs
transfers information between layers.

3.6.3 Censoring and The Likelihood Function


Let L(x, θ) be the likelihood function known as the joint PDF of the sample
Z = Z1 , ..., Zn . WTTE-RNN uses a likelihood function as its loss function. By

31
Figure 8: Layers of A Recurrent Neural Network

constructing the likelihood function, we use it as a loss function for the model,
depending on θ and to maximize the probability of X = x|θ, which will be the
same as minimizing the log-likelihood log(L(x, θ)).

3.6.4 WTTE-RNN
WTTE-RNN is a framework for predicting the time until the next event as a
discrete or continuous Weibull distribution, with two parameters of the Weibull
distribution being the output of a recurrent neural network. The Weibull dis-
tribution is used in the model because it has some good attributes, such as a
closed-form PDF, CDF and hazard function, and it can be used as a good ap-
proximation for many distributions such as the exponential distribution. The
model is trained using a special objective function that allows use of censored
data by constructing a likelihood function for censoring.

Given t ∈ [0, ∞), scale parameter, α ∈ (0, ∞) and shape parameter, β ∈ (0, ∞),
a random variable X ∼ W eibull(α, β) (continuous case) has cumulative hazard
function given as
 x β
H(x) = (23)
α
and hazard function  x β−1 β
h(x) = (24)
α α
The loss function of WTTE-RNN is
n
Y
L(x, θ) = P r(X = xi )ui · P r(X > xi )1−ui (25)
i
n
X n
X
−H(xi )
log(L(x, θ)) = ui log(e h(xi )) + (1 − ui ) log(e−H(xi ) ) (26)
i i
n
X
= [ui · log(h(xi )) − H(xi )] (27)
i

where ui is 0 if timestep is right censored or 1 otherwise.

32
The framework used an exponential activation function for the scale parameter,
α>0
f (x) = x, if x > 0 and a(ex − 1), if x ≤ 0 (28)
and the softplus activation function for shape parameter, β.

f (x) = log(1 + ex ) (29)

The activation function is used to ensure that the outputs of the parameters of
the Weibull distribution, α and β are positive.

The WTTE-RNN model, in the context of reliability, assumes that interarrival


times are i.i.d. For this reason, the model is not appropriate for modeling relia-
bility improvement or deterioration. It can be a good machine learning model for
renewal processes and HPP where interarrival times are i.i.d.

33
4 Case Study
High-Performance Computing, (or HPC) uses supercomputers and computer clus-
ters to solve advanced computational problems that require computing power and
performance that is beyond the capabilities of a typical desktop computer. These
large computational problems exist in numerous fields such as science and en-
gineering. HPC clusters have three main components: compute, network, and
storage components. HPC clusters have multiple compute servers (or computers)
networked together into a cluster. Each cluster’s nodes work in parallel with one
another, boosting processing speed to achieve high-performance computing. To
capture the output, the cluster is networked to a data storage system. All HPC
cluster nodes have the same components as a laptop or desktop: CPU cores (also
known as processors), memory (or RAM), and disk space. What distinguishes a
personal computer from a cluster node is the quantity, quality, and power of the
components.

The rise of big data and artificial intelligence has contributed to an increase in
demand for high-performance computing systems in both industry and academia.
The growing demands have resulted in frequent HPC failures. These failures
may cause system outages, which can be costly and disruptive to institutions and
the people who rely on them. This emphasizes the significance of ensuring the
reliability of HPC systems. As a result, a detailed understanding of failure char-
acteristics can better guide HPC management and thereby limit the occurrence
of failures, improving system performance and reliability. The goal of this case
study is to model the reliability of HPC systems based on failure data obtained
during the first seven years of operation using stochastic point process models.

4.1 Data
The case study uses data on the hardware failures of HPC computing systems
from the Computer Failure Data Repository (CFDR).[55] This data set is a record
of hardware failures recorded on the High Performance Computing System-2
(MPP2) operated by the Environmental and Molecular Science Labratory (EMSL),
Molecular Science Computing Facility (MSCF) at Pacific Northwest National
Laboratory (PNNL) from November 2003 through to September 2007. Below
is a description of the HPC system from PNNL provided by CFDR.[55]

The MPP2 computing system has the following equipment and capabilities:

• HP/Linux Itanium-2

– 980 node/1960 Itanium-2 processors (Madison, 1.5 GHz) configured as


follows:
– 574 nodes are ”fat” compute nodes with 10 Gbyte RAM and 430 Gbyte
local disk

34
– 366 nodes are ”thin” compute nodes with 10 Gbyte RAM and 10 Gbyte
local disk
– 34 nodes are Lustre server nodes (32 OSS, 2 MDS)
– 2 nodes are administrative nodes
– 4 nodes are login nodes

• Quadrics QsNetII interconnect

– 11.8 TFlops peak theoretical performance


– 9.7 terabytes of RAM
– 450 terabytes of local scratch disk space
– 53 terabytes shared cluster file system, Lustre

The applications running on this system are typically large-scale scientific simu-
lations or visualization applications. The data contains an entry for any failure
that occurred during the 5-year time period and that required the attention of a
system administrator. For each hardware failure, the data set includes a times-
tamp for when the failure happened, the node affected, what failed in the node
affected, a description of the failure, and the repair action taken. Table 2 shows
the first five rows of the raw data from PNNL.

Date HardwareID What Failed Description Action


of Failure
003-11-29 node 13 DISK I/O error on REPLACE
00:00:00 Drive SDG
2003-11-29 node 25 DISK I/O error on REPLACE
07:00:00 Drive sdf
2003-11-29 node 30 DISK I/O error on REPLACE
07:00:00 Drive sdb
and sdc
2003-11-29 node 63 DISK I/O error on REPLACE
07:00:00 Drive sdh
2003-11-29 node 380 DISK I/O error on REPLACE
08:00:00 Drive sdf

Table 1: Sample of hardware failure data from PNNL

4.1.1 Data Preprocessing


There were several steps of preprocessing the dataset used before we arrived at
a dataset suitable for the analysis used in the research. we selected a subset of
the dataset that had been replaced after the failure. The system was placed into

35
operation at the same time that hardware failure record keeping began.[1] There
is no information given on the exact start and end dates of the study. All that is
known is that the study began in November 2003 and ended in September 2007.
The dates November 1, 2003, and September 1, 2007 were selected for the start
and end of the study, respectively. This way we could calculate the arrival times
of failures, Ti and the interarrival times, or the time between successive failures,
Xi . Failure arrival times, Ti and interarrival times, Xi were both measured in
days for the study. Our data is censored. Since we are dealing with a repairable
system, there could be more than one failure in the system during the observation
period (Nov. ’03 - Sept ’07). Components in the system are replaced upon failure.
In the study, we treat censored age as the time between the last failure for each
component and the time the study ends. See figure 9.

Figure 9: X ∗ represents censored age of component.

36
4.2 Results
This section is a report on the results from the models used in this study. JMP,[53]
a software built on SAS mainly for reliability analysis, was used for the analysis.

4.2.1 Nelson-Aalen MCNF Plot for HPC system


We start our analysis by plotting the mean cumulative number of failures as a
function the system’s age. If the time interval between successive failures becomes
longer, the system’s reliability improves. Conversely, if the interval between fail-
ures is decreasing, the system’s reliability is deteriorating. (see figure 10)

Figure 10: Mean cumulative number of failures from the HPC System

From the plot of the MCNF, there is no evidence of the system getting worse
with age. We will conduct the trend test to determine whether or not the system
is improving.

4.2.2 Result of Laplace Trend Test


The test statistic of the Laplace trend test for testing the null hypothesis of
no trend vs the alternate hypothesis of trend in data at 95% confidence level
is UL = −8.9837 < Zα/2 = −1.96. There is enough evidence to reject the null
hypothesis of no trend. We can conclude that the reliability of HPC system under

37
study is improving. The interarrival time of failures is getting longer as the system
ages.

4.2.3 Results from parametric models


Despite the fact that nonparametric estimations reveal that the HPP is not a
good model for our system due to evidence of trend, let’s see how an HPP model
fits and performs with our data.

Figure 11: Fit of HPP Model

We can see that HPP may not be a good fit for our failure data. We now check
with two NHPP reliability growth models, the Power Law process and the Log-
Linear model.

Parameter Estimate Std. Error 95% C.I. 95% C.I.


(Lower) (Upper)
α 1.14683 0.22692 0.70206 1.59159
β 0.76865 0.01464 0.73995 0.79735

Table 2: Parameter Estimates of Power Law Process

38
The intensity function and mean cumulative number of failures of the Power
law process is
 −0.23135
t
ρ(t) = 0.67024
d
1.14683
 0.76865
t
M (t) =
d
1.14683

The MIL-HDBK 189 test: U = 208.700, p-value < 0.0001. The test rejects the
null hypothesis of β = 1 (or constant MTBF) in favor of the Nonhomogeneous
Poisson process as a model for this data.

Figure 12 fits a Power law process NHPP reliability growth model to the arrival
of failures data. The dotted line in figure 15 is a pointer that shows the value at
the selected point of the graph. The failure intensity is a decreasing function of
time. This demonstrates that our system is improving.

Parameter Estimate Std. Error 95% C.I. 95% C.I.


(Lower) (Upper)
α0 -1.863074 0.0325852 -1.926940 -1.799207
α1 -4.375e-5 2.0318e-6 -4.773e-5 -3.977e-5

Table 3: Parameter Estimates of Log-Linear Model

Figure 13 shows the fit of the log-linear model on the failure data. The log-linear
model seems to have a better fit as compared to the power law process. Fig. 14
shows that the log-linear model also has a decreasing ROCOF.

4.2.4 Results from Goodness-of-fit tests


The goodness-of-fit test: CR2 = 2.303485, p-value < 0.01 The computed test
statistics corresponds with p-value that is less than 0.01. We conclude that the
Power Law model does not provide an adequate fit to the data.
The log-linear model fits the failure data from the MPP2 HPC system better
than the Power Law process, according to the AIC/BIC values from table 4.

To select the most appropriate NHPP model for failure data from our HPC sys-
tem, we calculated the mean square errors (MSE) between the observed mean
cumulative number of failures, M (t), and the estimated mean cumulative number
of failures, M
d (t) for each NHPP reliability growth model. Because the log-linear

39
Model −2Loglikelihood AIC BIC
HPP 19409.121 19413.121 19425.54
Power Law 19200.421 19204.421 19216.846
Process
Log-Linear 18919.436 18923.436 18935.861
Model

Table 4: Goodness-of-fit: AIC & BIC

model proposed by Cox and Lewis has a much lower MSE, it is chosen.
n
1 X 
M SE = M (ti ) − Md
(ti )
n i=1

Model MSE
Power Law Process 37600.699
Log-Linear Model 3387.614

Table 5: Mean Squared Error

40
Figure 12: Fit of Power Law Process

Figure 13: Fit of Cox-Lewis Log-Linear Model

41
Figure 14: Plot of Failure Intensity Function and Cumulative Failures for Power
Law Process

42
Figure 15: Plot of Failure Intensity function and Cumulative Failures for the Log-
linear model

43
5 Conclusion
Several studies have highlighted the significance of studying failure data from
repairable systems. The thesis discusses the fundamental concepts of reliability
engineering for repairable systems. It also looks at stochastic point processes as
they apply to repairable systems. A case study is undertaken to see how the
reliability models discussed apply to failure data.

In the case study we study of failure data that was collected over the past two
decades at Pacific Northwest National Laboratories. We find that the interarrival
times of failure of the HPC system studied are not independent and identically
distributed. For this reason, the HPP is not an appropriate model to use. After
conducting the Laplace trend test, we discovered a trend in the data. The de-
creasing rate of occurrence of failures as the system ages indicates that we have
an improving system. The plot of the intensity function indicates that the system
is in the burn-in stage, where there is an initial high number of failures due to
defective parts in the system. These parts are replaced, and the occurrence of
failures decreases as the system ages. The log-linear model is chosen as an ap-
propriate model for the failure data used in the case study, with results from the
goodness-of-fit tests and mean squared error (MSE).

5.0.1 Future Work


The WTTE-RNN model can be extended with other distributions. A future work
will be to develop a variant of the WTTE-RNN that is appropriate for reliability
growth models.

44
References
[1] Mpp2 - cluster platform 6000 rx2600 itanium2 1.5 ghz, quadrics. Available
at https://www.top500.org/system/173082/.

[2] Reactor safety study. an assessment of accident risks in u. s. commercial


nuclear power plants. executive summary: main report. [pwr and bwr].

[3] O. Aalen, O. Borgan, and H. Gjessing, Survival and event history


analysis: a process point of view, Springer Science & Business Media, 2008.

[4] H. Altaleb and R. Zoltán, A brief overview of systems reliability.

[5] B. Altshuler, Theory for the measurement of competing risks in animal


experiments, Mathematical Biosciences, 6 (1970), pp. 1–11.

[6] H. Ascher, Repairable Systems Reliability, vol. 32, 03 2008.

[7] H. Ascher and H. Feingold, Repairable systems reliability: modeling,


inference, misconceptions and their causes, (1984).

[8] H. E. Ascher and C. K. Hansen, Spurious exponentiality observed when


incorrectly fitting a distribution to nonstationary data, IEEE transactions on
reliability, 47 (1998), pp. 451–459.

[9] L. BAIN, Statistical analysis of reliability and life-testing models: Theory


and methods(book), New York, Marcel Dekker, Inc.(Statistics: Textbooks
and Monographs, 24 (1978), p. 464.

[10] H. S. Bakouch, B. M. Al-Zahrani, A. A. Al-Shomrani, V. A.


Marchi, and F. Louzada, An extended lindley distribution, Journal of
the Korean Statistical Society, 41 (2012), pp. 75–85.

[11] D. J. Bartholomew, A problem in life testing, Journal of the American


Statistical Association, 52 (1957), pp. 350–355.

[12] R. Bartoszynski, B. W. Brown, C. M. McBride, and J. R. Thomp-


son, Some nonparametric techniques for estimating the intensity function
of a cancer related nonstationary poisson process, The Annals of Statistics,
(1981), pp. 1050–1060.

[13] O. Basile, P. Dehombreux, and F. Riane, Identification of reliability


models for non repairable and repairable systems with small samples, (2004).

[14] A. Basu and S. Rigdon, Statistical methods for the reliability of repairable
systems, (2000).

[15] N. Breslow and J. Crowley, A large sample study of the life table and
product limit estimates under random censorship, The Annals of statistics,
(1974), pp. 437–453.

45
[16] M. Brown and F. Proschan, Imperfect repair, Journal of Applied prob-
ability, 20 (1983), pp. 851–859.

[17] E. Cinlar, Introduction to Stochastic Processes, HWA-TAI Book, 1983.

[18] D. Cox and H. Miller, The theory of stochastic processes, methuen & co,
Ltd, London, UK, (1965).

[19] D. R. Cox and P. A. Lewis, The statistical analysis of series of events,


(1966).

[20] J. M. Cozzolino, Probabilistic models of decreasing failure rate processes,


Naval Research Logistics Quarterly, 15 (1968), pp. 361–374.

[21] L. H. Crow, Reliability analysis for complex, repairable systems, tech.


rep., Army Materiel Systems Analysis Activityaberdeen Proving Ground Md,
1975.

[22] L. H. Crow, Confidence interval procedures for reliability growth analysis,


1977.

[23] R. DiPietro and G. D. Hager, Deep learning: Rnns and lstm, in Hand-
book of medical image computing and computer assisted intervention, Else-
vier, 2020, pp. 503–519.

[24] J. Duane, Learning curve approach to reliability monitoring, IEEE transac-


tions on Aerospace, 2 (1964), pp. 563–566.

[25] N. Ebrahimi, Improvement and deterioration of a repairable system,


Sankhyā: The Indian Journal of Statistics, Series A (1961-2002), 55 (1993),
pp. 233–243.

[26] J. Fang, R. Kang, and Y. Chen, Reliability evaluation of non-repairable


systems with failure mechanism trigger effect, Reliability Engineering System
Safety, 210 (2021), p. 107454.

[27] P. D. Feigin, On the characterization of point processes with the order


statistic property, Journal of Applied Probability, 16 (1979), pp. 297–304.

[28] J. Finkelstein, Confidence bounds on the parameters of the weibull process,


Technometrics, 18 (1976), pp. 115–117.

[29] A. Geraci, IEEE standard computer dictionary: Compilation of IEEE stan-


dard computer glossaries, IEEE Press, 1991.

[30] S. M. Gore, Statistical models and methods for lifetime data, jerald f. law-
less, wiley, new york,1982. price £d27.25. no. of pages: 580, Statistics in
Medicine, 1 (1982), pp. 293–294.

[31] L. J. Gullo and D. Raheja, Design for Reliability, Wiley, 2012.

46
[32] C. K. Hansen, “Reliability,” Realizing Complex System Design (Eds. Shep-
pard, J.W and Ambler, A.P.), CDC Press (to appear), 2023, ch. 2.

[33] C. K. Hansen and P. Thyregod, Component lifetime models based on


weibull mixtures and competing risks, Quality and reliability engineering in-
ternational, 8 (1992), pp. 325–333.

[34] Y. Hu, X. Miao, Y. Si, E. Pan, and E. Zio, Prognostics and health
management: A review from the perspectives of design, development and
decision, Reliability Engineering & System Safety, 217 (2022).

[35] W. Jiang, Z. Ye, and X. Zhao, Reliability estimation from left-truncated


and right-censored data using splines, Statistica Sinica, 30 (2020), pp. 845–
875.

[36] B. L. Joiner, Lurking variables: Some examples, The American Statisti-


cian, 35 (1981), pp. 227–233.

[37] Z. Kang, C. Catal, and B. Tekinerdogan, Remaining useful life (rul)


prediction of equipment in production lines using artificial neural networks,
Sensors, 21 (2021), p. 932.

[38] E. L. Kaplan and P. Meier, Nonparametric estimation from incom-


plete observations, Journal of the American statistical association, 53 (1958),
pp. 457–481.

[39] H. Krishna and K. Kumar, Reliability estimation in lindley distribution


with progressively type ii right censored sample, Mathematics and Computers
in Simulation, 82 (2011), pp. 281–294.

[40] J. F. Lawless, Statistical models and methods for lifetime data, John Wiley
& Sons, 2011.

[41] L. Lee and S. K. Lee, Some results on inference for the weibull process,
Technometrics, 20 (1978), pp. 41–45.

[42] D. V. Lindley, Fiducial distributions and bayes’ theorem, Journal of the


Royal Statistical Society. Series B (Methodological), (1958), pp. 102–107.

[43] B. H. Lindqvist, On the statistical modeling and analysis of repairable


systems, (2006).

[44] D. K. Lloyd and M. Lipow, Reliability: Management, methods, and


mathematics., (1962).

[45] E. Martinsson, Wtte-rnn: Weibull time to event recurrent neural network,


PhD thesis, Chalmers University of Technology & University of Gothenburg,
2016.

47
[46] W. Nelson, Accelerated life testing - step-stress models and data analyses,
IEEE Transactions on Reliability, R-29 (1980), pp. 103–108.

[47] , Applied life data analysis, Wiley, 1982.

[48] W. B. Nelson, Applied life data analysis, John Wiley & Sons, 2005.

[49] M. Rausand and A. Hoyland, System reliability theory: models, statis-


tical methods, and applications, vol. 396, John Wiley & Sons, 2003.

[50] S. E. Rigdon and A. P. Basu, The power law process: a model for the
reliability of repairable systems, Journal of Quality Technology, 21 (1989),
pp. 251–260.

[51] N. Rosner, System analysis-non-linear estimation techniques, in Proceed-


ings national symposium on reliability and quality control, 1961, pp. 203–207.

[52] J. H. Saleh and K. Marais, Highlights from the early (and pre-) history
of reliability engineering, Reliability engineering & system safety, 91 (2006),
pp. 249–256.

[53] SAS Institute Inc., Cary, NC, Jmp, 1989–2023.

[54] R. Schafer, R. Sallee, and J. Torrez, Reliability growth study, Hughes


Aircraft Company, RADC-TR-75-253, (1975).

[55] B. Schroeder and G. A. Gibson, The computer failure data repository


(cfdr), in Workshop on Reliability Analysis of System Failure Data (RAF’07),
MSR Cambridge, UK, 2007.

[56] C. Singh, Failure data analysis for transit vehicles, in Proceedings of the
Annual Reliability and Maintainability Symposium, Washington, DC, Jan-
uary 23-25, 1979., no. IEEE 79CH1429-OR Conf Paper, 1979.

[57] K. Strandberg, Dependability, performance guarantees, some specifica-


tions and evaluation principles, in Proc. EUROCON 1982, North Holland,
North-Holland Publishing Company, 1982, pp. 963–969.

[58] W. Thompson, On the foundations of reliability, Technometrics, 23 (1981),


pp. 1–13.

[59] D. Trindade and S. Nathan, Field Data Analysis for Repairable Systems:
Status and Industry Trends, 08 2008, pp. 397–412.

[60] F. Vanderhaegen, S. Zieba, S. Enjalbert, and P. Polet, A bene-


fit/cost/deficit (bcd) model for learning from human errors, Reliability Engi-
neering & System Safety, 96 (2011), pp. 757–766.

[61] J. Vatn, P. Hokstad, and L. Bodsberg, An overall model for mainte-


nance optimization, Reliability Engineering System Safety, (1996).

48
[62] G. Weckman, R. Shell, and J. Marvel, Modeling the reliability of re-
pairable systems in the aviation industry, Computers Industrial Engineering,
40 (2001), pp. 51–63.

[63] W. Weibull, A statistical distribution function of wide applicability, Jour-


nal of applied mechanics, (1951).

[64] T. M. Welte and K. Wang, Models for lifetime estimation: an overview


with focus on applications to wind turbines, Advances in Manufacturing,
(2014).

[65] K. L. Wong, The bathtub does not hold water any more, Quality and reli-
ability engineering international, 4 (1988), pp. 279–282.

[66] M. Yanez, F. Joglar, and M. Modarres, Generalized renewal process


for analysis of repairable systems with limited failure experience, Reliability
Engineering & System Safety, 77 (2002), pp. 167–180.

[67] M. Zelen and M. C. Dannemiller, The robustness of life testing pro-


cedures derived from the exponential distribution, Technometrics, 3 (1961),
pp. 29–49.

[68] Q. Zhai, L. Xing, R. Peng, and J. Yang, Aggregated combinatorial re-


liability model for non-repairable parallel phased-mission systems, Reliability
Engineering System Safety, 176 (2018), pp. 242–250.

[69] E. Zio, Reliability engineering: Old problems and new challenges, Reliability
Engineering System Safety, 94 (2009), pp. 125–141.

49
6 Vitae

Author: Eunice Ofori-Addo

Place of Birth: Tema, Ghana

Undergraduate Schools Attended: Kwame Nkrumah University of Science and


Technology

Degrees Awarded: Bachelor of Science, 2019, Kwame Nkrumah University of


Science and Technology

Honors and Awards: Graduate Service Appointment, Mathematics Depart-


ment, 2021-2023, Eastern Washington University

First Class Honors, 2019, Kwame Nkrumah University


of Science and Technology

Professional
Experience: Internship, Carnegie Mellon University, Pittsburgh,
Pennsylvania, 2019

Internship, University of Kiel, Kiel, Germany, 2017

50

You might also like