Overview of Reliability Engineering: Eric Marsden
Overview of Reliability Engineering: Eric Marsden
Eric Marsden
<[email protected]>
Context
▷ I have a fleet of airline engines and want to anticipate when they may fail
Failure
Loss of ability to perform as required.
Fault
Inability to perform as required, due to an internal state [IEV 192-04-01]
▷ When a failure occurs, the item enters the failed state. A failure may
occur:
• while running
• while in standby
• due to demand
Error
Discrepancy between a computed, observed, or measured value or condition and
the true, specified, or theoretically correct value or condition.
Source: International Electrotechnical Vocabulary (IEV) part 192 on dependability, item 192-03-02
Failure mode
Failure mode
The way a failure is observed on a failed item.
Source: International Electrotechnical Vocabulary (IEV) part 192 on dependability, item 192-03-17
Failure classification
▷ Effects:
• safe failures
• dangerous failures
▷ Detectability:
• detected: revealed by online diagnostics
• undetected: revealed by functional tests or upon a real demand for activation
inputs outputs
λ Models the transitions between correct state
and failed state.
failure rate λ
correct failed
state state
repair rate μ
The “safe failure fraction”
failure rate λ Not all failures are dangerous: the system may have
been designed to tolerate them.
correct failed
state state
repair rate μ
service
repair rate μ
OK
degraded
but safe
rate of safe or dangerous service
and detected failure 𝜆𝑆
The “safe failure fraction”
failure rate λ Not all failures are dangerous: the system may have
been designed to tolerate them.
correct failed
state state
repair rate μ
service
repair rate μ
OK Importance of the coverage of the error detection
degraded
but safe
mechanisms, measured by the “safe failure fraction”:
rate of safe or dangerous service conditional probability that a failure will be safe, or
and detected failure 𝜆𝑆
dangerous-but-detected.
Failure classification
▷ The reliability 𝑅(𝑡) of an item at time 𝑡 is the probability that the item
performs the required function in the interval [0–𝑡] given the stress and
environmental conditions in which it operates
Reliability: definitions
▷ 𝑅(𝑡) represents the probability that the item is working correctly at time 𝑡
▷ Properties:
• 𝑅(𝑡) is non-increasing (no rising from the dead)
• 𝑅(0) = 1 (no immediate death/failure)
• lim
𝑡→∞
𝑅(𝑡) = 0 (no eternal life)
Interpreting the reliability function
1 1
P(T ≤ t)
P(T > t)
0
t 0
t
Time to failure (T) Time to failure (T)
Problem
The lifetime of a modern low-wattage electronic light bulb is known to be
exponentially distributed with a mean of 8000 hours.
Q1 Find the proportion of bulbs that may be expected to fail before 7000
hours use.
Solution
The time to failure of our light bulbs can be modelled by the distribution
dist = scipy.stats.expon(scale=8000)
Q1: The CDF gives us the probability that the lifetime is ≤ 𝑡. We want
dist.cdf(7000) which is 0.583137. So about 58% of light bulbs will fail
before they reach 7000 hours of operation.
Problem
A particular electronic device will only function correctly if two essential
components both function correctly. The lifetime of the first component
is known to be exponentially distributed with a mean of 5000 hours and
the lifetime of the second component (whose failures can be assumed to be
independent of those of the first component) is known to be exponentially
distributed with a mean of 7000 hours. Find the proportion of devices that
may be expected to fail before 6000 hours use.
Exercise
Solution
The device will only be working after 6000 hours if both components are
operating. The probability of the first component still working is
> pa = 1 - scipy.stats.expon(scale=5000).cdf(6000)
> pa
0.3011942119122022
> pb = 1 - scipy.stats.expon(scale=7000).cdf(6000)
> pb
0.42437284567695
Hazard function
The hazard function or failure rate function ℎ(𝑡) gives the conditional probability
of failure in the interval 𝑡 to 𝑡 + 𝑑𝑡, given that no failure has occurred by 𝑡.
𝑓 (𝑡)
ℎ(𝑡) =
𝑅(𝑡)
where 𝑓 (𝑡) is the probability density function (failure density function) and
𝑅(𝑡) is the reliability function.
It’s the probability of quitting a given state after having spent a given time
in that state.
Bathtub curve
Failure rate
Early Observed failure
“infant rate
▷ Useful life period: probability of mortality”
failure Wear Out
failures
failure is roughly constant Constant (random)
failures
∞
▷ Mean time to failure (MTTF) = 𝔼(𝑇 ) = ∫0 𝑅(𝑡)𝑑𝑡
▷ Often calculated by dividing the total operating time of the units tested by
the total number of failures encountered
Availability
The ability of an item (under combined aspects of its reliability, maintainability
and maintenance support) to perform its required function at a stated instant of
time or over a stated period of time [BS 4778]
▷ The availability 𝐴(𝑡) of an item at time 𝑡 is the probability that the item is
correctly working at time 𝑡
𝑀𝑇 𝑇 𝐹
▷ Mean availability =
𝑀𝑇 𝑇 𝐹 + 𝑀𝑇 𝑇 𝑅
Reliability ≠ availability
MTTF
𝐴=
MTTF + MTTR
time
≠
Also note that reliability
safety
Maintainability
Maintainability
The ability of an item, under stated conditions of use, to be retained in, or restored
to, a state in which it can perform its required functions, when maintenance is
performed under stated conditions and using prescribed procedures and resources
[BS4778]
MTTF MTTR
operational under repair
time
een
MTBF: mean time betw
failures
Exercise
Problem
For a large computer installation, the maintenance logbook shows that over a
period of a month there were 15 unscheduled maintenance actions or downtimes,
and a total of 1200 minutes in emergency maintenance status. Based upon
prior data on this equipment, the reliability engineer expects repair times to be
exponentially distributed. A warranty contract between the computer company
and the customer calls for a penalty payment for any downtime exceeding 100
minutes. Find the following:
1 The MTTR and repair rate
4 The time within which 95% of the maintenance actions can be completed
Exercise
Solution
1 MTTR = 1200/15 = 80 minutes and the repair rate μ is 1/80 = 0.0125. Our
probability distribution for repair times is dist =
scipy.stats.expon(scale=80).
4 The time within which 95% of the maintenance actions can be completed is
dist.ppf(0.95) = 240 minutes.
Exercise
Problem
From field data in an oil field, the time to failure of a pump, 𝑋, is known to be
normally distributed. The mean and standard deviation of the time to failure are
estimated from historical data as 3200 and 600 hours, respectively.
1 What is the probability that a pump will fail after it has worked for 2000 hours?
2 If two pumps work in parallel (the system can meet performance requirements
with a single operating pump), what is probability that the system will fail after
it has worked for 2000 hours? Assume that pump failures are independent
events.
Exercise
Solution
@LearnRiskEng
fb.me/RiskEngineering
Was some of the content unclear? Which parts were most useful to
you? Your comments to [email protected]
(email) or @LearnRiskEng (Twitter) will help us to improve these https://risk-engineering.org/
materials. Thanks! reliability-engineering/