100% found this document useful (1 vote)
140 views34 pages

Overview of Reliability Engineering: Eric Marsden

Reliability engineering aims to ensure that systems function as required over time. Reliability engineers address questions about when failures occur, why they occur, and how to reduce failures. Key concepts include failures, faults, errors, failure modes, failure classification (safe/dangerous, detected/undetected), common cause failures, and reliability definitions. Reliability is quantified using metrics like mean time between failures and survival functions, with the goal of modeling and improving system dependability.

Uploaded by

Luis Gonzalo
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
140 views34 pages

Overview of Reliability Engineering: Eric Marsden

Reliability engineering aims to ensure that systems function as required over time. Reliability engineers address questions about when failures occur, why they occur, and how to reduce failures. Key concepts include failures, faults, errors, failure modes, failure classification (safe/dangerous, detected/undetected), common cause failures, and reliability definitions. Reliability is quantified using metrics like mean time between failures and survival functions, with the goal of modeling and improving system dependability.

Uploaded by

Luis Gonzalo
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 34

Overview of reliability engineering

Eric Marsden
<[email protected]>
Context

▷ I have a fleet of airline engines and want to anticipate when they may fail

▷ I am purchasing pumps for my refinery and want to understand the


MTBF, lambda etc. provided by the manufacturers

▷ I want to compare different system designs to determine the impact of


architecture on availability
Reliability engineering

▷ Reliability engineering is the discipline of ensuring that a system will


function as required over a specified time period when operated and
maintained in a specified manner.

▷ Reliability engineers address 3 basic questions:


• When does something fail?
• Why does it fail?
• How can the likelihood of failure be reduced?
Failure

Failure
Loss of ability to perform as required.

▷ A failure is always related to a required function. The function is often


specified together with a performance requirement (eg. “must handle up
to 3 tonnes per minute”, “must respond within 0.1 seconds”).

▷ A failure occurs when the function cannot be performed or has a


performance that falls outside the performance requirement.

Source: International Electrotechnical Vocabulary (IEV) part 192 on dependability


Fault

Fault
Inability to perform as required, due to an internal state [IEV 192-04-01]

▷ While a failure is an event that occurs at a specific point in time, a fault is


a state that will last for a shorter or longer period.

▷ When a failure occurs, the item enters the failed state. A failure may
occur:
• while running
• while in standby
• due to demand

Source: International Electrotechnical Vocabulary (IEV) part 192 on dependability


Error

Error
Discrepancy between a computed, observed, or measured value or condition and
the true, specified, or theoretically correct value or condition.

▷ An error is present when the performance of a function deviates from the


target performance, but still satisfies the performance requirement

▷ An error will often, but not always, develop into a failure

Source: International Electrotechnical Vocabulary (IEV) part 192 on dependability, item 192-03-02
Failure mode

Failure mode
The way a failure is observed on a failed item.

▷ An item can fail in many different ways: a failure mode is a description of


a possible state of the item after it has failed

Source: International Electrotechnical Vocabulary (IEV) part 192 on dependability, item 192-03-17
Failure classification

IEC 61508 classifies failures according to their:


▷ Causes:
• random (hardware) faults
• systematic faults (including software faults)

▷ Effects:
• safe failures
• dangerous failures

▷ Detectability:
• detected: revealed by online diagnostics
• undetected: revealed by functional tests or upon a real demand for activation

IEC 61508: Functional Safety of Electrical/Electronic/Programmable Electronic Safety-related Systems


Markovian models

inputs outputs
λ Models the transitions between correct state
and failed state.

Assumption: nothing in the past determines


future events except for current state.
failure rate λ

Failure and repair are stochastic processes.


correct failed
state state Availability = proportion of time spent in
correct state.
repair rate μ
The “safe failure fraction”

failure rate λ

correct failed
state state

repair rate μ
The “safe failure fraction”

failure rate λ Not all failures are dangerous: the system may have
been designed to tolerate them.
correct failed
state state

repair rate μ

rate of non-detected and


dangerous failures 𝜆𝐷
dangerous
state

service
repair rate μ
OK
degraded
but safe
rate of safe or dangerous service
and detected failure 𝜆𝑆
The “safe failure fraction”

failure rate λ Not all failures are dangerous: the system may have
been designed to tolerate them.
correct failed
state state

repair rate μ

rate of non-detected and


dangerous failures 𝜆𝐷
dangerous
state

service
repair rate μ
OK Importance of the coverage of the error detection
degraded
but safe
mechanisms, measured by the “safe failure fraction”:
rate of safe or dangerous service conditional probability that a failure will be safe, or
and detected failure 𝜆𝑆
dangerous-but-detected.
Failure classification

▷ Safe undetected (SU): A spurious (untimely) activation of a component


when not demanded

▷ Safe detected (SD): A non-critical alarm raised by the component

▷ Dangerous detected (DD): A critical diagnostic alarm reported by the


component, which will, as long as it is not corrected prevent the safety
function from being executed

▷ Dangerous undetected (DU): A critical dangerous failure which is not


reported and remains hidden until the next test or demanded activation of
the safety function
Common cause failures

Common cause failure


A failure that is the result of one or more events, causing concurrent failures of
two or more separate channels in a multiple channel system, leading to system
failure [IEC 61508]

▷ Typical examples: loss of electricity supply, massive physical destruction

▷ More subtle example: loss of clock function (electronics), common


maintenance procedure
Reliability: definitions

Reliability [ISO 8402]


The ability of an item to perform a required function, under given environmental
and operational conditions for a stated period of time.

▷ The reliability 𝑅(𝑡) of an item at time 𝑡 is the probability that the item
performs the required function in the interval [0–𝑡] given the stress and
environmental conditions in which it operates
Reliability: definitions

▷ If 𝑋 is a random variable representing time to failure of an item, the


survival function (or reliability function) 𝑅(𝑡) is

𝑅(𝑡) = Pr(𝑋 > 𝑡)

▷ 𝑅(𝑡) represents the probability that the item is working correctly at time 𝑡

▷ Properties:
• 𝑅(𝑡) is non-increasing (no rising from the dead)
• 𝑅(0) = 1 (no immediate death/failure)
• lim
𝑡→∞
𝑅(𝑡) = 0 (no eternal life)
Interpreting the reliability function

1 1

P(T ≤ t)

Survival function R(t)


Probability F(t)

P(T > t)

0
t 0
t
Time to failure (T) Time to failure (T)

Cumulative distribution function Reliability function


Tells you the probability that lifetime is ≤ 𝑡 Tells you the probability that lifetime is > 𝑡

𝐹(𝑡) = 𝑃(𝑇 ≤ 𝑡) 𝑅(𝑡) = 𝑃(𝑇 > 𝑡) = 1 − 𝐹(𝑡)


Exercise

Problem
The lifetime of a modern low-wattage electronic light bulb is known to be
exponentially distributed with a mean of 8000 hours.

Q1 Find the proportion of bulbs that may be expected to fail before 7000
hours use.

Q2 What is the lifetime that we have 95% confidence will be exceeded?

For more on the reliability of solid-state lamps, see energy.gov


Exercise

Solution
The time to failure of our light bulbs can be modelled by the distribution

dist = scipy.stats.expon(scale=8000)

Q1: The CDF gives us the probability that the lifetime is ≤ 𝑡. We want
dist.cdf(7000) which is 0.583137. So about 58% of light bulbs will fail
before they reach 7000 hours of operation.

Q2: We need the 0.05 quantile of the lifetime distribution, dist.ppf(0.05)


which is around 410 hours.
Exercise

Problem
A particular electronic device will only function correctly if two essential
components both function correctly. The lifetime of the first component
is known to be exponentially distributed with a mean of 5000 hours and
the lifetime of the second component (whose failures can be assumed to be
independent of those of the first component) is known to be exponentially
distributed with a mean of 7000 hours. Find the proportion of devices that
may be expected to fail before 6000 hours use.
Exercise

Solution
The device will only be working after 6000 hours if both components are
operating. The probability of the first component still working is

> pa = 1 - scipy.stats.expon(scale=5000).cdf(6000)
> pa
0.3011942119122022

and likewise for the second component

> pb = 1 - scipy.stats.expon(scale=7000).cdf(6000)
> pb
0.42437284567695

The probability of both working is pa × pb = 0.127818, so the proportion of


devices that can be expected to fail before 6000 hours use is around 87%.
Hazard function

Hazard function
The hazard function or failure rate function ℎ(𝑡) gives the conditional probability
of failure in the interval 𝑡 to 𝑡 + 𝑑𝑡, given that no failure has occurred by 𝑡.

𝑓 (𝑡)
ℎ(𝑡) =
𝑅(𝑡)

where 𝑓 (𝑡) is the probability density function (failure density function) and
𝑅(𝑡) is the reliability function.

It’s the probability of quitting a given state after having spent a given time
in that state.
Bathtub curve

▷ Early failure (“burn-in”, “infant Decreasing


failure
Constant
failure
Increasing
failure
mortality” period): high hazard rate rate rate rate

due to manufacturing and design


problems

Failure rate
Early Observed failure
“infant rate
▷ Useful life period: probability of mortality”
failure Wear Out
failures
failure is roughly constant Constant (random)
failures

▷ Wearout period: hazard rate starts


to increase due to aging (corrosion,
Time
wear, fatigue)
Reliability measures


▷ Mean time to failure (MTTF) = 𝔼(𝑇 ) = ∫0 𝑅(𝑡)𝑑𝑡

▷ Often calculated by dividing the total operating time of the units tested by
the total number of failures encountered

▷ Often modelled by a Weibull distribution (systems affected by wear) or an


exponential distribution (systems not affected by wear, such as
electronics)
Availability

Availability
The ability of an item (under combined aspects of its reliability, maintainability
and maintenance support) to perform its required function at a stated instant of
time or over a stated period of time [BS 4778]

▷ The availability 𝐴(𝑡) of an item at time 𝑡 is the probability that the item is
correctly working at time 𝑡

𝑀𝑇 𝑇 𝐹
▷ Mean availability =
𝑀𝑇 𝑇 𝐹 + 𝑀𝑇 𝑇 𝑅
Reliability ≠ availability

Reliable system with poor availability


Note the important difference
MTTF MTTR
between:
▷ reliability (failure-free operation
during an interval), measured by
the MTTF
time
▷ availability (instantaneous
Available system with poor reliability failure-free operation on demand,
independently of number of
failure/repair cycles), measured by

MTTF
𝐴=
MTTF + MTTR
time


Also note that reliability
safety
Maintainability

Maintainability
The ability of an item, under stated conditions of use, to be retained in, or restored
to, a state in which it can perform its required functions, when maintenance is
performed under stated conditions and using prescribed procedures and resources
[BS4778]

▷ Measured by MTTR: mean time to repair

▷ Commonly modelled by a lognormal distribution


Reliability measures

MTBF = MTTR + MTTF


MTBF

MTTF MTTR
operational under repair

time

fault multiple errors are possible in this period

een
MTBF: mean time betw
failures
Exercise

Problem
For a large computer installation, the maintenance logbook shows that over a
period of a month there were 15 unscheduled maintenance actions or downtimes,
and a total of 1200 minutes in emergency maintenance status. Based upon
prior data on this equipment, the reliability engineer expects repair times to be
exponentially distributed. A warranty contract between the computer company
and the customer calls for a penalty payment for any downtime exceeding 100
minutes. Find the following:
1 The MTTR and repair rate

2 The probability that the warranty requirement is being met

3 The median time to repair

4 The time within which 95% of the maintenance actions can be completed
Exercise

Solution
1 MTTR = 1200/15 = 80 minutes and the repair rate μ is 1/80 = 0.0125. Our
probability distribution for repair times is dist =
scipy.stats.expon(scale=80).

2 The probability of time to repair not exceeding 100 minutes is


dist.cdf(100) = 71%.

3 The median time to repair is dist.ppf(0.5) = 55 minutes.

4 The time within which 95% of the maintenance actions can be completed is
dist.ppf(0.95) = 240 minutes.
Exercise

Problem
From field data in an oil field, the time to failure of a pump, 𝑋, is known to be
normally distributed. The mean and standard deviation of the time to failure are
estimated from historical data as 3200 and 600 hours, respectively.
1 What is the probability that a pump will fail after it has worked for 2000 hours?

2 If two pumps work in parallel (the system can meet performance requirements
with a single operating pump), what is probability that the system will fail after
it has worked for 2000 hours? Assume that pump failures are independent
events.
Exercise

Solution

1 Random variable 𝑋 can be represented by the model scipy.stats.norm(3200, 600).


We want to assess Pr(𝑋 > 2000), which is 1 − Pr(𝑋 ≤ 2000), or 1 -
scipy.stats.norm(3200, 600).cdf(2000), or 0.977.
2 Let’s call 𝑌 the random variable representing time to failure of the redundant pump system,
and 𝑋1 and 𝑋2 the time to failure of pumps 1 and 2 respectively. We want to determine
Pr(𝑌 > 2000), which is 1 − Pr(𝑌 ≤ 2000).
This is 1 − Pr(𝑋1 ≤ 2000 ∧ 𝑋2 ≤ 2000) (given the parallel configuration of the pumps, the
system fails when both of the pumps fail).
Given that pump failure is independent, that’s 1 − Pr(𝑋1 ≤ 2000) × Pr(𝑋2 ≤ 2000). It’s 1 -
scipy.stats.norm(3200, 600).cdf(2000)**2, which is 0.9994.
Further reading
This presentation is distributed under the terms of the
Creative Commons Attribution – Share Alike licence

▷ Wired.com article Why Things Fail: From Tires to Helicopter Blades,


Everything Breaks Eventually from 2010

For more free content on risk engineering,


visit risk-engineering.org
Feedback welcome!
This presentation is distributed under the terms of the
Creative Commons Attribution – Share Alike licence

@LearnRiskEng

fb.me/RiskEngineering

Was some of the content unclear? Which parts were most useful to
you? Your comments to [email protected]
(email) or @LearnRiskEng (Twitter) will help us to improve these https://risk-engineering.org/
materials. Thanks! reliability-engineering/

For more free content on risk engineering,


visit risk-engineering.org

You might also like