2024 What Is Reliability (Tim Adams)
2024 What Is Reliability (Tim Adams)
Timothy C. Adams
NASA Kennedy Space Center
Engineering
[email protected]
[email protected] 1
Prospectus
Claim: We are not done with Reliability until we are done with Safety!
− And if we continue to use a risk matrix, then we need to use it properly. Both axes
(likelihood and consequence) of the risk matrix need adequate attention.
Goal: Inform decision makers to embrace and use the Reliability discipline.
− This is important for decisions under risk (and with uncertainty) since ...
− Actual Results = Planned Results +/- Risk.
− Risk is potential loss in failure space. “Potential” is the likelihood axis of the risk matrix.
− The likelihood axis is the probability of failure (p f) axis, and pf = 1 - Reliability.
− So, what is reliability?
[email protected] 2
Risk: Reliability Makes the Likelihood Component
Risk Likelihood (Y) = Probability of Failure (pf) = Unreliability (U) = 1 - Probability of Success (ps) = 1 - Reliability (R)
Risk Decisions (Handling, Responding): Accept (fight), Avoid (flight), Hold (freeze), Mitigate (change), and Transfer (share)
[email protected] 3
Engineering Assurance and the RMA Program
The Reliability-Maintainability-Availability Program consists of integrated and sequenced tasks that are
implemented throughout the item’s life cycle. These tasks are customized to fit the needs of specific items. Resources:
• NASA-STD-8729.1, NASA Reliability and Maintainability (R&M) Standard for Spaceflight and Support Systems
• System Reliability Toolkit - V: New Approaches and Practical Applications, Quanterion, 2015
• Life Cycle Reliability Engineering, Yang, Wiley, 2007
• The Process of Reliability Engineering: Creating Reliability Plans That Add Value, Carlson & Schenkelberg, FMS Reliability, 2023
[email protected] 4
Definitions: R (↑) and its Counterparts M (↓) and A (↑ ↓)
1 An item is hardware (Hw), software (Sw), orgware (Ow) or humans, interfaces, or combination.
2 System Availability = AHardware* ASoftware* AOrgware* Ainterfaces where * denotes “and.”
System Effectiveness is: (1) System Availability, (2) Dependability (i.e., operating condition, trustworthiness),
and (3) Capability (i.e., meets mission demands).
[email protected] 5
Reliability Statements: Writing Quantitative Goals
Tip: Use the ABCD mnemonic to write goals and requirements. Example:
Degree 80,000 miles/8 years [with 0.9999 probability and 95% confidence].
1 The "probability of success" portion of the goal statement for reliability (or availability) is commonly called
reliability (or availability). However, this “short cut” is overly simplified and omits details to statistically
make a claim or to verify.
[email protected] 6
Setting the Goal: How Many 9s Are Required?
Example: What should be the top-level goal for the loss of electrical power?
IF THEN
Probability electric power is ... Expected electric power unavailability in one year is …
On Off Seconds Minutes Hours Days
0 1 31,536,000.00 525,600.00 8,760.00 365.00
0.9 0.1 3,153,600.00 52,560.00 876.00 36.50
0.99 0.01 315,360.00 5,256.00 87.60 3.65
0.999 0.001 31,536.00 525.60 8.76 0.37
0.9999 0.0001 3,153.60 52.56 0.88 0.04
0.99999 0.00001 315.36 5.26 0.09 0.00
0.999999 0.000001 31.54 0.53 0.01 0.00
0.9999999 0.0000001 3.15 0.05 0.00 0.00
Note: 1 per 1,000,000 is about 1/16 inch per one mile. Actually, 106 * 1/16” = 98.64% of 1 mile.
For sub-goals, start with and decompose the probability portion of the overall goal.
− This decomposition (called allocation) distributes system level reliability to lower elements.
− One allocation method is the nth root of system reliability; n is the number of serial elements.
The “nth root” allocation method provides minimum element reliability …
− That can serve as a minimum "design to" requirement for each serial element.
− Is larger than "absolute minimum element reliability," the notion where other serial elements
have perfect reliability. Thus, absolute minimum element reliability equals system reliability.
[email protected] 7
Reliability: From Definition to Analytical Products
For a “nay”
See slide 11
[email protected] 8
Required Resources via Statistics: Data and Models
For the item under study, data includes: (1) Both failures and non-failures (censored
data) and (2) Applicable areas: hardware, software, orgware (humans), and interfaces.
Data types for RMA (more on slide 12) and common (not the only) math models:
1. Time-based (clock) data
• Continuous (e.g., jet engine run hours)
• Lifetime math model: Weibull
• Repair time math model: Lognormal
2. Event-based (demand) data
• Discrete (e.g., landing gear actuations)
• Math models: Binomial and Poisson
3. Stress (load) and strength (capacity) data
• Example: See diagram
• Note: A safety factor does not characterize
the uncertainty in the item’s stress and strength.
4. Combination
This area corresponds to the probability of
• Time-to-failure data at different stress levels failure due to variation (uncertainty) in
• Common math model: Covariate Weibull stress and strength.
[email protected] 9
Example: Time-Based Data for both R and M
[email protected] 10
MIL-HDBK-217: Should this Data Still Be Used?
MIL-HDBK-217 = Reliability Prediction of Electronic Equipment; Version F = 1991 - 1995
[email protected] 11
Definitions: Failure and Failure Data as a Storyline
For a photocopier, which events are a failure? It depends on the mission and your “model of the world.”
• Is old but works • Has cracked glass • Is out of toner • Is in use by others
• Does not do color • Will not power up • Is being repaired • Is not permitted for
“To understand system assurance, one has to understand the definition of a failure and hazard. If a system does not
meet the reasonable expectation of the user, then it has failed, even though it meets the specifications. When failures
result in hazards, accidents can occur.”
Source: Assurance Technologies Principles and Practices, 2nd ed., Raheja & Allocco 2006, p. 5
Vocabulary: Electropedia (IEV Online), a resource from IEC, prepares and publishes international standards.
• Failure of an item 192-03-01 Loss of ability to perform as required. Also, see 192-03-03
• Hazard 903-01-02 Potential source of harm … qualified with origin (e.g., fire hazard).
Related Concepts: Item = hardware, software, orgware (humans), interfaces, or combination. Safety = freedom
from accident and loss. Risk (in failure space) = potential loss. RMA = defined on slides 4 and 5.
Data Storyline: Failure Event (item’s what, when, & where) → Failure Mode (observed what & how much) →
Failure Mechanism (why did it fail; causes) → Failure Reoccurrence Control (how to prevent, mitigate, respond to).
Tip: For RMA data, at least collect the operational type. Operational data: Operating behaviors and outcomes,
inferred by the design model, non-physical characteristics, and uses time and counts. Where as, Technical data:
Functional capability, contained in the design model, physical characteristics, and uses various units of measure.
[email protected] 12
Definitions: Bathtub Curve and Failure Rate Types
[email protected] 13
Definition: Durability
[email protected] 14
Example: Estimate Reliability Given TTF Data
[email protected] 15
Example: Estimate Reliability Given λ(t) Function
Failure Rate
500 hours? 0.003
− . 0.002
Given: 𝝀(𝒕) = (𝟐. 𝟔𝟐𝟖 𝒙 𝟏𝟎 𝟓) ∗ 𝒕𝟎 𝟔𝟒𝟑,
failure-rate function (or hazard function) for 0.001 𝜆(𝑡)=(2.628 𝑥 10−5)∗𝑡0.643
item X from the lab, a handbook, or journal
paper. Note: Confirm before using λ(t) as a 0.000
[email protected] 16
Alternatives to TTF and λ(t): Estimate Reliability
* PoF is how it should perform under specified conditions. Statistical modeling is how it did perform.
[email protected] 17
Examples: Physics of Failure (PoF)
From Design for Reliability, Crowe & Feinberg, 2001
Temperature and Humidity Vibration Related Failures Wearout (Fatigue) Failures via
Related Failures - Peck Model - MIL-STD 810E Uniform Cyclic Load - S-N Curve
[email protected] 18
Example: Find System Reliability (part 1 of 2)
Given
• Objective: Find the reliability (probability of success) for System X.
• Configuration: As shown in the diagram, System X has two items in series; the second item has
two items in parallel. All items operate independently of each other.
o Independence means the occurrence of success or failure in any one of the elements does
not affect the probabilities of the occurrences of the other events.
• Selected Probability Laws:
o Two items in series: Probability of A and B = P(A and B) = P(A)*P(B).
o Two items in parallel: Probability of B1 or B2 = P(B1 or B2) = 1 - [1 - P(B1)]*[1 - P(B2)].
• Data (3 types): (1) The reliability for each element (block). (2) The likelihood System X will be
needed, the initiating event, is one. (3) The consequence (e.g., loss of life, loss of property,
additional cost, delayed schedule, loss of reputation) for failure is quantitatively unknown.
[email protected] 19
Example: Find System Reliability (part 2 of 2)
Solution
• Write Outcome Statement (in English): Let S denote success and S’ denote failure for “not S” or the complement
of success. Based on the configuration of System X, system reliability is P(S) = P[A and (B1 or B2)].
• Method 1 - Solve via Event Tree: (1) A, B1, and B2 generate eight (23) possible scenarios. Scenarios need to be
assessed for applicability. (2) B is dependent on A. When A fails, then P(A’) and P(B given A’) = P(A’)*P(B
given A’) = 0.1*0 = 0. (3) B1 and B2 are independent, then P(B1 and B2) = P(B1)*P(B2).
♦ Reliability is not Lean Six Sigma, SPC, Safety, Risk … (go to slide)
[email protected] 21
Business Model: “Eight Dimensions of Quality”
This model by David A. Garvin (Harvard Business School) “breaks down the word
quality into manageable parts … can serve as a framework for strategic analysis.”
[email protected] 22
Data Types: Quality vs. Reliability
“Every product possesses a number of elements that jointly describe what the user
or consumer thinks of as quality. These parameters are often called quality
characteristics … several types:
1. Physical: Length, weight, voltage, viscosity These three are included in David A. Garvin’s
“Eight Dimensions of Quality” model
2. Sensory: Taste, appearance, color
3. Time Orientation: Reliability, durability, and serviceability [Maintainability].”
Source: Introduction To Statistical Quality Control 3rd ed., Montgomery, 1997, p. 6
“Not all discrepancies or defects lead to low reliability. For example, these defects
may degrade quality but not reliability:
− The wrong shade of a color, a light dent on the surface of a casting, a scratch on the
paint, a poor surface finish, the wrong plating on screws
However, for example, these defects or flaws usually reduce reliability:
− A poor weld, a cold-soldered joint, leaving out a lock washer, using the wrong flux, not
cleaning the surfaces to be joined, a large dent on a spring, an improper crimp on a wire
joint.”
Source: Assurance Technologies Principles and Practices, 2nd ed., Raheja & Allocco, 2006, p. 66
[email protected] 23
Reliability is not …
Reliability is not …
− Lean Six Sigma.
• Lean Six Sigma is a process improvement approach that uses a collaborative team effort.
• Lean Six Sigma is based on DMAIC (define, measure, analyze, improve, and control).
• Reliability is a “design to” attribute and a measure of effectiveness (not efficiency).
− Quality or Statistical Process Control (SPC).
• “… reliability incorporates the passage of time [number of demands or load], whereas quality
does not, because it is a static descriptor of an item … High reliability implies high quality, but
the converse is not necessarily true.”
Source: Reliability, Probabilistic Models and Statistical Methods, 3rd ed., Leemis, 2025, p. 4
• In the Apollo Space Program, quality meant the item was built so that it would work; reliability
meant the item was designed so that it would work. Click here for paper.
− Safety.
• Reliability is concerned with the cause of and likelihood of failure—and ensuring no loss of
the item’s intended function and mission. Safety is concerned with failures that create hazards.
− Risk.
• However, “not reliable” (Y) and “not safe” (Z) provide the content for the risk scenario (X).
• For details on risk as { X, Y, Z }, see Kaplan & Garrick, Jan 1981. More on risk – slide 3.
[email protected] 24
Summary: (Reliability * Safety) Risk Good Decision
“And” “subset of”
[email protected] 25
An Idealized Work Process:
To Produce Safety and RMA Analyses and Assessments
WHAT
EFFORT
Engineering
WHO
Safety &Mission
Assurance
0%
WHEN
Start Finish
Analytical Products: Theme:
FFBD = Functional Flow Block Diagram This work sequence (WHEN) builds and uses analytical products (WHAT)
RBDA = Reliability Block Diagram Analysis in an optimum manner—especially during the Design Phase. The
FMEA = Failure Modes & Effects Analysis appropriate mix of experts (WHO and EFFORT) make and deliver the right
FTA = Fault Tree Analysis analytical product at the right time. In addition to serving the intended
PRA = Probabilistic Risk Assessment purpose at the desired time, each analytical product serves as an input
that expands the technical fidelity of the analytical products that follow.
[email protected] 26
Author:
Tim Adams is a NASA Senior Engineer in the Engineering Directorate at the Kennedy Space
Center (KSC). He serves as a technical resource in engineering assurance with a specialty in
quantitative Reliability Engineering and Technical Risk -- and he is the founder and Technical
Editor of KSC Reliability, a website for practitioners in Reliability, Safety, and Systems
Engineering. Tim started with NASA at the Johnson Space Center with the Mission
Operations Directorate. In Reliability and Risk, Tim has received three NASA medals,
Employee of the Year Award with Office of the Chief Engineer, Commendation Award in
Project Management, and the Silver Snoopy Award from Astronaut William F. Readdy.
With the American Society for Quality (ASQ), Tim is a senior member, a Certified Reliability
Engineer (CRE), and served on the national ASQ team that reviewed the CRE exam. In
Mathematics, Tim is a member of Pi Mu Epsilon, a national honorary in Mathematics.
Acknowledgments:
The Office of Safety and Mission Assurance (OSMA)
− Brent Heard
− Anthony "Tony" DiVenti
Goddard Space Flight Center
− Charlie Knapp
Kennedy Space Center
− Susan Riccetti
− Anthony "Tony" Mansk
27