NASA Reliability Practices
NASA Reliability Practices
NASA Reliability Practices
NASA Technical
Memorandum
106313
Design
for Reliability:
Practices
Prepared for the Reliability and Maintainability Symposium cosponsored by ASQC, IIE, IEEE, SOLE, IES, AIAA, Anaheim, California, January 24-27, 1994
(NASA-TM-I06313) RELIABILITY:
NASA
N95-I3728
Unclas
G3/18
0028221
_=_;_
= ir
FOR
PREFERRED Vincent
National
Aeronautics and Space Administration Lewis Research Center Cleveland, Ohio 44135
SUMMARY
AND PURPOSE
evident
by test
or analysis
are
This tutorial summarizes reliability experience from both NASA and industry and reflects engineering practices that support current and future civil space programs. These practices were collected from various NASA field centers and were reviewed by a committee of senior technical representatives from the participating centers (members are listed at the end). The material for this tutorial was taken from the publication issued by the NASA Reliability and Maintainability Steering Committee (NASA Reliability Preferred Practices for Design and Test. NASA TM-4322, 1991). Reliability must be an integral part of the systems engineering process. Although both disciplines must be weighted equally with other technical and programmatic demands, the application of sound reliability principles will be the key to the effectiveness and affordability of America's space program. Our space programs have shown that reliability efforts must focus on the design characteristics that affect the frequency of failure. Herein, we emphasize that these identified design characteristics must be controlled by applying conservative engineering principles. This tutorialshould be used to assessyour current reliability techniques, thus promoting an active technical interchange between reliability and design engineering that focuses on the design margins and theirpotential impact on maintenance and logistics requirements.By applying these practices and guidelines,reliability organizations throughout NASA and the aerospace community will continue to contribute to a systems development processwhich assuresthat Operating independently Design environments verified. criteria drive a conservative design approach. are well defined and
at NASA
Lewis
Research
Center since 1963 when he was hired as an aerospace technologist. Presently, as an adjunct to his work for the Office of Mission Safety and Assurance in design, analysis, and failure metrics, he is responsible for product assurance management and also teaches courses to assist with NASA's training needs. Mr. Lalligraduated from Case Western Universitywith a B.S. and an M.S. in electrical engineering. In 1959 as a researchassistant at Case, and laterat PicatinnyArsenal,he helped to develop electronicfusesand special devices.From 1956 to 1963, he worked at TRW as a design,lead, and group engineer Mr. Lalliis a registered engineer in Ohio and a member of Eta Kappa Nu, IEEE, IPC, ANSI, and ASME.
1.0 OVERVIEW 1.1 Applicability The designpractices that have contributedto NASA mission successrepresentthe "best technicaladvice" on reliability design and testpractices. These practicesare not requirementsbut rather proven technicalapproaches that can enhance system reliability. This tutorial is divided into two technical sections.
Section II contains reliability practices, including design criteria, test procedures, and analytical techniques, that have been successfully applied in previous spaceflight programs. Section III contains reliability guidelines, including techniques currently applied to spaceflight projects, where insufficient information exists to certify that the technique will contribute to mission success.
1.2 Discussion Experiencefrom NASA's successful extended-duration space missions shows that four elements contribute to high reliability: (I) understanding stress factors imposed on flight hardware by the operating environment; (2) controlling the stress factors through the selection of conservativedesign criteria; (3) conducting an appropriate analysisto identify and track high stress pointsin the design {priorto qualification testingor flight use); and (4) selecting redundancy alternativesto provide the necessary function(s) should failure occur.
2.3 Document
Referencing
The following example of the document numbering system applicable to thepractices and guidelines isaPart Junction Temperature, _ practicenumber PD-ED-1204: P 1. Practice 2. Design Factors Design D ED 12 04
3. Engineering 4. Series 12
PRACTICES
5. Practice04 Key to nomenclature.--The followingisan explanation of the numbering system: presented spaceflight herein conprograms. Position
1.
The information is for use throughout NASA and the aerospace community to assist in the design and development of highly reliable equipment and assemblies. The practices include recommended analysis procedures, redundancy considerations, parts selection, environmental requirements considerations, and test requirements and procedures.
2.
ED - Engineering design AP - Analytical procedures TE- Test considerations and procedures x xx Series number Practicenumber within series
5.
Prau:t|co: A brief statement of the practice
BeneJlt: from
concise
of the
technical
improvement
realised
implementing
2.4 Practices
Identifiable programs or projects
as of January
1993
_lsat applied
Certified the
Ue_: prectlce
Center informat|on_
to
Courier usually
for
More
Information: NASA
Source Center
of (see
additional prigs 6)
sponsoring
A the
brief process to
technical but to
understand
how
PD-ED-1203
Toclz_ical practice Igotimaalo: A brief technical juetification for use of the
High-Voltage Power Supply Design and Manufacturing Practices Class-S Parts in High-Reliability Applications Part Junction Temperature Welding Practicesfor 2219 Aluminum and Inconel718 Power Line Filters Magnetic Design Control for Science Instruments
Impact the
ofNonpr_tleo: is avoided
brief
statement
of
what
con
be
expected
if
practice
Prnctices: that
of
other
topic
areas
in SPONSOR OF
informion
that
contain
addltlonl
infer-
PRACTICE
Static Cryogenic Seals for Launch Vehicle Applications ** Ammonia-Charged AhminumHeat Pipes with Extruded Wicks * Assessment and Control of * Electrical Charges Combination Methods forDeriving Structural Design Loads ConsideringVibro-Acoustic,etc., Responses Design and Analysis of Electronic Circuitsfor Worst-Case Environments and Part Variations Power,
PT-TE-1410
** ** ** **
Selectionof Spacecraft Materials and Supporting Vacuum Outgassing Data Heat Sinks for Parts Operated in Vacuum Environmental Test Sequencing Random Vibration Testing Electrostatic Discharge (ESD) Test Practices
PD-ED-1212
*New **New
** Electrical Shielding of
Signal,and Control Cables ** Electrical Grounding Practicesfor Aerospace Hardware ** Preliminary Design Review ** Active Redundancy ** Structural Laminate Composites for Space Applications ** Applicationof AblativeComposites to Nozzles for Reusable Solid Rocket Motors ** Vehicle Integration/Tolerance Buildup Practices ** Battery Selection Practice for Aerospace Power Systems ** Magnetic Field Restraints for SpacecraftSystems and Subsystems Surface Charging and Electrostatic Discharge Analysis Independent Review of Reliability Analyses Part Electrical StressAnalysis Problem/Failure Report Independent Review/Approval Risk Rating of Problem/Failure Reports Thermal Analysis of Electronic Assemblies to the Piece Part Level Failure Criticality Modes, Analysis Effects, {FMECA) and
2.5 Typical
Reliability
Practice
PD-ED-1219
PD-ED-1221 PD-ED-1222
A typical reliability practice is illustrated in this section.Environmental factors arevery important in the system design so equipment operating conditionsmust be identified. Systems designed to have adequate environmental strength perform wellin the field and satisfy our customers. Failure to perform a detailedllfe-cycle environment profile can lead to overlooking environmental factorswhose effect is critical to equipment reliability. Not includingthese factors in the environmental design criteria and test program can lead to environmentinduced failures during spaceflight operations.
Environmental
Benefit:Adequate environmental strengthisincorporated into design. Programs That Certified Usage: SERT I and II, CTS, ACTS, space experiments, launch vehicles, space power systems, and Space Station Freedom Center to Contact Lewis Research Center for More Information: NASA
EEE Parts Screening Thermal Cycling Thermographic Mapping Boards Thermal Test Levels Powered-On Vibration SinusoidalVibration * * *
Implement ationMethod: Develop life-cycle environment profile. of PC Describe anticipatedevents from finalfactory acceptance through removal from inventory. Identifysignificant natural and induced environments for each event. Describe environmental and stress conditions: Narrative Statistical
Assembly Acoustic Tests Pyrotechnic Shock Thermal Vacuum Versus Thermal Atmospheric Assemblies Test of Electronic
Technical
Rationale
Technical
Rationale
(continued)
_nvJronment
Principal
effects
Typicalfailurel induced
Environment
Principal
effects
Typical induced
failures
High temperature
aging:
Insulation alteration
Wind
Force
ippllciilon
change reaction
strength Inferrer-
lubrication
Depoi|t|on
of
materials
propertlel Softening sublimation Viscosity ovaporitlon Physical expansion reduction/ I meliing t and Structural incraued stress; wear on failure; mechanlcal Increased moving parts Heat gain (high Heat Isis (low velocity)
velocity)
Low temperature
viscosity
and
Lois
of
lubrication
Rain
Physical Water
immersion
]_mbrlttlement
Lose
of
mechanical cracking;
]_roslon
Moisture
absorption
of
Temperature shock
Mechenlcal
stress
collapse seal
or
of
reaction
Lose
of
Heating
Thermal oxidation
aging
Tran|mutition ionisation
and
of physical,
properties; conductivinsulators
producand particles
Low humidity
relatlve
Loss
of
secondary
of
Aggravation temperature
of
hlgheffects
High pressure
Compression
cooling
I Interference
Chemical Crasing,
function
Low
pressure
Expansion
Fracture explosive
of
container; expansion of electrical Ion strength breakdown Reduced strength dlelectrical of atr of Granulation Emhrittlcment
trical Lois
Outgueing
Reduced strength
dlelcctrlcal of air
Severe stroll
mechtnlcal
Rupture structural
and
cracking collapse
Solar radiation
Actinic chemical
and
phyiJoreactions:
comprolllOn
embrlttlement
properties;
Chemical Conttmlnition
coloration olonl
properties Sand dust and Abrasion Clogging Increased Interference tion; electrical alteration properties wear wlih funcof Acceleration Mechanical stress Structural collapse Reduced strength dielectric Insulation and arc-over breakdown
Salt
spray
Chemical Corrosion
reactions:
Increases Lois of
Technical Rationale
(concluded)
and projects. Unlike a reliability design practice, a guidelinelacksspecific operationalexperienceor data to validate its contribution to mission success. However, a
Typical induced failures
Environment
Principal
effect,
Vibration
Mechanical
stress
Lose
of
3.2 Format
funcof
Magnetic fields
Induced
magnetization
with
The following
guidelines:
Impact
of Nonpractice:
Prwctice:
brief
statement
of
the
guideline
Basalt: from
concise
of
the
technical
improvement
re.iiued
Implementing
Failure to perform a detailedlife-cycle environment profilecan lead to overlooking environmental factors whose effect is critical to equipment reliability. If these factors are not includedin the environmental design criteria and testprogram, environment-induced failures may occur during spaceflight operations. References: Government I. ReliabilityPrediction of Electronic Equipment. MIL-HDBK-217E Notice I, January 1990. 2. Reliability/Design Thermal Applications. MIL-HDBK-251, January 1978. 3. Electronic Reliability Design Handbook. MIL-HDBK-338-1A, October 1088. 4. Environmental Test Methods and Engineering Guidelines.MIL-STD-810E, July 1989. Industry
Center information,
tn
Contact usually
for
More the
Information: NASA
Source Center
of
sponsoring
A the
brief process to
technical but to
not design
understand
how
guideline
Technical guideline
Ratlvnale:
brief
technical
justlfication
for
use
of
the
Impa_'q the
of
Iq[onpractlce: is avoided
brief
statement
of
what
can
be
expected
if
guideline
Related in the
of
other
topic
information
References: information
that guideline
contain
additional
3.3 Guidelines as of January 1993 GD-ED-2201 GD-ED-2202 ** ** Fastener Standardization tion Considerations and Selec-
Design Considerations for Selection of Thick-Film Microelectronic Circuits Checklists Orbit for Microcircuits Heating Sub-
GD-ED-2203 5. Space Station Freedom Electric Power System Reliability and Maintainability Guidelines Document. EID-00866, Rocketdyne Division, Rockwell International, 1990. 6. Societyof Automotive Engineers,Reliability, Maintainability,and Supportability Guidebook, SAE G-11, 1990. GD-AP-2301 GT-TE-2401
** Design Earth **
Environmental
**New
Guidelines
as of January
DESIGN
GUIDELINES
3.4 Typical Reliability Guideline A typical reliability guideline is illustrated in this section. Environmental heating for Earth orbiting systems is an important design consideration. Designers should use currently accepted values for the solar constant, albedo factor, and Earth radiation when calculating the heat balance of Earth orbiters. These calculations can
The reliability design guidelines for consideration by the aerospace community are presented herein. These guidelines contain information that represents a technically credible process applied to ongoing NASA programs
accuratelypredict the thermal environment of orbiting devices.Failure to use these constantscan resultin an incomplete thermal analysisand grossly underestimated temperature variationsof the components.
References: I. Leffler,J.M.: Spacecraft External Heating Variations in Orbit. AIAA paper 87-1596, June 1987. 2. Reliability/Design, Thermal Applications. MIL-lIDBK-251, 1978. 3. Incropera, F.P.; and DeWitt, D.P.: Fundamentals of Heat and Mass Transfer.Second ed. John Wiley & Sons, 1985.
Analysis of Earth Orbit Environmental Heating Guideline (GD-AP-2$01): Use currentlyaccepted valuesforsolarconstant,albedo factor, and Earth radiationwhen calculating heatbalanceof Earth orbiters. This practice providesheatingrate forblackbody case without consideringspectraleffects or collimation. Benefit:Thermal environment of orbiting devicesis accuratelypredicted. Center to Contact forMore Information: Goddard Implementation Method 2
MAINTAINABILITY
The followingmembers of the NASA Reliability and Maintainability SteeringCommittee may be contactedfor more information about the practicesand guidelines: Dan Lee Ames Research Center MS 218-7 DQR MoffettField, California 94035 JackRemez Goddard Space FlightCenter Bldg.6 Rm $233 Code 302 Greenbelt, Maryland 20771 Thomas Gindoff Jet Propulsion Laboratory California Institute of Technology MS 301-456 SEC 521 4800 Oak Grove Drive Pasadena,California 91109 Nancy Steisslinger Lyndon B. Johnson Space Center Bldg.45 Run 613 Code NB23 Houston,Texas 77058 Leon M]gdalski John F. Kennedy Space Center RT-ENG-2 KSC HQS 3548 Kennedy Space Center, Florida 32899 Salvatore Bavuso LangleyResearchCenter MS 478 5 Freeman Road Hampton, Virginia 23665-5225 VincentLalli Lewis ResearchCenter MS 501-4 Code 0152 21000 BrookparkRoad Cleveland, Ohio 44135 Donald Bush George C. MarshallSpace FlightCenter CT11 Bldg.4103 MarshallSpace FlightCenter, Alabama 35812 Ronald Lisk NASA Headquarters Code QS Washington,DC 20546
Solar constant,W/m Nominal, 1367.5 Winter, 1422.0 Summer, 1318.0 Albedo factor Nominal, 0.30 Hot, 0.35 Cold, 0.25 -
(nominal,
255
K;
Albedo factor
Nomlntl,
1367.5
Winter 1422
solstice,
Summer 131S
so]stics,
Technical Rationale: Modification of energy incident on a spacecraft due to Earth-Sun distancevariation and accuracy of solarconstant areof sufficient magnitude to be important parameters in performing a thermal analysis. Impact of Nonpractice: Failure to use constants resultsin an incomplete thermal analysisand grossly underestimated temperature variationsof components.
PART
H--RELIABILITY
TRAINING
1.0
INTRODUCTION
TO
RELIABILITY
=Reliability _ appliesto systems consisting of people, machines, and written information.A system is reliable ifthose who need it can depend on itover a reasonable period oftime and if itsatisfies theirneeds.Of the people involved in a system, some relyon it,some keep it reliable,and some do both. Severalmachines comprise a system: mechanical, electrical, and electronic. The written information defines peoples'roles in the system: sales literature; system specifications; detailedmanufacturing drawings; software,programs, and procedures; operating and repairinstructions; and inventory control. Reliability engineeringis the discipline that defines specifictasks done while a system is being planned, designed,manufactured, used, and improved. Outside of the usual engineeringand management tasks, these tasks ensure that the people in the system attend to allthose detailsthat keep itoperating reliably. Reliability engineering isnecessarybecause as usersof rapidly changing technology and as members of large complex systems, we cannot ensure that essential details affecting reliability are not overlooked.
ure.Although such design controlsare important, most equipment failures in the fieldbear no relation to the results of reasonablestress analysesduring design.These failures are type II (i.e., those caused by built-in flaws}.
1.2
New Direction
The new direction in reliability engineering will be toward a more realistic recognition of the causes and effects of failures. The new boundaries proposed for reliability engineering are to exclude management, applied mathematics, and double checking. These functions are important and may still be performed by reliability engineers. However, reliability engineering is to be a synthesizing function devoted to flaw control. The functions presented in figure 2 relate to the following tasks: for
(2) Engage the materialtechnologists to determine the flaw failure mechanisms. (3)Develop flaw control techniquesand send informationback to the engineersresponsible fordesign,manufacture,and support planning.
I.I
The theme ofthistutorial isfailure physics: the study of how products, hardware, software, and systems fail and what can be done about it.Training in reliability must begin with a review of mathematics and a description of the elements that contributeto product failures. Consider the following example of a failure analysis. A semiconductor diode developed a short.Analysisshowed that a surgevoltagewas occurringoccasionally, exceeding the breakdown voltage of the diode and burning itup. The problem: stress exceeding strength,a type I failure. A transistorsuddenly stopped functioning. Analysis showed that aluminum metallization opened at an oxide step on the chip, the opening accelerated by the neckdown of the metallization at the step.In classical terminology,thisfailure, caused by a manufacturing flaw,isa random failure(type If).These two failuretypes are shown in figure1.Formerly, most of the design control efforts shown in the figurewere aimed at the type I fail-
Electromigration
Cathode depletfon
Bearing wear
Oxide
Electromigration aroundflaw
Oxidepinhole breakdown
__
I=
Support planning
liiiiiiii!i i i
Flaw (failure) mechanisms
tecMaterial hnology
Completed systems/equlpm_ent Maintenance plan and test equipment Figure 2.--Role of reliability engineering for the 1990's.
The neering
types are
expected those
from provided
reliability by
engi-
Era of Semiconductors Period of Awakening New Direction Concluding Remarks Reliability Training
regimens; effects
failure
of environmental of failure
relationship
detection
as autosignature
Relisbmty Mathematics and Failure Mathematics Review Notation Manipulation of Exponential Rounding Data Integration Formulas Differential Formulas Partial Derivatives Expansion of (a + b) n Failure Physics Probability Theory Fundamentals
Physics
Functions
Because facturing the bility flaw tion. and the what distribution
flaws processes,
on stay
the
design,
manuRelia-
parts,
does
information
controls
the
negotiation period.
should
be done
Probability Theorems Concept of Reliability Reliability as Probability of Success Reliability as Absence of Failure Product Application K-Factors Concluding l_marks Reliability Training Exponential Dlstribut|on and Rellabmty Exponential Distribution Failure Rate Definition Failure Rate Dimensions _Bathtub" Curve Mean Time Between Failures Models
1.3
Training
as of June
1992
this
tutorial
only
specific
areas program,
to
contents provides
following
the
NASA avail-
1253, the
InformaA
Service, evaluation
Springfield, form
487-4650. appendix.
is included
in the
Calculations of Pc for Single Devices Reliability Models Calculation of Reliability for Serles-Connected Devices Calculation of Reliability for Devices Connected in Parallel (Redundancy) Calculation of Reliability for Complete System
Introduction
to Reliability
Categories
of Software
Variables
Processing Environments Severity of Software Defects Failure Rates Part Derating Software Hardware Bugs and Compared Software of Software With Failures Bugs Software Defects
Affecting
Reliability
Training
Predicting Reliability Use of Failure Rates Nonoperating Applications Equipment Standardisation Allocation Failures
by Rapid in Tradeoffs
of Failure
Reliability
Importance of Learning From Each Failure Failure Reporting, Analysis, Corrective Action, Concurrence Case Study--Achieving Challenge Description to Achieving Reliability Goals Launch Vehicle Reliability
Launch and Flight Reliability Field Failure Problem Mechanical Tests Runup Summary Concluding Reliability Applying and Rundown of Case Remarks Training Density Functions Functions Distribution Tests
Program Establishment Goals and Objectives Symbolic Logistics Reliability Performance Specification Field Studies Human Analysis Human Example Representation Support and Repair Activities Philosophy Management Requirements Targets
Study
Probabillty Density
Functions
Presentation
of Reliability Manufacturing
Nonsymmetrical Application
Two-Limit of Normal
Distribution
Reliability Predictions Effects of Tolerance on a Product Notes on Tolerance Effects Remarks Training Accumulation: of Tolerance A How-To-Do-It Guide
Examples
Answer_
2.0
MATHEMATICS
AND
FAIL-
2.1
Methods
Axiomatic
Models
When most engineersthink of reliability, they think of parts sinceparts are the buildingblocks of products. Allagree that a reliable product must have reliable parts. But what makes a part reliable? When asked, nearly all engineers would say a reliablepaxt is one purchased according to a certain source control document and bought from an approved vendor. Unfortunately, these two qualifications are not always guarantees of relial)ility. The following case illustrates this problem.
A clock purchased according to PD 4600008 was procured from an approved vendor foruse in the ground support equipment of a missile system and was subjected to qualification testsas part of the reliability program. These testsconsisted of high- and low-temperature,mechanicalshock,temperature shock,vibration, and humidity.The clocksfrom the then sole-source vendor failed two of the tests: low-temperatureand humidity. A failure analysis revealedthatlubricants in the clock's mechanism froze and that the seals were not adequate to protect the mechanism from humidity. A second approved vendor was selected. His clocksfailed the hlgh-temperaturetest. In the processthe dialhands and numerals turned black, making readingsimpossiblefrom a distanceof 2 feet.A third approved vendor's clocks passed all of the tests except mechanical shock,which cracked two of the cases. Ironically, the fourth approved vendor's clocks,though lessexpensive,passed allthe tests. The point of thisillustration isthat fourclocks, each designed to the same specification and procured from a qualifiedvendor, all performed differently in the same environments. Why did thishappen? The specification did not include the gear lubricantor the type of coating on the hands and numerals or the type of casematerial. Many similarexamples could be cited, ranging from requirements for glue and paint to complete assemblies and systems, and the key to answering these problems can best be stated as follows:To kuow how reliable a product is or how to design a reliable product, you must know how many ways its parts can fail and the types and magnitude of stresses that cause such failures. Think about this: if you knew every conceivable way a missile could fail and if you knew the type and level of stress required to produce each failure, you could build a missile that would never fail because you could eliminate (I) As many ways of failure as possible (2) As many stresses as possible (3) The remaining potential failures by controlling the levelof the remaining stresses Sound simple? Well, it would be except that despite the thousands of failures observed in industry each day, we still know very little about why thingsfail and even less about how to control these failures. However, through systematic data accumulation and study, we learnmore each day. As stated at the outset, thistutorial introducessome basic concepts of failurephysics:failuremodes (how failures are revealed); failure mechanisms (what produces the failure mode); and failure stresses (what activates the failure mechanisms). The theory and the practical tools availablefor controlling failures are presented also.
2.2
Although the classical definitionof reliability is adequate for most purposes,we are going to modify it somewhat and examine reliability from a slightly different viewpoint. Consider thisdefinition: Reliability is the probability that the critical [ailure modes of a device wi]l not occur during a specified period of time and under specified conditions when used in the manner and for the purpose _tended. Essentially, thismodificationreplaces the words "a device willoperate successfullff with the words "critical failure modes . . . will not occur. _ This means that ifallthe possiblefailure modes of a device (ways the device can fail)and their probabilitiesof occurrence are known, the probability of success (or the reliability of a device) can be stated. It can be stated in terms of the probability that those failure modes critical to the performance of the device willnot occur. Just as we needed a clear definition of success when using the classical definition, we must also have a clear definition of failure when using the modified definition. For example, assume that a resistorhas only two failure modes: it can open or it can short. If the probabilitythat the resistor willnot short is0.99 and the probabilitythat it willnot open is 0.9,the reliability of the resistor (or the probability that the resistor will not short or open) is given by
Rreslstor
= Probability
of no opens
Note that we have multipliedthe probabilities. Probability theorem 2 therefore requiresthat the open-failuremode probability and the short-failure-mode probability be independent of each other.This condition issatisfied because an open-failure mode cannot occur simultaneously with a short mode.
2.3
Product Application
This section relates reliability (or the probabilityof success)to product failures. 2.3.1 Product failuremodes.--In general, critical
equipment failures may be classified as catastrophicpart failures, tolerance failures, and wearout failures. The expressionforreliability then becomes
R = Pw
I0
where Pc probability
occur
added failures, the observed failures the inherent failures of the design. that catastrophic part failures will not 2.4 K-Factors
will be greater
than
Pt Pw
probability probability
failures failures
The other contributors to product failure just mentioned are calledK-factors; they have a value between 0 and 1 and modify the inherent reliability:
As in the resistor
these probabilities
plied together because they are considered to be independent of each other. However, this may not always be true because an out-of-tolerance failure, for example, may evolve into or result from a catastrophic part failure. Nevertheless, in this tutorial they are considered independent 2.3.2 and exceptions Inherent product are pointed out as required. the in-
Ri(KqKmKrK_Ku)
that
inherent
reliability
reliability.--Consider
herent reliability R i of a product. Think of the expression R i = PcPtP w as representing the potential reliability of a product as described by its documentation, or let it represent the reliability inherent in the design drawings instead of the reliability of the manufactured hardware. This inherent reliability is predicated on the decisions and actions of many people. If they change, the inherent reliability could change. Why do we consider inherent reliability? Because the facts of failure are these: When a design comes off the drawing board, the parts and materials have been selected; the tolerance, error, stress, and other performance analyses have been performed; the type of packaging is firm; the manufacturing processes and fabrication techniques have been decided; and usually the test methods and the quality acceptance criteria have been selected. At this point the design documentation represents some potential reliability that can never be increased except by a design change or good maintenance. However, the possibility exists that the actual reliability observed when the documentation is transformed into hardware will be much less than the potential reliability of the design. To understand why this is true, consider the hardware to be a black box with a hole in both the top and bottom. Inside are potential failures that limit the inherent reliability of the design. When the hardware is operated, these potential failures fall out the bottom (i.e., operating failures are observed). The rate at which the failures fall out depends on how the box or hardware is operated. Unfortunately, we never have just the inherent failures to worry about because other types of failures are being added to the box through the hole in the top. These other failures are generated by the manufacturing, quality, mad logistics functions, by the user or customer, and even by the reliability organization itself. We discuss these added failures and their contributors in the following paragraphs but it is important to understand that, because of the
- K m manufacturing and fabrication and assembly techniques quality test methods and acceptance criteria Krqreliability engineering activities - K_ logistics activities K u the user or customer Any K-factor can cause reliability to go to zero. Ifeach K-factor equals 1 (the goal), Rproduc t = R i.
2.5
Variables AffectingFailureRates
Part failure rates are affected by (1) acceptance criteria, (2) all environments, (3) application, and (4) storage. To reduce the occurrence of part failures, we observe failure modes, learn what caused the failure (the failure stress), determine why it failed (the failure mechanism), and then take action to eliminate the failure. For example, one of the failure modes observed during a storage test was an _open" in a wet tantalum capacitor. The failure mechanism was end seal deterioration, allowing the electrolyte to leak. One obvious way to avoid this failure mode in a system that must be stored for long periods without maintenance is not to use wet tantalum capacitors. If this is impossible, the best solution would be to redesign the end seals. Further testing would be required to isolate the exact failure stress that produces the failure mechanism. Once isolated, the failure mechanism can often be eliminated through redesign or additional process controls.
2.6
Use of Failure
Rates
in Tradeoffs
Failure rate tables and derating curves are useful to a designer because they enable him to make reliability tradeoffs and provide a more practical method of establishing derating requirements. For example, suppose we
11
have two design concepts for performing some function. If the failure rate of concept A is 10 times higher than that of concept ]3,one can expect concept B to fail onetenth as often as concept A. If it is desirableto use concept A forother reasons,such as cost,size, performance, or weight, the derating failure rate curves can be used to improve concept A's failurerate (e.g., select
isthe requiredfailure rate. Ifblowers are used for cooling, the equipment must operate at temperatures as high as 75 C; if air-conditioning isused, the temperature need not exceed 50 C. Therefore,air-conditioning must be used ifwe are to meet the reliability requirement.
Other factors must be examined before we make a final decision.Whatever type of cooling equipment is components with a lower failure rate,derate the composelected, the totalsystem reliability now becomes nents more, or both). An even betterapproach isto find ways to reduce the complexity and thus the failure rate R T = ReR c of concept A. Figure 3 illustrates the use of failure rate data in tradeoffs. This figuregives a failure-rate-versusTherefore,the effect on the system of the cooling equip temperature curve for the electronics of a complex (over ment's reliability must be calculated. A more important 35 000 parts) piece of ground support equipment. The consideration isthe effect on system reliability should the curve was developed as follows: (1) A failure ratepredictionwas performed by using component failure ratesand theirapplication factors KA foran operatingtemperature of 25 oC. The resulting failure rate was chosen as a reference point. (2) Predictionswere then made by using the same method for temperatures of 50, 75, and 100 C. The ratios of these predictionsto the referencepoint were plotted versus component operating temperature, with the resultingcurve for the equipment. This curve was then used to provide tradeoff criteriafor using airconditioning versus blowers to cool the equipment. To illustrate, suppose the maximum operatingtemperatures expected are 50 C with air-conditioning and 75 C with blowers.Suppose furtherthatthe requiredfailure ratefor the equipment, ifthe equipment isto meet itsreliability goal,is one failure per 50 hr.A failure ratepredictionat 25 C might indicatea failure rateof I per 100 hr.From the figure,note that the maximum allowable operating temperature is therefore 60 C, since the maximum allowablefailure rateratiois A = 2; that is,at 60 C the equipment failure ratewillbe (1/100) 2 = 1/50,which cooling equipment fail.Because temperature control appears to be critical, lossof itmay have serioussystem consequences.Therefore, itis too soon to ruleout blowers entirely. A failure mode, effects, and criticality analysis (FMECA) must be made on both cooling methods to examine allpossible failure modes and theireffects on the system. Only then willwe have sufficient information to make a sound decision.
2.7
When a product fails, a valuable piece of information about it has been generated because we have the opportunity to learn how to improve the product if we take the right actions. Failurescan be classified as: (1) Catastrophic (a shorted transistoror an open wire-wound resistor) (2)Degradation (change in transistorgain or the resistor value) (3)Wearout (brush wear in an electric motor) These threefailure categories can be subclassified further: (1) Independent (a shorted capacitor in a radiofrequency amplifier being unrelated to a low-emission cathode in a picture tube) (2) Cascade (the shorted capacitor in the radiofrequency amplifier causing excessive current to flow in its transistor and burning the collector beam lead open) (3) Common
motors)
40 2O lO _ 8 -
_+
4 2 i point co
mode
(uncured
resin
being
present
in
20
30
40 50 60 70 80 Componenttemperature,+C
90
100
Much can be learnedfrom each failure by using these categories, good failure reporting, analysis, and a concurrence system and by taking correctiveaction. Failure analysisdetermines what caused the part to fail. Correc-
12
tive action ensures that the cause is dealt with. Concurrence informs management of actions being taken to
measure
the population
avoid another failure.These data enable all personnel to compare the part ratings with the use stresses and verify margin.
exceptions to this are not dealt with at this time.) sections, we discuss confidence levels,
In subsequent attribute
lifetest
be
methods,
analyzed, and introduce the subject and use of confidence Because tolerances must be expected in all manufacquestions to ask about are 3.2 (1) How is the reliability affected? Mr. (2) How can tolerances he analyzed and what are available? methods and Igor Bazovsky, in his book, Reliability Theory aconfidence_ in Confidence Levels
important
Practice
testing:
(3) What
Pt
in
the product
We know that statistical estimatesare more likely to be close to the truevalue as the sample sizeincreases. Thus, thereisa close correlation between the accuracy of an estimateand the sizeof the mample from which itwas obtained.Only an infinitely largesample size could give us a 100 percent confidence or certainty that a measured statistical parameter coincides with the true value. In this context, confidence is a mathematical probability relating the mutual positions of the true value of a parameter and its estimate. When the estimate of a parameter is obtained from a reasonably sized sample, we may logically assume that the true value of that parameter will be somewhere in the neighborhood of the estimate, to the right or to the left, Therefore, it would be more meaningful to express statistical estimates in terms of a range or interval with an associated probability or confidence that the true value lies within such interval than to express them as point estimates. This is exactly what we are doing when we assign confidence limits to point estimates obtained from statistical measurements.
reliabilitymodel?
are often up
affected
by part righthand
or down,
and transfer
components
may
be so loose that
excessive vibration
3.0
TESTING
FOR
RELIABILITY
3.1
Test Objectives
It can be inferred that i000 test samples to demonstrate of cost and more,
are required
a reliability requirement of 0.999. Because is impractical. Furtheroften may not In other words, rather than as point estimates, express statisticalestimates be more meaningful to of a product it would
even approach
as a range
(or interval), with an associated is a statisticalterm and reflects the amount that of
the total production of a product (calledproduct population), we must demonstrate reliabilityon a few samples.
probability (or confidence) that the true value lleswithin interval. Confidence on supporting data
Thus, the main objective of a reliabilitytest is to test an available device so that the data will allow a statistical conclusion to be reached about devices that will not or cannot main the reliabilityof similar be tested. That is, the
depends
3.3
for predicting the reliability of similar items that will not be tested and that often have not yet been manufactured.
Qualification, preflight certification, and design verification tests are categorized and 6). They as attribute tests (refs.5 and demonstrate showing how that or
To how
know
how ways
reliable a product
is one must
know
good
many
are subjected to
stress,usually tilemaxi-
anticipated operational limit. If both samples pass, qualified, preflight certified,or involved
magnitudes
13
the
true
objective
is to have
the device
3.5
Life Test
Methods
Life testsaxe conducted to illustrate how the failure In summary, an attribute test is not a satisfactory method of testing for reliability because it can only identify gross design and manufacturing problems; it is an adequate method of testing for reliability only when sufficient samples are tested to establish an acceptable level of statistical confidence. rateof a typical system or complex subsystem variesduring itsoperatinglife. Such data provide valuable guidelines for controllingproduct reliability. They help to establish burn-in requirements, to predict spare part requirements, and to understand the need for or lack of need for a system overhaul program. Such data are obtained through laboratory lifetestsor from the normal operation of a fielded system. 3.4 Test-To-Failure Methods 2o The purpose of the test-to-failure method is to develop a failure distribution for a product under one or more types of stress. The results are used to calculate the demonstrated reliability of the device for each stress. In this case the demonstrated population reliability will usually be the Pt or Pw product reliability term. the term 14
18
16
In this discussion
of test-to-failure
methods,
=safety factor s SF is included because it is often confused with safety margin SM. Safety factor is widely used in industry to describe the assurance against failure that is built into structural products. Of the many definitions of safety factor the most commonly used is the ratio of mean strength to reliability boundary:
12
10
9.68 Percentdefective
p(x) When we deal with materials repeatable, and =tight _ strength with clearly distributions, defined, such as (a) StructureA. 18
D
sheet and structural steel or aluminum, using S F presents little risk. However, when we deal with plastics, fiberglass, and other metal substitutes or processes with wide variations in strength or repeatability, using S M provides a clearer picture of what is happening (fig. 4). In most cases, we must know the safety margin to understand how accurate the safety factor may be. In summary, test-to-failure develop a strength distribution methods can be used to that provides a good esti-
16 ___SF 14
10 13"13
"
-D
SM = 4.0 Rb
mate of the Pt and Pw product reliability terms without the need for the large samples required for attribute tests; the results of a test-to-failure exposure of a device can be used to predict the reliability of similar devices that cannot or will not be tested; testing to failure provides a means of evaluating the failure modes and mechanisms of devices so that improvements can be made; confidence levels can be applied to the safety margins and to the resulting population reliability estimates; the accuracy of a safety factor can be known only if the associated safety margin is known.
lO
/
p(x) (b) StructureB.
14
In summary, Iliatests are performed to evaluate product failure rate characteristics; iffailures include all causes of system failure, the failure rate of the system is the only true factoravailable for evaluating the system's performance; lifetests at the part level require large sample sizesifrealistic failure rate characteristics are to be identified; laboratory lifetests must simulate the major factors that influence failurerates in a device during field operations; the use ofrunning averages in the analysis of lifedata will identifyburn-in and wearout regions if such exist; and failure rates are statistics and thereforeare subject to confidence levelswhen used in making predictions. Figure 5 illustrates what might be called a failure surfacefor a typicalproduct. It shows system failure rate versus operating time and environmental stress, three parameters that describea surface such that, given an environmental stress and an operating time, the failure rate is a point on the surface. Test-to-failure methods generate lineson the surface parallel to the stress axis;life testsgenerate lines on the surfaceparallelto the time axis. Therefore,these tests provide a good descriptionof the failuresurface and, consequently,the reliability of a product. Attribute tests resultonly in a point on the surfaceif failures occur and a point somewhere within the volume iffailures do not occur.For thisreason,attribute testing isthe least desirable method forascertaining reliability.
Of course, in the case of missile flights or other events that produce go/no-go results, an attribute analysis is the only way to determine product reliability.
4.0
SOFTWARE
perceived. For the purposes of this tutorial, quality is closely related to the process, and reliability is closely related to the product. Thus, both span the life cycle. Before we can stratify software reliability, the progress of hardware reliability will be reviewed. Over the past 25 years, the industry observed (1) the initial of "wizard status _ to hardware reliability modeling, and analysis, (2) the growth of the assignment for theory, field, and
(3) the final establishment of hardware reliability as a science. One of the major problems was aligning reliability predictions and field performance. Once that was accomplished, the wizard status was removed from hardware reliability. The emphasis in hardware reliability from now to the year 2000 will be on system failure modes and effects. Software reliability became classified as a science for many reasons. The difficulty in assessing software reliability is analogous to the problem of assessing the reliability of a new hardware device with unknown reliability characteristics. The existence of 30 to 50 different software reliability models indicates the organization in this area. Hardware reliability began at a few companies and later became the focus of the Advisory Group on Reliability of Electronic Equipment. The field then logically progressed through different models in sequence over the years. Similarly, numerous people and companies simultaneously entered the software reliability field in their major areas: cost, complexity, and reliability. The difference is that at least 100 times as many people are now studying software reliability as those who initially studied hardware reliability. The existence of so many models and their purports tends to mask the fact that several of these models showed excellent correlations between software performance predictions and actual software field performance: the Musa model as applied to communications systems and the Xerox model as applied to office copiers. There are also reasons for not accepting software reliability as a science, and they are discussed next. One impediment to the establishment of software reliability as a science is the tendency toward programming development philosophies such as (1) "do it right the in'st time _ (a reliability model is not needed} or (2) "quality is a programmer's development toolj _ or
/0
15
(3) _quality is the same as reliability and is measured by the number of defects in a program and not by its reliability. _ All of these philosophies tend to eliminate probabilistic measures because the managers consider a programmer to be a software factory whose quMity output is controllable, adjustable, or both. In actuality, hardware design can be controlled for reliability characteristics better than software design can. Design philosophy experiments that failed to enhance hardware reliability axe again being formulated for software design (ref. 9). Quality and reliability are not the same. Quality is characteristic and reliability is probabilistic. Our approach draws the line between quality and reliability because quality is concerned with the development process and reliability is concerned with the operating product. Many models have been developed and a number of the measurement models show great promise. Predictive models have been far less successful partly because a data base {such as MIL-HDBK-217E, ref. 10} is not yet available for software. Software reliability often has to use other methods; it must be concerned with the process of software product development.
"It is contrary to the definition of reliability to apply reliability analysis to a system that never really works. This means that the software which still has bugs in it really has never worked in the true sense of reliability in the hardware sense. _ Large complex software programs used in the communications industry are usually operating with some software bugs. Thus, a reliability analysis of such software is different from a reliability analysis of established hardware. Software reliability is not alone in the need for establishing qualitative and quantitative models. In the early 1980's, work was done on a combined hardware/software reliability model. A theory for combining well-known hardware and software models in a Markov processwas developed. A considerationwas the topic of software bugs and errorsbased on experiencein the telecommunications field. To synthesizethe manifestationsof softwarebugs, some of the followinghardware trends for these systems should be noted: (1) hardware transientfailures increaseas integratedcircuits become denser; (2} hardware transientfailures tend to remain constant or increaseslightly with time afterthe burn-in; and (3} hardware (integrated circuit) catastrophic failures decrease with time after the burn-in phase. These trends affect the operational software of communications systems. If the transient failures increase, the error analysis and system security software are called into action more often. This increases the risk of misprocessing a given transaction in the communications system. A decrease in the catastrophic failure rate of integrated circuits can be significant {ref. 12}. An order-of-magnitude decrease in the failure rate of 4K memory devices between the first year and the twentieth year is predicted. We also tend to oversimplify the actual situations. Even with five vendors of these 4K devices, the manufacturing quality control person may have to set up different screens to eliminate the defective devices from different vendors. Thus, the system software will see many different transient memory problems and combinations of them in operation. Centralcontroltechnology has prevailedin communications systems for25 years.The industryhas used many of itsold modeling tools and applied them directlyto distributedcontrol structures. Most modeling research was performed on largeduplex processors. With an evolution through forms of multiple duplex processors and load-sharingprocessorsand on to the present forms of distributed processingarchitectures, the modeling tools need to be verified. With fullydistributed control systems, the softwarereliability model must be conceptually matched to the softwaredesign in order to achievevalid predictions of reliability. The followingtrends can be formulated for software transient failures: (I} software transient failures decrease
4.1
Hardware
and Software
Failures
Microprocessor-based products have more refined definitions. Four types of failure may be considered (1} hardware catastrophic, (2)hardware transient, (3)software catastrophic, and (4) software transient. In general, the catastrophic failures require a physical or remote hardware replacement, software program can result in a manual or remote patch. The transient either restarts or unit restart, or a failure categories reloads for the
microprocessor-based systems, subsystems, or individual units and may or may not require further correction. A recent reliability analysis of such a system assigned ratios for these categories. Hardware transient faults were assumed to occur at 10 times the hardware catastrophic rate, and software transient faults were assumed to occur at 100 to 500 times the software catastrophic rate. The time of day is of great concern in reliability modeling and analysis. Although hardware catastrophic failures occur at any time of theday, they oftenmanifest themselves during busiersystem processingtimes.On the other hand, hardware and softwaretransient failures generallyoccur during the busy hours.When a system'spredicted reliability is close to the specified reliability, a sensitivity analysismust be performed.
4.2
of Software models,
quantifying
software
reliability.
16
architecture and
a fully transient
tool could
the effects of be a
structure,
(i.e.,
for making
design
decisions.
entry,
removal
I, an example
checking,
removal
checks).
level criticality index for defects. These the flexibility of classification. such an
approach
A fully distributed control structure can be configured to operate as its own essing levels, each error filter. In a hierarchy of proc-
can bug
choose removal
a decreasing,
constant,
or
below and prevents errors or transient faults from propagating through the system. Central control structures
software. and
a decreasing encountered.
has
interleaving
program
is reduced, the
permanent
postponed does
transaction true in
Thus,
be
this
is
is especially experienced on
treated it should
in genera]
as one
interaction Another
overall
quality mani-
opinion
software
The Until
concerns
that the faster a software program it is to cause errors (such control architectures}. A methods ware ations each bugs. are "missing can be link" used needs further
hardware of large
as encountered
in central
systems be
systems, performance discussion. the occurrence Several of softoperbecause The key 4.3 formance causes
it will
benchmark. modeling
specific
of software
unreliability.
However, detrimental
reliability
of Quality concept The of need quality for quality and "doing before We go on to
is to categorize and with ware ware system problems stress and found estimate
of criticality
probability The
spective
The first
on quality defects the concepts is that by we would to achieve as possible, The per
evaluation
design
and the
measurable software
parameters. bug
perfecting
quantifying
manifestations an accurate
be
we can obtain
implement defects
tradeoffs basis.
management.
key to achieving
TABLE Bug manifestation rate 4 per day 3 per day 2 per week 1 per month 1 per two years Defect removal rate 1 per month 1 per week 1 per month 2 per year 1 per year
Failure
Errors come and go Errors are repeated Service is affected System down System is partially stops
17
quality appears to have a thirdmajor factorin addition to product and process:the environment. People are important.They make the processor theproduct successful. The next step isto discuss what the processof achieving qualityin softwareconsists of and how quality management isinvolved.The purpose ofqualitymanagement for programming products isto ensure that a preselected softwarequalitylevelhas been achieved on scheduleand in a cost-effective manner. In developing a qualitymanagement system, the programming product'scritical lifecycle-phase reviews providethe reference base fortracking the achievement of qualityobjectives. The International ElectrotechnicalCommission (IEC) system life-cycle phases presented in theirguidelines for reliability and maintainabilitymanagement are (I) concept and definition, (2)design and development, (3) manufacturing, installation, and acceptance, (4)operation and maintenance,and (5)disposal. In general,a phase-coststudy shows the increasing cost of correcting programming defects in laterphases of a programming product'slife. Also, the higher the level of software quality,the more life-cycle costsare reduced.
metrics, and (4) standards. Areas {1} and (2) are applicable during both the design and development phase and the operation and maintenance phase. In general, area (2) is used during the design and development phase before the acceptance phase for a given software product. The following discussion will concern area (2).
4.5
Software
Quality
Metrics
The entire area of softwaremeasurements and metrics has been widely discussedand the subjectof many publications. Notable is the guide for software reliability measurement developedby theInstitute forElectrical and Electronics Engineers (IEEE} Computer Society's working group on metrics.A basisfor software qualitystandardizationwas alsoissuedby the IEEE. Software metrics cannot be developed before the cause and effect of a defect have been established for a givenproduct with relationto itsproduct life cycle. A typical cause-and-effect chart for a software product includesthe processindicator. At the testingstage of product development, the evolution of software qualitylevels can be assessedby characteristics such as freedom from error, successful testcase completion,and estimate of the software bugs remaining. For example, these processindicators can be used to predict slippageof the product deliverydate and the inability to meet originaldesign goals. When the programming product entersthe qualification, installation, and acceptance phase and continues into the maintenance and enhancements phase, the concept of performance is important in the quality characteristic activity. This concept isshown in table IIwhere the 5 IEC system life-cycle phaseshave been expanded to 10 software life-cycle phases.
4.4
Software
Quality
The next step is to look at specific software quality items. Software quality is defined as %he achievement of a preselected software quality level within the costs, schedule, and productivity boundaries established by management _ (ref. 10}. However, agreement on such a definition is often difficult to achieve because metrics vary more than those for hardware, software reliability management has focused on the product, and software quality management has focused on the process. In practice, the quality emphasis can change with respect to the specific product application environment. Different perspectives of software product quality have been presented over the years. However, in todays' literature there is general agreement that the proper quality level for a particular software product should be determined in the concept and definition phase and that quality managers should monitor the project during the remaining life-cycle phases to ensure the proper quality level. The developer of a methodology for assessing the quality of a software product must respond to the specific characteristics of the product. There can be no single quality metric. The process of assessing the quality of a software product begins with the selection of specific characteristics, quality metrics, and performance criteria. With respect to software interest are (1) characteristics, quality, several areas of (2) metrics, (3) overall
4.6
Concluding
Remarks
This sectionpresented a snapshot of software quality assurancetoday. Continuing researchis concerned with theuse ofoverall softwarequalitymetrics and bettersoftware predictiontoolsfordetermining the defectpopulation. In addition, simulators and code generators are being further developed so thathigh-quality softwarecan be produced. Process indicators are closely related to software quality and some include them as a stage in software development. In general,such measures as (1) testcases completed versus testcases planned and (2) the number of linesof code developed versus the number expected give an indicationof the overall company or corporate progress toward a qualitysoftware product. Too often,
18
TABLE II.--MEASUREMENTS
AND PROGRAMMING
PRODUCT
LIFE CYCLE
[The 5 International Electrotechnical Commission (IEC) life-cycle phases have been expanded to I0 software phases.] System lifecycle phase Concept and definition Software life-cycle phase Conceptual planning (1) Requirements definition (2) Product definition (3) Top-level design (4) Detailed design (5) Implementation (6)
w
......................................................... ......................................................... Quality metrics" ............................ Quality metrics Quality metrics Process Indicators b Process indicators Performance measures c Process indicatorJ Process indicators Quality metrics Performance measures Quality metrics
Testing and integration (7) Qualification, installation, and acceptance (8) Maintenance and enhancements (9) Disposal (10)
Performance measures
............................
.........................................................
LMetrlcs, qualitative assessment, quantitative prediction, or both. blndicators, month-by-month tracking of key proiect parameters. Measures, quantitative performance assessment.
personnel are moved from one project to another and thus the lagging projects improve but the leading projects decline in their process indicators. The llfe cycle for programming products should not be disrupted. Performance measures, includingsuch criteria as the percentage of proper transactions, the number of system restarts, the number of system reloads, and the percentage of uptime, should reflect the user's viewpoint. In general, the determination of applicable quality measures for a given software product development is viewed as a specific taskof the software qualityassurance function. The determinationof the processindicators and performance measures is a task of the software quality standards function.
(4) Selectthe reliability analysisprocedure. (5) Selectthe data sourcesforfailure rates and repair rates. (6) Determine the failure ratesand the repairrates. (7) Perform the necessarycalculations. (8) Validate (9) Measure and verify reliability the reliability. until customer shipment.
5.1
Goals
and Objectives
5.0
RELIABILITY
MANAGEMENT
To design for successful reliability and continue to provide customers with a reliable product, the following steps are necessary: (1)Determine the reliability goals to be met. (2) Construct a symbolic representation. (3) Determine philosophy. the logisticssupport and repair
Goals must be placed into the proper perspective. Because they are often examined by using models that the producer develops,one of the weakest links in the reliability processisthe modeling. Dr. John D. Spragins, an editorfor the IEEE Transaction on Computers, corroboratesthisfactwith the followingstatement (ref. 13): Some standard definitions of reliability or availability, such as those based on the probability that all components of a system are operational at a given time, can be dismissed as irrelevant when studying large telecommunication networks, Many telecommunication networks are so large that the probability they are operational according to this criterion may be very nearly zero; at least one item of equipment may be down essentially all of the time. The typical user, however, does not see this unless he or she happens to
19
be the unlucky
person whose
equipment
may
still
5.3
Human
Reliability
operate perfectly from this user's point of view. A more meaningful criterion is one users. based on the reliability seen by typical system to system operators is another valid, The reliability apparent
but distinct, criterion. (Since system operators commonly systems down only after failures have been reported to may not hear of short higher self-clearing than the outages, values seen their reliability are often
The major objectives of reliability management are to ensure that a selectedreliability levelfor a product can be achieved on schedule in a cost-effective manner and that the customer perceivesthe selectedreliability level. The current emphasis in reliability management is on meeting or exceedingcustomer expectations. We can view thisas a challenge, but itshould be viewed as the bridge between the user and the producer or provider. This bridge is actually =human reliability. _ In the past, the producerwas concerned with the processand the product and found reliability measurements that addressed both. Often there was no correlation between fielddata, the customer's perception of reliability, and the producer's reliability metrics.Surveys then began to indicatethat the customer distinguished between reliability performance, response to order placement, technical support, service quality, etc. Human reliability is defined (ref. 17) as %..the probability of accomplishinga job or task successfully by humans at any requiredstagein system operationswithin a specified minimum time limit (if the time requirement is specified)." Although customers generallyaxe not yet requiringhuman reliability models in addition to the requestedhardware and software reliability models, the science of human reliability iswell established.
estimates
by users.)
Reliability objectives can be defined differently for various systems. An example from the telecommunications industry (ref. 14) is presented in table III.
5.2
Specification A system
can have
bility specification that is based on customer requirements. The survivability of a telecommunications network is defined as the ability of the network to perform under stress caused by cable cuts or sudden and lengthy traffic overloads and after failures including equipment breakdowns. Thus, performance and availability have been combined into a unified metric. One area of telecommunications where these principles have been applied is the design and implementation of fiber-based networks. Roohy-Laleh et al. (ref. 15) state a...the statistical observation that on the average 56 percent of the pairs in a copper cable are cut when the cable is dug up, makes the copper network 'structurally survivable. On the other hand, a fiber network can be assumed to be an all or nothing situation with 100 percent of the circuits being affected by a cable cut, failure, etc. In this case study (ref. 15), =...cross connects and allocatable capacity axe utilized by the intelligent network operation system to dynamically reconfigure the network in the case of failuresY Figure 6 (from ref. 16) presents a concept for specification targets.
TM
5.4
Customer
Reliability growth has been studied, modeled, and analyzed--usually from the design and development viewpoint. Seldom is the process or product studied from the customer's perspective. Furthermore, the reliability that the first customer observes with the r-st shipment
TABLE III.--RELIABILITY TELECOMMUNICATIONS Module Telephone Electronic or system instrument key system Mean
FOR
lOO Fullyoperational
P2 failures
time
between loss
of service
of service
i
E
I11
Subliminal availability
Subliminal
major Pl
availability minor
Degraded operation
Major loss of service Minor loss of service Mishandled calls Mishandled (TSPS) System outage calls
Subliminal performance, 75 percent at toad factor B Unusable Subliminal performance, 65 percent at load factor B
a1 Availability, Figure 6.--Specification percent targets (ref. 16). a2 100
Traffic service position system Class 5 office Class 4 office Class 3 office
2O
can be quitedifferent from thereliability that a customer willobservewith a unit or system produced 5 years later, or with the lastshipment. Because the customer'sexperience can vary with the maturity of a system, reliability growth is an important concept to customers and should be consideredin theirpurchasing decisions. One key to reliability growth is the ability to def'me the goals for the product or service from the customer's perspective while reflecting the actual situation in which the customer obtains the product or service. For large telecommunications switching systems, the rule of thumb for determining reliability growth has been that often systems have been allowed to operate at a lower availability than the specified availability goalfor the first 6 months to 1 year of operation (ref. 18).In addition,component part replacement rates have often been allowed to be 50 percenthigher than specified forthe first 6 months of operation. These allowancesaccommodated craftspersons learningpatterns,software patches,design errors, etc. Another key to reliability growth is to have its measurement encompass the entirelife cycleof the product. The concept is not new; only here the emphasis is placed on the customer's perspective.
Reliability growth can be specified from "day 1_ in product development and can be measured or controlled with a 10-yearlife until _day 5000._ We can apply the philosophy of reliability knowledge generationprinciples, which isto generatereliability knowledge at the earliest possibletime in the planning processand to add to this base for the duration of the product's useful life. To accurately measure and control reliability growth, we must examine the entiremanufacturing lifecycle.One method is the construction of a production life-cycle reliability growth chart. In certain largetelecommunications systems, the long installation time allows the electronic part reliability to grow so that the customer observesboth the design and the production growth. Large complex systems oftenoffer an environment unique to each product installation, which dictatesthat a significant reliability growth will occur. Yet, with the difference that size and complexity impose on resultant product reliability growth, corporationswith largeproduct linesshould not present overall reliability growth curves on a corporate basis but must presentindividual product-line reliability growth pictures to achieve totalcustomer satisfaction.
21
APPENDIX--COURSE
EVALUATION
22
NASA
SAFETY
TRAINING
CENTER
(NSTC)
COURSE
EVALUATION
Name:
Course Title:
Sponsor:
I.What
3. How
will the skills/knowledge you gained in thiscourse/workshop help you to perform betterin your job?
5 Excellent
3 Fair
I Poor
5 Excellent
3 Fair
1 Poor
6. Please
rate
the applicability
of this
course to your
work.
5 Excellent
3 Fair
1 Poor (OVER)
23
7.Asa customer
of the NASA
Safety Training Center (NSTC), how would you rate our services?
5 Excellent Comments:
3 Fair
I Poor
8. Please rate the followingitems: Excellent I. Overall course content ............................ 2. Achievement of course objectives ..................... 3. Instructor's knowledge of subject ..................... 4. Instructor's presentation methods .................... 5. Instructor's ability to address questions ................ 6. Quality of textbook/workbook (if applicable} ............ 7. Training facilities ............................... 8. Time allotted forthe course ........................ Comments: 5 5 5 5 5 5 5 5 4 4 4 4 4 4 4 4 Fair 3 3 3 3 $ 3 3 3 2 2 2 2 2 2 2 2 Poor 1 1 1 1 1 1 1 1
9. Training expense other than tuition(ifapplicable}: Travel {including plane fare, taxi,car rentaland tolls} Per Diem Total 10. Please send this evaluationto: NASA Safety Training Center
Webb, Murray & Associates, Inc. 1730 NASA Road One, Suite 102 Houston, Texas 77058 THANK YOU_
24
Software
Models,
1.
Electronic Reliability Design Handbook. MIL-HDBK-338, vols.1 and 2, Oct. 1988. Reliability Modeling and MIL-STD-756B, Aug. 1982. Prediction. 13.
Schick,G.J.;and Wolverton, R.W.: An Analysisof Computing Software ReliabilityModels. IEEE Trans. Software Eng., vol.SE-4, no. 2, Mar. 1978_ pp. 104-120. Spragins,J.D.,et al.: Current Telecommunication Network Reliability Models: A Critical Assessment, IEEE J. Sel.Topics Commun., vol.SAC-4, no. 7, Oct. 1986, pp. 1168-1173.
3.
Theory
and
Practice.
5.
Reliability Test Methods, Plans,and Environments for Engineering Development, Qualification, and Production. MIL-HDBK-781, July 1987.
.
14.
Malec, H.A.: Reliability Optimization in Telephone Switching Systems Design. IEEE Trans. Rel., vol. R-26, no. 3, Aug. 1977, pp. 203-208. Roohy-Laleh, E.,et al.: A Procedure forDesigning a Low Connected Survivable Fiber Network. IEEE J. Sel.Topics Commun., 1986, pp. 1112-1117. vol.SAC-4, no. 7, Oct.
Acceptance
Test-
15.
7.
Laube, R.B.: Methods to Assessthe Successof Test Programs. J. Environ. Sci., vol. 26, no. 2, Mar.Apr. 1983, pp. 54-58. Test Requirements for Space MIL-STD-1540B, Oct. 1982.
.
16.
8.
Vehicles.
Jones, D.R.; and Malec, H.A.: Communications Systems Performability: New Horizons. 1989 IEEE International Conference on Communications, vol. 1, IEEE, 1989, pp. 1.4.1-1.4.9. Dhillon, Factors. B.S.: Human Reliability: Pergamon Press, 1986. With Human
Siewiorek, D.P.; and Swarz, R.S.: The Theory and Practice of Reliable System Design. Digital Press, Bedford, MA, 1982, pp. 206-211. ReliabilityPrediction of Electronic Equipment, MIL-HDBK-217E, Jan. 1990. Nathan, _Error-Free' I.: A Deterministic Status of Complex Model To Predict Develop-
17.
18. 10.
11.
Software
Conroy, R.A.; Malec, H.A.; and Van Goethem, J.: The Design, Applications, and Performance of the System-12 Distributed Computer Architecture. First International Conference on Computers and Applications, E.A. Parrish and S. Jiang, eds., IEEE, 1984, pp. 186-195.
25
REPORT
DOCUMENTATION
PAGE
Public tepontng burden for this collection of information is estimated to .average 1.,hour pej' response, including the time lot revl.ewing In,stru.ctlons, searching existing data sources, gathering and maintaining the data. needed, and completing ano reviewing the conec.on ov information. :_eno comments regarolng this ouroen estimate or any other aspect of this co k_tion o! informatlon,/ncludlng suggestions for reducing this burden, to Washington Headquarters Services. Deectorate for Information Operstlons and Reports, 1215 Jefferson Davis Highway, Suite 1204, Arlington. VA 22202-4302, and to the Office of Management and Budget. Paperwork Reduct on Project (0704-0188), Wash ngton. DC 20503. 1. AGENCY USE ONLY (Leave b/anlO 2. REPORT DATE 3. REPORT TYPE AND DATES COVERED
October
4. TITLE AND SUBTITLE
1994
Technical
Memorandum
5. FUNDING NUMBERS
Design
for Reliability:
NASA Reliability Prc, fen'cd Practices for Design and Test WU-323-44-19 R. Lalli
O. AUTHOR(S) Vincent
7.
PERFORMING
ORGANIZATION
NAME(S)
AND
ADDRESS(ES)
8.
PERFORMING REPORT
ORGANIZATION NUMBER
National Lewis
and Space
Administration E-8053
Cleveland,
44135-3191
9. SPONSORING/MONITORING
10.
National Washington,
Aeronautics D.C.
and
Symposium 1994.
cosponsored person,
by ASQC, Vincent
IIE,
IEEE,
SOLE,
IES, code
AIAA, 0152,
SSS,
California,
Responsible
R. Lalli,
organization
433-2354.
STATEMENT 12b. DISTRIBUTION CODE
DISTRIBUTION'AVAILABILITY
Unclassified Subject
- Unlimited 18
Category
13.
ABSTRACT
(Maximum
200
words)
This
tutorial
from These
both
NASA
engineering NASA
that and at
support were
practices
of senior
centers
(members
are listed
was
by the NASA
Reliability
Preferred
be an integral technical
of the systems
engineering
equally
demands, space
reliability have
space
of failure. engineering
we emphasize
identified
character-
by applying
principles.
14.
15.
NUMBER
OF
PAGES
27
16. PRICE CODE
A03
17. SECURITY OF REPORT CLASSIFICATION
19.
SECURITY
CLASSIFICATION
20.
LIMITATION
OF
ABSTRACT
OF ABSTRACT
Unclassified
Standard Prescribed _.9G-1C2 Form 298 (Rev. 739-18 2-89)
by AN_31 _td