Master Report John Zanoff
Master Report John Zanoff
Master Report John Zanoff
By
John Zanoff III
MASTER OF ENGINEERING
Approved
______________________________________
Dr. A. Ertas
______________________________________
Dr. T. T. Maxwell
______________________________________
Dr. E. W. Kiesling
______________________________________
Dr. M. M. Tanik
First of all, I would like to thank my family, especially my wife, because without her support
and understanding, I would not have undertaken this endeavor to further my education. I would also
like to thank my daughter, a junior at the University of Texas in Austin, who provided me the
I would be remiss by not thanking some of the individuals in the class that also played a
substantial part in my successful completion of this program. Terrence Chan and Schuyler Deitch
and myself teamed for all of the group assignments during the year, and their youth and motivation
kept me going. I only hope that I was able to exchange their energy with some wisdom of my own to
them.
I would also like to thank all of the professors and instructors that took the time from their
busy schedules to further my education. I would especially like to thank Dr. Ertas and Dr. Maxwell
for their tireless support of the class as well as ability to keep things moving over a long distance. My
ii
TABLE OF CONTENTS
ACKNOWLEDGEMENTS ........................................................................................................ II
DISCLAIMER.............................................................................................................................. V
ABSTRACT................................................................................................................................. VI
LIST OF FIGURES .................................................................................................................. VII
LIST OF TABLES ...................................................................................................................VIII
CHAPTER I
INTRODUCTION......................................................................................................................... 1
CHAPTER II
BACKGROUND ........................................................................................................................... 3
2.1 The Need For Prognostics and Diagnostics in New Designs 3
2.1.1 Customer Requirements ............................................................................................... 3
2.1.2 Total Cost of Ownership .............................................................................................. 3
2.1.2.1 Long Term Savings.......................................................................................... 4
2.1.2.2 Reduced Levels of Sparing and Logistics Impact ........................................... 5
2.2 Impact on Reliability 6
2.2.1 Partial Operability ........................................................................................................ 8
CHAPTER III
DEFINING PROGNOSTICS AND DIAGNOSTICS.............................................................. 10
3.1 Understanding the Terms 10
3.1.1 Diagnostics ................................................................................................................. 10
3.1.2 Prognostics ................................................................................................................. 11
CHAPTER IV
IMPLEMENTATION ................................................................................................................ 12
4.1 Utilization of Built In Test (BIT) 12
4.2 Diagnostics 13
4.2.1 Types of Diagnostics .................................................................................................. 13
4.2.1.1 Failure Modes, Effects and Criticality Analysis (FMECA) Role in Health
Management .............................................................................................................. 13
4.2.1.2 Software Implementation of Prognostics and Diagnostics ............................ 19
4.2.1.3 Canaries ......................................................................................................... 27
4.3 Prognostics 28
4.3.1 Modeling of the System ............................................................................................. 29
4.3.2 Approaches to Implementing Prognostics Capability................................................ 32
4.3.3 Physics of Failure ....................................................................................................... 38
4.3.3.1 Identification of Failure Mechanisms ............................................................ 40
4.3.3.2 Steps for Physics-of-Failure Accelerated Life Testing.................................. 42
4.3.3.3 Problems with Accelerated Life Testing ....................................................... 43
4.4 Monitoring 43
4.4.1 In-Situ Sensor Monitoring.......................................................................................... 45
4.4.3 Maintenance ............................................................................................................... 47
4.4.3.1 Condition-based Maintenance (CBM)........................................................... 48
iii
4.4.3.2 Maintenance Activities Prior to Failure......................................................... 52
4.4.3.3 Value of the Data Mining Process ................................................................. 53
4.4.4 Knowledge Database.................................................................................................. 55
4.4.5 Applicability to the Aviation Industry ....................................................................... 57
4.4.6 Applicability to the Medical Industry ........................................................................ 58
CHAPTER V
SUMMARY AND CONCLUSIONS ......................................................................................... 63
REFERENCES............................................................................................................................ 66
APPENDIX A
DRIVING FORCE BEHIND THE CONTENTS OF THIS MASTER’S REPORT .............. 1
Introduction 1
The Logistics Footprint [24] 2
Embedded Diagnostics and Prognostics Synchronization [24] 4
Summary [24] 8
APPENDIX B
RELIABILITY IMPACT ON DESIGNS ................................................................................... 1
Design in Reliability Early in the Development Process ...................................................... 1
Conduct Lower Level Testing............................................................................................... 2
Rely More on Engineering Design Analysis and Less on Predictions.................................. 2
Perform Engineering Analyses of Commercial-Off-The-Shelf Equipment.......................... 3
Provide Reliability Incentives in Contracts........................................................................... 3
iv
DISCLAIMER
The opinions expressed in this report are strictly those of the author and are not necessarily
those of Raytheon, Texas Tech University, nor any U.S. Government agency.
An attempt has been made to acknowledge all the people involved in this paper, but due to
the shear size of the material, someone may have been missed. This is unintentional and it is my wish
v
ABSTRACT
As a reliability engineer for the past seventeen years, I have seen a transformation from
simple, easily repairable systems to incredibly complex designs that are almost impossible to repair,
especially in times of actual combat. The Army, in its latest weapon system development effort, has
put the requirements on potential bidders that the reliability of these new systems will be an order of
magnitude higher for the end user, will be easily and quickly repairable in the field, and during times
of actual combat, they WILL NOT FAIL. On initial thought, this seems like an insurmountable task,
but after doing some research and benchmarking with some commercial companies, this task can be
realized. If some existing design and testing philosophies from other technologies are incorporated
into these designs, we can meet these needs of the end user.
This report was initiated to accumulate information in order to develop a plan to put
into use the necessary requirements, design methodologies, both hardware and software, and
capabilities such that we can propose and design a system or systems that can be as reliable as needed
for the next generation of hardware for the military. This plan will encompass the needed
requirements, testing, ideas and objectives of meeting the requirements. An extensive list of
materials, including papers presented at various seminars and working groups, extensive company
research, benchmarking and actual case studies were used as the basis of this report, and the
possibility of this document being utilized as a Raytheon “Best Practice” has a high potential of being
certain amount of synergy that can be gained by pursuing this subject matter as well as producing
something that could have a very positive impact on future development within the company, but
most importantly, have a positive impact on the ability to better support the customer or end user of
the product.
vi
LIST OF FIGURES
Figure 4: Architecture of the Prognostics & Health Management Design Tool [14] 19
Figure 11: Idealized bathtub reliability curves for the test circuit and the prognostic
cell [4] 46
Figure 12: Decision Tree for Pre and Post Failure Maintenance Actions 49
Figure 14: Example of a Bayesian Network for Monitoring Motor Stress Factors [18] 55
vii
LIST OF TABLES
viii
CHAPTER I
INTRODUCTION
The overwhelming desire to design and produce systems that are an order of magnitude more
reliable, maintainable and supportable than their predecessors carries with it a responsibility for
It is of utmost importance to have a clearly defined strategy of justifiable, attainable goals and
definable tasks and processes to be successful. This paper will serve to collect and define the
necessary steps to attain the most reliable system design possible and to also document the required
Diagnosis and prognosis are processes of assessment of a system’s health – past, present and
future – based on observed data and available knowledge about the system. Diagnosis is an
assessment about the current (and past) health of a system based on observed symptoms, and
prognosis is an assessment of the future health. While diagnostics, the science dealing with diagnosis
has existed as a discipline in medicine, and in system maintenance and failure analysis, prognostics is
a relatively new area of research. Indeed, the word “prognostics” does not yet exist in English
language dictionaries. The research community developing diagnosis and prognosis techniques to aid
Condition-Based Maintenance (CBM) likely coined the term. A working definition of prognostics is
“the capability to provide early detection and isolation of precursor and/or incipient fault condition to
a component or sub-element failure condition, and to have the technology and means to manage and
predict the progression of this fault condition to component failure.” This definition includes
prognosis (i.e., the prediction of the course of a fault or ailment) only as a second stage of the
prognostic process. The preceding stage of detection and isolation could be considered a diagnosis
process. Even the traditional meaning of prognosis used in the field of medicine implies the
recognition of the presence of a disease and its character prior to performing prognosis. However, in
the maintenance context, the final step of making decisions about maintenance and mission planning
should also be included, while developing prognostics, for more convenient traceability to top-level
1
goals [1:1]. Due to the nature of the observed data and the available knowledge, the diagnostic and
prognostic methods are often a combination of statistical inference and machine learning methods as
well as embedded hardware, firmware, components or software to monitor, store and communicate
conditions or impending faults. The development (or selection) of appropriate methods requires
appropriate formulation of the learning and inference problems that support the goals of diagnosis and
prognosis. An important aspect of the formulation is modeling – relating the real system to its
mathematical abstraction.
Since both diagnostics and prognostics are concerned with health assessment, it is logical to
study them together. However, the decision-making goals of the two are different. Diagnosis results
are used for reactive (post failure) decisions about corrective (repair/replacement) actions; prognosis
results are used for proactive (pre-failure) decisions about preventive and/or evasive actions ((CBM),
mission reconfiguration, etc.) with the economic goal of maximizing the service life of
replaceable/serviceable components while minimizing operational risk. On the other hand, in many
situations diagnosis and prognosis aid each other. Diagnostic techniques can be used to isolate
components where incipient faults have occurred based on observed degradation in system
performance. These components can then become the focus of prognostic methods [1:2] to estimate
when incipient faults would progress to critical levels causing system failure. Prognostics could then
be used to update the failure rates (reliabilities) of the system components, and in the event of a
failure these updated reliability values can be used to isolate the failed component(s) via a more
2
CHAPTER II
BACKGROUND
2 . 1 T h e N e e d F o r P r o g n o s t i c s a n d D i a g n o s t i c s i n N e w D es i g n s
2 . 1 . 1 C u s t o me r R e q u i r e me n t s
The increasing cost and complexity of military weapon systems over the past four decades
has driven the technology to the point of almost inability to easily repair these systems without
extensive knowledge of the system and cumbersome test equipment. Along with this comes the
heavy price tag of maintaining these items during a potential service life of more than 25 years; the
cost of which includes replacement items, training the maintainers and repair of the failed items
removed from the system. Although the commercial world has embraced the idea of not only
including Built-In Test (BIT) and an extensive diagnostics capability, recent development in the area
of prognostics, or “repair before fail” has been implemented in even the most complex items and for
various applications.
This technology has only recently been required for military systems, and many defense
contractors are struggling with the implementation of this technology in new designs. The ability to
repair items prior to failure, especially if in a wartime footing, will provide the advantage to those
As in any complex design, the initial or inherent reliability is at its peak during the early
phase of the operating cycle. If it is assumed that the design has been thoroughly tested and latent
defects and infant mortality are not an issue, then it will perform “as advertised” for a period of time
without any sort of failure or unscheduled maintenance action. In many cases, especially as the
complexity of the item increases, the ability to perform or meet the performance requirements suffers
from failures that were not anticipated. Car manufacturers pride themselves on the fact that their
3
designs are considerably more reliable than in previous years, even when taking into account that
these vehicles are considerably more complex than ones built only 20 years prior. The ability for a
product or system to operate with minimal upkeep for an extended period of time is one of the inputs
The ability to predict the time to conditional or mechanical failure (on a real-time basis) is of
enormous benefit and health management systems that can effectively implement these capabilities
offer a great opportunity in terms of reducing the overall Life Cycle Costs (LCC) [2:2] of operating
The reduction of the overall cost of ownership of a product while in operation is at least
partially based on the required maintenance, either planned or unplanned, and upkeep to ensure that
the system is operating at full or acceptable performance. The more maintenance or “down time”
experienced with a particular piece of equipment means loss of revenue, reduced capability, increased
overhead or “out of pocket” expenses and, most of all, a customer that is also likely to be unhappy
with their product. The evolution of the automotive industry over the last forty years is a perfect
During the 1960’s, most Japanese manufactured automobiles were produced with a
perception of poor workmanship and materials to such an extent that the American auto
manufacturers were in control of the market. A transformation occurred in the late 1960’s that was to
be a monumental turnaround in the industry. Japanese manufacturers began to understand what was
needed to increase both the quality and reliability of their product to the point of exceeding the
American automotive industry. These manufacturers were flooding the market in the early to mid
1970’s with a product superior in quality, reliability and economy, hence overall lower cost of
ownership than that of their US made competition. The fuel crisis was also instrumental in these
compact, fuel-efficient vehicles dominating the market. It took the Fords, Chevrolets and Chryslers a
4
substantial amount of time to recover from this due to their enormous infrastructure and cost of
The fact that smaller American made cars were rolling off the assembly lines did not deter the
buying public from returning to the Nissans and Toyotas because of the fact that the US made cars
were still not very reliable and the cost of ownership outweighed the initial purchase of the vehicle.
The American public wanted transportation that could be trusted to get them where they wanted to
go, and did not want to have to worry about whether or not it was going to get them there.
This concept of reduced maintenance, lower operating costs and reduced down time is now
being embraced within the Department of Defense. It has been shown that the more reliable a system
is, the less it will cost to operate and maintain. When diagnostics and prognostics are incorporated,
the logistics tasks are also positively benefited by the fact that the system based on either historical
data or known failure modes, can “communicate” its health to the maintainers to reduce repair time as
well as reduced cost of spare parts. This will be discussed in more detail in the next section.
maintainers, it is much easier to perform either post or, more importantly, pre failure maintenance on
the hardware. Once a failure history has been collected on a particular piece of hardware, this
knowledge can be instrumental in determining over the life of the system what needs to be purchased
on an ongoing basis to ensure that the system will continue to operate. Depending on the technology,
there is apt to be failures that occur in one particular part of the hardware more often than others, like
motors, bearings and other electro-mechanical features that have a wear out cycle based on friction,
heat, plastic deformation, etc. Because of this, and if the hardware is monitored correctly, knowing
what will fail can have a significant impact on operation and support cost over the useful life. This
can also be used to determine if the design itself needs to be evaluated or specific portions of the
system be redesigned due to unknown or unforeseen environments that it is operating in than were not
5
known during the initial development. The bottom line is that the system needs to be designed to
have the highest inherent reliability as the operating environment allows, and then monitor the system
for any signs of premature failure or design defects that would not normally be encountered during
normal operation.
2 . 2 I mp a c t o n R e l i a b i l i t y
Reliability is defined as the ability of a product to perform as intended (i.e., without failure
and within specified performance limits) for a specified time, in its life cycle application
environment. The objective of reliability prediction is to support decisions related to the operation and
• Reducing the output penalties to include outage repair and labor costs
• Helping in the design of future products, by improved safety margins and reduced
failures
properties, fabrication process, and the life cycle environment. Material properties and geometry of a
product are not exactly the same for all the products coming out of a production line. There may be
changes in these with variations in the fabrication process affecting the reliability and can be
product degradation in its life cycle environment. By knowing about impending failure, based on
actual life cycle application condition, procedures can be developed to mitigate, manage or maintain
the product.
6
In health monitoring a product’s degradation is quantified by continuous or periodic
measurement, sensing, recording, and interpretation of physical parameters related to the product’s
life cycle environment and converting the measured data into some metric associated with either the
fraction of product degradation or the remaining life of the product (in terms of days, distance in
miles, cycles to failure, etc.). A product’s degradation can be assessed in terms of physical
degradation (e.g., cracks, defection, delamination) and electrical degradation (e.g., increase in
product’s operating parameters (e.g., electrical, mechanical, or acoustic) from expected values.
Methods employed for health monitoring of electro-mechanical attributes [4:2] are non-destructive
test (e.g., ultrasonic inspection, liquid penetrant inspection, and visual inspection) and operating
parameter monitoring (e.g., vibration monitoring, oil consumption monitoring and thermography
(infrared) monitoring).
The science of reliability is based on the premise that hardware can be characterized to be at a
certain spot on the reliability “bathtub” curve. This curve depicts the life cycle of hardware based on
its maturity, technology basis, useful life and when wear out or failure is likely to occur. Figure 1 is a
depiction of the reliability bathtub curve with points annotated on it based on where in the lifecycle
the hardware is. Although widely accepted as the traditional constant failure rate curve during useful
life, more and more research is being performed in the area of Physics of Failure, which is a modeling
technique used to better determine when a system or component is likely to fail. This topic will be
7
Figure 1: Classical Bathtub Reliability Curve [5]
There are really two concepts that need to be understood and specified during the
development process. The more known or standard specification is Mean Time Between Failures
(MTBF), which signifies any failure that would or could occur on the system thereby making the
system unable to complete its intended function. An additional requirement that can be specified may
be Mean Time Between Essential Function Failure (MTBEFF). This is a partially failed state or
“degraded” mode of operation that while not fully functional, the system can still perform its intended
function by the use of either partial redundancy, backup systems, or increased operator workload.
Normally, this could occur in the extreme case of an aircraft losing one out two or more engines and
still having the ability to maintain flight control and land safely to the other extreme of a sensor
system with multiple sensors that if one failed, one of the others could at least partially fulfill the role
Due to cost, weight, volume and power consumption constraints, the ability to incorporate
redundancy is not a luxury in most applications, especially if the system is man-portable or airlift
capable. Because of this, it is essential to ensure that the design is the most reliable it can be based on
8
the technology available, and to also incorporate the ability to test the health of the system so that an
informed decision can be made of its ability to perform a specific function or task.
The goal here is two-fold: Allow the operation or use of a system to continue with sub-
optimal performance due to a partial failure while still meeting the system’s inherent reliability.
9
CHAPTER III
DEFINING PROGNOSTICS AND DIAGNOSTICS
3 . 1 U n d e r s t a n d i n g t h e T e r ms
3.1.1 Diagnostics
Diagnosis is the process of identification of the cause(s) of a problem. The causes of interest
may not be “root” causes. Depending on the application or the level of maintenance activity,
diagnosis can imply the isolation of a faulty component, a failure mode, or a failure condition. It
involves the recognition of one or more symptoms or anomalies (that what is observed is not normal),
and their association with a ground truth. The cause could be within the system (failed component) or
be an external factor which subsequently ‘damages’ the system and prevents it from functioning
normally in the future. The objectives, job-function or capacity of the user of the diagnostic results
determines whether the external root cause, the damaged system component(s) or both need to be
addressed (whether the corrective action should focus on eliminating the external root cause,
In many large systems instrumented with built-in sensors and diagnostic tests, the steps of
anomaly detection and root-cause isolation are distinct. The failure of one or more tests signifies
anomalies or failures, and the processing of these results to isolate the failure source constitutes the
isolation step. In some applications, however, failure detection and isolation are not separate steps.
For example [1:2], diagnostic problems are often formulated as classification problems: associate a
vector of features obtained from the system with a class corresponding either to the normal (healthy)
state or to one of the failure modes. Such approaches are effective during corrective maintenance if
the relationship between the failure modes constituting the classes and the implicated components is
obvious or implied.
10
3.1.2 Prognostics
While diagnosis or diagnostics is based on observed data and available knowledge about the
system and its environment, prognosis makes use of not only the historical data and available
knowledge, but also profiles of future usage and external factors. Both diagnosis and prognosis can be
formulated as inference problems that depend on the objectives of diagnosis and prognosis and the
nature of the available data and system knowledge. Two characteristics common to all applications
[1:2] are the incomplete or imprecise knowledge about the systems of interest, especially in the
failure space, and the uncertainty or randomness of observed data. The development of an appropriate
decision model, thus, requires consideration of these characteristics along with the objectives of the
decision problem.
11
CHAPTER IV
IMPLEMENTATION
The concept of Built-in-test (BIT) is another technique employed for diverse applications.
BIT is a hardware-software diagnostic means to identify and locate faults. There are various types of
BIT concepts that are employed in electronic systems [4:3], interruptive BIT (I-BIT), periodic BIT
(P-BIT) and continuous BIT (C-BIT). The concept of I-BIT is that normal equipment operation is
suspended during BIT operation. Such BITs are typically initiated by the operator or during a power-
up process. P-BIT is the ability for the hardware to “self check” at specific times during its operation,
specifically on a not to interfere basis, and will cease if necessary to avoid a graceful degradation in
performance. The concept of C-BIT is that equipment is monitored continuously and automatically
BIT is a technology that has been in place for many years in military hardware. The ability to
incorporate testing at some designated level to detect failures allows for both diagnostic and repair
capability. It provides a “go/no go” indication as to the ability of piece of hardware to perform its
mission. The drawback to this technology is that it cannot predict failures, but only announce that
they have occurred. In some cases, depending on the depth that this testing capability has been
incorporated, a system or weapon could actually be failed or unable to perform its intended use, and
the failure be masked because it was not monitored. Although this concept was generally thought of
20 years ago as an innovative approach to improve the health monitoring of systems, the complexity
of the current technology as well as the necessity of more reliable hardware has been the driving force
in the utilization of more complex monitoring, fault isolation and maintenance avoidance. The
transition into a more robust testing scheme is the reason that enhanced hardware diagnostics are
utilized.
12
4 . 2 D i a g n o s t i cs
Diagnostics is the ability of a system to process data in order to communicate to the user what
is or has failed. This information is usually in the form of a test result that shows a failure and the
most likely cause of that failure. These results are based on the data that is collected and compared
against historical database information that is gathered from previous failures, and the actual cause of
the failure.
algorithms that produce 0 or 1 depending on if a threshold has been exceeded. Many types of Built In
Exhaust Gas Temperature (EGT) reading that has exceeded a predetermined level.
Continuous diagnostics are algorithms [14:5] designed to observe transitional effects and
diagnose a failure mode based on the method and rate in which the effect is changing. Continuous
diagnostics are usually associated with observing the severity of failure mode symptoms. Examples
of continuous diagnostics would be a spike energy monitor for identifying low levels of bearing race
spalling or an Artificial Intelligence (AI) classifier for diagnosing that a valve is sticking. The
“Detection Confidence score (0-1) – (DDC)”, and “% false positive score (0-1) – (DFP)” can be used
4.2.1.1 Failure Modes, Effects and Criticality Analysis (FMECA) Role in Health
Management
Military Standard 1629, Revision A, defines the attributes of a FMECA. Figure 2 is a
depiction of how the criticality and potential for occurrence is mapped in a matrix for determination
of potential critical issues that would need to be addressed. In addition, the following information is
provided to better understand the probability of occurrence and severity classification of a FMECA:
Severity Classifications:
13
• Category I - Catastrophic - A failure which may cause death or weapon system loss
• Category II - Critical - A failure which may cause severe injury, major property
• Category III - Marginal - A failure which may cause minor injury, minor property
damage, and minor system damage which will result in a delay or loss of availability
or mission degradation.
repair.
Probability of Occurrence:
bigger than 0.2 of the overall system probability of failure during the defined mission
period.
probability which is more than 0.1 but less than 0.2 of the overall system probability
probability, which is more than 0.01 but less than 0.1 of the overall system
which is more than 0.001 but less than 0.01 of the overall system probability of
14
• Level E - Extremely unlikely probability. The extremely unlikely probability is
defined as probability which is less than 0.001 of the overall system probability of
The application of “health” or “condition” monitoring systems serves to increase the overall
consistent health management philosophy integrates the results from the health monitoring system for
• Prediction, with confidence bounds, of the Remaining Useful Life (RUL) of critical
components, and
• Isolating the root cause of failures after the failure effects have been observed.
15
If RUL predictions can be made, the allocation of replacement parts or refurbishment actions
can be scheduled in an optimum fashion to reduce the overall operational and maintenance logistic
footprints. Fault isolation is a critical component to maximizing system availability and minimizing
place after in-field failures (and substantial costs) have been incurred.
Because an initial system FMECA is performed during the preliminary design stage, it is a
perfect link between the critical overall system failure modes and the health management system
designed to help mitigate those failure modes. Hence, a key aspect of the process links [14:2] this
traditional FMECA analysis with health management system design optimization based on failure
mode coverage and life cycle cost analysis. Figure 2 [14:2] depicts the process of utilizing a FMECA
during the design stage to enhance the overall ability to monitor the health of a system.
16
2. The effects of each failure mode ranging from a local level to the end effect
3. The criticality of the Failure mode (I – IV), where (I) is the most critical
While this type of failure mode analysis is beneficial in getting an initial (though generally
unsubstantiated) measure of system reliability and identifying candidates for redundancy, there are
several areas where fundamental improvements can be made so that FMECA’s can assist in health
and component level indications that the likelihood of a substantial failure mode has increased.
Failure mode symptoms that occur prior to failure are these indications. An example of failure mode
symptoms associated with a bearing would be an increase in spike energy or an increase in the oil
particulate count.
3. Does not address the sensors and sensor placement requirements to observe failure mode
symptoms or effects.
4. Does not address health management technologies for diagnosing and prognosing faults.
except it contains failure mode symptoms, as well as sensors and diagnostic/prognostic technologies.
Alternately, a system response model may be used for assessing sensor placements and observability
of simulated failure modes thus offsetting the manual burden of creating the FMECA. Finally,
Figure 4 [14:3] provides an overview of the approach to health management system design
optimization. A basic description of each block will be given first, then details associated with each
block will follow. First, a Function Block diagram of the system must be created that models the
energy flow relationships among components. This functional block diagram provides a clear vision
17
The information from the Functional Block diagram and the tabular FMECA is automatically
combined to create a graphical health management environment that contains all of the failure mode
attributes as well as health management technologies. The graphical health management environment
is simply a sophisticated interface to a relational database. Once the graphical health management
system has been developed, attributes are assigned to the failure modes, connections, sensors and
diagnostic/prognostic technologies. The attributes are information like historical failure rates
(failures / 106 operating hours), replacement hardware costs, false alarm rates etc., which are used to
generate a fitness function for assessing the benefits of the health management system configuration.
The “fitness” function criteria include system availability, reliability, and cost. Some of these
attributes must be manually determined, if known, while others are related to the attributes of the
effectiveness tests or from pre-developed databases. Finally, the health management configuration is
automatically optimized from a cost/benefit standpoint using a generic algorithm approach. The net
result is a configuration that maintains the highest system reliability to cost/benefit ratio.
18
Functional Block Tabular FMECA++
Maintenance
Diagram
Output
•HM Optimal Designs System Optimization User
•FM Coverage (Rank HM System Configurations)
•Testability Weights
•Overall Reliability
HM
Hardware
Figure 4: Architecture of the Prognostics & Health Management Design Tool [14]
have driven an increase in the use of CBM include the need for:
• Signal processing
• Prognostics
• Decision aiding
19
In addition, a Human System Interface (HSI) is required to provide user access to the system.
The implementation of a CBM system usually requires the integration of a variety of hardware and
software components. Across the range of military and industrial application of CBM, there is a broad
In addition, due to the potential costs of incorporating these CBM systems, a determination
Standardization of specifications within the community of CBM users will, ideally, drive the
CBM supplier base to produce interchangeable hardware and software components. A non-
proprietary standard that is widely adopted will result in a free market for CBM components. The
• Reduced prices
For a particular system integration task, an open systems approach requires a set of public
component interface standards and may also require a separate set of public specifications for the
functional behavior of the components. The underlying standards of an open system may result from
the activities of a standards organization, an industry consortium team, or may be the result of market
standards organizations are called de jure standards. De facto standards are those that arise from the
20
history of the standards that support an open system, it is required that the standards are published,
and publicly available at a minimal cost. An example of an open de jure standard is the IEEE 802.3,
which defines medium access protocols and physical media specifications for LAN Ethernet
Examples of open de facto standards are the UNIX OS and HTTP. An open system standard that
receives widespread market acceptance can have great benefits to consumers and also to suppliers of
these products. The emergence of the IBM PC architecture as a market leading de facto standard that
stimulated the market for both PC hardware and software suppliers. It also led to price reductions due
to market competition in the face of rapid technology advances. The emergence of a dominant
proprietary standard for PC operating systems and software (Windows) resulted in benefits to
consumers in terms of application interoperability, but arguably at the cost of increased prices and
reduced performance compared to the possibilities that an open system software standard might have
system architecture yields several technical benefits: system capability can be readily extended by
adding additional components, and system performance can be readily enhanced by adding
A complete architecture for CBM systems should cover the range of functions from data
collection through the recommendation of specific maintenance actions. The key functions [23:2] that
21
• Decision aiding: maintenance recommendations, or evaluation of asset readiness for a
Typically, CBM system integrators will utilize a variety of commercial off the shelf (COTS)
hardware and software products (using a combination of proprietary and open standards). Due to the
difficulty in integrating products from multiple vendors, the integrator is often limited in the system
capabilities that can be readily deployed. For some applications a system developer will engineer an
application specific system solutions. When user requirements drive custom solutions, a significant
part of the overall systems engineering effort is the definition and specification of system interfaces.
The use of open interface standards would significantly reduce the time required to develop and
integrate specialized system components. CBM system developers and suppliers must make decisions
about how the functional capabilities are distributed, or clustered within the system. Due to
integration difficulties in the current environment, suppliers are encouraged to design and build
components that integrate a number of CBM functions. Furthermore, proprietary interfaces are often
used to lock customers into a single source solution, especially for software components. An ideal
Open System Architecture for CBM should support both granular approaches (individual components
A given CBM architecture may limit the flexibility and performance of system
implementations if it does not take into account data flow requirements. To support the full range of
CBM data flow requirements [23:3], the architecture should support both time based and event-based
data reporting and processing. Time-based data reporting can be further categorized as periodic or a-
22
periodic. An event-based approach supports data reporting and processing based upon the occurrence
of events (limit exceedences, state changes, etc…). A specific requirement that may be imposed on a
CBM system involves the timeliness of data reporting. Timeliness requirements may be defined
broadly as time-critical or non time-critical. The non time-critical category applies to data messages
or processing for which delays have no significant impact on the usefulness of the data or processing
result. The time-critical category implies a limited temporal validity to the data or processing result
requiring deterministic and short time delays. Two different messaging approaches [23:5] may also be
employed within a system, synchronous and asynchronous. In the synchronous model, the message
sender waits for a confirmation response from the message receiver before proceeding to its next task.
In the asynchronous model, the message sender does not wait for a response from the receiver and
closes the communication. The asynchronous model is generally more applicable to time-critical
communications.
Current PC, Networking, and Internet technologies provide a readily available, cost effective,
easily implemented communications backbone for CBM systems. These computer networks are built
over a combination of open and proprietary standards. Software technologies are rapidly developing
to support distributed software architectures over the Internet and across LANs. There is a large
potential payback associated with market acceptance of an open standard for distributed CBM system
architectures. With the ready availability of network connectivity, the largest need is in the area of
One model for distributed computing is Web-based computing [23:6]. The Web-based model
utilizes HTTP servers that function primarily as document servers. The most common medium of
information transport on the Web is the HTML page; HTML is a format for describing the content
and appearance of a document. An alternate format for information transport over the Web is
becoming increasingly popular; XML (eXtensible Mark-up Language). In contrast to HTML, XML is
focused on describing information content and information relationships. XML is readily parsed into
data elements that application programs can understand and serves as an ideal means of data and
23
information transport over the web. A simple model for data access over the web is a smart sensor
that periodically posts new data in XML format to a Web page. Information consumers may access
that updated data directly from the Web page. HTTP servers also provide remote access to application
programs by means of the Common Gateway Interface (CGI). In this model, the interface between the
remote client and the Web server is by means of HTML pages or XML; the web server utilizes the
CGI to communicate with the application program. The web-based distributed computing model
requires that each data server have HTTP server software. With the development of compact and
With the growth of distributed computing [23:19], a class of software solutions is evolving
which enable tighter coupling of distributed applications and hide some of the inherent complexities
of distributed software solutions. The general term [23:5] for these software solutions is middleware.
programs as if the two programs were located on the same computer. Current middleware
Distributed Component Object Model (DCOM), SUN’s Java-Remote Method Invocation (RMI), and
Web-based Remote Procedure Call (RPC). CORBA is an open middleware standard developed and
maintained by the Object Management Group (OMG). A number of companies have CORBA based
product offerings for a variety of hardware and OS platforms. DCOM is the extension of Microsoft’s
object technology to distributed software objects; DCOM is built into the Windows 2000 operating
system and Windows NT 4.0. A number of software companies have ported DCOM to other OS
platforms, however DCOM is just one component of the complete solution for distributed computing,
which is provided by Windows 2000. The SUN solution for distributed computing uses JAVA RMI
for managing calls to distributed Java objects. JAVA RMI operates over IIOP (the Internet Inter-Orb
Protocol), the CORBA protocol for communication between distributed software components. This
allows JAVA RMI based solutions some level of interoperability with CORBA based solutions.
24
In order to standardize architecture for CBM components, the first step is to assign the CBM
system functions defined earlier to a set of standard software components. The software architecture
has been described in terms of functional layers. Starting with sensing and data acquisition and
progressing towards decision support, the general functions of the layers [23:12] are given below:
Layer 1 – Sensor Module: The sensor module has been generalized to represent the software
module, which provides system access to digitized sensor or transducer data. The sensor module may
represent a specialized data acquisition module that has analog feeds from legacy sensors, or it may
collect and consolidate sensor signals from a data bus. Alternately, it might represent the software
interface to a smart sensor (e.g. IEEE 1451 compliant sensor). The sensor module is a server of
Layer 2 – Signal Processing: The signal processing module acquires input data from sensor
modules or from other signal processing modules and performs single and multichannel signal
transformations and CBM feature extraction. The outputs of the signal-processing layer include:
digitally filtered sensor data, frequency spectra, virtual sensor signals, and CBM features.
Layer 3 – Condition Monitor: The condition monitor acquires input data from sensor
modules, signal-processing modules, and from other condition monitors. The primary function of the
condition monitor is to compare CBM features against expected values or operational limits and
output enumerated condition indicators (e.g. level low, level normal, level high, etc). The condition
monitor also generates alerts based on defined operational limits. When appropriate data is available,
the condition monitor may generate assessments of operational context (current operational state or
operational environment). Context assessments are treated, and output, as condition indicators. The
condition monitor may schedule the reporting of the sensor, signal processing, or other condition
monitors based on condition or context indicators, in this role it acts as a test coordinator. The
condition monitor also archives data from the Signal Processing and Sensor Modules, which may be
25
Layer 4 – Health Assessment: The health assessment layer acquires input data from
condition monitors or from other health assessment modules. The primary function of the health
equipment is degraded. If the health is degraded, the health assessment layer may generate a
diagnostic record, which proposes one or more possible fault conditions with an associated
confidence. The health assessment module should take into account trends in the health history,
operational status and loading, and the maintenance history. The health assessment module should
Layer 5 – Prognostics: Depending on the modeling approach that is used for prognostics, the
prognostic layer may need to acquire data from any of the lower layers within the architecture. The
primary function of the prognostic layer is to project the current health state of equipment into the
future taking into account estimates of future usage profiles. The prognostics layer may report health
status at a future time, or may estimate the remaining useful life (RUL) of an asset given its projected
usage profile. Assessments of future health or RUL may have an associated diagnosis of the projected
fault condition. The prognostic module should maintain its own archive of required historical data.
Layer 6 – Decision Support: The decision support module acquires data primarily from the
health assessment and prognostics layers. The primary function of the decision support module is to
provide recommended actions and alternatives and the implications of each recommended action.
equipment in order to accomplish mission objectives, or modifying mission profiles to allow mission
completion. The decision support module needs to take into account operational history (including
usage and maintenance), current and future mission profiles, high-level unit objectives, and resource
constraints.
Layer 7 – Presentation: The presentation layer may access data from any of the other layers
within the architecture. Typically high-level status (health assessments, prognostic assessments, or
decision support recommendations) and alerts would be displayed, with the ability to drill down when
26
anomalies are reported. In many cases the presentation layer will have multiple layers of access
depending on the information needs of the user. It may also be implemented as an integrated user
interface, which takes into account information needs of the users other than CBM.
4.2.1.3 Canaries
The Canary or prognostic monitor approach integrates several sacrificial sensors into life the
monitoring system that are capable of providing a cumulative record of the life consumption being
extracted by the environmental stresses. This is called the canary approach, which is patterned after
the role of the caged canary bird that played a role in giving a warning to the early coal miners of
impending life threats from excessive carbon monoxide in the air which was undetectable by the
with the actual circuit on a semiconductor device. The prognostic monitor thus experiences the same
manufacturing process and the same environmental parameters as that of the actual circuit. Hence as
long as the operational parameters are the same, the damage rate is expected to be the same for both
the circuits. By incorporating these monitors as a part of the sub system we ensure that the cells see
the same operational environment as the product from fabrication, to test to operation. It assures that
any parameter that affects the product reliability will also affect the monitor causing its failure. This
approach enables the cells to overcome the limitations of off-line tests, which are often performed to
represent an average expected performance of circuits and have no means to account for the effects of
Prognostic monitors employ accelerated and calibrated stress conditions [5:3] to increase
their rate of degradation relative to the companion functional circuits in which they are incorporated;
thereby assuring the monitor will fail before the circuit. Incorporating the monitors as part of sub-
system circuits insures that they will see the same operational environment as the system components
from fabrication, to test, to burn-in, to operation. This assures that any variation that would affect the
27
system reliability (e.g. process-induced or installation-induced damage, voltage transients or spikes,
temperature variations, etc.) will also affect the monitor causing its premature failure relative to that
of the circuit. Thus, traditional methods of lifetime prediction based on conditions experienced by the
integrated circuit up to system insertion and offline testing of other components is replaced by a
technique that takes into consideration any impact the actual operating environment may have on
system lifetime.
The purpose of the prognostic cell [5:5] is to predict circuit failure. To achieve this, the
prognostic cell must fail prior to the circuit on the same chip for all realistic operating conditions. The
prognostic cell failure distribution of a particular chip must therefore be linked to the fabrication and
operating conditions in the same way as the circuit failure distribution of that same chip. If the
complete prognostic cell failure distribution lies before the onset of the circuit failure distribution, all
prognostic cells will fail prior to circuit failure. The challenge to building a useful prognostic cell is to
ensure this type of predictive capability without severely reducing useful circuit lifetime. This
4.3 Prognostics
The early research on prognostics has dealt largely with specific applications or case studies.
This is expectedly so, since prognostics as an engineering problem arose from a need to promote
CBM practices for reducing costs incurred during inefficient schedule-based preventive maintenance.
requires reference to the application domain: either to obtain data required for training prognostic
algorithms, or to model the functional relationships based on the physical laws governing the system.
Prognostics is receiving most attention for systems consisting of mechanical and structural
components, where there is an opportunity to change current maintenance practices from scheduled-
development efforts in prognostics has been spurred by the military seeking to change its
28
maintenance practices; to a lesser extent, the need for prognostics has also been recognized by other
industries relying heavily on mechanical plant equipment such as power turbines, diesel engines, and
other rotating machinery. Unlike electronic or electrical systems, mechanical systems typically fail
slowly as structural failures (faults) in mechanical parts progress slowly to a critical level, and
monitoring the growth of these failures provides an opportunity to assess degradation and re-compute
remaining component life over a period of time. The typical scenario is a slowly evolving structural
failure due to fatigue; the causes of this fatigue are repetitive stresses induced by vibration from
rotating machinery (high cycle fatigue) and stresses due to temperature cycles (low cycle fatigue). In
electronic parts, faults are rather abrupt at the component level. Due to the complexity of geometry-
dependent physical models of mechanical structures, developing failure evolution models for such
applications can quickly become computationally prohibitive or intractable, and recourse is taken to
empirical (data dependent) forecasting models. Thus, developing prognostics for mechanical systems
has become heavily dependent on the availability of reliable historical data, even where some
physical modeling is possible. The selection of the techniques too depends on the nature of the data.
Most efforts are still in their infancy, and therefore results are not easily available in the public
domain. Researchers have reported general approaches [1:3] and the specific methods (neural
networks, wavelets, fuzzy methods, etc.) used in the context of specific machinery or components, but
results pertaining to success with validation efforts are not readily (or publicly) available. One reason
that prevents public reporting of these results could be that the studies are dependent on possibly
proprietary data either supplied by the customer (system owner) or collected at significant cost.
4 . 3 . 1 M o d el i n g o f th e S y s t e m
Models that relate the physical system to the data observed from it are an important part of
previous section, several approaches are being researched and applied in the development of
29
diagnostics/prognostics and health management. Some of the model paradigms [1:4-5] are described
below:
Physical models – Physical models are models founded in the natural laws governing the
system operation or used to design the system, e.g., structural mechanics (properties of materials –
solid, liquid and gas), statics and dynamics of rigid bodies (e.g., finite-element models),
thermodynamics, etc. Physical models are usually developed to explain the normal behavior of a
system to facilitate system engineering, not the failure behavior. Indeed, the failure space of systems
tends to be larger than the normal functional space. Physics-based failure models need to be specially
built – usually one model for each failure mode. Needless to say, intricate knowledge possessed by
domain experts (scientists and engineers) is required, and, hence, such modeling is expensive.
individual components, and system reliability evaluation using reliability block diagrams. The
distribution functions based on the statistical analysis of empirical and laboratory data. When a
system assembled using such individual components is considered, the reliability block diagrams are
used to analyze the overall system reliability using probabilistic and graph-theoretic techniques.
be accounted for. The reliability models are helpful in engineering the overall health management
system by identifying parts of the system in need of health monitoring and diagnostics/prognostics.
Component reliabilities can be used to update periodic maintenance and inspection schedules, and
computed reliabilities of each subassembly, module, etc. can be used to efficiently isolate causes of
anomalies or failures.
Machine Learning models – Machine learning models are purely data dependent models
and require sufficient amounts of relevant historical training data to be effective. The most prominent
techniques in this class are the neural-network-based techniques. Neural networks are useful for
30
modeling phenomena that are hard to model using parametric/analytic expressions and equations, but
the downside is that they are hard to validate and, furthermore, do not enhance the basic
understanding of the system/process under study. The learning demonstrated by neural networks is
often impressive, but care is required during generalizations made using them.
captures cause-effect relationships. At a lower level of detail, causes can be associated with physical
components, effects with failures of components, diagnostic tests, or symptoms, and the relations
between causes and effects with physical links, between components or directions of energy flow.
Furthermore, a priori knowledge about occurrences of component failures or failure modes can be
specified and used in the analysis for designing appropriate diagnostics/prognostics. Figure 5 [1:6]
31
4 . 3 . 2 A p p r o a c h e s t o I mp l e me n t i n g P r o g n o s ti c s C a p a b i l i ty
For a health management or CBM system to possess prognostics implies [3:4] the ability to
predict a future condition. Inherently probabilistic or uncertain in nature, prognostics can be applied
diagnostic algorithms, prognostic algorithms can be generic in design but specific in terms of
application. A prognostic model must have ability to predict or forecast the future condition of a
component and/or system of components given the past and current information. Within the health
management system architecture, the prognostic module function is to intelligently utilize diagnostic
results, experienced based information and statistically estimated future conditions to determine the
remaining useful life or failure probability of a component or subsystem. Prognostic reasoners can
Some of the information that may be required depending on the type of prognostics approach
• Failure History
• Current Conditions
• Maintenance History
For a health management or CBM system to possess prognostics implies the ability to predict
32
system/component failure modes governed by material condition or by functional loss. Like the
diagnostic algorithms, prognostic algorithms can be generic in design but specific in terms of
application.
Examples of prognostics approaches [3:4] and [2:3-5] that have been successfully applied
component is absent and there is an insufficient sensor network to assess condition, an experienced-
based prognostic model may be the only alternative. This form of prognostic model is the least
complex and requires the failure history or “by-design” recommendations of the component under
similar operation. Typically, failure and/or inspection data is compiled from legacy systems and a
Weibull distribution or other statistical distribution is fitted to the data. An example of these types of
distribution can be used to drive interval-based maintenance practices that can then be updated on
regular intervals. An example may be the maintenance scheduling for a low criticality component that
has little or no sensed parameters associated with it. In this case, the prognosis of when the
component will fail or degrade to an unacceptable condition must be based solely on analysis of past
maintenance complexity and criticality associated with the component, the prognostics system may be
set up for a maintenance interval (i.e. replace every 1000+/-20 Effective Operating Hrs) then updated
as more data becomes available. Having an automated maintenance database is important for the
33
Legacy-Based
Weibull Formulation
Maintenance Action
Update Capability
PDF
New Data
Legacy Data
In-field Inspection
Results PDF
relies on gauging the proximity and rate of change of the current component condition (i.e. features)
conditional failures such as compressor or turbine flow path degradation. Generally, evolutionary
prognostics works well for system level degradation because conditional loss is typically the result of
interaction of multiple components functioning improperly as a whole. This approach requires that
sufficient sensor information is available to assess the current condition of the system or subsystem
and relative level of uncertainty in this measurement. Furthermore, the parametric conditions that
signify known performance related faults must be identifiable. While a physical model, such as a gas
path analysis or control system simulation, is beneficial, it is not a strict requirement for this technical
34
approach. An alternative to the physical model is built in “expert” knowledge of the fault condition
Statistical Feature
Shifts Over Time Track and Predict Path
Feature 1
Feature 2
Feature 3
T0 T1
fault/failure degradation paths of measured/extracted feature(s) as they progress over time is another
commonly utilized prognostic approach. In this approach, neural networks or other AI techniques are
trained on features that progress through a failure. In such cases, the probability of failure as defined
by some measure of the “ground truth” is required as a-priori information as described earlier. This
“ground truth” information that is used to train the predictive network is usually obtained from
inspection data. Based on the input features and desired output prediction, the network will
automatically adjust its weights and thresholds based on the relationships between the probability of
failure curve and the correlated feature magnitudes. Figure 8 [2:4] shows an example of a neural
network after being trained by some vibration feature data sets. The difference between the neural
network output and the “ground truth” probability of failure curve is due to error that still exists, after
the network parameters have optimized, to minimize this error. Once trained, the neural network
35
architecture can be used to intelligently predict these same features progressions for a different test
Weights
Input features
Time from Failure Prediction Grouping
- 20 to 6.9 hrs
F2, Accelerometer #3
Vibration Features
4. State Estimator Prognostics: State estimation techniques such as Kalman filters [2:5] or
various other tracking filters can also be implemented as a prognostic technique. In this type of
application, the minimization of error between a model and measurement is used to predict future
feature behavior. Either fixed or adaptable filter gains can be utilized (Kalman is typically adapted,
while Alpha-Beta-Gamma is fixed) within an nth-order state variable vector. For a given measured or
T
x= f f f
Then, the state transition equation is used to update these states based upon a model. A
simple Newtonian model of the relationship between the feature position, velocity and acceleration
can be used if constant acceleration is assumed. This simple kinematics equation can be expressed as
follows:
36
1
f (n + 1) = f (n) + f ( n)t + f (n)t 2
2
Where f is again the feature and t is the time period between updates. There is an assumed
noise level on the measurements and model related to typical signal-to-noise problems and
unmodeled physics. The error covariance associated with the measurement noise vectors are typically
developed based on actual noise variances, while the process noise is assumed based on the kinematic
model. In the end, the tracking filter approach is used to track and smooth the features related to the
prediction of a given failure mode progression, and thus, it is used in conjunction with a diagnosis.
model is a technically comprehensive modeling approach that has been traditionally used for
component failure mode prognostics. It can be used to evaluate the distribution of remaining useful
fault. The results from such a model can then be used to create a neural network or probabilistic-
based autonomous system for real-time failure prognostic predictions. Other information used as
input to the prognostic model includes diagnostic results, current condition assessment data and
operational profile predictions. This knowledge rich information can be generated from multi-
sensory data fusion combined with in-field experience and maintenance information that can be
obtained from data mining processes. While the failure modes may be unique from component to
component, the physics-based methodology can be applied to many different types of mechanical
37
Diagnostic
Results
Expected
Future
Condition
(based on
Historical
Conditions) Current &
Future
Failure Prediction
Experienced-Based
Information
Use of physics-of-failure concepts during the design of electronic products can provide
validated, engineering models for root-cause failure mechanisms that can be of great benefit during
the design & evaluation of accelerated reliability tests. The growing popularity of the physics-of-
failure approach to electronics reliability has the potential of improving the effectiveness of
accelerated reliability testing. Two general areas of research [22:1] are needed for accelerated testing
to become more widely used. First, more research is required in the failure-mechanism models area,
especially for probabilistic failure mechanism models. Secondly, additional research is required on
stress margins that uses knowledge of root-cause failure processes to prevent product failures through
structural, or thermal processes leading to failure); failure sites; and failure modes
38
(which result from the activation of failure mechanisms, and are usually precipitated
and their input parameters, including those associated with material characteristics,
• Provides information to plan tests and screens, and to determine electrical and
• Uses generic failure models that are as effective for new materials and structures as
• Encourages innovative, cost effective design through the use of realistic reliability
assessment.
A central feature of the physics of failure approach [22:4] is that reliability modeling (i.e.
models explicitly address the design parameters that have been found to influence hardware reliability
strongly, including material properties, defects, and electrical, chemical, thermal and mechanical
stresses. The goal is to keep the modeling, in a particular application, as simple as feasible without
39
Many accelerated tests have been conducted without understanding which failure
mechanisms are being accelerated, how they are accelerated, or if the failure mechanisms occur under
usage conditions [22:5]. Understanding these factors is essential for conducting successful accelerated
life tests. Physics of failure aids in determining which failure mechanisms & sites to accelerated,
which stress(es) to accelerated, the most effective way to apply the stresses, and the approximate
Many accelerated tests (e.g. MIL-STD-883 and MIL-STD-810 tests) do not require the failure
mechanisms to be identified. If the life at the usage condition is desired, the failure mechanisms must
be identified. Failure mechanisms can occur at many sites, which must also be identified. Numerous
stresses can act on electronic equipment, but there are usually one or two stresses that correspond to a
particular failure mechanism. These stresses should be used to accelerate the failure mechanisms.
Failure mechanisms that are dormant under usage conditions may begin to become dominant at the
accelerated stress conditions. This phenomenon is called failure mechanism shifting [22:5], but can be
avoided by performing a physics of failure analysis to determine the stress limits that reduce the life
of the dominant failure mechanism sufficiently and not introduce new failure mechanisms.
One of the keys to successful acceleration modeling is the explicit treatment of failure
devices, since failure mechanisms can have different life distributions and acceleration models.
dominant failure mechanisms acting at associated failure sites within the device. While the time to
failure of each of the failure mechanisms may be influenced by changes in several parameters,
including loads, geometries, material properties, defect magnitudes and stresses, for the purposes of
40
this explanation, focus will be placed on the impact of stress on time-to-failure. Figure 10 depicts the
relationship between the time-to-failure distribution of each of the three failure mechanisms, and the
LIFE
LIFE
STRESS STRESS STRESS
Failure Failure Failure
Mechanism 1 Mechanism 2 Mechanism 3
microelectronic failure mechanisms are known to vary considerably. The relationship between time-
to-failure and stress may be elusive unless failure mechanisms receive explicit treatment. Addressing
dominant failure mechanisms & sites directly provides critical reliability insight pertaining to:
• Determination of the dominant failure mechanisms that are the weakest links in the
and
and build reliability into electronic products, and compress reliability test time, is not available. In
41
order that accelerated reliability testing be practical and successful, identification and modeling of the
dominant product failure mechanisms and sites must be available from the reliability evaluation
[22:7].
determined. The analysis should be based on current failure-mechanism models and good engineering
judgment. The analysis should also identify the dominant failure mechanism, what stress(es)
accelerate the failure mechanism, what parameters influence the failure mechanism, and the
2. The failure mechanisms to accelerate during the test are selected. The electronic circuit
card or microelectronic package under test will have numerous failure mechanisms and the dominant
3. The various stresses and their limits must be studied for each failure mechanism under
consideration. Stresses will affect different failure mechanisms, therefore must be selected so they
affect the appropriate failure mechanisms and do not introduce extraneous failure mechanisms.
4. The appropriate stress parameters to be accelerated and the magnitude of the stress are
determined. Based on the physics-of-failure analysis, the stress magnitude must be selected so that
failure mechanisms dominant under usage stress are also dominant under the accelerated stress.
5. The type of accelerated life test is determined (i.e., constant load acceleration or step stress
6. The test is performed. During the test, root-cause failure analysis should be performed to
42
7. The test data are interpreted, which includes extrapolating the accelerated test results to
normal operating conditions. Failure modes and mechanisms should be examined to determine if they
8. If more than one failure mechanism is considered, the competing risk model can be used to
combine the time-to-failure distributions of the failure mechanisms to form a composite distribution.
limited its application and acceptance [22:8]. These difficulties can be traced, in part, to a lack of
information concerning the dominant failure mechanisms and sites from reliability evaluations
Difficult issues associated with the evaluation of accelerated reliability tests include the
following:
• Determination of the failure mechanisms, sites and modes that will be dominant
• Assurance that the dominant failure mechanisms are, or are not, likely to occur under
• Assessment of product reliability under intended usage conditions from the failure
mechanisms models.
4.4 Monitoring
Monitoring of equipment, with appropriate analytical techniques applied to the data collected,
provides an important window into the health of the machinery. It further provides guidance for
informed operation and maintenance of the equipment. The General Electric Research and
Development Center [20:1] has focused on an expanded role of equipment monitoring that applies
43
analytical techniques to the collected data, incorporating component design information and specific
operating experience, resulting in information that aids in decision making and operational planning.
This work has been applied at varying levels to medical imaging equipment, locomotives, and power
Technical Approach:
A five-step approach [20:2] has been developed to evaluate machine condition and translate it
The first step is collecting equipment data, which includes sensor information as well as
control system events and plant operation activities. Selection and location of sensors requires an
understanding of machine dynamics. Also, sensors must have accuracy commensurate with the
analysis objectives and must be sampled at a frequency suitable for those analyses.
Comparing the calculated performance to a baseline condition and to the detailed design
performance expected can perform diagnosis. For mechanical diagnosis, baseline vibration signatures
are often taken when the equipment is installed and after each major overhaul to be used for this
comparison in order to determine whether significant changes have taken place. Baseline information,
equipment monitoring is the identification of faulty sensors. Data that is suspect must be identified so
it is not used to generate incorrect parameters, which will lead to erroneous conclusions.
Determining the presence and strength of problem symptoms involves the use of advanced
signal processing and feature extraction techniques to resolve small but significant signals in the
presence of the high-noise environment of most operating situations. Merging advance diagnostic
and signal processing techniques with machine specific knowledge creates the ability to reliably
44
4. Identifying Causes of Symptoms
Identification of the root cause of problems is essential for improving performance and
availability and represents the most significant engineering challenge. Simply observing that, for
example, bearing vibration levels are excessive is not enough. It is necessary to find that the cause of
a vibration problem is misalignment and not mass unbalance. Without careful diagnosis of the root
cause of a detected problem, it is easy to attempt remedies that (a) relieve symptoms temporarily
rather than effect a cure for the disease or (b) unnecessarily extend the duration of maintenance event
A wide range of corrective actions is available to remedy problems. For especially severe
problems, it may be important to change the operating state of the machine in order to avert additional
damage. For less severe problems, operating limitations may be placed on the unit to allow continued
operation until the next scheduled outage. Corrective actions often include the scheduling of
inspection and maintenance procedures for repairs. The long-term goal is to develop optimized
operational recommendations that can balance maintenance requirements for remedying diagnosed
The in-situ sensor approach [4:4] can be described by referring to the idealized bathtub
failure rate curves. Semiconductor reliability is often represented as an idealized bathtub curve, which
can be divided into three regions, i.e., infant mortality region, useful life region and the wear out
region (see Figure 11). The infant mortality region begins at time zero, which is characterized by a
high but rapidly decreasing failure rate. Most failures in this region result from defects caused in the
material structure during the manufacturing, handling, or assembly. These defects may take the form
of missing metal from the side of a thin film conductor, internal cracks, and foreign inclusions. After
45
the infant mortality region the failure rate decreases to a lower value and remains almost constant for
a long period of time. This long period of an almost constant failure rate is known as the useful life
period. Ultimately the failure rate begins to increase as materials start degrading and wear out failures
occur at an increasing rate. This is a result of continuously increasing damage accumulation in the
product. This region in the bathtub curve is commonly referred as wear-out or end-of-life period.
Figure 11 also shows two idealized bathtub curves one for the actual circuit and the other for the
Figure 11: Idealized bathtub reliability curves for the test circuit and the prognostic cell
[4]
In the infant mortality region failures are often due to defects introduced during
manufacturing, handling, and storage. Because the cells are designed to be a part of the actual chip,
defects introduced in the product affects the prognostic cells in the same way as that of the actual
46
circuitry. As a result infant mortality can also be expected for the cells. Infant mortality effect can be
increased in the prognostic cells by intentionally adding defects in them. This process known as error
seeding introduces new defects, which can interact with the defects cause during manufacturing,
handling and storage. Combined effect of both defects makes the chip fail during functional testing,
Because of accelerated failure mechanisms the prognostic cells have higher failure rates than
the actual circuit, for the entire life period. Further since the failure mechanisms are accelerated in the
prognostic cells [4:5], the wear out region occurs earlier than that of the actual circuit which is also
depicted in Figure 11. To predict the end-of-life period, the prognostic cells must fail prior to the
actual circuit failure. As the failure of both the circuits are in the form of distributions, the failure
distribution of prognostic cells on a particular chip must be before the failure distribution of the actual
circuit. If the complete prognostic cell failure distribution is before the onset of the circuit failure
distribution, then all prognostic cells will fail prior to the circuit failure there by predicting failure of
the actual circuit. In other words the failure points of the prognostic cells must be calibrated in such a
way that their failures occur before the actual circuit wear-out region.
4.4.3 Maintenance
The maintenance infrastructure for large systems (such as weapon or sensor system) is
typically structured as a hierarchical organization for effective management of resources and logistics.
The applicability of prognostics, just as the applicability of diagnostics, is, thus, tied to the
maintenance activity and requirements at the respective hierarchical layers. For example, the
maintenance activity of the defense forces is usually organized into three levels: the Organization
level (O-level or, in the Army’s case, Unit-level) is the lowest-level activity and functions in the
mission (system operation) environment; Intermediate-level (I-level) is the next higher-level, and
Depot-level (D-level) is the highest level and supports off-line functions. The main purpose of
prognostics is to anticipate and prevent critical failures and the subsequent need for corrective
47
maintenance [1:4]. At the O-level, preventive maintenance consists of scheduled inspections, LRU
replacement, and on-system (e.g., on-aircraft or flight-line) servicing. The advantage of prognostics
would be realized in an O-level maintenance activity if the schedules for preventive maintenance can
be updated for individual pieces of equipment and their components based on operational data and
future usage. This implies predicting time to failure of Line Replaceable Units (LRU’s), as well as the
impact of such failure on the system. For the entire system, times-to-failure need to be tracked at the
LRU and system levels. The utility of prognostics at the I- and D-levels is lower than at the O-level.
The I-level activity is primarily concerned with scheduled maintenance and repairing/servicing
LRU’s that can be serviced without having to send parts to the Depot. So, prognostics can be used to
predict parts requirements and thus support lower inventory requirements. The D-level maintenance is
concerned with specialized repairs or overhauls of LRUs as well as with inspection, servicing and
repair/replacement of Shop Replaceable Units (SRU’s). Thus, prognostics have minimal applicability
to the D-level. The types of prognostic techniques depend on the level of complexity of the
subsystem, assembly, or component. The prognostic techniques at the individual parts level (e.g.,
SRU) tend to be physics based or based on reliability model updates. The prognostic techniques at
the sub-system or assembly levels (e.g., LRU) tend to be less dependent on physical models and more
failure analysis, on-line diagnostics, diagnostic data interpretation, management and communication,
follow-up corrective actions and lastly the program maintenance. One of the difficult areas in the
contributing causes of failures [16:1], and selection of the appropriate on-line diagnostic tools to
48
address the correct failure contributors. Figure 12 depicts the decision points for both Preventive and
Corrective maintenance.
Maintenance
Preventive Corrective
Maintenance Maintenance
Scheduled,
continuous or on Scheduled Deferred Immediate
request
Figure 12: Decision Tree for Pre and Post Failure Maintenance Actions
Both condition assessment and CBM are concepts involving the application of new
technologies and techniques of equipment diagnostics while the equipment remains in full operation.
While some terminology may imply relatively new techniques, it should be born in mind that the idea
of condition-based maintenance has been around for many years. As an example, a thermal replica,
used in many temperature monitoring and protective devices, addresses one of the most important
The benefits of the CBM programs lay in elimination of many time-based maintenance tasks,
in exchange for maintenance tasks deemed necessary due to the actual condition of the equipment.
While the specific condition is always monitored during normal operation, its evaluation serves to
better manage the life and therefore the reliability of a specific asset. The corrective actions may take
various forms such as through changes to the equipment-operating regime or specific discrete
49
The CBM approach constitutes a dramatic qualitative leap in managing the equipment
reliability compared to the conventional off-line diagnostics, where the condition of the equipment
often remains unknown until an outage is underway. It follows that the condition-based maintenance
approach offers reduction in the equipment downtime, improvement in the equipment reliability and
dramatic reduction of the asset operating costs. Another advantage is the deferral of planned
CBM adds two enormously important dimensions to classical predictive maintenance. First,
CBM deals with the entire system as an entity. This holistic approach to maintenance represents a
major shift from the piecemeal methodologies of the past. While CBM can still be implemented “one
step at a time,” it realizes its greatest potential when applied consistently and evenly across the entire
range of system maintenance concepts. The second added dimension is the concept of ignoring or
extending maintenance intervals. Predictive Maintenance (PDM) trending techniques have been used
historically to confirm maintenance decisions that would previously have been based on expert
opinions. While this approach may often find problems not otherwise identifiable, it does little toward
reducing the cost of classical preventive maintenance programs. In fact, because of the additional
analysis required, PDM may actually increase day-to-day costs slightly for some installations. CBM
on the other hand, because of its systemic approach, usually decreases long term maintenance costs.
Consider Figure 13 [17:4] for example. After all of the various criteria are entered into the CBM
model [17:4], and the analysis is performed, the results can cause the maintenance interval to be
decreased, maintained or increased. In other words there is an actual possibility that maintenance
50
Subjective Criteria
PDM Techniques Environment, etc
All Equipment
Out of Norms:
Decrease Interval
Economic Factors
Risk Assessment
maintenance methods. In the past, maintenance results from any given interval were reviewed and
Trending and statistical analysis are the fundamental building blocks of CBM. Comparing
data absolute values, and perhaps more importantly, comparing data deviations via statistical analysis
provide information never before available. Obviously, a statistically relevant database is required.
Maintenance management software (MMS) [17:6] has been available for many years. Such
software may vary slightly from one manufacturer to another, but the basic purpose and design are
similar from one package to another. Fundamental equipment information is stored — usually in a
detailed manner. Information such as size, date of purchase, ratings, cost, maintenance cycle, and
equipment specific notes are all maintained. Most MMS packages will even print out work orders
when calendar based preventive maintenance schedules dictate. Few, if any, of these packages will
51
store maintenance results, and none will perform the trending or statistical analysis required for full
component failure as far into the future as possible. Routinely a pending failure is discovered, but
with minimal advanced warning. The approach to planning maintenance varies between systems or
2) Predictive maintenance on higher cost critical equipment with potential inaccurate decisions,
or
The goal is to minimize the number of unplanned repairs and to be able to recommend a
maintenance schedule that does not needlessly over inspect or wastefully replace components that still
have useful remaining life. A proactive maintenance process includes regular inspections and
component replacements based upon industry standards and the experience unique to the application
in a particular system. An asset management database [18:2] of the equipment is used to manage the
maintenance schedule. This historical database maintains a record of all the service actions noted
above. The responsible party can monitor and confirm that predictive maintenance inspections are
being accomplished for the critical equipment and view the results. For example, motors that exhibit
increased winding resistance or elevated vibration levels are judged to be in a deteriorating condition.
Historical data is then comprised of a record of the time intervals between failures and replacement or
refurbishment of components or equipment. Best practice is to produce weekly summary reports that
are reviewed by the plant engineer so that trends may be discovered and changes in the maintenance
schedule determined.
52
This process is much improved over a paper based system but still requires time and effort to
be used effectively to identify developing failures. One could envision bringing data from continuous
monitoring of the critical motors to help identify developing deterioration in addition to the current
depends solely upon historical service data to determine component failure-rates at a plant site. We
desire to extend the approach to improve the Weibull failure-rates through sensing in real-time stress
factors that affect the health of the motor components. This is an area of active research and is
The goal is more than just reacting quickly to customer requests, problems, or failures but
rather to anticipate developing failure conditions. The concept of real-time data mining brings
together the means to exploit data to maximum effect and must conduct an efficient discovery that
allows planning for the most effective action in the face of unpredictability and emerging events.
Service businesses offering real-time technologies must provide more than just sensing nets, but as
part of the data stream must include embedded data fusion, cleaning, data mining services that are in
part based on knowledge management being integrated with proprietary analytical technologies.
These innovative business solutions must incorporate key technologies such as sensing and
acquisition engines, dynamic data assimilation (preprocessing, fusion and cleaning engines), analytic
engines, modeling engines, knowledge engines, learning engines, inference engines and visualization
dashboards. These engines facilitate more effective plant decisioning process reducing operations
Motor prediction in industrial plant operations is a good example. Situation analysis must
project the possible alternative failures through use of both historical and streaming input data sets in
conjunction with knowledge representation. Motor data is acquired continuously, even after the
53
analysis starts. As real data becomes available about a particular motor, the situation analysis updates
the local model and knowledge representation. As the effects of deterioration are identified through
monitoring for stress factors, then the possible alternative failures are recomputed. In effect, the
model [18:2] computes in parallel with the actual motor operation so as to determine
Use of a Weibull analysis tool helps to improve the planning activity and in conjunction with
continuous monitoring that operate Bayesian Networks to better identify deteriorating motor health.
1) Maintenance planning that tries to optimize when and what inspections are accomplished,
and
Bayesian networks (BN’s) may be used to represent dependency structures among motor life
[18:4] and sensed factors affecting motor health. Today these are most often used as an after-the-fact
diagnostic tool as opposed to an anticipatory prognostic tool. Next generation motor prognostic tools
should integrate sensor data with Weibull based structural equation models as derived from BN
skeletons. By this means dynamic (i.e. time-evolving) interactive and coupled processes as
represented by directed graphs can yield solutions to these complexly interacting behavioral
problems. These knowledge driven engines represent a first step toward machine learning in real time
in the plant.
factors, the leaf nodes of the tree corresponding to symptom that can be sensed. Probabilities are used
to distinguish between competing root causes when there is uncertainty. A BN derives all the
implications of the beliefs that are input to it; some of these will be facts that can be checked against
observations, or simply against the experience of the engineers. The power of BN'
s comes through
application of the cause and effect rules along with the Bayesian probabilities to propagate
54
consistently an assessment. Now both the on-line, continuous and the walk-around inspection data
are used as evidence to identify, in the face of uncertainty, developing component failures.
Development of a BN starts by using a Failure Modes and Effects Analysis (FMEA). Using
prior observations combined with current knowledge gives an approach to dealing with uncertainty in
maintenance intervention decisions and addressing the problems surrounding uncertainty. Error
ranges for uncertainty in the data must be created and analyzed during operations. Data assimilation
Monte-Carlo methods in order to determine sensitivity, analytic effectiveness, data cleaning, and data
filtering requirements.
Vibration High
Current High
Neg. Seq.
Current High
Ambient Temp
Torque Pulses
High
Figure 14: Example of a Bayesian Network for Monitoring Motor Stress Factors [18]
4 . 4 . 4 K n o w l ed g e Da t a b a s e
Although seemingly trivial in nature, the accumulation of failure and/or the capture of sensed
data are becoming a complete science of its own. There is almost as much research in this field as
there is in the study of prognostics and diagnostics. This is extremely important because the capture
of specific pieces of data pre and post failure is essential to either develop or update the diagnostics
55
history of a system and to be better able to predict what has failed so that the initiation of repair
efforts can begin. In classical terms, the initial “fault detection” must occur on a system and then the
process of “fault isolation” can begin. There are two specific areas of study in this area.
The first is knowledge fusion [26:6], which is the co-ordination of individual data reports
from a variety of sensors. It is higher level than pure ‘data fusion,’ which generally seeks to correlate
common-platform data. Knowledge fusion, for example, seeks to integrate reports from acoustic,
vibration, oil analysis, and other sources, and eventually to incorporate trend data, histories, and other
The second area is termed data mining [27:3], which is the nontrivial process of identifying
valid, novel, potentially useful, and ultimately understandable patterns in data. Within a large section
of the data mining research community, data mining has been identified with the multi-disciplinary
field called Knowledge Discovery in Databases (KDD), in which data mining is just one step in a
series of steps starting with data preparation and ending with presentation, evaluation and utilization
of mined results. The first steps in a KDD process, such as data cleaning, data integration, and data
selection, can often be performed using basic database manipulation and queries, followed by
additional high-level analysis using On-line Analytical Processing (OLAP) techniques. This initial
stage is then followed by the application of machine learning techniques to actually mine patterns in
the data. A number of data mining approaches have been studied and evaluated. These include several
classification and prediction techniques such as learning of decision rules or decision trees, neural
networks, and data clustering. Since a large part of the current maintenance data consists of nominal
The essence of these efforts is that continuous updates and improvements based on actual
failure and repair data are validated and stored so that when certain triggers or limits are reached
within a sensor or set of sensors, the diagnostic process sends an alarm or warning to either the
operator or a repair or maintenance facility so that the appropriate action can take place to either
repair the failure, or in the case of prognostics, to perform a maintenance action prior to failure.
56
4.4.5 Applicability to the Aviation Industry
Due to the fact that the airline industry has always struggled with massive overhead because
of the maintenance and upkeep of the particular aircraft that are in their fleet, the need for
development of more efficient designs has always been a priority in the interactions between the
airlines and the major aircraft manufacturers. One of the tradeoffs that have occurred is the ability to
extend the range that twin-engine aircraft can operate over water. This is an extensive and costly
evolution in the design process that is monitored very closely by the governing bodies of commercial
air travel. The other factor that plays into this is the engine manufacturers that are under extreme
pressure to design these engines for greater and greater reliability. This is the major reason that the
primary non-mechanical portion of the engine, which is the fuel control system, is normally a triple
redundant design.
With the advent and study of more elaborate prognostics, various prognostics and health
monitoring technologies have been developed [3:1] that aid in the detection and classification of
developing system faults. However, these technologies have traditionally focused on fault detection
and isolation within an individual subsystem. Health management system developers are just
beginning to address the concepts of prognostics and the integration of anomaly, diagnostic and
prognostic technologies across subsystems and systems. Hence, the ability to detect and isolate
impending faults or to predict the future condition of a component or subsystem based on its current
diagnostic state and available operating data continues to be a high priority research topic not only for
the airline industry, but for many areas where a failure occurring could have catastrophic
consequences.
In general, health management technologies will observe features associated with anomalous
system behavior and relate these features to useful information about the system’s condition. In the
case of prognostics, this information relates to the condition at some future time. Inherently
57
governed by material condition or by functional loss. Like diagnostic algorithms, prognostic
algorithms can be generic in design but specific in terms of application. Various approaches to
prognostics have been developed that range in fidelity from simple historical failure rate models to
The ability to monitor equipment in the medical industry similar to what is done in the
aviation industry has also been investigated and implemented to a large extent. Having the capability
to evaluate the health of critical and expensive medical testing equipment on a real time basis has
been instrumental in reducing the unavailability of this equipment due to unforeseen failures and the
58
For specific medical imaging equipment manufactured by General Electric [21:1] the service
organization is responsible for a significant portion of total revenue for a complex medical imaging
system. The traditional service delivery model developed for this purpose is highly field engineer
centric with an emphasis on building field engineer to customer relationships. In order to maximize
the service revenue, the service delivery model needs to become highly service centric with an
emphasis on driving proactive smart diagnosis. Field engineers can evolve to become more service
centric by increasing their productivity, allowing them to spend less time commuting to customer
sites and service more customers per day. Technologies such as remote diagnostics, virtual
monitoring of machines, improved communication, and efficient online parts order have contributed
to this shift in service delivery paradigm. However, field engineers are still, on average, only able to
service one major customer call per day and responding in reactive mode to notifications of machine
failures.
Multiple site visits and wrong part orders are a result of inaccurate diagnosis, which lead to
downtime for customers and higher service costs. Poor diagnostics coupled with an ad-hoc mix of
indicators designed into the system by design engineers in the development stage of a program and a
trial-and-error approach to diagnosis leads to multiple site visits by field engineers. Accurate
Because of these issues, General Electric has done extensive research in the field of
serviceability of medical testing equipment that is in the hands of their customers. The incredibly
high non-recurring cost of purchasing this equipment makes it imperative that the time the equipment
is down for maintenance is minimized. In order to accomplish this, a Design for Serviceability (DFS)
tool that can monitor and repair the equipment. Figure 16 [21:7] is a depiction of the three main
59
Indications From
Systems Events
Indications from
Operator Observations
Authoring Reporting
M
os
tL
ik
el
y
Fa
ilu
re
Field Feedback
Recommendations
& Final Actions
necessary to create a Constrained Bayesian Network. The analysis component consists of a decision
engine that updates the probabilities of failure modes given a set of known observations. The
reporting component provides design engineers with a report on serviceability during new product
development. The report enables design engineers to make intelligent design decisions based on
expected payback and resources available. Once the design engineer is satisfied with the results from
the report, domain experts validate the created Bayesian model before releasing it for use by the
Diagnosing application. The service feedback is the process to bring performance of the causality
engine back to design engineers so that the information may be used to improve diagnosis and the
The DFS Tool [21:7] could be integrated into a Customer’s current new product development
process. Ideally, field engineers will use the Diagnosing Tool before visiting a customer site so that
parts may be ordered beforehand and the problem can be fixed in a single trip.
60
A Bayesian network is a technology used for handling uncertainty and has been applied to
diagnosis of mechanical and electrical systems. One reason is the ability to capture heuristic
knowledge [21:I] when no failure data is available but expert opinion is abundant, which is the case
during new product development. Online engineers and field engineers could use the Bayesian
serviceability tool to diagnose failures given observations from diverse sources and provide service
feedback for improving diagnostic accuracy of Bayesian model. A diagnostic model strategy selects
a model for diagnosis given a set of indicators and a recommendation strategy will present the field
engineer with a customized service recommendation. Two authoring alternatives allow engineers
The ability to generate a service recommendation from the observations for any diagnostic
model is essential to increasing the ability to fix a problem right the first time, which directly impacts
field engineer productivity. The Design For Six Sigma methodology is followed to measure the tool
capability in meeting this customer requirement. Unit testing is employed to generate opportunities
for measurement. Based on the amount of testing, the capability metric for the tool to generate a
service recommendation is 3.51 sigma with 19,610 defects per million opportunities [21:iv].
Another application that can utilize this technology is the ability to remotely monitor elderly
persons who live alone in their own homes. The ability to apply sensors in certain areas of the home
and utilizing adaptive modeling can associate sensor events with daily activities. An experiment was
carried out in this regard by General Electric’s Global Research organization [19:2]. Figure 17 [19:1]
61
Figure 17: Simplified Model of Caregiver Monitoring System [19]
The initial results of this experiment have been positive, and the ability to reduce the stress on
caregivers has been one of the most positive aspects of this research. As seen in Table 1 [19:4], this
research was based on inputs from a selection of twenty-one caregivers that responded with what
gave them the most stress on a daily basis with the responsibility of caring for elderly individuals that
62
CHAPTER V
SUMMARY AND CONCLUSIONS
There are inherent design requirements that need to be identified and understood prior to fully
comprehending the ability to facilitate the incorporation of diagnostics and prognostics into a system
based information and statistically estimated future conditions to determine the remaining useful life
or failure probability of a component, module or system. A prognostic model must have the ability to
predict or forecast the future condition of a component and/or system of components given the past
Data availability, dominant failure or degradation mode of interest, modeling and system
knowledge, accuracies required and criticality of the application are some of the variables that
determines the choice of prognostic approach. The ability to predict the time to conditional or
mechanical failure (on a real-time basis) is of enormous benefit and health management systems that
can effectively implement the capabilities presented herein offer a great opportunity in terms of
reducing the overall Life Cycle Costs (LCC) of operating systems as well as decreasing the
Recent proposal activity with the U.S. Army has dictated that competence be gained in the
application of prognostics for military systems either currently or in near term development. These
requirements are being backed with funding to incorporate these features. In order to accomplish this,
the following main ideas must be thoroughly thought out and understood prior to and during the early
63
• A thorough understanding of failure modes and testability is the precursor for
included in a knowledge database that is built upon during the life of the system for
As has been proven time and time again in commercial industry, these techniques can be
applied to designs with great success, increased customer satisfaction, lower life cycle cost and most
64
Table 2: Acronym List
Acronym Description
BIT Built In Test
BN Bayesian Network
C-BIT Continuous BIT
CBM Condition-Based Maintenance
CGI Common Gateway Interface
CORBA Component Object Request Broker Architecture
COTS Commercial Off The Shelf
DCOM Distributed Component Object Model
DFS Design for Serviceability
D-LEVEL Depot Level
EGT Exhaust Gas Temperature
FMEA Failure Modes and Effects Analysis
FMECA Failure Modes, Effects and Criticality Analysis
HIS Human System Interface
HM Health Monitoring
HTML Hypertext Mark-up Language
HTTP Hypertext Transfer Protocol
I-BIT Interruptive BIT
I-LEVEL Interediate Level
KDD Knowledge Discovery in Databases
LAN Local Area Network
LCC Life Cycle Cost
LRU Line Replaceable Unit
MMS Maintenance Management Software
MTBEFF Mean Time Between Essential Function Failure
MTBF Mean Time Between Failure
OLAP On-Line Analytical Processing
O-LEVEL Operational Level
OMG Object Management Group
P-BIT Periodic BIT
PHM Prognostic Health Monitoring
POF Physics of Failure
RMI Remote Method Invocation
RUL Remaining Useful Life
XML eXtensible Mark-up Language
65
REFERENCES
[1] Mathur, A., Cavanaugh, K., Pattipati, K., Willett, P., and Galie, T., “Reasoning and
Modeling Systems in Diagnosis and Prognosis,” Proceedings of the SPIE Aerosense
Conference, Orlando, FL, April 16-20, 2001.
[2] Byington, C., Roemer, M., Galie, T., “Prognostic Enhancements to Diagnostic
Systems for Improved Condition-Based Maintenance,” IEEE Aerospace Conference,
Big Sky MT, March 2002.
[3] Byington, C., Roemer, M., Galie, T., “Prognostic Enhancements to Gas Turbine
Diagnostic Systems,” IEEE Aerospace Conference, Big Sky MT, March 2003.
[4] Satchidananda, M., Pecht, M., Goodman, D., “In-situ Sensors for Product Reliability
Monitoring,” CALCE Electronic Products and Systems Center, University of
Maryland, 2002.
[5] Goodman, D., “Prognostic Techniques for Semiconductor Failure Modes,” Ridgetop
Group, Inc., 2000.
[6] Byington, C., Safa-Bakhsh, R., “Metrics Evaluation and Tool Development for
Health and Usage Monitoring System Technology,” AHS Forum 59, Phoenix, AZ,
American Helicopter Society International, May 6-8, 2003.
[8] Byington, C., Kalgren, P., Johns, R., Beers, R., “Embedded Diagnostic/Prognostic
Reasoning and Information Continuity for Improved Avionics Maintenance,”
AUTOTESTCON 2003, Anaheim, California, 2003.
[9] Vachtsevanos, G., “Cost & Complexity Trade-off in Prognostics,” Paper Presented at
NDIA Conference on Intelligent Vehicles, Traverse City, MI, 2003.
[11] Mortin, D., “Prognostics Overview (Usage Based),” Paper Presented During Industry
Day for potential subcontractors, USAMSAA, 2003.
[12] Kozera, M., “Brigade Combat Team Diagnostics and Prognostics,” Paper Presented
Presented During Industry Day for potential subcontractors, US Army Tank-
automotive and Armaments Command, 2003.
66
[13] Chelidze, D., “A Dynamical Systems Approach to Failure Prognosis,” Journal of
Vibration and Acoustics, 2003.
[14] Kacprzynski, G., Hess, A., “Health Management System Design: Development,
Simulation and Cost/Benefit Optimization,” IEEE Conference, Big Sky, MT, March
2002.
[15] Roemer, M., “Assessment of Data and Knowledge Fusion Strategies for Prognostics
and Health management,” IEEE Conference, Big Sky, MT, March 2001.
[17] Cadick, J., “Condition Based Maintenance…How to Get Started…,” The Cadick
Corporation, Garland, TX, 1999.
[18] Sutherland, H., Repoff, T., House, M., Flickinger, G., “Prognostics: A New Look at
Statistical Life Prediction for Condition-Based Maintenance,” GE Global Research,
February 2003.
[19] Cuddihy, P., Ganesh, M., Graichen, C., Weisenberg, J., “Remote Monitoring and
Adaptive Models for Caregiver Piece of Mind,” GE Global Research, July 2003.
[20] Azzaro, S., Johnson, T., Graichen, M., “Remote Monitoring Techniques and
Diagnostics Selected Topics,” GE Research & Development Center, January 2002.
[21] Chen, C., “Bayesian Serviceability Tool for Diagnosing Complex Medical Imaging
Machines,” GE Global Research, January 2003.
[25] Mortin, D., Yukas, S., Cushing, M., “Five Key Ways to Improve Reliability,”
Reliability Analysis Center (RAC) Journal, Rome, NY, Second Quarter 2003.
[26] Hadden, G., Bergstrom, P., Samad, T., Holt-Bennett, B., Vachtsevanos, G., Van
Dyke, J., “Application Challenges: System Health Management for Complex
67
Systems,” Proceedings of the 5th International Workshop on Embedded HPC
Systems and Applications (EHPC’200), 2000.
[27] Mathur, A., "Data Mining of Aviation Data for Advancing Health Management,"
SPIE'
s 16th International Symposium on AeroSense, Aerospace/Defence Sensing,
Simulation and Controls, 2002.
68
APPENDIX A
Driving Force Behind the Contents of this Master’s Report
The attached document was taken directly from a presentation given on the topic of Army
Transformation. It outlines the needs and requirements for the technology shift into embedded
diagnostics and prognostics from the Army’s point of view to reduce the overall Operations &
Support Cost of hardware after production during its operational life. All new development for the
Army will utilize the concepts and requirements that are explained in this document. I have copied it
Introduction
The Army Vision for the 21st Century is a rapidly deployable, highly mobile fighting force
with the lethality and survivability needed to achieve a decisive victory against any adversary. To
support this vision the Army’s logistics system must be versatile, agile, sustainable and affordable.
The Army Transformation is bringing about these fundamental changes in the Army’s structure,
equipment and doctrine. Additionally, while the Army’s science and technology, research and
development, and procurement investments are being focused to create and field the Objective Force
over the next 10 to 15 years, selected portions of the legacy forces are being recapitalized to bridge
the gap between today’s Army and the Objective Force. The responsibility for sustaining today’s
force and the transforming Army is the business of the Deputy Chief of Staff (DCS) G-4, Army who
A-1
Figure excerpted from [24]
T h e L o g i s ti c s F o o t p r i n t [ 2 4 ]
One of the Army Transformation’s goals is to reduce the logistics footprint of combat support
and combat service support while enhancing the sustainability, deployability, readiness and reliability
of military systems. This requires new logistics processes and dramatic changes in current business
processes to support the new force. These processes are focused on the weapons systems, and must
be readiness-driven, lean, and agile. They must detect and correct problems early, allocate resources
where they are most needed, and continuously drive cost and labor out of the system. One of the key
enablers for the objective sustainment processes envisioned is to equip platforms with self-reporting,
real-time, embedded diagnostics and prognostics systems. This enabler promises to replace entire
segments of the traditional logistics support structure. Such systems would contribute directly to
A-2
• Improved readiness for weapons platforms and support equipment
Adding embedded diagnostic and prognostics capabilities to equipment and developing the
infrastructure needed to generate maximum benefit from the prognostics data represent major
challenges. The infrastructure needed to transmit, store and use the information is complex, requiring
changes to many existing and emerging communications and information systems. The potential
application to Army platforms includes vehicles, aircraft and marine craft numbering thousands of
platforms. Therefore, an implementation strategy is needed that achieves maximum benefit with the
resources available, recognizing that technology is continually evolving. This strategy should define
the following:
• When, where and how much diagnostic and prognostic capability should be
• What policy and doctrine additions/changes will be required to support the Interim
• What are the funding implications related to the Program Objective Memorandum
(POM)?
A-3
Figure excerpted from [24]
transformation enabler and has required the consideration and planning for this technology to be
included on new and retrofitted equipment for several years. Unfortunately, funding limitations and
detailed specifications of requirements have delayed and in fact have inhibited their development and
integration. However, changing operational concepts and the emerging vision of the Objective Force
requirements now makes their integration a necessity. Furthermore, the application of this technology
the Army’s supply chain management of consumables, repairables, and the end items themselves.
E mb e d d e d Di a g n o s t i c s a n d P r o g n o s t i cs S y n c h r o n i z a t i o n [ 2 4 ]
There is a need to apply these embedded diagnostic and prognostic capabilities across the
entire Army, employing communications systems and modifying information systems to make use of
the new sources of information. The Army’s diagnostics and prognostics community of combat
A-4
developers, materiel developers, and logisticians have been working to achieve the Chief of Staff of
the Army’s goal of putting embedded diagnostics and prognostics on all weapons systems. This
requires that systems that have historically been developed independently be synchronized to support
an overall system of systems. Subsequently, the DCS G-4, Army directed the U.S. Army Logistics
Transformation Agency (LTA), the Army’s integrator of logistics systems and processes, to
coordinate and synchronize these efforts under a project called Embedded Diagnostics and
The EDAPS project is an over-arching process that coordinates a unified Army strategy
synchronizing the Army’s current diagnostics and prognostics initiatives. The G-4 tasking calls for
LTA to pull together all the key diagnostics and prognostics players from across the Army and
develop an end state that considers all the current diagnostics and prognostics pilots, programs, and
plans and integrates the current programs and initiatives. The EDAPS project objectives include the
following sub-tasks:
• Influence the requirements of future operational and management systems such as the
• Identify policy and programmatic gaps and redundancies and define and then re-
engineer the operational architecture and its business processes from the platform,
A-5
The project’s scope of work includes the legacy fleets and the transformation to the Objective
from the Army’s diagnostics and prognostics community. The team’s first order of business was to
define the operational architecture, develop a management structure that involved users at all stages
of development to ensure coordination and integration, and establish a common vision for the
logistics embedded diagnostics and prognostics processes. The team’s operational architecture will
define the vision and identify requirements for policy/doctrine/training, platform technology,
communications systems, and information systems as the key pieces that need coordination and
management structure to ensure that all aspects of the operational architecture are considered. The
approach is designed to engage key players in the information collection and analysis process and
A-6
build consensus for the path forward to the maximum extent possible. It also maximizes EDAPS’
The EDAPS team has begun the job of synchronizing and coordinating Army diagnostics and
prognostics issues across the entire business enterprise for the entire weapons system’s lifecycle, not
just at the platform level. This includes a review of Army policy and regulations and in depth
assessments of related initiatives. Requirements for embedded diagnostics and prognostics are being
added where appropriate to Army operational requirement documents based on the EDAPS team’s
Synchronization IPT, has been created to facilitate the process and manage the total enterprise. In
this manner a means has been made available for synchronizing policy, procedures, operations,
doctrine, training, and automation requirements. The supporting teams build on the work of the
Army Diagnostics Improvement Program (ADIP), which compliments its efforts that are focused on
incorporating diagnostics sensors and read-out mechanisms for Army weapons systems. The EDAPS
process is expected to identify and document EDAPS end-to-end information requirements (including
tactical, non-tactical and strategic) for all users and develop a roadmap to describe how these
requirements should be developed to support near-term, interim and objective forces. It will also
driven to address the information requirements, for all levels of field, depot and national management
activities. Finally, it will refine and define policy, doctrine, and operational architectures to ensure
that all EDAPS future requirements are reflected in appropriate policy, doctrine, procedures,
automation and training. The Synchronization IPT is responsible for assuring that the other working
groups address the comprehensive breadth and depth of the issues involved in implementing
embedded diagnostics, condition-based maintenance, and the linkages between these processes and
A-7
S u mma r y [ 2 4 ]
The coordination and synchronization of embedded diagnostics and prognostics for the
Objective Force is critical to the Army Transformation because this technology impacts logistics
operations at all levels – from the maintainer to the weapons’ platform lifecycle managers. A wide
range of Army organizations responsible for the doctrine, policy, equipment, training, funding,
business processes, information systems and communications systems will be affected by this
technology. It will take many years and substantial investments to fully implement the Army’s vision
for self-reporting weapons platforms and support vehicles with embedded diagnostics and prognostics
capturing, moving, storing and using platform-based readiness information will greatly facilitate
development of the common vision for platform-focused logistics processes. Significant work
remains to be done to develop a robust logistics system around this technology. Synchronizing these
efforts is a major challenge. Although the DCS, G-4 tasked LTA to lead the synchronization effort, it
is clear that this undertaking will be successful only if the impacted organizations are directly
involved in defining the end state and developing the implementation road map. The EDAPS process
allows for this coordination and synchronization of achieving the Army’s vision of embedded
diagnostics and prognostics in support of the Objective Force and will ensure that the process is
A-8
APPENDIX B
Reliability Impact on Designs
Although the major thrust in this project is based on the fact that diagnostics and prognostics
can be incorporated into the design to ether determine the failure or predict when a failure is about to
occur, one of the most important aspects during development is to ensure that all aspects of reliability
are also included in the process. Since many of the requirements placed on new designs are
interdependent on each other, i.e., reliability, availability, repair time and supportability, marginal
design in one area will almost ensure other requirements will be hard to meet. There are many reasons
why systems fail to achieve their requirements. However, there are several reasons [25:1] for failures
equipment
Because of this, there are five aspects of reliability that need to be understood and
championed by all project reliability engineers from the very beginning of the program.
D e s i g n i n R e l i a b i l i t y E a r l y i n th e D e v e l o p me n t P r o c e s s
environment in which the system, subsystem, or component will be employed and ensure that the
requirements flow- down process to suppliers is adequate. Given this understanding, many potential
failures can be identified and eliminated very early in the development process. In addition, every
effort should be used to leverage existing field test results to support failure analysis efforts.
B-1
Efforts to eliminate failures require a commitment of technical resources. Engineers are
needed to conduct thermal and vibration analyses to address potential failure mechanisms and failure
sites. These analyses can include the use of fatigue analysis tools, finite element modeling, dynamic
simulation, heat transfer analyses, and other engineering analysis models. Industry, universities, and
government organizations have produced a large number of engineering tools that are widely used,
especially in the commercial sector. The capability exists to model and address a number of failure
C o n d u c t L o w e r L ev e l T e s t i n g
Lower-level testing, such as highly accelerated life testing (HALT) and highly accelerated
stress screening (HASS), is critical for precipitating failures early and identifying weaknesses in the
design. Integration testing is also critical for identifying unforeseen interface issues.
Developmental testing serves as one of the last opportunities to fix remaining problems and
increase the probability of system success. Developmental testing is not only required to ensure
system reliability maturation but also to mitigate risk in meeting requirements during operational
testing. Insufficient or poorly planned developmental test strategies often result in systems failure
Early low-level testing, along with focused higher-level testing, is key to producing products
with high reliability. Without comprehensive lower level testing on most or all-critical subassemblies,
and without significant integration and developmental testing, there is little likelihood that high levels
R e l y M o r e o n E n g i n e e r i n g D e s i g n An a l y s i s a n d L e s s o n P r e d i c t i o n s
A reliability prediction may have little or nothing to do with the actual reliability of the
product and can actually encourage poor design practices. In many cases, the person producing the
prediction may not be a direct contributor to the design team. The historic focus on the accounting of
predictions versus the engineering activities needed to eliminate failures during the design process has
B-2
significantly limited the ability to produce highly reliable products. Additionally, in some cases,
developmental testing resulting in failure to meet operational testing requirements. High reliability is
When most people think of reliability models, they think of reliability predictions; reliability
block diagrams; failure mode, effects, and criticality analysis (FMECA); fault trees; and reliability
growth. When directly used to influence the design team, or when used to manage reliability progress,
these tools can be extremely useful to focus engineering and testing efforts. However, the most
important reliability tools are the structural, thermal, fatigue, failure mechanism, and vibration models
used by the design team to ensure that they are producing a product that will have a sufficiently large
failure-free operating period. When the major focus of the system reliability program is reliability
E q u i p me n t
COTS equipment represents a great opportunity to improve reliability, reduce costs, and
leverage the latest technologies. However, COTS does not imply that we abandon engineering
analyses and early testing. Thermal, vibration, fatigue, and failure mechanism modeling, combined
with early accelerated testing, can quantify and minimize the risk of COTS equipment failing in the
P r o v i d e R e l i a b i l i t y I n c e n t i v e s i n C o n tr a c t s
weight of reliability in the selection criteria is usually small. Contractors have to bid low in order to
be competitive. When they have to trim their programs, reliability is often one of the first areas to go.
Unless the contractor sees value in directing and resourcing the design team to achieve high
B-3
reliability, equipment will continue to be fielded with reliability values that fall far short of what the
commercial consumer typically experiences. Most suppliers have the engineering staff and technical
B-4