Master Report John Zanoff

Download as pdf or txt
Download as pdf or txt
You are on page 1of 88

Incorporating Diagnostics and Prognostics in New System Designs

By
John Zanoff III

A MASTER OF ENGINEERING REPORT

Submitted to the College of Engineering at


Texas Tech University in
Partial Fulfillment of
The Requirements for the
Degree of

MASTER OF ENGINEERING

Approved

______________________________________
Dr. A. Ertas

______________________________________
Dr. T. T. Maxwell

______________________________________
Dr. E. W. Kiesling

______________________________________
Dr. M. M. Tanik

October 16, 2004


ACKNOWLEDGEMENTS

First of all, I would like to thank my family, especially my wife, because without her support

and understanding, I would not have undertaken this endeavor to further my education. I would also

like to thank my daughter, a junior at the University of Texas in Austin, who provided me the

motivation to do the work.

I would be remiss by not thanking some of the individuals in the class that also played a

substantial part in my successful completion of this program. Terrence Chan and Schuyler Deitch

and myself teamed for all of the group assignments during the year, and their youth and motivation

kept me going. I only hope that I was able to exchange their energy with some wisdom of my own to

them.

I would also like to thank all of the professors and instructors that took the time from their

busy schedules to further my education. I would especially like to thank Dr. Ertas and Dr. Maxwell

for their tireless support of the class as well as ability to keep things moving over a long distance. My

hat is off to them. Keep up the good work

ii
TABLE OF CONTENTS

ACKNOWLEDGEMENTS ........................................................................................................ II
DISCLAIMER.............................................................................................................................. V
ABSTRACT................................................................................................................................. VI
LIST OF FIGURES .................................................................................................................. VII
LIST OF TABLES ...................................................................................................................VIII
CHAPTER I
INTRODUCTION......................................................................................................................... 1
CHAPTER II
BACKGROUND ........................................................................................................................... 3
2.1 The Need For Prognostics and Diagnostics in New Designs 3
2.1.1 Customer Requirements ............................................................................................... 3
2.1.2 Total Cost of Ownership .............................................................................................. 3
2.1.2.1 Long Term Savings.......................................................................................... 4
2.1.2.2 Reduced Levels of Sparing and Logistics Impact ........................................... 5
2.2 Impact on Reliability 6
2.2.1 Partial Operability ........................................................................................................ 8
CHAPTER III
DEFINING PROGNOSTICS AND DIAGNOSTICS.............................................................. 10
3.1 Understanding the Terms 10
3.1.1 Diagnostics ................................................................................................................. 10
3.1.2 Prognostics ................................................................................................................. 11
CHAPTER IV
IMPLEMENTATION ................................................................................................................ 12
4.1 Utilization of Built In Test (BIT) 12
4.2 Diagnostics 13
4.2.1 Types of Diagnostics .................................................................................................. 13
4.2.1.1 Failure Modes, Effects and Criticality Analysis (FMECA) Role in Health
Management .............................................................................................................. 13
4.2.1.2 Software Implementation of Prognostics and Diagnostics ............................ 19
4.2.1.3 Canaries ......................................................................................................... 27
4.3 Prognostics 28
4.3.1 Modeling of the System ............................................................................................. 29
4.3.2 Approaches to Implementing Prognostics Capability................................................ 32
4.3.3 Physics of Failure ....................................................................................................... 38
4.3.3.1 Identification of Failure Mechanisms ............................................................ 40
4.3.3.2 Steps for Physics-of-Failure Accelerated Life Testing.................................. 42
4.3.3.3 Problems with Accelerated Life Testing ....................................................... 43
4.4 Monitoring 43
4.4.1 In-Situ Sensor Monitoring.......................................................................................... 45
4.4.3 Maintenance ............................................................................................................... 47
4.4.3.1 Condition-based Maintenance (CBM)........................................................... 48

iii
4.4.3.2 Maintenance Activities Prior to Failure......................................................... 52
4.4.3.3 Value of the Data Mining Process ................................................................. 53
4.4.4 Knowledge Database.................................................................................................. 55
4.4.5 Applicability to the Aviation Industry ....................................................................... 57
4.4.6 Applicability to the Medical Industry ........................................................................ 58
CHAPTER V
SUMMARY AND CONCLUSIONS ......................................................................................... 63
REFERENCES............................................................................................................................ 66
APPENDIX A
DRIVING FORCE BEHIND THE CONTENTS OF THIS MASTER’S REPORT .............. 1
Introduction 1
The Logistics Footprint [24] 2
Embedded Diagnostics and Prognostics Synchronization [24] 4
Summary [24] 8
APPENDIX B
RELIABILITY IMPACT ON DESIGNS ................................................................................... 1
Design in Reliability Early in the Development Process ...................................................... 1
Conduct Lower Level Testing............................................................................................... 2
Rely More on Engineering Design Analysis and Less on Predictions.................................. 2
Perform Engineering Analyses of Commercial-Off-The-Shelf Equipment.......................... 3
Provide Reliability Incentives in Contracts........................................................................... 3

iv
DISCLAIMER

The opinions expressed in this report are strictly those of the author and are not necessarily

those of Raytheon, Texas Tech University, nor any U.S. Government agency.

An attempt has been made to acknowledge all the people involved in this paper, but due to

the shear size of the material, someone may have been missed. This is unintentional and it is my wish

to acknowledge all sources for their material included in this paper.

v
ABSTRACT

As a reliability engineer for the past seventeen years, I have seen a transformation from

simple, easily repairable systems to incredibly complex designs that are almost impossible to repair,

especially in times of actual combat. The Army, in its latest weapon system development effort, has

put the requirements on potential bidders that the reliability of these new systems will be an order of

magnitude higher for the end user, will be easily and quickly repairable in the field, and during times

of actual combat, they WILL NOT FAIL. On initial thought, this seems like an insurmountable task,

but after doing some research and benchmarking with some commercial companies, this task can be

realized. If some existing design and testing philosophies from other technologies are incorporated

into these designs, we can meet these needs of the end user.

This report was initiated to accumulate information in order to develop a plan to put

into use the necessary requirements, design methodologies, both hardware and software, and

capabilities such that we can propose and design a system or systems that can be as reliable as needed

for the next generation of hardware for the military. This plan will encompass the needed

requirements, testing, ideas and objectives of meeting the requirements. An extensive list of

materials, including papers presented at various seminars and working groups, extensive company

research, benchmarking and actual case studies were used as the basis of this report, and the

possibility of this document being utilized as a Raytheon “Best Practice” has a high potential of being

accepted and incorporated into future development activities.

Since I am currently working on a similar document as part of a proposal, there is a

certain amount of synergy that can be gained by pursuing this subject matter as well as producing

something that could have a very positive impact on future development within the company, but

most importantly, have a positive impact on the ability to better support the customer or end user of

the product.

vi
LIST OF FIGURES

Figure 1: Classical Bathtub Reliability Curve [5] 8

Figure 2: FMECA Matrix 15

Figure 3: Health Management With System Design [14] 16

Figure 4: Architecture of the Prognostics & Health Management Design Tool [14] 19

Figure 5: Hierarchical dependency-graph model [1] 31

Figure 6: Experienced-Based Prognostics Approach [2] 34

Figure 7: Evolutionary Prognostics Approach [2] 35

Figure 8: Feature/AI-Based Prognostics [2] 36

Figure 9: Physics-Based Prognostics [2] 38

Figure 10: Time to Failure Distribution Relationship 41

Figure 11: Idealized bathtub reliability curves for the test circuit and the prognostic
cell [4] 46

Figure 12: Decision Tree for Pre and Post Failure Maintenance Actions 49

Figure 13: CBM Flow Chart Model [17] 51

Figure 14: Example of a Bayesian Network for Monitoring Motor Stress Factors [18] 55

Figure 15: Hierarchy of Prognostic Approaches [3] 58

Figure 16: Overview of DFS Tool [21] 60

Figure 17: Simplified Model of Caregiver Monitoring System [19] 62

vii
LIST OF TABLES

Table 1: Caregiver Stress Events [19]. 62

Table 2: Acronym List 65

viii
CHAPTER I
INTRODUCTION

The overwhelming desire to design and produce systems that are an order of magnitude more

reliable, maintainable and supportable than their predecessors carries with it a responsibility for

clearly defining the path to that end.

It is of utmost importance to have a clearly defined strategy of justifiable, attainable goals and

definable tasks and processes to be successful. This paper will serve to collect and define the

necessary steps to attain the most reliable system design possible and to also document the required

analyses, design criteria, tests and justification for this.

Diagnosis and prognosis are processes of assessment of a system’s health – past, present and

future – based on observed data and available knowledge about the system. Diagnosis is an

assessment about the current (and past) health of a system based on observed symptoms, and

prognosis is an assessment of the future health. While diagnostics, the science dealing with diagnosis

has existed as a discipline in medicine, and in system maintenance and failure analysis, prognostics is

a relatively new area of research. Indeed, the word “prognostics” does not yet exist in English

language dictionaries. The research community developing diagnosis and prognosis techniques to aid

Condition-Based Maintenance (CBM) likely coined the term. A working definition of prognostics is

“the capability to provide early detection and isolation of precursor and/or incipient fault condition to

a component or sub-element failure condition, and to have the technology and means to manage and

predict the progression of this fault condition to component failure.” This definition includes

prognosis (i.e., the prediction of the course of a fault or ailment) only as a second stage of the

prognostic process. The preceding stage of detection and isolation could be considered a diagnosis

process. Even the traditional meaning of prognosis used in the field of medicine implies the

recognition of the presence of a disease and its character prior to performing prognosis. However, in

the maintenance context, the final step of making decisions about maintenance and mission planning

should also be included, while developing prognostics, for more convenient traceability to top-level

1
goals [1:1]. Due to the nature of the observed data and the available knowledge, the diagnostic and

prognostic methods are often a combination of statistical inference and machine learning methods as

well as embedded hardware, firmware, components or software to monitor, store and communicate

conditions or impending faults. The development (or selection) of appropriate methods requires

appropriate formulation of the learning and inference problems that support the goals of diagnosis and

prognosis. An important aspect of the formulation is modeling – relating the real system to its

mathematical abstraction.

Since both diagnostics and prognostics are concerned with health assessment, it is logical to

study them together. However, the decision-making goals of the two are different. Diagnosis results

are used for reactive (post failure) decisions about corrective (repair/replacement) actions; prognosis

results are used for proactive (pre-failure) decisions about preventive and/or evasive actions ((CBM),

mission reconfiguration, etc.) with the economic goal of maximizing the service life of

replaceable/serviceable components while minimizing operational risk. On the other hand, in many

situations diagnosis and prognosis aid each other. Diagnostic techniques can be used to isolate

components where incipient faults have occurred based on observed degradation in system

performance. These components can then become the focus of prognostic methods [1:2] to estimate

when incipient faults would progress to critical levels causing system failure. Prognostics could then

be used to update the failure rates (reliabilities) of the system components, and in the event of a

failure these updated reliability values can be used to isolate the failed component(s) via a more

efficient troubleshooting sequence.

2
CHAPTER II
BACKGROUND

2 . 1 T h e N e e d F o r P r o g n o s t i c s a n d D i a g n o s t i c s i n N e w D es i g n s

2 . 1 . 1 C u s t o me r R e q u i r e me n t s

The increasing cost and complexity of military weapon systems over the past four decades

has driven the technology to the point of almost inability to easily repair these systems without

extensive knowledge of the system and cumbersome test equipment. Along with this comes the

heavy price tag of maintaining these items during a potential service life of more than 25 years; the

cost of which includes replacement items, training the maintainers and repair of the failed items

removed from the system. Although the commercial world has embraced the idea of not only

including Built-In Test (BIT) and an extensive diagnostics capability, recent development in the area

of prognostics, or “repair before fail” has been implemented in even the most complex items and for

various applications.

This technology has only recently been required for military systems, and many defense

contractors are struggling with the implementation of this technology in new designs. The ability to

repair items prior to failure, especially if in a wartime footing, will provide the advantage to those

whose systems do not fail over the ones that do.

2.1.2 Total Cost of Ownership

As in any complex design, the initial or inherent reliability is at its peak during the early

phase of the operating cycle. If it is assumed that the design has been thoroughly tested and latent

defects and infant mortality are not an issue, then it will perform “as advertised” for a period of time

without any sort of failure or unscheduled maintenance action. In many cases, especially as the

complexity of the item increases, the ability to perform or meet the performance requirements suffers

from failures that were not anticipated. Car manufacturers pride themselves on the fact that their

3
designs are considerably more reliable than in previous years, even when taking into account that

these vehicles are considerably more complex than ones built only 20 years prior. The ability for a

product or system to operate with minimal upkeep for an extended period of time is one of the inputs

into the total cost of owning and operating the item.

The ability to predict the time to conditional or mechanical failure (on a real-time basis) is of

enormous benefit and health management systems that can effectively implement these capabilities

offer a great opportunity in terms of reducing the overall Life Cycle Costs (LCC) [2:2] of operating

systems as well as decreasing the operations/maintenance logistics footprint.

2.1.2.1 Long Term Savings

The reduction of the overall cost of ownership of a product while in operation is at least

partially based on the required maintenance, either planned or unplanned, and upkeep to ensure that

the system is operating at full or acceptable performance. The more maintenance or “down time”

experienced with a particular piece of equipment means loss of revenue, reduced capability, increased

overhead or “out of pocket” expenses and, most of all, a customer that is also likely to be unhappy

with their product. The evolution of the automotive industry over the last forty years is a perfect

example of how “unreliability” has had a negative impact on business.

During the 1960’s, most Japanese manufactured automobiles were produced with a

perception of poor workmanship and materials to such an extent that the American auto

manufacturers were in control of the market. A transformation occurred in the late 1960’s that was to

be a monumental turnaround in the industry. Japanese manufacturers began to understand what was

needed to increase both the quality and reliability of their product to the point of exceeding the

American automotive industry. These manufacturers were flooding the market in the early to mid

1970’s with a product superior in quality, reliability and economy, hence overall lower cost of

ownership than that of their US made competition. The fuel crisis was also instrumental in these

compact, fuel-efficient vehicles dominating the market. It took the Fords, Chevrolets and Chryslers a

4
substantial amount of time to recover from this due to their enormous infrastructure and cost of

retooling their factories in order to compete.

The fact that smaller American made cars were rolling off the assembly lines did not deter the

buying public from returning to the Nissans and Toyotas because of the fact that the US made cars

were still not very reliable and the cost of ownership outweighed the initial purchase of the vehicle.

The American public wanted transportation that could be trusted to get them where they wanted to

go, and did not want to have to worry about whether or not it was going to get them there.

This concept of reduced maintenance, lower operating costs and reduced down time is now

being embraced within the Department of Defense. It has been shown that the more reliable a system

is, the less it will cost to operate and maintain. When diagnostics and prognostics are incorporated,

the logistics tasks are also positively benefited by the fact that the system based on either historical

data or known failure modes, can “communicate” its health to the maintainers to reduce repair time as

well as reduced cost of spare parts. This will be discussed in more detail in the next section.

2.1.2.2 Reduced Levels of Sparing and Logistics Impact


If a system has the ability to monitor its health and provide that information to the

maintainers, it is much easier to perform either post or, more importantly, pre failure maintenance on

the hardware. Once a failure history has been collected on a particular piece of hardware, this

knowledge can be instrumental in determining over the life of the system what needs to be purchased

on an ongoing basis to ensure that the system will continue to operate. Depending on the technology,

there is apt to be failures that occur in one particular part of the hardware more often than others, like

motors, bearings and other electro-mechanical features that have a wear out cycle based on friction,

heat, plastic deformation, etc. Because of this, and if the hardware is monitored correctly, knowing

what will fail can have a significant impact on operation and support cost over the useful life. This

can also be used to determine if the design itself needs to be evaluated or specific portions of the

system be redesigned due to unknown or unforeseen environments that it is operating in than were not

5
known during the initial development. The bottom line is that the system needs to be designed to

have the highest inherent reliability as the operating environment allows, and then monitor the system

for any signs of premature failure or design defects that would not normally be encountered during

normal operation.

2 . 2 I mp a c t o n R e l i a b i l i t y

Reliability is defined as the ability of a product to perform as intended (i.e., without failure

and within specified performance limits) for a specified time, in its life cycle application

environment. The objective of reliability prediction is to support decisions related to the operation and

maintenance of the product including [4:1]:

• Reducing the output penalties to include outage repair and labor costs

• Optimization of the maintenance cycles and spares purchases and stocks

• Maintaining the effectiveness of equipment through optimized repair actions

• Helping in the design of future products, by improved safety margins and reduced

failures

• Increasing the profitability of the manufacturer by reducing latent defects and

increasing throughput yield in the factory

A reliability prediction for a product is dependent on its structural architecture, material

properties, fabrication process, and the life cycle environment. Material properties and geometry of a

product are not exactly the same for all the products coming out of a production line. There may be

changes in these with variations in the fabrication process affecting the reliability and can be

catastrophic for a given design or process.

Health monitoring is a method of evaluating the extent of a product’s reliability in terms of

product degradation in its life cycle environment. By knowing about impending failure, based on

actual life cycle application condition, procedures can be developed to mitigate, manage or maintain

the product.

6
In health monitoring a product’s degradation is quantified by continuous or periodic

measurement, sensing, recording, and interpretation of physical parameters related to the product’s

life cycle environment and converting the measured data into some metric associated with either the

fraction of product degradation or the remaining life of the product (in terms of days, distance in

miles, cycles to failure, etc.). A product’s degradation can be assessed in terms of physical

degradation (e.g., cracks, defection, delamination) and electrical degradation (e.g., increase in

resistance, increase in threshold voltage), or performance degradation, such as deviation of the

product’s operating parameters (e.g., electrical, mechanical, or acoustic) from expected values.

Methods employed for health monitoring of electro-mechanical attributes [4:2] are non-destructive

test (e.g., ultrasonic inspection, liquid penetrant inspection, and visual inspection) and operating

parameter monitoring (e.g., vibration monitoring, oil consumption monitoring and thermography

(infrared) monitoring).

The science of reliability is based on the premise that hardware can be characterized to be at a

certain spot on the reliability “bathtub” curve. This curve depicts the life cycle of hardware based on

its maturity, technology basis, useful life and when wear out or failure is likely to occur. Figure 1 is a

depiction of the reliability bathtub curve with points annotated on it based on where in the lifecycle

the hardware is. Although widely accepted as the traditional constant failure rate curve during useful

life, more and more research is being performed in the area of Physics of Failure, which is a modeling

technique used to better determine when a system or component is likely to fail. This topic will be

discussed in more detail in a later section of this paper.

7
Figure 1: Classical Bathtub Reliability Curve [5]

2.2.1 Partial Operability

There are really two concepts that need to be understood and specified during the

development process. The more known or standard specification is Mean Time Between Failures

(MTBF), which signifies any failure that would or could occur on the system thereby making the

system unable to complete its intended function. An additional requirement that can be specified may

be Mean Time Between Essential Function Failure (MTBEFF). This is a partially failed state or

“degraded” mode of operation that while not fully functional, the system can still perform its intended

function by the use of either partial redundancy, backup systems, or increased operator workload.

Normally, this could occur in the extreme case of an aircraft losing one out two or more engines and

still having the ability to maintain flight control and land safely to the other extreme of a sensor

system with multiple sensors that if one failed, one of the others could at least partially fulfill the role

of the failed item.

Due to cost, weight, volume and power consumption constraints, the ability to incorporate

redundancy is not a luxury in most applications, especially if the system is man-portable or airlift

capable. Because of this, it is essential to ensure that the design is the most reliable it can be based on

8
the technology available, and to also incorporate the ability to test the health of the system so that an

informed decision can be made of its ability to perform a specific function or task.

The goal here is two-fold: Allow the operation or use of a system to continue with sub-

optimal performance due to a partial failure while still meeting the system’s inherent reliability.

9
CHAPTER III
DEFINING PROGNOSTICS AND DIAGNOSTICS

3 . 1 U n d e r s t a n d i n g t h e T e r ms

3.1.1 Diagnostics

Diagnosis is the process of identification of the cause(s) of a problem. The causes of interest

may not be “root” causes. Depending on the application or the level of maintenance activity,

diagnosis can imply the isolation of a faulty component, a failure mode, or a failure condition. It

involves the recognition of one or more symptoms or anomalies (that what is observed is not normal),

and their association with a ground truth. The cause could be within the system (failed component) or

be an external factor which subsequently ‘damages’ the system and prevents it from functioning

normally in the future. The objectives, job-function or capacity of the user of the diagnostic results

determines whether the external root cause, the damaged system component(s) or both need to be

addressed (whether the corrective action should focus on eliminating the external root cause,

repairing/replacing the damaged components, or both).

In many large systems instrumented with built-in sensors and diagnostic tests, the steps of

anomaly detection and root-cause isolation are distinct. The failure of one or more tests signifies

anomalies or failures, and the processing of these results to isolate the failure source constitutes the

isolation step. In some applications, however, failure detection and isolation are not separate steps.

For example [1:2], diagnostic problems are often formulated as classification problems: associate a

vector of features obtained from the system with a class corresponding either to the normal (healthy)

state or to one of the failure modes. Such approaches are effective during corrective maintenance if

the relationship between the failure modes constituting the classes and the implicated components is

obvious or implied.

10
3.1.2 Prognostics

While diagnosis or diagnostics is based on observed data and available knowledge about the

system and its environment, prognosis makes use of not only the historical data and available

knowledge, but also profiles of future usage and external factors. Both diagnosis and prognosis can be

formulated as inference problems that depend on the objectives of diagnosis and prognosis and the

nature of the available data and system knowledge. Two characteristics common to all applications

[1:2] are the incomplete or imprecise knowledge about the systems of interest, especially in the

failure space, and the uncertainty or randomness of observed data. The development of an appropriate

decision model, thus, requires consideration of these characteristics along with the objectives of the

decision problem.

11
CHAPTER IV
IMPLEMENTATION

4.1 Utilization of Built In Test (BIT)

The concept of Built-in-test (BIT) is another technique employed for diverse applications.

BIT is a hardware-software diagnostic means to identify and locate faults. There are various types of

BIT concepts that are employed in electronic systems [4:3], interruptive BIT (I-BIT), periodic BIT

(P-BIT) and continuous BIT (C-BIT). The concept of I-BIT is that normal equipment operation is

suspended during BIT operation. Such BITs are typically initiated by the operator or during a power-

up process. P-BIT is the ability for the hardware to “self check” at specific times during its operation,

specifically on a not to interfere basis, and will cease if necessary to avoid a graceful degradation in

performance. The concept of C-BIT is that equipment is monitored continuously and automatically

without affecting normal operation.

BIT is a technology that has been in place for many years in military hardware. The ability to

incorporate testing at some designated level to detect failures allows for both diagnostic and repair

capability. It provides a “go/no go” indication as to the ability of piece of hardware to perform its

mission. The drawback to this technology is that it cannot predict failures, but only announce that

they have occurred. In some cases, depending on the depth that this testing capability has been

incorporated, a system or weapon could actually be failed or unable to perform its intended use, and

the failure be masked because it was not monitored. Although this concept was generally thought of

20 years ago as an innovative approach to improve the health monitoring of systems, the complexity

of the current technology as well as the necessity of more reliable hardware has been the driving force

in the utilization of more complex monitoring, fault isolation and maintenance avoidance. The

transition into a more robust testing scheme is the reason that enhanced hardware diagnostics are

utilized.

12
4 . 2 D i a g n o s t i cs

4.2.1 Types of Diagnostics

Diagnostics is the ability of a system to process data in order to communicate to the user what

is or has failed. This information is usually in the form of a test result that shows a failure and the

most likely cause of that failure. These results are based on the data that is collected and compared

against historical database information that is gathered from previous failures, and the actual cause of

the failure.

Diagnostics can be either discrete or continuous. Discrete diagnostics are traditionally

algorithms that produce 0 or 1 depending on if a threshold has been exceeded. Many types of Built In

Tests (BITs) can be classified as Discrete Diagnostics. An example of a discrete diagnostics is an

Exhaust Gas Temperature (EGT) reading that has exceeded a predetermined level.

Continuous diagnostics are algorithms [14:5] designed to observe transitional effects and

diagnose a failure mode based on the method and rate in which the effect is changing. Continuous

diagnostics are usually associated with observing the severity of failure mode symptoms. Examples

of continuous diagnostics would be a spike energy monitor for identifying low levels of bearing race

spalling or an Artificial Intelligence (AI) classifier for diagnosing that a valve is sticking. The

“Detection Confidence score (0-1) – (DDC)”, and “% false positive score (0-1) – (DFP)” can be used

to simultaneously account for true-negative and true-positive characteristics.

4.2.1.1 Failure Modes, Effects and Criticality Analysis (FMECA) Role in Health
Management
Military Standard 1629, Revision A, defines the attributes of a FMECA. Figure 2 is a

depiction of how the criticality and potential for occurrence is mapped in a matrix for determination

of potential critical issues that would need to be addressed. In addition, the following information is

provided to better understand the probability of occurrence and severity classification of a FMECA:

Severity Classifications:

13
• Category I - Catastrophic - A failure which may cause death or weapon system loss

(i.e. aircraft, tank, missile, ship, etc.).

• Category II - Critical - A failure which may cause severe injury, major property

damage, or major system damage which will result in a mission loss.

• Category III - Marginal - A failure which may cause minor injury, minor property

damage, and minor system damage which will result in a delay or loss of availability

or mission degradation.

• Category IV - Minor - A failure not serious enough to cause injury, property

damage, or system damage, but which will result in unscheduled maintenance or

repair.

Probability of Occurrence:

• Level A - Frequent. The high probability is defined as a probability which is equal or

bigger than 0.2 of the overall system probability of failure during the defined mission

period.

• Level B - Reasonable probable. The reasonable (moderate) probability is defined as

probability which is more than 0.1 but less than 0.2 of the overall system probability

of failure during the defined mission period.

• Level C - Occasional probability. The occasional probability is defined as a

probability, which is more than 0.01 but less than 0.1 of the overall system

probability of failure during the defined mission period.

• Level D - Remote probability. The remote probability is defined as a probability,

which is more than 0.001 but less than 0.01 of the overall system probability of

failure during the defined mission period.

14
• Level E - Extremely unlikely probability. The extremely unlikely probability is

defined as probability which is less than 0.001 of the overall system probability of

failure during the defined mission period.

Figure 2: FMECA Matrix

The application of “health” or “condition” monitoring systems serves to increase the overall

reliability of a system through judicious application of intelligent monitoring technologies. A

consistent health management philosophy integrates the results from the health monitoring system for

the purposes of optimizing operations and maintenance practices through:

• Prediction, with confidence bounds, of the Remaining Useful Life (RUL) of critical

components, and

• Isolating the root cause of failures after the failure effects have been observed.

15
If RUL predictions can be made, the allocation of replacement parts or refurbishment actions

can be scheduled in an optimum fashion to reduce the overall operational and maintenance logistic

footprints. Fault isolation is a critical component to maximizing system availability and minimizing

downtime through more efficient troubleshooting efforts.

Aside from general exceedence warnings/alarms, health-monitoring initiatives mostly take

place after in-field failures (and substantial costs) have been incurred.

Because an initial system FMECA is performed during the preliminary design stage, it is a

perfect link between the critical overall system failure modes and the health management system

designed to help mitigate those failure modes. Hence, a key aspect of the process links [14:2] this

traditional FMECA analysis with health management system design optimization based on failure

mode coverage and life cycle cost analysis. Figure 2 [14:2] depicts the process of utilizing a FMECA

during the design stage to enhance the overall ability to monitor the health of a system.

Design Stage Prototype/Production

System Design System Design


Concept
Health Monitoring
FMECAs, Modeling, System
Cost Benefit Analysis

Health Management Maintenance


Virtual Test Bench Management System

Continuous Feedback of Experience

Continuous Design Improvement & Support

Figure 3: Health Management With System Design [14]


FMECA’s historically contain 3 main pieces [14:2] of information:

1. A list of failure modes for a particular component

16
2. The effects of each failure mode ranging from a local level to the end effect

3. The criticality of the Failure mode (I – IV), where (I) is the most critical

While this type of failure mode analysis is beneficial in getting an initial (though generally

unsubstantiated) measure of system reliability and identifying candidates for redundancy, there are

several areas where fundamental improvements can be made so that FMECA’s can assist in health

monitoring design. Five shortcomings [14:2] of traditional FMECA’s are:

1. Does not address the precursors or symptoms to failure modes.

2. To move maintenance from reactive to proactive, it is important to focus on both system

and component level indications that the likelihood of a substantial failure mode has increased.

Failure mode symptoms that occur prior to failure are these indications. An example of failure mode

symptoms associated with a bearing would be an increase in spike energy or an increase in the oil

particulate count.

3. Does not address the sensors and sensor placement requirements to observe failure mode

symptoms or effects.

4. Does not address health management technologies for diagnosing and prognosing faults.

5. Typically focuses on subsystems independently, and not on interactions.

On a parallel path, a tabular FMECA is created that corresponds to a traditional FMECA

except it contains failure mode symptoms, as well as sensors and diagnostic/prognostic technologies.

Alternately, a system response model may be used for assessing sensor placements and observability

of simulated failure modes thus offsetting the manual burden of creating the FMECA. Finally,

maintenance tasks that address failure modes are included.

Figure 4 [14:3] provides an overview of the approach to health management system design

optimization. A basic description of each block will be given first, then details associated with each

block will follow. First, a Function Block diagram of the system must be created that models the

energy flow relationships among components. This functional block diagram provides a clear vision

of how components interact with each other across subsystems.

17
The information from the Functional Block diagram and the tabular FMECA is automatically

combined to create a graphical health management environment that contains all of the failure mode

attributes as well as health management technologies. The graphical health management environment

is simply a sophisticated interface to a relational database. Once the graphical health management

system has been developed, attributes are assigned to the failure modes, connections, sensors and

diagnostic/prognostic technologies. The attributes are information like historical failure rates

(failures / 106 operating hours), replacement hardware costs, false alarm rates etc., which are used to

generate a fitness function for assessing the benefits of the health management system configuration.

The “fitness” function criteria include system availability, reliability, and cost. Some of these

attributes must be manually determined, if known, while others are related to the attributes of the

diagnostic/prognostic technologies can be determined from independent measures of performance and

effectiveness tests or from pre-developed databases. Finally, the health management configuration is

automatically optimized from a cost/benefit standpoint using a generic algorithm approach. The net

result is a configuration that maintains the highest system reliability to cost/benefit ratio.

18
Functional Block Tabular FMECA++

Maintenance
Diagram

PHM Model Fitness Function


System Response Model
•Database Selection
•Availability/
Maintainability
•LCC
•Technical Performance

Output
•HM Optimal Designs System Optimization User
•FM Coverage (Rank HM System Configurations)
•Testability Weights
•Overall Reliability
HM
Hardware

Figure 4: Architecture of the Prognostics & Health Management Design Tool [14]

4.2.1.2 Software Implementation of Prognostics and Diagnostics


CBM is becoming more widespread within US industry and the military. The factors that

have driven an increase in the use of CBM include the need for:

• Reduced maintenance and logistics costs

• Improved equipment availability

• Protection against failure of mission critical equipment

A complete CBM system comprises a number of functional capabilities:

• Sensing and data acquisition

• Signal processing

• Condition and health assessment

• Prognostics

• Decision aiding

19
In addition, a Human System Interface (HSI) is required to provide user access to the system.

The implementation of a CBM system usually requires the integration of a variety of hardware and

software components. Across the range of military and industrial application of CBM, there is a broad

spectrum of system level requirements including:

• Communication and integration with legacy systems

• Protection of proprietary data and algorithms

• A need for upgradeable systems

• Limiting implementation time

In addition, due to the potential costs of incorporating these CBM systems, a determination

has to be made on the potential return on investment (ROI).

Standardization of specifications within the community of CBM users will, ideally, drive the

CBM supplier base to produce interchangeable hardware and software components. A non-

proprietary standard that is widely adopted will result in a free market for CBM components. The

potential benefits of a robust non-proprietary standard include:

• Improved ease of upgrading for system components

• A broader supplier community resulting in more technology choices

• More rapid technology development

• Reduced prices

For a particular system integration task, an open systems approach requires a set of public

component interface standards and may also require a separate set of public specifications for the

functional behavior of the components. The underlying standards of an open system may result from

the activities of a standards organization, an industry consortium team, or may be the result of market

domination by particular product (or product architecture). Standards produced by recognized

standards organizations are called de jure standards. De facto standards are those that arise from the

market-place, including those generated by industrial consortia. Regardless of the development

20
history of the standards that support an open system, it is required that the standards are published,

and publicly available at a minimal cost. An example of an open de jure standard is the IEEE 802.3,

which defines medium access protocols and physical media specifications for LAN Ethernet

connections. An example of a proprietary de facto standard is the Windows OS (operating system).

Examples of open de facto standards are the UNIX OS and HTTP. An open system standard that

receives widespread market acceptance can have great benefits to consumers and also to suppliers of

these products. The emergence of the IBM PC architecture as a market leading de facto standard that

stimulated the market for both PC hardware and software suppliers. It also led to price reductions due

to market competition in the face of rapid technology advances. The emergence of a dominant

proprietary standard for PC operating systems and software (Windows) resulted in benefits to

consumers in terms of application interoperability, but arguably at the cost of increased prices and

reduced performance compared to the possibilities that an open system software standard might have

offered. In addition to commercial issues, the interchangeability of components enabled by an open

system architecture yields several technical benefits: system capability can be readily extended by

adding additional components, and system performance can be readily enhanced by adding

components with improved or upgraded capabilities.

A complete architecture for CBM systems should cover the range of functions from data

collection through the recommendation of specific maintenance actions. The key functions [23:2] that

facilitate CBM include:

• Sensing and data acquisition

• Signal processing and feature extraction

• Production of alarms or alerts

• Failure or fault diagnosis and health assessment

• Prognostics: projection of health profiles to future health or estimation of RUL

(remaining useful life)

21
• Decision aiding: maintenance recommendations, or evaluation of asset readiness for a

particular operational scenario

• Management and control of data flows or test sequences

• Management of historical data storage and historical data access

• System configuration management

• Human system interface

Typically, CBM system integrators will utilize a variety of commercial off the shelf (COTS)

hardware and software products (using a combination of proprietary and open standards). Due to the

difficulty in integrating products from multiple vendors, the integrator is often limited in the system

capabilities that can be readily deployed. For some applications a system developer will engineer an

application specific system solutions. When user requirements drive custom solutions, a significant

part of the overall systems engineering effort is the definition and specification of system interfaces.

The use of open interface standards would significantly reduce the time required to develop and

integrate specialized system components. CBM system developers and suppliers must make decisions

about how the functional capabilities are distributed, or clustered within the system. Due to

integration difficulties in the current environment, suppliers are encouraged to design and build

components that integrate a number of CBM functions. Furthermore, proprietary interfaces are often

used to lock customers into a single source solution, especially for software components. An ideal

Open System Architecture for CBM should support both granular approaches (individual components

implement individual functions) and integrated approaches (individual components integrate a

number of CBM functions).

A given CBM architecture may limit the flexibility and performance of system

implementations if it does not take into account data flow requirements. To support the full range of

CBM data flow requirements [23:3], the architecture should support both time based and event-based

data reporting and processing. Time-based data reporting can be further categorized as periodic or a-

22
periodic. An event-based approach supports data reporting and processing based upon the occurrence

of events (limit exceedences, state changes, etc…). A specific requirement that may be imposed on a

CBM system involves the timeliness of data reporting. Timeliness requirements may be defined

broadly as time-critical or non time-critical. The non time-critical category applies to data messages

or processing for which delays have no significant impact on the usefulness of the data or processing

result. The time-critical category implies a limited temporal validity to the data or processing result

requiring deterministic and short time delays. Two different messaging approaches [23:5] may also be

employed within a system, synchronous and asynchronous. In the synchronous model, the message

sender waits for a confirmation response from the message receiver before proceeding to its next task.

In the asynchronous model, the message sender does not wait for a response from the receiver and

closes the communication. The asynchronous model is generally more applicable to time-critical

communications.

Current PC, Networking, and Internet technologies provide a readily available, cost effective,

easily implemented communications backbone for CBM systems. These computer networks are built

over a combination of open and proprietary standards. Software technologies are rapidly developing

to support distributed software architectures over the Internet and across LANs. There is a large

potential payback associated with market acceptance of an open standard for distributed CBM system

architectures. With the ready availability of network connectivity, the largest need is in the area of

standards for distributed software component architecture.

One model for distributed computing is Web-based computing [23:6]. The Web-based model

utilizes HTTP servers that function primarily as document servers. The most common medium of

information transport on the Web is the HTML page; HTML is a format for describing the content

and appearance of a document. An alternate format for information transport over the Web is

becoming increasingly popular; XML (eXtensible Mark-up Language). In contrast to HTML, XML is

focused on describing information content and information relationships. XML is readily parsed into

data elements that application programs can understand and serves as an ideal means of data and

23
information transport over the web. A simple model for data access over the web is a smart sensor

that periodically posts new data in XML format to a Web page. Information consumers may access

that updated data directly from the Web page. HTTP servers also provide remote access to application

programs by means of the Common Gateway Interface (CGI). In this model, the interface between the

remote client and the Web server is by means of HTML pages or XML; the web server utilizes the

CGI to communicate with the application program. The web-based distributed computing model

requires that each data server have HTTP server software. With the development of compact and

embedded HTTP server software it remains a feasible approach.

With the growth of distributed computing [23:19], a class of software solutions is evolving

which enable tighter coupling of distributed applications and hide some of the inherent complexities

of distributed software solutions. The general term [23:5] for these software solutions is middleware.

Fundamentally, middleware allows application programs to communicate with remote application

programs as if the two programs were located on the same computer. Current middleware

technologies include: Component Object Request Broker Architecture (CORBA), Microsoft’s

Distributed Component Object Model (DCOM), SUN’s Java-Remote Method Invocation (RMI), and

Web-based Remote Procedure Call (RPC). CORBA is an open middleware standard developed and

maintained by the Object Management Group (OMG). A number of companies have CORBA based

product offerings for a variety of hardware and OS platforms. DCOM is the extension of Microsoft’s

object technology to distributed software objects; DCOM is built into the Windows 2000 operating

system and Windows NT 4.0. A number of software companies have ported DCOM to other OS

platforms, however DCOM is just one component of the complete solution for distributed computing,

which is provided by Windows 2000. The SUN solution for distributed computing uses JAVA RMI

for managing calls to distributed Java objects. JAVA RMI operates over IIOP (the Internet Inter-Orb

Protocol), the CORBA protocol for communication between distributed software components. This

allows JAVA RMI based solutions some level of interoperability with CORBA based solutions.

24
In order to standardize architecture for CBM components, the first step is to assign the CBM

system functions defined earlier to a set of standard software components. The software architecture

has been described in terms of functional layers. Starting with sensing and data acquisition and

progressing towards decision support, the general functions of the layers [23:12] are given below:

Layer 1 – Sensor Module: The sensor module has been generalized to represent the software

module, which provides system access to digitized sensor or transducer data. The sensor module may

represent a specialized data acquisition module that has analog feeds from legacy sensors, or it may

collect and consolidate sensor signals from a data bus. Alternately, it might represent the software

interface to a smart sensor (e.g. IEEE 1451 compliant sensor). The sensor module is a server of

calibrated digitized sensor data records.

Layer 2 – Signal Processing: The signal processing module acquires input data from sensor

modules or from other signal processing modules and performs single and multichannel signal

transformations and CBM feature extraction. The outputs of the signal-processing layer include:

digitally filtered sensor data, frequency spectra, virtual sensor signals, and CBM features.

Layer 3 – Condition Monitor: The condition monitor acquires input data from sensor

modules, signal-processing modules, and from other condition monitors. The primary function of the

condition monitor is to compare CBM features against expected values or operational limits and

output enumerated condition indicators (e.g. level low, level normal, level high, etc). The condition

monitor also generates alerts based on defined operational limits. When appropriate data is available,

the condition monitor may generate assessments of operational context (current operational state or

operational environment). Context assessments are treated, and output, as condition indicators. The

condition monitor may schedule the reporting of the sensor, signal processing, or other condition

monitors based on condition or context indicators, in this role it acts as a test coordinator. The

condition monitor also archives data from the Signal Processing and Sensor Modules, which may be

required for downstream processing.

25
Layer 4 – Health Assessment: The health assessment layer acquires input data from

condition monitors or from other health assessment modules. The primary function of the health

assessment layer is to determine if the health of a monitored system, subsystem, or piece of

equipment is degraded. If the health is degraded, the health assessment layer may generate a

diagnostic record, which proposes one or more possible fault conditions with an associated

confidence. The health assessment module should take into account trends in the health history,

operational status and loading, and the maintenance history. The health assessment module should

maintain its own archive of required historical data.

Layer 5 – Prognostics: Depending on the modeling approach that is used for prognostics, the

prognostic layer may need to acquire data from any of the lower layers within the architecture. The

primary function of the prognostic layer is to project the current health state of equipment into the

future taking into account estimates of future usage profiles. The prognostics layer may report health

status at a future time, or may estimate the remaining useful life (RUL) of an asset given its projected

usage profile. Assessments of future health or RUL may have an associated diagnosis of the projected

fault condition. The prognostic module should maintain its own archive of required historical data.

Layer 6 – Decision Support: The decision support module acquires data primarily from the

health assessment and prognostics layers. The primary function of the decision support module is to

provide recommended actions and alternatives and the implications of each recommended action.

Recommendations include maintenance action schedules, modifying the operational configuration of

equipment in order to accomplish mission objectives, or modifying mission profiles to allow mission

completion. The decision support module needs to take into account operational history (including

usage and maintenance), current and future mission profiles, high-level unit objectives, and resource

constraints.

Layer 7 – Presentation: The presentation layer may access data from any of the other layers

within the architecture. Typically high-level status (health assessments, prognostic assessments, or

decision support recommendations) and alerts would be displayed, with the ability to drill down when

26
anomalies are reported. In many cases the presentation layer will have multiple layers of access

depending on the information needs of the user. It may also be implemented as an integrated user

interface, which takes into account information needs of the users other than CBM.

4.2.1.3 Canaries
The Canary or prognostic monitor approach integrates several sacrificial sensors into life the

monitoring system that are capable of providing a cumulative record of the life consumption being

extracted by the environmental stresses. This is called the canary approach, which is patterned after

the role of the caged canary bird that played a role in giving a warning to the early coal miners of

impending life threats from excessive carbon monoxide in the air which was undetectable by the

miner’s own sensory system.

A prognostic monitor [4:3] is a pre-calibrated semiconductor cell (circuit) that is co-located

with the actual circuit on a semiconductor device. The prognostic monitor thus experiences the same

manufacturing process and the same environmental parameters as that of the actual circuit. Hence as

long as the operational parameters are the same, the damage rate is expected to be the same for both

the circuits. By incorporating these monitors as a part of the sub system we ensure that the cells see

the same operational environment as the product from fabrication, to test to operation. It assures that

any parameter that affects the product reliability will also affect the monitor causing its failure. This

approach enables the cells to overcome the limitations of off-line tests, which are often performed to

represent an average expected performance of circuits and have no means to account for the effects of

the operational environment seen by the circuit in use.

Prognostic monitors employ accelerated and calibrated stress conditions [5:3] to increase

their rate of degradation relative to the companion functional circuits in which they are incorporated;

thereby assuring the monitor will fail before the circuit. Incorporating the monitors as part of sub-

system circuits insures that they will see the same operational environment as the system components

from fabrication, to test, to burn-in, to operation. This assures that any variation that would affect the

27
system reliability (e.g. process-induced or installation-induced damage, voltage transients or spikes,

temperature variations, etc.) will also affect the monitor causing its premature failure relative to that

of the circuit. Thus, traditional methods of lifetime prediction based on conditions experienced by the

integrated circuit up to system insertion and offline testing of other components is replaced by a

technique that takes into consideration any impact the actual operating environment may have on

system lifetime.

The purpose of the prognostic cell [5:5] is to predict circuit failure. To achieve this, the

prognostic cell must fail prior to the circuit on the same chip for all realistic operating conditions. The

prognostic cell failure distribution of a particular chip must therefore be linked to the fabrication and

operating conditions in the same way as the circuit failure distribution of that same chip. If the

complete prognostic cell failure distribution lies before the onset of the circuit failure distribution, all

prognostic cells will fail prior to circuit failure. The challenge to building a useful prognostic cell is to

ensure this type of predictive capability without severely reducing useful circuit lifetime. This

information is further refined in section 4.4.1, In-Situ Monitoring.

4.3 Prognostics

The early research on prognostics has dealt largely with specific applications or case studies.

This is expectedly so, since prognostics as an engineering problem arose from a need to promote

CBM practices for reducing costs incurred during inefficient schedule-based preventive maintenance.

Irrespective of the prognostic approaches considered (empirical or analytical), their implementation

requires reference to the application domain: either to obtain data required for training prognostic

algorithms, or to model the functional relationships based on the physical laws governing the system.

Prognostics is receiving most attention for systems consisting of mechanical and structural

components, where there is an opportunity to change current maintenance practices from scheduled-

preventive to condition-based maintenance. A large number of recent industrial research and

development efforts in prognostics has been spurred by the military seeking to change its

28
maintenance practices; to a lesser extent, the need for prognostics has also been recognized by other

industries relying heavily on mechanical plant equipment such as power turbines, diesel engines, and

other rotating machinery. Unlike electronic or electrical systems, mechanical systems typically fail

slowly as structural failures (faults) in mechanical parts progress slowly to a critical level, and

monitoring the growth of these failures provides an opportunity to assess degradation and re-compute

remaining component life over a period of time. The typical scenario is a slowly evolving structural

failure due to fatigue; the causes of this fatigue are repetitive stresses induced by vibration from

rotating machinery (high cycle fatigue) and stresses due to temperature cycles (low cycle fatigue). In

electronic parts, faults are rather abrupt at the component level. Due to the complexity of geometry-

dependent physical models of mechanical structures, developing failure evolution models for such

applications can quickly become computationally prohibitive or intractable, and recourse is taken to

empirical (data dependent) forecasting models. Thus, developing prognostics for mechanical systems

has become heavily dependent on the availability of reliable historical data, even where some

physical modeling is possible. The selection of the techniques too depends on the nature of the data.

Most efforts are still in their infancy, and therefore results are not easily available in the public

domain. Researchers have reported general approaches [1:3] and the specific methods (neural

networks, wavelets, fuzzy methods, etc.) used in the context of specific machinery or components, but

results pertaining to success with validation efforts are not readily (or publicly) available. One reason

that prevents public reporting of these results could be that the studies are dependent on possibly

proprietary data either supplied by the customer (system owner) or collected at significant cost.

4 . 3 . 1 M o d el i n g o f th e S y s t e m

Models that relate the physical system to the data observed from it are an important part of

the formulation of the diagnostic/prognostic inference (reasoning) process. As mentioned in the

previous section, several approaches are being researched and applied in the development of

29
diagnostics/prognostics and health management. Some of the model paradigms [1:4-5] are described

below:

Physical models – Physical models are models founded in the natural laws governing the

system operation or used to design the system, e.g., structural mechanics (properties of materials –

solid, liquid and gas), statics and dynamics of rigid bodies (e.g., finite-element models),

thermodynamics, etc. Physical models are usually developed to explain the normal behavior of a

system to facilitate system engineering, not the failure behavior. Indeed, the failure space of systems

tends to be larger than the normal functional space. Physics-based failure models need to be specially

built – usually one model for each failure mode. Needless to say, intricate knowledge possessed by

domain experts (scientists and engineers) is required, and, hence, such modeling is expensive.

Reliability models – Reliability modeling involves developing reliability distributions of

individual components, and system reliability evaluation using reliability block diagrams. The

reliability analysis of individual components consists of specification of the failure probability

distribution functions based on the statistical analysis of empirical and laboratory data. When a

system assembled using such individual components is considered, the reliability block diagrams are

used to analyze the overall system reliability using probabilistic and graph-theoretic techniques.

Judicious assumptions often need to be made pertaining to the probabilistic independence of

individual failures, and situations such as sympathetic failures, redundancy/reconfiguration, need to

be accounted for. The reliability models are helpful in engineering the overall health management

system by identifying parts of the system in need of health monitoring and diagnostics/prognostics.

Component reliabilities can be used to update periodic maintenance and inspection schedules, and

computed reliabilities of each subassembly, module, etc. can be used to efficiently isolate causes of

anomalies or failures.

Machine Learning models – Machine learning models are purely data dependent models

and require sufficient amounts of relevant historical training data to be effective. The most prominent

techniques in this class are the neural-network-based techniques. Neural networks are useful for

30
modeling phenomena that are hard to model using parametric/analytic expressions and equations, but

the downside is that they are hard to validate and, furthermore, do not enhance the basic

understanding of the system/process under study. The learning demonstrated by neural networks is

often impressive, but care is required during generalizations made using them.

Dependency models – Dependency modeling, rooted in artificial intelligence approaches,

captures cause-effect relationships. At a lower level of detail, causes can be associated with physical

components, effects with failures of components, diagnostic tests, or symptoms, and the relations

between causes and effects with physical links, between components or directions of energy flow.

Furthermore, a priori knowledge about occurrences of component failures or failure modes can be

specified and used in the analysis for designing appropriate diagnostics/prognostics. Figure 5 [1:6]

below is an example of a model of a subsystem hierarchy.

Figure 5: Hierarchical dependency-graph model [1]

31
4 . 3 . 2 A p p r o a c h e s t o I mp l e me n t i n g P r o g n o s ti c s C a p a b i l i ty

For a health management or CBM system to possess prognostics implies [3:4] the ability to

predict a future condition. Inherently probabilistic or uncertain in nature, prognostics can be applied

to system/component failure modes governed by material condition or by functional loss. Similar to

diagnostic algorithms, prognostic algorithms can be generic in design but specific in terms of

application. A prognostic model must have ability to predict or forecast the future condition of a

component and/or system of components given the past and current information. Within the health

management system architecture, the prognostic module function is to intelligently utilize diagnostic

results, experienced based information and statistically estimated future conditions to determine the

remaining useful life or failure probability of a component or subsystem. Prognostic reasoners can

range from reliability-based to empirical feature-based to completely model-based.

Some of the information that may be required depending on the type of prognostics approach

used in the system include:

• Engineering Model and Data

• Failure History

• Past Operating Conditions

• Current Conditions

• Identified Fault Patterns

• Transitional Failure Trajectories

• Maintenance History

• System Degradation Modes

• Mechanical Failure Modes

For a health management or CBM system to possess prognostics implies the ability to predict

a future condition. Inherently probabilistic or uncertain in nature, prognostics can be applied to

32
system/component failure modes governed by material condition or by functional loss. Like the

diagnostic algorithms, prognostic algorithms can be generic in design but specific in terms of

application.

Examples of prognostics approaches [3:4] and [2:3-5] that have been successfully applied

for different types of problems include:

1. Experience-Based Prognostics: In the case where a physical model of a subsystem or

component is absent and there is an insufficient sensor network to assess condition, an experienced-

based prognostic model may be the only alternative. This form of prognostic model is the least

complex and requires the failure history or “by-design” recommendations of the component under

similar operation. Typically, failure and/or inspection data is compiled from legacy systems and a

Weibull distribution or other statistical distribution is fitted to the data. An example of these types of

distributions is given in Figure 6 [2:4]. Although simplistic, an experienced-based prognostic

distribution can be used to drive interval-based maintenance practices that can then be updated on

regular intervals. An example may be the maintenance scheduling for a low criticality component that

has little or no sensed parameters associated with it. In this case, the prognosis of when the

component will fail or degrade to an unacceptable condition must be based solely on analysis of past

experience or Original Equipment Manufacturer (OEM) recommendations. Depending on the

maintenance complexity and criticality associated with the component, the prognostics system may be

set up for a maintenance interval (i.e. replace every 1000+/-20 Effective Operating Hrs) then updated

as more data becomes available. Having an automated maintenance database is important for the

application of experience-based prognostics.

33
Legacy-Based
Weibull Formulation
Maintenance Action
Update Capability
PDF

New Data
Legacy Data
In-field Inspection
Results PDF

In-Field MTBF or MTBI


Legacy MTBF or MTBI

Figure 6: Experienced-Based Prognostics Approach [2]

2. Evolutionary/Statistical Trending Prognostics: An evolutionary prognostic approach

relies on gauging the proximity and rate of change of the current component condition (i.e. features)

to known performance degradation or component faults. Figure 7 [2:4] is an illustration of the

technique. Evolutionary prognostics may be implemented on systems or subsystems that experience

conditional failures such as compressor or turbine flow path degradation. Generally, evolutionary

prognostics works well for system level degradation because conditional loss is typically the result of

interaction of multiple components functioning improperly as a whole. This approach requires that

sufficient sensor information is available to assess the current condition of the system or subsystem

and relative level of uncertainty in this measurement. Furthermore, the parametric conditions that

signify known performance related faults must be identifiable. While a physical model, such as a gas

path analysis or control system simulation, is beneficial, it is not a strict requirement for this technical

34
approach. An alternative to the physical model is built in “expert” knowledge of the fault condition

and how it manifests itself in the measured and extracted features.

Statistical Feature
Shifts Over Time Track and Predict Path

Feature 1

Feature 2

Feature 3

T0 T1

Figure 7: Evolutionary Prognostics Approach [2]

3. Artificial Intelligence Based Prognostics: Utilizing known transitional or seeded

fault/failure degradation paths of measured/extracted feature(s) as they progress over time is another

commonly utilized prognostic approach. In this approach, neural networks or other AI techniques are

trained on features that progress through a failure. In such cases, the probability of failure as defined

by some measure of the “ground truth” is required as a-priori information as described earlier. This

“ground truth” information that is used to train the predictive network is usually obtained from

inspection data. Based on the input features and desired output prediction, the network will

automatically adjust its weights and thresholds based on the relationships between the probability of

failure curve and the correlated feature magnitudes. Figure 8 [2:4] shows an example of a neural

network after being trained by some vibration feature data sets. The difference between the neural

network output and the “ground truth” probability of failure curve is due to error that still exists, after

the network parameters have optimized, to minimize this error. Once trained, the neural network

35
architecture can be used to intelligently predict these same features progressions for a different test

under similar operating conditions.

Weights

Input features
Time from Failure Prediction Grouping
- 20 to 6.9 hrs
F2, Accelerometer #3

- 6.2 to 5.5 hrs


Wavelength Hz

- 5.67 to 1.24 hrs


- 1.2 to 0.7 hrs
- 0.2 to 0 hrs

F1, Accelerometer #3 RMS

Vibration Features

Figure 8: Feature/AI-Based Prognostics [2]

4. State Estimator Prognostics: State estimation techniques such as Kalman filters [2:5] or

various other tracking filters can also be implemented as a prognostic technique. In this type of

application, the minimization of error between a model and measurement is used to predict future

feature behavior. Either fixed or adaptable filter gains can be utilized (Kalman is typically adapted,

while Alpha-Beta-Gamma is fixed) within an nth-order state variable vector. For a given measured or

extracted feature f, a state vector can be constructed as shown below.

T
x= f f f

Then, the state transition equation is used to update these states based upon a model. A

simple Newtonian model of the relationship between the feature position, velocity and acceleration

can be used if constant acceleration is assumed. This simple kinematics equation can be expressed as

follows:

36
1
f (n + 1) = f (n) + f ( n)t + f (n)t 2
2
Where f is again the feature and t is the time period between updates. There is an assumed

noise level on the measurements and model related to typical signal-to-noise problems and

unmodeled physics. The error covariance associated with the measurement noise vectors are typically

developed based on actual noise variances, while the process noise is assumed based on the kinematic

model. In the end, the tracking filter approach is used to track and smooth the features related to the

prediction of a given failure mode progression, and thus, it is used in conjunction with a diagnosis.

5. Model-Based or Physics of Failure Based Prognostics: A physics-based stochastic

model is a technically comprehensive modeling approach that has been traditionally used for

component failure mode prognostics. It can be used to evaluate the distribution of remaining useful

component life as a function of uncertainties in component strength/stress or condition for a particular

fault. The results from such a model can then be used to create a neural network or probabilistic-

based autonomous system for real-time failure prognostic predictions. Other information used as

input to the prognostic model includes diagnostic results, current condition assessment data and

operational profile predictions. This knowledge rich information can be generated from multi-

sensory data fusion combined with in-field experience and maintenance information that can be

obtained from data mining processes. While the failure modes may be unique from component to

component, the physics-based methodology can be applied to many different types of mechanical

components. An example of a physical, model-based prognostic technique is shown in Figure 9 [2:5]

for a rotating blade.

37
Diagnostic
Results

Expected
Future
Condition
(based on
Historical
Conditions) Current &
Future
Failure Prediction
Experienced-Based
Information

Figure 9: Physics-Based Prognostics [2]

4.3.3 Physics of Failure

Use of physics-of-failure concepts during the design of electronic products can provide

validated, engineering models for root-cause failure mechanisms that can be of great benefit during

the design & evaluation of accelerated reliability tests. The growing popularity of the physics-of-

failure approach to electronics reliability has the potential of improving the effectiveness of

accelerated reliability testing. Two general areas of research [22:1] are needed for accelerated testing

to become more widely used. First, more research is required in the failure-mechanism models area,

especially for probabilistic failure mechanism models. Secondly, additional research is required on

the overall physics-of-failure approach to applying failure-mechanism models for planning,

conducting, and evaluating accelerated tests.

Physics of failure is an approach to design, reliability assessment, testing, screening, and

stress margins that uses knowledge of root-cause failure processes to prevent product failures through

robust design & manufacturing practices. This approach involves:

• Identifying potential failure mechanisms (chemical, electrical, physical, mechanical,

structural, or thermal processes leading to failure); failure sites; and failure modes

38
(which result from the activation of failure mechanisms, and are usually precipitated

as shorts, opens, or electrical deviations beyond specifications);

• Identifying the appropriate failure-mechanism models (i.e. stress-life relationships)

and their input parameters, including those associated with material characteristics,

damage properties, relevant geometry at failure sites, manufacturing flaws and

defects, and environmental and operating loads;

• Determining the variability for each design parameter when possible;

• Computing the effective reliability function, mean, median, or desired quantile;

• Accepting the design, if the estimated time-dependent reliability function meets or

exceeds the required value over the required time period.

This approach also:

• Proactively incorporates reliability into the design process by establishing a scientific

basis for evaluating new materials, structures, and electronics technologies;

• Provides information to plan tests and screens, and to determine electrical and

thermal-mechanical stress margins;

• Uses generic failure models that are as effective for new materials and structures as

they are for existing designs;

• Encourages innovative, cost effective design through the use of realistic reliability

assessment.

A central feature of the physics of failure approach [22:4] is that reliability modeling (i.e.

time-to-failure modeling) is based on root-cause failure processes or mechanisms. Failure mechanism

models explicitly address the design parameters that have been found to influence hardware reliability

strongly, including material properties, defects, and electrical, chemical, thermal and mechanical

stresses. The goal is to keep the modeling, in a particular application, as simple as feasible without

losing the cause-effect relationships that advance useful corrective action.

39
Many accelerated tests have been conducted without understanding which failure

mechanisms are being accelerated, how they are accelerated, or if the failure mechanisms occur under

usage conditions [22:5]. Understanding these factors is essential for conducting successful accelerated

life tests. Physics of failure aids in determining which failure mechanisms & sites to accelerated,

which stress(es) to accelerated, the most effective way to apply the stresses, and the approximate

time-to-failure at the usage and accelerated conditions.

4.3.3.1 Identification of Failure Mechanisms


An important aspect of accelerated life testing is the identification of failure mechanisms.

Many accelerated tests (e.g. MIL-STD-883 and MIL-STD-810 tests) do not require the failure

mechanisms to be identified. If the life at the usage condition is desired, the failure mechanisms must

be identified. Failure mechanisms can occur at many sites, which must also be identified. Numerous

stresses can act on electronic equipment, but there are usually one or two stresses that correspond to a

particular failure mechanism. These stresses should be used to accelerate the failure mechanisms.

Failure mechanisms that are dormant under usage conditions may begin to become dominant at the

accelerated stress conditions. This phenomenon is called failure mechanism shifting [22:5], but can be

avoided by performing a physics of failure analysis to determine the stress limits that reduce the life

of the dominant failure mechanism sufficiently and not introduce new failure mechanisms.

One of the keys to successful acceleration modeling is the explicit treatment of failure

mechanisms. Separate treatment of failure mechanisms, a central feature of the physics-of-failure

approach, is recommended by leading authorities on the accelerated reliability testing of electronic

devices, since failure mechanisms can have different life distributions and acceleration models.

Consider a hypothetical microelectronic device, which is subject to failure due to three

dominant failure mechanisms acting at associated failure sites within the device. While the time to

failure of each of the failure mechanisms may be influenced by changes in several parameters,

including loads, geometries, material properties, defect magnitudes and stresses, for the purposes of

40
this explanation, focus will be placed on the impact of stress on time-to-failure. Figure 10 depicts the

relationship between the time-to-failure distribution of each of the three failure mechanisms, and the

stress on which each is most dependent.

Time-To Failure Accelerated


Distribution Model
LIFE

LIFE

LIFE
STRESS STRESS STRESS
Failure Failure Failure
Mechanism 1 Mechanism 2 Mechanism 3

Figure 10: Time to Failure Distribution Relationship


The failure mechanisms that cause hardware failures in electronic products generally do not

have identical stress dependencies. For example, the temperature-stress dependencies of

microelectronic failure mechanisms are known to vary considerably. The relationship between time-

to-failure and stress may be elusive unless failure mechanisms receive explicit treatment. Addressing

dominant failure mechanisms & sites directly provides critical reliability insight pertaining to:

• Determination of the dominant failure mechanisms that are the weakest links in the

product and how best to improve the design and/or manufacture;

• Determination of appropriate stress levels during accelerated reliability test design;

and

• Estimation of time to failure at intended usage stresses from time-to-failure data

obtained under accelerated stresses.

Without explicit consideration of dominant failure mechanisms, insight required to design

and build reliability into electronic products, and compress reliability test time, is not available. In

41
order that accelerated reliability testing be practical and successful, identification and modeling of the

dominant product failure mechanisms and sites must be available from the reliability evaluation

[22:7].

4.3.3.2 Steps for Physics-of-Failure Accelerated Life Testing


From examining the literature [22:7-8], the following are steps for accelerated life testing

using the physics-of-failure approach were compiled:

1. A physics-of-failure analysis is performed where the likely failure mechanisms are

determined. The analysis should be based on current failure-mechanism models and good engineering

judgment. The analysis should also identify the dominant failure mechanism, what stress(es)

accelerate the failure mechanism, what parameters influence the failure mechanism, and the

approximate time-to-failure at the usage and accelerated conditions.

2. The failure mechanisms to accelerate during the test are selected. The electronic circuit

card or microelectronic package under test will have numerous failure mechanisms and the dominant

failure mechanisms should be selected based on the physics-of-failure analysis.

3. The various stresses and their limits must be studied for each failure mechanism under

consideration. Stresses will affect different failure mechanisms, therefore must be selected so they

affect the appropriate failure mechanisms and do not introduce extraneous failure mechanisms.

4. The appropriate stress parameters to be accelerated and the magnitude of the stress are

determined. Based on the physics-of-failure analysis, the stress magnitude must be selected so that

failure mechanisms dominant under usage stress are also dominant under the accelerated stress.

Failure mechanism shifting should be avoided.

5. The type of accelerated life test is determined (i.e., constant load acceleration or step stress

accelerating) and the sample size at each stress level is selected.

6. The test is performed. During the test, root-cause failure analysis should be performed to

determine which failure mechanism caused the failure.

42
7. The test data are interpreted, which includes extrapolating the accelerated test results to

normal operating conditions. Failure modes and mechanisms should be examined to determine if they

would occur at the usage condition.

8. If more than one failure mechanism is considered, the competing risk model can be used to

combine the time-to-failure distributions of the failure mechanisms to form a composite distribution.

4.3.3.3 Problems with Accelerated Life Testing


Accelerated testing of electronic products offers a great potential for improvements in

reliability testing. Unfortunately, difficulties encountered in accelerated reliability testing have

limited its application and acceptance [22:8]. These difficulties can be traced, in part, to a lack of

information concerning the dominant failure mechanisms and sites from reliability evaluations

conducted during product design. Identification of the product'


s dominant failure mechanisms & sites

is an essential component of successful accelerated reliability test design and evaluation.

Difficult issues associated with the evaluation of accelerated reliability tests include the

following:

• Determination of the failure mechanisms, sites and modes that will be dominant

under intended life-cycle loads;

• Assurance that the dominant failure mechanisms are, or are not, likely to occur under

intended life-cycle loads;

• Assessment of product reliability under intended usage conditions from the failure

mechanisms models.

4.4 Monitoring

Monitoring of equipment, with appropriate analytical techniques applied to the data collected,

provides an important window into the health of the machinery. It further provides guidance for

informed operation and maintenance of the equipment. The General Electric Research and

Development Center [20:1] has focused on an expanded role of equipment monitoring that applies

43
analytical techniques to the collected data, incorporating component design information and specific

operating experience, resulting in information that aids in decision making and operational planning.

This work has been applied at varying levels to medical imaging equipment, locomotives, and power

generation machines and aircraft engines.

Technical Approach:

A five-step approach [20:2] has been developed to evaluate machine condition and translate it

into action plans.

1. Monitoring Equipment in Operation

The first step is collecting equipment data, which includes sensor information as well as

control system events and plant operation activities. Selection and location of sensors requires an

understanding of machine dynamics. Also, sensors must have accuracy commensurate with the

analysis objectives and must be sampled at a frequency suitable for those analyses.

2. Evaluating Equipment Condition

Comparing the calculated performance to a baseline condition and to the detailed design

performance expected can perform diagnosis. For mechanical diagnosis, baseline vibration signatures

are often taken when the equipment is installed and after each major overhaul to be used for this

comparison in order to determine whether significant changes have taken place. Baseline information,

corrected to reference conditions is also applied in performance monitoring. A critical aspect of

equipment monitoring is the identification of faulty sensors. Data that is suspect must be identified so

it is not used to generate incorrect parameters, which will lead to erroneous conclusions.

3. Determining Presence of Symptoms

Determining the presence and strength of problem symptoms involves the use of advanced

signal processing and feature extraction techniques to resolve small but significant signals in the

presence of the high-noise environment of most operating situations. Merging advance diagnostic

and signal processing techniques with machine specific knowledge creates the ability to reliably

determine the presence of symptoms.

44
4. Identifying Causes of Symptoms

Identification of the root cause of problems is essential for improving performance and

availability and represents the most significant engineering challenge. Simply observing that, for

example, bearing vibration levels are excessive is not enough. It is necessary to find that the cause of

a vibration problem is misalignment and not mass unbalance. Without careful diagnosis of the root

cause of a detected problem, it is easy to attempt remedies that (a) relieve symptoms temporarily

rather than effect a cure for the disease or (b) unnecessarily extend the duration of maintenance event

by attempting ineffective remedies.

5. Proposing Corrective Actions

A wide range of corrective actions is available to remedy problems. For especially severe

problems, it may be important to change the operating state of the machine in order to avert additional

damage. For less severe problems, operating limitations may be placed on the unit to allow continued

operation until the next scheduled outage. Corrective actions often include the scheduling of

inspection and maintenance procedures for repairs. The long-term goal is to develop optimized

operational recommendations that can balance maintenance requirements for remedying diagnosed

problems with operating profiles so as to optimize availability and performance.

4.4.1 In-Situ Sensor Monitoring

The in-situ sensor approach [4:4] can be described by referring to the idealized bathtub

failure rate curves. Semiconductor reliability is often represented as an idealized bathtub curve, which

can be divided into three regions, i.e., infant mortality region, useful life region and the wear out

region (see Figure 11). The infant mortality region begins at time zero, which is characterized by a

high but rapidly decreasing failure rate. Most failures in this region result from defects caused in the

material structure during the manufacturing, handling, or assembly. These defects may take the form

of missing metal from the side of a thin film conductor, internal cracks, and foreign inclusions. After

45
the infant mortality region the failure rate decreases to a lower value and remains almost constant for

a long period of time. This long period of an almost constant failure rate is known as the useful life

period. Ultimately the failure rate begins to increase as materials start degrading and wear out failures

occur at an increasing rate. This is a result of continuously increasing damage accumulation in the

product. This region in the bathtub curve is commonly referred as wear-out or end-of-life period.

Figure 11 also shows two idealized bathtub curves one for the actual circuit and the other for the

prognostic cell. The hazard rate, h(t) in the plot is defined as

Where ns(t) is the number of surviving products at the end of time t.

Figure 11: Idealized bathtub reliability curves for the test circuit and the prognostic cell
[4]
In the infant mortality region failures are often due to defects introduced during

manufacturing, handling, and storage. Because the cells are designed to be a part of the actual chip,

defects introduced in the product affects the prognostic cells in the same way as that of the actual

46
circuitry. As a result infant mortality can also be expected for the cells. Infant mortality effect can be

increased in the prognostic cells by intentionally adding defects in them. This process known as error

seeding introduces new defects, which can interact with the defects cause during manufacturing,

handling and storage. Combined effect of both defects makes the chip fail during functional testing,

which can be the indicative of infant mortality failure.

Because of accelerated failure mechanisms the prognostic cells have higher failure rates than

the actual circuit, for the entire life period. Further since the failure mechanisms are accelerated in the

prognostic cells [4:5], the wear out region occurs earlier than that of the actual circuit which is also

depicted in Figure 11. To predict the end-of-life period, the prognostic cells must fail prior to the

actual circuit failure. As the failure of both the circuits are in the form of distributions, the failure

distribution of prognostic cells on a particular chip must be before the failure distribution of the actual

circuit. If the complete prognostic cell failure distribution is before the onset of the circuit failure

distribution, then all prognostic cells will fail prior to the circuit failure there by predicting failure of

the actual circuit. In other words the failure points of the prognostic cells must be calibrated in such a

way that their failures occur before the actual circuit wear-out region.

4.4.3 Maintenance

The maintenance infrastructure for large systems (such as weapon or sensor system) is

typically structured as a hierarchical organization for effective management of resources and logistics.

The applicability of prognostics, just as the applicability of diagnostics, is, thus, tied to the

maintenance activity and requirements at the respective hierarchical layers. For example, the

maintenance activity of the defense forces is usually organized into three levels: the Organization

level (O-level or, in the Army’s case, Unit-level) is the lowest-level activity and functions in the

mission (system operation) environment; Intermediate-level (I-level) is the next higher-level, and

Depot-level (D-level) is the highest level and supports off-line functions. The main purpose of

prognostics is to anticipate and prevent critical failures and the subsequent need for corrective

47
maintenance [1:4]. At the O-level, preventive maintenance consists of scheduled inspections, LRU

replacement, and on-system (e.g., on-aircraft or flight-line) servicing. The advantage of prognostics

would be realized in an O-level maintenance activity if the schedules for preventive maintenance can

be updated for individual pieces of equipment and their components based on operational data and

future usage. This implies predicting time to failure of Line Replaceable Units (LRU’s), as well as the

impact of such failure on the system. For the entire system, times-to-failure need to be tracked at the

LRU and system levels. The utility of prognostics at the I- and D-levels is lower than at the O-level.

The I-level activity is primarily concerned with scheduled maintenance and repairing/servicing

LRU’s that can be serviced without having to send parts to the Depot. So, prognostics can be used to

predict parts requirements and thus support lower inventory requirements. The D-level maintenance is

concerned with specialized repairs or overhauls of LRUs as well as with inspection, servicing and

repair/replacement of Shop Replaceable Units (SRU’s). Thus, prognostics have minimal applicability

to the D-level. The types of prognostic techniques depend on the level of complexity of the

subsystem, assembly, or component. The prognostic techniques at the individual parts level (e.g.,

SRU) tend to be physics based or based on reliability model updates. The prognostic techniques at

the sub-system or assembly levels (e.g., LRU) tend to be less dependent on physical models and more

on machine learning methods since physics-based modeling of systems or assemblies consisting of

several interacting components becomes difficult.

4.4.3.1 Condition-based Maintenance (CBM)


The implementation of condition assessment, or CBM, involves many disciplines such as

failure analysis, on-line diagnostics, diagnostic data interpretation, management and communication,

follow-up corrective actions and lastly the program maintenance. One of the difficult areas in the

development of a comprehensive condition assessment program is the analysis of the probable

contributing causes of failures [16:1], and selection of the appropriate on-line diagnostic tools to

48
address the correct failure contributors. Figure 12 depicts the decision points for both Preventive and

Corrective maintenance.

Maintenance

Preventive Corrective
Maintenance Maintenance

Condition Based Predetermined


Maintenance Maintenance

Scheduled,
continuous or on Scheduled Deferred Immediate
request

Figure 12: Decision Tree for Pre and Post Failure Maintenance Actions
Both condition assessment and CBM are concepts involving the application of new

technologies and techniques of equipment diagnostics while the equipment remains in full operation.

While some terminology may imply relatively new techniques, it should be born in mind that the idea

of condition-based maintenance has been around for many years. As an example, a thermal replica,

used in many temperature monitoring and protective devices, addresses one of the most important

contributors of the electrical insulation aging, temperature.

The benefits of the CBM programs lay in elimination of many time-based maintenance tasks,

in exchange for maintenance tasks deemed necessary due to the actual condition of the equipment.

While the specific condition is always monitored during normal operation, its evaluation serves to

better manage the life and therefore the reliability of a specific asset. The corrective actions may take

various forms such as through changes to the equipment-operating regime or specific discrete

corrective actions to be conveniently scheduled for future planned outages [16:1].

49
The CBM approach constitutes a dramatic qualitative leap in managing the equipment

reliability compared to the conventional off-line diagnostics, where the condition of the equipment

often remains unknown until an outage is underway. It follows that the condition-based maintenance

approach offers reduction in the equipment downtime, improvement in the equipment reliability and

dramatic reduction of the asset operating costs. Another advantage is the deferral of planned

maintenance, hence an increase in production equipment availability.

CBM adds two enormously important dimensions to classical predictive maintenance. First,

CBM deals with the entire system as an entity. This holistic approach to maintenance represents a

major shift from the piecemeal methodologies of the past. While CBM can still be implemented “one

step at a time,” it realizes its greatest potential when applied consistently and evenly across the entire

range of system maintenance concepts. The second added dimension is the concept of ignoring or

extending maintenance intervals. Predictive Maintenance (PDM) trending techniques have been used

historically to confirm maintenance decisions that would previously have been based on expert

opinions. While this approach may often find problems not otherwise identifiable, it does little toward

reducing the cost of classical preventive maintenance programs. In fact, because of the additional

analysis required, PDM may actually increase day-to-day costs slightly for some installations. CBM

on the other hand, because of its systemic approach, usually decreases long term maintenance costs.

Consider Figure 13 [17:4] for example. After all of the various criteria are entered into the CBM

model [17:4], and the analysis is performed, the results can cause the maintenance interval to be

decreased, maintained or increased. In other words there is an actual possibility that maintenance

costs will go down based on an increased time interval between shutdowns.

50
Subjective Criteria
PDM Techniques Environment, etc
All Equipment

Out of Norms:
Decrease Interval

CBM Data Statistical Within Norms:


Analysis Model Maintain Interval

Well Within Norms:


Increase Interval

Economic Factors
Risk Assessment

Figure 13: CBM Flow Chart Model [17]


Maintenance data and its uses are among the key differences between classical and modern

maintenance methods. In the past, maintenance results from any given interval were reviewed and

filed. Little, if any, attention was paid to comparison or trending.

Trending and statistical analysis are the fundamental building blocks of CBM. Comparing

data absolute values, and perhaps more importantly, comparing data deviations via statistical analysis

provide information never before available. Obviously, a statistically relevant database is required.

Maintenance management software (MMS) [17:6] has been available for many years. Such

software may vary slightly from one manufacturer to another, but the basic purpose and design are

similar from one package to another. Fundamental equipment information is stored — usually in a

detailed manner. Information such as size, date of purchase, ratings, cost, maintenance cycle, and

equipment specific notes are all maintained. Most MMS packages will even print out work orders

when calendar based preventive maintenance schedules dictate. Few, if any, of these packages will

51
store maintenance results, and none will perform the trending or statistical analysis required for full

implementation of PDM or CBM.

4.4.3.2 Maintenance Activities Prior to Failure


One of the primary objectives is to develop a maintenance approach that anticipates a

component failure as far into the future as possible. Routinely a pending failure is discovered, but

with minimal advanced warning. The approach to planning maintenance varies between systems or

designs but common practice includes:

1) Preventative maintenance inspection at regularly scheduled time intervals,

2) Predictive maintenance on higher cost critical equipment with potential inaccurate decisions,

or

3) Run to failure, i.e. only maintain when fails.

The goal is to minimize the number of unplanned repairs and to be able to recommend a

maintenance schedule that does not needlessly over inspect or wastefully replace components that still

have useful remaining life. A proactive maintenance process includes regular inspections and

component replacements based upon industry standards and the experience unique to the application

in a particular system. An asset management database [18:2] of the equipment is used to manage the

maintenance schedule. This historical database maintains a record of all the service actions noted

above. The responsible party can monitor and confirm that predictive maintenance inspections are

being accomplished for the critical equipment and view the results. For example, motors that exhibit

increased winding resistance or elevated vibration levels are judged to be in a deteriorating condition.

Historical data is then comprised of a record of the time intervals between failures and replacement or

refurbishment of components or equipment. Best practice is to produce weekly summary reports that

are reviewed by the plant engineer so that trends may be discovered and changes in the maintenance

schedule determined.

52
This process is much improved over a paper based system but still requires time and effort to

be used effectively to identify developing failures. One could envision bringing data from continuous

monitoring of the critical motors to help identify developing deterioration in addition to the current

walk-around inspections now practiced.

4.4.3.3 Value of the Data Mining Process


The objective of data-mining continuous data [18:4] is to improve upon today’s practice that

depends solely upon historical service data to determine component failure-rates at a plant site. We

desire to extend the approach to improve the Weibull failure-rates through sensing in real-time stress

factors that affect the health of the motor components. This is an area of active research and is

sometimes referred to as “real-time” data mining.

The goal is more than just reacting quickly to customer requests, problems, or failures but

rather to anticipate developing failure conditions. The concept of real-time data mining brings

together the means to exploit data to maximum effect and must conduct an efficient discovery that

allows planning for the most effective action in the face of unpredictability and emerging events.

Service businesses offering real-time technologies must provide more than just sensing nets, but as

part of the data stream must include embedded data fusion, cleaning, data mining services that are in

part based on knowledge management being integrated with proprietary analytical technologies.

These innovative business solutions must incorporate key technologies such as sensing and

acquisition engines, dynamic data assimilation (preprocessing, fusion and cleaning engines), analytic

engines, modeling engines, knowledge engines, learning engines, inference engines and visualization

dashboards. These engines facilitate more effective plant decisioning process reducing operations

cost while extending time between maintenance outages.

Motor prediction in industrial plant operations is a good example. Situation analysis must

project the possible alternative failures through use of both historical and streaming input data sets in

conjunction with knowledge representation. Motor data is acquired continuously, even after the

53
analysis starts. As real data becomes available about a particular motor, the situation analysis updates

the local model and knowledge representation. As the effects of deterioration are identified through

monitoring for stress factors, then the possible alternative failures are recomputed. In effect, the

model [18:2] computes in parallel with the actual motor operation so as to determine

multidimensional autocorrelation-like temporal mining.

Use of a Weibull analysis tool helps to improve the planning activity and in conjunction with

continuous monitoring that operate Bayesian Networks to better identify deteriorating motor health.

In this way support two desirable proactive maintenance activities:

1) Maintenance planning that tries to optimize when and what inspections are accomplished,

and

2) Monitoring of equipment to detect developing faults.

Bayesian networks (BN’s) may be used to represent dependency structures among motor life

[18:4] and sensed factors affecting motor health. Today these are most often used as an after-the-fact

diagnostic tool as opposed to an anticipatory prognostic tool. Next generation motor prognostic tools

should integrate sensor data with Weibull based structural equation models as derived from BN

skeletons. By this means dynamic (i.e. time-evolving) interactive and coupled processes as

represented by directed graphs can yield solutions to these complexly interacting behavioral

problems. These knowledge driven engines represent a first step toward machine learning in real time

in the plant.

Figure 14 is a representation of a simplified Bayesian network model that monitors stress

factors, the leaf nodes of the tree corresponding to symptom that can be sensed. Probabilities are used

to distinguish between competing root causes when there is uncertainty. A BN derives all the

implications of the beliefs that are input to it; some of these will be facts that can be checked against

observations, or simply against the experience of the engineers. The power of BN'
s comes through

application of the cause and effect rules along with the Bayesian probabilities to propagate

54
consistently an assessment. Now both the on-line, continuous and the walk-around inspection data

are used as evidence to identify, in the face of uncertainty, developing component failures.

Development of a BN starts by using a Failure Modes and Effects Analysis (FMEA). Using

prior observations combined with current knowledge gives an approach to dealing with uncertainty in

maintenance intervention decisions and addressing the problems surrounding uncertainty. Error

ranges for uncertainty in the data must be created and analyzed during operations. Data assimilation

[18:5] is accomplished by applying Bayesian methods, non-linear multi-resolution de-noising, and

Monte-Carlo methods in order to determine sensitivity, analytic effectiveness, data cleaning, and data

filtering requirements.

Bearing Failing Winding Failing

Reduced Bearing Temp Winding Temp Number Starts


Efficiency High High High

Vibration High
Current High
Neg. Seq.
Current High

Ambient Temp
Torque Pulses
High

Figure 14: Example of a Bayesian Network for Monitoring Motor Stress Factors [18]

4 . 4 . 4 K n o w l ed g e Da t a b a s e

Although seemingly trivial in nature, the accumulation of failure and/or the capture of sensed

data are becoming a complete science of its own. There is almost as much research in this field as

there is in the study of prognostics and diagnostics. This is extremely important because the capture

of specific pieces of data pre and post failure is essential to either develop or update the diagnostics

55
history of a system and to be better able to predict what has failed so that the initiation of repair

efforts can begin. In classical terms, the initial “fault detection” must occur on a system and then the

process of “fault isolation” can begin. There are two specific areas of study in this area.

The first is knowledge fusion [26:6], which is the co-ordination of individual data reports

from a variety of sensors. It is higher level than pure ‘data fusion,’ which generally seeks to correlate

common-platform data. Knowledge fusion, for example, seeks to integrate reports from acoustic,

vibration, oil analysis, and other sources, and eventually to incorporate trend data, histories, and other

components necessary for true prognostics.

The second area is termed data mining [27:3], which is the nontrivial process of identifying

valid, novel, potentially useful, and ultimately understandable patterns in data. Within a large section

of the data mining research community, data mining has been identified with the multi-disciplinary

field called Knowledge Discovery in Databases (KDD), in which data mining is just one step in a

series of steps starting with data preparation and ending with presentation, evaluation and utilization

of mined results. The first steps in a KDD process, such as data cleaning, data integration, and data

selection, can often be performed using basic database manipulation and queries, followed by

additional high-level analysis using On-line Analytical Processing (OLAP) techniques. This initial

stage is then followed by the application of machine learning techniques to actually mine patterns in

the data. A number of data mining approaches have been studied and evaluated. These include several

classification and prediction techniques such as learning of decision rules or decision trees, neural

networks, and data clustering. Since a large part of the current maintenance data consists of nominal

variables, rule induction techniques are perhaps the most pertinent.

The essence of these efforts is that continuous updates and improvements based on actual

failure and repair data are validated and stored so that when certain triggers or limits are reached

within a sensor or set of sensors, the diagnostic process sends an alarm or warning to either the

operator or a repair or maintenance facility so that the appropriate action can take place to either

repair the failure, or in the case of prognostics, to perform a maintenance action prior to failure.

56
4.4.5 Applicability to the Aviation Industry

Due to the fact that the airline industry has always struggled with massive overhead because

of the maintenance and upkeep of the particular aircraft that are in their fleet, the need for

development of more efficient designs has always been a priority in the interactions between the

airlines and the major aircraft manufacturers. One of the tradeoffs that have occurred is the ability to

extend the range that twin-engine aircraft can operate over water. This is an extensive and costly

evolution in the design process that is monitored very closely by the governing bodies of commercial

air travel. The other factor that plays into this is the engine manufacturers that are under extreme

pressure to design these engines for greater and greater reliability. This is the major reason that the

primary non-mechanical portion of the engine, which is the fuel control system, is normally a triple

redundant design.

With the advent and study of more elaborate prognostics, various prognostics and health

monitoring technologies have been developed [3:1] that aid in the detection and classification of

developing system faults. However, these technologies have traditionally focused on fault detection

and isolation within an individual subsystem. Health management system developers are just

beginning to address the concepts of prognostics and the integration of anomaly, diagnostic and

prognostic technologies across subsystems and systems. Hence, the ability to detect and isolate

impending faults or to predict the future condition of a component or subsystem based on its current

diagnostic state and available operating data continues to be a high priority research topic not only for

the airline industry, but for many areas where a failure occurring could have catastrophic

consequences.

In general, health management technologies will observe features associated with anomalous

system behavior and relate these features to useful information about the system’s condition. In the

case of prognostics, this information relates to the condition at some future time. Inherently

probabilistic or uncertain in nature, prognostics can be applied to system/component failure modes

57
governed by material condition or by functional loss. Like diagnostic algorithms, prognostic

algorithms can be generic in design but specific in terms of application. Various approaches to

prognostics have been developed that range in fidelity from simple historical failure rate models to

high-fidelity physics-based models. Figure 15 [3:1] illustrates a hierarchy of prognostic approaches in

relation to their applicability and relative costs.

Figure 15: Hierarchy of Prognostic Approaches [3]

4.4.6 Applicability to the Medical Industry

The ability to monitor equipment in the medical industry similar to what is done in the

aviation industry has also been investigated and implemented to a large extent. Having the capability

to evaluate the health of critical and expensive medical testing equipment on a real time basis has

been instrumental in reducing the unavailability of this equipment due to unforeseen failures and the

inherent time to repair when a failure occurs.

58
For specific medical imaging equipment manufactured by General Electric [21:1] the service

organization is responsible for a significant portion of total revenue for a complex medical imaging

system. The traditional service delivery model developed for this purpose is highly field engineer

centric with an emphasis on building field engineer to customer relationships. In order to maximize

the service revenue, the service delivery model needs to become highly service centric with an

emphasis on driving proactive smart diagnosis. Field engineers can evolve to become more service

centric by increasing their productivity, allowing them to spend less time commuting to customer

sites and service more customers per day. Technologies such as remote diagnostics, virtual

monitoring of machines, improved communication, and efficient online parts order have contributed

to this shift in service delivery paradigm. However, field engineers are still, on average, only able to

service one major customer call per day and responding in reactive mode to notifications of machine

failures.

Multiple site visits and wrong part orders are a result of inaccurate diagnosis, which lead to

downtime for customers and higher service costs. Poor diagnostics coupled with an ad-hoc mix of

indicators designed into the system by design engineers in the development stage of a program and a

trial-and-error approach to diagnosis leads to multiple site visits by field engineers. Accurate

diagnosis [21:11] is essential for improving service.

Because of these issues, General Electric has done extensive research in the field of

serviceability of medical testing equipment that is in the hands of their customers. The incredibly

high non-recurring cost of purchasing this equipment makes it imperative that the time the equipment

is down for maintenance is minimized. In order to accomplish this, a Design for Serviceability (DFS)

tool that can monitor and repair the equipment. Figure 16 [21:7] is a depiction of the three main

components of this tool and a description of each piece.

59
Indications From
Systems Events
Indications from
Operator Observations

Authoring Reporting
M
os
tL
ik
el
y
Fa
ilu
re

Field Feedback
Recommendations
& Final Actions

Figure 16: Overview of DFS Tool [21]


The Authoring component allows design engineer to enter service critical information

necessary to create a Constrained Bayesian Network. The analysis component consists of a decision

engine that updates the probabilities of failure modes given a set of known observations. The

reporting component provides design engineers with a report on serviceability during new product

development. The report enables design engineers to make intelligent design decisions based on

expected payback and resources available. Once the design engineer is satisfied with the results from

the report, domain experts validate the created Bayesian model before releasing it for use by the

Diagnosing application. The service feedback is the process to bring performance of the causality

engine back to design engineers so that the information may be used to improve diagnosis and the

next generation of the product.

The DFS Tool [21:7] could be integrated into a Customer’s current new product development

process. Ideally, field engineers will use the Diagnosing Tool before visiting a customer site so that

parts may be ordered beforehand and the problem can be fixed in a single trip.

60
A Bayesian network is a technology used for handling uncertainty and has been applied to

diagnosis of mechanical and electrical systems. One reason is the ability to capture heuristic

knowledge [21:I] when no failure data is available but expert opinion is abundant, which is the case

during new product development. Online engineers and field engineers could use the Bayesian

serviceability tool to diagnose failures given observations from diverse sources and provide service

feedback for improving diagnostic accuracy of Bayesian model. A diagnostic model strategy selects

a model for diagnosis given a set of indicators and a recommendation strategy will present the field

engineer with a customized service recommendation. Two authoring alternatives allow engineers

with no prior experience in Bayesian theory to easily create a causality engine.

The ability to generate a service recommendation from the observations for any diagnostic

model is essential to increasing the ability to fix a problem right the first time, which directly impacts

field engineer productivity. The Design For Six Sigma methodology is followed to measure the tool

capability in meeting this customer requirement. Unit testing is employed to generate opportunities

for measurement. Based on the amount of testing, the capability metric for the tool to generate a

service recommendation is 3.51 sigma with 19,610 defects per million opportunities [21:iv].

Another application that can utilize this technology is the ability to remotely monitor elderly

persons who live alone in their own homes. The ability to apply sensors in certain areas of the home

and utilizing adaptive modeling can associate sensor events with daily activities. An experiment was

carried out in this regard by General Electric’s Global Research organization [19:2]. Figure 17 [19:1]

is a simplified depiction of how the monitoring system was developed.

61
Figure 17: Simplified Model of Caregiver Monitoring System [19]
The initial results of this experiment have been positive, and the ability to reduce the stress on

caregivers has been one of the most positive aspects of this research. As seen in Table 1 [19:4], this

research was based on inputs from a selection of twenty-one caregivers that responded with what

gave them the most stress on a daily basis with the responsibility of caring for elderly individuals that

were living alone.

Table 1: Caregiver Stress Events [19].

62
CHAPTER V
SUMMARY AND CONCLUSIONS

There are inherent design requirements that need to be identified and understood prior to fully

comprehending the ability to facilitate the incorporation of diagnostics and prognostics into a system

during the development phase.

The prognostic module function is to intelligently utilize diagnostic results, experienced-

based information and statistically estimated future conditions to determine the remaining useful life

or failure probability of a component, module or system. A prognostic model must have the ability to

predict or forecast the future condition of a component and/or system of components given the past

and current information.

Data availability, dominant failure or degradation mode of interest, modeling and system

knowledge, accuracies required and criticality of the application are some of the variables that

determines the choice of prognostic approach. The ability to predict the time to conditional or

mechanical failure (on a real-time basis) is of enormous benefit and health management systems that

can effectively implement the capabilities presented herein offer a great opportunity in terms of

reducing the overall Life Cycle Costs (LCC) of operating systems as well as decreasing the

operations/maintenance logistics footprint.

Recent proposal activity with the U.S. Army has dictated that competence be gained in the

application of prognostics for military systems either currently or in near term development. These

requirements are being backed with funding to incorporate these features. In order to accomplish this,

the following main ideas must be thoroughly thought out and understood prior to and during the early

design phase of a program:

• Prognostics and diagnostics have to be designed in from the earliest stages of

development, otherwise it will be too late

63
• A thorough understanding of failure modes and testability is the precursor for

successful implementation of prognostics

• All failures encountered need to also be completely understood, documented and

included in a knowledge database that is built upon during the life of the system for

repair and maintenance recommendations

• A coordinated effort with design, systems testability and reliability engineering is

imperative for this to be successful

As has been proven time and time again in commercial industry, these techniques can be

applied to designs with great success, increased customer satisfaction, lower life cycle cost and most

importantly the lowest cost of ownership.

64
Table 2: Acronym List
Acronym Description
BIT Built In Test
BN Bayesian Network
C-BIT Continuous BIT
CBM Condition-Based Maintenance
CGI Common Gateway Interface
CORBA Component Object Request Broker Architecture
COTS Commercial Off The Shelf
DCOM Distributed Component Object Model
DFS Design for Serviceability
D-LEVEL Depot Level
EGT Exhaust Gas Temperature
FMEA Failure Modes and Effects Analysis
FMECA Failure Modes, Effects and Criticality Analysis
HIS Human System Interface
HM Health Monitoring
HTML Hypertext Mark-up Language
HTTP Hypertext Transfer Protocol
I-BIT Interruptive BIT
I-LEVEL Interediate Level
KDD Knowledge Discovery in Databases
LAN Local Area Network
LCC Life Cycle Cost
LRU Line Replaceable Unit
MMS Maintenance Management Software
MTBEFF Mean Time Between Essential Function Failure
MTBF Mean Time Between Failure
OLAP On-Line Analytical Processing
O-LEVEL Operational Level
OMG Object Management Group
P-BIT Periodic BIT
PHM Prognostic Health Monitoring
POF Physics of Failure
RMI Remote Method Invocation
RUL Remaining Useful Life
XML eXtensible Mark-up Language

65
REFERENCES

[1] Mathur, A., Cavanaugh, K., Pattipati, K., Willett, P., and Galie, T., “Reasoning and
Modeling Systems in Diagnosis and Prognosis,” Proceedings of the SPIE Aerosense
Conference, Orlando, FL, April 16-20, 2001.

[2] Byington, C., Roemer, M., Galie, T., “Prognostic Enhancements to Diagnostic
Systems for Improved Condition-Based Maintenance,” IEEE Aerospace Conference,
Big Sky MT, March 2002.

[3] Byington, C., Roemer, M., Galie, T., “Prognostic Enhancements to Gas Turbine
Diagnostic Systems,” IEEE Aerospace Conference, Big Sky MT, March 2003.

[4] Satchidananda, M., Pecht, M., Goodman, D., “In-situ Sensors for Product Reliability
Monitoring,” CALCE Electronic Products and Systems Center, University of
Maryland, 2002.

[5] Goodman, D., “Prognostic Techniques for Semiconductor Failure Modes,” Ridgetop
Group, Inc., 2000.

[6] Byington, C., Safa-Bakhsh, R., “Metrics Evaluation and Tool Development for
Health and Usage Monitoring System Technology,” AHS Forum 59, Phoenix, AZ,
American Helicopter Society International, May 6-8, 2003.

[7] Kacprzynski, G., Maynard, K., “Enhancement of Physics-of-Failure Prognostic


Models with System Level Failures,” IEEE Aerospace Conference, Big Sky MT,
March 2002.

[8] Byington, C., Kalgren, P., Johns, R., Beers, R., “Embedded Diagnostic/Prognostic
Reasoning and Information Continuity for Improved Avionics Maintenance,”
AUTOTESTCON 2003, Anaheim, California, 2003.

[9] Vachtsevanos, G., “Cost & Complexity Trade-off in Prognostics,” Paper Presented at
NDIA Conference on Intelligent Vehicles, Traverse City, MI, 2003.

[10] Palmer, D., Kendig, M., “Prognostics in Military-based Digital Electronic


Assemblies,” Paper Presented at the Interagency Integrated Vehicle Health
Management Diagnostic-Prognostic Technical Interchange Meeting, 2003.

[11] Mortin, D., “Prognostics Overview (Usage Based),” Paper Presented During Industry
Day for potential subcontractors, USAMSAA, 2003.

[12] Kozera, M., “Brigade Combat Team Diagnostics and Prognostics,” Paper Presented
Presented During Industry Day for potential subcontractors, US Army Tank-
automotive and Armaments Command, 2003.

66
[13] Chelidze, D., “A Dynamical Systems Approach to Failure Prognosis,” Journal of
Vibration and Acoustics, 2003.

[14] Kacprzynski, G., Hess, A., “Health Management System Design: Development,
Simulation and Cost/Benefit Optimization,” IEEE Conference, Big Sky, MT, March
2002.

[15] Roemer, M., “Assessment of Data and Knowledge Fusion Strategies for Prognostics
and Health management,” IEEE Conference, Big Sky, MT, March 2001.

[16] Paoletti, G., “Failure Contributors of MV Electrical Equipment and Condition


Assessment Program Development,” Eaton’s Cutler-Hammer Performance Power
Solutions, (No Date).

[17] Cadick, J., “Condition Based Maintenance…How to Get Started…,” The Cadick
Corporation, Garland, TX, 1999.

[18] Sutherland, H., Repoff, T., House, M., Flickinger, G., “Prognostics: A New Look at
Statistical Life Prediction for Condition-Based Maintenance,” GE Global Research,
February 2003.

[19] Cuddihy, P., Ganesh, M., Graichen, C., Weisenberg, J., “Remote Monitoring and
Adaptive Models for Caregiver Piece of Mind,” GE Global Research, July 2003.

[20] Azzaro, S., Johnson, T., Graichen, M., “Remote Monitoring Techniques and
Diagnostics Selected Topics,” GE Research & Development Center, January 2002.

[21] Chen, C., “Bayesian Serviceability Tool for Diagnosing Complex Medical Imaging
Machines,” GE Global Research, January 2003.

[22] Stadterman, T., Hum, B., “A Physics-of-Failure Approach to Accelerated Life


Testing of Electronic Equipment,” U.S. Army Materiel Systems Analysis Activity
Reliability and Engineering Branch, 1996.

[23] Thurston, M., Lebold, M., “Standards Developments for Condition-Based


Maintenance Systems,” Applied Research Laboratory, Penn State University, 2002.

[24] Various Authors, “Embedded Diagnostics and Prognostics Synchronization -


Logistics Integration Through Collaboration,” U.S. Army Logistics Transformation
Agency, Ft. Belvoir, VA, 2004.

[25] Mortin, D., Yukas, S., Cushing, M., “Five Key Ways to Improve Reliability,”
Reliability Analysis Center (RAC) Journal, Rome, NY, Second Quarter 2003.

[26] Hadden, G., Bergstrom, P., Samad, T., Holt-Bennett, B., Vachtsevanos, G., Van
Dyke, J., “Application Challenges: System Health Management for Complex

67
Systems,” Proceedings of the 5th International Workshop on Embedded HPC
Systems and Applications (EHPC’200), 2000.

[27] Mathur, A., "Data Mining of Aviation Data for Advancing Health Management,"
SPIE'
s 16th International Symposium on AeroSense, Aerospace/Defence Sensing,
Simulation and Controls, 2002.

68
APPENDIX A
Driving Force Behind the Contents of this Master’s Report

The attached document was taken directly from a presentation given on the topic of Army

Transformation. It outlines the needs and requirements for the technology shift into embedded

diagnostics and prognostics from the Army’s point of view to reduce the overall Operations &

Support Cost of hardware after production during its operational life. All new development for the

Army will utilize the concepts and requirements that are explained in this document. I have copied it

verbatim so that the complete message could be understood.

“Embedded Diagnostics and Prognostics Synchronization - Logistics Integration

Through Collaboration [24]”

Introduction

The Army Vision for the 21st Century is a rapidly deployable, highly mobile fighting force

with the lethality and survivability needed to achieve a decisive victory against any adversary. To

support this vision the Army’s logistics system must be versatile, agile, sustainable and affordable.

The Army Transformation is bringing about these fundamental changes in the Army’s structure,

equipment and doctrine. Additionally, while the Army’s science and technology, research and

development, and procurement investments are being focused to create and field the Objective Force

over the next 10 to 15 years, selected portions of the legacy forces are being recapitalized to bridge

the gap between today’s Army and the Objective Force. The responsibility for sustaining today’s

force and the transforming Army is the business of the Deputy Chief of Staff (DCS) G-4, Army who

is also responsible for managing the Army’s logistics footprint.

A-1
Figure excerpted from [24]

T h e L o g i s ti c s F o o t p r i n t [ 2 4 ]

One of the Army Transformation’s goals is to reduce the logistics footprint of combat support

and combat service support while enhancing the sustainability, deployability, readiness and reliability

of military systems. This requires new logistics processes and dramatic changes in current business

processes to support the new force. These processes are focused on the weapons systems, and must

be readiness-driven, lean, and agile. They must detect and correct problems early, allocate resources

where they are most needed, and continuously drive cost and labor out of the system. One of the key

enablers for the objective sustainment processes envisioned is to equip platforms with self-reporting,

real-time, embedded diagnostics and prognostics systems. This enabler promises to replace entire

segments of the traditional logistics support structure. Such systems would contribute directly to

several key objectives for the future Army:

• Virtual logistics situational awareness at all levels

• Proactive (versus reactive) combat logistics

A-2
• Improved readiness for weapons platforms and support equipment

• Reduced logistics footprint on the battlefield

• More effective fleet management and reduced lifecycle costs

• Reduced logistics workload on the Warfighter and crews.

Adding embedded diagnostic and prognostics capabilities to equipment and developing the

infrastructure needed to generate maximum benefit from the prognostics data represent major

challenges. The infrastructure needed to transmit, store and use the information is complex, requiring

changes to many existing and emerging communications and information systems. The potential

application to Army platforms includes vehicles, aircraft and marine craft numbering thousands of

platforms. Therefore, an implementation strategy is needed that achieves maximum benefit with the

resources available, recognizing that technology is continually evolving. This strategy should define

the following:

• When, where and how much diagnostic and prognostic capability should be

developed and installed?

• What communications medium will be used to move the information?

• What technologies do users need to move/use the information or data?

• What policy and doctrine additions/changes will be required to support the Interim

and Objective Force?

• What requirement documents will be impacted?

• What are the funding implications related to the Program Objective Memorandum

(POM)?

A-3
Figure excerpted from [24]

The Army leadership recognizes the importance of diagnostics and prognostics as a

transformation enabler and has required the consideration and planning for this technology to be

included on new and retrofitted equipment for several years. Unfortunately, funding limitations and

detailed specifications of requirements have delayed and in fact have inhibited their development and

integration. However, changing operational concepts and the emerging vision of the Objective Force

requirements now makes their integration a necessity. Furthermore, the application of this technology

is expected to contribute significantly to the Army’s Revolution in Military Logistics by improving

the Army’s supply chain management of consumables, repairables, and the end items themselves.

E mb e d d e d Di a g n o s t i c s a n d P r o g n o s t i cs S y n c h r o n i z a t i o n [ 2 4 ]

There is a need to apply these embedded diagnostic and prognostic capabilities across the

entire Army, employing communications systems and modifying information systems to make use of

the new sources of information. The Army’s diagnostics and prognostics community of combat

A-4
developers, materiel developers, and logisticians have been working to achieve the Chief of Staff of

the Army’s goal of putting embedded diagnostics and prognostics on all weapons systems. This

requires that systems that have historically been developed independently be synchronized to support

an overall system of systems. Subsequently, the DCS G-4, Army directed the U.S. Army Logistics

Transformation Agency (LTA), the Army’s integrator of logistics systems and processes, to

coordinate and synchronize these efforts under a project called Embedded Diagnostics and

Prognostics Synchronization (EDAPS).

The EDAPS project is an over-arching process that coordinates a unified Army strategy

synchronizing the Army’s current diagnostics and prognostics initiatives. The G-4 tasking calls for

LTA to pull together all the key diagnostics and prognostics players from across the Army and

develop an end state that considers all the current diagnostics and prognostics pilots, programs, and

plans and integrates the current programs and initiatives. The EDAPS project objectives include the

following sub-tasks:

• Identify Interim and Objective Force business processes;

• Influence the requirements of future operational and management systems such as the

Global Combat Support System-Army, Wholesale Logistics Modernization Program,

and the Future Combat System (FCS);

• Influence the requirements of weapons systems platforms;

• Determine the best return on investments;

• Identify data requirements at all echelons;

• Identify policy and programmatic gaps and redundancies and define and then re-

engineer the operational architecture and its business processes from the platform,

through retail, and into the wholesale system; and

• Identify POM issues.

A-5
The project’s scope of work includes the legacy fleets and the transformation to the Objective

Force as outlined in emerging Army doctrine and Joint Vision 2020.

Figure excerpted from [24]

LTA formed a Synchronization Integrated Product Team (IPT), consisting of representatives

from the Army’s diagnostics and prognostics community. The team’s first order of business was to

define the operational architecture, develop a management structure that involved users at all stages

of development to ensure coordination and integration, and establish a common vision for the

logistics embedded diagnostics and prognostics processes. The team’s operational architecture will

define the vision and identify requirements for policy/doctrine/training, platform technology,

communications systems, and information systems as the key pieces that need coordination and

synchronizing. An enterprise management framework approach was selected as the proposed

management structure to ensure that all aspects of the operational architecture are considered. The

approach is designed to engage key players in the information collection and analysis process and

A-6
build consensus for the path forward to the maximum extent possible. It also maximizes EDAPS’

probability of success based on the complexity of the G-4 tasking.

The EDAPS team has begun the job of synchronizing and coordinating Army diagnostics and

prognostics issues across the entire business enterprise for the entire weapons system’s lifecycle, not

just at the platform level. This includes a review of Army policy and regulations and in depth

assessments of related initiatives. Requirements for embedded diagnostics and prognostics are being

added where appropriate to Army operational requirement documents based on the EDAPS team’s

input. Finally, a collaborative framework of inter-related working groups, coordinated through a

Synchronization IPT, has been created to facilitate the process and manage the total enterprise. In

this manner a means has been made available for synchronizing policy, procedures, operations,

doctrine, training, and automation requirements. The supporting teams build on the work of the

Army Diagnostics Improvement Program (ADIP), which compliments its efforts that are focused on

incorporating diagnostics sensors and read-out mechanisms for Army weapons systems. The EDAPS

process is expected to identify and document EDAPS end-to-end information requirements (including

tactical, non-tactical and strategic) for all users and develop a roadmap to describe how these

requirements should be developed to support near-term, interim and objective forces. It will also

identify tactical, non-tactical, operational and strategic communications requirements, primarily

driven to address the information requirements, for all levels of field, depot and national management

activities. Finally, it will refine and define policy, doctrine, and operational architectures to ensure

that all EDAPS future requirements are reflected in appropriate policy, doctrine, procedures,

automation and training. The Synchronization IPT is responsible for assuring that the other working

groups address the comprehensive breadth and depth of the issues involved in implementing

embedded diagnostics, condition-based maintenance, and the linkages between these processes and

relevant field, depot and national information systems.

A-7
S u mma r y [ 2 4 ]

The coordination and synchronization of embedded diagnostics and prognostics for the

Objective Force is critical to the Army Transformation because this technology impacts logistics

operations at all levels – from the maintainer to the weapons’ platform lifecycle managers. A wide

range of Army organizations responsible for the doctrine, policy, equipment, training, funding,

business processes, information systems and communications systems will be affected by this

technology. It will take many years and substantial investments to fully implement the Army’s vision

for self-reporting weapons platforms and support vehicles with embedded diagnostics and prognostics

capabilities. The project’s development of a comprehensive operational architecture for generating,

capturing, moving, storing and using platform-based readiness information will greatly facilitate

development of the common vision for platform-focused logistics processes. Significant work

remains to be done to develop a robust logistics system around this technology. Synchronizing these

efforts is a major challenge. Although the DCS, G-4 tasked LTA to lead the synchronization effort, it

is clear that this undertaking will be successful only if the impacted organizations are directly

involved in defining the end state and developing the implementation road map. The EDAPS process

allows for this coordination and synchronization of achieving the Army’s vision of embedded

diagnostics and prognostics in support of the Objective Force and will ensure that the process is

institutionalized for the future.

A-8
APPENDIX B
Reliability Impact on Designs

Although the major thrust in this project is based on the fact that diagnostics and prognostics

can be incorporated into the design to ether determine the failure or predict when a failure is about to

occur, one of the most important aspects during development is to ensure that all aspects of reliability

are also included in the process. Since many of the requirements placed on new designs are

interdependent on each other, i.e., reliability, availability, repair time and supportability, marginal

design in one area will almost ensure other requirements will be hard to meet. There are many reasons

why systems fail to achieve their requirements. However, there are several reasons [25:1] for failures

that repeatedly surface. Five of these reasons are:

• Failure to design in reliability early in the development process

• Inadequate lower-level testing

• Relying on predictions instead of conducting engineering design analysis

• Failure to perform engineering analyses of commercial off-the-shelf (COTS)

equipment

• Lack of reliability improvement incentives.

Because of this, there are five aspects of reliability that need to be understood and

championed by all project reliability engineers from the very beginning of the program.

D e s i g n i n R e l i a b i l i t y E a r l y i n th e D e v e l o p me n t P r o c e s s

It is critical that developers have a thorough understanding of the tactical operational

environment in which the system, subsystem, or component will be employed and ensure that the

requirements flow- down process to suppliers is adequate. Given this understanding, many potential

failures can be identified and eliminated very early in the development process. In addition, every

effort should be used to leverage existing field test results to support failure analysis efforts.

B-1
Efforts to eliminate failures require a commitment of technical resources. Engineers are

needed to conduct thermal and vibration analyses to address potential failure mechanisms and failure

sites. These analyses can include the use of fatigue analysis tools, finite element modeling, dynamic

simulation, heat transfer analyses, and other engineering analysis models. Industry, universities, and

government organizations have produced a large number of engineering tools that are widely used,

especially in the commercial sector. The capability exists to model and address a number of failure

mechanisms for everything from suspension components to circuit boards.

C o n d u c t L o w e r L ev e l T e s t i n g

Lower-level testing, such as highly accelerated life testing (HALT) and highly accelerated

stress screening (HASS), is critical for precipitating failures early and identifying weaknesses in the

design. Integration testing is also critical for identifying unforeseen interface issues.

Developmental testing serves as one of the last opportunities to fix remaining problems and

increase the probability of system success. Developmental testing is not only required to ensure

system reliability maturation but also to mitigate risk in meeting requirements during operational

testing. Insufficient or poorly planned developmental test strategies often result in systems failure

and a repeat of operational testing.

Early low-level testing, along with focused higher-level testing, is key to producing products

with high reliability. Without comprehensive lower level testing on most or all-critical subassemblies,

and without significant integration and developmental testing, there is little likelihood that high levels

of reliability will be achieved.

R e l y M o r e o n E n g i n e e r i n g D e s i g n An a l y s i s a n d L e s s o n P r e d i c t i o n s

A reliability prediction may have little or nothing to do with the actual reliability of the

product and can actually encourage poor design practices. In many cases, the person producing the

prediction may not be a direct contributor to the design team. The historic focus on the accounting of

predictions versus the engineering activities needed to eliminate failures during the design process has

B-2
significantly limited the ability to produce highly reliable products. Additionally, in some cases,

predictions are used as a means to bypass demonstration of contractual specifications in

developmental testing resulting in failure to meet operational testing requirements. High reliability is

not obtained through reliability predictions.

When most people think of reliability models, they think of reliability predictions; reliability

block diagrams; failure mode, effects, and criticality analysis (FMECA); fault trees; and reliability

growth. When directly used to influence the design team, or when used to manage reliability progress,

these tools can be extremely useful to focus engineering and testing efforts. However, the most

important reliability tools are the structural, thermal, fatigue, failure mechanism, and vibration models

used by the design team to ensure that they are producing a product that will have a sufficiently large

failure-free operating period. When the major focus of the system reliability program is reliability

prediction, there is a high potential for failure.

Perform Engineering Analyses of Commercial-Off-The-Shelf

E q u i p me n t

COTS equipment represents a great opportunity to improve reliability, reduce costs, and

leverage the latest technologies. However, COTS does not imply that we abandon engineering

analyses and early testing. Thermal, vibration, fatigue, and failure mechanism modeling, combined

with early accelerated testing, can quantify and minimize the risk of COTS equipment failing in the

military operating environment. Insurance of adequate requirements flowdown to suppliers is a must.

P r o v i d e R e l i a b i l i t y I n c e n t i v e s i n C o n tr a c t s

Even when reliability is mentioned in a statement of work or in platform specifications, the

weight of reliability in the selection criteria is usually small. Contractors have to bid low in order to

be competitive. When they have to trim their programs, reliability is often one of the first areas to go.

Unless the contractor sees value in directing and resourcing the design team to achieve high

B-3
reliability, equipment will continue to be fielded with reliability values that fall far short of what the

commercial consumer typically experiences. Most suppliers have the engineering staff and technical

know-how to produce highly reliable systems.

B-4

You might also like