The Bathtub Curve and Product Failure Behavior
The Bathtub Curve and Product Failure Behavior
The Bathtub Curve and Product Failure Behavior
For SER, the failure mode is a normal life failure. There is an average rate of occurrence but
the failures occur "at random." The failures in most cases cause only a minor deviation in
operation and are self-correcting. No repair is needed to "fix" a product subject to SER and,
in fact, no "fix" can eliminate SER effects. Only a significant design change (using an error
correcting design) can eliminate the effects of SER, but nothing can eliminate SER.
There are other cases, especially in electronic products, where a "constant" failure rate may
be appropriate (although approximate). This is the basis for MIL-STD-217 and other
methods to estimate system failure rates from consideration of the types and quantities of
components used. For many electronic components, wear-out is not a practical failure
mode. The time that the product is in use is significantly shorter than the time it takes to
reach wear-out modes. That leaves infant mortality and normal life failure modes as the
causes of all significant failures. As we have already observed, after some time, failures
from infant mortality defects get spread out so much that they appear to be approximately
random in time. A combination of low level infant mortality failures and some random
failures caused by operational stresses (such as power line surges) can result in a product
failure distribution that is very close to the classical normal life period. This brings up the
question of a much-misunderstood term that applies during the normal life period, MTBF.
How does MTBF fit into the equation for the exponential distribution? MTBF is the scale
parameter (usually termed eta or ) that defines the specific model for an exponential
distribution. The equation for the density function of an exponential distribution is given by:
where:
MTBF Summary
As we have seen here, it is logical that a device can have an MTBF that is much greater than
its wear-out time because MTBF is only a projection of the normal lifetime failures to a
cumulative level of 63.2%. Most, if not all, devices will have failed due to wear-out modes
long before the MTBF time.
A major problem for many people with the term Mean Time Between Failures is that it is
expressed as "time" when it is really used to indicate failure rate during the normal life
period. To further confuse the issue, some people use the term MTBF to indicate Mean Time
Before Failure, a case when it applies to wear-out modes and really does relate to service
life. And, as I noted above, some people don't know what they are talking about and claim
service life is equal to "Mean Time Between Failures"!
Good news is that recent data sheets from some vendors (particularly those making
electronic assemblies) show MTBF under the heading of "reliability" and a separate value for
service life. In this case, the vendor has described both the expected failure rate during the
normal life period of the bathtub (e.g. "MTBF = 3.5 million hours") and the point in time at
which the product is expected to start up the wear-out part of the bathtub curve
(e.g. "service life greater than 8 years"). Bad news is that some vendors, even major
technology firms, still don't understand reliability concepts, at least as expressed in their
data sheets.
When you are specifying components for a product and want to understand how long it
might operate and what the failure rate might be during normal life, be sure to find out
what the vendor means by "MTBF." And if he thinks it means "Minimum Time Before Failure"
calculated from MIL-HDBK-217, be careful!
The wear-out period does not occur at one time for all components. The shortest-lived
component will determine the location of the wear-out time in a given product. In designing
a product, the engineer must assure that the shortest-lived component lasts long enough to
provide a useful service life. If the component is easily replaced, such as tires, replacement
may be expected and will not degrade the perception of the product's reliability. If the
component is not easily replaced and not expected to fail, failure will cause customer
dissatisfaction.
Properly applied, QALT can provide useful information from tests much shorter in length
than the expected operating time of a design. However, much care must be taken to assure
that all possible failure modes have been investigated. Running a quantitative accelerated
life test without considering all possible failure modes and their accelerating stress types
may miss a significant failure mode and invalidate the conclusions. As appropriate,
mechanics, electronics, physics and chemistry must all be considered when designing a
QALT.
Note that "MTBF" testing, using many units in parallel to shorten test times, is a popular
method of life testing. It does not apply to testing for wear-out! It can apply to testing for
normal life failures, but the results of such testing should never be extrapolated to times
longer than were used for the test itself.
Conclusion
As demonstrated in Parts One and Two of this article, the traditional bathtub curve is a
reasonable, qualitative illustration of the key kinds of failure modes that can affect a
product. Quantitative models such as the Weibull distribution can be used to assess actual
designs and determine if observed failures are decreasing, constant or increasing over time
so that appropriate actions can be taken. The exponential distribution and the related Mean
Time Between Failures (MTBF) metric are appropriate for analyzing data for a product in the
"normal life" period, which is characterized by a constant failure rate. But be careful - many
people have "imposed" a constant failure rate model on products that should be
characterized by increasing or decreasing failure rates, just because the exponential
distribution is an easy model to use.
Do not assume that a product will exhibit a constant failure rate. Use life testing and/or field
data and life distribution analysis to determine how your product behaves over its expected
lifetime. In addition to traditional life data analysis models (such as the Weibull distribution),
quantitative accelerated life testing (QALT) may be a valuable technique to better
understand failure distributions of highly reliable products with reduced testing time, in a
cost-effective manner. Without a QALT approach to testing, there is no way to accurately
assess the long-term reliability of a product in a short time. If you need to understand the
reliability of a device for a one-year use-life, a non-accelerated test of 12 units for one
month will not do it. It will only provide information on one month of use. Projection to one
year will be invalid if a wear-out mode occurs, for example, in six months. The only way to
find a wear-out mode is to test long enough to observe it, with or without a QALT approach.
When dealing with vendors and their claims of reliability for components you wish to use, be
sure you understand how they determined these figures and how well they understand the
consequences of the bathtub curve.
Weibull++
In the last several issues of Reliability HotWire, we looked at how distributions are defined
and how common reliability metrics are derived. In this issue, we will take a closer look at a
specific distribution that is widely used in life data analysis - the Weibull distribution. Named
for its inventor, Waloddi Weibull, this distribution is widely used in reliability engineering and
elsewhere due to its versatility and relative simplicity.
Where:
and:
Frequently, the location parameter is not used, and the value for this parameter can be set
to zero. When this is the case, the pdf equation reduces to that of the two-parameter
Weibull distribution. There is also a form of the Weibull distribution known as the one-
parameter Weibull distribution. This in fact takes the same form as the two-parameter
Weibull pdf, the only difference being that the value of is assumed to be known
beforehand. This assumption means that only the scale parameter needs be estimated,
allowing for analysis of small data sets. It is recommended that the analyst have a very
good and justifiable estimate for before using the one-parameter Weibull distribution for
analysis.
As was mentioned previously, the Weibull distribution is widely used in reliability and life
data analysis due to its versatility. Depending on the values of the parameters, the Weibull
distribution can be used to model a variety of life behaviors. An important aspect of the
Weibull distribution is how the values of the shape parameter, , and the scale parameter, ,
affect such distribution characteristics as the shape of the pdf curve, the reliability and the
failure rate.
The Weibull shape parameter, , is also known as the Weibull slope. This is because the
value of is equal to the slope of the line in a probability plot. Different values of the shape
parameter can have marked effects on the behavior of the distribution. In fact, some values
of the shape parameter will cause the distribution equations to reduce to those of other
distributions. For example, when = 1, the pdf of the three-parameter Weibull reduces to
that of the two-parameter exponential distribution. The parameter is a pure number (i.e.,
it is dimensionless).
The following figure shows the effect of different values of the shape parameter, , on the
shape of the pdf (while keeping constant). One can see that the shape of the pdf can take
on a variety of forms based on the value of .
Looking at the same information on a Weibull probability plot, one can easily understand
why the Weibull shape parameter is also known as the slope. The following plot shows how
the slope of the Weibull probability plot changes with . Note that the models represented
by the three lines all have the same value of .
Another characteristic of the distribution where the value of has a distinct effect is the
failure rate. The following plot shows the effect of the value of on the Weibull failure rate.
This is one of the most important aspects of the effect of on the Weibull distribution. As is
indicated by the plot, Weibull distributions with < 1 have a failure rate that decreases with
time, also known as infantile or early-life failures. Weibull distributions with close to or
equal to 1 have a fairly constant failure rate, indicative of useful life or random failures.
Weibull distributions with > 1 have a failure rate that increases with time, also known as
wear-out failures. These comprise the three sections of the classic "bathtub curve." A mixed
Weibull distribution with one subpopulation with < 1, one subpopulation with = 1 and
one subpopulation with > 1 would have a failure rate plot that was identical to the bathtub
curve. An example of a bathtub curve is shown in the following chart.
Weibull Scale parameter,
A change in the scale parameter, , has the same effect on the distribution as a change of
the abscissa scale. Increasing the value of while holding constant has the effect of
stretching out the pdf. Since the area under a pdf curve is a constant value of one, the
"peak" of the pdf curve will also decrease with the increase of , as indicated in the following
figure.
If is increased, while and are kept the same, the distribution gets stretched out
to the right and its height decreases, while maintaining its shape and location.
If is decreased, while and are kept the same, the distribution gets pushed in
towards the left (i.e., towards its beginning or towards 0 or ), and its height increases.
has the same unit as T, such as hours, miles, cycles, actuations, etc.
Weibull Reliability Metrics
As was mentioned in last month's Reliability Basics, the pdf can be used to derive
commonly-used reliability metrics such as the reliability function, failure rate, mean and
median. The equations for these functions of the Weibull distribution will be presented in the
following section, without derivations for the sake of brevity and simplicity. Note that in the
rest of this section we will assume the most general form of the Weibull distribution, the
three-parameter form. The appropriate substitutions to obtain the other forms, such as the
two-parameter form where = 0, or the one-parameter form where is a constant, can
easily be made.
where (*) is the gamma function. The gamma function is defined as:
The equation for the median life, or B50 life, for the Weibull distribution is given by:
How to use Condition Based Maintenance Strategy for Equipment Failure Prevention
Most equipment failures have no relationship to length of time in-service. Most
failures are unpredictable. But if you detect a future failure early, you can plan and
do the repair cost effectively before it becomes a breakdown.
With the introduction in the 1960s of the first Boeing jumbo jets questions
were raised about the sense of continuing with maintenance requirements
based on the traditional bath-tub curve maintenance paradigm present at
the time.
(NOTE: There is a large question as to the accuracy of the Nolan and Heap
1978 Reliability Centered Maintenance report that first published the failure
curves. The analysis mixed up overhauled equipment with new equipment,
which are not comparable, and mixed up maintained assemblies with
unmaintained parts, which also are not comparable. Certainly Pattern F is
wrongly showing maximum failures at start-up. In reality most equipment will
go when the start button is pressed or the key turned. At zero time Pattern F
should begin much closer to the bottom left-hand corner of the plot and show
a rapid rise in failures, signifying that most operating equipment starts but
soon after start-up the failure rate rises from poor quality control during the
life cycle to that point in time. It then follows that all the other failure
patterns showing the curve starting high up the y-axis at time zero are also
suspicious as to their accuracy.)
Whenever these results have been tested by other parties, their findings
have confirmed the presence of the curves found in the original
investigations. Though the proportions varied because equipment was in
different situations and used in different ways. What seems clear is that with
the equipment technologies available in the early 21st Century, equipment
failures fit one of the six failure curves in Figure 1 but the proportions vary
depending on engineering decisions and operational circumstances.
Here was definite proof that most failures were not age-related, where the
equipment failed because of length of use. It meant that time-based
preventative maintenance was pointless in most cases. Age-related use
includes stress fatigue failures (e.g. shafts breaking, springs breaking, boiler
tubes leaking), erosion/corrosion failures (e.g. material erosion, metal
corrosion), wear-out failures (e.g. car tyre tread wear, packed gland leaks)
and other such failures where the length of operating time contributes to the
eventual failure.
You find time-related failures by looking at your work orders on each item of
equipment for as far back as you can and creating a Pareto Chart of its
failure history. You can also get good answers by asking your long-serving
maintainers and operators what keeps failing on each piece of equipment.
About 80% 85% of your repair work orders will happen randomly. You
cannot predict a date when they occur. But you can detect that they have
started. It is possible to use the changed condition of the equipment to tell
when a failure is due.
Starting from new, a part properly built and installed into equipment without
any errors, will operate at a particular level of performance, which ideally is
at its design requirement. As its operating life progresses degradation
occurs. Please do not assume degradation is normal and nothing can be
done about it. This is not the case. In fact equipment failure should never
happen! The acceptance of equipment failure as normal is an expensive
misunderstanding to make.
Regardless of the reasons for degradation, the item can no longer meets its
original service requirements and its level of performance falls. By detecting
the loss-in-condition of the item you have advanced warning that
degradation has started. If you can detect this change in performance level
you have a means to forecast a coming failure.
Figure 2 below represents the typical degradation process experienced by
equipment. Following a period of normal operation, where the item has been
running smoothly, a change occurs that affects its performance. The earliest
time where you can detect the degradation is called the P-point, or potential
failure point, since after this time the item can potentially fail at any time.
This change gradually, or rapidly in some cases, worsens to the point that
the equipment cannot reliably and safely deliver its service duty. At this
stage the item has functionally failed, i.e. it is not delivering its required
performance, and the location is called the functional failure point or F-point.
If it continues in operation the part will fail and the equipment will stop
working.
The most important issue is to spot the tell-tale signs early so that you have
time to plan and prepare an organised and least cost correction. Once the
equipment is broken you will have to spend whatever time, money and
resources it takes to get it back in operation fast.
When you need expert help for more accurate results, or a measured opinion
on the implications to continued operation, or the equipment is particularly
critical to your business and you do not have the necessary expertise and
skills in-house, sub-contract those specialities in at the time.
With time-random failures the simplest (but not the only) management
strategy to use is to inspect your equipment and look for evidence of
degraded conditions. You can use a continuous means of monitoring
condition by using your process computers to trend an equipments
performance graphically (e.g. power use verses throughput, flow verses
pressure, etc), or you can introduce periodic inspections of equipment
condition through observation and data measurement (e.g. lubrication
sampling, temperature measurement, etc).
If condition monitoring is based on timed inspections, you must set the time
periods at a frequency that will let you spot the change well before the
impending failure. Figure 3 shows a frequency inspection period that will
detect the degrading performance well before the failure. The rule of thumb
is to set the condition inspection frequency one quarter the width of the P-F
period. This gives your three inspection points before failure. The second
point lets you confirm the that the first point is correct and not a random
error. Once you are certain the item is degrading you plan for its replacement
or repair.
Having discovered the start of a failure you can prepare for its repair, or put
into place strategies and make changes in its use, to extend the time to
failure.
But doing condition based maintenance will only marginally reduce your
maintenance costs. The main thing condition monitoring does for you is to
tell you that you have a problem in time to deal with it in a low cost way. It
does not stop the problem!
There is one more step that you must now do to drastically reduce your
maintenance costs. You must remove the failure mode. Having discovered a
cause of failure through condition monitoring, you must now get rid of it, or
else it can randomly occur at anytime in the future after it is repaired. Only
by removing failure modes will you significantly reduce your future
maintenance. In addition to having a way to find the potential failure point
you also need a business process that finds and removes the cause of
random failures. Random failures are stress induced by specific acts and
events. Unless you remove the cause of the incident it will happen again and
again, costing you money, time and effort at every occurance. When you
introduce predictive maintenance into a company you also need to introduce
root cause failure removal as part of the maintenance reduction strategy.
Please click this link if you want to read more about defect elimination
strategy.
Mike Sondalini
Managing Director
Lifetime Reliability Solutions HQ
ABSTRACT
1. INTRODUCTION
1.1 Reliability:
There are two basic things related to reliability those are reliability evaluation
and reliability improvement. The reliability evaluation of a product or process
includes the study of different phases of product life cycle and its failure
analysis. It can be expressed in a quantitative term. The Reliability
improvement is the process of preventing all the chances of failures in other
words to seal all the loop holes in designing, manufacturing, operation and
maintenance practices involved with the particular product. To improve the
reliability of the product all the stake holders have to contribute throughout
its life cycle.
* Only the first method that is Use of failure data method is discussed below
in brief which is used further in this article for Reliability calculations.
RELIABILITY CALCULATION BY USE OF FAILURE DATA:
In this concept the Reliability of any system is mainly affected by its failure
rate. As a thumb rule we can say less failure results in more reliability and
vice versa. Failure rate is the number of failures of a system or component
per unit time, for example, failures per hour. It is often denoted by the Greek
letter ? (lambda).
The mean time between failures (MTBF) is often used instead of the failure
rate in practice. MTBF is the mean time gap between two failures of any
system or equipment. The failure rate is simply the multiplicative inverse of
the MTBF (1/?).We can determine the Reliability by using following
mathematical relationship established by Weibull.[1]
(T/M ),
Reliability(R) = e
M= MTBF
? = Failure rate
2. EXPERIMENTAL WORK
For conducting this experiment a set of data has been collected for ten
number of industrial process fans in a particular cement plant for ten
consecutive years. The number of failures in each year has been recorded for
all the individual fans. To generalize and to establish the relationship through
the data table and graphs, all the failures in each year are added together to
find out the total numbers of failures of fans per year. Then the average
numbers of failures per year also calculated by dividing the total numbers of
failures by total numbers of fans. With the help of these data the Bath tub
curve will be validated in first phase of this experiment, the concept of
Madhabs Hat curve of Reliability will be established in the second phase of
this experiment and the reliability improvement methods will be discussed in
the last phase of the experiment.
The first part represents the failure rate of early life period which is
decreasing in nature.
The second part represents the failure rate of useful life period which is
more or less constant in nature.
The third part represents failure rate of wear out period which is
increasing in nature.
The bathtub curve, does notdepict the failure rate of a single item, but
describes the relative failure rate of an entire population of products over
time. The bathtub curve is generated by tracing the rate of failures of
equipment throughout its life period. As explained earlier a set of data has
been collected for the industrial process fans, which is shown in the Table-
1.Now we can check the validity of bath tub curve by using these data with
the help of MINITAB software.
The failure rate (Total number of failures per year) from the Table-1 has been
plotted against all the years and the curve found from the graph is shown in
figure - 2.Now we can see the curve obtained from the study looks like a Bath
Tub. If we follow the curve we can observe there are mainly three parts as
mentioned in the curve which is already discussed above.
YE No. No. No. No. No. No. No. No. No. No. TOTA AVG
AR of of of of of of of of of of L No. .
failur failur failur failur failur failur failur failur failur failur of Rat
es, es, es, es, es, es, es, es, es, es, Failu e of
FAN- FAN- FAN- FAN- FAN- FAN- FAN- FAN- FAN- FAN- res failu
1 2 3 4 5 6 7 8 9 10 per re
year (Nos
.
per
year
)
1st 1 1 2 3 1 0 1 1 2 1 13 1.3
2n 1 2 0 2 1 0 2 2 1 1 12 1.2
d
3rd 1 0 2 0 1 0 0 1 0 1 6 0.6
4th 1 0 1 1 1 0 0 1 2 0 7 0.7
5th 1 1 1 0 0 1 0 1 2 1 8 0.8
6th 1 0 2 0 0 0 0 1 1 2 7 0.7
7th 1 1 0 1 0 1 1 0 1 1 7 0.7
8th 1 1 1 1 2 1 0 1 0 0 8 0.8
9th 2 1 2 2 2 3 1 1 1 0 15 1.5
10t 2 2 2 1 2 0 3 0 2 2 16 1.6
h
By using the above data, Mean Time Between Failures and the Reliability for
each year has been calculated as shown in the Table 2. As explained earlier
the reliability has been calculated by using failure data method. Like Bath
Tub curve we can put the data of Reliability for all the years in MINITAB
software and after putting the data we got a curve which is shown in the
figure - 3. If we observe closely we can mark the curve looks lika a Hat. So it
could be known as Madhabs Hat curve of Reliability.Like bath tub curve it
has mainly three regions.
The first region is early life period or infant stage where the reliabilty is in
increasing trend just opposite to the bath tub curve.
The second region is useful life period where the reliabilty is more or less
consatnt in nature just like in bath tub curve.
The third region is wear out period where the reliability is in decreasing
trend just opposite to the bath tub curve.
The hat curve of Relibility will be helpful for comparison with the actual
reliability cycle of any equipment.It will also work as a guideline for
equipment lifecycle performance analysis.
HOURS (e -T/M)
(T)
Failure can be described in many different ways. One of the definitions may
be as follows-
The failure is a deviation from the designed and assured performance level
of any equipment which creates dissatisfaction to its user.
We can find out some individual units fails relatively early, others will last
until wear-out, and some will fail during the relatively long period typically
called normal life. Failures during infant mortality are highly undesirable and
are always caused by defects and mistakes like: material defects, design
mistakes, manufacturing defects, etc. Normal life failures are normally
considered to be random cases of "stress exceeding designed strength" due
to abnormal operating conditions. Wear-out is a fact of life due to fatigue or
depreciation of materials. After useful life period most of the equipments fails
which is normal and acceptable for both the manufacturer and customer. A
product's useful life is limited by its endurance design. A product
manufacturer must assure that all specified materials are adequately
designed to function through the intended product life cycle. There are
mainly two types of premature failures observed in any equipment, those
are:
To prevent both the types of premature failures and to increase the reliability,
for smooth operation of equipments, different failure prevention techniques
are adopted. Those are discussed below.
Analyze all the minor issues and also big failures using proper failure
analysis techniques like FMEA, Fish bone diagram, root-cause failure analysis
etc.
Always learn the lessons from the past failures and follow the preventive
mode to avoid those failures in future.
VIBRATION IN MM/SEC
BEARING TEMPERATURE
LOOSENESS OF BOLTS
NOISE
RPM
ATTRIBUTE MOTOR
R Y B
RPM
CURRENT IN AMPERE
CONCLUSION
In the first phase of this study the Bath tub curve is validated. A new concept
that is Hat curve of Reliability has been established, thus it is a simple
clarification still it is important for the life cycle reliability study of any
equipment. In the third phase some failure prevention methods has been
discussed.
It is hoped that this study would be helpful for the maintenance engineers as
well as the design engineers and Reliability engineers to achieve the
excellence in their respective field.
REFERENCES
I. http://reliabilityanalyticstoolkit.appspot.com/weibull_distribution
II. http://www.weibull.com/hotwire/issue21/hottopics21.htm
III. http://www.lifetime-reliability.com/free-articles/work-quality-
assurance/defect-elimination.html
IV. https://www.dmgeventsme.com/machineryfailureprevention/