The Bathtub Curve and Product Failure Behavior

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 47
At a glance
Powered by AI
The document discusses the bathtub curve model which describes three periods in the lifetime of a product population: infant mortality, useful life, and wear-out. It also discusses causes of failures during each period and methods to reduce failures.

The three periods are: infant mortality (decreasing failure rate), useful life (low, constant failure rate), and wear-out (increasing failure rate). They represent the relative failure rate of an entire product population over time.

Infant mortality refers to the early failures of a product population that is decreasing over time. It is caused by defects, design flaws, errors in manufacturing or assembly.

The Bathtub Curve and Product Failure Behavior

Part One - The Bathtub Curve, Infant Mortality and Burn-


in
by Dennis J. Wilkins
Retired Hewlett-Packard Senior Reliability Specialist, currently a ReliaSoft
Reliability Field Consultant
This paper is adapted with permission from work done while at Hewlett-
Packard.
Reliability specialists often describe the lifetime of a population of products
using a graphical representation called the bathtub curve. The bathtub
curve consists of three periods: an infant mortality period with a decreasing
failure rate followed by a normal life period (also known as "useful life")
with a low, relatively constant failure rate and concluding with a wear-out
period that exhibits an increasing failure rate. This article provides an
overview of how infant mortality, normal life failures and wear-out modes
combine to create the overall product failure distributions. It describes
methods to reduce failures at each stage of product life and shows how
burn-in, when appropriate, can significantly reduce operational failure rate
by screening out infant mortality failures. The material will be presented in
two parts. Part One (presented in this issue) introduces the bathtub curve
and covers infant mortality and burn-in. Part Two (presented in next
month's HotWire) will address the remaining two periods of the bathtub
curve: normal life failures and end of life wear-out.
Figure 1: The Bathtub Curve
The bathtub curve, displayed in Figure 1 above, does not depict the failure
rate of a single item, but describes the relative failure rate of an entire
population of products over time. Some individual units will fail relatively
early (infant mortality failures), others (we hope most) will last until wear-
out, and some will fail during the relatively long period typically called
normal life. Failures during infant mortality are highly undesirable and are
always caused by defects and blunders: material defects, design blunders,
errors in assembly, etc. Normal life failures are normally considered to be
random cases of "stress exceeding strength." However, as we'll see, many
failures often considered normal life failures are actually infant mortality
failures. Wear-out is a fact of life due to fatigue or depletion of materials
(such as lubrication depletion in bearings). A product's useful life is limited
by its shortest-lived component. A product manufacturer must assure that
all specified materials are adequate to function through the intended
product life.
Note that the bathtub curve is typically used as a visual model to illustrate
the three key periods of product failure and not calibrated to depict a graph
of the expected behavior for a particular product family. It is rare to have
enough short-term and long-term failure information to actually model a
population of products with a calibrated bathtub curve.
Also note that the actual time periods for these three characteristic failure
distributions can vary greatly. Infant mortality does not mean "products that
fail within 90 days" or any other defined time period. Infant mortality is the
time over which the failure rate of a product is decreasing, and may last for
years. Conversely, wear-out will not always happen long after the expected
product life. It is a period when the failure rate is increasing, and has been
observed in products after just a few months of use. This, of course, is a
disaster from a warranty standpoint!
We are interested in the characteristics illustrated by the entire bathtub
curve. The infant mortality period is a time when the failure rate is
dropping, but is undesirable because a significant number of failures occur
in a short time, causing early customer dissatisfaction and warranty
expense. Theoretically, the failures during normal life occur at random but
with a relatively constant rate when measured over a long period of time.
Because these failures may incur warranty expense or create service support
costs, we want the bottom of the bathtub to be as low as possible. And we
don't want any wear-out failures to occur during the expected useful
lifetime of the product.
Infant Mortality What Causes It and What to Do About It?
From a customer satisfaction viewpoint, infant mortalities are
unacceptable. They cause "dead-on-arrival" products and undermine
customer confidence. They are caused by defects designed into or built into
a product. Therefore, to avoid infant mortalities, the product manufacturer
must determine methods to eliminate the defects. Appropriate
specifications, adequate design tolerance and sufficient component derating
can help, and should always be used, but even the best design intent can
fail to cover all possible interactions of components in operation. In addition
to the best design approaches, stress testing should be started at the
earliest development phases and used to evaluate design weaknesses and
uncover specific assembly and materials problems. Tests like these are
called HALT (Highly Accelerated Life Test) or HAST (Highly Accelerated Stress
Test) and should be applied, with increasing stress levels as needed, until
failures are precipitated. The failures should be investigated and design
improvements should be made to improve product robustness. Such an
approach can help to eliminate design and material defects that would
otherwise show up with product failures in the field.
After manufacturing of a product begins, a stress test can still be valuable.
There are two distinct uses for stress testing in production. One purpose
(often called HASA, Highly Accelerated Stress Audit) is to identify defects
caused by assembly or material variations that can lead to failure and to
take action to remove the root causes of these defects. The other purpose
(often called burn-in) is to use stress tests as an ongoing 100% screen to
weed out defects in a product where the root causes cannot be eliminated.
The first approach, eliminating root causes, is generally the best approach
and can significantly reduce infant mortalities. It is usually most cost-
effective to run 100% stress screens only for early production, then reduce
the screen to an audit (or entirely eliminate it) as root causes are identified,
the process/design is corrected and significant problems are
removed. Unfortunately, some companies put 100% burn-in processes in
place and keep using them, addressing the symptoms rather than identifying
the root causes. They just keep scrapping and/or reworking the same
defects over and over. For most products, this is not effective from a cost
standpoint or from a reliability improvement standpoint.
There is a class of products where ongoing 100% burn-in has proven to be
effective. This is with technology that is "state-of-the-art," such as leading
edge semiconductor chips. There are bulk defects in silicon and minute
fabrication variances that cannot be designed out with the current state of
technology. These defects can cause some parts to fail very early relative to
the majority of the population. Burn-in can be an effective way to screen
out these weak parts. This will be addressed later in this article.
A Quantitative Look at Infant Mortality Failures Using the Weibull
Distribution
The Weibull distribution is a very flexible life distribution model that can be
used to characterize failure distributions in all three phases of the bathtub
curve. The basic Weibull distribution has two parameters, a shape
parameter, often termed beta (), and a scale parameter, often termed eta
( ). The scale parameter, eta, determines when, in time, a given portion of
the population will fail (i.e., 63.2%). The shape parameter, beta, is the key
feature of the Weibull distribution that enables it to be applied to any phase
of the bathtub curve. A beta less than 1 models a failure rate that decreases
with time, as in the infant mortality period. A beta equal to 1 models a
constant failure rate, as in the normal life period. And a beta greater than 1
models an increasing failure rate, as during wear-out. There are several
ways to view this distribution, including probability plots, survival plots and
failure rate versus time plots. The bathtub curve is a failure rate vs. time
plot.
Typical infant mortality distributions for state-of-the-art semiconductor
chips follow a Weibull model with a beta in the range of 0.2 to 0.6. If such a
distribution is viewed in terms of failure rate versus time, it looks like the
plot in Figure 2.

Figure 2: Infant Mortality Curve - Failure Rate vs. Time


This plot shows ten years (87,600 hours) of time on the x-axis with failure
rate on the y-axis. It looks a lot like the infant mortality and normal life
portions of the bathtub curve in Figure 1, but this curve models only infant
mortality (decreasing failure rate). Dots on this plot represent failure times
typical of an infant mortality with Weibull beta = 0.2. As you can see, there
are 27 failures before one year, and only 6 failures from one to ten years.
People observing this curve, and the failure points plotted, could not be
blamed for thinking it represents both infant mortality failures (in the first
year or so), and normal life failures after that. But these are only infant
mortality failures - all the way out to ten years!
This plot shows the distribution for a beta value typical of complex, high-
density integrated circuits (VLSI or Very Large Scale Integrated circuits).
Parts such as CPUs, interface controller and video processing chips often
exhibit this kind of failure distribution over time. A look at this plot shows
that if you could run these parts for the equivalent of three years and
discard the failed parts, the reliability of the surviving parts would be much
higher out to ten years. In fact, until a wear-out mode occurs, the reliability
would continue to improve over time. If there are mechanisms that can
produce normal life failures (theoretically a constant failure rate) mixed in
with the defects that cause the infant mortalities shown above, burn-in can
still provide significant improvement as long as the constant failure rate is
relatively low.
Burn-In for Leading Edge Technologies
To see how burn-in can improve the reliability of high tech parts, we'll use a
chart that looks somewhat like the failure rate vs. time curve in Figure 2,
but is more useful. This is a survival plot that directly shows how many units
from a population have survived to a given time. Figure 3 is a plot for a
typical VLSI process with a small "weak" sub-population (defective parts that
will fail as infant mortalities) and a larger sub-population of parts that will
fail randomly at a very low rate over the normal operating life. The x-axis
scale is in years of use (zero to 100 years!) and the y-axis is percent of parts
still operating to spec (starting at 100% and dropping to 50%).
Figure 3 shows that, of the failures that occur in the first 20 years (about
4%), most failures occur in the first year or so, just like we observed in the
infant mortality example above. Because there is a low level, constant
failure rate, this plot shows failures continuing for a hundred years. Of
course, there could be a wear-out mode that comes into play before a
hundred years has elapsed, but no wear-out distribution is considered here.
Electronic components, unlike mechanical assemblies, rarely have wear-out
mechanisms that are significant before many decades of operation.
Figure 3: Mixed Infant Mortality and Normal Life Survival Plot
We're not really interested in the failures much beyond ten years, so let's
look at this same model for only the first ten years. In Figure 4, we have
included sample failure points from the simulation model used to create the
plot. These enable us to view which population (infant mortality or normal
life) the failure came from.
Figure 4: Mixed Infant Mortality and Normal Life Failures
We see that the plot in Figure 4 looks like the early life and normal life
portions of the bathtub curve, and in fact includes both distributions. We
see that over 2% of the units fail in the first year, but it takes ten years for
3% to fail. In actuality, there are still "infant" mortalities occurring well
beyond ten years in this model, but at an ever-decreasing rate. In fact, in
the ten year span of this model there would be very few normal life failures.
Only two failures (~5% of all failures) in this example (large blue dots) come
from the normal life failure population. About 95% of the failures plotted
above (small red dots) are infant mortality failures! This is what the
integrated circuits (IC) industry has observed with complex solid-state
devices. Even after ten years of operation the primary failure cause for ICs
is still infant mortality. In other words, failures are still driven primarily by
defects.
In such cases, burn-in can help. In the plot above you can see that if you
could get three years of operation on this part before you shipped it, you
would have screened out over 80% (2% divided by 3%) of the parts that would
fail in ten years. So if we were to come up with a method to effectively
"age" the parts the equivalent of three years and eliminate most of the
infant mortalities, the remaining parts would be more reliable than the
original population. Of course, the parts that go through the three-year
"burn-in" would have to last an additional ten years in the field, for a total
of thirteen years. Let's see what this looks like in Figure 5.

Figure 5: Comparison of Failures from Raw and Burned-in Parts


Above, we see fourteen years of failure distribution for the original parts
(not burned-in) and eleven years of expected failure distribution for parts
that received three years of burn-in. In this example, the total cumulative
failures between three years and thirteen years for the original parts (or
from zero to ten years for burned-in parts) is about 0.6%. Without burn-in,
the first ten years would have had about 3% cumulative failures. This is
about a five times reduction in cumulative failures by using burn-in, or in
terms of a change, we would have about 2% fewer cumulative failures in ten
years with burn-in if a dominant infant mortality failure mode exists. Note
that in the first year or two, the relative improvement in reliability is even
greater. At two years, only about 0.1% failures are expected after burn-in
but almost 2% without burn-in; a ratio of almost 25:1!
In reality, manufacturers don't have two to three years to spend on burn-in.
They need an accelerated stress test. In the IC industry there are usually
two stresses that are used to accelerate the effective time of burn-in:
temperature and voltage. Increased temperature (relative to normal
operating temperatures) can provide an acceleration of tens of times (10x to
30x is typical). Increased voltages (relative to normal operating levels) can
provide even higher acceleration factors on many types of ICs. Combined
acceleration factors in the range of 1000:1, or more, are typical for many IC
burn-in processes. Therefore, burn-in times of tens of hours can provide
effective operating times of one to five years, significantly reducing the
proportion of parts with infant mortality defects.
What if we try burn-in on a product with no dominant infant mortality
problems? The survival plot for an assembly with a 1% per year "constant"
failure rate (normal life period) is shown below in Figure 6.

Figure 6: Survival Plot for Constant Failure Rate


It's pretty easy to see that burn-in for two years would find ~2% failures, but
operation for an additional two years would find another ~2%. At ten years,
we would have found about 10%. Note, the line is not really a straight line
because a constant failure rate (equivalent to the normal life part of the
bathtub) acts on the remaining population and the remaining population is
decreasing as units fail. Looking at the same burn-in conditions as in the last
example, if we were to provide three years of operation on these parts and
then use them for an additional ten years, what results would we have? The
cumulative failures of the units that passed this screen would be very close
to 9.5%. Without burn-in, the cumulative failures in ten years would be the
same, about 9.5%. There is no advantage to burn-in with a constant (normal
life) failure rate.
It should be obvious that burn-in of an assembly that is failing due to a
wear-out failure mode (failure rate increasing with time) will actually yield
assemblies that are worse than units that did not go through burn-in. This is
simply because the probability of failure is increasing for every hour the
parts run. Adding operating time simply increases the possibility of a failure
in any future period of time!
Conclusion
In this issue, Part One, we have introduced the concept of the bathtub curve
and discussed issues related to the first period, infant mortality, as well as
the practices, such as burn-in, that are used to address failures of this type.
As this article demonstrates, although burn-in practices are not usually a
practical economic method of reducing infant mortality failures, burn-in has
proven to be effective for state-of-the-art semiconductors where root cause
defects cannot be eliminated. For most products, stress testing, such as
HALT/HAST should be used during design and early production phases to
precipitate failures, followed by analysis of the resulting failures and
corrective action through redesign to eliminate the root causes. In Part Two
(presented in next month's HotWire), we will examine the final two periods
of the bathtub curve: normal life failures and end of life wear-out.

The Bathtub Curve and Product Failure Behavior


Part Two - Normal Life and Wear-Out
by Dennis J. Wilkins
Retired Hewlett-Packard Senior Reliability Specialist, currently a ReliaSoft Reliability Field
Consultant
This paper is adapted with permission from work done while at Hewlett-Packard.
Introduction
Part One of this article (presented in last month's HotWire) introduced the concept of the
reliability bathtub curve. This is a graphical representation of the lifetime of a population of
products, which consists of three key periods. Part One examined the first period of the
curve, infant mortality, and also discussed issues related to burn-in, a common practice to
reduce the occurrence of this type of failure during the useful life of the product. Part Two
(presented here) will address the middle and last periods in the bathtub curve: normal life
(or "useful life") and end of life wear-out. The normal life period is characterized by a low,
relatively constant failure rate with failures that are considered to be random cases of
"stress exceeding strength." The wear-out period is characterized by an increasing failure
rate with failures that are caused by the "wear and tear" on the product over time.
Reliability Bathtub Curve Review
As described in more detail in Part One, the bathtub curve, displayed in Figure 1 below,
does not depict the failure rate of a single item. Instead, the curve describes the relative
failure rate of an entire population of products over time. Some individual units will fail
relatively early (infant mortality failures), others (we hope most) will last until wear-out,
and some will fail during the relatively long period typically called normal life. The first
period is characterized by a decreasing failure rate and consists of failures caused by defects
and blunders. The second period maintains a low and relatively constant failure rate and
consists of random failures typically caused by "stress exceeding strength." The third period
exhibits an increasing failure rate and consists of failures caused by wear-out due to fatigue
or depletion of materials.
Figure 1: The Bathtub Curve
Normal Life Period Does it Really Exist?
Some reliability specialists like to point out that real products don't exhibit constant failure
rates. This is quite true for a mechanical part where wear-out is the primary failure mode.
And all kinds of parts, mechanical and electronic, are subject to infant mortality failures
from intrinsic defects. But there are common situations where a true random failure
potential exists.
Soft Error Rate (SER) is a fact of life for systems using solid state memory chips. And today
that includes about any electronic device, from a personal computer to a VCR, microwave
oven, digital camera or automotive control module. These errors are caused by two factors:
alpha particles and cosmic rays. These errors are random in time and transient. A bit that is
"flipped" by one of these factors will be corrected when new data is written to the same
memory cell. But if that cell is read before a new write operation takes place, the data read
will be erroneous. The effect of the error may be minor (such as a single pixel of a display
being the wrong color for one screen refresh cycle) or major (such as crashing a PC). In
business-critical computer systems, special error correcting codes are employed to prevent
SER from causing any data loss or system malfunctions. However, most electronic products
will malfunction in some way from SER.

For SER, the failure mode is a normal life failure. There is an average rate of occurrence but
the failures occur "at random." The failures in most cases cause only a minor deviation in
operation and are self-correcting. No repair is needed to "fix" a product subject to SER and,
in fact, no "fix" can eliminate SER effects. Only a significant design change (using an error
correcting design) can eliminate the effects of SER, but nothing can eliminate SER.
There are other cases, especially in electronic products, where a "constant" failure rate may
be appropriate (although approximate). This is the basis for MIL-STD-217 and other
methods to estimate system failure rates from consideration of the types and quantities of
components used. For many electronic components, wear-out is not a practical failure
mode. The time that the product is in use is significantly shorter than the time it takes to
reach wear-out modes. That leaves infant mortality and normal life failure modes as the
causes of all significant failures. As we have already observed, after some time, failures
from infant mortality defects get spread out so much that they appear to be approximately
random in time. A combination of low level infant mortality failures and some random
failures caused by operational stresses (such as power line surges) can result in a product
failure distribution that is very close to the classical normal life period. This brings up the
question of a much-misunderstood term that applies during the normal life period, MTBF.

MTBF What is it?


A common term used in specifying and marketing products is MTBF, which is a vastly
misunderstood (and often misused) term. MTBF historically stands for "Mean Time Between
Failures," and as such, applies only when the underlying distribution has a constant failure
rate (e.g. an exponential distribution). In this case, MTBF is the characteristic life parameter
of the exponential distribution, as we will see below. However, use of the term MTBF is
confused by the fact that a few reliability practitioners have used it to indicate "Mean
Time Before Failure," a case where the underlying distribution may be a wear-out mode.
Further, to some practitioners the word "between" implies a repairable product while
"before" implies a non-repairable product. To make matters worse, vendors of many
products use the term MTBF without defining what they mean, sometimes with no concept
of reliability issues. In fact, the author has actually seen MTBF explained as "Minimum Time
Before Failure," a completely non-statistical and nonsensical concept.
Mean Time Before Failure (often termed Mean Time To Failure, or MTTF) describes the
average time to failure of a product, even when failure rate is increasing over time (wear-
out mode). Some units will fail before the mean life, and some will last longer. Thus, a
product specified as having an MTTF of 50,000 hours implies that some units will actually
operate longer than 50,000 hours without failure. Note: I'll use MTTF rather than Mean Time
Before Failure for the remainder of this article. When I write MTBF, I mean "Mean Time
Between Failures," as applies to the exponential distribution.
In recent years, many vendors have started using terminology such as "service life" to
describe how long their products may last in use. This is a good trend. However, while
writing this article, the author found current a data sheet that indicated "service life" using
MTBF, where the MTBF values were in excess of 500,000 hours (this would be 57 years of
24-hour-per-day operation). The products specified would not operate, non-stop, for over
50 years; wear-out modes would kill off most of these products in ten years, at most. The
vendor was confusing the normal life failure rate, often expressed as an MTBF value, with
the wear-out distribution of the product.
How does MTBF describe failure rate? It is quite simple: when the exponential distribution
applies (constant failure rate modeled by the flat, bottom of the bathtub curve), MTBF is
equal to the inverse of failure rate. For example, a product with an MTBF of 3.5 million
hours, used 24 hours per day:

MTBF = 1 / failure rate


failure rate = 1 / MTBF = 1 / 3,500,000 hours
failure rate = 0.000000286 failures / hour
failure rate = 0.000286 failures / 1000 hours
failure rate = 0.0286% / 1000 hours - and since there are 8,760 hours in a year
failure rate = 0.25% / year
Note that 3.5 million hours is 400 years. Do we expect that any of these products will
actually operate for 400 years? No! Long before 400 years of use, a wear-out mode will
become dominant and the population of products will leave the normal life period of the
bathtub and start up the wear-out curve. But during the normal life period, the "constant"
failure rate will be 0.25% per year, which can also be expressed as an MTBF of 3.5 million
hours.

How does MTBF fit into the equation for the exponential distribution? MTBF is the scale
parameter (usually termed eta or ) that defines the specific model for an exponential
distribution. The equation for the density function of an exponential distribution is given by:
where:

F(t) = probability of failure at time t


= characteristic life = MTBF (time when 63.2% cumulative failures occur)
e = 2.71828', base of natural logs
Note that many products with very low failure rates during "normal life" will wear out in a
few years, so that the Mean Time Before Failure (or MTTF) may be much less than Mean
Time Between Failures. Let's look at this graphically.

Figure 2: Weibull Plot for Normal Life and Wear-Out Populations


Figure 2 above shows a Weibull probability plot. This plot shows the expected cumulative
failures for a product over time, with time shown on the x-axis and cumulative failure
percentages (labeled Unreliability) shown on the y-axis. This is one of the most common
ways to view failure distributions. The solid blue line is titled "MTBF = 20 million hours" and
represents the normal life period shown as a horizontal line on the bathtub curve. It is not
horizontal here because this plot shows cumulative failures whereas the bathtub curve
shows failure rate. The MTBF of an exponential distribution is equal to the time when 63.2%
of the population of units has failed. This level is shown on the plot as a dashed black line
labeled (eta). In this example, the extension of the "MTBF = 20 million hours" line crosses
the 63.2% level at 20 million hours on the x-axis.
The green line, on the other hand, represents a wear-out distribution as depicted on the
right side of a bathtub curve. It is not a constant failure rate distribution but a failure rate
that increases with time. Note that it crosses the 50% cumulative level at about 500,000
hours. This is a wear-out distribution with an MTTF of 500,000 hours. Note that for betas
over 3, MTTF is close to the 50% cumulative failure time - Weibull++ can calculate the
actual mean life (MTTF) and median life (50% cumulative failure time) for any Weibull
distribution. When beta = 1 (or an exponential distribution is used), the mean life will be the
same as Mean Time Between Failures.
Both of these distributions (blue and green lines) apply to the same population of devices.
These devices fail primarily according to the constant failure rate model (solid blue MTBF
line) until the blue line intercepts the green line. This is when wear-out begins to have a
significant effect (a little over 100,000 hours in this example). By 500,000 hours, half of the
units will have failed and by 900,000 hours, 99% of the units will have failed. None of them
will ever reach the 20 million hour MTBF time because the wear-out mode dominates after
about 100,000 hours of operation. Note that the true overall cumulative failures will be the
sum of the two distributions shown on this plot. However, because the y-axis is a log-log
scale, the sum of the two distributions is very close to the two straight lines except around
the area where they intercept.

MTBF Summary
As we have seen here, it is logical that a device can have an MTBF that is much greater than
its wear-out time because MTBF is only a projection of the normal lifetime failures to a
cumulative level of 63.2%. Most, if not all, devices will have failed due to wear-out modes
long before the MTBF time.
A major problem for many people with the term Mean Time Between Failures is that it is
expressed as "time" when it is really used to indicate failure rate during the normal life
period. To further confuse the issue, some people use the term MTBF to indicate Mean Time
Before Failure, a case when it applies to wear-out modes and really does relate to service
life. And, as I noted above, some people don't know what they are talking about and claim
service life is equal to "Mean Time Between Failures"!
Good news is that recent data sheets from some vendors (particularly those making
electronic assemblies) show MTBF under the heading of "reliability" and a separate value for
service life. In this case, the vendor has described both the expected failure rate during the
normal life period of the bathtub (e.g. "MTBF = 3.5 million hours") and the point in time at
which the product is expected to start up the wear-out part of the bathtub curve
(e.g. "service life greater than 8 years"). Bad news is that some vendors, even major
technology firms, still don't understand reliability concepts, at least as expressed in their
data sheets.
When you are specifying components for a product and want to understand how long it
might operate and what the failure rate might be during normal life, be sure to find out
what the vendor means by "MTBF." And if he thinks it means "Minimum Time Before Failure"
calculated from MIL-HDBK-217, be careful!

Everything Eventually Wears Out


In the long run, everything wears out. For many electronic designs, wear-out will occur after
a long, reasonable use-life. Inexpensive electronic watches, radios, televisions and other
such products usually last for years, and people are not too upset if they finally fail. There
are usually newer products with better features that they want to buy after a few years.
For many mechanical assemblies, the wear-out time will be less than the desired operational
life of the whole product and replacement of failed assemblies can be used to extend the
operational life of the product. With some items, wear-out is expected and replacement is a
normal routine. For example, inkjet cartridges run out of ink after so much ink has been
squirted. This is not normally thought of as a failure. However, if a newly replaced cartridge
runs out of ink after a short period of use, then we do consider it a failure. On the other
hand, there are mechanical and electro-mechanical devices that only last for months or
years of use in a product expected to last for decades. Relays, generators, switching
devices, engine parts and hydraulic components in aircraft are replaced on a periodic basis,
usually before they fail, to enable the aircraft to fly for many years of safe operation. Tires
and brake components are replaced several times over the period of time that the
automobile is in use.

The wear-out period does not occur at one time for all components. The shortest-lived
component will determine the location of the wear-out time in a given product. In designing
a product, the engineer must assure that the shortest-lived component lasts long enough to
provide a useful service life. If the component is easily replaced, such as tires, replacement
may be expected and will not degrade the perception of the product's reliability. If the
component is not easily replaced and not expected to fail, failure will cause customer
dissatisfaction.

In order to assess wear-out time of a component, long-term testing may be required. In


some cases, a 100% duty cycle (running tires in a road wear simulator 24 hours a day) may
provide useful lifetime testing in months. In other cases, actual product use may be 24
hours a day and there is no way to accelerate the duty cycle. High level physical stresses
may need to be applied to shorten the test time. This is an emerging technique of reliability
assessment termed QALT (Quantitative Accelerated Life Testing) that requires consideration
of the physics and engineering of the materials being tested.

Properly applied, QALT can provide useful information from tests much shorter in length
than the expected operating time of a design. However, much care must be taken to assure
that all possible failure modes have been investigated. Running a quantitative accelerated
life test without considering all possible failure modes and their accelerating stress types
may miss a significant failure mode and invalidate the conclusions. As appropriate,
mechanics, electronics, physics and chemistry must all be considered when designing a
QALT.

Note that "MTBF" testing, using many units in parallel to shorten test times, is a popular
method of life testing. It does not apply to testing for wear-out! It can apply to testing for
normal life failures, but the results of such testing should never be extrapolated to times
longer than were used for the test itself.

Conclusion
As demonstrated in Parts One and Two of this article, the traditional bathtub curve is a
reasonable, qualitative illustration of the key kinds of failure modes that can affect a
product. Quantitative models such as the Weibull distribution can be used to assess actual
designs and determine if observed failures are decreasing, constant or increasing over time
so that appropriate actions can be taken. The exponential distribution and the related Mean
Time Between Failures (MTBF) metric are appropriate for analyzing data for a product in the
"normal life" period, which is characterized by a constant failure rate. But be careful - many
people have "imposed" a constant failure rate model on products that should be
characterized by increasing or decreasing failure rates, just because the exponential
distribution is an easy model to use.
Do not assume that a product will exhibit a constant failure rate. Use life testing and/or field
data and life distribution analysis to determine how your product behaves over its expected
lifetime. In addition to traditional life data analysis models (such as the Weibull distribution),
quantitative accelerated life testing (QALT) may be a valuable technique to better
understand failure distributions of highly reliable products with reduced testing time, in a
cost-effective manner. Without a QALT approach to testing, there is no way to accurately
assess the long-term reliability of a product in a short time. If you need to understand the
reliability of a device for a one-year use-life, a non-accelerated test of 12 units for one
month will not do it. It will only provide information on one month of use. Projection to one
year will be invalid if a wear-out mode occurs, for example, in six months. The only way to
find a wear-out mode is to test long enough to observe it, with or without a QALT approach.
When dealing with vendors and their claims of reliability for components you wish to use, be
sure you understand how they determined these figures and how well they understand the
consequences of the bathtub curve.

Characteristics of the Weibull Distribution


Software Used

Weibull++
In the last several issues of Reliability HotWire, we looked at how distributions are defined
and how common reliability metrics are derived. In this issue, we will take a closer look at a
specific distribution that is widely used in life data analysis - the Weibull distribution. Named
for its inventor, Waloddi Weibull, this distribution is widely used in reliability engineering and
elsewhere due to its versatility and relative simplicity.

As was discussed in February's Reliability Basics, a distribution is mathematically defined by


its pdf equation. The most general expression of the Weibull pdf is given by the three-
parameter Weibull distribution expression, or:

Where:

and:

is the shape parameter, also known as the Weibull slope


is the scale parameter
is the location parameter

Frequently, the location parameter is not used, and the value for this parameter can be set
to zero. When this is the case, the pdf equation reduces to that of the two-parameter
Weibull distribution. There is also a form of the Weibull distribution known as the one-
parameter Weibull distribution. This in fact takes the same form as the two-parameter
Weibull pdf, the only difference being that the value of is assumed to be known
beforehand. This assumption means that only the scale parameter needs be estimated,
allowing for analysis of small data sets. It is recommended that the analyst have a very
good and justifiable estimate for before using the one-parameter Weibull distribution for
analysis.

As was mentioned previously, the Weibull distribution is widely used in reliability and life
data analysis due to its versatility. Depending on the values of the parameters, the Weibull
distribution can be used to model a variety of life behaviors. An important aspect of the
Weibull distribution is how the values of the shape parameter, , and the scale parameter, ,
affect such distribution characteristics as the shape of the pdf curve, the reliability and the
failure rate.

Weibull Shape Parameter,

The Weibull shape parameter, , is also known as the Weibull slope. This is because the
value of is equal to the slope of the line in a probability plot. Different values of the shape
parameter can have marked effects on the behavior of the distribution. In fact, some values
of the shape parameter will cause the distribution equations to reduce to those of other
distributions. For example, when = 1, the pdf of the three-parameter Weibull reduces to
that of the two-parameter exponential distribution. The parameter is a pure number (i.e.,
it is dimensionless).

The following figure shows the effect of different values of the shape parameter, , on the
shape of the pdf (while keeping constant). One can see that the shape of the pdf can take
on a variety of forms based on the value of .
Looking at the same information on a Weibull probability plot, one can easily understand
why the Weibull shape parameter is also known as the slope. The following plot shows how
the slope of the Weibull probability plot changes with . Note that the models represented
by the three lines all have the same value of .
Another characteristic of the distribution where the value of has a distinct effect is the
failure rate. The following plot shows the effect of the value of on the Weibull failure rate.
This is one of the most important aspects of the effect of on the Weibull distribution. As is
indicated by the plot, Weibull distributions with < 1 have a failure rate that decreases with
time, also known as infantile or early-life failures. Weibull distributions with close to or
equal to 1 have a fairly constant failure rate, indicative of useful life or random failures.
Weibull distributions with > 1 have a failure rate that increases with time, also known as
wear-out failures. These comprise the three sections of the classic "bathtub curve." A mixed
Weibull distribution with one subpopulation with < 1, one subpopulation with = 1 and
one subpopulation with > 1 would have a failure rate plot that was identical to the bathtub
curve. An example of a bathtub curve is shown in the following chart.
Weibull Scale parameter,

A change in the scale parameter, , has the same effect on the distribution as a change of
the abscissa scale. Increasing the value of while holding constant has the effect of
stretching out the pdf. Since the area under a pdf curve is a constant value of one, the
"peak" of the pdf curve will also decrease with the increase of , as indicated in the following
figure.

If is increased, while and are kept the same, the distribution gets stretched out
to the right and its height decreases, while maintaining its shape and location.
If is decreased, while and are kept the same, the distribution gets pushed in
towards the left (i.e., towards its beginning or towards 0 or ), and its height increases.
has the same unit as T, such as hours, miles, cycles, actuations, etc.
Weibull Reliability Metrics

As was mentioned in last month's Reliability Basics, the pdf can be used to derive
commonly-used reliability metrics such as the reliability function, failure rate, mean and
median. The equations for these functions of the Weibull distribution will be presented in the
following section, without derivations for the sake of brevity and simplicity. Note that in the
rest of this section we will assume the most general form of the Weibull distribution, the
three-parameter form. The appropriate substitutions to obtain the other forms, such as the
two-parameter form where = 0, or the one-parameter form where is a constant, can
easily be made.

The Weibull reliability function is given by:


The Weibull failure rate function is given by:

The Weibull mean life, or MTTF, is given by:

where (*) is the gamma function. The gamma function is defined as:

The equation for the median life, or B50 life, for the Weibull distribution is given by:

What is a Weibull distribution?


The Weibull distribution is one of the most widely used lifetime distributions in
reliability engineering. It is a versatile distribution that can take on the characteristics of
other types of distributions, based on the value of the shape parameter, .

Why do we use Weibull distribution?


The Weibull shape parameter, , is also known as the Weibull slope. This is because
the value of is equal to the slope of the line in a probability plot. Different values of the
shape parameter can have marked effects on the behavior of the distribution.
How do you calculate MTBF?
Mean time between failures (MTBF) is the predicted elapsed time between inherent
failures of a system during operation. MTBF can be calculated as the arithmetic mean
(average) time between failures of a system. The term is used in both plant and
equipment maintenance contexts.

What is failure rate?


Failure rate is the frequency with which an engineered system or component fails,
expressed in failures per unit of time. It is often denoted by the Greek letter (lambda)
and is highly used in reliability engineering.

What is the bathtub curve?


The bathtub curve is widely used in reliability engineering. It describes a particular
form of the hazard function which comprises three parts: The first part is a decreasing
failure rate, known as early failures. The second part is a constant failure rate, known as
random failures. The third part is an increasing failure rate, known as wear-out failures. The name
is derived from the cross-sectional shape of a bathtub: steep sides and a flat bottom.

What is meant by mean time to failure?


In reliability analysis, MTTF is the average time that an item will function before it fails.
It is the mean lifetime of the item. With censored data, the arithmetic average of the
data does not provide a good measure of the center because at least some of
the failure times are unknown.

What is reliability of equipment?


Simply put availability is a measure of the % of time the equipment is in an operable
state while reliability is a measure of how long the item performs its intended function.
Reliability is a measure of the probability that an item will perform its intended function
for a specified interval under stated conditions.

What is reliability of a product?


Product Reliability is defined as the probability that a device will perform its required
function, subjected to stated conditions, for a specific period of time. Product
Reliability is quantified as MTBF (Mean Time Between Failures) for
repairableproduct and MTTF (Mean Time To Failure) for non-repairable product.

What is meant by maintenance management?


Maintenance management is the process of overseeing maintenance resources so
that the organization does not experience downtime from broken equipment or waste
money on inefficient maintenance procedures. Maintenance management software
programs can assist with the process. The primary objectives of maintenance management are to
schedule work efficiently, control costs and ensure regulatory compliance.

How to use Condition Based Maintenance Strategy for Equipment Failure Prevention
Most equipment failures have no relationship to length of time in-service. Most
failures are unpredictable. But if you detect a future failure early, you can plan and
do the repair cost effectively before it becomes a breakdown.

Abstract:Condition Based Maintenance Strategy. With only about 15% to 20% of


your equipment failures being age related, and the other 80% to 85% being totally
time-random events, how can you improve the uptime of your plant and facility?
This article explains how to detect the random failures that make-up the vast
majority of maintenance expense and production downtime by using simple, low
cost condition monitoring methods.Keywords: equipment condition monitoring,
random equipment failure, equipment failure patterns

Equipment Failure Probability Curves Showing The Six Time Related


Patterns of Failure

With the introduction in the 1960s of the first Boeing jumbo jets questions
were raised about the sense of continuing with maintenance requirements
based on the traditional bath-tub curve maintenance paradigm present at
the time.

Investigations were conducted of past aircraft maintenance history. It was


found that all failures fitted one of six conditional probability (or likelihood of
occurrence) failure curves. The USA Navy conducted similar investigations
and confirmed the findings of the airline industry. The six failure patterns
identified are shown in the Figure 1 below. The traditional bathtub paradigm
(Pattern A) explained only 3% 4% of failures.

(NOTE: There is a large question as to the accuracy of the Nolan and Heap
1978 Reliability Centered Maintenance report that first published the failure
curves. The analysis mixed up overhauled equipment with new equipment,
which are not comparable, and mixed up maintained assemblies with
unmaintained parts, which also are not comparable. Certainly Pattern F is
wrongly showing maximum failures at start-up. In reality most equipment will
go when the start button is pressed or the key turned. At zero time Pattern F
should begin much closer to the bottom left-hand corner of the plot and show
a rapid rise in failures, signifying that most operating equipment starts but
soon after start-up the failure rate rises from poor quality control during the
life cycle to that point in time. It then follows that all the other failure
patterns showing the curve starting high up the y-axis at time zero are also
suspicious as to their accuracy.)

Whenever these results have been tested by other parties, their findings
have confirmed the presence of the curves found in the original
investigations. Though the proportions varied because equipment was in
different situations and used in different ways. What seems clear is that with
the equipment technologies available in the early 21st Century, equipment
failures fit one of the six failure curves in Figure 1 but the proportions vary
depending on engineering decisions and operational circumstances.

Here was definite proof that most failures were not age-related, where the
equipment failed because of length of use. It meant that time-based
preventative maintenance was pointless in most cases. Age-related use
includes stress fatigue failures (e.g. shafts breaking, springs breaking, boiler
tubes leaking), erosion/corrosion failures (e.g. material erosion, metal
corrosion), wear-out failures (e.g. car tyre tread wear, packed gland leaks)
and other such failures where the length of operating time contributes to the
eventual failure.

Non-time related failures were unpredictable! Time in service had no


influence on 77% to 89% of the failures. This is not the same as saying that
there as no reason for the failure. There will definitely be reasons for a
failure, but you cannot predict when there will be a failure based only on the
age of the equipment. For the vast majority of equipment you need to base
maintenance on non-age related factors.

Most equipment assemblies and components eventually settle into a long


period of chance failure. About 15% to 20% of maintenance will repeat based
on age-related factors. You will see these in work requests for the same
repair again and again over a period of years.

You find time-related failures by looking at your work orders on each item of
equipment for as far back as you can and creating a Pareto Chart of its
failure history. You can also get good answers by asking your long-serving
maintainers and operators what keeps failing on each piece of equipment.

About 80% 85% of your repair work orders will happen randomly. You
cannot predict a date when they occur. But you can detect that they have
started. It is possible to use the changed condition of the equipment to tell
when a failure is due.

Equipment Condition Monitoring

Starting from new, a part properly built and installed into equipment without
any errors, will operate at a particular level of performance, which ideally is
at its design requirement. As its operating life progresses degradation
occurs. Please do not assume degradation is normal and nothing can be
done about it. This is not the case. In fact equipment failure should never
happen! The acceptance of equipment failure as normal is an expensive
misunderstanding to make.

Regardless of the reasons for degradation, the item can no longer meets its
original service requirements and its level of performance falls. By detecting
the loss-in-condition of the item you have advanced warning that
degradation has started. If you can detect this change in performance level
you have a means to forecast a coming failure.
Figure 2 below represents the typical degradation process experienced by
equipment. Following a period of normal operation, where the item has been
running smoothly, a change occurs that affects its performance. The earliest
time where you can detect the degradation is called the P-point, or potential
failure point, since after this time the item can potentially fail at any time.
This change gradually, or rapidly in some cases, worsens to the point that
the equipment cannot reliably and safely deliver its service duty. At this
stage the item has functionally failed, i.e. it is not delivering its required
performance, and the location is called the functional failure point or F-point.
If it continues in operation the part will fail and the equipment will stop
working.

By using the tell-tale evidence of changing equipment performance due to


degradation, you can detect a failure and act to address it before an
unplanned production disruption occurs. How soon you can detect the
change in performance and spot the P-point depends on the condition
monitoring technology that you use.

There are many ways to identify a change in equipment condition. Some


commonly used ones are changes in vibration, changes in power usage,
changes in operating performance, changes in temperatures, changes in
noise levels, changes in chemical composition, increase in debris content
and changes in volume of material. You can be as creative as you want in
developing ways to warn you of future problems.

The most important issue is to spot the tell-tale signs early so that you have
time to plan and prepare an organised and least cost correction. Once the
equipment is broken you will have to spend whatever time, money and
resources it takes to get it back in operation fast.

This explains why the leading companies have created a condition


monitoring technician position in their organisation and, like the Oiler and
Greaser long used to lubricate equipment and stop bearing failures, they get
the condition monitoring man out amongst the equipment looking for tell-
tale signs of coming failure. Such a person will save you a great deal of lost
production and frustration.

It is not necessary to spend vast amounts of money on oil analysis programs,


thermography cameras, state of the art vibration analysis equipment,
ultrasonic listening devices and the like. It is wise to use such technologies
selectively when accuracy of results is critical or you need a long warning
time to plan and prepare rectification for an item. But you can do a great
deal of condition monitoring of mechanical equipment accurately enough
yourself with a laser gun to tell temperature, an automotive stethoscope to
hear noise, a low-cost bearing vibration detector to note change in bearing
vibration, laboratory filter paper to separate debris in oil and a magnifying
glass and magnet to check if the debris has iron (ferrous), plus using your
own five senses.

When you need expert help for more accurate results, or a measured opinion
on the implications to continued operation, or the equipment is particularly
critical to your business and you do not have the necessary expertise and
skills in-house, sub-contract those specialities in at the time.

Condition Based Maintenance Strategy

With around 80% of equipment failures being totally unpredictable based on


the equipments age alone, you must have an on-condition maintenance
strategy to deal with them.
The around 20% usage or time-based repetitive failures are addressed by
doing preventative maintenance and scheduled replacement maintenance
well before failure can occur. But non-age related failures cannot be
addressed by timed renewal-based maintenance strategies, they require
different solutions.

If you apply scheduled renewal based maintenance strategies to non-age


related condition failures you will change-out items that still have a long life
ahead of them and so waste your money, time and effort unnecessarily.

With time-random failures the simplest (but not the only) management
strategy to use is to inspect your equipment and look for evidence of
degraded conditions. You can use a continuous means of monitoring
condition by using your process computers to trend an equipments
performance graphically (e.g. power use verses throughput, flow verses
pressure, etc), or you can introduce periodic inspections of equipment
condition through observation and data measurement (e.g. lubrication
sampling, temperature measurement, etc).

If condition monitoring is based on timed inspections, you must set the time
periods at a frequency that will let you spot the change well before the
impending failure. Figure 3 shows a frequency inspection period that will
detect the degrading performance well before the failure. The rule of thumb
is to set the condition inspection frequency one quarter the width of the P-F
period. This gives your three inspection points before failure. The second
point lets you confirm the that the first point is correct and not a random
error. Once you are certain the item is degrading you plan for its replacement
or repair.
Having discovered the start of a failure you can prepare for its repair, or put
into place strategies and make changes in its use, to extend the time to
failure.

But doing condition based maintenance will only marginally reduce your
maintenance costs. The main thing condition monitoring does for you is to
tell you that you have a problem in time to deal with it in a low cost way. It
does not stop the problem!

There is one more step that you must now do to drastically reduce your
maintenance costs. You must remove the failure mode. Having discovered a
cause of failure through condition monitoring, you must now get rid of it, or
else it can randomly occur at anytime in the future after it is repaired. Only
by removing failure modes will you significantly reduce your future
maintenance. In addition to having a way to find the potential failure point
you also need a business process that finds and removes the cause of
random failures. Random failures are stress induced by specific acts and
events. Unless you remove the cause of the incident it will happen again and
again, costing you money, time and effort at every occurance. When you
introduce predictive maintenance into a company you also need to introduce
root cause failure removal as part of the maintenance reduction strategy.

Please click this link if you want to read more about defect elimination
strategy.

Unfortunately there are some serious traps in using condition based


maintenance that will affect your maintenance delivery and performance.
Using condition maintenance can cause you to become a reactive
organisation if you decide to rectify problems only after detecting impending
failure. Many times it is economically more sensible simply to use time based
maintenance and not care about condition because it is the more cost
effective option in the circumstances. You can read more about the dangers
of misunderstood use of condition monitoring in the PDF articles Dont Waste
Your Time and Money with Condition Monitoring and A Common
Misundersanding about Reliability Centered Maintenance.

My best regards to you,

Mike Sondalini
Managing Director
Lifetime Reliability Solutions HQ

A STUDY ON RELIABILITY, VALIDATION OF BATH TUB CURVE AND


CONCEPT OF MADHABS HAT CURVE OF RELIABILITY

Madhab Chandra Jena

Om Krishna Arts & Science Research Association

IshanPur, Jajpur, Odisha, INDIA.

ABSTRACT

Zero defect one of the basic demands of the current generation.


Everybody wants a defect free and reliable product. On the other hand this is
a big challenge for the manufacturers to fulfill the demand of the customers.
But to survive in this competitive globalization era an industry has to be
excellent in all the aspects to make a qualitative product with a cost
effective manner. To maintain high Reliability and least failures of
equipments is a good example of Global Manufacturing Practice. Equally it is
the biggest challenge for the engineers/managers to maintain a culture of
zero defect and 100% reliability of the equipments in any Industry. The
first step to achieve the target is to understand the basic concept of
reliability, then to analyze and find a way. The reliability engineers are
working day and night to improve the way forward. This study is a small step
towards a big mission of Zero Defect the way the world is moving on. In
this thesis a set of industrial fans used in a particular cement plant has been
considered for the reliability study. As we know industrial process fans used
in different industries like cement plants, steel plants, sponge iron plants,
refractory plants etc are considered as important equipments. Any failure or
down time associated with the process fans leads to the plant stoppage and
production loss. In addition to this the maintenance cost also increases. So it
is important to analyze the situation and actions should be taken to prevent
the unwanted failures of the equipments in turn improve the reliability. In the
first phase of the study, the Validity of Bath Tub Curve has been checked by
the help of failure data analysis. In the second phase of the study, the
concept of Hat curve of Reliability has been established by the help of
MINITAB software, using the equipment failure data and reliability analysis.
This would be known as Madhabs Hat Curve of Reliability. In the third
phase of the study various methods of reliability improvement has been
discussed briefly.

1. INTRODUCTION

1.1 Reliability:

The reputation of a company is very closely related to the quality of its


products. Reliability is a major factor for determining the quality. The best
quality with high reliability is must for any product in this competitive age. In
the other hand it is also important for maintenance engineers in any industry
to maintain the equipments availability to a highest level to achieve
uninterrupted manufacturing with high OEE (Overall Equipment
Effectiveness).To achieve high availability of the equipments the number of
failures should be as less as possible.

In other terms MTBF (Mean Time Between Failures) should be as high as


possible which in turn yields a higher reliability.

The overall manufacturing cost of a product including product liability cost is


also highly dependent on the product reliability. If any equipment fails
intermittently and not performs as per the warranty terms and conditions the
repair cost or rework cost is just like an overburden for any manufacturer. It
increases the overall Cost due to Poor Quality which is also known as Cost of
Poor Quality (COPQ).It also spoils the brand image if it happens with more
customers frequently. Reliability analysis is an important tool which will be
helpful to take corrective actions. It is just like a customer satisfaction study
with more concentration on Reliability. The cost of failure or poor quality
must be taken into consideration with the manufacturing cost to find out the
overall manufacturing cost of the product. Then a calculated call must be
taken to upgrade the design and manufacturing method to a higher level.

Definition: We can get a number of definitions of Reliability already existing


in different books and journals devised by great engineers and scientists. In
simplest terms Reliability is the probability of performance for which the
equipment or system is designed. The equipment should perform as
committed by the manufacturer within its warranty period and if it performs
well after its warranty period also, it indirectly helps to increase the
reputation of the manufacturer.

There are two basic things related to reliability those are reliability evaluation
and reliability improvement. The reliability evaluation of a product or process
includes the study of different phases of product life cycle and its failure
analysis. It can be expressed in a quantitative term. The Reliability
improvement is the process of preventing all the chances of failures in other
words to seal all the loop holes in designing, manufacturing, operation and
maintenance practices involved with the particular product. To improve the
reliability of the product all the stake holders have to contribute throughout
its life cycle.

1.2. Reliability Calculations:

It is important to express the reliability with the help of a quantitative term.


In the design stage itself the design engineer must have to know the
reliability value for which he is going to design the equipment, because all
other design parameters are dependent on the value of reliability.
Maintenance engineers also need to calculate the value of reliability of plant
and machineries in reliability analysis phase based on which they can take
necessary actions to improve the reliability level. As we know reliability
depends on different factors, so based on those factors there are mainly four
methods to calculate and quantify the reliability. Those are:

Use of failure data method

Density functions method

Reliability function method

Hazard and failure rates method

* Only the first method that is Use of failure data method is discussed below
in brief which is used further in this article for Reliability calculations.
RELIABILITY CALCULATION BY USE OF FAILURE DATA:

In this concept the Reliability of any system is mainly affected by its failure
rate. As a thumb rule we can say less failure results in more reliability and
vice versa. Failure rate is the number of failures of a system or component
per unit time, for example, failures per hour. It is often denoted by the Greek
letter ? (lambda).

The mean time between failures (MTBF) is often used instead of the failure
rate in practice. MTBF is the mean time gap between two failures of any
system or equipment. The failure rate is simply the multiplicative inverse of
the MTBF (1/?).We can determine the Reliability by using following
mathematical relationship established by Weibull.[1]

(T/M ),
Reliability(R) = e

Considering = 1 (constant failure rate)


Tx?
Reliability = e

Where T = Total time period

M= MTBF

? = Failure rate

2. EXPERIMENTAL WORK

For conducting this experiment a set of data has been collected for ten
number of industrial process fans in a particular cement plant for ten
consecutive years. The number of failures in each year has been recorded for
all the individual fans. To generalize and to establish the relationship through
the data table and graphs, all the failures in each year are added together to
find out the total numbers of failures of fans per year. Then the average
numbers of failures per year also calculated by dividing the total numbers of
failures by total numbers of fans. With the help of these data the Bath tub
curve will be validated in first phase of this experiment, the concept of
Madhabs Hat curve of Reliability will be established in the second phase of
this experiment and the reliability improvement methods will be discussed in
the last phase of the experiment.

2.1. Validation of Bath Tub curve:[2]

As we know the bathtub curve is widely used in reliability engineering


which shows the trend of failure rates of a system or equipment obtained
throughout its life period. The trend looks like the cross-sectional shape of
a bathtub so it is termed as bath tub curve. An ideal bathtub curve is shown
in the figure-1.It mainly comprises of three parts:

The first part represents the failure rate of early life period which is
decreasing in nature.

The second part represents the failure rate of useful life period which is
more or less constant in nature.

The third part represents failure rate of wear out period which is
increasing in nature.

Figure - 1. Ideal Bathtub Curve [2]

The bathtub curve, does notdepict the failure rate of a single item, but
describes the relative failure rate of an entire population of products over
time. The bathtub curve is generated by tracing the rate of failures of
equipment throughout its life period. As explained earlier a set of data has
been collected for the industrial process fans, which is shown in the Table-
1.Now we can check the validity of bath tub curve by using these data with
the help of MINITAB software.

The failure rate (Total number of failures per year) from the Table-1 has been
plotted against all the years and the curve found from the graph is shown in
figure - 2.Now we can see the curve obtained from the study looks like a Bath
Tub. If we follow the curve we can observe there are mainly three parts as
mentioned in the curve which is already discussed above.
YE No. No. No. No. No. No. No. No. No. No. TOTA AVG
AR of of of of of of of of of of L No. .
failur failur failur failur failur failur failur failur failur failur of Rat
es, es, es, es, es, es, es, es, es, es, Failu e of
FAN- FAN- FAN- FAN- FAN- FAN- FAN- FAN- FAN- FAN- res failu
1 2 3 4 5 6 7 8 9 10 per re
year (Nos
.
per
year
)

1st 1 1 2 3 1 0 1 1 2 1 13 1.3

2n 1 2 0 2 1 0 2 2 1 1 12 1.2
d

3rd 1 0 2 0 1 0 0 1 0 1 6 0.6

4th 1 0 1 1 1 0 0 1 2 0 7 0.7

5th 1 1 1 0 0 1 0 1 2 1 8 0.8

6th 1 0 2 0 0 0 0 1 1 2 7 0.7

7th 1 1 0 1 0 1 1 0 1 1 7 0.7

8th 1 1 1 1 2 1 0 1 0 0 8 0.8

9th 2 1 2 2 2 3 1 1 1 0 15 1.5

10t 2 2 2 1 2 0 3 0 2 2 16 1.6
h

Table - 1. Failure Rate of Fans


Figure - 2. Failure trend,Bath Tub curve

2.2. Concept of Madhabs Hat curve of Reliability

By using the above data, Mean Time Between Failures and the Reliability for
each year has been calculated as shown in the Table 2. As explained earlier
the reliability has been calculated by using failure data method. Like Bath
Tub curve we can put the data of Reliability for all the years in MINITAB
software and after putting the data we got a curve which is shown in the
figure - 3. If we observe closely we can mark the curve looks lika a Hat. So it
could be known as Madhabs Hat curve of Reliability.Like bath tub curve it
has mainly three regions.

The first region is early life period or infant stage where the reliabilty is in
increasing trend just opposite to the bath tub curve.

The second region is useful life period where the reliabilty is more or less
consatnt in nature just like in bath tub curve.

The third region is wear out period where the reliability is in decreasing
trend just opposite to the bath tub curve.

The hat curve of Relibility will be helpful for comparison with the actual
reliability cycle of any equipment.It will also work as a guideline for
equipment lifecycle performance analysis.

Table 2. Reliability of Fans.

YEA UP TIME AVG.NO. OF MTBF TIME/MTBF RELIABILI


R IN FAILURES (N) (T/N) (T/M) TY

HOURS (e -T/M)
(T)

1st 8450 1.3 6500.00 1.3 0.272


2nd 8360 1.2 6966.67 1.2 0.301

3rd 8500 0.6 14166.670.6 0.548

4th 8456 0.7 12080.000.7 0.496

5th 8366 0.8 10457.5 0.8 0.449

6th 8499 0.7 12141.430.7 0.496

7th 8632 0.7 12331.430.7 0.496

8th 8392 0.8 10490.000.8 0.449

9th 8546 1.5 5697.33 1.5 0.223

10th8593 1.6 5370.62 1.6 0.201

Figure - 3. Madhabs Hat curve of Reliability

2.3.Failure prevention and reliability improvement[3][4]

Failure can be described in many different ways. One of the definitions may
be as follows-

The failure is a deviation from the designed and assured performance level
of any equipment which creates dissatisfaction to its user.

We can find out some individual units fails relatively early, others will last
until wear-out, and some will fail during the relatively long period typically
called normal life. Failures during infant mortality are highly undesirable and
are always caused by defects and mistakes like: material defects, design
mistakes, manufacturing defects, etc. Normal life failures are normally
considered to be random cases of "stress exceeding designed strength" due
to abnormal operating conditions. Wear-out is a fact of life due to fatigue or
depreciation of materials. After useful life period most of the equipments fails
which is normal and acceptable for both the manufacturer and customer. A
product's useful life is limited by its endurance design. A product
manufacturer must assure that all specified materials are adequately
designed to function through the intended product life cycle. There are
mainly two types of premature failures observed in any equipment, those
are:

Instantaneous or sudden failure This type of failure mainly occurs when


stress exceeds the strength of material.

Progressive/fatigue failure This type of failure occurs mainly due to


improper or lack of maintenance of equipments.

To prevent both the types of premature failures and to increase the reliability,
for smooth operation of equipments, different failure prevention techniques
are adopted. Those are discussed below.

TECHNIQUES TO PREVENT THE SUDDEN FAILURES ARE

Abnormal operating conditions to be considered at design stage.

Proper stress analysis and metallurgy study of components to be done


before using in the equipment at design stage.

Redundancy use of parallel components wherever possible.

Always in design stage, Factor of Safety and endurance limit of


components should be considered towards a safe side.

Proper methods for manufacturing of equipments should be adopted in its


manufacturing stage.

Proper method of installation and start up/commissioning of equipments


should be followed.

TECHNIQUES TO PREVENT THE PROGRESSIVE FAILURES ARE

Cleaning and lubrication of the equipments to be done at a regular basis.

Lubricants should be used as per the recommendation of the OEM.

Lubricant analysis to be done in a regular basis to check the oil


contamination level, wear particle analysis, viscosity etc.

Tightness of fasteners used in equipments must be checked in a regular


interval.
Follow the proper maintenance procedure and adopt the methods of
Reliability Centered Maintenance (Predictive and Preventive maintenance) to
increase the reliability of the equipments.

The most important failure prevention technique is Condition monitoring


of the equipments like vibration measurement and analysis, temperature
measurement and analysis, wear pattern measurement and analysis, bearing
clearance measurement and analysis, gear wear pattern measurement and
analysis, electrical current of motor measurement and analysis etc. to be
done in regular basis and preventive actions to be taken if required. Some
important condition monitoring techniques are shown in the figure - 4.

According to the result of the condition monitoring the appropriate


preventive actions must be taken to prevent failures of equipments.

Alignment and balancing of equipments must be checked in a regular


interval and necessary actions to be taken against the abnormalities.

Measurement of pressure, flow and velocity of fluids to be done regularly


and the trend should be analyzed.

Avoid running the equipments above the recommended temperature,


pressure, vibration, noise etc.

Adopt proper operating procedure with recommended operating


parameters.

Analyze all the minor issues and also big failures using proper failure
analysis techniques like FMEA, Fish bone diagram, root-cause failure analysis
etc.

Always learn the lessons from the past failures and follow the preventive
mode to avoid those failures in future.

Modifications and continuous improvement should be done depending on


the requirements of operating conditions.

Figure- 4, Condition monitoring Techniques


2.4. Daily checklist for a fan:

A sample of daily checklist which is prepared through work experience is


given below which will be very helpful for mateinance engineers to get a
trend of different factors which plays a great role for failure of an industrial
process fan. The trends can be analyzed in a regular basis to get the
important information about the abnormalities. On the basis of this
information the preventive actions can be taken to avoid failures and thus
increasing the reliability of the systems.

ATTRIBUTE FAN MOTOR

DE SIDE NDE DE SIDE NDE


BEARING SIDE SIDE
BEARING
BEARING BEARING

VIBRATION IN MM/SEC

BEARING TEMPERATURE

LOOSENESS OF BOLTS

OIL LEVEL/ LUBRICATION


STATUS

NOISE

RPM

ATTRIBUTE MOTOR
R Y B

RPM

CURRENT IN AMPERE

POTENTIAL DIFFERENCE IN VOLT

CONCLUSION

In the first phase of this study the Bath tub curve is validated. A new concept
that is Hat curve of Reliability has been established, thus it is a simple
clarification still it is important for the life cycle reliability study of any
equipment. In the third phase some failure prevention methods has been
discussed.

It is hoped that this study would be helpful for the maintenance engineers as
well as the design engineers and Reliability engineers to achieve the
excellence in their respective field.

REFERENCES

I. http://reliabilityanalyticstoolkit.appspot.com/weibull_distribution

II. http://www.weibull.com/hotwire/issue21/hottopics21.htm

III. http://www.lifetime-reliability.com/free-articles/work-quality-
assurance/defect-elimination.html

IV. https://www.dmgeventsme.com/machineryfailureprevention/

You might also like