0% found this document useful (0 votes)

682 views125 pages

Reliability Book

This document provides an introduction to reliability engineering and testing methods. It discusses key terms like reliability, failure, and objectives of reliability engineering. Reliability engineering aims to minimize failures and improve maintenance effectiveness. It is important for customer satisfaction, reducing warranty costs, improving design, and providing a competitive advantage through a good reputation. The document explains concepts of reliability and failure and how they are defined and classified.

Uploaded by

CHINEDU CHIEJINE

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

682 views125 pages

Reliability Book

Uploaded by

CHINEDU CHIEJINE

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 125

INTRODUCTION

TO
TESTING METHODS
AND
RELIABILITY

Chinedu Michael Chiejine PhD.

Chapter One
1.0 Understanding the Basic Terms and Relationships Involved In Reliability Studies

1.1 Introduction to Reliability Engineering

Reliability engineering is aimed at establishing techniques and methods for assessment,
determination and explanation of reliability, maintainability and safety of components,
equipment and systems as well as assisting engineers and managers in creating these
attributes.
Reliability engineering evaluates the intrinsic reliability of a product or process and identifies
where there is likely going to be reliability improvement. Practically, there cannot be one
hundred percent failure proof in any system or design; therefore reliability engineering also
aims at determining the most probable failures and afterwards ascertains appropriate
palliative measures to their effects. Reliability engineering may involve a selection of
engineers, including reliability engineers, quality engineers, test engineers, system engineers
and design engineers. All these key engineers are responsive to their individual assignments
with respect to reliability, and work together to help enhance the product.
Reliability engineering undertakings have to be an ongoing process beginning from the
concept development stage of the product to the last phase of the product lifecycle. It is
important to note that it is more economical to effect changes to a design in the early part of a
design phase instead of when the product must have been manufactured and in service.
The modern concept of reliability is distinct from the former perception by numerical
valuation in contrast to the former qualitative evaluation. Numerical valuation of reliability as
it is done currently, specifies, analyzes and measures, thereby becoming a parameter of
design that can be traded off against other parameters such as cost and performance
1.1.1 Objectives of Reliability Engineering
In order of priority, the objects of reliability engineering can be itemized as follows:
i. To apply engineering knowledge and specialized techniques to prevent or to reduce
the likelihood or frequency of failure
ii. To identify and correct the causes of failures that do occur, despite the efforts to
prevent them
iii. To determine ways of coping with failure that occur, if their causes have not been
corrected
3

iv. To apply methods for estimating the likelihood of new design and for analyzing
reliability data.

1.1.2 Importance of Reliability Engineering

Reliability engineering is meant to minimize failures, improve maintenance effectiveness,

shorten repair times, and meet customer and organization expectations. It is important and
beneficial in many ways. These include:

i. Customer Satisfaction and Expectation

Products work under environmental and use conditions imposed by the customer. Creating
a product that matches the expectations imposed by the customer permits the product to
work as expected. Understanding the conditions allows the design to meet customer’s
expectations without over designing, thus optimizing product cost and consumer
satisfaction.
ii. Warranty Cost
Products that operate as expected without failure avoid being returned or serviced under
warranty. Calls for service support, troubleshooting, product returns, failure analyses, and
re-engineering are all part of the cost of unreliability. Failure of a product to perform it
function within the warranty period, the replacement and repair cost will negatively affect
profits as well as gain unwanted negative attention. Introducing reliability engineering
techniques is an important step in taking corrective action, ultimately leading to a product
that is more reliable.
iii. Design
Enhancing the design team’s reliability engineering capabilities through training and
staffing of reliability professionals enables the entire team to make decisions fully
considering the impact on product reliability. This reduces the need for expensive redesign
or rework costs to address reliability related design errors.
iv. Time
Unanticipated failures cost time for customers and for the organization to resolve the
failures. Using reliability and availability concept can minimize failures and avoid wasting
time.

v. Throughput
4

Downtime for any reason reduces the system’s throughput. Downtime can be minimized
by applying predictive and preventive maintenance programs which are all reliability
engineering techniques. A well-maintained system maximizes throughput and minimizes
operating expenses.
vi. Cost Analysis
Manufacturers may take reliability data and combine it with other cost information to
illustrate the cost-effectiveness of their products. This life cycle cost analysis can prove
that although the initial cost of a product might be higher, the overall lifetime cost is lower
than that of a competitor’s because their product requires fewer repairs or less
maintenance.
vii. Competitive Advantage
Many companies will publish their predicted reliability numbers to help gain advantage
over their competitors who either do not publish their number or have lower number.
viii. Reputation
A company’s reputation is closely related to the reliability of its products. The more
reliable a product is, the more likely the company is to have a favourable reputation.
ix. Repeat Business
A concentrated effort towards improving reliability shows existing customers that a
manufacturer is serious about its product, and committed to customer satisfaction. This
type of attitude has a positive impact on future business.
x. Safety
Some product failure cause unintended or unsafe conditions leading to loss of life or
injury. Reliability engineering tools assist in identifying and minimizing safety risk.
xi. Distribution
Fewer failures and optimized maintenance implies fewer spare parts in the logistics
system. This minimizes the distribution system costs for transportation, logistics, and
storage for spare parts. This also minimizes service labour costs.

1.2 Explanation of Some Functional Terms

1.2.1 Reliability
5

Reliability is the capacity of an item ( a part, system, product or service) to perform it required
function under given conditions for a stated time interval without fail. It is generally designated
by R.
From qualitative point of view, reliability specifies the probability that no operational
interruptions will occur during a stated time interval. Reliabilities are always specified with
respect to certain condition, called normal operating conditions. These include load, temperature,
and humidity ranges as well as operating procedures and maintenance schedules. Failure of users
to heed these conditions often results in premature failure of parts or complete system. For
example, using a 2KVA electricity generating set to power an electrical equipment of 3 Kilowatts
will cause a failure on the generating set; so also, using a passenger-car to tow heavy loads will
cause excess wear and tear on the drive train; again, driving over potholes or curbs often results in
untimely tyre failure.
This concept can be clearly understood by the use of an example. Suppose a test was started at
time t = 0 with N0 number of items, after a time period of t, Nf out of the original N0 items failed
and Ns survived. Then reliability R(t) is expressed at any time, t as:
Ns Number of item that survived
R ( t )=
N0
= Total Number of item
(1.1)

N 0−N f N
R ( t )=
N0
= 1− N f (1.2)
0

1.2.2 Failure

Failure is used to describe a situation in which an item stops performing its required function.
This includes not only instances in which the item does not function at all, but also instances in
which the item’s performance is subnormal or it functions in a way not intended. For example, a
smoke alarm might fail to respond to the presence of smoke (not operated at all), it might sound
an alarm that is too faint to provide adequate warning (subnormal performance), or it might sound
an alarm even though no smoke is present (unintended response).
Failure can be classified according to the mode, cause, effect and mechanism:
i. Mode: The mode of failure is the symptom (local effect) by which a failure is
observed; e.g. opens, shorts or drift for electronic components; brittle, rupture, creep,
cracking seizure, fatigue for mechanical components.
6

ii. Cause: Cause of a failure can be intrinsic, due to weaknesses in the item and/or wear
out, or extrinsic, due to errors, misuse or mishandling during the design, production, or
use. Extrinsic causes often lead to systemic failures, which are deterministic and
should be considered like defects. Defects are present at t = 0, even if they cannot
often be discovered at t = 0.
iii. Effects: The effect or consequence of a failure can be different if considered on the
item itself or at higher level. A usual classification is: non-relevant, partial, complete,
and critical failure. Since a failure can cause further failures, distinction between
primary and secondary failures is important.
iv. Mechanism: Failure mechanism is the physical, chemical, or other process resulting in
failure e.g. fatigue, corrosion, charge spreading (leakage currents) etc.
1.2.3 Item

An item is a final product to be delivered to a customer. An item is often referred to as an

SKU (Stock Keeping Unit), and represents the finest level of detail in the product
structure.

1.2.4 Indices of Reliability

Reliability is a probabilistic function. In other words, it is a probability dependent factor,

which tries to forecast the failure pattern for the system. Nevertheless, it is sometimes
beneficial to parameterize the item reliability which reveals the failure rate of the item,
and does not depend on the operating time. The advantage of such parameterization is the
prospect of comparing performances of different items operating in different periods.
Typical examples of such parameterization are Mean Time Between Failure (MTBF) and
Mean Time To Failure (MTTF).

1.2.5 Mean Time between Failures (MTBF)

MTBF is a key measure of reliability for repairable items. It can be expressed as the
elapsed time before an item fails, under the condition of a constant failure rate. MTBF can
also be explained as the expected value of time between two consecutive failures, for
repairable items. It is the inverse of the failure rate, λ, for constant failure rate systems or
7

items. For example, for a component with a failure rate of 5 failures per million hours, the

1
MTBF would be the inverse of that failure rate, λ, or MTBF = λ

1
Or = 200,000hours/failure
5 f ailures/1000000 hours

It is important to note that Failure rate, λ, is the number of failures per unit time. It can be

1
expressed as; λ = MTBF (1.3)

On the other hand, if the test involves a number of identical pieces of equipment or items
operating under similar conditions in various systems, then the MTBF is given by:

Total operating hours of all the items

MTBF=
Total number of failures that occur

Implying that if 80 identical items are in operation for 20,000 hours during which time 20 failures
occur and were rectified, then:

20,000× 80
MTBF= = 80,000 hours
20

1.2.6 Mean Time to Failure (MTTF)

MTTF is a key measure of reliability for non-repairable items such as filament lamp, fuses,
resistors, capacitors, etc. It is the mean time expected until the first failure of a piece of
equipment. The value of MTTF can be calculated from life test results such that can be obtained
by stressing a large number of components under known conditions for a period of time and
noting the number of failures. Then we have:

Length of test time

MTTF= (1.4)
Number of failures

MTTF can also be expressed in terms of the failure rate, λ, as follows:

1
MTTF= (1.5)
λ
8

1
And conversely, we have λ=
MTTF
, where λ is independent of time, in this case.

1.3 Classes of Failure

It is important to note that in the design of reliability testing, it is failure that provides information
on what to improve, confirm design margins, or validate assumptions.

Generally, the type of failure or, in part, the root cause of the corrective action. Keeping the types
and sources of failure in mind may assist one to avoid or prevent failure. By and large, failure
may be categorized as follows:

i. Misuse Failure: This is failure due to improper use of an item or system. For instance,
when a product is strained beyond its stipulated capacity, it is said to be have been
misused, and it can result to failure. For example, applying 230V a.c mains to an item
designed and rated for 110V a.c mains only is a misuse and could result to a misuse error.
ii. Inherent Weakness Error: Sometimes a system can fail even when it is operated within the
limit of its stipulated capacities. This is by reason of intrinsic weakness in the item itself.
iii. Sudden Failure: This is a failure that could not be projected even after preceding test had
been undertaken.
iv. Gradual Failure: In this type of failure, the system slowly draws near and exceeds the
failure threshold.
v. Partial Failure: This type of failure expresses itself by the item’s or system’s departure
from its features beyond a stated degree. However, it does not outrightly fail in its required
function.
vi. Secondary Failure: In this type of failure, the item actually damaged (hero) may be the
result of a failure of another component (instigator).
vii. Transient Failure: This type of failure is short-lived and often not tied to specific
conditions.
viii. Intermittent Failure: The failures under this class only occur under specific condition and
the product works otherwise.
9

1.3.1 Mode of Failure

The mode of failure describes the manner or form in which failure presents itself. This is
important when attempting to avoid, prevent, understand or resolve a failure. This includes:

a. Hard or Catastrophic Failure

When the item or unit under this mode of failure fails, it requires repair or replacement
before the system resumes operation.
b. Soft Failure
Under this mode, the item ceases to operate under some set of conditions and when those
conditions are removed the system resumes operation.
c. Degraded Performance
The function of the system is impaired or diminishes due to a failure that impacts portion
of the system and does not shut down the entire system.
d. Drift Failure
The system slowly approaches and crosses a failure threshold and restored with
calibration, restart or reset.

During the design process, it is imperative to keep in mind that every element of a system leads to
one or more of the above types of failure. On complex systems, it may be necessary to include
diagnostic routine to assist in the determination of the cause of the failure and where the damage
occurred.

1.4 Failure Rates and its Classifications

One major determinants of the reliability of an item is the frequency of occurrence of failure of
the item. This is known as the failure rate of the item; it is expressed as the number of failures that
take place per unit time.

For computation purposes, we can express failure rate, λ mathematically as:

lim 1 ∆ N f (t) 1 d N f (t)

λ ( t )= × = × (1.6)
N s (t) ∆t N s (t) dt

∆ t →0

where N s = Number of surviving items after a life test

∆ N f = Number of failed items during the time interval, ∆ t

Example 1.1: In a life test where 4,040 items were tested, 40 of them failed. If the test period was
20000 hours, calculate the failure rate.

Solution

Using equation (1.6)

1 40
λ = ×
4040−40 20000

1 40
= ×
4000 20000
λ=0.0000005 failure per hour

Or 5 ×10−7 failure per hour

In terms of percentage failure rate, we have:

1 40 100
× ×
4000 20000 1
=0.00005 percent/hours

There are two major classifications of failure rates we may possibly encounter in this text namely:

i. Instantaneous failure rate (or hazard rate) and

ii. Proportional failure rate (or Average failure rate)

1.4.1 Instantaneous Failure Rate / Hazard Rate (h(t))

This is a function that describes the probability per time unit that a case that has survived to the
beginning of the respective interval will fail in the interval. It is computed as the number of
failures per time units in respective interval, divided by the average number of surviving cases at
the mid-point of the interval.

f (t) f (t )
h ( t )= = = Instantaneous (conditional) failure rate (1.7)
1−F( t) R (t)

−dR
where f (t) is the failure density function:- f ( t )=
dt
11

F (t) is the failure function:- F ( t )=1−R (t)

R ( t )is the survivor function or probability of success.

By considering a small finite interval of time ∆ t we can see that:

∆R
h(t) ∆ t ≈ (1.8)
R

This may be identified as the probability that a certain component will fail in the interval of time t
to (t +∆ t), given that it has survived up to the time t.

The hazard rate is a value from 0 to 1. Failure rate is broken down a couple of ways.
Instantaneous failure rate is the probability of failure at some specific point in time. The hazard
rate is therefore the failures per unit time when the time interval is very small at some point in
time, t. Thus, if a unit is operating for a year, this calculation would provide the chance of failure
in the next instant of time.

This is not useful for the calculation of number of failures over that year; rather, it is used only to
evaluate the chance of a failure in the next moment.

1.4.2 Proportional Failure Rate (Average Failure Rate)

It is also sometimes useful to define an averaged failure rate, AFR over any interval of time (T 1 to
T2) that averages the failure rate over that interval. This rate, denoted by AFR (T 1, T2) is a single
number that can be used as a specification or target for population failure rate over the interval. If
T1 is 0, it is dropped from the expression. Thus, for example, AFR (40000) would be the average
failure rate for the population over the first 40,000 hours of operation. Therefore, when we set T 1
= 0, we have:

T2
h(t) dt H ( T 2 )−H T 1
AFR (T2- T1) = ∫
T 2−T 1
=
T1 T 2−T 1

InR ( T 1 ) −InR(T 1)
= T 2−T 1

and
12

H (T ) T
AFR (0, T) = AFR (T) = = −InR
T
(1.9)
T

Where H ¿) is the integral of the hazard rate h(t) from time zero to time T;

T is the time of interest which define a time period from zero to T and

R(T) is the reliability function or probability of successful operation fro time zero toT.

1.5 Reliability Equations and their Correlations

Consider a special case of a constant failure rate, in which case we have h(t)=λ we have:
R ( t )=exp [−∫ λ dt ] = e− λt (1.10)

Where λ = number of failure/number of time units during which all items were exposed to failure.
The probability distribution of reliability for this case is negative exponential distribution.

1
As already stated in equation (1.3), the reciprocal of λ (i. e . )= T is mean time between failures
λ

¿
MTBF=T= Number of timeunitduring which items were exposed ¿ failure Number of failures

− λt
So, R ( t )=e (1.11)

Note that if a component is operated for a period equal to MTBF, the probability of survival is:

1
∴ R ( t )=¿ e λ = e−1 = 0.368 = 36.8% of survival.
− λt − λ×
R ( t )=e

(This indicates that the probability that any one particular item will survive to its calculated
MTBF is only 36.8%)

While an individual component may not have an exponential reliability distribution, in a complex
system with many components, the overall reliability may appear as series of random events and
the system the system will follow an exponential distribution.

Under this case of constant failure rate, if a fixed number of identical components are tested, let

N o ( t )=Total number of components put on test ∨¿ operation at time t=0

N s (t) = the number surviving up to time t

N f ( t )=the number that failed betwee n t=0∧t=t

∴ N o ( t )=¿ N s (t) + Nf (t ) (1.12)

1 d N f (t)
Recall from equation (1.6) that λ= ×
N s (t ) dt

1 d N f (t)
λ= ×
N o ( t )−N f (t) dt

t N f (t )
d N f (t)
∫ λ dt = ∫ N o ( t )−N f (t)
dt
0 0

∴− λt=¿

λ t=−¿

−λ t=[log ¿ ¿ e ( N o ( t )−N f ( t ) ) −log ¿ ¿ e ( N o−0)]¿ ¿

N o ( t )−N f (t)
−λ t=log e
( No (t) )
− λt N f (t )
So, e =1− (1.13)
No (t)

N s (t)
But R ( t )= N ( t) from equation (1.1)
0

N 0−N f N
and, R ( t )=
N0
= 1− N f from equation (1.2).
0

Comparing equations (1.2) and (1.13) we conclude that

R ( t )=e− λt (1.14)
It can therefore be deduced from the foregoing that:
i. Equation 1.14 is relevant and appropriate exclusively for the useful life period of the item
as shown in the Bathtub diagram of Figure 1.3.
ii. If the failure rate, λ is very small, then from equation (1.14) we have :
14

R ( t )=e− λt ≈(1−λ t) and reliability R become very close to unity

iii. If the failure rate is not constant, then there is a change in the relationship between R(t)
and λ which is in contrariety with existent assumption under constant failure condition.
t

We the have: R ( t )= e−∫ λtdt

0 (1.15)

It should be noted that the notion or presumption of constant failure rate is virtually invariably
true of electronic components.

1.5.1 Reliability and Unreliability and their Related Curves and Equations under Constant
Failure Rate

At any particular moment, a component or system is either operational or it has failed, and a
component’s or system’s functioning status varies as time advances. A functioning item will fail
on the long run; and if the item is a non-repairable one, it will remain in this failed state ad
infinitum. A repairable item will however continue in its failed status for as long as it takes for its
repair to be effected, thence is restored to the working state at the completion of its repair. This
switch from a functioning to a failed state is referred to as failure; whereas, the change from a
failed state to an operational state is called repair. It is assumed that the switch from a failure to a
working state is instantaneous; and that repair causes the item to transmute back to spick-and-span
status. For a repairable item therefore, this cycle continues with the repair-to-failure and the
failure -to-repair sequence, and afterwards recur continuously.

Reliability R(t) for non-repairable items therefore, can be defined as the probability that an item
will perform a defined function without failure under stated conditions for a stated period of time.
A clear comprehension of probability concept is vital if one must understand the concept of
reliability. This is because the arithmetical values of reliability and unreliability are stated as a
probability from 0 to1 without any units.

For repairable items, reliability, R(t) can be defined as the probability that the item suffers no
failures during the time interval zero to t1 given that the item was repaired to spick-and-spin status
or was operational at t0.

Unreliability, Q(t), of an item is defined as the probability that the item suffers the first failure or
has failed once or greater than once during the time period zero to time t, taking into account that
15

it was operating or repaired to spick-and-span status at time zero. Unreliability, Q(t) can also be
expressed as the number of items failed at time t divided by the total number of samples tested.

Since an item must experience or suffer its first failure in the time interval zero to t, or stay
functional over this period, it is therefore proper to express their relationship as follows:

R(t) + Q(t) = 1 (1.16)

Or Q(t) = 1- R(t) (1.17)

Recall from equation (1.11) that, R ( t )=e− λt
Figure 1.1 shows a graph of R(t) and Q(t) against time.

R, Q

Q =1- e− λt

R = e− λt

Time
Figure 1.1 Graphical Representations of Reliability and Unreliability

At time t = 0, R(t) = 1 and Q(t) = 0

1
At time t = , R(t) = 0.37 and Q(t) = 0.63
λ

1.5.2 Probabilities of Failure/ Survival

By plotting time, t against the fraction of the total components that failed, we obtain the graph of
the probability of the failure of a component after time t hours. This is referred to as probability o
failure curve as shown in figure 1.2. On the other hand, by plotting time, t versus fraction of total
survivors, the probability of survival curve is obtained (figure 1.2). The probability of survival is
often expressed as reliability.

From equations (1.12) and (1.13),

N f ( t)
R ( t )=1− = e− λt ,
No (t )

and N s = N o ( t )−N f (t )

N s ( t)
From equation (1.1), R ( t ) =
N o (t)
= e− λt

∴ N s =¿ N o e− λt (1.18)

Equation (1.18) represents the equation of the graph of survivors versus time called the survivor
curve.

Relatedly, the equation of the graph of failure versus time can be derived from equation (1.12) as
follows:

N s ( t)
R (t ) =
N o (t)
= e− λt

N o (t )−N f (t )
=¿ e− λt
N o(t )

N o ( t )−N f ( t )=¿ N o ( t ) e− λt

−N f ( t )=¿ N o ( t ) e− λt −N o ( t )

N f ( t )=N o ( t ) (1−e− λt ) (1.19)

Nf ,
No

Failure curve

Survivor curve

Time

Figure 1.2 Probability of Failure/ Survival Curve

1.6 Failure Rate of Items as a Function of Operating Time (Bathtub Curve)

Documented investigations of components and equipment in regard to their failure pattern over
the years have been found to keep to a recurring shape. Calculation on failure rates for uniform
time intervals from start to replacement have been done over the years. Plotting the failure rate
versus time, covering the equipment’s or component’s lifecycle, the resultant curve is the widely
known “Bathtub Curve,” a name which was derived from its shape. Figure 1.3 shows a typical
Bathtub curve.

Characteristically, a Bathtub curve is divided into three distinctive zones which are:

i. The Infant Mortality period (or early failure period);

ii. The Constant Failure rate period, and
iii. The Wear Out period
18

Figure 1.3: Equipment or Component Failure Profile (Bathtub Curve)

1.7 Explanation of the Periods in the Bathtub Curve

1.7.1 The Infant Mortality Period: This is also referred to as early failure period. In Figure 1.3
the slope from the starting point at the leftmost side to where it begins to flatten out, represents
this period. This period is characterized by decreasing failure rate. This mode of failures occurs
during the early life of a population of units. It reveals failure rate arising from frail components
that evaded final testing, checking and examination, but cave in to infant mortality when exposed
to normal operational pressure. The feeble units fail, leaving a population that is more robust and
hardy.

1.7.2 The constant Failure Rate Period: This period is the flat portion of the curve of Figure 1.3.
It is called the normal life or the “useful life.” Failures in this region occur in random series. This
is the period dominated by chance failures. Chance failures are those failures that result from
strictly random or chance causes. Equipment is designed to operate under certain conditions and
up to certain stress levels. When these stress levels are exceeded due to random unforeseen or
unknown events, a chance failure will occur. While reliability theory and practice is concerned
with all three types of failures its primary concern is with chance failures, since they occur during
the useful life period of the equipment. The time when a chance failure will occur cannot be
predicted; however, the likelihood or probability that one will occur during a given period of time
within the useful life can be determined by analyzing the equipment design. If the probability of
chance failure is too great, either design changes must be introduced or the operating environment
made less severe. The failure rate is lowest during this period; and the slope of the curve is
constant in this region which signifies constant failure rate. The amplitude on the bathtub curve is
at its lowest during this time.

1.7.3 The Wear out Period: this period begins at the point where the slope begins to increase and
outspread to the end of the curve of Figure 1.3. This is what takes place when units get old and
start failing progressively (i.e. increasing failure rate)

1.8 Causes of Failure in the Respective Regions of the Bathtub Curve

1.8.1 The Infant Mortality Period: Failures in this region are usually caused by inherent defects
due to poor materials, workmanship or processing procedures at the molecular level or
manufacturer’s quality control beside installation problems.

Some of the design techniques that should be put in place to ensure the integrity of the designs
include: “burn in” (this is stressing the devices under constant operating conditions); “power
cycling” (this stresses the devices under surges of turn-on and turn off); “temperature cycling”
(this stresses the devices mechanically and electrically over the temperature extremes);
“vibration” and “testing at the destruct limits” etc.

1.8.2 The constant Failure Rate Period: In this region, the causes of failure are produced by
chance or operating conditions such as a failure from switching surges, lightning and operator
faults.

1.8.3 The Wear out Period: Failures during this period are due to old age; various components
are worn out; metals become embrittled, insulation dries out etc. Typical examples are electrolytic
capacitors drying out, fan bearing seizing up, switch mechanisms wearing out etc. Well
implemented preventive maintenance/replacement can delay the onset of this region.

1.9 The Wear out Failure curve

During the wear out stage of the lifecycle of components or devices, they begin to fatigue, and
their expected useful life is dwindling, the failure mode follows a symmetric distribution where
most of the observations of the wear out failures cluster around the central peak (mean) and the
probabilities for values further away from the mean and taper off equally in both directions. This
is called the normal or Gaussian distribution. It is shown in Figure 1.4, which is a graph of wear
out failures versus time. It is also referred to as “bell curve”

.
;
;
Wear out
;
failure
;
;
;
;
M= mean Time

Figure 1.4 Wear out Failure Distribution Density Versus Time

It is expedient to state that the wear out region is not secluded from the entire bathtub curve
structure described and illustrated in Figure 1.3, rather it is to all intent and purpose, a
20

continuance of it. Hence representing the bathtub curve with the wearout failure region wholly
incorporated, will bear a resemblance to Figure 1.5.

Figure 1.5 Graph of Failure Probability Density Function versus Time

To begin with, while trying to grasp the concept of Gaussian distribution, it is helpful for us to
figure out very vital indicators, which is the standard deviation “σ”.

Consider running a wearout test in which the number of components 1, 2, 3…, n is denoted by C 1,
C2, C3 …, Cn and the total number of components engaged in the test is n, then,

σ =√ ¿ ¿ ¿ (1.20)
where, M is the mean wearout life.

Next, it can be proved that for a Gaussian distribution of failures, there is 0.6826 probability that
the entire failures will take place between a time period of (M -σ ) and (M +σ ) i.e. in the vicinity
of a period of (M± σ ); while there is a 0.9544 probability that it will happen between a time
period of (M -2 σ) and (M +2σ ); whereas there is a probability of 0.9973 occurrence between a
time period of (M -3 σ) and (M +3σ ). This data is mostly handy when determining the confidence
limit of wearout failure estimate. Figure 1.6 clearly depicts the preceding details.
21

Life

Time

Figure 1.6: Vital Points on the Gaussian Distribution Curve.

1.9.1 Evaluation of Mean Wear-out - Life

Life test can be conducted to determine wear-out failure modes. This is accomplished by
subjecting a sample of the component to within-the-rated operating conditions (i.e. a real life
setting); and sustained for a sufficiently long test period, up to when the components fail. A close
inspection of the failed components with respect to their physical attributes and life expectancy,
rules out infant mortality and chance failures. For instance, if the life span of a sample of the
components is, perhaps, 6,000hours, a component which survives for 700hours cannot possibly be
classified as a wear-out failure. The usual praxis for such premature failure is to exclude them
from those samples that may be classified to have exhibited wear-out failure.

The mean time to wear-out, M can be calculated if the time to each wear-out failure is known.

If t1, t2, t3,…, tn, denote the time to each wear-out failure, then,

t 1 +t 2 +t 3 + …+ t n
M= (1.21)
n

Where n is the number of samples.

Put in another form,

Total time used for the test of all thecomponents which failed duetowear−out
M= wear−out ¿
Total number of components which fail due ¿

Example 1.2

A set of 25 components were put into accelerated life test. The time for their failures, from the
beginning of the test to the end, are presented in hours as follows:
22

16, 82, 215, 610,761,784,790, 798. 935, 28, 91, 310, 650, 767, 787, 792, 920, 51, 103, 420, 750,
780, 788, 796, 931. Calculate the mean time to wear-out failure.

Tips:

A close look at the trend of failure obviously shows that the first 3 failures may possibly be
classified as infant mortality period (early failure period). Again, looking at the next 8 failures,
they give the impression of a possible random failure period. Thus, the first 11 failures are left
out, as they do not have the likely traits of wear-out failure. However, the failures from 750 hours
to 798 hours look more probable to be part of the wear-out failure inclination.

Besides, there are 3 failures which display long life and they are 920, 931 and 935 hours, these
three failures should not be included as well because they visibly stand out as atypical of the test.
They depict long life. If they are included in the calculation, the result will create a false notion of
a longer mean time to wear-out failure. So, our solution lies between the failures from 750hours
to 780hours, which is calculated as follows:

Total hours the components were subjected to test =750+761+767+780+784+787+788+790

+792+796+798 = 8,593

Number of failures observed = 11

8,593
∴ Mean time¿ w ear−out failure , M = =781 hours
11

1.9.2 How to Evaluate the Confidence Limit for the Gaussian Distribution

In a bid to determine the mean wear-out life of components, several sample tests are performed
from which a number of possible values of mean life will be obtained. These values are
considered as estimates of the actual mean. Given that the wear-out life is a typical Gaussian
distribution which clusters around the central mean wear-out life, as a consequence, the estimates
of the actual mean is also inclined to have the same Gaussian distribution clustering around the
actual mean wear-out life. There will therefore be a striking difference between the standard
deviation, σ, of the wear-out life distribution and that of the mean distribution, σM.

σ
Standard deviation for the mean distribution σM ¿ (1.22)
√n
Where n is the number of components tested.

The shape of the normal means distribution curve is the same as the wear-out distribution density
curve shown in figure 1.4. we can suitably project with certainty that:

i. There is 0.6826probability that all estimates of the actual mean are around the ± σM
range;
ii. There is 0.9544 probability that all estimates of the actual mean are around the
23

± 2σM range and

iii. There is 0.9973 probability of all estimates of the actual mean are around the ± 3σM
range.

Example 1.3

The result of a wear-out test shows the following data:

Number of failed components 2 1 4 6 3

Life Span per Component (Hours) 1200 1300 1400 1600 1650
Calculate:

i. The mean wear-out life of the components

ii. The upper limit of the confidence for the components’ mean wear-out life at 99.73%
confidence level.
Solution:
i. Sum of lives of components =
( 2 ×1200 ) + ( 1 ×1300 ) + ( 4 ×1400 ) + ( 6 ×1600 )+ (3 × 1650 )=23850 hours
n= 2+1+4+6+3 =16
23650
Mean wear-out life, M = =1491 hours
16
n
ii. Standard deviation, σ =
√∑i=0
f i¿ ¿ ¿ ¿ …from equation( 1.20)

To obtain the necessary parameter for the evaluation of σ, we tabulate and compute as follows:

f Ci-M (Ci-M)2 fi(Ci-M)2

- M)2 =386096
2 1200-1491 84684 169362 ∑ f i(ti

1 1300-1491 36481 36481

386096
4 1400-1491 8281 33124
∴ σ=
√ 16
6 1600-1491 11881 71286

3 1650-1491 25281 75843

σ =155 hours ¿)

155
Standard deviation for the means distribution, σ M = =39 hours (approximately)
√ 16
The upper confidence limit for the actual mean wear-out life at a 99.73% level of confidence is
therefore

(M+3 σ M ) =1491+ (3×39) =1608hours

Example1.4

Forty machines are operated for 150 hours, one machine fails in 50 hours, another fails at 65
hours yet a third one fails at 70 hours. What is the MTBF?

Solution

37 machines ran for 150 hours, while three others ran for 50, 65 and 70 hours respectively.

∴ The total running time=37 ( 150 ) +50+65+70=5735

Total operating timeof all the item

Recall that MTBF =
Total number of failures that occur

5735
MTBF = = 1912 hours
3

Example 1.5

(i) What is the reliability of the same machines from Example 1.4 at 400 hours? And at 600
hours?
Solution
1 1
λ = MTBF = 1912 =0.000523 failures/hours
R(t )=e− λt

R ( 400 ) =¿ e−0.000523(400) = 0.811

R ( 600 )=¿ e−0.000523 (600) = 0.731

Thus, there is 81% probability that the machine will run for 400 hours without failure, and 73%
probability that it will run for 600 hours.

(ii) Suppose the machine’s performance is entirely dependent on one particular component.
Each time the component is replaced, the machine’s reliability returns to 100%. How often
should the component be replaced so that the machine’s reliability is never less than 90%?
Solution
R(t )=e− λt
0.90 = e−0.000523 (t )
Taking the log of both sides, we have:
ln 0.90=−0.000523(t )
t =201.5hours
The component should therefore be replaced every 201.5 hours.
25

Chapter Two

2.0 The Concept of Reliability Prediction

Reliability predictions are techniques of reliability analysis employed to quantify system
reliability from anticipated failure rates of components. It requires appreciation of good
engineering knowledge in the widest sense, intermixed with descent mathematical and statistical
methods, during the design, development manufacture and service to achieve this objective. The
predictions are used to evaluate design feasibility, compare alternatives, identify potential failure
areas trade-off system design factors, and pursue reliability improvement.

2.0.1 The Role of Reliability Prediction

The purpose of reliability prediction is as follows:
(a) The effect of the proposed design modification on reliability is clinched by comparing the
reliability of the extant and proposed design.
26

(b) The potential of the design to sustain satisfactory reliability level under intense
environmental conditions can be determined through reliability predictions. Thus,
predictions can be a means of assessing the necessity for environmental control systems.
(c) The effect of complexity on the likelihood of succeeding in the undertaking can be
assessed by performing a reliability prediction survey. The result from the survey may
establish the necessity for redundant systems, back-up systems, subsystems, assemblies
or component parts.
(d) A reliability prediction can also assist in evaluating the importance and gravity of the
reported failures. Eventually, the outcome of a reliability prediction analysis can be
handy when conducting further necessary analysis such as failure modes, Effects and
Criticality Analysis (FMECA). Reliability Block Diagram (RBD) or Fault Tree Analysis
(FTA). The reliability predictions are used to evaluate the probabilities of failure events
described in there alternate failure analysis modes.

2.1.1 Some Basic Probability Rules in Relation to Reliability Calculation and Predictions

Remember that reliability is generally concerned with whether an item functions for a particular
time domain (or period), which is a question that can be answered as a probability. It is the
probability that an item will perform a required function without failure under stated conditions
for a stated period of time.

Therefore in probability prediction analysis some basic probability rules are needed. A handful
of the relevant probability rules are discussed hereunder.

(i) Multiplication Rule (General)

The multiplication rule is a way to find the probability of two events happening at the
same time. There are two multiplication rules. The general multiplication rule formula is:
P ( A ∩ B )=P ( A ) . P(B/ A) (2.1)

and the specific multiplication rule is:

P ( A∧B )=P ( A ) × P( B) (2.2)
P(B/ A) means “the probability of A happening given that B has occurred”

(a) Multiplication Rule Probability (Specific)

The specific multiplication rule, P(A and B) = P ( A ) × P ( B ) ,is only valid if the two events
are independent. In other words, it only works if one event does not change the probability
of the other events e.g.:
i. Owning a laptop computer and getting a daily pay cheque
ii. Buying a cloth and boarding a cab.

Under this rule, just multiply the probability of the first event by the second. For example,
if the probability of event A is 4/9 and the probability of event B is 2/9 then, the
probability of both events happening at the same time is (4/9) ×(2/9) = (8/81) = 0.099
27

(b) Multiplication Rule Probability (General Rule)

The rule can be used for any event (they can be independent or dependent events). You
still have multiply two numbers, but first, you have to use a little logic to figure out the
second probability before multiplying. For example, an urn contains 6 black stones and 4
red stones. Two stones are drawn from the urn, without replacement. What is the
probability that both stones are red?
Hint:
Step 1: Label your events A and B. Let A be the event that stone 1 is red and let B be the
event that stone 2 is red.
Step 2: Figure out the probability of A. there are ten stones in the urn, so, the probability of
drawing a red stone is 4/10.
Step 3: Figure out the probability of B. There are nine stones in the urn, so, the probability
of choosing a red stone P(B/ A) is 3/9
Step 4: Multiply Steps 2 and 3 together: 4/10 × 3/9 =2/15
It should be borne in mind that multiplication rule probability –the (general rule) which is
applicable both dependent and independent events are applicable for successful operating
and determination of the reliability of a series system.

(ii) Addition Rules in Probability

In probability, addition rule provides a way to calculate the probability of the event “A or
B,” provided that the conditions attached to both events are known. Sometimes the “or” is
replace by∪, the symbol from set theory that denotes union of two sets. The particular
addition rule used is dependent upon whether events A and event B are mutually exclusive
or not.
(a) Addition Rule for Mutually Exclusive Events
If events A and B are mutually exclusive, then the probability of A or B, is the sum of the
probability of A and the probability of B. We present it in a compressed form as follows:
P ( A∨B )=P ( A ) + P(B) (2.3)

(b) Generalized Addition Rule for Any Two Events

In a situation where the events are not necessarily mutually exclusive as presented in (a)
above, we can generalize the formula. For any two events A and B, the probability of A
and B is the sum of the probability of A and the probability of B minus the shared
probability of both A and B:

P ( A∨B )=P ( A ) + P(B)- P ( A∧B ) (2.4)

Sometimes the word “and” is replaced by ∩, which is the symbol from set theory that
denotes the intersection of two sets.
The addition rule for mutually exclusive events is really a special case of the generalized
rule. This is because if A and B are mutually exclusive, then the probability of both A and
B is zero.
28

Example 2.1 (Mutually Exclusive)

In a group of 98 students, 28 are freshmen and 34 are sophomores. Find the probability
that a student picked from the group at random is either a freshman or sophomore.

Solution
28 34
P ( freshman )= and P ( sophomore )= . Therefore,
98 98

28 34 62
P ( freshman∨sophomore )=¿ + =
98 98 98
This looks logical since 62 of the 98 students are freshmen or sophomores

Example 2.2 (Not Mutually Exclusive)

In a group of 98 students, 30 are juniors, 40 are female, and 28 are female juniors. Find
the probability that a student picked from this group at random is either a junior or female.

Solution
30 40 28
P ( freshman )=¿ and P ( female ) = and P ( junior∧female )=
98 98 98
30 40 28
Therefore P ( junior∨female ) = + −
98 98 98

42
P ( junior∨female )=¿
98

This is logical since 30 are juniors, and 40 are female while 28 are female juniors.

(iii) Binomial Probability Distribution

A binomial distribution can be thought of as simply the probability of a success or failure

outcome in an experiment or survey that is repeated multiple times. The binomial is a type of
distribution that has two possible outcomes (the prefix “bi” means two possible outcomes): heads
or tails and taking a test could have two possible outcomes: pass or fail. The formular for
evaluating the binomial distribution is:

b ( x , n , P )=¿nCx× P x ×(1−P)n− x (2.5)

Where b=binomial probability

x=Total number of “Success”(pass or fail, heads or tails etc.)

P=Probability of a “Success” on an individual trial

n= number of trials
29

n!
This formular can also be written in a slightly different way, because nCx = . (This
x ! ( n−x ) !
formular applies factorials). Therefore, we have the alternate binomial distribution formular as:

n!
P(x) = × P x ×(q)n− x (2.6)
x ! ( n−x ) !

Where q = probability of failure (obtained by subtracting the probability of success from 1)

The first variable in the binomial formular of equation (2.5), n, stands for the number of times the
experiment runs. The second variable, P, represents the probability of one specific outcome. For
example, suppose you wanted to know the probability of getting a one on a die roll. If you were to
roll a die 20 times, the probability of rolling a one on any throw is 1/6. Roll twenty times and you
have a binomial distribution of (n=20, P=1/6). SUCCESS would be “roll a one” and FAILURE
would be “roll any number else”. If the outcome in question was the probability of the die landing
on an even number the binomial distribution would then become (n=20, P=1/2). That is because
your probability of throwing an even number is one half (i.e. n=3/6, P=1/2).

2.1.3.1 Criteria for Binomial Distributions

For the feasibility of Binomial distributions, the following conditions must be satisfied.

i. The Number of Observations or Trials is Fixed

Put in another way, one can only determine the probability of something happening if he does
it a certain number of times. This is simply logical. For instance, if you toss a coin once, your
probability of getting a tails is 50%. If however, the coin is tossed 20 times, the probability of
getting a tails is very close to 100%
ii. Each Observation of Trial is Independent
This means that none of your trials has an effect on the probability of next trial
iii. The Probability of Success (tails or heads, fail or pass) is exactly the same from one trial to
another

Once it is established that the distribution is binomial, then one can employ the binomial
distribution formular to calculate the probability.

Example 2.3

A coin is tossed 10 times. What is the probability of getting exactly 6 heads?

Solution

The number of trials, n, =10

The odds of success (tossing heads) is 0.5 (so, 1-P =0.5)

x =6
30

x n− x
Applying b ( x , n , P )=¿nCx× P ×(1−P)

× ×(0.5)10-6
=10C6 0.56

n!
= x ! ( n−x ) ! ×0.56× 0.54

= 210× 0.015625× 0.0625

=0.205078125

Example 2.4

90% of people who purchase facial cosmetics are women. If 10 cosmetic retailers are randomly
selected, find the probability that 7 are women.

Solution

n!
We shall work with the formular: P(x) = × P x ×(q)n− x
x ! ( n−x ) !

Step 1: Identify “n” from the problem. Using our sample question, n (the number of randomly
selected items) is 10

Step 2: Identify “x” from the problem. x (the number you are asked to find the probability for) is
7

n!
Step 3: Work the first part of the formular. The first part of the formular is
x ! ( n−x ) !

10 ! 720
Substituting the variables, we have: nCx = 10C7 = ( 10−7 ) ! 7 ! = 6 = 120

Step 4: Find P and q. P is the probability of success and q is the probability of failure

We are given P=90% or 0.9. so the probability of failure is 1-0.9 = 0.1=10%

Step 5: Work the second part of the formular

P x =0.97= 0.4782969

Step 6: Work the third part of the formular

(q)n −x =0.110−7

=0.13= 0.001

Step 7: Multiply your answers from steps 3, 5 and 6 together

n!
P(x) = × P x ×(q)n− x
x ! ( n−x ) !

= 120× 0.4782969× 0.001

=0.057 or 5.7%

2.2 Mathematical Expression for Reliability and MTBF

Reliability R(t) =e− λt

where R(t) = Reliability

t = elapsed operational time in hours

λ=¿ failure rate

R(t) is the combined probability of individual parts’ reliability, where the unit contains quantity n
parts. The assumption is that if any part fails during operation, the entire system is considered to
have failed as a whole.
n
R(t) = ∏ R (t)i
i=1

Where R(t )i= reliability of part, i, over time, t =e− λit

This is the summation of all the parts failure rates provided the system failure rate, provided the
system failure rate is constant.

Thus, the system MTBF is determined by taking a reciprocal of the summation of the failure rates
of all the parts.

∞ 1
n
MTBF=∫ R (t) dt = (2.7)
0 ∑ λi
i=1

2.3 Reliability and MTBF of Series System

Two parts of sub-unit1, 2 and 3 are said to be operating in series if failure of either of the parts
results in failure of the combination. They can be regarded as “fault intolerant”. A block diagram
of series reliability system is shown in Figure 2.1. This diagram only depicts that the system must
be treated from a reliability point of view, and does not represent physical interconnection of
components.

R1 R2 R3
32

Figure 2.1 Reliability Block Diagram for a Series System

If Rs(t) is the reliability of a series system, and R 1, R2, …Rn represent the individual reliability of
the n number of parts or sub-units in series in the system for equal interval of time, then the
reliability of the system is given by:

Rs(t) = R1(t) R2(t) Rn(t) (2.8)

The consequence of the above equation is that the combined reliability of two parts in series is
always lower than the reliability of its individual sub-units, giving credence to the dictum that a
chain is weaker than the weakest link. Then again, if we have an exponentially failing units, then
the reliabilities of its sub-units will be R1(t) = e− λ t , R2(t) = e− λ t … , Rn(t) = e− λ t where λ1, λ2,…, λn
1 2 n

are the respective failure rates of sub-units.

λ1 +λ 2+… λ n) t
Thus, Rs(t) = e ( = e− λ t
s

Where, λs is the system failure rate, which is

λs = λ 1+ λ 2+ …+ λ n (2.9)

2.3.1 Assumptions made when making a Series Reliability Prediction by Adding Failure
Rates

1. The components are inside their useful life.

2. The components are constant failure rate devices (or may be treated as such).

3. The components are considerably similar to those whose failure rates have been measured.

4. The system is manufactured in line with accepted praxis.

5. When the components are in the system, they have the same ambient conditions and stress level
as those under which the failure rates were measured (or calculated by extrapolation from
measured data).

6. The component’s failure probabilities are independent.

7. The design of the system is satisfactory.

8. The system is not misused.

2.3.2 MTBF of a series System

Let the MTBF of a series system be Ms; recall that MTBF is equal to the reciprocal of the system
failure rate.
33

1 1
∴ Ms = = (2.10)
λ s λ1 + λ2 +…+ λn

If the series system contains n homogenous sub-units, each, having equal reliability, the system
reliability, Rs(t) will be: Rs(t) = e−nλt

Thus, the system failure rate λsn = nλ.

1
The system MTBF, Msn =
nλ

2.3.3 Reliability and MBTF of Parallel System

Two parts or sub-units are said to be operating in parallel if the failure of a sub-unit leads to the
other sub-unit taking over the operations of the failed one. In other words, the system only fails
when both (all the) sub-units fail. The system is operational if either of the sub-unit is functional.
A block diagram of a parallel system is shown in Figure 2.2.

Figure 2.2 Reliability Block Diagram for a Parallel System

The reliability of a parallel system as in Figure 2.2 is:

Rp = 1-[(1-R1) (1-R2) ... (1-Rn)] (2.11)

R1, R2...Rn are the reliabilities of the individual sub-units. It is important to state that if we have a
condition of R1=R2 =R3 =...Rn = R, then,

Rp = 1- (1- R)n (2.22)

Yet again, if the system consists of exponentially failing sub-units then,

R1 =e− λ t , R2 = e− λ t , R3=e−λ t … , Rn = e− λ t
1 2 3 n

Then if the system is made up of n sub-units then its reliability is:

Rp = 1- (1- e− λ t ) (1- e− λ t ¿ ¿ ¿)
1 2

If the system has two sub-units in parallel connections we have:

Rp2 = 1- (1-R1) (1-R2) = R1+ R2 - R1R2

Rp2 =e− λ t + e− λ t −e−(λ +λ )t

1 2 1 2
(2.13)

Then again for a parallel system with three sub-units, we have:

Rp3= 1- (1-R1) (1-R2) (1-R3).

On expansion and simplification, we have:

Rp3 = R1+ R2+ R3 - R1R2 - R2R3 - R1R3+ R1R2R3

= e− λ t + e− λ t + e−λ t −e−(λ + λ )t −¿ e−(λ + λ )t −e−(λ +λ )t + e−(λ + λ + λ )t

1 2 3 1 2 2 3 1 3 1 2 3
(2.14)

These equations for parallel systems are certainly complex exponential form, and it may not be
feasible to express their overall system reliability in exponential form as was done for the series
system in the form ofe− λ t . Thus, the system is not a constant failure rate system, even though it is
p

made up of constant failure rate units.

2.3.4 MTBF of Parallel System

We call to mind equation (2.7) in which we expressed MTBF as the integral of reliability, with
the limits of integration from 0-∞. This implies that it is the summation of the failure rates of all
the parts that make up the system.
∞
1
∴ MTBF=∫ R ( t ) dt= n
0
∑ λ1
i=1

∞ ∞

For a parallel system with two sub-units, the MTBF, Mp2=∫ R p2 dt =∫ (R ¿ ¿ 1+ R2−R1 R2 ¿ )dt ¿ ¿
0 0

∞
¿ ∫ R ¿ ¿+ e− λ t −e−(λ +λ )t )dt
2 1 2

1 1 1
∴ Mp2= + − (2.15)
λ1 λ 2 ( λ ¿ ¿ 1+ λ2 )¿

Correspondingly, for a three sub-unit system in parallel connection, we have:

1 1 1 1
+ + −
λ1 λ 2 λ 3 1
( λ ¿ ¿1+ λ 2)− ¿
Mp3= 1 (2.16)
(λ ¿ ¿ 2+ λ 3)− ¿
1
( λ ¿ ¿ 1+ λ3 )+ ¿
λ 1+ λ2 + λ3

where λ 1 , λ2 , λ3 are the respective sub-unit failure rates. Moreover, for an n sub-unit system
with individual sub-unit possessing equal failure rate λ, it can be proved that its MTBF, (Mpn) is
35

1 1 1 1
Mpn = λ + 2 λ + 3 λ + …+ nλ (2.17)

2.3.5 Reliability and MTBF of a Series-Parallel System

Consider a series-parallel system as represented in Figure 2.3, which we are required to find its
reliability.

R1 R2 R∞ Rb

Rz
Figure 2.3 A Series-Parallel System

The under-listed steps are recommended for the determination of the reliability of the system:

1. Pickout the units which are in series within the system, and calculate the equivalent reliability
of the series units with this relationship:

Rs = R1 ×R2 × … R ∞ (2.18)

Where R1 ,R2... … R∞ represent the reliability of each of the series units

2. Condense every one of the parallel system to one equivalent unit with equation (2.11). This
gives: Rp = 1- (1- Ra) (1- Rb)... (1- Rz)

3. Determine the product of outcome of step 1 and step 2 to obtain the overall system reliability
Rsp i.e. Rsp = Rs × Rp (2.19)

= (R1 ×R2 × … R ∞) [1- (1- Ra) (1- Rb)... (1- Rz)] (2.20)

2.3.6 MTBF of Series –Parallel System

Determining the MTBF of the series-parallel system as represented in Figure 2.3 involves
integrating the reliability expression for the series-parallel system. In other words, the MTBF will
be:
∞

MTBFsp = ∫ R sp(t)dt (2.21)

Example 2.1
36

Draw a distinction between the reliability of n series system with a parallel system, given that
each system accommodates 5 sub-units having reliabilities of 0.94, 0.90, 0.88; 0.81 and 0.78
respectively

Solution

For series: Rs = 0.94×0.90×0.88×0.81×0.78 = 0.470

For parallel: Rp = [1- (1-0.94) (1-0.90)(1-0.88) (1-0.81)( 1-0.78)]

= [1- 0.06× 0.1× 0.12× 0.19× 0.22 ¿

= [1-0.000030096]

= 0.999

Example 2.2

A communication system with a hard-wired microwave repeater unit, has a mean time to failure
of 60,000 hours. The system is functional if one channel is working and the reliability of the
switching unit is 0.98. Calculate the reliability for 24 months functional period, utilizing

(a) A single channel

(b) Two parallel channels
(c) Three parallel channels

Solution

The above-described system can be represented as shown in Figure 2.4

Single Channel R

R
Two parallel

channel R
Output

RO
R
Three
parallel
channel R

Figure 2.4
37

1 1
Failure rate, λ = =
60000 6 ×104

= 0.166×10−4/ hour

Operating period = 24 months = 2 years = (365×2)×24 hours

=17520hours

Using the single channel:

The reliability, R1 = R = e− λt
−4
= e−0.166 ×17520 × 10

= e−0.290832
R1 = 0.748

Using the two-parallel channels:

The reliability, R2 = 1−¿ (1−¿R) (1−¿R) = 2 R- R2

= (2× 0.748¿−( 0.748) ¿2

R2 = 1.496−¿ 0.5595

R2 = 0.9365 ≈ 0.937

But we have the reliability of the switching unit as 0.98 hence, the series-parallel merger
becomes:

R2s = R2 ×Rs ...from equation (2.19)

= 0.937× 0.98

R2s = 0.918

Using the three-parallel channel:

The Reliability, R3 = 1−( 1−R)(1−R)(1−R)

= 1−( 1−R)3

= 1−( 1−0.748)3

= 1-0.0160

R3 = 0.984
38

R3s= R3× Rs ...from equation (2.19)

=0.984 × 0.98

R3s=0.964

Example 2.3

An electrical power system consists of three sections connected in series. The sections have mean
times between failures of 12,000 hours, 8,000hours, and 6,000hours respectively. Calculate the
MTBF of the system.

Solution

The failure rates λ1, λ2, λ3 stand for the three sections respectively; and being in series we have:

λ s = λ1+ λ2+ λ3 ...from equation (2.9)

1 1
MTBF, Ms = = ...from equation (2.10)
λs λ 1+ λ 2+ λ 3+…+ λ n

1
But MTTF = ...from equation (1.5)
λ

1
λ1 = = 1.2 ×10−3
12×10 3

1
λ2 = = 0.125 ×10−3
8 ×103

1
λ3 = = 0.166 ×10−3
6 ×103

1
Ms = −3 hours
(1.2+0.125+0.166)×10

= 670 hours

2.3.7 N-Out of M-Unit Network

A system in which the components are arranged to give parallel reliability is said to be redundant;
there is more than one mechanism for the system functions to be carried out. In a system with full
active redundancy all but one component may fail before the system fails.
39

There are other systems with partial active redundancy, in which certain components can fail
without causing system failure but more than one component must remain operating to keep the
system operating; a typical example is a four engine aircraft that can fly on two engines but would
lose stability and control if only one engine were operating. This type of situation is known as an
N-out of M- unit network. At least N-units must function normally for the system to succeed
rather than one unit in the parallel case and all unit in the series case.

The reliability of an N-out of M-unit system is given by binomial distribution, on the assumption
that each of the M-units is independent and identical.

Rn/m = (mi ) R ¿ i

m!
Where, (mi )= i! ( m−i )!

Example 2.4

A complex engineering design can be described by a reliability block diagram as shown in Figure
2.5. in the sub-system A, two components must operate for the sub-system to function
successfully; sub-system C has true parallel reliability. Calculate the reliability.

A B C

Solution Figure 2.5 Reliability block diagram depicting Complex engineering design

Sub-system A is an N-out of M model for which N =2, M = 4

Using equation Rn/m = (mi ) R ¿

RA = ( 4i ) R ¿
i

Where i=2

RA = ( 42) R ¿+( 43) R (1−R)+( 44)R

2 3 4

= 6 R2(1−¿2 R+ R2) + 4 R3 (1−R)+ R4

=3 R4 −8 R3 +6 R2= 3(0.92)4 −8( 0.92)3+ 6(0.92)2

RA = 0.998

Sub-system B is just a single component:

RB = 0.97

Sub-System C is a parallel sub-system

∴ RC = 1−( 1−R)(1−R)(1−R) = 1−(1−R)3

= 1−¿(1−0.85)3

= 1−¿(0.15)3

RC = 0.997

Therefore the reliability of the system,

RS = RA+ RB+ RC

= 0.998+0.970+0.997

RS = 0.965

2.4 Redundancy Concept

Redundancy in a system means that there exists an alternative parallel path for the successful
operation of the system. Put in another form, it is the existence of more than one means (in an
item) for performing the required function.

Redundancy is therefore a fundamental tenet of reliability engineering that makes room for high
reliability, availability, and/or safety at equipment or system level. It is a well-known creed in
reliability engineering circle that as the complexity of a system increases the reliability dwindles
unless compensatory measures are taken.

For instance, if we have a system with n components connected in parallel and at least K,
( 1 ≤ K ≤ n ) , components are needed for the successful operation of the system, we say that the
number of basic components is K and the remaining (n-K) components are known as redundant
components. In fact, this system is known as a K-out-of- n system, and is discussed later in this
chapter.

Redundancy can be classified as shown in Figure 2.6

Figure 2.6: Classification of Redundancy

2.4.1 Active Redundancy

In this case, multiple units are connected in parallel, energized and subjected to the same load
simultaneously to perform the given function. Here load sharing is possible, yet the expected
function can be performed even if only one out of the several units is working. Figure 2.7
illustrates this concept.

Input Output

Figure 2.7 Active (parallel Redundancy)

2.4.1.1 Fully Redundant System: A k-out-of-n system is known as a fully redundant

system if k = 1. But for k = 1, the k-out-of-n system is the same as the parallel system,
which we have discussed in section 2.4.3. Thus, the parallel system is nothing but a 1-
out-of-n system and is also known as a fully redundant system.
2.4.1.2 Partially Redundant System: Before defining a partially redundant system, let
us first define a non-redundant system. A k-out-of-n system is said to be a non-
redundant system if k = n. But for k = n, the k-out-of-n system is simply the series
system, which we have discussed in Section2.4.1. Thus, a series system is nothing but an
n-out-of-n system and is also known as a non-redundant system. Now, we define a
partially redundant system. A k-out-of-n system is said to be a partially redundant
42

system if 1 < k < n. That is, the redundant elements (units) are subjected to a lower load
until one of the operating elements fails.
2.4.1.3 Conditional (Majority voting) Redundant System
The output in this type of design agrees with the majority (2-out –of 3) as shown in Figure
2.8. This IS typical of a processor or (modules). This means that the system can tolerate
(mask) single module failure only.

 1

 2

3
Figure 2.8 Conditional (Majority Voting)

2.4.2 Standby (Passive) Redundancy

Refer to Figure 2.6, where we have classified redundancy broadly into two categories, namely,
active redundancy and standby redundancy. By active redundancy, we mean that all the
components connected in parallel are turned on at the beginning of operation of the system, and
continue to perform until they fail. Thus, in active redundancy, all components are simultaneously
in the operating mode. We have discussed two such systems, namely, the parallel system and k-
out-of-n system, where all the components of the system are turned on at the beginning of the
operation of the system. So, all the components are simultaneously in the operating mode. On the
other hand, in standby redundancy, the components are connected in parallel but do not start
operating simultaneously from the beginning of the operation of the system. In standby
redundancy, the component(s) in operating mode is/are known as normally operating
component(s). The component(s) kept in standby or reserve mode is/are known as standby
component(s). Other than this, there is a changeover device. The function of the changeover
device is to sense the failure of normally operating component and in case of a failure, to bring a
standby component into the normal operating mode.
43

To explain the concept of standby system, let us first consider the simplest case of the two
components (say, A and B) standby system.

A switch in the standby systems can put any component into operation. In the 2-component
standby system, initially this switch is connected to component A and turns it on, as shown in
Figure 2.9. In Figure 2.9, the switch is represented by S. In this setup, component B will remain in
standby (reserve or inoperative) mode till such time as component A performs its function
successfully. As soon as component A fails, the switch senses the failure and puts the component
B into operation.

Figure 2.9: A typical two components standby system.

In general, if we have a standby system having n components, namely, 1, 2, 3, …, n, then there

exists a switch, say S, which can put any one of the n components into operation. The system
works in the same way as the two-component standby system.
Initially the switch is connected to component 1 and turns it on (see Figure 2.10). Here we are
assuming that only one component is in the normally operating mode. Until the component 1
performs its intended function successfully, the remaining (n-1) components, namely 2 to n,
remain in the standby (reserve or inoperative) mode. As soon as component 1 fails, the switch
senses the failure and turns component 2 on. Till such time as the component 2 performs its
intended function successfully, the remaining components, namely 3 to n, remain in the standby
(reserve or inoperative) mode. As soon as component 2 fails, the switch senses the failure and
turns the component 3 on. This process continues until the failure of the nth component. When the
nth component fails, the system will go into failure mode. If there are more than one components
in the normally operating mode, the process will work in the same way as it works for one
normally operating component.

Figure 2.10.: A typical n components standby system for the case of only one component in
normally operating mode.
44

In the discussion so far, we have not considered a key point of the standby system: that the switch
itself may fail during the operation of a component or during the time of changeover of a
normally operating component to a standby component. Due to this feature of the standby system,
we divide the discussion of this section under the following two heads:
a) Standby system with perfect switching, and
b) Standby system with imperfect switching.

2.4.2.1 Standby System with Perfect Switching

By perfect switching, we mean that the changeover device/switch will neither fail during the
operation of a component nor during the changeover of a failed component to the next available
standby component. In reliability terms, we are assuming that the changeover device is 100%
reliable. To evaluate the reliability of the standby system with perfect switching, we assume that
the standby component(s) will not fail in its/their standby position. Now, let E i be the event that
component i performs its intended function successfully, where i = 1, 2, 3, …, n. Let Q 1 denote
the unreliability of the component i, given that components 1 to (i-1) have failed. Further, if R and
Q denote the reliability and unreliability of the standby system, then,

Q = Q1 Q2 Q3... Qn (2.21)

∴Reliability (R) of the standby system is given by:

R=1–Q (2.22)

In particular, in the case of one normally operating component and one standby component, the
unreliability of the system is given by

Q = Q1 Q2 (2.23)

and reliability (R) of the standby system is given by equation (2.22)

It should be noted that the unreliability Q2 of the component 2 is not the same as it is in the case
of parallel system because in a standby system, component 2 is used for a shorter duration as
compared to the case of parallel system.

Example 2.5
A standby system has three components 1, 2, 3, where component 1 is normally operating and
components 2, 3 are standby components. The reliability of component 1 is 0.95. The reliability
of component 2 given that component 1 has failed is 0.96 and that of component 3 given that
components 1 and 2 have failed is 0.98. Evaluate the reliability of the system under the
assumption that the switch is perfect.
45

Solution
We know that for an n-component standby system (with one normally operating component), if
Qi is the unreliability of component, i given that components 1 to (i-1) have failed, then reliability
(R) of the system is given by

R = 1−¿Q1 Q2 Q3... Qn
In this case,
Q1= 1−¿0.95 = 0.05, Q2 = 1−¿0.96 = 0.04, Q3= 1−0.98= 0.02
∴ R = 1−¿Q1 Q2 Q3
= 1 – (0.05) (0.04) (0.02) = 1 – 0.00004
R = 0.99996

2.4.2.2 Standby System with Imperfect Switching

By imperfect switching, we mean that the changeover device/switch may fail either during the
operation of a component or during the time of bringing a standby component in place of a
normally operating failed component. To explain the concept, let us consider a simple two-
component standby system with one normally operating component (say, A) and one standby
component (say, B). To evaluate the reliability of the system, let us assume that the switch fails
only at the time of changeover. Therefore, the unreliability of the system is given by

Q = P(successful changeover) P(system failure given successful changeover)+ P(unsuccessful

changeover) P(system failure given unsuccessful changeover)

in case of successful changeover, system failure  A

 Q= Ps QA QB ṔQA, fails and then component B fails. In case of unsuccessful (2.24)
changeover, system failure  failure of
component A.

where,
Ps probability of successful changeover
Ṕs 1Ps probability of unsuccessful changeover
QA unreliability of component A
Q B unreliability of component B given that component
A has failed
.
Example 2.6
Consider a two-component standby system with A as normally operating component and B as
standby component. The reliability of component A is 0.90 while the reliability of component B
given that A has failed is 0.95. Assume that the switch can fail only at the time of changeover
with a probability of failure 0.03. Evaluate the reliability of the system.
46

Solution
Recall that Ps probability of successful changeover
Ṕs 1Ps probability of unsuccessful changeover
QA unreliability of component A
Q B unreliability of component B given that component
A has failed

Then from equation (2.24), the unreliability (Q) of the two-component standby system under the
assumption that the switch can only fail during the time of changeover is given by
Q = Ps QA QB  ṔQA
= (1)

Q = 0.00785
Hence, the reliability of the system, Rs = 1 – 0.00785 = 0.99215

2.4.3 Some Practical Applications of Active and Standby Redundancy

As has been discussed, redundancy is the replication of critical components functions of system
with the purpose of enhancing reliability of the system, usually to form a backup or fail-safe or to
improve actual system performance, such as in the case of Global Navigation Satellite System
(GNSS) receiver, or multi-threaded computer processing.
In many safety critical systems, such as fly-by-wire (FBI) (FBI is a system that replaces the
conventional manual flight controls of aircraft an electronic interface) and hydraulic systems in
aircraft, some parts of the control system may be triplicated, which is formally termed triple
modular system (TMR). An error in one component may then be out-voted by the other two. In a
triply redundant system, the system has three sub components, all three of which must fail before
the system fails. Since each one rarely fails, and the sub components are expected to fail
independently, the probability of all three failing is calculated to be extraordinarily small.
A suspension bridge’s numerous cables are a form of redundancy. The use of margin of safety
( the extra strength of cabling and struts used in bridges) allows some structural components to
fail without bridge collapse.
Eyes and ears provide working examples of passive redundancy. Vision loss in one eye does not
cause blindness but depth perception is impaired. Hearing loss in one ear does not cause deafness
but directionality is lost. Performance decline is commonly associated with passive redundancy
when a limited number of failures occur.
Active redundancy eliminates performance declines by monitoring the performance of individual
devices, and this monitoring is used in voting logic. The voting logic is linked to switching that
automatically reconfigures the component.
Furthermore, electrical power distribution provides an example of active redundancy. Several
power lines connect each generation facility with customers. Each power line includes monitors
that detect overload. Each power line also includes circuit breakers which disconnect a power line
when monitors detect an overload. Power is redistributed across the remaining lines.
47

Another example of standby redundancy would be the use of a standby generator in a

building to ensure continuity of supply in case of a mains failure. The generator is not called for
until it is needed when the power supply fails. Such a scheme would not be suitable for a
computer system, because data would be lost during the relatively long period required to start
the standby generator.

For electronic systems static redundancy is materialized in components in form of multiple

wiring, multiple coil windings, multiple brushes for DC motors, or multiple contacts for
potentiometers.
48

Chapter Three
3.1 Causes, Effects and Remedies of Environmental Factors on Component/ Equipment
Failure.

The constituents of electronic equipment are electronic components or its integration with
electromechanical devices. Again, incorporation of various equipment makes up a system. It is
therefore implicit that the study of failure of electronic components is an allusion to the failure of
both equipment and system since there is a critical link between them. Consequently, the failure
of a component in equipment could possibly bring about the failure of that equipment, and by
extension, the failure of all or part of the system.

Electronic components, equipment and systems are presupposed to operate in diverse climatic
conditions such as tropical, arctic, desert, high altitude, radiation, including transport hazard and
mechanical shocks and vibrations. These factors, in one way or the other, impact on the reliability
of electronic components, equipment and systems.

The causes of failure of components can be classified into two namely: Environmental Stresses
and Electrical Over-stress. These two main classes of stress and failure of components can be
further subdivided as follows:

3.1.1Environmental Stresses

i) Atmospheric temperature
ii) Humidity
iii) Shock and Vibration
iv) Generated heat
v) Atmospheric pressure
vi) Wind, air and dust
vii) Electromagnetic radiation
viii) Electrostatics
3.1.2 Electrical Over-stress
i) Voltage over-stress
ii) Current over-stress
iii) Frequency variation

A. Environmental Stresses
i) Effects of Atmospheric temperature

The extreme temperature experienced in the arctic and desert regions and fluctuations in
temperature noticeable within some territories on daily bases and from season to season subject
the components to stress. This is capable of promoting mechanical failure. Literature has it that
failure rate of components is about double with every 10°C rise in temperature for a specific
applied voltage*. The following are the specific effects of high temperature and temperature
fluctuation on components:
49

(a) Thermal ageing and Oxidation: – Loss of electrical quality/ change of electrical properties
like increase in power factor, decrease of dielectric strength and insulation failure.
(b) Physical Expansion: - Structural failure, differential expansion of different materials can
cause distortion of assemblies, rupturing of seals and wear or binding on moving parts.
(c) Loss or Change of Viscosity, Evaporation:- Loss of lubrication properties,
structural/mechanical failure (breakage or fracture, seizure)
(d) Softening/melting: - Internal temperature of equipment may approach a value where low
melting point materials such as grease, protective compounds and waxes become soft or even
begin to flow. This may lead to structural failure, physical breakdown or penetration of
sealing leading to internal electrical breakdown.
(e) Chemical decomposition: - Decomposition of organic materials increases, rubber materials
harden. This may change the initial physical or electrical constants. The ultimate cause of any
of these effects can be physical or chemical change in the material and hence variation in
characteristics of components. Excess temperature is perhaps the most destructive
environmental factor associated with electronic/electrical components and equipments. Hence,
development of new stable materials for improved performance of component has been a
continuous process.

In contrast, cold/arctic temperature condition also has some negative impacts on components.
It results to physical contraction, solidification and increased viscosity. Furthermore, there
could be change of electrical properties due to different temperature coefficients of various
component parts such as capacitances, resistances and inductances. Additionally,
embrittlement occurs in metallic and non-metallic materials as a result of extreme cold
temperature. This brings about loss of mechanical strength, cracking and fracture. Physical
breakdown of sealing due to shrinkage and cracking leading to electrical breakdown ensues.

Remedies

To redress the effects of high temperature on components, the use of appropriate heat sink,
provision of air vent and /or incorporation of forced air cooling should be engaged.
Furthermore, component material with low coefficient of expansion and other low
temperature characteristics should be appropriated maximally in systems and equipments that
are meant to function in high temperature regions.

For cold/arctic temperature zones, while appropriate choice of materials and component s for
equipment design and construction is critical, indirect heating of the equipment to control the
temperature is recommended.

ii) Effects of Humidity

This is the absorption of moisture or deposition of damp layers on components or equipments.

This results to swelling, rupture of containers and physical breakdown. Water, being a good
conductor can act as a low resistance path on the insulation of electronic circuits. It has been
* proven that an ionized conducting film of water will form on the surface of a dielectric within
https://www.electronics-cooling.com/2017/08/10c
50

a few seconds if the relative humidity is 100%. This will lead to insulation breakdown, change
of dielectric properties and external electrical failure like tracking, insulation flashover etc.
Furthermore, formation of fungi growth is stimulated, which also brings about degradation of
insulation. Corrosion (structural/mechanical failure) is another effect of humidity. This brings
about interference with function, internal electrical failure and change of physical or electrical
constants.

It has been proven that for given atmospheric moisture content, humidity increases as the
temperature falls and vice versa. Therefore, in environments where there are sudden
temperature drops especially at night, condensation is sure to occur thereby leading to
formation of water vapour on components.

Remedies

Only a few materials such as Silicones, Polystyrene and some Polymers can stop the
formation of a continuous moisture films but they have poor resistance to fungal growth. To
minimize the effects of humidity therefore, insulating materials used for casing of equipment
should be such that do not absorb moisture or stimulate fungi growth or hold up water films.
Concisely, completely enfolding any component or the entire equipment that is prone to
humidity exceptionally is the remedy.

iii) Effects of Shock and Vibration

When equipments are being transported or moved from one place to another, they are
susceptible to shock, vibration, bump and drop. Structural collapse, loss of mechanical
robustness, breakage, fracture, crack etc. are some of the possible after-effects. Other upshots
are physical breakdown of sealing, complete disconnection or intermittent electrical contact.

Electromechanical components like relays, and contactors and heavy components such as
transformers are more vulnerable to the negative effects of shock and vibration. On the other
hand, electronic components which are generally small in nature experience comparatively
little effect of shock and vibration.

Remedies

To reduce the effect of shock and vibration, design best practices are employed, utilizing anti-
vibration mountings, locking nuts, shake-proof and lock washers. Furthermore, sensitive
components can be prevented from shock and vibration by enfolding them with protective
padding materials.

iv) Effects of Generated Heat

Devices and equipment generate heat internally during operation. Semiconductor devices
carry significant current during operation, and hence, generate considerable amount of heat.
This generate is capable of causing failure of such devices when it surpassed a particular level
51

of tolerance. Furthermore, the normal operation specifications of the device can as well be
affected resulting in undesirable increase in chemical reaction and rash ageing.

Remedies

The impact of generated heat can be reduced by the use of suitable heat sink, use of forced air
cooling devices like fan as well as making provision for adequate ventilation in the design of
such equipment. Again, thoughtful selection of components with good temperature and low
expansion attributes can also be appropriate. Lastly, generated heat can also be reduced by
derating technique which allows the item to function at an applied voltage, current or power
unlikely to set off unwarranted internally generated heat. Derating technique therefore, lessens
temperature rise arising from internally dissipated heat, and hence, shrinks the failure rate.

v. Effects of Atmospheric Pressure (Altitude-High or Low air Pressure)

Generally, there is a very insignificant Change in atmospheric pressure over the earth surface.
However, the atmospheric pressure varies significantly at very high altitude where it becomes
very low. Hence, at high altitudes (in air borne systems for instance), there is considerable
impact of atmospheric pressure on components and equipments because of the intense
variation in atmospheric pressure obtainable there. Low pressure can result in low dielectric
strength of air which brings about insulation breakdown and flashover, corona and ozone
formation. Furthermore, low pressure can bring about expansion and consequent fracture of
container/insulating materials. Additionally, very low atmospheric pressure could result to
leakage of component closures, for example, paper-foil and aluminum electrolytic capacitors.
This could happen if the said components or equipment are conveyed in an unpressurized
enclosure. On the other hand, high air pressure can result to structural collapse-breaking or
fracture, external electrical failure like tracking; insulation flashover etc. Physical breakdown
of sealing can as well ensue, such as loss of electrical quality like insulation, dissipation factor
and electrical breakdown.
Remedies
To forestall the diminution of dielectric strength of air, the resultant insulation breakdown and
possible breakdown voltage between contacts that use air as insulator, electrical conductors in
equipment designed to function in low pressure atmosphere should be provided with safe
distance between their electrical contacts. In addition, dust-free and dirt-free space should be
maintained between conductors.
vi Effects of Wind, Air and Dust

Earthly elements such as wind, air and dust undoubtedly impinge on components and
equipment operated alfresco. For instance, components or equipments operated in coastal
environment are susceptible to saline air and its effects. Also, circulation of dust in equipment
may instigate tracking (leakage of current) within devices particularly in switches. This may
set off early failure. Moreover, dust admittance could trigger off gradual breakdown of
insulation and build up of contact resistance. Saline atmosphere can set off corrosion,
resulting to structural/mechanical failure such as breakage, fracture, seizure etc. Besides,
52

physical breakdown of sealing may as well arise, in addition to change of initial physical or
electrical constants.

Wind on the other hand can bring about vibration, rocking and excessive movement with
consequent physical breakdown of sealing, breakage or fracture which may lead to electrical
breakdown or loss of electrical quality.

Curtailing the effects of these air contaminating elements will entail setting up appropriate
enfolding around the components/equipment. Again, periodic dusting of the
equipment/components should be carried out. For the coastal atmosphere, suitable physical
protective covering should be used to shield the component from the effects of the saline
atmosphere.

vii. Effects of Electromagnetic Radiation

Electromagnetic radiation has the potential of interfering with electrical signals, components,
electronic devices and systems, creating radiation-induced effects.

Remedies

Principally, the effects of electromagnetic radiation are instigated by providing shielding and
insightful part selection for equipment design.

viii. The Effects of Electrostatic Discharge (ESD)

Electrostatic discharge (static electricity) can affect electronic circuit and components in
diverse ways. Some of the effects may be instantaneous while others may appear weeks or
years later. While some items in today’s workplace can store thousands of volts in
electrostatic charges, others can take up to 25 electrostatic volts to damage an integrated
circuit irreparably. This natural phenomenon has only become an issue since the widespread
use of solid-state electronics. All materials (insulators and conductors alike) are sources of
electrostatic discharge. The amount of electrostatic charge that can accumulate on any item is
dependent on its capacity to store a charge. Sources of electrostatic discharge are:

a) Through human contact with sensitive devices (human touch is only sensitive on
electrostatic discharge level that is above 4,000V);
b) Troubleshooting electronic equipment or handling of printed circuit boards without
using an electrostatic wrist strap;
c) Placement of synthetic materials (i.e. plastic, Styrofoam, etc.) on or near electronic
equipment, and
d) Rapid movement of air near electronic equipment (including using compressed air to
blow dirt off printed circuit boards, circulating fans blowing on electronic equipment,
or using an electronic device close to an air handling system).
In all of these scenarios, the accumulation of static charges may occur, but you may not know.
Furthermore, a charged object does not necessarily have to contact the item for an electrostatic
discharge event to occur.
53

How does ESD damage Electronic Circuitry?

As the current dissipates through an object, it is seeking a low impedance path to ground to
equalize potentials. In most cases, ESD current will travel to ground via the metal chases
frame of the device. However, it is well known that current will travel on every available path.
In some cases, one path may be between the PN junctions on integrated circuits to reach
ground. This current flow will burn holes visible to the naked eye in an integrated circuit, with
evidence of heat damage to the surrounding area. One ESD event may not disrupt equipment
operation. However, repeated events will degrade equipment’s internal components over time.
Remedies
It is unlikely you can eliminate ESD completely from any site. Nevertheless, experience has
shown that the following guidelines are helpful:
i) Keep all synthetic materials at least 4 inches away from electronic equipment.
ii) When cleaning printed circuit boards, use a spray labeled as non-static forming.
iii) When troubleshooting electronic equipment, always wear a static wrist strap that is
grounded to the frame of the device. Also, wear the wrist strap when handling printed
circuit boards.
iv) Treat carpets and floors with compounds that reduce a buildup of static charges.
v) Use static floor mats where necessary.
vi) Make sure the grounding system for equipment has low impedance for ESD currents to
dissipate to an earthing reference.

3.1.3 Précis of Environmental Stress Protection on Components and Equipments

Environmental stress hastens the onset of wear-out by contributing to physical deterioration. The
preceding discussion on environmental stress can be summarized in Table 3.1

Stress Symptom Action

High temperature Insulation materials Dissipate heat. Minimize

deteriorate thermal
Chemical reactions accelerate contact. Use fins. Increase
conductor
sizes on PCBs. Provide
conduction paths
Low temperature Apply heat Mechanical contraction Apply heat and thermal
and thermal insulation damage insulation
Insulation materials
deteriorate

Thermal shock Mechanical damage within Shielding

LSI components
Mechanical shock Component and connector Mechanical design. Use of
damage mountings
Vibration Hastens wearout and causes Mechanical design
54

connector failure

Humidity Coupled with temperature Sealing. Use of silica gel

cycling causes
‘pumping’ – filling up with
water

Salt atmosphere Corrosion and insulation Mechanical protection

degradation
Electromagnetic Interference to electrical Shielding and part selection
radiation signals

Dust Long-term degradation of Sealing. Self-cleaning contacts

insulation.
Increased contact resistance

Biological effects Mechanical Decayed insulation material Mechanical and chemical

buffers protection

Acoustic noise Electrical interference due to Mechanical buffers

microphonic effects

Reactive gases Corrosion of Corrosion of contacts Physical seals

contacts Physical seals
Table 3.1: Environmental Stress and Remedies

B. Electrical Over-stress

This is a condition at which a device or electrical circuit or component is exposed to the higher
value of current, voltage or frequency that is beyond the maximum rated value. Each of these
possible over-stresses is discussed below:

i) Voltage Over-stress
This may occur when an electronic device is being switched on. Although this could
happen transiently, its magnitude may be extremely, enormously transcending the steady-
state value. This could have an adverse effect on the device. It is expedient therefore to
maintain the applied voltage within an acceptable tolerance value of the rated voltage in
order to minimize the failure rate. Voltage over-stress can be mitigated by designing
robust system through derating techniques. This is an intentional process applied to every
component of a product to reduce the chances of components witnessing more stress than
it is capable of withstanding.
ii) Current Over-stress
Current over-stress has similar effects on components/devices as voltage over-stress. It
could be sudden as well as transient; and so, should be anticipated during design by
making room for its palliation by means of derating techniques.
55

iii) Frequency Variation

The operational frequency of any electronic equipment is always specified and this varies
from one country to another. For instance, the mains frequency of electricity supply in
Nigeria is 50Hz, while that of USA is 60Hz. Therefore, an attempt to use a supply mains
of 50Hz to power an equipment designed for 60 Hz mains supply will result to operating
stress for such equipment with consequent failure. Another example is operating
equipment designed for audio frequency at a very high frequency. This will damage some
of the solid state devices in the equipment

3.1.4 Types of Failure Mechanism

The majority of failures are attributable to one of the following physical or chemical phenomena.
Alloy Formation: Formation of alloys between gold, aluminum and silicon causes what is known
as ‘purple plague’ and ‘black plague’ in silicon devices.
Biological effects: Moulds and insects can cause failures. Tropical environments are particularly
attractive for moulds and insects, and electronic devices and wiring can be affected.
Chemical and electrolytic changes: Electrolytic corrosion can occur wherever potential
difference together with an ionizable film are present. The electrolytic effect causes interaction
between the salt ions and the metallic surfaces, which act as electrodes. Salt laden atmospheres
cause corrosion of contacts and connectors. Chemical and physical changes to electrolytes and
lubricants both lead to degradation failures.
Contamination: Dirt, particularly carbon or ferrous particles, causes electrical failure. The former
deposited on insulation between conductors leads to breakdown and the latter to insulation
breakdown and direct short circuits. Non-conducting material such as ash and fibrous waste can
cause open-circuit failure in contacts.
Depolymerization: This is a degrading of insulation resistance caused by a type of liquefaction in
synthetic materials.
Electrical contact failures: Failures of switch and relay contacts occur owing to weak springs,
contact arcing, spark erosion and plating wear. In addition, failures due to contamination, as
mentioned above, are possible. Printed-board connectors will fail owing to loss of contact
pressure, mechanical wear from repeated insertions and contamination.
Evaporation: Filament devices age owing to evaporation of the filament molecules.
Fatigue: This is a physical/crystalline change in metals that leads to spring failure, fracture of
structural members, etc.
Film deposition: All plugs, sockets, connectors and switches with non-precious metal surfaces
are likely to form an oxide film, which is a poor conductor. This film therefore leads to high-
resistance failures unless a self-cleaning wiping action is used.
Friction: Friction is one of the most common causes of failure in motors, switches, gears, belts,
styli, etc.
Ionization of gases: At normal atmospheric pressure a.c. voltages of approximately 300V across
gas bubbles in dielectrics give rise to ionization, which causes both electrical noise and ultimate
breakdown. This reduces to 200V at low pressure.
56

Ion migration: If two silver surfaces are separated by a moisture-covered insulating material then,
providing an ionizable salt is present as is usually the case, ion migration causes a silver ‘tree’
across the insulator.
Magnetic degradation: Modern magnetic materials are quite stable. However, degraded magnetic
properties do occur as a result of mechanical vibration or strong a.c. electric fields.
Mechanical stresses: Bump and vibration stresses affect switches, insulators, fuse mountings,
component lugs, printed-board tracks, etc.
Metallic effects: Metallic particles are a common cause of failure as mentioned above. Tin and
cadmium can grow ‘whiskers’, leading to noise and low-resistance failures.
Moisture gain or loss: Moisture can enter equipment through pin holes by moisture vapor
diffusion. This is accelerated by conditions of temperature cycling under high humidity. Loss of
moisture by diffusion through seals in electrolytic capacitors causes reduced capacitance.
Molecular migration: Many liquids can diffuse through insulating plastics.
Stress relaxation: Cold flow (‘creep’) occurs in metallic parts and various dielectrics under
mechanical stress. This leads to mechanical failure. This is not the same as fatigue, which is
caused by repeated movement (deformation) of a material.
Temperature cycling: This can be the cause of stress fluctuations, leading to fatigue or to
moisture build-up.

3.1.5 Failures in Semiconductor Components

The majority of semiconductor device failures are attributable to the wafer-fabrication

process. The tendency to create chips with ever-decreasing cross-sectional areas increases the
probability that impurities, localized heating, flaws, etc., will lead to failure by deterioration.
Table 3.2 shows a typical proportion of failure modes.
As microelectronics packaging density increases, small chip geometries entail much higher
current densities. This suggests a greater need for derating in the application of such

Table 3.2

Specific

Linear (%) TTL (%) CMOS (%) In general (%)

Metalization 18 50 25 -

Diffusion 1 1 9 55

Oxide 1 4 16 -
10 10 - 25
Bond – die
9 15 1 -
Bond – wire
5 14 10 -
57

Packaging/hermeticity

Surface 20

contamination 55 5 25

1 1
Cracked die

devices. Another complication is provided by changing materials to improve performance

and to overcome the chip density problem. An example is the replacement of aluminum
interconnection with lower-resistance copper to cut propagation delays. The overall effect is
likely to accelerate the long-term wearout characteristic. Whereas in the 1970s chip lifetimes
were thought to be of the order of hundreds of years, more recent estimates are an order less.
Discrete Components
The most likely causes of failure in resistors and capacitors are shown in Tables 3.3 and
11.3. Short-circuit failure is rare in resistors. For composition resistors, fixed and variable, the
division tends to be 50% degradation failures and 50% open circuit. For film and wire-wound
resistors the majority of failures are of the open-circuit type.
Table 3.3
Resistor Type Short Open Drift
Film Insulation Mechanical
breakdown due breakdown of spiral
to humidity. due to r.f. thin spiral
Protuberance of
adjacent spirals

Wire wound overvoltage Mechanical

breakdown of spiral
due to r.f. Failure of
winding termination

Composition r.f. produces capacitance or

dielectric loss
Wiper arm wear. Noise
Variable (wire and Excess current over a
composition) small segment owing
to selecting low
value.

Mechanical movement

3.2 Complexity and Parts

3.2.1 Reduction of Complexity

Higher scales of integration in electronic technology enable circuit functions previously requiring
many hundreds (or thousands) of devices to be performed by a single component. Hardware
failure is restricted to either the device or its connections (sometimes 40 pins) to the remaining
circuitry. A reduction in total device population and quantity leads, in general, to higher reliability.
Standard circuit configurations help to minimize component populations and allow the use
of proven reliable circuits. Regular planned design reviews provide an opportunity to assess
the economy of circuitry for the intended function. Digital circuits provide an opportunity for
reduction in complexity by means of logical manipulation of the expressions involved. This
enables fewer logic functions to be used in order to provide a given result.
Table 3.4

Capacitor Type Short Open Drift

Mica Water absorption. Silver Mechanical vibration
ion migration

Electrolytic solid Solder balls caused Internal connection

tantalum by external heat from
soldering
Failures due to shock or
vibration

Electrolytic non-solid Electrolyte leakage due to External welds

tantalum temperature cycling

Electrolytic Lead dissolved in Low capacitance due to

aluminum oxide electrolyte aluminum oxide combining
with electrolyte

Paper Moisture. Rupture Poor internal connections

Plastic Internal solder flow Poor

internal connections
Instantaneous breakdown
in plastic causing s/c

Ceramic Silver ion migration Mechanical stress

Air (variable) Loose plates. Foreign Ruptured internal
bodies connections

3.2.2Part Selection
Since hardware reliability is largely determined by the component parts, their reliability and
fitness for purpose cannot be over-emphasized. The choice often arises between standard parts
with proven performance which just meet the requirement and special parts that are totally
applicable but unproven. Consideration of design support services when selecting a component
source may be of prime importance when the application is a new design. General considerations
should be:
o Function needed and the environment in which it is to be used;
o Critical aspects of the part such as, for example, limited life, procurement time,
contribution to overall failure rate, cost, etc;
59

o Availability: number of different sources;

o Stress: given the application of the component the stresses applied to it and the expected
failure rate, as well as the effect of burn-in and screening on actual performance.
.
3.3 Fundamentals of Derating Electronic Components
Derating is a design process that can make a significant contribution to reliability. This chapter describes
policies and methods for derating electronic components and mechanical systems equipment.

Derating is defined as ‘a policy of deliberately under stressing components in order to provide

increased reliability’. The selection of components of higher stress capability than is required for
normal operation is an empirical but effective and well established method of reducing their
failure rate; e.g. the use of a half watt resistor in circuit conditions demanding a quarter watt
dissipation.

Stress rating is defined as the ratio of applied stress to rated stress, for example, the ratio of
applied voltage to rated voltage in capacitor applications. Generally as stress increases failure rate
also increases, usually exponentially; conversely as stress reduces failure rate reduces. However,
care must be taken when applying derating as a method of improving reliability because at very
low stress ratios failure rate may again increase.

Typically the components to which derating is applicable include transistors, resistors,

transformers, integrated circuits, micro-electronic devices, and other passive electronic devices
with stress dependent failure rates, such as capacitors and inductors.

History shows that a significant number of equipment failures arise from inadequate design
margins. The derating factors listed for electronic equipment in Table 1 should not only ensure
that components are operated well within the recommended limits of stress, but also provide in
most cases a sufficient design margin to accommodate minor variations in environment stress,
power supply levels, transients, etc.

Failures caused by stress transients in the operational environment are in fact often due to
inadequate design margins. Test conditions seldom reproduce these transients, and failures of this
kind are, therefore, difficult to diagnose in the field. Derating can eliminate many such potential
problems.

Electronic components are in general subject to at least two stresses, an electrical stress, with
increasing tendency to breakdown due to voltage, current or power and a thermal stress due to its
own power dissipation and, in part, to the total dissipation of neighbouring components and/or the
local environment. Reducing electrical stress will indirectly reduce thermal stress and lead to
improved failure rates.

Failure rates for generic component types invariably assume that the failure rates are constant
with time and that the components are conservatively rated. Thus predictions based on component
count procedures pre-suppose that derating will be applied.
60

The methods described in this chapter are aimed at reducing failures by increasing design
margins, i.e. the margin of design strength over expected stress. To make an impact on the overall
system failure rate a derating policy must be applied to as many components as possible. In some
cases this may incur weight or space penalties; however, such cases should not prevent the policy
being applied as far as practicable.

3.3.1 Approaches to Derating

Derating has two major approaches which are:
i) Reducing the level of stress applied to the product;
ii) Selecting components that are more robust to the stress
As discussed earlier, stress on electrical components comes in various forms, for derating, there
are four types of stress that a derating approach may apply. They are:
i. Electrical Stress (voltage, current, frequency change)
ii. Thermal Stress (hot, cold)
iii. Chemical Stress (solvents, corrosives, water)
iv. Mechanical Stress(thermal expansion, vibration and shock)
Generally, reducing traffic loading reduces the applied stress due to operating load, and so,
enables the component or system to operate over a wider range of environmental, thermal or
electrical stress.

3.3.2 Derating of Electrical and Electronic Components

General

The reliability of electronic components decreases when they are operated at high stress levels.
These stresses are primarily temperature, voltage, current and power dissipation. Heat-generating
components in particular, such as transistors, resistors, valves and transformers, are susceptible to
these stresses which result in degraded performance and accelerated failure.

The problem is one in which the materials employed in the construction of the component have
upper and lower temperature design limits, beyond which performance changes develop or
catastrophic failures occur. This problem can be brought under control by ensuring that the
component functions within its design rating. It is mostly a heat balance problem and can usually
be solved by keeping components cool enough to function reliably.

The theoretical justification for derating is discussed as a means of reducing thermal and electrical
stress, and derives two mathematical models relating component failure rates to stress conditions.

3.3.2.1 Temperature Stress

Under normal operating conditions, a component is considered to have failed when its design
parameters have changed beyond the limits of its acceptance specification, due to degradation
processes. In a structurally sound component most processes of degradation are primarily
dependent upon chemical reaction and include such phenomena as hot spot formation. increased
carrier generation, parametric degradation, aluminium migration, gold-aluminium interdiffusion.
61

In 1889 Arrhenius suggested an empirical model for the rate at which chemical reactions occur at
different temperatures. The Arrhenius chemical reaction rate law is:

E
Chemical Reaction Rate = A exp -
RT

where: A is a constant;
E is activation energy;
R is Boltzmann’s gas constant;
T is absolute temperature.

Where it is assumed that failure rate is directly proportional to chemical reaction rate, the failure
rates to be expected at different temperatures can be estimated, e.g.

Failure rate at temp (T) = λ= B exp (-E/RT)

where: B is a constant which relates failure rate to exp(-E/RT)

E is the activation energy in electron-volts which is assumed constant
R is Boltzmann’s gas constant 8.63 x 10 -5 eV/Kelvin
T is temperature measured in Kelvins at which the failure rate (λ) is required

If To is a temperature with which a ‘known’ failure rate (λ o) can be associated, then:

λ B exp (−E/ RT )
=
λo B exp (−E / R T 0 )

λ E 1 1
∴
λo
= exp[ (
− ]
R T0 T ) **

The relationship between failure rate and temperature can be seen from equation **.
This approach assumes that the reaction rate of the failure mechanism is related to degradation time
and therefore to component failure rate. However, the activation energy of a particular failure
mechanism is not the same as the apparent activation energy to reach some limit of parameter
variation, since several failure mechanisms may be operating at the same time to produce one
apparent component failure activation energy.
Where a single failure mechanism is considered, a reliability analyst may attempt to determine the
apparent activation energy (E) of that failure mechanism using time to failure data from a suitable
sample. A graph of time to failure against the reciprocal of temperature, using Log/Linear graph
paper, will result in a straight line (the Arrhenius Line) when the theory applies, and the activation
energy that is sensibly constant for each failure mechanism, can then be determined from the slope of
the line.
Supposing that: T 0=288 K ( 15° C )∧T =303 K (30 ° C)
And assuming a value for the apparent activation energy; E of approximately 1 Volt (say0.92V),
then λ = 2λ0.
This implies a doubling of failure rate for an increase of 15° C or, conversely, a halving of failure
rate for a decrease of 15° C for the example chosen.
62
63

Typically failure rates change by a factor between 1.1 and 2.0 for a change of temperature of 15°C,
the higher factor being applied to transistors and some capacitors and the lower factor being
appropriate for resistors. In general the relationship between failure rate and temperature is that given
by equation (ii), i.e. failure rates increase exponentially as temperature increases.

3.3.2.2 Electrical Stress

Items such as capacitors that are subject to voltage stress also need to be derated to reduce failures due
to dielectric breakdown. It has been suggested that failure rate is related to dielectric stress by a 5 th
Power Law, which states that life of an item, i.e. its mean time to first failure, is inversely proportional
to the fifth power of dielectric stress.

Data sources that relate failure rate to stress can show that the failure rates do not generally indicate
such drastic changes, suggesting that perhaps a fifth power law is pessimistic for many individual
component types and also may not apply for stress ratios less than 0.5. In general the relationship
between failure rate and voltage stress is that failure rate increases according to a power law as stress
increases.

Resistors

Given that resistors are properly made, the two principal influences on component failure rate are
temperature and electrical stress. Derating characteristics for resistors specify a maximum stress for
these two critical parameters by limiting the power dissipated.

The power rating of resistors is dependent upon the manufacturing techniques and materials used, and
limited by a maximum hot-spot temperature. The power that can be developed in a resistor body
depends upon how effectively the dissipated energy is carried away and is therefore a function of the
local temperature and heat transfer conditions. At all temperatures above the rated temperature for the
type, resistors should be derated.

The following electrical stress ratio for derating resistors is recommended:

Operating Power
Stress Ratio = =80 %
Rated Power

This recommended stress ratio provides a sufficient design margin in most practical cases and in
addition increases resistor stability.
For variable resistors the operating current in any part of the resistor is the critical stress
condition. The stress ratio for variable resistors is given by:

Operating Power
Stress Ratio = =75 %
Rated Power

In this case the derating factor is more stringent than that for fixed resistors since variable
resistors have, in general, a higher failure rate than fixed resistor types and a greater design
margin is needed.
Power is not, however, the only quantity in which stress ratings are specified. For example, a
resistor may be rated at 300mW dissipation in free air at 20°C, or the same type may be rated at
64

250V d.c. across the resistor. In every case the ‘limiting element voltage’ specified for the type
must not be exceeded.
Semi-Conductor Devices
Transistors
Transistors can be destroyed by exceeding the manufacturer’s voltage rating even for a few micro-
seconds. Transient voltage spikes of comparatively small magnitude and very short duration are very
difficult to trace and can often be the reason for circuit failures which appear to have no obvious
cause. In terms of power, transistors are rated in a similar fashion to resistors, except that the limiting
hot spot occurs at the junction, and junction temperature is the most important parameter.

In practice it is essential to derate transistors to a level which ensures that the manufacturer’s
recommended junction temperature will not be exceeded. To this end, the following electrical stress
ratios are recommended as a minimum:
Operating Power
Stress Ratio (1) = =75 %
Rated P ower

where, operating power is the power dissipated in the device.

Operating Voltage
Stress Ratio (2) = =90 %
Rated Voltage

where, voltage is the voltage between collector and emitter.

Operating I c
Stress Ratio (3) = =90 %
Rated I c

where, I cis the collector current. Each of these ratios must be complied with at the same time in each
particular transistor application.

Power Diodes

The following stress ratios are recommended for power diodes in order to ensure, in general, that
the limiting junction temperatures are not exceeded:

Operating PIV
Stress Ratio (1) = =50 %
Rated PIV

and,

Operating I f
Stress Ratio (2) = =70 %
Rated I f

where, Stress Ratio(1) is the Peak Inverse Voltage derating factor and Stress Ratio(2) is the
Forward Current derating factor.

Small Signal Diodes

The following stress ratios are recommended for small signal diodes in order to ensure, in general,
that the limiting junction temperature is not exceeded:

Operating PIV
Stress Ratio (1) = =85 %
Rated PIV

and,

Operating I f
Stress Ratio (2) = =85 %
Rated I f

Each of these stress ratios must be complied with at the same time in every application.

Transformers

The policy and principles for the derating of transformers applies also to similar devices including
inductors, chokes, magnetic amplifiers and RF coils.

Most transformer failures result from insulation breakdown and resulting short circuit, and the
overheating that follows may result in misshapen or burst containers due to expansion of the
potting or filling compound. Open circuit windings occur only occasionally.

Transformer failures are largely due to the insulation becoming brittle and losing its insulation
qualities. This is usually caused by hot-spots and is related to the operating temperature. The
operating temperature in turn is related to the power dissipation of the device and the operation
stress ratio.

The operating temperature of transformers can be estimated as follows:

Operating Temperature = ambient + 0.15 of Rated Temp on full load + (0.85 of Rated
Temp. × Stress Ratio)

Note: All temperatures are in degrees Celsius.

The following electrical stress ratio is recommended for transformers and similar devices:

OperatingVA Load
Stress Ratio = =80 %
Rated VA Load

This stress ratio will, in most cases, ensure that the limiting hot-spot temperature is not exceeded.

Capacitors

Capacitors in general do not dissipate heat in the same way as resistors, transistors or
transformers, except when they are subjected to ripple currents or pulse loads when derating does
become important. However, they are subject to thermally sensitive failure modes that depend on
the materials used in their manufacture.
66

Some of the principal conditions associated with capacitor failure are current overload, voltage
overload, high frequency effects, high temperature, high pressure, humidity and shock. The most
important of these are voltage and temperature stress, which are the principal factors to be
derated.

Dielectric breakdown may occur after many hours of satisfactory operation and is associated with
a slowly changing physical or chemical reaction. The ultimate failure is, however, most often
associated with one abnormal electrical or temperature stress.

The recommended electrical stress ratio for all types of capacitor is:

OperatingVoltage
Stress Ratio = =75 %
Rated Voltage

In all cases the capacitor selected for a particular application must be carefully chosen from the
various types available to avoid misapplication. A significant number of equipment failures are
due to incorrect selection and application of capacitors.

Electrolytic capacitors are a special case and have power factors several times higher than other
capacitor types and due to ‘leakage’ currents which cause significant self-heating. This self-
heating tends to increase with age and can build up causing complete failure, thus derating is
particularly important. Non-electrolytic capacitors can be derated down to 10% of the maximum
voltage rating, though this is seldom physically practicable; however, this is not true for
electrolytic capacitors which may exhibit increased failure rates at these low levels because a
minimum voltage is required to establish and maintain the polarisation of these types. The
principal derating parameter is ‘surge voltage’ for solid tantalum types and ‘ripple current’ for
other electrolytic types. These capacitors must not be operated below the minimum specified
voltage; they should be derated but still comply with the manufacturers minimum requirements.

Micro-Electronic Circuits

This heading includes Integrated Circuits, Medium Scale Integration (MSI) Circuits, Large Scale
Integration (LSI) Circuits, and Hybrid Circuits, and covers both thick and thin film technology.

In any given type of micro-electronic structure the device reliability is very strongly related to
temperature of operation, and particularly to junction temperature in applications where the power
dissipated in the device is relatively high. The heat generated must be dispersed using appropriate
metal or ceramic packaging.

The specific stress ratios are necessarily different for each type of device and the only
generalisations that can be made are that digital IC’s should be derated in terms of fan-out and
linear IC’s in terms of current. The following derating factors are recommended:

Digital Integrated Circuits Fan-out 80%

Linear Integrated Circuits Current 85%
67

MOS Devices

The predominant cause of failure in MOS and C-MOS devices is electrical overstress; experience
with C-MOS devices under test conditions indicated that over 40% of failures arise from this
cause. The overstress can be due to mishandling since these devices are especially sensitive to
static discharge, and this has been found to be the most frequent cause of failure. Precautions to
be taken to protect expensive devices from static damage include:

a) Avoid handling a device by its pins.

b) Use conducting foam to protect pins and provide a leakage path.

c) Avoid static discharge from operators’ clothing.

d) Provide conductive paths for workbenches and stools in assembly areas.

MOS and C-MOS devices typically operate at supply voltages ranging from 3V to 18V. The
choice of supply voltage influences the speed of operation, because the higher the voltage the
shorter the rise and fall times of the output pulses. However, increased supply voltage also
increases the power consumption and thermal dissipation, and directly influences the failure rate.
Manufacturers’ life tests of MOS devices indicate an increase in failure rate of at least 10 times
for an increase of supply voltage from 10 to 15 volts, and it is clear that a compromise between
operating speed and reliability has to be made. Thus these devices should be derated, in terms of
supply voltage, to the lowest level consistent with the required operating speeds.

Relays and Switches

The functioning of a contact operating device like a switch or relay entails many sources of risk,
and incorrect functioning can expose adjacent circuitry to various degrees of hazard. These
devices present complex electro-mechanical failure modes; for example, a chopper type relay may
occasionally make poor contact with little effect on the overall system operation, while a ‘one-
shot’ armament relay in a missile requires little total usage but demands high reliability when it is
used. The conditions that lead to possible failure include ageing of time delay relays and gas
generation in hermetically sealed cans.

Most failure modes of relays and switches are dependent upon the cumulative number of
operations and, being electro-mechanical devices, relays and switches are subject to both
electrical and mechanical failure. Typical causes of failure are predominantly mechanical in
nature and include; mis-aligned contacts, open circuit contacts, contaminated or pitted contacts,
loss of resilience in contact springs, and open circuit coils.

Contact failure can result from a current surge or high sustained current. Current surges occur in
loads which include motors, lamps, heaters, capacitive input filters and other devices with low
initial impedance. These currents can cause intense heat with associated contact welding.
68

Transformers can present transients of many times the steady state current. At switch-on a lamp
filament can demand current up to 15 times the steady state value and motors up to 10 times.

The following electrical stress ratios for relays and switches are recommended:

OperatingContact current
Stress Ratio = =50 %
Rated Contact current

Special circuit conditions demand a greater degree of derating and the following factors are
recommended:

Motor circuit applications 40%

Inductive circuit applications 25%
Lamp filament applications 20%
69

Chapter Four

4.0 Understanding the Basic Principles of Maintainability

4.1 Definition of Maintainability

Maintainability is the probability that a device that has failed will be restored to operational
effectiveness within a given period of time when the maintenance action is performed in
accordance with prescribed procedures.

This is usually expressed as Mean Time To Repair (MTTR), or how quickly the system can be
restored to operational effectiveness or sometimes expressed as the repair rate. The word “repair”
implies that we are concerned with the time to perform corrective maintenance action only; on the
contrary, the time taken to carry out preventive maintenance is equally of interest. Hence,
Maintainability can be examined from two perspectives, namely: Serviceability (the ease of
conducting scheduled inspection and servicing) and reparability (the ease of restoring service
after a failure).

Reliability and Maintainability work join forces to achieve Availability. The basic factors that
determine availability are mean time between failures (MTBF), mean time to repair (MTTR), and
performance before failure and after repair. For example, a system with a high reliability with
protracted MTTR may have lower availability than a system with lesser reliability that is easier
and faster maintained or repaired. However, the user may be willing to tradeoff either reliability
or maintainability to accomplish higher performance. This is typical of some military armaments.
Similarly, performance may be compromised to raise reliability and reduce MTTR.

4.2 Maintainability and its Relation with Reliability and Maintenance

It is generally known that some equipment are really easy to maintain while others can make
maintenance work stressful. This attribute is referred as maintainability. Thus, maintainability can
be defined as the characteristic of an equipment that makes it easy to repair. In contrast,
maintenance is the activity oriented to keep equipment running; and it is divided into two major
types, corrective and preventive maintenance.
70

4.2.1 Corrective Maintenance: This is concerned with the repair of something that is not working
according to standards. It is mainly reactive and unplanned. Because this type of maintenance
always takes place when the equipment is running, it has a production loss associated which
negatively impacts on equipment’s availability.

4.2.2 Preventive Maintenance: This is performed to prevent equipment from failing. It is

essentially proactive and scheduled and usually does not affect availability.

Generally speaking, we expect that equipment will work for a reasonable amount of time
without failure. As everybody knows, there are equipments that are more reliable than others and
their reliability depends on their complexity, quality, design, etc.

Reliability and maintainability are characteristics defined during the design stage. They are also
affected by the manufacturing process and quality control. However, they can be further improved
during their productive life using failure information and field experience to implement
modifications, although this is more difficult and less cost effective.

Reliability, maintainability and availability are all related. Their correlation can be seen in
Figure 4.1

Figure 4.1 Maintainability- Reliability- Maintenance Correlations

Broadly speaking, reliability will determine how often the equipment fails. Although, it is stated
during the design stage, this will be an estimated value since reliability can decrease during the
equipment life due to several factors such as, poor maintenance, heavy work environments
(performance stress) or incorrect operation procedures.
71

Owing to the critical nature of corrective maintenance, it is greatly affected by maintainability. If

the equipment is easy to test, access, and repair, we will finish the maintenance work sooner and
we will produce less impact on the equipment’s availability.
Conversely, preventive maintenance is also affected by maintainability but in a different way. In
equipment with poor maintainability, preventive maintenance work will take more time so it
would be difficult to have the plant in good condition, especially in 24/7 or 24/5 production
environments when there isn’t much available time.

4.3 Factors Affecting Maintainability

In view of the fact that maintainability is mainly defined during the design phase, the key factors
that influence it are:
i. Ease of access: covers, panels and components are easy to remove and the distribution of
the different components inside the equipment facilitates maintenance work.
ii. Standard tools: when a machine needs special tools to be maintained, this adds more
complications to maintenance people, we need to have that tool available on site to do our
job and we need space to store it. All components, nuts, bolts and screws should be
standard so we can use regular tools to disassemble them.
iii. Standard components: sometimes manufacturers use proprietary components for common
elements (relays, thermomagnetic switches, etc.). This, as well as tools, add more
complications than if simple commercially available components were used in the
equipment.
iv. Easy to calibrate or no calibration needed: calibration requirements should be avoided
when possible or, if not possible, should be as easy as possible to perform.
v. Easy to test: it’s important to know what is wrong with the equipment with as much
precision as possible. If the equipment complexity justifies it, built in test capabilities
should be included. Otherwise the equipment should offer test points and clear procedures
to perform an accurate diagnosis.
vi. Interchangeability: equipment that has an extreme variety of similar components are very
hard to maintain and cause logistic complications. Similar components should be
standardized to a single model to allow interchangeability and make spare part planning
easier.
vii. Modular design: for complex equipment, modular design allows to quickly replace the
failing module with another one and then repair it with the production line back in service.

4.4 How to Improve Maintainability

In general, the best option is to acquire equipment with high reliability and maintainability,
but sometimes we already have the equipment installed and we have to deal with it. In those
cases, some ways to improve maintainability are:
i. Improve access to the equipment: we can modify aspects of the equipment that will allow
us to get access faster and easier. For example, sometimes quick access doors can be
72

added or when we have panels with too many screws, we can replace them with quick
release fasteners.
ii. Standardize components: every time we replace a component we can install an equivalent
brand or model that we frequently use in our plant (this can also be done all at once with
higher costs). This can be applied to thermomagnetic switches, fuses, relays, electric
motors and VFDs. In this last two cases, we can even use a model of higher capacity to
diminish the number of models in use. For example, if we have many VFDs of 110 kW
and one of 90 kW, we can analyse the possibility to replace this last one for one of 110
kW so we use only one model. This has to be considered carefully to see the available
space and other compatibility issues.
iii. Improve fault indication: this is especially useful for PLC controlled equipment. In some
occasions, the equipment has failed and there is no fault indication or it is vague or
erroneous. Sometimes, the failure wasn’t anticipated during the design phase so we can
program a new alarm or indication to clearly communicate the problem in the future.

In conclusion, we have seen the importance of maintainability and its effects on maintenance
and availability. It is therefore appropriate that these characteristics be taken into
consideration when buying new equipment, since poor maintainability could lead to further
costs during the life of the equipment.

4.5 Utilization Factor (UF)

Utilization factor is the ratio between the usage time for which an equipment has been
functional to the total time for which it could be used. It is often averaged over time such that
the ratio becomes the amount of energy used divided by the maximum possible which could
have been used. It is also known as use factor. In electrical engineering, utilization factor is
the ratio of the maximum load which could be drawn to the rate capacity of the system.

The time that an equipment is∈use

Utilization factor (UF) = (4.1)
The total timethat it could be∈use

Example 4.1
The motor may only be used for eight hours a day, 50 weeks a year. The hours of operation
would then be 2000 hours, and the motor Utilization factor for a base of 8760 hours per year
would be 2000/8760 = 22.83%. With a base of 2000 hours per year, the motor Utilization factor
would be 100%.

The bottom line is that the use of this factor is applied to get the correct number of hours that the
motor is in use.
 This factor must be applied to each individual load, with particular attention to electric
motors, which are very rarely operated at full load. In an industrial installation this factor
may be estimated on an average at 0.75 for motors.
 For incandescent-lighting loads, the factor always equals 1.
73

 For socket-outlet circuits, the factors depend entirely on the type of appliances being
supplied from the sockets concerned.

4.6 Availability
Availability can be defined as the proportion of time for which the equipment is either performing
its function or capable of performing its function. It is the probability that a system has not failed
or undergoing a repair action when it needs to be put to use. Availability is based on the question:
“Is the equipment available in a working condition when it is needed?”
An item of equipment may not be very reliable, but if it can be repaired quickly when it fails, its
availability could be high. One major distinguishing factor between availability and reliability is
that availability takes repair time into account. Figure 4.2 illustrates the average value of time, t
over the operating life of the equipment.

Figure 4.2: Illustration of the Average value of time over operating life of the equipment

Mean time between failures (MTBF) is commonly used to express the overall reliability of items
of equipment and system.
From Figure 4.2 we can see what is meant by Uptime, which is the time when the equipment is
available; and Downtime, which is the time when the equipment has failed, and so, is unavailable.
The averages of each of these are:
a) Mean Uptime, which we have already seen is known as the MTBF
b) Mean Downtime, or MDT
Uptime Mean Uptime
∴ Availability (A0) = = (4.2)
Total time Mean Uptime+ Mean Downtime

Total time−D owntime −Downtime

= =1
Total time Total time

MTBF
A0 = (4.3)
MTBF+ MDT
This is called the operational Available; A0
Sometimes Mean Time To Repair (MTTR) is used in this formular instead of MDT. But MTTR
may not be the same as MDT because:
i) The failure may not be noticed for some time after it has occurred.
ii) It may be decided not to repair the equipment immediately
iii) The equipment may not be put back in service immediately it is repaired.
Whether MDT or MTTR is used, it is important that it reflects the total time for which the
equipment is unavailable for service; otherwise the calculated availability will be incorrect.
74

Example
A system operates for 100hours as shown in the timeline of Figure 4.3. Calculate the Uptime,
Downtime and the operational Availability.

Figure 4.3: Timeline

Solution
Uptime = 10+26+20+10+14 = 80hours
Downtime = 4+10+2+4 = 20hours
Uptime 80
A0 = = = 0.80
Uptime+ Downtime 80+20

Using the alternate method,

80
MTBF = = 20
4
20
MDT = =5
4
20
A0 = = 0.80
20+5

4.6.1 Classifications and Analysis of Availability

In the previous section, operational availability was treated as indicative of percentage of time
that a system or group of systems within a unit is operationally capable of performing an
assigned mission. The various classifications and the ascertainment of the availability of the
varied complexities and sub-units were not treated.
In this section, these various classifications will be identified and analyzed. They include:
4.6.1.1 Steady State Availability A(∞¿)¿
The steady state availability of a system is the limit of the instantaneous availability
function as time approaches infinity or:

A(∞) = lim A (t)

t→∞

The instantaneous availability function will start approaching the steady state availability
value after a time period of approximately four times the average time-to-failure. Figure 4.4
illustrates the steady state availability graphically.
75

a
)

Figure 4.4 Illustration of point Availability Approaching Steady State

Concisely, one can think of the steady state availability as a stabilizing point where the system’s availability is
roughly a constant value.
From the equation of the operational availability, and the availability of repairable system which is a function of
failure rate, λ and its repair rate, µ, we can derive the steady-state availability:

Uptime Total time−Downtime

A(∞)=¿ =
Uptime+ Downtime Total time
Downtime
A(∞)=1−
Total time

1
MTBF λ
∴ A(∞ )=¿1- =
MTBF+ MDT 1 1
+
λ μ
(4.5)

μ
∴ A(∞ )= (4.6)
μ+λ

The steady state availability reflects the long-term availability after the system “settles”. The
system availability may initially be unstable due to training/learning issue, deciding on a good
spare parts stocking policy, deciding on the number of repair personnel, optimizing the efficiency
of repair, burn-in of the system, etc., and could take some time before it stabilizes.

4.6.1.2 Instantaneous (or Point) Availability

The instantaneous availability of a system is the probability that the system is “up” at a specific
time (t). For instance, if the system starts in an “up” condition, its instantaneous availability
begins at 1.0 and approaches the steady state availability after a few failure/repair cycles.
76

For systems that operate continuously, once the system passes an initial start-up period, the
instantaneous availability equals its steady state availability.
In many analysis cases, the period of interest is such that the startup transient is negligible and is
ignored.
Now let us consider a two-state model as shown in Figure 4.5, in which a system is either “up”
represented by 1, or “down” represented by 0.
λ

1 0

μ
Figure 4.5: Two state model
The system moves from state 1 to state 0 at a rate λ and from state 0 to 1 with rate µ

1 1
where λ = and µ =
MTBF MDT

The instantaneous availability can be calculated as a function of time. It is the probability of being
in state 1 at time t.

μ λ −(λ+ µ)t
P1(t) = + e (4.7)
μ+λ μ+λ

The average availability or uptime availability is therefore the uptime percentage through time, t
and can be evaluated as:
t
1
A0(t) = ∫ P (t) du
t 0 1
(4.8)

t
1
= ∫¿¿
t 0

μ −λ
Ao(t) = + (e ¿ ¿−( λ+ µ) t−1)¿ (4.9)
μ + λ t( μ+ λ)

μ
As time, t becomes large, approaching infinity, the limit of Ao(t) is (4.10)
μ+λ

4.6.2.3 Inherent Availability AI

This is the probability of system operating and functioning at the requisite level in an ideal
environment when considering only the corrective maintenance (CM). A I ignores standby and
delay times associated with preventive maintenance as well as logistic delays, supply delays and
administrative delays. Since these other causes of delay can be minimized or eliminated, an
availability value that considers only the corrective downtime is the inherent or intrinsic property
of the system. Many times, this is the type of availability that companies use to report the
77

availability of their products (e.g. computer server) because they see downtime other than actual
repair time as out of their control and too unpredictable.

The corrective downtime reflects the efficiency and speed of the maintenance personnel, as well
as their expertise and training level. It also reflects characteristics that should be of importance to
the engineers who design the system, such as the complexity of necessary repairs, ergonomics
factors and whether ease of repair (maintainability) was adequately considered in the design.

i). For a one-off/non-repairable component, the inherent availability can be computed as:

MTTF
AI = (4.11)
MTTF+ MTTR

Where MTTF is Mean Time to Failure

MTTR is Mean Time to Repair

ii). Equation (4.11) gets slightly more complicated for a repairable element. To do this, one needs
to look at the Mean Time between Failures (MTBF), and compute as follows:

MTBF
AI = (4.12)
MTBF+ MTTR

Uptime
Where MTBF = ;
Number of system failures

Corrective Maintenance Downtime

and MTTR =
Number of system failure .

4.6.1.4 Achieved Availability, AA

This is the probability that an item will operate satisfactorily at a given point in time when used
under stated conditions in an ideal support environment (i.e. that, personnel, tools, spares, etc are
instantaneously available). It includes logistics time and waiting or administrative time. It
includes active preventive maintenance (P.M) and corrective maintenance (C.M) downtime.
AA can be computed by looking at the mean time between maintenance actions, MTBM, and
the mean maintenance downtime, ( Ḿ ):
MTBM
AA = (4.13)
MTBM + Ḿ

Uptime
Where MTBM = ;
Number of system failures+ Number of systemdowning P . M

C . M Downtime+ P . M Downtime
and Ḿ =¿
Number of system failures+ Number of systemdowning P . Ms
78

It should be noted that what is meant by system downing PMs are PMs that cause the system to
go down or required a short down of the system.

4.6.1.5 Operational Availability A0

This is the probability that an item will operate satisfactorily at a given point in time when used in
an actual or realistic operating and support environment. It includes logistics time, ready time, and
waiting or administrative downtime and both preventive and corrective maintenance downtime.
This value is equal to the mean time between failures (MTBF) divided by the mean time between
failures plus the mean downtime (MDT). This measure extends the definition of availability to
elements controlled by the logisticians and mission planners, such as quantity and proximity of
spares, tools and manpower, to the hardware item. In many cases, operational availability cannot
be controlled by the manufacturer due to variation in location, resources and other factors that are
the sole province of the end user of the product.
Operational availability is expressed mathematically as:

MTBF Uptime
Ao = or (4.14)
MTBF+ MDT OperatingCycle
Where operating cycle is the overall time period of operation being investigated and uptime is the
total time the system was functioning during the operating cycle.

For instance, if we are using equipment which has a mean time to failure (MTTF) of 81.5 years
and mean time to repair (MTTR) of 1 hour,
MTTF in hours = 81.5×365×24 =713940 (This is a reliability parameter and often has a high
level of uncertainty)
Inherent availability AI = 713940/(1+713940)
= 713940/713941
= 99.999860%
Inherent Unavailability = 1/713940 = 0.000140%

4.7 Unavailability (Q)

Unavailability is defined as the probability that an item will not operate correctly at a given time
and under specific conditions.
Expressed mathematically, Unavailability is 1 minus the availability. Therefore a system with
availability 0.9999999654 is more concisely described as having an unavailability of 3.46 ×10-8.
Unavailability may be expressed mathematically as the ratio, where MTTR is the mean time to
repair, and MTTF is the mean time to failure. In telecommunication, unavailability is an
expression of the degree to which a system, subsystem, or equipment is not operable and not in a
committable state at the start of a mission, when the mission is called for at an unknown, i.e.
random time. The conditions determining operability and committability must be specified.
The derivation of the equation for Unavailability is shown below:
79

Downtime
Unavailability (Q) =
Total time
Downtime = MTTR
Total time = MTBF = MTTF+MTTR
MTTR
Q=
MTTF+ MTTR

If MTTR<< MTTF
MTTR 1
Substituting, Q ≈ = × MTTR
MTTF MTTF

If the failure rate is constant (standard assumption for safety integrity level (SIL) verification)...
1
λ=
MTTF

∴ Q ≈ λMTTR

4.8 Repairability
There are various definitions of “repairability” as can be found in literature. Some of them are:
o The ability and ease of product to be repaired during its life cycle.
o The ability to bring a product back to working condition after failure in reasonable amount
of time and for a reasonable price.
o The characteristics of a product that allows all or some of its parts to be separately
repaired or replaced without having to replace the entire product.
Arising from the above three definitions, there are some relevant elements and parameters that
must be included in the definition in order to give it an all-encompassing meaning which portrays
the desired outcome of the repair (bringing a product back to working condition) as well as
determining the feasibility of a repair (cost and time). Accordingly, for purpose of this text,
Repairability is defined as follows:
The characteristics of a product that allows all or some of its parts to be brought back to working
condition after failure in a reasonable amount of time for a reasonable price without having to
replace the entire product.

4.9 Repair Undertakings

Generally, repair becomes necessary when a product or part of a product fails to fulfill its function
or when the performance of the product declines as a result of wear-and-tear or damage. Figure
4.6 shows the different steps of the repair cycle.
80

Figure 4.6: Overview of Repair Cycle

The first step of the repair activities is to properly identify the product model in order to retrieve
relevant repair information such as failure diagnostic guides, disassembly instructions or
availability of spare parts. The second step is to identify the damage or failure in the
system/product. In this step, the technical repairability is assessed and the required further repair
actions, such as replacement of failed parts are identified. To access this failed parts, complete or
partial disassembly and subsequent reassembly is required. Before putting the product back in
use, in many cases the repaired product is tested and/or reset.

4.9.1 Priority Parts

Within the context of repairability, priority parts are commonly determined by the failure rate of
individual parts. For example, priority parts are those parts that are exposed to a particular wear or
material fatigue might have a higher failure rate. Priority parts are also defined as those parts
which, typically, may break down within the scope of the ordinary use of a product. However, it
is not because a part is not commonly repaired that it should not be defined as priority part, as in
some cases repair is rarely performed due to the difficulty of the repair activity and/or limited
availability of spare parts. It should also be considered that prioritisation of parts will be different
in the context of upgradability or recycling. Taking into account these considerations, a Priority
Part is defined as follows:
81

The parts that are most likely to be repaired or replaced during normal service life of the
product and/or parts that are characterized by a high assumed failure rate and/or are critical for
the product to deliver the main desired function.

4.9.2 Disassembly

Diverse definitions of “disassembly” have been put up by different literature, which are as
follows:
• Non-destructive taking apart of an assembled product into constituent materials and/or
components.
• A process whereby an item is taken apart in such a way that it could subsequently be
reassembled and made operational
• A reversible process in which a product is separated into its components and/or
subassemblies by non-destructive or semi-destructive operations which only damage the
connectors/fasteners. If the product separation product is irreversible, this process is called
dismantling
We therefore define partial or complete Disassembly as:
A reversible process in which a product is separated into its parts by non-destructive operations
or semi-destructive operations which only damage the connectors/fasteners in such a way that it
could subsequently be reassembled and made operational, possibly needing new
connectors/fasteners.

4.9.3 Repair Manual

Repair manuals should be easily accessible, readable, understandable (self-explanatory), free of

charge and as simple as possible. Access to repair manuals can be facilitated by the use of Quick
Response (QR) codes, which can be decoded to retrieve a link to access information. Repair
manuals can contain the following information:
1. Manufacturer’s service centers (after sales services): address, phone and bu •
Manufacturer’s service centers (after sales services): address, phone and business
hours can be provided to the consumer directly by the manufacturers or through
retailers. This service should offer:
a) A substitute for the original product during the repair time
b) The possibility to get repair swiftly
c) Assistance in fault diagnosis
d) Support for repair operations
2. Product maintenance instructions
3. Instructions for disassembly of a product
o Kind of repair tools needed and their availability
o Information about type and number and location of connections
o Description of actions that must be carried out to repair the product (basic
fault diagnostic advice and troubleshooting tree)
82

4. Index for spare parts: This includes information on where to get spare parts and their
cost Asides from the content of repair manuals also the structure of the manual and
ease of retrieving the required information for persons performing repair operations is
of high importance.

4.10 Relationship between Cost and Equipment Reliability

Improvement to system reliability will almost without exception cause increase in general
increase in cost. The total costs incurred over the period of ownership of equipment are often
referred to as life-cycle costs (LCC). These can be categorized into:
a) Acquisition Cost: Capital cost plus cost of installation, transport, etc.
b) Ownership Cost: Cost of preventive and corrective maintenance and modifications.
c) Operating Cost: Cost of materials and energy
d) Administrative Cost: Cost of data acquisition and analysis
These costs will be infuenced by:
i) Reliability- this determines frequency of repair, spare part requirements, and loss of
revenue (together with maintainability)
ii) Maintainability- this affects training, test equipment, downtime and manpower.
iii) Safety Factor- this affects operating efficiency, maintainability and liability costs.
Life-cycle costs (LCC) will clearly be reduced by improving reliability, maintainability and safety
but will be increased by the activities needednto achieve them. Therefore, we need to find an
optimum set of parameters which minimizes the total cost. This concept is illustrated in Figures
4.7 and 4.8. Each curve represents cost against availability. Figure 4.7 shows the general
relationship between availability and cost.

Figure 4.7 Price and Availability Figure 4.8 Cost of Ownership and Availability

The manufacturer’s pre-delivery costs, those of design, procurement and manufacture, increase
with availability. On the other hand, the manufacturerer’s after-delivery costs, those of warranty,
redesign, and loss of reputation, decrease as availability improves. The total cost is shown by a
curve indicating some value of availability at which minimum cost is incurred. Price will be
related to this cost. Taking, then, the price/availability curve and plotting it again in Figure 4.8,
the user’s costs involve the addition of another curve representing losses and expense, owing to
failure, borne by the user. The result is a curvealso showing an optimum availability that incurs
83

minimum cost. These diagrams serve to illustrate the idea that cost is minimized by finding
reliability and maintainability enhancements whose savings exceed the initial expenditure.

4.11 The Concept of Failure Reporting

Failure data can be collected from prototype and production models or from the field. In either
case a formal failure-reporting document is necessary in order to ensure that the feedback is both
consistent and adequate. Field information is far more valuable since it concerns failures and
repair actions which have taken place under real conditions. Since recording field incidents relies
on people, it is subject to errors, omissions and misinterpretation. It
is therefore important to collect all field data using a formal document. Information of this type
has a number of uses, the main two being feedback, resulting in modifications to prevent further
defects, and the acquisition of statistical reliability and repair data. In detail, then, they:
o Indicate design and manufacture deficiencies and can be used to support reliability
growth Programmes;
o Provide quality and reliability trends;
o Identify wearout and decreasing failure rates;
o Provide subcontractor ratings;
o Contribute statistical data for future reliability and repair time predictions;
o Assist second-line maintenance (workshop);
o Enable spares provisioning to be refined;
o Allow routine maintenance intervals to be revised;
o Enable the field element of quality costs to be identified.
A failure-reporting system should be established for every project and product. Customer
cooperation with a reporting system is essential if feedback from the field is required and this
could well be sought, at the contract stage, in return for some other concession.

4.11.1 Required Information in Failure Reporting

A failure report form must collect information covering the following:

o Repair time – active and passive.
o Type of fault – primary or secondary, random or induced, etc.
o Nature of fault – open or short circuit, drift condition, wear-out, design deficiency.
o Fault location – exact position and detail of LRA or component.
o Environmental conditions – where these are variable, record conditions at time of fault if
possible.
o Action taken – exact nature of replacement or repair.
o Personnel involved.
o Equipment used.
o Spares used.
o Unit running time.
84

4.11.2 Problems with Failure Reporting

The main problems associated with failure recording are:
1. Inventories: Whilst failure reports identify the numbers and types of failure they rarely
provide a source of information as to the total numbers of the item in question and their
installation dates and running times.
2. Motivation: If the field service engineer can see no purpose in recording information it is likely
that items will be either omitted or incorrectly recorded. The purpose of fault reporting and the
ways in which it can be used to simplify the task need to be explained. If the engineer is frustrated
by unrealistic time standards, poor working conditions and inadequate instructions, then the
failure report is the first task which will be skimped or omitted. A regular circulation of field data
summaries to the field engineer is the best (possibly the only) way of encouraging feedback. It
will help him to see the overall field picture and advice on diagnosing the more awkward faults
will be appreciated.
3. Verification: Once the failure report has left the person who completes it the possibility of
subsequent checking is remote. If repair times or diagnoses are suspect then it is likely that
they will go undetected or be unverified. Where failure data are obtained from customer’s
staff, the possibility of challenging information becomes even more remote.
4. Cost: Failure reporting is costly in terms of both the time to complete failure-report forms and
the hours of interpretation of the information. For this reason, both supplier and customer are
often reluctant to agree to a comprehensive reporting system. If the information is correctly
interpreted and design or manufacturing action taken to remove failure sources, then the cost of
the activity is likely to be offset by the savings and the idea must be ‘sold’ on this basis.
5. Recording non-failures: The situation arises where a failure is recorded although none exists.
This can occur in two ways. First, there is the habit of locating faults by replacing suspect but not
necessarily failed components. When the fault disappears the first (wrongly removed) component
is not replaced and is hence recorded as a failure. Failure rate data are therefore artificially
inflated and spares depleted. Second, there is the interpretation of secondary failures as primary
failures. A failed component may cause stress conditions upon another which may, as a result,
fail. Diagnosis may reveal both failures but not always which one occurred first. Again, failure
rates become wrongly inflated. More complex maintenance instructions and the use of higher-
grade personnel will help reduce these problems at a cost.
6. Times to failure: These are necessary in order to establish wear-out.

4.11.3 Best Practice and Recommendations

The following list summarizes the best practice together with recommended enhancements for
both manual and computer based field failure recording. Recorded field information is frequently
inadequate and it is necessary to emphasize that failure data must contain sufficient information to
enable precise failures to be identified and failure distributions to be identified. They must,
therefore, include:
(a) Adequate information about the symptoms and causes of failure. This is important because
85

predictions are only meaningful when a system level failure is precisely defined. Thus component
failures which contribute to a defined system failure can only be identified if the failure modes are
accurately recorded. There needs to be a distinction between failures (which cause loss of system
function) and defects (which may only cause degradation of function).
(b) Detailed and accurate equipment inventories enabling each component item to be separately
identified. This is essential in providing cumulative operating times for the calculation of assumed
constant failure rates and also for obtaining individual calendar times (or operating times or
cycles) to each mode of failure and for each component item. These individual times to failure are
necessary if failure distributions are to be analysed.
(c) Identification of common cause failures by requiring the inspection of redundant units to
ascertain if failures have occurred in both (or all) units. This will provide data to enhance models.
In order to achieve this it is necessary to be able to identify that two or more failures are related to
specific field items in a redundant configuration. It is therefore important that each recorded
failure also identifies which specific item (i.e. tag number) it refers to.
(d) Intervals between common cause failures. Because common cause failures do not necessarily
occur at precisely the same instant it is desirable to be able to identify the time elapsed between
them.
(e) The effect that a ‘component part’ level failure has on failure at the system level. This will
vary according to the type of system, the level of redundancy (which may postpone system level
failure), etc.
(f) Costs of failure such as the penalty cost of system outage (e.g. loss of production) and the cost
of corrective repair effort and associated spares and other maintenance costs.
(g) The consequences in the case of safety-related failures (e.g. death, injury, environmental
damage) not so easily quantified.
(h) Consideration of whether a failure is intrinsic to the item in question or was caused by an
external factor. External factors might include: process operator error induced failure maintenance
error induced failure caused by a diagnostic replacement attempt modification induced failure
(i) Effective data screening to identify and correct errors and to ensure consistency. There is a cost
issue here in that effective data screening requires significant man-hours to study the field failure
returns. In the author’s experience an average of as much as one hour per field return can be
needed to enquire into the nature of a given failure and to discuss and establish the underlying
cause. Both codification and narrative are helpful to the analyst and, whilst each has its own
merits, a combination is required in practice. Modern computerized maintenance management
systems offer possibilities for classification and codification of failure modes and causes.
However, this relies on motivated and trained field technicians to input accurate and complete
data. The option to add narrative should always be available.
(j) Adequate information about the environment (e.g. weather in the case of unprotected
equipment) and operating conditions (e.g. unusual production throughput loadings).

4.11.4 Examples of Failure Report Forms

Figure 4.9 shows an example of a well-designed and thorough failure recording form as once used
by a typical telecommunication company. This single form strikes a balance between the need for
86

detailed failure information and the requirement for a simple reporting format. A feature of a
Telecommunication company’s form is the use of four identical print-through forms. The
information is therefore accurately recorded four times with minimum effort. Figure 4.10 shows
the author’s recommended format taking into account the list of items under “Best Practices and
Recommendations.”

4.12 Concept of Failure Reporting, Analysis and Corrective Action System (FRACAS)

FRACAS is a process that gives organizations a way to report, classify and analyze failures, as
well as plan corrective reactions in response to those failures. Usually, software is used to
implement a FRACAS system to help manage multiple failure reports and produce a history of
failure with corresponding corrective actions, so recorded information from those past failures can
be analyzed.
FRACAS is a closed-loop process containing the following steps:

1. Failure reporting (FR): All failures and faults related to a system, piece of equipment or
process are formally reported using a standard form known as a failure report or defect
report. The failure report should clearly identify the failed asset, symptoms of the failure,
testing conditions, operating conditions and failure time.

2. Analysis (A): Perform a root cause analysis to identify what caused the failure.

3. Corrective actions (CA): Once the cause of the failure is determined, implement and
verify corrective (or preventive) actions to prevent future occurrences of the failure. Any
changes should be formally documented to ensure standardization.

4.12.1 How to Implement FRACAS

Implementing FRACAS is highly customizable based on your organization’ needs. In reality,
there is no single FRACAS standard that is applied across the board, with many standards being
industry specific. Below are guidelines and a thorough overview of an effective FRACAS and
what it takes to gather the necessary information. There are three basic steps for collecting this
information:

Step 1- Creating the Failure Report.

FRACAS starts with the failure report- the recording of an asset’s failure, issue or cause for
concern with a product or process. The information in failure reports can vary widely depending
on your industry, processes and compliance requirements. Creating a report might involve
speaking with multiple departments within the organization to discuss things like technical
support, test results from the laboratory, manufacturing defects, issues in the field and engineering
or design.
Regardless of the type of information you're tracking within the FRACAS, it's important to
remember that you need to narrow down what information you want to include in your report.
87

This means any information deemed necessary to help determine and resolve issues as well as
information for future tracking.

During the failure reporting stage of FRACAS, you should clearly define the type of information
to record in the incident report. Over time, as failures flow through the closed-loop FRACAS
process, more information will be collected; however, initially, as much data as possible should
be gathered on the failure and how it was detected. Failure reports should collect information such
as:

(i) Date and time of the incident

(ii) Who found the issue or failure

(iii) Who is conducting the incident report

(iv)All details about the incident, including the steps that led to the incident

(v) Any corrective action that was done to fix the issue

(vi)Suggestions for changes that could be implemented to prevent recurrence

As noted previously, this information can vary depending on the type of data you're tracking, who
is recording the information, what details are needed to resolve the issue, compliance
requirements and more. Generally, FRACAS failure reports are customized to each organization's
requirements.

The most important aspect of failure reporting is ensuring issues are logged in your FRACAS as
they occur in real time. To do this, all team members must have access to the FRACAS and be
able to properly navigate the system.

Step 2 – Analysis

After you've logged your failure report(s), it's time to conduct an analysis of the issue at hand.
This phase can also be customized to fit your organization's needs and help you determine how to
proceed with analyzing the issue. The analysis phase typically is done by a team lead or engineer
who fully evaluates what caused the failure and then identifies a solution.

Step 3 – Corrective action

The final step in the FRACAS is resolving the issue and closing it out. At this point, you've
determined the root cause of the failure and come up with a solution to correct it. Once you've
implemented the corrective action, your team should verify the success of the action and close out
the incident in the system. Closing out each failure is critical to ensure the closed-loop system
remains intact.
88

4.12.2 Benefits of Implementing FRACAS

Implementing a FRACAS gives you valuable information to help identify and correct errors or
failures, past problems, defects, or process errors in a timely manner. Additional benefits include
the following:

(i) Through proper investigation of failures and appropriate corrective action, a FRACAS
also directly reduces immediate costs like factory rework and parts/materials scrap, as well
as indirect costs like customer dissatisfaction.

(ii) A FRACAS tends to be a contributing factor to reliability growth, continuous process

improvement and an efficient maintenance program. This is done via continuous
monitoring and data tracking through the FRACAS, which provides a summarized
assessment of whether previous failure trends have been eliminated by corrective action.

(iii) FRACAS offers visibility of reliability performance issues and initiates continuous
improvement processes.

(iv)Through root cause analysis, FRACAS helps expedite engineering efforts to fix issues,
which in turn leads to effective corrective actions.

(v) FRACAS provides an organization with a knowledge base of a history of problems, giving
you a precedent for numerous issues to help you avoid them in the future.

Chapter Five

5.0 Appreciating the Purpose of Specifications

5.1 What are Engineering Specifications?

Specifications, also referred to as technical standard means an assemblage of catalogued
stipulations to be fulfilled by material, design, product ar service. In engineering, there are diverse
types of specificatios (Specs); and they are used in different ways, depending on the precise
technical context. Genarally, the word specification implies “a detailed description or to state
explicitly”.

A required specification is written stipulations, or an assemblage of documented requirements to

be fulfilled by a certain design, material, product, service etc. in engineering, required
specification is a generic preliminary feature of product development and engineering design
process.

A product design specification(PDS) refers to the attribute or characteristics of the solutions

forthe required specification, referring to either a design solution or final produced solution. It is
typically used to guide fabrication/project team. PDS document documents andtracks the
necessary information required to effectively define architecture and system design in order to
give development team guidance on design of the system to be developed.The PDS document is
therefore created during the planning phase of the project. Its intended audience is the project
managers, project team, and development team. Some portions of this document such as the user
interface (UI) may on occasion be shared with the client/user, and other stake holders whose
input/approval into UI is needed.

At times, the word specification is used in relation to a data sheet (or spec sheet), which could be
misconstrued. A data sheet however, explains the technical features of an item or product, usually
documented, and made available by manufacturer to assist select or make proper use of the
products. It is sometimes called performance specification. Hence, a data sheet is not a technical
specification in connection with imformation regarding how to design or produce.

For example, Function Generatorn required for use in a laboratory may have the following brief
specifications:

Frequency Range: 0.001Hz-10,000Hz

Amplitude: 5Mv-25V peak-to-peak

Impedance: 20Ω -500Ω

Accuracy of Calibration: 1% of dial reading

Output function: sine-distortion less than 3µs

: Square-rise time less than 1%

: triangle-non-linearity less than 1% of maximum amplitude

Stability: 0.03% (after initial warm-up)

This is an example of the content of a typical data sheet (spec sheet).

An “in service” or “maintain as” specification spells outthe state of a single system/product after
functioning for some years, as well as the effectsof wear and maintenance (i.e. changes in
configuration arising from years of operation).

5.2 Why do we Need Specifications?

Specifications are perfomance performance goals for the design team. The are so important for
the following reasons:
(i) They provides clear instructions on the intent, materials, product and service.
(ii) They serve as a guidline to the quality and standards which should be applied.
(iii) Materials and manufacturer’s products can be clearly defined.
(iv)The requirements for installation, testing and handover can be identified.
(v) They may be used as mutual agreement between a prospective buyer and manufacturer.
For example, a buyermay wish to specify that material components or equipment should
conform to certain standard.
(vi)They are used as indicators for buyers or users of manufactured products in choosing the
most suitable item for their needs among many options or varieties.

5.3 Typical Items of Information Require in Product Design Specification (PDS)

A typical PDS includes the following information:
(A) Product Design and Performance Issues...
(i) Expected product size and weight
(ii) Expected product performance requirements-
Operational requirement:
(a) Speed (How fast? How slow? How often?)
(b) Continuous or Discontinuous?
(c) Loadings likely encountered
(d) Product power requirements
(e) Product shelf life
(f) Product service life
(iii) Expected product service environment
(a) What is the operating temperature range for this product?
(b) What is the operating humidity range for this product?
(c) Is it subject to shock loading?
(d) Will the product be exposed to dirt or other contaminants(corrosive fluids,
etc)?
91

(e) How will the product be treated in service?

(f) What impact will the product have on its environment?
(iv) Expected product safety requirements
(a) Potential source of product liability litigation.
(b) Potential operator hazard.
(c) Potential manufacturing and assembly hazards.
(d) Potential for misuse/ abuse.
(v) Expected product liability standards and requirements.
What level of liability can be expected for the product?
(vi) Expected product ergonomic requirements -- customer requirement
(a) Which user/operator features are desirable in this product?
(b) Are there problem areas for users/operators? Can we design around them?
(vii) Expected product aesthetics -- customer requirement
(viii) Expected product maintenance requirements.
(a) Can product be maintenance-free?
(b) If routine maintenance is required, can it be done by the owner/operator?
(c) Will professional maintenance be required?
(ix) Possible off-the-shelf component parts.
(a)Which parts of this product can be purchased instead of being made by us?
(b) Is the quality and reliability of purchases parts adequate for this design?
(x) Material requirements...
(a) What are the strength requirements?
(b) What are the rigidity/compliance requirements?
(c) Is product weight of importance?
(xi) Expected product recycling potential and expected disposal
(a) Does the disposal of this product constitute an environmental hazard?
(b) Can parts of this product be effectively recycled by existing processes?
(xii) Manufacturing process requirements and limitations.
(a) Is protection from the environment necessary?
(b) Is there a customer preference for a particular finish?
(c) How do we minimize environmental impact?
(xiii)Product packaging requirements.
(a) Can we use environmentally friendly packaging and packing materials?
(b)How much packaging and packing materials are really necessary?

(B) Market issues...

(i) Potential customer base
(a) Who will buy this product? Why?
(b) Have you listed all potential classes of customers?
(c) Can we tap into a new segment of the market? How?
(ii) Market constraints on product.
(a) Who is buying this type product? (Customer base)
92

(b) What is currently selling?

(c) What is currently not selling?
(iii) Expected product competition (These will be benchmarked)
(a) What are the strengths of each competing product? Can we incorporate them?
(b) What are the weaknesses of each competing product? Can we improve on then?
(c) What are the market shares of competing products?
(iv) Target product price –Original Equipment Manufacturer (OEM) and Manufacturer’s
Suggested Retail Price (MSRP)
(v) Target production volume and market share.
(a) Is there a market for this product? How do you know?
(b) Is the potential market sufficiently large to justify investment in a new product?
(c) Is the new product sufficiently better than the competition?
(vi) Expected product distribution environment.
(a) How will the packaged product be treated in shipping, storage, and on the shelf?
(b) Are adequate shipping facilities available?
(c) Will installation require a professional?
C. Capability issues....
(i) Company constraints on product design, manufacture, and distribution.
(a)What are our manufacturing capabilities?
(b) Should we manufacture ourselves or outsource?
(ii) Schedule requirements -- time to market.
(a) When should we have this product to market to capture maximum market share?
(b) How much time should we allocate to design?
(c) How much time do we need to implement a manufacturing process?
It is always advisable to write a PDS in a list format, and not as an essay.
Again, quantify your parameters. Use target goals. Do not say “light weight” but rather “weight to
be less than 3 kilograms. If you are unsure of a specific parameter, estimate a value and adjust
your PDS at a later date.
A brief specification of a Function Generator FG 600 as provided, by FEED-BACK Instruments
Ltd. UK is presented below as a typical data specification usually attributed to it.
Frequency Range: 0.01Hz-100 KHz in 7 decade range 0.01-0.1Hz, 0.1-1.0Hz, 1.0-10Hz,
10Hz- 100Hz, 0.01-1KHz, 1-10 KHz, 10-100KHZ.
Scale accuracy: ±5% of full scale range selected
Frequency stability: (measured at 10KHz on 1-10KHz range)
Drift: Typically 0.03 per 10 minutes.
Supply voltage change: Typically 0.003% of frequency for 10% change in supply voltge.
Main output:
Waveform: sine, square or triangular
Source Impedance: 600Ω
Amplitude: 10V peak-peak maximum open circuit
Minimum usable output: Typically 100Mv peak-peak
93

Amplitude Stability:
i) (Sine and Square waveform) typically, less than +5% peak-peak, change over the range
0.01 to 100 KHz.
ii) (Triangle waveform) typically, less than 0.5% peak-peak change over the range 0.01-10
KHz and 2.5% at 100 KHz
Purity:
Sine distortion: less than 2%, typically 0.25% at KHz on 100 KHz to 1 KHz range and 0.7% at
100 KHz.
Rise and fall time: Less than 200nS on square wave
Linearity: typically less than 1% on triangle wave
Auxiliary Output:
Triangle wave: amplitude, 2V peak –peak
Impedance: 600 Ω
TTL (Transistor-transistor logic) wave: amplitude, 0 to +5V (nominal) rise time, less than
100nS
Input:
Input impedance: 10KΩ
Power Requirements:
Line voltage: 200/250Vrms, 50 or 60Hz
Consumption: 6 VA
Observations
A close look at the above data sheet clearly shows that no express information about the reliability
of the product is provided by the manufacturer. Most times, it is a deliberate omission by the
manufacturer in order not to give inadvertent vital information to his competitors. Sometimes
also, it might be as a result of difficulties in working out realistic reliability numerical value. This
is because accurate reliability prediction of equipment requires examining minutely, each of its
conceptual “blocks” for which sufficient facts and figures on their failure rates are useful. The
interconnectivity of these blocks makes the overall reliability of the equipment dependent on the
collective performance of individual conceptual blocks which can only unfold through some
rigorous calculations.

5.4 The Concept of Component Specification

Component specification is very vital to end users because it provides them with the necessary
information to enable them juxtapose the performance of components with similar functions,
thereby making appropriate decision on choice of suitable component. Furthermore, it furnishes
the user with the mode of operating or use of such component. Vital inclusions to component
specification information are:
i) Device type
ii) Concise account of potential area of applicability
iii) A line drawing of mechanical layout
iv) Vital electrical characteristics including attainable ranges and characteristics curves; and
v) Predicted failure rate or reliability data
94

Example 1
Let us take an example of compact change over switch:
Electrical Specifications:
Nominal Current 500 mA
Minimum Current 1MA
Nominal Voltage 12V
Maximum Voltage 10Mv
Electrostatic breakdown value 5Kv
Isolation resistance ≥10000 MΩ at 100 Vdc
Contact resistance ≤22 Ω
Electrical life 1000 actuations (operations)
Electric strength 250 Vrms, 50 Hz, 1 min

Mechanical Characteristics:
Switching action Maintained
Actuating travel 1.6 mm
Operating temperature -40°C ... +85°C
Shock resistance 50 g, 11 ms
Environmental resistance - Standard Tropical
- Damp heat 4 days 21 days
- Saline mist 24 hours 96 hours

Manual Spot Welding Guideline:

The 1K2 is designed for machine wave flow soldering. If manual soldering needs to be done, the
following guidelines have to be adhered to (in accordance to IEC 60068-2-20)
o Size of tip of soldering Iron: 3mm diameter
o Type of solder: Sn/Ag/Cu
o Soldering temperature: 350°C at 370°C
o Soldering time: 1s at 5s max
o In order to avoid overheating, a small break has to be observed between each solder-
application
o The soldering iron has to be kept clean at all times
o It is mandatory to tin the tip of the soldering iron before the soldering can take place
o When soldering, the tip of the iron has to first pre-heat the «intersection» between pin and
circuit, before applying the solder.
Example 2
The Specifications of two Carbon-Fixed capacitor.
Specifications Paper type Silver type
Range 10nf-10µF 5pf- 10µF
Tolerance ± 10 % ± 0.5 %
Typical a.c voltage 250V/500V(rms) -
95

Typical d.c voltage 600V 60V-600V

Temperature coefficient 300ppm/°C 100ppm/°C
Resonance Frequency 0.1MHz 10MHz
Dissipation Factor 0.005-0.01 0.005
Leakage Resistance 109-1010 Ω 1011 Ω
Stability Fair Excellent
Typical application Mains interference suppression, Tuned circuit filter
Motor start and run

Example 3
Specifications of two fixed resistors
Specifications Carbon Composition Type Metal Oxide Type
Range 10Ω to 22Ω 10Ω to 1MΩ
Selection tolerance ± 10 % ±2%
Power rating 250mW 500mW
Load stability 10% 1%
Maximum voltage 150V 350V
Insulation resistance 109 Ω 1010 Ω
Proof voltage* 500V 1KV
Voltage coefficient** 200ppm/V 10ppm/V
Ambient temperature range 40°C to 105°C 55°C to 150°C
Temperature coefficient ± 120 rpm/° C ± 250 rpm/° C
Noise 1KΩ 2µV/V to 10MΩ 6µV/V 0.1µV/V
Soldering effects*** 2% 0.15%
+
Shelf time (1year) 5% 0.1%
++
Damp heat (95% RH) 15°C max. 1%

Remark
* Proof voltage: This is the maximum voltage that can be applied between the resistor body and a
touching external conductor.
**Voltage coefficient: This is the negative change in resistance with respect to applied voltage
(expressed in ppm per volt). 1ppm = 1× 10-6
***Soldering effects: This is the change in resistance as a result of a standard soldering test.
+
Shelf time: this is the possible change in resistance, usually after one year.
++
Damp heat: This is the change in resistance as a result of a standard high temperature and
humidity

Observation
A close look at the above data sheet reveals that:
i) Different components possess diverse items of information for their specifications
96

ii) The comparison of the data for the two fixed resistors showed that for high reliability
performance, the metal oxide type will be required because of their wider range low
temperature coefficient and superior stability.

5.5 Improving Specifications Vis-à-vis the Cost of Equipment

For any reasonable improvement of any of the under listed specification items, there must be a
trade-off with the cost of the equipment. The items include:
i) Stability
ii) Sensitivity
iii) Reduction of error range
iv) Speed of reading
v) Widening operating conditions
It is inferred therefore, that the higher the qualities of the operational features, the higher the cost
of the equipment. Hence, for the buyer of any equipment, the guiding principle should be the
purpose for which the equipment is needed. That will determine which of the aboe listed factors
the buyer should prioritize.
97

Chapter Six

6.0 Appreciating the Need for Testing; Types of Testing carried out and the purpose of
Testing

Introduction
Testing is a set of activities conducted to facilitate discovery and/or evaluation of properties of
one or more items/equipment under test. Each individual test, known as test case, exercises a set
of predefined test activities, developed to drive the execution of the test item to meet objectives,
including correct implementation, error identification, quality verification and other valued
details. The test environment is usually designed to be identical, or as close as possible to the
anticipated operating environment.
In this chapter, various types of tests, which products undergo in order to be certified suitable for
the market, will be treated.

6.1 (a) Reliability Demonstration Test (RDT)

Frequently, a manufacturer will have to demonstrate that a certain product has met a goal of
certain reliability benchmark (i.e. target metric). For example, one can express the reliability of an
item at a given time with specific confidence level. This is done by testing a specified number of
units for a predetermined time. If no failures occur, then the target metrics is demonstrated (i.e.
the reliability at a specific time with specific confidence level). This method has been adopted for
scenarios where the target metric can be demonstrated even if some failures occur, so long as a
specified number of allowable failures is not exceeded. For example, in an RDT where the
number of allowable failures is 2, the target metric is demonstrated if more than 2 two failures
occur during the test.
RDTs are long duration physical stress tests which measure product reliability and accelerated
failure mechanisms. Product performance is monitored and data is statistically analyzed to
determine if it meets its goal. Accelerated environment (elevated stresses) and multiple samples
are used to reduce test time- accelerating the future.

6.1.1The Purpose of RDTs

The RDTs are undertaken to:

i) Demonstrate and quantify product reliability
ii) Ensure confidence prior to initial shipments
iii) Plan warranty costs and spares strategy
iv) Satisfy customer’s requirements
v) Discover product weaknesses and failure modes that occur over time
vi) Verify product MTBF at a desired statistical confidence level
98

6.2 Product Reliability Acceptance Test (PRAT)

Acceptance testing is a formal testing conducted to determine whether a product /system satisfies
it acceptance criteria- i.e. the criteria the product must satisfy to be accepted by the customer. This
test helps the customer to determine whether or not to accept the product. Moreover, this test
helps to evaluate the system’s compliance with the business requirement and assess whether it is
acceptable for delivery. Acceptance testing is performed before making the system available for
actual use.

6.2.1 Objectives of Acceptance Testing

The following are the three major objectives of Acceptance testing:
i) Confirming that the product meets the agreed-upon criteria
ii) Identifying and resolving discrepancies, if there are any.
iii) Determining the readiness of the product for transition to live operations. The final
acceptance of a system for deployment is conditioned upon the outcome of the acceptance
test. The acceptance test team produces an acceptance test report which outlines the
acceptance condition.
6.2.2 Types of Acceptance Test
There are two categories of Acceptance test:
i) User Acceptance Test (UAT)
(a) The UAT is conducted by the customer to ensure that the system satisfies the
contractual acceptance criteria before being signed off as meeting user needs.
(b) Actual planning and execution of the acceptance tests do not have to be undertaken
directly by the customer
(c) Often, third party consulting firms offer their services to undertake this task.
However, the customer must specify the acceptance criteria for the third party to
seek in the product.
A common UAT is a factory acceptance test (FAT) in the industrial sector, which takes place
before the installation of the concerned equipment.
ii) Business Acceptance Testing(BAT)
This is undertaken within the development division of the supplier to ensure that the system
will eventually pass the UAT. It is a rehearsal of UAT at the premises of the supplier. The
development division of the supplier derived and executes test cases from the client’s contract,
which include the acceptance criteria. It concentrates on satisfying the business requirements of
business scenarios or business specifications.

Objectives of BAT
The objective of BAT is to provide a set of working and fully tested features that are ready for
production to be tested and validated by the business.
It is recommended that the delivery team works collaboratively with the business to define a
testing approach and plan with test cases. Any defects found at this stage will be handled by
the Business Test Lead, Business Analyst and development Lead along with the support from
project manager.
99

Additionally, as the users pass their acceptance criteria, the business owners can be reassured that
the developers are progressing in the right direction.

6.3 Calibration Test

Calibration is a documented comparison of the measurement devices to be calibrated (tested)
against a traceable reference standard/device. The reference standard may be also referred to as
a “Calibrator” . Logically, the reference standard should be at least ten times more accurate than
the device to be calibrated (tested).
The process of calibration consists of reading the standard and test instruments and test
instrument simultaneously when the input quantity is held constant at several values over the
range of the test instrument. Calibrations are better carried out under the stipulated environmental
conditions in order to ensure that the grade of performance conforms to its stated specifications.
While calibrating, it is customary to take readings both in the ascending and descending order.
Calibration thus reveals some of the inherent flaws in the electromechanical instruments and other
mechanical transducers involving elastic elements. It is essential to check the linearity between
the input and output quantities of many transducers and declares the extent of non-linearity likely
to exist for the range for which it is to be used.

6.3.1 Why is Calibration so Important?

Calibration defines the accuracy and quality of measurements recorded using a piece of
equipment. Over time there is a tendency for results and accuracy to “drift” when using particular
technologies or measuring particular parameters such as temperature and humidity. To be
confident in the results being measured, there is an ongoing need to maintain the calibration of
equipment throughout its lifetime for reliable, accurate and repeatable measurements.
The goal of calibration is to minimize any measurement uncertainty by ensuring the accuracy of
test equipment. Calibration quantifies and controls errors or uncertainties within measurement
processes to an acceptable level.
In manufacturing process applications, any equipment used should be calibrated at multiple points
across its working range to ensure reliable information to critical alarms and systems. Failure to
calibrate or improper calibration has been the cause of injury, death and even major
environmental disaster.
6.3.2 How often should Calibration be done?
One of the facts of life in working with sensors, meters, and other measurement devices is that
eventually they drift out of tolerance and have to be recalibrated. The question is, how often do
you really have to get this done?
Calibration is an expensive process, whether it is performed within the organization or contracted
out to a vendor. The user’s manual that comes with the device may recommend annual
calibration, but once a year may not really be appropriate.
For some devices, government regulations dictate calibration frequency, particularly for medical
and pharmaceutical instruments or other devices that can affect public safety. In these cases
flexibility is not allowed in setting up a calibration cycle. However, in other situations, it could be
more profitable to find the best balance between calibrating too often (costly) and calibrating too
100

little (inaccurate measurements). The bottom line however, is that calibration costly, but so do
inaccurate instruments.
In general, there are other scenarios where more frequent calibrations could be required:
i) Before Starting and after Finishing a Major Critical Measuring Project
If you are planning a project that requires extremely accurate measurements, the instruments to be
used for that project should be sent for calibration, and then, kept in storage until the testing
begins. Likewise, after the project is completed, the equipment used should be sent for calibration.
When the results are obtained, they can be used to confirm the accuracy of the testing results for
the project.
ii) After an Incident
If the instrument receives knocks, bumps or any other kind of physical impact or the interval
overload is knocked out, then, one may want to consider sending it for recalibration to ensure its
accuracy. The chances of this happening will be more likely in certain industries such as
construction, field service and facilities maintenance.
iii) Based on individual Project Requirements
One can guarantee every project requiring electrical testing will differ in size and scope, and
therefore have different requirements for calibration. Some will require the use of certified and
calibrated test equipment, while others won’t have as stringent calibration standards. Review of
the specifications is required before the test, as the requirements might not be explicitly stated.
iv) At Quarterly or Semi-annual Periods
If one carries out critical measurements then leaving a shorter time span between measurements
means there will be less chance of questionable test results, or the electrical test meter drifting
from calibration. Be prepared by diarizing the calibration frequency or booking calibrations in
advance.
v) Internal Requirements
Often business insurance may require you to have a valid calibration certificate and awarding
organization. Or check your organization’s quality manual which might stipulate the desired
frequency of calibrations.
Choosing a supplier with affordable prices and quick turnaround times makes regular calibrations
a possibility.

As a final point, there are also some other factors that can influence the calibration period, such
as:

o The workload of the instrument: if the instrument is used a lot, it should be calibrated

more often than one that is being used very seldom.
o Environmental conditions: an instrument used in extreme environmental conditions
should be calibrated more often than one used in stable conditions.
o Transportation: if an instrument is transported frequently, you should consider
calibrating it more often.
o Accidental drop/shock: if you drop or otherwise shock an instrument, it may be wise to
have it calibrated afterward.
101

o Intermediate checks: in some cases, the instrument can be checked by comparing it

against another instrument, or against some internal reference. For example, for
temperature sensors, an ice-bath is a way to make relatively accurate one-point check.
This kind of intermediate checks between the actual full recalibrations ads certainty to the
measurement and can be used to extend the calibration period.

6.4 Non-Destructive Test (NDT)

Non-destructive test (NDT) or Nondestructive evaluation (NDE) refers to techniques which are
used to detect, locate and assess defects or flaws in materials or structures or fabricated
components without affecting in any way, their continued usefulness or serviceability. The defects
may either be intrinsically present as a result of manufacturing process or may result from stress,
corrosion etc. to which a material or a component may be subjected during actual use. It is
techniques to detect critical flaws before they have grown unacceptably large and of vital
importance in the industry for in-service inspection, quality control and failure analysis. As an
industrial test method, NDT provides a cost effective means of testing while protecting the
usability of the object for its designed purpose.
As no harm occurs to the items under test, it is very valuable techniques used for product
evaluation research and problem solving which can save both time and money. NDT is used
widely in all areas of engineering, aerospace, defense, power generation, medicine and art.

6.4.1 Why use NDT?

Here are the top reasons NDT is used by so many companies throughout the world:

i) Savings. The most obvious answer to this question is that NDT is more appealing than
destructive testing because it allows the material or object being examined to survive the
examination unharmed, thus saving money and resources.
ii) Safety. NDT is also appealing because almost all NDT techniques (except radiographic
testing) are harmless to people.
iii) Efficiency. NDT methods allow for the thorough and relatively quick evaluation of assets,
which can be crucial for ensuring continued safety and performance on a job site.
iv) Accuracy. NDT methods have been proven accurate and predictable; both qualities you
want when it comes to maintenance procedures meant to ensure the safety of personnel
and the longevity of equipment.

6.4.2 Common Methods of NDT

There are several techniques used in NDT for the collection of various types of data, each
requiring its own kind of tools, training, and preparation.

Some of these techniques might allow for a complete volumetric inspection of an object, while
others only allow for a surface inspection. In a similar way, some NDT methods will have varying
degrees of success depending on the type of material they are used on, and some techniques—
102

such as Magnetic Particle NDT, for example—will only work on specific materials (i.e., those
that can be magnetized).

Here are the eight most commonly used NDT techniques:

1. Visual NDT (VT)

2. Ultrasonic NDT (UT)
3. Radiography NDT (RT)
4. Eddy Current NDT (ET)
5. Magnetic Particle NDT (MT)
6. Acoustic Emission NDT (AE)
7. Liquid Penetrant NDT (PT)
8. Leak Testing (LT)

1. Visual Testing (VT)

Definition: Visual Non-Destructive Testing is the act of collecting visual data on the status of a
material. Visual Testing is the most basic way to examine a material or object without altering it
in any way.

How to conduct Visual Testing

Visual Testing can be done with the naked eye, by inspectors visually reviewing a material or
asset. For indoor Visual Testing, inspectors use flashlights to add depth to the object being
examined. Visual Testing can also be done with an RVI (Remote Visual Inspection) tool, like a
camera. To get the camera in place, NDT inspectors may use a robot or drone, or may simply
hang it from a rope.

2. Ultrasonic Testing (UT)

Definition: Ultrasonic Non-Destructive Testing is the process of transmitting high-frequency

sound waves into a material in order to identify changes in the material’s properties.

How to conduct Ultrasonic Testing

In general, Ultrasonic Testing uses sound waves to detect defects or imperfections on the surface
of a material created.

One of the most common Ultrasonic Testing methods is the pulse echo. With this technique,
inspectors introduce sounds into a material and measure the echoes (or sound reflections)
produced by imperfections on the surface of the material as they are returned to a receiver.

Here are some other types of Ultrasonic Testing:

103

 Phased Array Ultrasonic Testing (PAUT)

 Automated Ultrasonic Testing (AUT)
 Time-Of-Flight Diffraction (TOFD)

3. Radiography Testing (RT)

Definition: Radiography Non-Destructive Testing is the act of using gamma- or X-radiation on

materials to identify imperfections.

How to conduct Radiography NDT Testing

Radiography Testing directs radiation from a radioactive isotope or an X-ray generator through
the material being tested and onto a film or some other kind of detector. The readings from the
detector create a shadowgraph, which reveals the underlying aspects of the inspected material.

Radiography Testing can uncover aspects of a material that can be hard to detect with the naked
eye, such as alterations to its density.

4. Eddy Current (Electromagnetic) Testing (ET)

Definition: Eddy Current Non-Destructive Testing is a type of electromagnetic testing that uses
measurements of the strength of electrical currents (also called eddy currents) in a magnetic field
surrounding a material in order to make determinations about the material, which may include the
locations of defects.

How to conduct Eddy Current Testing

To conduct Eddy Current Testing, inspectors examine the flow of eddy currents in the magnetic
field surrounding a conductive material to identify interruptions caused by defects or
imperfections in the material.

5. Magnetic Particle Testing (MT)

Definition: Magnetic Particle Non-Destructive Testing is the act of identifying imperfections in a

material by examining disruptions in the flow of the magnetic field within the material.

How to conduct Magnetic Particle Testing

To use Magnetic Particle Testing, inspectors first induce a magnetic field in a material that is
highly susceptible to magnetization. After inducing the magnetic field, the surface of the material
is then covered with iron particles, which reveal disruptions in the flow of the magnetic field.
These disruptions create visual indicators for the locations of imperfections within the material.
104

6. Acoustic Emission Testing (AE)

Definition: Acoustic Emission Non-Destructive Testing is the act of using acoustic emissions to
identify possible defects and imperfections in a material.

How to conduct Acoustic Emission Testing

Inspectors conducting Acoustic Emission Tests are examining materials for bursts of acoustic
energy, also called acoustic emissions, which are caused by defects in the material. Intensity,
location, and arrival time can be examined to reveal information about possible defects within the
material.

7. Liquid Penetrant Testing (PT)

Definition: Liquid Penetrant Non-Destructive Testing refers to the process of using a liquid to
coat a material and then looking for breaks in the liquid to identify imperfections in the material.

How to conduct Penetrant Testing

Inspectors conducting a Penetrant Test will first coat the material being tested with a solution that
contains a visible or fluorescent dye. Inspectors then remove any extra solution from the
material’s surface while leaving the solution in defects that “break” the material’s surface. After
this, inspectors use a developer to draw the solution out of the defects, then use ultraviolet light to
reveal imperfections (for fluorescent dyes). For regular dyes, the color shows in the contrast
between the penetrant and the developer.

8. Leak Testing (LT)

Definition: Leak Non-Destructive Testing refers to the process of studying leaks in a vessel or
structure in order to identify defects in it.

How to conduct Leak Testing

Inspectors can detect leaks within a vessel using measurements taken with a pressure gauge, soap-
bubble tests, or electronic listening devices, among others.

6.5 Testing for Packaging and Transport

There is no virtue in investing large sums in design and manufacture if inherently reliable
products are to be damaged by inadequate packaging and handling. The package needs to match
the characteristics and weaknesses of the contents with hazards it is likely to meet. The major
causes of defects during packaging, storage and transport are:
i) Inadequate or unsuitable packaging materials for the transport involved.
Transport, climatic and vibration conditions not foreseen, requires consideration of waterproofing,
hoops, bands, lagging, hermetic seals, desiccant, ventilation holes etc.
105

ii) Inadequate marking: BS 2770, 86th Edition, April 30, 1990 provides Specification for
pictorial marking of Handling Instructions for Goods in Transit. Set of symbols for the marking of
packages to convey handling instructions without the use of specific language.
iii) Failure to treat for prevention of corrosion
Various clearing methods for the removal of oil, rust and miscellaneous contamination followed
by preventive treatments and coatings.
iv) Degradation of Packaging materials owing to method of storage prior to use.
v) Inadequate adjustments or padding prior to packaging lack of handling care during
transport: This requires adequate work instructions, packaging lists, training etc.
Choosing the most appropriate packaging involves considerations of cost, availability and size,
for which reason a compromise I usually sought. Crates, rigid and collapsible boxes, cartons,
wallets, tri-wall wrapping, clipboard cases, sealed wrapping, fabricated and moulded spacers,
corner blocks and cushions, bubble wrapping, etc. are a few of the many alternatives available to
meet any particular packaging specification.
An environmental testing involving vibration and shock test together with climatic tests is
necessary to qualify a packaging arrangement. This work is undertaken by a number of tests
houses and may save large sums if it ultimately prevents damaged goods being received since the
cost of defect rises tenfold and more, once equipment has left the factory. As well as specified
environmental tests, the product should be transported over a range of typical journeys and then
retested to assess the effectiveness of the proposed pack.
6.6 Preproduction Testing
Preproduction testing is meant to certify that a production model completely satisfies the
indicated technical criteria in the engineering specification. This test is typically done ahead of
mass production of the equipment/items. This is to make certain, the practicability of satisfying
the specifications, subject to usual manufacturing conditions. The test is usually carried out by an
independent section of the company such as the quality control section. Consequently, pre-
production testing usually include performance test, within some specified limits/range,
environmental test, reliability test, maintainability test, packaging and transport test (which covers
shock and vibration tests), physical characteristic test (involving ergonomics).
A brief discussion of the above mentioned tests embodied in the pre-production test is useful
hence, there discussion below:
i) Environmental Testing
This is the type of test whereby the expected operating environmental condition of the equipment
is simulated, and the product is made to function in it. The bottom line is to determine whether the
product will function as stated in the specification uninterruptedly without damage or diminution
in functionality and serviceability. It is usually subjected to stated boundaries of its environment,
using test equipments such as ovens and refrigerators for simulation of diverse temperature and
humidity, and other equipments capable of simulating various degrees of shocks, vibration and
the like.
This proves that equipment functions to specification (for a sustained period) and is not
degraded or damaged by defined extremes of its environment. The test can cover a wide range
of parameters and it is important to agree a specification which is realistic. It is tempting, when
106

in doubt, to widen the limits of temperature, humidity and shock in order to be extra sure of
covering the likely range which the equipment will experience. The resulting cost of overdesign,
even for a few degrees of temperature, may be totally unjustified.
The possibilities are numerous and include:
Electrical
o Electric fields.
o Magnetic fields.
o Radiation.
Climatic
o Temperature extremes
o Temperature cycling internal and external may be specified.
o Humidity extremes.
o Temperature cycling at high humidity.
o Thermal shock – rapid change of temperature.
o Wind – both physical force and cooling effect.
o Wind and precipitation.
o Direct sunlight.
o Atmospheric pressure extremes.
Mechanical
o Vibration at given frequency – a resonant search is often carried out.
o Vibration at simultaneous random frequencies – used because resonances at different
frequencies can occur simultaneously.
o Mechanical shock – bump.
o Acceleration.
Chemical and hazardous atmospheres
o Corrosive atmosphere – covers acids, alkalis, salt, greases, etc.
o Foreign bodies – ferrous, carbon, silicate, general dust, etc.
o Biological – defined growth or insect infestation.
o Reactive gases.
o Flammable atmospheres.

ii) Reliability Test

At the pre-production stage, the reliability test undertaken is reliability measurements, usually to
evaluate a reliability value, mean time between failures or the failure rate which are all directly
correlated.
iii) Maintainability Test
Maintainability test is done to determine the degree of effectiveness and efficiency with which a
product or system can be repaired and brought back to serviceability by the intended maintainer.
For instance, if we have two different designs of a certain item, it will certainly take different
durations to repair each of the two designs, depending on the ease of disassembling, actual repair,
107

and reassembling. This parameter becomes essential when considering the availability of the
product which, of course is linked with the ability to reduce downtime.
iv) Ergonomics Test
This test is meant to determine ease of interaction between the operator/ maintenance personnel
and the pre-production model. This may uncover likely recurrence of operator mistakes and lapse
or time wasting as a result of the number and positioning of the keys, knobs, switches etc. All
these are a function of the design, and affect the operator’s convenience and accuracy of
performing his duties.

v) Marginal Testing
This involves proving the various system functions at the extreme limits of the electrical and
mechanical parameters and includes:
Electrical
o Mains supply voltage.
o Mains supply frequency.
o Insulation limits.
o Earth testing.
o High voltage interference – radiated. Typical test apparatus consists of a spark plug,
induction coil and break contact.
o Mains-borne interference.
o Line error rate – refers to the incidence of binary bits being incorrectly transmitted in a
digital system. Usually expressed as in 1 in 10-n bits.
o Line noise tests – analogue circuits.
o Electrostatic discharge – e.g. 10 kV from 150 pF through 150Ω to conductive
o surfaces.
o Functional load tests – loading a system with artificial traffic to simulate full utilization
(e.g. call traffic simulation in a telephone exchange).
o Input/output signal limits – limits of frequency and power.
o Output load limits – sustained voltage at maximum load current and testing that current
does not increase even if load is increased as far as a short circuit.
Mechanical
Dimensional limits – maximum and minimum limits as per drawing.
Pressure limits – covers hydraulic and pneumatic systems.
Load – compressive and tensile forces and torque.

6.7 The Concept of Prototype and Prototyping

A prototype can be defined as an artifact incorporating characteristics of the new product under
development that allows stakeholders (designers, users, etc.) to interact with it, by testing various
aspects of their ideas and investigating its suitability before committing themselves to the expense
and risk of producing commercial quantities.
Prototyping on the other hand, describes the process of making and utilizing prototype in design.
108

6.7.1 Types and Purposes of Prototyping

Prototype serves different purpose in different industries and research traditions: to an industrial
designer, he produces prototypes of conceptual ideas to explore form and geometry. Engineers on
the other hand prototype designs to validate a functional principle or benchmark performance.
Then again, software developers write prototype programs to test user experience or requirement
specifications.
These various and distinct types of prototypes can be broadly classified into two dimensions that
relates to the nature of the prototype. The first is to what extent the prototype is physical as
opposed to analytical. The second dimension is the degree to which a prototype is comprehensive
as opposed to focus (emphasis).

6.8 Reasons for Producing Prototype items of Equipment

Possible purposes of prototypes in product development are:

i) “Learning”: is used to answer the type of questions: “will it work” or how well does it
meet customers’ needs?
ii) “Collaboration and Communication”: Prototyping enriches communication with
various stakeholders such as management, vendors, partners, extended team members,
customers and inventors.
iii) “Integration”: Prototypes can be used to ensure that components and subsystem of
product work together as expected.
iv) “Milestones”: Particularly in the later stages of product development, prototypes are
used to demonstrate that the product has achieved a desired level of functionality.
Prototype Testing
With all the sophisticated computer analysis, simulation and tolerancing methods available, there
is still no substitute for thoroughly testing the maximum number of prototypes. An effort should
be made to locate and use components from different batches, especially for critical components.
These units must be tested under dynamic conditions to ensure reliability. An effective test is to
cycle the temperature, the input, and the load independently. The units should be tested at both
maximum and minimum temperature cycling according to this plan. Process capability index
(Cpk) analysis of the results is used to ensure that the specification parameter margins are
adequate (Cpk is a statistical tool, to measure the ability of a process to produce output within
customer's specification limits). After testing, these units are normally used as the first batch on
the reliability demonstration tests. At least one unit should be subjected to Highly Accelerated
Life (HALT) testing, and several to destructive overstress tests to establish the safety margins.
The timing of these tests is critical - it must not be so early in the development phase that the final
circuit is radically different, and it must not be so late that production starts before the results are
evaluated. A pitfall to watch out for, if changes are proposed as a result of these tests.
109

6.9 The Necessity for Pre-production Testing

The objective of the pre-production prototype is to develop the manufacturing processes and
techniques required to produce the product.
A new product originates as an idea. The idea is a descriptive statement that can be written or
only verbalized. The idea is refined into a product concept that includes consumer benefits and
features of the product. After several iterations the prototype is perfected into the final product.
The pre-product prototype step usually will result in knowledge about the manufacture-ability of
the product, the manufacturing processes, maintainability and reliability, material and component
lists, plans for field support, installation and production costs, safety and environmental factors,
time schedules, and regulatory requirements.
Pre-production prototyping has several advantages:
1. It enables you to test and refine the functionality of your design
2. It makes it possible to test the performance of various materials.
3. It will help you describe your product more effectively
4. It will encourage others (buyers, attorneys etc.) to take you more seriously.
During the product development process it can be useful to create a prototype, a representative
version of the product under development. Prototyping is appropriate if the product is reasonably
small, if the prototype can be created quickly and the additional cost is relatively low compared to
the product’s overall development costs (a prototype hydroelectric dam would be unreasonable on
all three counts).
Pre-production prototype is for all practical purposes the final version of the product. It should be
just like the finished product in every way, from how it is manufactured to its appearance,
packaging and instructions. This final-stage prototype is typically expensive to produce, and far
more expensive to make than the actual unit cost once the product is in full production; but the
added cost is often well worth it. it is most valuable because it enables inventors and producers to
go over every aspect of the product in fine detail, which can head off potential trouble spots prior
to product launch.

6.10 Prototype Design Methodology:

In electronics, prototyping means building an actual circuit to a theoretical design to verify that it
works, and to provide a physical platform for debugging it if it does not. The prototype is often
constructed using techniques such as breadboard or Point to Point Wiring or Wire wrapping or
Printed Circuit Board (PCB) with the result being a circuit that is electrically identical to the
design but not physically identical to the final product.
The step-by-step approach to effective prototyping is itemized and discussed below:

1. Specifications based on requirements

2. Research
3. Schematics
4. Simulation
5. Construction (PCB Layout or other wiring technique)
6. Parts Purchase
110

7. Assembly
8. Firmware Development (Assembly, C and/or BASIC)
9. Testing and debugging

A) Construction
i) Breadboard: This is a quick way to set up a circuit. It can be used only with through-hole
components although adapters for some surface mount devices exist. Not suitable for high
frequencies, due to stray capacitance and inductance. Also, long connecting traces inside the
breadboard act as antennas. Also, not suitable for high current and/or high voltage circuits.

ii)Point to Point Wiring: More permanent than breadboard. Components are soldered to a
perforated board, typically known as Vero board.

iii)Wire wrapping: It used to be popular in the 70’s and 80’s. Similar to point to point wiring
except that a special tool/gun and special sockets are used. Wire is connected to endpoints
by wrapping onto a rectangular pin, instead of soldering.
iv) Printed Circuit Board (PCB): This is the most reliable method. It used to be the costliest
method of prototyping but with recent advances in automated manufacturing of printed circuit
boards, prices have decreased significantly.

B) Assembly
i)Hand assembly: Suitable for prototypes
ii)Pick and Place Machines: Suitable for mass production. It takes time to set up. There is a
set up fee. Some parts need to be purchased in reels.

C) Testing and Add test points. Especially Troubleshooting

o Design your boards with ease of testing/troubleshooting in mind:
o Simulate your design, if possible, before building it.
o Go over your design many times by eye and pencil.
o Have somebody else review your design.
o Use through hole components.
o Use connectors.
o Put IC’s on sockets.
D) ground and power lines should be easy to reach.
o Have your schematics and board layout printed and in front of you when troubleshooting a
circuit.
o Mark your schematics and/or board layout as you find mistakes.
o If you use a printed circuit board, maker corrections by cutting traces and making
connections using jumper wires.
111

D) Bugs to Expect:
i) Design Error
ii) Design is OK but you have a schematics error
iii) Schematics is OK but you made a mistake in board layout
o Examples: Wrong package for the part
o Mirror package for the part
o Wrong hole size
o Too thin a trace to carry the current
o Supply and/or ground does not reach all components
o Missing traces
o Packages conflict with each other
iv) Board manufacturing flaws: Usually short circuits or open circuits. This is rare when
the board is coming from a professional board house but it may happen.
v) Assembly errors: Wrong component was installed. For example, you meant 10K
resistor and a 10-Ohm resistor was installed.
vi) Correct component was installed but it was installed backwards. Typically happens
with 2-pin components that have a polarity such as diodes.
vii) Component damaged during assembly. Some resistors may crack/break during
soldering due to high heat. Some IC’s cannot tolerate soldering iron’s heat longer than
10 seconds. Cool the IC package during soldering if necessary.
viii) Printed Circuit Board Traces can be lifted during soldering if you keep the soldering
iron on the board too long.

6.11 Methods of Sampling Plan for Testing Large and Small Batch Quantities

Introduction
If a large supermarket is supplied with large batches of pre-packed sandwich in its food department
from a catering firm, and the supermarket manager wishes to test the sandwich so as to certify their
freshness and quality. The only way she can test them is by unwrapping them and tasting them. It is
however, obvious that it will no longer be possible to sell them after the test. She is obliged therefore
to decide as to whether or not the batch is acceptable based on testing a relatively small sample of
sandwiches. This is known as acceptance sampling.
Acceptance sampling may be applied where large quantities of similar items or large batches of
material are being bought or are being transferred from one part of an organization to another. Unlike
statistical process control where the purpose is to check production as it proceeds, acceptance
sampling is applied to large batches of goods which have already been produced.
It is discernible that the test on the sandwiches is a destructive test because after the test has been
carried out the sandwich is no longer saleable. Other reasons for applying acceptance sampling are
that when buying large batches of components it may be too expensive or too time-consuming to test
them all. In other cases when dealing with a well established supplier the customer may be quite
confident that the batch will be satisfactory but will still wish to test a small sample to make sure.

Characteristics of Acceptance Sampling

112

The characteristics of acceptance sampling are that each item tested is classified as conforming or
non-conforming. (Items used to be classified as defective or non-defective but these days no self
respecting manufacturing firm will admit to making defective items.)
A sample is taken and if it contains too many non-conforming items the batch is rejected, otherwise it
is accepted.
For this method to be effective, batches containing some non-conforming items must be acceptable. If
the only acceptable percentage of non-conforming item is zero this can only be achieved by
examining every item and removing those that are non-conforming. This is known as 100%
inspection and is not acceptance sampling. However the definition of non-conforming may be chosen
as required. For example, if the contents of jars of jam are required to be between 453 g and 461 g, it
would be possible to define a jar with contents outside the range 455 g and 459 g as non-conforming.
Batches containing up to, say 5% non-conforming items, could then be accepted in the knowledge
that, unless there was something very unusual about the distribution, this would ensure that virtually
all jars in the batch contained between 453 g and 461 g.

Single Sampling Plan

A sampling plan in which a decision about the acceptance or rejection of a lot is based on a single
sample that has been inspected is known as a single sampling plan. For example, suppose a buyer
purchases resistors in lots of 500 from a company manufacturing resistors. To check the quality of
the lots, the buyer draws a random sample of size 20 from each lot and takes a decision about
accepting or rejecting of the lot on the basis of the information provided by this sample. Since the
buyer takes the decision about the lot on the basis of a single sample, this sampling plan is a
single sampling plan. A single sampling plan requires the specification of two quantities which
are known as parameters of the single sampling plan. These parameters are n – size of the sample,
and c – acceptance number for the sample. Let us suppose that the lots are of the same size (N)
and are submitted for inspection one at a time. The procedure for implementing the single
sampling plan to arrive at a decision about the lot is described in the following steps:
Step 1: We draw a random sample of size n from the lot received from the supplier or the final
assembly.
Step 2: We inspect each and every unit of the sample and classify it as conforming or non-
conforming. At the end of the inspection, we count the number of non-conforming units found in
the sample. Suppose the number of non-conforming units found in the sample is d.
Step 3: We compare the number of non-conforming units (d) found in the sample with the stated
acceptance number (c).
Step 4: We take the decision of acceptance or rejection of the lot on the basis of the sample as
follows:
Under acceptance sampling plan
If the number of non-conforming units (d) in the sample is less than or equal to the stated
acceptance number (c), i.e., if d ≤ c, we accept the lot and if d > c, we reject the lot.
Under rectifying sampling plan
If d ≤ c, we accept the lot and replace all non-conforming units found in the sample by
conforming units and if d > c, we accept the lot after inspecting the entire lot and replacing all
defective units in the lot by conforming units.
113

Example:
Suppose a mobile phone company produces mobiles phones in lots of 100 phones. To check the
quality of the lots, the quality inspector of the company uses a single sampling plan with n = 15
and c = 1. Explain the procedure for implementing it.

Solution:
For implementing the single sampling plan, the quality inspector of the company randomly draws
a sample of 15 mobile phones from each lot and classifies each mobile of the sample as non-
conforming or conforming. At the end of the inspection, he/she counts the number of non-
conforming mobiles (d) found in the sample and compares it with the acceptance number (c). If d
≤ c (= 1), he/she accepts the lot and if d > c (= 1), he/she rejects the lot under the acceptance
sampling plan. Under rectifying sampling plan, if d ≤ c (= 1), he/she accepts the lot by replacing
all non-conforming mobiles found in the sample by conforming mobiles and if d > c, he/she
accepts the lot by inspecting the entire lot and replacing all non-conforming mobiles in the lot by
conforming mobiles.

Double Sampling Plans

Sometimes, situations arise when it is not possible to decide whether to accept or reject the lot on
the basis of a single sample. In such situations, we use a sampling plan known as the double
sampling plan. In this plan, the decision of acceptance or rejection of a lot is taken on the basis of
two samples. A lot may be accepted immediately if the first sample is good or may be rejected if
it is bad. If the first sample is neither good nor bad, the decision is based on the evidence of the
first and second sample combined.
In this section we shall explain the concept of the double sampling plan and the procedure for
implementing it.
A sampling plan in which a decision about the acceptance or rejection of a lot is based on two
samples that have been inspected is known as a double sampling plan.
The double sampling plan is used when a clear decision about acceptance or rejection of a lot
cannot be taken on the basis of a single sample. In double sampling plan, generally, the decision
of acceptance or rejection of a lot is taken on the basis of two samples. If the first sample is bad,
the lot may be rejected on the first sample and a second sample need not be drawn. If the first
sample is good, the lot may be accepted on the first sample and a second sample is not needed.
But if the first sample is neither good nor bad and there is a doubt about its results, we take a
second sample and the decision of acceptance or rejection of a lot is taken on the basis of the
evidence obtained from both the first and the second samples.
For example, suppose a buyer purchases resistors in lots of 500 from a company. To check the
quality of the lots, the buyer and the company decide that the buyer will draw two samples of
sizes 10 (first sample) and 20 (second sample) and the acceptance numbers for the plan are 1 and
3. The buyer takes two samples and makes the decision of acceptance or rejection of the lot on the
114

basis of two samples. Since the decision of acceptance or rejection of the lot is taken on the basis
of two samples, this is a double sampling plan.
A double sampling plan requires the specification of four quantities which are known as its
parameters. These parameters are
n1 – size of the first sample,
c1 – acceptance number for the first sample,
n2 – size of the second sample, and
c2 – acceptance numbers for both samples combined.
Therefore, the parameters of the double sampling plan in the above example are
the size of the first sample (n1) = 10,
the acceptance number for the first sample (c1), = 1
the size of the second sample (n2) = 20, and
the acceptance numbers for both the samples combined (c2) = 3.
So far you have learnt the definition of the double sampling plan and why it is used. We now
describe the procedure for implementing it and its advantages over the single sampling plan.

Implementation of Double Sampling Plan

Suppose, lots of the same size, say N, are received from the supplier or the final assembly line and
submitted for inspection one at a time. The procedure for implementing the double sampling plan
to arrive at a decision about the lot is described in the following steps:

Step 1: We draw a random sample (first sample) of size n 1 from the lot received from the supplier
or the final assembly.

Step 2: We inspect each and every unit of the sample and classify it as non-conforming or
conforming. At the end of the inspection, we count the number of non-conforming units found in
the sample. Suppose the number of non-conforming units found in the first sample is d1.

Step 3: We compare the number of non-conforming units (d1) found in the first sample with the
stated acceptance numbers c1 and c2.

Step 4: We take the decision on the basis of the first sample as follows:

Under Acceptance Sampling Plan

If the number of non-conforming units (d1) in the first sample is less than or equal to the stated
acceptance number (c1) for the first sample, i.e.,
if d1 ≤ c1, we accept the lot and if d1 > c2, we reject the lot.
But if c1 < d1 ≤ c2, the first (single) sample is failed.

Under Rectifying Sampling Plan

115

If d1 ≤ c1, we accept the lot and replace all non-conforming units found in the sample by
conforming units. If d1 > c2, we accept the lot after inspecting the entire lot and replacing all non-
conforming units in the lot by conforming units. But if c 1 < d1 ≤ c2, the first (single) sample is
failed.

Step 5: If c1 < d1 ≤ c2, we draw a second random sample of size n2 from the lot.

Step 6: We inspect each and every unit of the second sample and count the number of non-
conforming units found in it. Suppose the number of non-conforming units found in the second
sample is d2.

Step 7: We combine the number of non-conforming units (d 1 and d2) found in both samples and
consider d1 + d2 for taking the decision about the lot on the basis of the second sample as follows:

Under Acceptance Sampling Plan

If d1 + d2 ≤ c2, we accept the lot and if d1 + d2 > c2, we reject the lot.

Under Rectifying Sampling Plan

If d1 + d2 ≤ c2, we accept the lot and replace all non-conforming units found in the second sample
by conforming units. If d1 + d2 > c2, we accept the lot after inspecting the entire lot and replacing
all non-conforming units in the lot by conforming units.

Example:
Suppose a mobile phone company produces mobile phones in lots of 400 phones each. To check
the quality of the lots, the quality inspector of the company uses a double sampling plan with n 1 =
15, c1 = 1, n2 = 30, c2 = 3. Explain the procedure for implementing it under acceptance sampling
plan.

Solution:
For implementing the double sampling plan, the quality inspector of the company randomly
draws first sample of 15 mobiles from the lot and classifies each mobile of the first sample as
non-conforming or conforming. At the end of the inspection, he/she counts the number of non-
conforming mobiles (d1) found in the first sample and compares d 1 with the acceptance numbers
c1 and c2. If d1 ≤ c1 = 1, he/she accepts the lot and if d 1 > c2 = 3, he/she rejects the lot. If c 1 < d1 ≤
c2, it means that if the number of non-conforming mobiles is 2 or 3, he /she draws the second
sample from the lot. He/she counts the number of non-conforming mobiles (d 2) found in the
second sample and compares the total number of non-conforming mobiles (d 1 + d2) in both
samples with the acceptance number c2. If d1 + d2 ≤ c2 = 3, he/she accepts the lot and if d1 + d2 > c2
= 3, he/she rejects the lot.

Example
A firm is to introduce an acceptance sampling scheme. Three alternative plans are considered.
116

Plan A Take sample of 50 and accept the batch if no non-conforming items are found,
otherwise reject.
Plan B Take a sample of 50 and accept the batch if 2 or fewer non-conforming items are
found.
Plan C Take a sample of 40 and accept the batch if no non-conforming items is found.
Reject the batch if 2 or more are found. If one is found, then take a further sample
of size 40. If a total of 2 or fewer (out of 80) is found, accept the batch, otherwise
reject.
a) Find the probability of acceptance for each of the plans A, B and C if batches are
submitted containing
(i) 1% non-conforming (ii) 10% non-conforming
b) Without further calculation, sketch on the same axes the operating characteristic for plans
A, B and C.
c) Show that, for batches containing 1% non-conforming, the average number of items
inspected when using plan C is similar to the number inspected when using plan A or B.

Solution:

a) Plan A: accept 0.
P ( accept )=¿
For p=0.01, P ( accept )=0.9950=0.605;
for p=0.1, P ( accept )=0.950=0.005.

Plan B: accept 0, 1 or 2.

¿¿

for p=0.01, P ( accept ) = 0.986;

For p=0.1, P ( accept ) = 0.112

Plan C: accept 0 in first sample (in which case no second sample will be taken) or 1
in first sample and 0 in second sample or 1 in first sample and1 in second sample.

There are no other ways of accepting the batch – if 2 or more are found in the first
sample the batch is immediately rejected and if 1 is found in the first sample and 2 or
more in the second (giving a total of 3 or more) the batch is rejected.

The samples are of equal size and the batch is large so the probability of acceptance may
be expressed as

P ( 0 ) + P ( 1 ) × P ( 0 ) + P(1) × P(1)
117

P ( 0 )=¿; P ( 1 )=40 × p × ¿

For p=0.01, P ( 0 ) = 0.669 and P ( 1 )=0.270

P ( accept )=0.669+0.270 × 0.669+0.2702=0.923

For p=0.1, P ( 0 ) = 0.0148 and P ( 1 )=0.0657 .

P ( acc ept ) =0.0148+0.0657 ×0.0148+ 0.06572=0.020

b) The operating characteristics are shown in Figure 6.1

P(accept)

Figure 6.1: Operating characteristics for Plans A, B and C

c) For Plan C, if the first sample contains 0 or 2 or more non-conforming, a decision as to

whether to accept or reject the batch is made immediately. A second sample is only taken
if the first sample contains exactly 1 non-conforming item.
The average number of items inspected is
40+ 40 × P ( 1 )
For batches containing 1% non-conforming the average number of items inspected is
40+ 40 ×0.270=50.8 .
Thus the average number inspected is similar to the 50 inspected in the single sample
plans.
Note: This calculation only applies when p=0.01. For other values of p you would have to
make a further calculation. However it can be stated that if a single and a double sampling
plan have similar operating characteristics (not the case here), the double sampling plan
118

will, on average, require less items to be inspected than the single sampling plan. This will
be true for any value of p. Against this, the double sampling plan is more complex to
operate.

Advantages of Double Sampling

Plan A double sampling plan has the following two main advantages over a single sampling plan:

i) The principal advantage of the double sampling plan over the single sample plan is that for the
same degree of protection (i.e., the same probability of accepting a lot of a given quality), the
double sampling plan may have a smaller average sample number (ASN) than that corresponding
to the single sampling plan. The underlying reason is that the size (n 1) of the first sample in the
double sampling plan is always smaller than the sample size 5 (n) of an equivalent single
sampling plan. Thus, if a decision is taken on the Double Sampling Plans basis of the first sample,
ASN will be lower for the double sampling plan or if a decision is taken after the second sample,
the ASN will be reduced.
ii) The double sampling plan has the psychological advantage of giving a lot a second chance.
From the viewpoint of the producer/manufacture, it is unfair to reject a lot on the basis of a
single sample. The double sampling plan permits the decision to be made on the basis of two
samples.
However, double sampling plans are costlier to administer in comparison with the single sampling
plans.
119

Chapter Seven

7.1 Methods of Enhancing System Reliability

Introduction
Representing the total failure mode of a newly designed system will portray the reliability
block diagram possibly as a series of units. The aggregate failure rate must be divided
between the subsystems, and preliminary predictions made to assess whether or not the
system is likely to meet its reliability target.
Reliability block diagrams for the failure modes (if they are highly critical) may also have to
be constructed. In cases where the preliminary designs seem unlikely to meet the
specification, then methods of increasing system reliability must be sought.

7.2 Methods of Enhancing System reliability

System reliability may be enhanced through the following means:
a) During the Design Phase
i) Use fewer components; for example, by:
(a) Simplifying the system
(b) Using more complex (possibly custom designed) integrated circuits
ii) Use better components, that is:
(a) Better quality or
(b) Ore highly derated
iii) Improve the environment, for example, use cooling fans, reduce vibration etc.
iv) Make the system maintainability compliant
v) Use fault-tolerant constituents in the system
120

b) During the Development, Pre-production, and Production Phases. In addition to the

above, the following methods may be used:
vi) Improve the design (for example, to allow increased drift in component values)
vii) Burn-in the components and/ or system
viii) Use non-destructive testing of components
ix)Improve quality control by:
(a) Improving assembly methods
(b) Improving inspection criteria
x) Improve maintenance techniques, for example, by:
(a) Using better test equipments and/or
(b) Early replacement of limited-life components

7.3 System Simplification

It may not always be possible to simplify a system so as to increase its reliability. Nevertheless, it
is usually worthwhile to pose the question “Can the system be simplified?” because it is quite
common for over-elaborate system requirements to be specified at the initial design stage. The
question “do you really need an all-singing and all-dancing system?” can encourage a design team
to reconsider whether an ultra-sophisticated system is really needed.
7.4 Reduction in Complexity
An integrated circuit usually has a lower failure rate than the group of discrete components, which
it replaces, and so, system reliability can often be improved by replacing a circuit using many
discrete components by a single integrated circuit. Custom- built integrated circuits are usually
expensive and so, their use will usually involve a cost penalty.
7.5 Use of Fault Tolerance
A series system as described in section 7.1 can be described as “fault intolerant” because
failure of any one component will cause total system failure.
A fault tolerant system is one in which at least some parts of the system may fail without
causing total system failure. An example of fault-tolerance is in the design of a two-engine
aircraft, which is capable of flying on one engine. However, there is only limited fault-tolerance
in the aircraft; a major structured failure (for example, loss of the tail plane) will still cause the
aircraft to crash.
7.6 Use of Preventive Maintenance
Preventive maintenance is aimed at preventing failures and is exemplified by the regular
maintenance actions which are taken with care, like checking tyre pressures, checking oil and
coolant level etc. it may not be easy to decide during the Design Phase on exactly what impact
preventive maintenance will have, although the early replacement of limited-life components
brings an obvious improvement to reliability. During the Production Phase a careful study and
analysis of the failures, which occur in the field, may indicate how when and where preventive
maintenance should be introduced or extended.
7.7 Use of Corrective Maintenance
Corrective Maintenance (repair) is described in section 4.2. it consists of those actions which
return a failed system to working order. From a reliability viewpoint, corrective maintenance
chiefly affects the system availability and so faster and more effective corrective maintenance will
generally decrease the system down time and increase the system availability.
121

7.8 Design Improvements

It is sometimes suggested that potential sources of failure should be “designed-out”. For
example, stringent testing of a system might show that the drift with temperature in the current
gain of a transistor is large enough for the system output to fail to meet its specified level. Re-
design of the circuit may reduce the criticality of the current gain and so make the system
more reliable. However, designing-out of failure sources is generally possible to only a
limited extent and this is why fault-tolerance is often used, particularly in systems where
failure can cause a safety hazard (for example, in control systems of nuclear reactors).
Environmental testing may highlight the need for protection against corrosion or fungus attack
7.9 Cost
Improvements to system reliability will almost without exception cause increase to development
costs, system cost (to the customer) and maintenance cost (to the manufacturer and/ or customer).
Although the reliability engineers can do little wholly to prevent these increases, what he can do
is to minimize them as far as possible. For example, with constantly increasing cost of corrective
maintenance (repair), it may be worthwhile to spend more money on fault-identification and
automatic alarms rather than on increasing the number of maintenance staff.
The design strategies discussed above can fall between two broad extremes. The fall-safe
approach is to identify the weak spot in the system of components and provide some ways to
monitor that weakness. When the weak link fails, it is replaced just as the fuse in a household
electrical system is replaced. At the other extreme is what can be termed the “one horse shay”
approach.
The objective is to design all components to have equal life so that the system will fall apart at the
end of its useful lifetime just as the legendary one-horse did. Frequently, a worst-case approach is
used; in it, the worst combination of parameters is identified and the design is based on the
premise that all can go wrong at the same time. This is a very conservative approach and often
leads to overdesign.
Two major areas of engineering activity determine the reliability of an engineering system. First,
provision for reliability must be established during the earliest design concept stage, carried
through the detailed design development, and maintained during the many steps in manufacture.
Once the system becomes operational, it is imperative that provision be made for its continued
maintenance during its service. The various aspects of building reliability into the design process
are illustrated in Figure 7.1

7.10 Causes of Unreliability

The malfunctions that an engineering system can experience can be classified into five general
categories:
1) Design Mistake;
Among the common design errors are failure to include all important operating factors,
incomplete information on loads and environmental conditions, erroneous calculations and poor
selection of materials.
Analysis Material Process Specification Scheduled Inspection and Maintenance

Customer
Preliminary Design Model Prototype Manufacturing Service Retired
Requirement design Development

Design Quality Control Repair and Replacement

Criteria
122

Experience Base

Figure 7.1: Elements of Integrating Reliability into Design Process

2) Manufacturing Defect;
Although the design may be free from error, defects introduced at some stage in
manufacturing may degrade it. Some common examples are:
i. Poor surface finish or sharp edge (burrs) that lead to fatigue cracks and
ii. Decarburisation or quench cracks in heat-treated steel. Elimination of defects in
manufacturing is a key responsibility of the manufacturing engineering staff, but a strong
relationship with the Research and Development function may be required to achieve it.
manufacturing errors produced by the production work force are due to such factors as lack
of proper instructions or specifications; insufficient supervision, poor working
environment, unrealistic production quota, inadequate training and poor motivation.
3) Maintenance;
Most engineering systems are designed on the assumption that they will receive adequate
maintenance at specified periods. When maintenance is neglected or improperly performed
service life will suffer. Since many consumer products do not receive proper maintenance by
their owners, a good design strategy is to make the products maintenance free.
4) Exceeding Design Limits;
If the operator exceeds the limits of temperature, speed, etc. for which it was designed, the
equipment is likely to fail.
5) Environmental Factors;
Subjecting equipment to environmental conditions for which it was not designed, e.g. rain,
high humidity and ice usually greatly shortens its service life.
7.11 Safety Factor and Reliability
A variety of methods are used in engineering design practice to improve reliability. We
generally aim at a probability of failure (Pf) of 10-6 for structural applications and 10-4 Pf to
10-3 for unstressed application.
Factor of safety is defined as: maximum strength / maximum stress
S max .
=
σ max .
Another concept is the safety margin defined by:
Minimum strength−Maximum stress
Safety margin=
Minimum strength

S min −σ max .
=
Smin
It is often believed that the use of safety factor greater than some preconceived magnitude;
usually above 2.5 will result in no failures. Actually, with safety factors the failures probably
may vary from a satisfactory low to an intolerable high.
It is known that distributions exist in both load stress requirement and the available strength.
It is this distribution as defined by mean values, standard deviations, and or other parameters
with which the designer should be concerned. The safety factor concept overlooks the facts
of variability, which may give different reliabilities for the same factor.

7.12 Reliability Degradation

123

Another Critical Design Task is to determine the reliability degradation (if any) on the
system/product due to storage, parking, transportation and handling. The equipment is
subjected to these environments when initially shipped from the factory and distributed to
customer, stored as a spare, spare return to the depot or supplier for maintenance and so on.
Sometimes these environments include extreme conditions of rain, sand and dust, salt spray,
high and low temperature and high humidity.
In the event of degradation, either additional design provisions are needed to compensate for
the reduction in reliability or an increase in the quality of maintenance actions will result. In
either case the impact of logistics support and cost is evident.
Quite often, the reliability of a system is degraded through the performance of preventive and
corrective maintenance action. Unless extreme care is taken, maintenance-induce faults may
inadvertently be introduced in the accomplishment of maintenance action or components
may be partly damaged to the extent that subsequent systems failure may occur and more
frequently than initially anticipated. This is primarily due to carelessness on the part of
individuals performing maintenance, using the wrong tools and test equipment, not following
approved maintenance procedures, and so on. Thus, it is extremely important that the proper
logistics support resources be applied in performing system product maintenance, if the
reliability of the system is to be maintained.

7.13 Procedure for Assessing Design Reliability

i) The first is to have a well-established problem statement or system definition. This
should layout the criteria for success or failure of the design, performances expected, the
environmental factors, duty cycle and all boundary conditions and physical constrains on
the design.
ii) Next draw a reliability block diagram of the system. This will be similar to the functional
block diagram, which may be part of the problem definition but it will stress those areas,
which influence reliability.
iii) Prepare a list of parts in each block of the reliability block diagram
iv) Collect data on failure and performance for each part of component. The data will come
from company records, user data banks, and the general technical literature.
v) Calculate the hazard rate of failure for each component with the aid of the data that has
been collected. Often, extensive plotting of data and analysis is required.
vi) Knowing the failure rate for each component in each block of the reliability diagram,
combine the failure rate of each block.
vii) With the above information, compute the system reliability. Its failure rate and the mean
time between failures. This information identifies the weak aspect of the system and the
extent of improvement that is required. These facts should be fed back into the design
process to take corrective action.
124

References

David J Smith (2011); Reliability, Maintainability and Risk Practical methods for engineers
Eighth edition Butterworth-Heinemann The Boulevard, Langford Lane, Kidlington,
Oxford OX5 1GB, UK
Nwachukwu , J.C (2000); Introduction to Maintenance and Reliability. Ambik Press LTD, Benin
City, Nigeria.
Hoang Pham (Editor) (2003); Handbook of Reliability Engineering. Springer-Verlag London
Limited

Imogah,S.O (2003); Reliability and Testing Methods. Itua Press, Benin City, Nigeria.

Rudolph Fredrick Stapelberg (2009); Handbook of Reliability, Availability, Maintainability and

Safety in Engineering Design. Springer-Verlag London Limited

Okah-Avae, B.E.(1995); THE Science of industrial Machinery and System Maintenance,

Spectrum books, Ibadan

Paul Horowitz and Winfield Hill The Art of Electronics ( Cambridge University Press )

Lewis, E. E. (1995), Introduction to Reliability Engineering, Wiley & Sons, New York

Paul A. Tobias and David C. Trindade (1994), Applied Reliability, 2nd ed, Van Nostrand
125

Robert A. Pease Troubleshooting Analog Circuits ( Butterworth-Heinemann )

Introduction To Reliability Engineering - CERN 06.11
80% (5)
Introduction To Reliability Engineering - CERN 06.11
48 pages
Technical Specs For Power Supply System
100% (1)
Technical Specs For Power Supply System
120 pages
Reliability Enginnering: Presented by
100% (2)
Reliability Enginnering: Presented by
15 pages
Life Data Analysis Reference PDF
No ratings yet
Life Data Analysis Reference PDF
429 pages
BA WA HD70 HD75 en H152
100% (1)
BA WA HD70 HD75 en H152
112 pages
A Review of Reliability Prediction Methods For Electronic Devices
No ratings yet
A Review of Reliability Prediction Methods For Electronic Devices
8 pages
Reliability Allocation PDF
100% (1)
Reliability Allocation PDF
6 pages
Applications of Reliability Degradation Analysis
100% (1)
Applications of Reliability Degradation Analysis
51 pages
Ch2 Reliabilitymanagement
No ratings yet
Ch2 Reliabilitymanagement
57 pages
Structural Assessment
No ratings yet
Structural Assessment
48 pages
Cigre TB 498
100% (1)
Cigre TB 498
79 pages
01-Reliability Management
100% (1)
01-Reliability Management
120 pages
Midterm Exam Plus Solutions Spring 2014
No ratings yet
Midterm Exam Plus Solutions Spring 2014
6 pages
PERD Where For Art Thou Failure Rate Data PDF
No ratings yet
PERD Where For Art Thou Failure Rate Data PDF
10 pages
Weibull Analysis of Time Between Failures of Pumps
100% (1)
Weibull Analysis of Time Between Failures of Pumps
18 pages
Reliability & Failure Analysis
100% (2)
Reliability & Failure Analysis
44 pages
3 Reliability Concepts and Calculations
100% (1)
3 Reliability Concepts and Calculations
56 pages
Reliability Block Diagram
100% (1)
Reliability Block Diagram
9 pages
List of Reliability Standards
No ratings yet
List of Reliability Standards
7 pages
Optimum PM and Reliability Centred Spares
100% (2)
Optimum PM and Reliability Centred Spares
79 pages
Accurate Failure Metrics For Mechanical Instruments
100% (1)
Accurate Failure Metrics For Mechanical Instruments
10 pages
Reliability Best Practices: Review These
100% (1)
Reliability Best Practices: Review These
18 pages
Eie512 01 24 25
No ratings yet
Eie512 01 24 25
45 pages
Reliability Engineering - Wikipedia PDF
100% (1)
Reliability Engineering - Wikipedia PDF
206 pages
Problems: 0-7/hr - The Shield Used Ihr. If A Given Company
100% (1)
Problems: 0-7/hr - The Shield Used Ihr. If A Given Company
32 pages
Mil STD 756b PDF
100% (1)
Mil STD 756b PDF
85 pages
AMSAA Design For Reliability Handbook
100% (1)
AMSAA Design For Reliability Handbook
50 pages
Reliability Calculations: What, Why, When & How Do We Benefit From Them?
100% (1)
Reliability Calculations: What, Why, When & How Do We Benefit From Them?
16 pages
Reliability Tools
100% (2)
Reliability Tools
40 pages
Engineering Reliability Report
No ratings yet
Engineering Reliability Report
38 pages
The Basic Reliability Calculations
100% (2)
The Basic Reliability Calculations
30 pages
Mean Time Between Failure
50% (2)
Mean Time Between Failure
10 pages
Reliability Calculations: What, Why, When & How Do We Benefit From Them?
100% (2)
Reliability Calculations: What, Why, When & How Do We Benefit From Them?
16 pages
Rome Laboratory Reliability Engineers Toolkit
100% (2)
Rome Laboratory Reliability Engineers Toolkit
274 pages
RCM Template3
No ratings yet
RCM Template3
16 pages
Reliability Prediction
No ratings yet
Reliability Prediction
7 pages
Reliability
No ratings yet
Reliability
27 pages
Reliability Engineering
No ratings yet
Reliability Engineering
17 pages
Reliability
100% (2)
Reliability
27 pages
Reliability Block Diagram and Calculation of A Complex System
No ratings yet
Reliability Block Diagram and Calculation of A Complex System
6 pages
Reliability and Maintenance (MANE 4015) : Instructor: Dr. Sayyed Ali Hosseini Winter 2015 Lecture #4
100% (1)
Reliability and Maintenance (MANE 4015) : Instructor: Dr. Sayyed Ali Hosseini Winter 2015 Lecture #4
26 pages
The Seven Questions of Reliability Centered Maintenance by Bill Keeter and Doug Plucknette, Allied Reliability
No ratings yet
The Seven Questions of Reliability Centered Maintenance by Bill Keeter and Doug Plucknette, Allied Reliability
3 pages
A Two-Stage Failure Mode and Effect Analysis of Offshore Wind Turbines - 2020
No ratings yet
A Two-Stage Failure Mode and Effect Analysis of Offshore Wind Turbines - 2020
24 pages
EPRD Sample
100% (1)
EPRD Sample
83 pages
CBNL VectaStar NGMN Requirements Mapping
No ratings yet
CBNL VectaStar NGMN Requirements Mapping
26 pages
Lecture 10 Reliability
No ratings yet
Lecture 10 Reliability
33 pages
Nswc-10 Reliability HDBK Jan2010
No ratings yet
Nswc-10 Reliability HDBK Jan2010
505 pages
Uptime 201804 PDF
No ratings yet
Uptime 201804 PDF
49 pages
FRACAS: From Data Collection To Problem Solving
No ratings yet
FRACAS: From Data Collection To Problem Solving
2 pages
Guidelines For Child Sexual Abuse Investigation Protocols
100% (4)
Guidelines For Child Sexual Abuse Investigation Protocols
45 pages
A Reliability Engineer's Use of Warranty Cost Information
No ratings yet
A Reliability Engineer's Use of Warranty Cost Information
11 pages
Eie512 02
No ratings yet
Eie512 02
45 pages
Chapter 1-1
No ratings yet
Chapter 1-1
60 pages
Engineering Reliability
No ratings yet
Engineering Reliability
11 pages
Reliability Software, Weibull Distribution, Test Design, Failure Analysis
No ratings yet
Reliability Software, Weibull Distribution, Test Design, Failure Analysis
6 pages
Mechanical Design Reliability Handbook Simplified Approach
No ratings yet
Mechanical Design Reliability Handbook Simplified Approach
103 pages
Fault Tree Analysis
No ratings yet
Fault Tree Analysis
38 pages
Equipment Criticality White Paper
No ratings yet
Equipment Criticality White Paper
6 pages
Introduction To Reliability 13
100% (2)
Introduction To Reliability 13
19 pages
Reliability Engineer Brochure
No ratings yet
Reliability Engineer Brochure
5 pages
Note7 - Reliability - Theory - Revised
No ratings yet
Note7 - Reliability - Theory - Revised
5 pages
The Role of N H P P Models in The Practical Analysis of Maintenance Failure Data
No ratings yet
The Role of N H P P Models in The Practical Analysis of Maintenance Failure Data
8 pages
04 Reliability Definitions
No ratings yet
04 Reliability Definitions
3 pages
Chapter 3
100% (1)
Chapter 3
62 pages
Indra-Distance Meassuring Equipment Brochure PDF
No ratings yet
Indra-Distance Meassuring Equipment Brochure PDF
6 pages
HALT To HASS To HASA
100% (4)
HALT To HASS To HASA
177 pages
Papers
No ratings yet
Papers
57 pages
Papers Cimentaciones Profundas
100% (1)
Papers Cimentaciones Profundas
822 pages
Sist en 50129 2019
No ratings yet
Sist en 50129 2019
15 pages
Barth Etal The Relevance of The Value Relevance Anotherview
No ratings yet
Barth Etal The Relevance of The Value Relevance Anotherview
28 pages
Accessories 12
No ratings yet
Accessories 12
60 pages
Fixed - Price - Reman - Program Mar 2020 01
No ratings yet
Fixed - Price - Reman - Program Mar 2020 01
13 pages
Operation Management
No ratings yet
Operation Management
21 pages
Hydraulic Actuators
100% (1)
Hydraulic Actuators
9 pages
Reliability Prediction
No ratings yet
Reliability Prediction
6 pages
Integration of First Principles and Empirical Modeling Technologies For Improved Plant Reliability and Efficiency - Feb12
No ratings yet
Integration of First Principles and Empirical Modeling Technologies For Improved Plant Reliability and Efficiency - Feb12
8 pages
Integrated Cad/Cam For The Sheet Metal Industry: Master Your Manufacturing Process
No ratings yet
Integrated Cad/Cam For The Sheet Metal Industry: Master Your Manufacturing Process
8 pages
Atlas Copco Oil Free Zt90 VSD FF CFR
No ratings yet
Atlas Copco Oil Free Zt90 VSD FF CFR
19 pages
Unit 4
No ratings yet
Unit 4
26 pages
Coatings and Adhesive Mixing With Process Control
No ratings yet
Coatings and Adhesive Mixing With Process Control
4 pages
Datasheet
No ratings yet
Datasheet
13 pages
A Bayesian Network Approach To Early Reliability Assessment of Complex Systems
No ratings yet
A Bayesian Network Approach To Early Reliability Assessment of Complex Systems
157 pages
Mastering Your Construction Budget The Ultimate Guide To Building Estimating Services 673d6bfe
No ratings yet
Mastering Your Construction Budget The Ultimate Guide To Building Estimating Services 673d6bfe
14 pages
Assignment 2
No ratings yet
Assignment 2
3 pages
Basics of Reliability Theory
No ratings yet
Basics of Reliability Theory
53 pages
SE 113 Presentation
No ratings yet
SE 113 Presentation
21 pages
Tecnologa Contra Incendios
No ratings yet
Tecnologa Contra Incendios
42 pages
Syllabus
No ratings yet
Syllabus
7 pages
From Prognostics and Health Systems Management to Predictive Maintenance 2: Knowledge, Reliability and Decision
From Everand
From Prognostics and Health Systems Management to Predictive Maintenance 2: Knowledge, Reliability and Decision
Brigitte Chebel-Morello
No ratings yet
RCM Standard Requirements Third Edition
From Everand
RCM Standard Requirements Third Edition
Gerardus Blokdyk
No ratings yet