Reliability Book
Reliability Book
TO
TESTING METHODS
AND
RELIABILITY
Chapter One
1.0 Understanding the Basic Terms and Relationships Involved In Reliability Studies
iv. To apply methods for estimating the likelihood of new design and for analyzing
reliability data.
v. Throughput
4
Downtime for any reason reduces the system’s throughput. Downtime can be minimized
by applying predictive and preventive maintenance programs which are all reliability
engineering techniques. A well-maintained system maximizes throughput and minimizes
operating expenses.
vi. Cost Analysis
Manufacturers may take reliability data and combine it with other cost information to
illustrate the cost-effectiveness of their products. This life cycle cost analysis can prove
that although the initial cost of a product might be higher, the overall lifetime cost is lower
than that of a competitor’s because their product requires fewer repairs or less
maintenance.
vii. Competitive Advantage
Many companies will publish their predicted reliability numbers to help gain advantage
over their competitors who either do not publish their number or have lower number.
viii. Reputation
A company’s reputation is closely related to the reliability of its products. The more
reliable a product is, the more likely the company is to have a favourable reputation.
ix. Repeat Business
A concentrated effort towards improving reliability shows existing customers that a
manufacturer is serious about its product, and committed to customer satisfaction. This
type of attitude has a positive impact on future business.
x. Safety
Some product failure cause unintended or unsafe conditions leading to loss of life or
injury. Reliability engineering tools assist in identifying and minimizing safety risk.
xi. Distribution
Fewer failures and optimized maintenance implies fewer spare parts in the logistics
system. This minimizes the distribution system costs for transportation, logistics, and
storage for spare parts. This also minimizes service labour costs.
1.2.1 Reliability
5
Reliability is the capacity of an item ( a part, system, product or service) to perform it required
function under given conditions for a stated time interval without fail. It is generally designated
by R.
From qualitative point of view, reliability specifies the probability that no operational
interruptions will occur during a stated time interval. Reliabilities are always specified with
respect to certain condition, called normal operating conditions. These include load, temperature,
and humidity ranges as well as operating procedures and maintenance schedules. Failure of users
to heed these conditions often results in premature failure of parts or complete system. For
example, using a 2KVA electricity generating set to power an electrical equipment of 3 Kilowatts
will cause a failure on the generating set; so also, using a passenger-car to tow heavy loads will
cause excess wear and tear on the drive train; again, driving over potholes or curbs often results in
untimely tyre failure.
This concept can be clearly understood by the use of an example. Suppose a test was started at
time t = 0 with N0 number of items, after a time period of t, Nf out of the original N0 items failed
and Ns survived. Then reliability R(t) is expressed at any time, t as:
Ns Number of item that survived
R ( t )=
N0
= Total Number of item
(1.1)
N 0−N f N
R ( t )=
N0
= 1− N f (1.2)
0
1.2.2 Failure
Failure is used to describe a situation in which an item stops performing its required function.
This includes not only instances in which the item does not function at all, but also instances in
which the item’s performance is subnormal or it functions in a way not intended. For example, a
smoke alarm might fail to respond to the presence of smoke (not operated at all), it might sound
an alarm that is too faint to provide adequate warning (subnormal performance), or it might sound
an alarm even though no smoke is present (unintended response).
Failure can be classified according to the mode, cause, effect and mechanism:
i. Mode: The mode of failure is the symptom (local effect) by which a failure is
observed; e.g. opens, shorts or drift for electronic components; brittle, rupture, creep,
cracking seizure, fatigue for mechanical components.
6
ii. Cause: Cause of a failure can be intrinsic, due to weaknesses in the item and/or wear
out, or extrinsic, due to errors, misuse or mishandling during the design, production, or
use. Extrinsic causes often lead to systemic failures, which are deterministic and
should be considered like defects. Defects are present at t = 0, even if they cannot
often be discovered at t = 0.
iii. Effects: The effect or consequence of a failure can be different if considered on the
item itself or at higher level. A usual classification is: non-relevant, partial, complete,
and critical failure. Since a failure can cause further failures, distinction between
primary and secondary failures is important.
iv. Mechanism: Failure mechanism is the physical, chemical, or other process resulting in
failure e.g. fatigue, corrosion, charge spreading (leakage currents) etc.
1.2.3 Item
MTBF is a key measure of reliability for repairable items. It can be expressed as the
elapsed time before an item fails, under the condition of a constant failure rate. MTBF can
also be explained as the expected value of time between two consecutive failures, for
repairable items. It is the inverse of the failure rate, λ, for constant failure rate systems or
7
items. For example, for a component with a failure rate of 5 failures per million hours, the
1
MTBF would be the inverse of that failure rate, λ, or MTBF = λ
1
Or = 200,000hours/failure
5 f ailures/1000000 hours
It is important to note that Failure rate, λ, is the number of failures per unit time. It can be
1
expressed as; λ = MTBF (1.3)
On the other hand, if the test involves a number of identical pieces of equipment or items
operating under similar conditions in various systems, then the MTBF is given by:
Implying that if 80 identical items are in operation for 20,000 hours during which time 20 failures
occur and were rectified, then:
20,000× 80
MTBF= = 80,000 hours
20
MTTF is a key measure of reliability for non-repairable items such as filament lamp, fuses,
resistors, capacitors, etc. It is the mean time expected until the first failure of a piece of
equipment. The value of MTTF can be calculated from life test results such that can be obtained
by stressing a large number of components under known conditions for a period of time and
noting the number of failures. Then we have:
1
MTTF= (1.5)
λ
8
1
And conversely, we have λ=
MTTF
, where λ is independent of time, in this case.
It is important to note that in the design of reliability testing, it is failure that provides information
on what to improve, confirm design margins, or validate assumptions.
Generally, the type of failure or, in part, the root cause of the corrective action. Keeping the types
and sources of failure in mind may assist one to avoid or prevent failure. By and large, failure
may be categorized as follows:
i. Misuse Failure: This is failure due to improper use of an item or system. For instance,
when a product is strained beyond its stipulated capacity, it is said to be have been
misused, and it can result to failure. For example, applying 230V a.c mains to an item
designed and rated for 110V a.c mains only is a misuse and could result to a misuse error.
ii. Inherent Weakness Error: Sometimes a system can fail even when it is operated within the
limit of its stipulated capacities. This is by reason of intrinsic weakness in the item itself.
iii. Sudden Failure: This is a failure that could not be projected even after preceding test had
been undertaken.
iv. Gradual Failure: In this type of failure, the system slowly draws near and exceeds the
failure threshold.
v. Partial Failure: This type of failure expresses itself by the item’s or system’s departure
from its features beyond a stated degree. However, it does not outrightly fail in its required
function.
vi. Secondary Failure: In this type of failure, the item actually damaged (hero) may be the
result of a failure of another component (instigator).
vii. Transient Failure: This type of failure is short-lived and often not tied to specific
conditions.
viii. Intermittent Failure: The failures under this class only occur under specific condition and
the product works otherwise.
9
The mode of failure describes the manner or form in which failure presents itself. This is
important when attempting to avoid, prevent, understand or resolve a failure. This includes:
During the design process, it is imperative to keep in mind that every element of a system leads to
one or more of the above types of failure. On complex systems, it may be necessary to include
diagnostic routine to assist in the determination of the cause of the failure and where the damage
occurred.
One major determinants of the reliability of an item is the frequency of occurrence of failure of
the item. This is known as the failure rate of the item; it is expressed as the number of failures that
take place per unit time.
∆ t →0
Example 1.1: In a life test where 4,040 items were tested, 40 of them failed. If the test period was
20000 hours, calculate the failure rate.
Solution
1 40
λ = ×
4040−40 20000
1 40
= ×
4000 20000
λ=0.0000005 failure per hour
1 40 100
× ×
4000 20000 1
=0.00005 percent/hours
There are two major classifications of failure rates we may possibly encounter in this text namely:
This is a function that describes the probability per time unit that a case that has survived to the
beginning of the respective interval will fail in the interval. It is computed as the number of
failures per time units in respective interval, divided by the average number of surviving cases at
the mid-point of the interval.
f (t) f (t )
h ( t )= = = Instantaneous (conditional) failure rate (1.7)
1−F( t) R (t)
−dR
where f (t) is the failure density function:- f ( t )=
dt
11
∆R
h(t) ∆ t ≈ (1.8)
R
This may be identified as the probability that a certain component will fail in the interval of time t
to (t +∆ t), given that it has survived up to the time t.
The hazard rate is a value from 0 to 1. Failure rate is broken down a couple of ways.
Instantaneous failure rate is the probability of failure at some specific point in time. The hazard
rate is therefore the failures per unit time when the time interval is very small at some point in
time, t. Thus, if a unit is operating for a year, this calculation would provide the chance of failure
in the next instant of time.
This is not useful for the calculation of number of failures over that year; rather, it is used only to
evaluate the chance of a failure in the next moment.
It is also sometimes useful to define an averaged failure rate, AFR over any interval of time (T 1 to
T2) that averages the failure rate over that interval. This rate, denoted by AFR (T 1, T2) is a single
number that can be used as a specification or target for population failure rate over the interval. If
T1 is 0, it is dropped from the expression. Thus, for example, AFR (40000) would be the average
failure rate for the population over the first 40,000 hours of operation. Therefore, when we set T 1
= 0, we have:
T2
h(t) dt H ( T 2 )−H T 1
AFR (T2- T1) = ∫
T 2−T 1
=
T1 T 2−T 1
InR ( T 1 ) −InR(T 1)
= T 2−T 1
and
12
H (T ) T
AFR (0, T) = AFR (T) = = −InR
T
(1.9)
T
Where H ¿) is the integral of the hazard rate h(t) from time zero to time T;
T is the time of interest which define a time period from zero to T and
R(T) is the reliability function or probability of successful operation fro time zero toT.
Where λ = number of failure/number of time units during which all items were exposed to failure.
The probability distribution of reliability for this case is negative exponential distribution.
1
As already stated in equation (1.3), the reciprocal of λ (i. e . )= T is mean time between failures
λ
¿
MTBF=T= Number of timeunitduring which items were exposed ¿ failure Number of failures
− λt
So, R ( t )=e (1.11)
Note that if a component is operated for a period equal to MTBF, the probability of survival is:
1
∴ R ( t )=¿ e λ = e−1 = 0.368 = 36.8% of survival.
− λt − λ×
R ( t )=e
(This indicates that the probability that any one particular item will survive to its calculated
MTBF is only 36.8%)
While an individual component may not have an exponential reliability distribution, in a complex
system with many components, the overall reliability may appear as series of random events and
the system the system will follow an exponential distribution.
Under this case of constant failure rate, if a fixed number of identical components are tested, let
1 d N f (t)
Recall from equation (1.6) that λ= ×
N s (t ) dt
1 d N f (t)
λ= ×
N o ( t )−N f (t) dt
t N f (t )
d N f (t)
∫ λ dt = ∫ N o ( t )−N f (t)
dt
0 0
∴− λt=¿
λ t=−¿
N o ( t )−N f (t)
−λ t=log e
( No (t) )
− λt N f (t )
So, e =1− (1.13)
No (t)
N s (t)
But R ( t )= N ( t) from equation (1.1)
0
N 0−N f N
and, R ( t )=
N0
= 1− N f from equation (1.2).
0
It should be noted that the notion or presumption of constant failure rate is virtually invariably
true of electronic components.
1.5.1 Reliability and Unreliability and their Related Curves and Equations under Constant
Failure Rate
At any particular moment, a component or system is either operational or it has failed, and a
component’s or system’s functioning status varies as time advances. A functioning item will fail
on the long run; and if the item is a non-repairable one, it will remain in this failed state ad
infinitum. A repairable item will however continue in its failed status for as long as it takes for its
repair to be effected, thence is restored to the working state at the completion of its repair. This
switch from a functioning to a failed state is referred to as failure; whereas, the change from a
failed state to an operational state is called repair. It is assumed that the switch from a failure to a
working state is instantaneous; and that repair causes the item to transmute back to spick-and-span
status. For a repairable item therefore, this cycle continues with the repair-to-failure and the
failure -to-repair sequence, and afterwards recur continuously.
Reliability R(t) for non-repairable items therefore, can be defined as the probability that an item
will perform a defined function without failure under stated conditions for a stated period of time.
A clear comprehension of probability concept is vital if one must understand the concept of
reliability. This is because the arithmetical values of reliability and unreliability are stated as a
probability from 0 to1 without any units.
For repairable items, reliability, R(t) can be defined as the probability that the item suffers no
failures during the time interval zero to t1 given that the item was repaired to spick-and-spin status
or was operational at t0.
Unreliability, Q(t), of an item is defined as the probability that the item suffers the first failure or
has failed once or greater than once during the time period zero to time t, taking into account that
15
it was operating or repaired to spick-and-span status at time zero. Unreliability, Q(t) can also be
expressed as the number of items failed at time t divided by the total number of samples tested.
Since an item must experience or suffer its first failure in the time interval zero to t, or stay
functional over this period, it is therefore proper to express their relationship as follows:
R, Q
Q =1- e− λt
R = e− λt
Time
Figure 1.1 Graphical Representations of Reliability and Unreliability
1
At time t = , R(t) = 0.37 and Q(t) = 0.63
λ
By plotting time, t against the fraction of the total components that failed, we obtain the graph of
the probability of the failure of a component after time t hours. This is referred to as probability o
failure curve as shown in figure 1.2. On the other hand, by plotting time, t versus fraction of total
survivors, the probability of survival curve is obtained (figure 1.2). The probability of survival is
often expressed as reliability.
N f ( t)
R ( t )=1− = e− λt ,
No (t )
and N s = N o ( t )−N f (t )
N s ( t)
From equation (1.1), R ( t ) =
N o (t)
= e− λt
∴ N s =¿ N o e− λt (1.18)
Equation (1.18) represents the equation of the graph of survivors versus time called the survivor
curve.
Relatedly, the equation of the graph of failure versus time can be derived from equation (1.12) as
follows:
N s ( t)
R (t ) =
N o (t)
= e− λt
N o (t )−N f (t )
=¿ e− λt
N o(t )
N o ( t )−N f ( t )=¿ N o ( t ) e− λt
−N f ( t )=¿ N o ( t ) e− λt −N o ( t )
Nf ,
No
Failure curve
Survivor curve
Time
Characteristically, a Bathtub curve is divided into three distinctive zones which are:
1.7.1 The Infant Mortality Period: This is also referred to as early failure period. In Figure 1.3
the slope from the starting point at the leftmost side to where it begins to flatten out, represents
this period. This period is characterized by decreasing failure rate. This mode of failures occurs
during the early life of a population of units. It reveals failure rate arising from frail components
that evaded final testing, checking and examination, but cave in to infant mortality when exposed
to normal operational pressure. The feeble units fail, leaving a population that is more robust and
hardy.
1.7.2 The constant Failure Rate Period: This period is the flat portion of the curve of Figure 1.3.
It is called the normal life or the “useful life.” Failures in this region occur in random series. This
is the period dominated by chance failures. Chance failures are those failures that result from
strictly random or chance causes. Equipment is designed to operate under certain conditions and
up to certain stress levels. When these stress levels are exceeded due to random unforeseen or
unknown events, a chance failure will occur. While reliability theory and practice is concerned
with all three types of failures its primary concern is with chance failures, since they occur during
the useful life period of the equipment. The time when a chance failure will occur cannot be
predicted; however, the likelihood or probability that one will occur during a given period of time
within the useful life can be determined by analyzing the equipment design. If the probability of
chance failure is too great, either design changes must be introduced or the operating environment
made less severe. The failure rate is lowest during this period; and the slope of the curve is
constant in this region which signifies constant failure rate. The amplitude on the bathtub curve is
at its lowest during this time.
1.7.3 The Wear out Period: this period begins at the point where the slope begins to increase and
outspread to the end of the curve of Figure 1.3. This is what takes place when units get old and
start failing progressively (i.e. increasing failure rate)
1.8.1 The Infant Mortality Period: Failures in this region are usually caused by inherent defects
due to poor materials, workmanship or processing procedures at the molecular level or
manufacturer’s quality control beside installation problems.
Some of the design techniques that should be put in place to ensure the integrity of the designs
include: “burn in” (this is stressing the devices under constant operating conditions); “power
cycling” (this stresses the devices under surges of turn-on and turn off); “temperature cycling”
(this stresses the devices mechanically and electrically over the temperature extremes);
“vibration” and “testing at the destruct limits” etc.
1.8.2 The constant Failure Rate Period: In this region, the causes of failure are produced by
chance or operating conditions such as a failure from switching surges, lightning and operator
faults.
1.8.3 The Wear out Period: Failures during this period are due to old age; various components
are worn out; metals become embrittled, insulation dries out etc. Typical examples are electrolytic
capacitors drying out, fan bearing seizing up, switch mechanisms wearing out etc. Well
implemented preventive maintenance/replacement can delay the onset of this region.
During the wear out stage of the lifecycle of components or devices, they begin to fatigue, and
their expected useful life is dwindling, the failure mode follows a symmetric distribution where
most of the observations of the wear out failures cluster around the central peak (mean) and the
probabilities for values further away from the mean and taper off equally in both directions. This
is called the normal or Gaussian distribution. It is shown in Figure 1.4, which is a graph of wear
out failures versus time. It is also referred to as “bell curve”
.
;
;
Wear out
;
failure
;
;
;
;
M= mean Time
It is expedient to state that the wear out region is not secluded from the entire bathtub curve
structure described and illustrated in Figure 1.3, rather it is to all intent and purpose, a
20
continuance of it. Hence representing the bathtub curve with the wearout failure region wholly
incorporated, will bear a resemblance to Figure 1.5.
To begin with, while trying to grasp the concept of Gaussian distribution, it is helpful for us to
figure out very vital indicators, which is the standard deviation “σ”.
Consider running a wearout test in which the number of components 1, 2, 3…, n is denoted by C 1,
C2, C3 …, Cn and the total number of components engaged in the test is n, then,
σ =√ ¿ ¿ ¿ (1.20)
where, M is the mean wearout life.
Next, it can be proved that for a Gaussian distribution of failures, there is 0.6826 probability that
the entire failures will take place between a time period of (M -σ ) and (M +σ ) i.e. in the vicinity
of a period of (M± σ ); while there is a 0.9544 probability that it will happen between a time
period of (M -2 σ) and (M +2σ ); whereas there is a probability of 0.9973 occurrence between a
time period of (M -3 σ) and (M +3σ ). This data is mostly handy when determining the confidence
limit of wearout failure estimate. Figure 1.6 clearly depicts the preceding details.
21
Life
Time
Life test can be conducted to determine wear-out failure modes. This is accomplished by
subjecting a sample of the component to within-the-rated operating conditions (i.e. a real life
setting); and sustained for a sufficiently long test period, up to when the components fail. A close
inspection of the failed components with respect to their physical attributes and life expectancy,
rules out infant mortality and chance failures. For instance, if the life span of a sample of the
components is, perhaps, 6,000hours, a component which survives for 700hours cannot possibly be
classified as a wear-out failure. The usual praxis for such premature failure is to exclude them
from those samples that may be classified to have exhibited wear-out failure.
The mean time to wear-out, M can be calculated if the time to each wear-out failure is known.
If t1, t2, t3,…, tn, denote the time to each wear-out failure, then,
t 1 +t 2 +t 3 + …+ t n
M= (1.21)
n
Total time used for the test of all thecomponents which failed duetowear−out
M= wear−out ¿
Total number of components which fail due ¿
Example 1.2
A set of 25 components were put into accelerated life test. The time for their failures, from the
beginning of the test to the end, are presented in hours as follows:
22
16, 82, 215, 610,761,784,790, 798. 935, 28, 91, 310, 650, 767, 787, 792, 920, 51, 103, 420, 750,
780, 788, 796, 931. Calculate the mean time to wear-out failure.
Tips:
A close look at the trend of failure obviously shows that the first 3 failures may possibly be
classified as infant mortality period (early failure period). Again, looking at the next 8 failures,
they give the impression of a possible random failure period. Thus, the first 11 failures are left
out, as they do not have the likely traits of wear-out failure. However, the failures from 750 hours
to 798 hours look more probable to be part of the wear-out failure inclination.
Besides, there are 3 failures which display long life and they are 920, 931 and 935 hours, these
three failures should not be included as well because they visibly stand out as atypical of the test.
They depict long life. If they are included in the calculation, the result will create a false notion of
a longer mean time to wear-out failure. So, our solution lies between the failures from 750hours
to 780hours, which is calculated as follows:
+792+796+798 = 8,593
8,593
∴ Mean time¿ w ear−out failure , M = =781 hours
11
1.9.2 How to Evaluate the Confidence Limit for the Gaussian Distribution
In a bid to determine the mean wear-out life of components, several sample tests are performed
from which a number of possible values of mean life will be obtained. These values are
considered as estimates of the actual mean. Given that the wear-out life is a typical Gaussian
distribution which clusters around the central mean wear-out life, as a consequence, the estimates
of the actual mean is also inclined to have the same Gaussian distribution clustering around the
actual mean wear-out life. There will therefore be a striking difference between the standard
deviation, σ, of the wear-out life distribution and that of the mean distribution, σM.
σ
Standard deviation for the mean distribution σM ¿ (1.22)
√n
Where n is the number of components tested.
The shape of the normal means distribution curve is the same as the wear-out distribution density
curve shown in figure 1.4. we can suitably project with certainty that:
i. There is 0.6826probability that all estimates of the actual mean are around the ± σM
range;
ii. There is 0.9544 probability that all estimates of the actual mean are around the
23
Example 1.3
To obtain the necessary parameter for the evaluation of σ, we tabulate and compute as follows:
- M)2 =386096
2 1200-1491 84684 169362 ∑ f i(ti
155
Standard deviation for the means distribution, σ M = =39 hours (approximately)
√ 16
The upper confidence limit for the actual mean wear-out life at a 99.73% level of confidence is
therefore
Example1.4
Forty machines are operated for 150 hours, one machine fails in 50 hours, another fails at 65
hours yet a third one fails at 70 hours. What is the MTBF?
Solution
37 machines ran for 150 hours, while three others ran for 50, 65 and 70 hours respectively.
5735
MTBF = = 1912 hours
3
Example 1.5
(i) What is the reliability of the same machines from Example 1.4 at 400 hours? And at 600
hours?
Solution
1 1
λ = MTBF = 1912 =0.000523 failures/hours
R(t )=e− λt
Thus, there is 81% probability that the machine will run for 400 hours without failure, and 73%
probability that it will run for 600 hours.
(ii) Suppose the machine’s performance is entirely dependent on one particular component.
Each time the component is replaced, the machine’s reliability returns to 100%. How often
should the component be replaced so that the machine’s reliability is never less than 90%?
Solution
R(t )=e− λt
0.90 = e−0.000523 (t )
Taking the log of both sides, we have:
ln 0.90=−0.000523(t )
t =201.5hours
The component should therefore be replaced every 201.5 hours.
25
Chapter Two
(b) The potential of the design to sustain satisfactory reliability level under intense
environmental conditions can be determined through reliability predictions. Thus,
predictions can be a means of assessing the necessity for environmental control systems.
(c) The effect of complexity on the likelihood of succeeding in the undertaking can be
assessed by performing a reliability prediction survey. The result from the survey may
establish the necessity for redundant systems, back-up systems, subsystems, assemblies
or component parts.
(d) A reliability prediction can also assist in evaluating the importance and gravity of the
reported failures. Eventually, the outcome of a reliability prediction analysis can be
handy when conducting further necessary analysis such as failure modes, Effects and
Criticality Analysis (FMECA). Reliability Block Diagram (RBD) or Fault Tree Analysis
(FTA). The reliability predictions are used to evaluate the probabilities of failure events
described in there alternate failure analysis modes.
2.1.1 Some Basic Probability Rules in Relation to Reliability Calculation and Predictions
Remember that reliability is generally concerned with whether an item functions for a particular
time domain (or period), which is a question that can be answered as a probability. It is the
probability that an item will perform a required function without failure under stated conditions
for a stated period of time.
Therefore in probability prediction analysis some basic probability rules are needed. A handful
of the relevant probability rules are discussed hereunder.
Under this rule, just multiply the probability of the first event by the second. For example,
if the probability of event A is 4/9 and the probability of event B is 2/9 then, the
probability of both events happening at the same time is (4/9) ×(2/9) = (8/81) = 0.099
27
In a situation where the events are not necessarily mutually exclusive as presented in (a)
above, we can generalize the formula. For any two events A and B, the probability of A
and B is the sum of the probability of A and the probability of B minus the shared
probability of both A and B:
In a group of 98 students, 28 are freshmen and 34 are sophomores. Find the probability
that a student picked from the group at random is either a freshman or sophomore.
Solution
28 34
P ( freshman )= and P ( sophomore )= . Therefore,
98 98
28 34 62
P ( freshman∨sophomore )=¿ + =
98 98 98
This looks logical since 62 of the 98 students are freshmen or sophomores
In a group of 98 students, 30 are juniors, 40 are female, and 28 are female juniors. Find
the probability that a student picked from this group at random is either a junior or female.
Solution
30 40 28
P ( freshman )=¿ and P ( female ) = and P ( junior∧female )=
98 98 98
30 40 28
Therefore P ( junior∨female ) = + −
98 98 98
42
P ( junior∨female )=¿
98
This is logical since 30 are juniors, and 40 are female while 28 are female juniors.
n= number of trials
29
n!
This formular can also be written in a slightly different way, because nCx = . (This
x ! ( n−x ) !
formular applies factorials). Therefore, we have the alternate binomial distribution formular as:
n!
P(x) = × P x ×(q)n− x (2.6)
x ! ( n−x ) !
The first variable in the binomial formular of equation (2.5), n, stands for the number of times the
experiment runs. The second variable, P, represents the probability of one specific outcome. For
example, suppose you wanted to know the probability of getting a one on a die roll. If you were to
roll a die 20 times, the probability of rolling a one on any throw is 1/6. Roll twenty times and you
have a binomial distribution of (n=20, P=1/6). SUCCESS would be “roll a one” and FAILURE
would be “roll any number else”. If the outcome in question was the probability of the die landing
on an even number the binomial distribution would then become (n=20, P=1/2). That is because
your probability of throwing an even number is one half (i.e. n=3/6, P=1/2).
For the feasibility of Binomial distributions, the following conditions must be satisfied.
Once it is established that the distribution is binomial, then one can employ the binomial
distribution formular to calculate the probability.
Example 2.3
Solution
x =6
30
x n− x
Applying b ( x , n , P )=¿nCx× P ×(1−P)
× ×(0.5)10-6
=10C6 0.56
n!
= x ! ( n−x ) ! ×0.56× 0.54
=0.205078125
Example 2.4
90% of people who purchase facial cosmetics are women. If 10 cosmetic retailers are randomly
selected, find the probability that 7 are women.
Solution
n!
We shall work with the formular: P(x) = × P x ×(q)n− x
x ! ( n−x ) !
Step 1: Identify “n” from the problem. Using our sample question, n (the number of randomly
selected items) is 10
Step 2: Identify “x” from the problem. x (the number you are asked to find the probability for) is
7
n!
Step 3: Work the first part of the formular. The first part of the formular is
x ! ( n−x ) !
10 ! 720
Substituting the variables, we have: nCx = 10C7 = ( 10−7 ) ! 7 ! = 6 = 120
Step 4: Find P and q. P is the probability of success and q is the probability of failure
P x =0.97= 0.4782969
(q)n −x =0.110−7
=0.13= 0.001
n!
P(x) = × P x ×(q)n− x
x ! ( n−x ) !
R(t) is the combined probability of individual parts’ reliability, where the unit contains quantity n
parts. The assumption is that if any part fails during operation, the entire system is considered to
have failed as a whole.
n
R(t) = ∏ R (t)i
i=1
This is the summation of all the parts failure rates provided the system failure rate, provided the
system failure rate is constant.
Thus, the system MTBF is determined by taking a reciprocal of the summation of the failure rates
of all the parts.
∞ 1
n
MTBF=∫ R (t) dt = (2.7)
0 ∑ λi
i=1
Two parts of sub-unit1, 2 and 3 are said to be operating in series if failure of either of the parts
results in failure of the combination. They can be regarded as “fault intolerant”. A block diagram
of series reliability system is shown in Figure 2.1. This diagram only depicts that the system must
be treated from a reliability point of view, and does not represent physical interconnection of
components.
R1 R2 R3
32
If Rs(t) is the reliability of a series system, and R 1, R2, …Rn represent the individual reliability of
the n number of parts or sub-units in series in the system for equal interval of time, then the
reliability of the system is given by:
The consequence of the above equation is that the combined reliability of two parts in series is
always lower than the reliability of its individual sub-units, giving credence to the dictum that a
chain is weaker than the weakest link. Then again, if we have an exponentially failing units, then
the reliabilities of its sub-units will be R1(t) = e− λ t , R2(t) = e− λ t … , Rn(t) = e− λ t where λ1, λ2,…, λn
1 2 n
λs = λ 1+ λ 2+ …+ λ n (2.9)
2.3.1 Assumptions made when making a Series Reliability Prediction by Adding Failure
Rates
2. The components are constant failure rate devices (or may be treated as such).
3. The components are considerably similar to those whose failure rates have been measured.
5. When the components are in the system, they have the same ambient conditions and stress level
as those under which the failure rates were measured (or calculated by extrapolation from
measured data).
Let the MTBF of a series system be Ms; recall that MTBF is equal to the reciprocal of the system
failure rate.
33
1 1
∴ Ms = = (2.10)
λ s λ1 + λ2 +…+ λn
If the series system contains n homogenous sub-units, each, having equal reliability, the system
reliability, Rs(t) will be: Rs(t) = e−nλt
1
The system MTBF, Msn =
nλ
Two parts or sub-units are said to be operating in parallel if the failure of a sub-unit leads to the
other sub-unit taking over the operations of the failed one. In other words, the system only fails
when both (all the) sub-units fail. The system is operational if either of the sub-unit is functional.
A block diagram of a parallel system is shown in Figure 2.2.
R1
R2
R3
R1, R2...Rn are the reliabilities of the individual sub-units. It is important to state that if we have a
condition of R1=R2 =R3 =...Rn = R, then,
R1 =e− λ t , R2 = e− λ t , R3=e−λ t … , Rn = e− λ t
1 2 3 n
Rp = 1- (1- e− λ t ) (1- e− λ t ¿ ¿ ¿)
1 2
These equations for parallel systems are certainly complex exponential form, and it may not be
feasible to express their overall system reliability in exponential form as was done for the series
system in the form ofe− λ t . Thus, the system is not a constant failure rate system, even though it is
p
We call to mind equation (2.7) in which we expressed MTBF as the integral of reliability, with
the limits of integration from 0-∞. This implies that it is the summation of the failure rates of all
the parts that make up the system.
∞
1
∴ MTBF=∫ R ( t ) dt= n
0
∑ λ1
i=1
∞ ∞
For a parallel system with two sub-units, the MTBF, Mp2=∫ R p2 dt =∫ (R ¿ ¿ 1+ R2−R1 R2 ¿ )dt ¿ ¿
0 0
∞
¿ ∫ R ¿ ¿+ e− λ t −e−(λ +λ )t )dt
2 1 2
1 1 1
∴ Mp2= + − (2.15)
λ1 λ 2 ( λ ¿ ¿ 1+ λ2 )¿
1 1 1 1
+ + −
λ1 λ 2 λ 3 1
( λ ¿ ¿1+ λ 2)− ¿
Mp3= 1 (2.16)
(λ ¿ ¿ 2+ λ 3)− ¿
1
( λ ¿ ¿ 1+ λ3 )+ ¿
λ 1+ λ2 + λ3
where λ 1 , λ2 , λ3 are the respective sub-unit failure rates. Moreover, for an n sub-unit system
with individual sub-unit possessing equal failure rate λ, it can be proved that its MTBF, (Mpn) is
35
1 1 1 1
Mpn = λ + 2 λ + 3 λ + …+ nλ (2.17)
Consider a series-parallel system as represented in Figure 2.3, which we are required to find its
reliability.
Ra
R1 R2 R∞ Rb
Rz
Figure 2.3 A Series-Parallel System
The under-listed steps are recommended for the determination of the reliability of the system:
1. Pickout the units which are in series within the system, and calculate the equivalent reliability
of the series units with this relationship:
Rs = R1 ×R2 × … R ∞ (2.18)
2. Condense every one of the parallel system to one equivalent unit with equation (2.11). This
gives: Rp = 1- (1- Ra) (1- Rb)... (1- Rz)
3. Determine the product of outcome of step 1 and step 2 to obtain the overall system reliability
Rsp i.e. Rsp = Rs × Rp (2.19)
= (R1 ×R2 × … R ∞) [1- (1- Ra) (1- Rb)... (1- Rz)] (2.20)
Determining the MTBF of the series-parallel system as represented in Figure 2.3 involves
integrating the reliability expression for the series-parallel system. In other words, the MTBF will
be:
∞
Example 2.1
36
Draw a distinction between the reliability of n series system with a parallel system, given that
each system accommodates 5 sub-units having reliabilities of 0.94, 0.90, 0.88; 0.81 and 0.78
respectively
Solution
= [1-0.000030096]
= 0.999
Example 2.2
A communication system with a hard-wired microwave repeater unit, has a mean time to failure
of 60,000 hours. The system is functional if one channel is working and the reliability of the
switching unit is 0.98. Calculate the reliability for 24 months functional period, utilizing
Solution
Single Channel R
R
Two parallel
channel R
Output
RO
R
Three
parallel
channel R
Figure 2.4
37
1 1
Failure rate, λ = =
60000 6 ×104
= 0.166×10−4/ hour
=17520hours
The reliability, R1 = R = e− λt
−4
= e−0.166 ×17520 × 10
= e−0.290832
R1 = 0.748
R2 = 1.496−¿ 0.5595
R2 = 0.9365 ≈ 0.937
But we have the reliability of the switching unit as 0.98 hence, the series-parallel merger
becomes:
= 0.937× 0.98
R2s = 0.918
= 1−( 1−R)3
= 1−( 1−0.748)3
= 1-0.0160
R3 = 0.984
38
=0.984 × 0.98
R3s=0.964
Example 2.3
An electrical power system consists of three sections connected in series. The sections have mean
times between failures of 12,000 hours, 8,000hours, and 6,000hours respectively. Calculate the
MTBF of the system.
Solution
The failure rates λ1, λ2, λ3 stand for the three sections respectively; and being in series we have:
1 1
MTBF, Ms = = ...from equation (2.10)
λs λ 1+ λ 2+ λ 3+…+ λ n
1
But MTTF = ...from equation (1.5)
λ
1
λ1 = = 1.2 ×10−3
12×10 3
1
λ2 = = 0.125 ×10−3
8 ×103
1
λ3 = = 0.166 ×10−3
6 ×103
1
Ms = −3 hours
(1.2+0.125+0.166)×10
= 670 hours
A system in which the components are arranged to give parallel reliability is said to be redundant;
there is more than one mechanism for the system functions to be carried out. In a system with full
active redundancy all but one component may fail before the system fails.
39
There are other systems with partial active redundancy, in which certain components can fail
without causing system failure but more than one component must remain operating to keep the
system operating; a typical example is a four engine aircraft that can fly on two engines but would
lose stability and control if only one engine were operating. This type of situation is known as an
N-out of M- unit network. At least N-units must function normally for the system to succeed
rather than one unit in the parallel case and all unit in the series case.
The reliability of an N-out of M-unit system is given by binomial distribution, on the assumption
that each of the M-units is independent and identical.
Rn/m = (mi ) R ¿ i
m!
Where, (mi )= i! ( m−i )!
Example 2.4
A complex engineering design can be described by a reliability block diagram as shown in Figure
2.5. in the sub-system A, two components must operate for the sub-system to function
successfully; sub-system C has true parallel reliability. Calculate the reliability.
A B C
Solution Figure 2.5 Reliability block diagram depicting Complex engineering design
RA = ( 4i ) R ¿
i
Where i=2
RA = 0.998
RB = 0.97
= 1−¿(1−0.85)3
= 1−¿(0.15)3
RC = 0.997
RS = RA+ RB+ RC
= 0.998+0.970+0.997
RS = 0.965
Redundancy in a system means that there exists an alternative parallel path for the successful
operation of the system. Put in another form, it is the existence of more than one means (in an
item) for performing the required function.
Redundancy is therefore a fundamental tenet of reliability engineering that makes room for high
reliability, availability, and/or safety at equipment or system level. It is a well-known creed in
reliability engineering circle that as the complexity of a system increases the reliability dwindles
unless compensatory measures are taken.
For instance, if we have a system with n components connected in parallel and at least K,
( 1 ≤ K ≤ n ) , components are needed for the successful operation of the system, we say that the
number of basic components is K and the remaining (n-K) components are known as redundant
components. In fact, this system is known as a K-out-of- n system, and is discussed later in this
chapter.
In this case, multiple units are connected in parallel, energized and subjected to the same load
simultaneously to perform the given function. Here load sharing is possible, yet the expected
function can be performed even if only one out of the several units is working. Figure 2.7
illustrates this concept.
Input Output
system if 1 < k < n. That is, the redundant elements (units) are subjected to a lower load
until one of the operating elements fails.
2.4.1.3 Conditional (Majority voting) Redundant System
The output in this type of design agrees with the majority (2-out –of 3) as shown in Figure
2.8. This IS typical of a processor or (modules). This means that the system can tolerate
(mask) single module failure only.
1
2
3
Figure 2.8 Conditional (Majority Voting)
Refer to Figure 2.6, where we have classified redundancy broadly into two categories, namely,
active redundancy and standby redundancy. By active redundancy, we mean that all the
components connected in parallel are turned on at the beginning of operation of the system, and
continue to perform until they fail. Thus, in active redundancy, all components are simultaneously
in the operating mode. We have discussed two such systems, namely, the parallel system and k-
out-of-n system, where all the components of the system are turned on at the beginning of the
operation of the system. So, all the components are simultaneously in the operating mode. On the
other hand, in standby redundancy, the components are connected in parallel but do not start
operating simultaneously from the beginning of the operation of the system. In standby
redundancy, the component(s) in operating mode is/are known as normally operating
component(s). The component(s) kept in standby or reserve mode is/are known as standby
component(s). Other than this, there is a changeover device. The function of the changeover
device is to sense the failure of normally operating component and in case of a failure, to bring a
standby component into the normal operating mode.
43
To explain the concept of standby system, let us first consider the simplest case of the two
components (say, A and B) standby system.
A switch in the standby systems can put any component into operation. In the 2-component
standby system, initially this switch is connected to component A and turns it on, as shown in
Figure 2.9. In Figure 2.9, the switch is represented by S. In this setup, component B will remain in
standby (reserve or inoperative) mode till such time as component A performs its function
successfully. As soon as component A fails, the switch senses the failure and puts the component
B into operation.
Figure 2.10.: A typical n components standby system for the case of only one component in
normally operating mode.
44
In the discussion so far, we have not considered a key point of the standby system: that the switch
itself may fail during the operation of a component or during the time of changeover of a
normally operating component to a standby component. Due to this feature of the standby system,
we divide the discussion of this section under the following two heads:
a) Standby system with perfect switching, and
b) Standby system with imperfect switching.
Q = Q1 Q2 Q3... Qn (2.21)
R=1–Q (2.22)
In particular, in the case of one normally operating component and one standby component, the
unreliability of the system is given by
Q = Q1 Q2 (2.23)
It should be noted that the unreliability Q2 of the component 2 is not the same as it is in the case
of parallel system because in a standby system, component 2 is used for a shorter duration as
compared to the case of parallel system.
Example 2.5
A standby system has three components 1, 2, 3, where component 1 is normally operating and
components 2, 3 are standby components. The reliability of component 1 is 0.95. The reliability
of component 2 given that component 1 has failed is 0.96 and that of component 3 given that
components 1 and 2 have failed is 0.98. Evaluate the reliability of the system under the
assumption that the switch is perfect.
45
Solution
We know that for an n-component standby system (with one normally operating component), if
Qi is the unreliability of component, i given that components 1 to (i-1) have failed, then reliability
(R) of the system is given by
R = 1−¿Q1 Q2 Q3... Qn
In this case,
Q1= 1−¿0.95 = 0.05, Q2 = 1−¿0.96 = 0.04, Q3= 1−0.98= 0.02
∴ R = 1−¿Q1 Q2 Q3
= 1 – (0.05) (0.04) (0.02) = 1 – 0.00004
R = 0.99996
where,
Ps probability of successful changeover
Ṕs 1Ps probability of unsuccessful changeover
QA unreliability of component A
Q B unreliability of component B given that component
A has failed
.
Example 2.6
Consider a two-component standby system with A as normally operating component and B as
standby component. The reliability of component A is 0.90 while the reliability of component B
given that A has failed is 0.95. Assume that the switch can fail only at the time of changeover
with a probability of failure 0.03. Evaluate the reliability of the system.
46
Solution
Recall that Ps probability of successful changeover
Ṕs 1Ps probability of unsuccessful changeover
QA unreliability of component A
Q B unreliability of component B given that component
A has failed
Then from equation (2.24), the unreliability (Q) of the two-component standby system under the
assumption that the switch can only fail during the time of changeover is given by
Q = Ps QA QB ṔQA
= (1)
Q = 0.00785
Hence, the reliability of the system, Rs = 1 – 0.00785 = 0.99215
Chapter Three
3.1 Causes, Effects and Remedies of Environmental Factors on Component/ Equipment
Failure.
The constituents of electronic equipment are electronic components or its integration with
electromechanical devices. Again, incorporation of various equipment makes up a system. It is
therefore implicit that the study of failure of electronic components is an allusion to the failure of
both equipment and system since there is a critical link between them. Consequently, the failure
of a component in equipment could possibly bring about the failure of that equipment, and by
extension, the failure of all or part of the system.
Electronic components, equipment and systems are presupposed to operate in diverse climatic
conditions such as tropical, arctic, desert, high altitude, radiation, including transport hazard and
mechanical shocks and vibrations. These factors, in one way or the other, impact on the reliability
of electronic components, equipment and systems.
The causes of failure of components can be classified into two namely: Environmental Stresses
and Electrical Over-stress. These two main classes of stress and failure of components can be
further subdivided as follows:
3.1.1Environmental Stresses
i) Atmospheric temperature
ii) Humidity
iii) Shock and Vibration
iv) Generated heat
v) Atmospheric pressure
vi) Wind, air and dust
vii) Electromagnetic radiation
viii) Electrostatics
3.1.2 Electrical Over-stress
i) Voltage over-stress
ii) Current over-stress
iii) Frequency variation
A. Environmental Stresses
i) Effects of Atmospheric temperature
The extreme temperature experienced in the arctic and desert regions and fluctuations in
temperature noticeable within some territories on daily bases and from season to season subject
the components to stress. This is capable of promoting mechanical failure. Literature has it that
failure rate of components is about double with every 10°C rise in temperature for a specific
applied voltage*. The following are the specific effects of high temperature and temperature
fluctuation on components:
49
(a) Thermal ageing and Oxidation: – Loss of electrical quality/ change of electrical properties
like increase in power factor, decrease of dielectric strength and insulation failure.
(b) Physical Expansion: - Structural failure, differential expansion of different materials can
cause distortion of assemblies, rupturing of seals and wear or binding on moving parts.
(c) Loss or Change of Viscosity, Evaporation:- Loss of lubrication properties,
structural/mechanical failure (breakage or fracture, seizure)
(d) Softening/melting: - Internal temperature of equipment may approach a value where low
melting point materials such as grease, protective compounds and waxes become soft or even
begin to flow. This may lead to structural failure, physical breakdown or penetration of
sealing leading to internal electrical breakdown.
(e) Chemical decomposition: - Decomposition of organic materials increases, rubber materials
harden. This may change the initial physical or electrical constants. The ultimate cause of any
of these effects can be physical or chemical change in the material and hence variation in
characteristics of components. Excess temperature is perhaps the most destructive
environmental factor associated with electronic/electrical components and equipments. Hence,
development of new stable materials for improved performance of component has been a
continuous process.
In contrast, cold/arctic temperature condition also has some negative impacts on components.
It results to physical contraction, solidification and increased viscosity. Furthermore, there
could be change of electrical properties due to different temperature coefficients of various
component parts such as capacitances, resistances and inductances. Additionally,
embrittlement occurs in metallic and non-metallic materials as a result of extreme cold
temperature. This brings about loss of mechanical strength, cracking and fracture. Physical
breakdown of sealing due to shrinkage and cracking leading to electrical breakdown ensues.
Remedies
To redress the effects of high temperature on components, the use of appropriate heat sink,
provision of air vent and /or incorporation of forced air cooling should be engaged.
Furthermore, component material with low coefficient of expansion and other low
temperature characteristics should be appropriated maximally in systems and equipments that
are meant to function in high temperature regions.
For cold/arctic temperature zones, while appropriate choice of materials and component s for
equipment design and construction is critical, indirect heating of the equipment to control the
temperature is recommended.
a few seconds if the relative humidity is 100%. This will lead to insulation breakdown, change
of dielectric properties and external electrical failure like tracking, insulation flashover etc.
Furthermore, formation of fungi growth is stimulated, which also brings about degradation of
insulation. Corrosion (structural/mechanical failure) is another effect of humidity. This brings
about interference with function, internal electrical failure and change of physical or electrical
constants.
It has been proven that for given atmospheric moisture content, humidity increases as the
temperature falls and vice versa. Therefore, in environments where there are sudden
temperature drops especially at night, condensation is sure to occur thereby leading to
formation of water vapour on components.
Remedies
Only a few materials such as Silicones, Polystyrene and some Polymers can stop the
formation of a continuous moisture films but they have poor resistance to fungal growth. To
minimize the effects of humidity therefore, insulating materials used for casing of equipment
should be such that do not absorb moisture or stimulate fungi growth or hold up water films.
Concisely, completely enfolding any component or the entire equipment that is prone to
humidity exceptionally is the remedy.
When equipments are being transported or moved from one place to another, they are
susceptible to shock, vibration, bump and drop. Structural collapse, loss of mechanical
robustness, breakage, fracture, crack etc. are some of the possible after-effects. Other upshots
are physical breakdown of sealing, complete disconnection or intermittent electrical contact.
Electromechanical components like relays, and contactors and heavy components such as
transformers are more vulnerable to the negative effects of shock and vibration. On the other
hand, electronic components which are generally small in nature experience comparatively
little effect of shock and vibration.
Remedies
To reduce the effect of shock and vibration, design best practices are employed, utilizing anti-
vibration mountings, locking nuts, shake-proof and lock washers. Furthermore, sensitive
components can be prevented from shock and vibration by enfolding them with protective
padding materials.
Devices and equipment generate heat internally during operation. Semiconductor devices
carry significant current during operation, and hence, generate considerable amount of heat.
This generate is capable of causing failure of such devices when it surpassed a particular level
51
of tolerance. Furthermore, the normal operation specifications of the device can as well be
affected resulting in undesirable increase in chemical reaction and rash ageing.
Remedies
The impact of generated heat can be reduced by the use of suitable heat sink, use of forced air
cooling devices like fan as well as making provision for adequate ventilation in the design of
such equipment. Again, thoughtful selection of components with good temperature and low
expansion attributes can also be appropriate. Lastly, generated heat can also be reduced by
derating technique which allows the item to function at an applied voltage, current or power
unlikely to set off unwarranted internally generated heat. Derating technique therefore, lessens
temperature rise arising from internally dissipated heat, and hence, shrinks the failure rate.
Earthly elements such as wind, air and dust undoubtedly impinge on components and
equipment operated alfresco. For instance, components or equipments operated in coastal
environment are susceptible to saline air and its effects. Also, circulation of dust in equipment
may instigate tracking (leakage of current) within devices particularly in switches. This may
set off early failure. Moreover, dust admittance could trigger off gradual breakdown of
insulation and build up of contact resistance. Saline atmosphere can set off corrosion,
resulting to structural/mechanical failure such as breakage, fracture, seizure etc. Besides,
52
physical breakdown of sealing may as well arise, in addition to change of initial physical or
electrical constants.
Wind on the other hand can bring about vibration, rocking and excessive movement with
consequent physical breakdown of sealing, breakage or fracture which may lead to electrical
breakdown or loss of electrical quality.
Curtailing the effects of these air contaminating elements will entail setting up appropriate
enfolding around the components/equipment. Again, periodic dusting of the
equipment/components should be carried out. For the coastal atmosphere, suitable physical
protective covering should be used to shield the component from the effects of the saline
atmosphere.
Electromagnetic radiation has the potential of interfering with electrical signals, components,
electronic devices and systems, creating radiation-induced effects.
Remedies
Principally, the effects of electromagnetic radiation are instigated by providing shielding and
insightful part selection for equipment design.
Electrostatic discharge (static electricity) can affect electronic circuit and components in
diverse ways. Some of the effects may be instantaneous while others may appear weeks or
years later. While some items in today’s workplace can store thousands of volts in
electrostatic charges, others can take up to 25 electrostatic volts to damage an integrated
circuit irreparably. This natural phenomenon has only become an issue since the widespread
use of solid-state electronics. All materials (insulators and conductors alike) are sources of
electrostatic discharge. The amount of electrostatic charge that can accumulate on any item is
dependent on its capacity to store a charge. Sources of electrostatic discharge are:
a) Through human contact with sensitive devices (human touch is only sensitive on
electrostatic discharge level that is above 4,000V);
b) Troubleshooting electronic equipment or handling of printed circuit boards without
using an electrostatic wrist strap;
c) Placement of synthetic materials (i.e. plastic, Styrofoam, etc.) on or near electronic
equipment, and
d) Rapid movement of air near electronic equipment (including using compressed air to
blow dirt off printed circuit boards, circulating fans blowing on electronic equipment,
or using an electronic device close to an air handling system).
In all of these scenarios, the accumulation of static charges may occur, but you may not know.
Furthermore, a charged object does not necessarily have to contact the item for an electrostatic
discharge event to occur.
53
Environmental stress hastens the onset of wear-out by contributing to physical deterioration. The
preceding discussion on environmental stress can be summarized in Table 3.1
connector failure
B. Electrical Over-stress
This is a condition at which a device or electrical circuit or component is exposed to the higher
value of current, voltage or frequency that is beyond the maximum rated value. Each of these
possible over-stresses is discussed below:
i) Voltage Over-stress
This may occur when an electronic device is being switched on. Although this could
happen transiently, its magnitude may be extremely, enormously transcending the steady-
state value. This could have an adverse effect on the device. It is expedient therefore to
maintain the applied voltage within an acceptable tolerance value of the rated voltage in
order to minimize the failure rate. Voltage over-stress can be mitigated by designing
robust system through derating techniques. This is an intentional process applied to every
component of a product to reduce the chances of components witnessing more stress than
it is capable of withstanding.
ii) Current Over-stress
Current over-stress has similar effects on components/devices as voltage over-stress. It
could be sudden as well as transient; and so, should be anticipated during design by
making room for its palliation by means of derating techniques.
55
The majority of failures are attributable to one of the following physical or chemical phenomena.
Alloy Formation: Formation of alloys between gold, aluminum and silicon causes what is known
as ‘purple plague’ and ‘black plague’ in silicon devices.
Biological effects: Moulds and insects can cause failures. Tropical environments are particularly
attractive for moulds and insects, and electronic devices and wiring can be affected.
Chemical and electrolytic changes: Electrolytic corrosion can occur wherever potential
difference together with an ionizable film are present. The electrolytic effect causes interaction
between the salt ions and the metallic surfaces, which act as electrodes. Salt laden atmospheres
cause corrosion of contacts and connectors. Chemical and physical changes to electrolytes and
lubricants both lead to degradation failures.
Contamination: Dirt, particularly carbon or ferrous particles, causes electrical failure. The former
deposited on insulation between conductors leads to breakdown and the latter to insulation
breakdown and direct short circuits. Non-conducting material such as ash and fibrous waste can
cause open-circuit failure in contacts.
Depolymerization: This is a degrading of insulation resistance caused by a type of liquefaction in
synthetic materials.
Electrical contact failures: Failures of switch and relay contacts occur owing to weak springs,
contact arcing, spark erosion and plating wear. In addition, failures due to contamination, as
mentioned above, are possible. Printed-board connectors will fail owing to loss of contact
pressure, mechanical wear from repeated insertions and contamination.
Evaporation: Filament devices age owing to evaporation of the filament molecules.
Fatigue: This is a physical/crystalline change in metals that leads to spring failure, fracture of
structural members, etc.
Film deposition: All plugs, sockets, connectors and switches with non-precious metal surfaces
are likely to form an oxide film, which is a poor conductor. This film therefore leads to high-
resistance failures unless a self-cleaning wiping action is used.
Friction: Friction is one of the most common causes of failure in motors, switches, gears, belts,
styli, etc.
Ionization of gases: At normal atmospheric pressure a.c. voltages of approximately 300V across
gas bubbles in dielectrics give rise to ionization, which causes both electrical noise and ultimate
breakdown. This reduces to 200V at low pressure.
56
Ion migration: If two silver surfaces are separated by a moisture-covered insulating material then,
providing an ionizable salt is present as is usually the case, ion migration causes a silver ‘tree’
across the insulator.
Magnetic degradation: Modern magnetic materials are quite stable. However, degraded magnetic
properties do occur as a result of mechanical vibration or strong a.c. electric fields.
Mechanical stresses: Bump and vibration stresses affect switches, insulators, fuse mountings,
component lugs, printed-board tracks, etc.
Metallic effects: Metallic particles are a common cause of failure as mentioned above. Tin and
cadmium can grow ‘whiskers’, leading to noise and low-resistance failures.
Moisture gain or loss: Moisture can enter equipment through pin holes by moisture vapor
diffusion. This is accelerated by conditions of temperature cycling under high humidity. Loss of
moisture by diffusion through seals in electrolytic capacitors causes reduced capacitance.
Molecular migration: Many liquids can diffuse through insulating plastics.
Stress relaxation: Cold flow (‘creep’) occurs in metallic parts and various dielectrics under
mechanical stress. This leads to mechanical failure. This is not the same as fatigue, which is
caused by repeated movement (deformation) of a material.
Temperature cycling: This can be the cause of stress fluctuations, leading to fatigue or to
moisture build-up.
Table 3.2
Specific
Metalization 18 50 25 -
Diffusion 1 1 9 55
Oxide 1 4 16 -
10 10 - 25
Bond – die
9 15 1 -
Bond – wire
5 14 10 -
57
Packaging/hermeticity
Surface 20
contamination 55 5 25
1 1
Cracked die
Mechanical movement
3.2.2Part Selection
Since hardware reliability is largely determined by the component parts, their reliability and
fitness for purpose cannot be over-emphasized. The choice often arises between standard parts
with proven performance which just meet the requirement and special parts that are totally
applicable but unproven. Consideration of design support services when selecting a component
source may be of prime importance when the application is a new design. General considerations
should be:
o Function needed and the environment in which it is to be used;
o Critical aspects of the part such as, for example, limited life, procurement time,
contribution to overall failure rate, cost, etc;
59
Stress rating is defined as the ratio of applied stress to rated stress, for example, the ratio of
applied voltage to rated voltage in capacitor applications. Generally as stress increases failure rate
also increases, usually exponentially; conversely as stress reduces failure rate reduces. However,
care must be taken when applying derating as a method of improving reliability because at very
low stress ratios failure rate may again increase.
History shows that a significant number of equipment failures arise from inadequate design
margins. The derating factors listed for electronic equipment in Table 1 should not only ensure
that components are operated well within the recommended limits of stress, but also provide in
most cases a sufficient design margin to accommodate minor variations in environment stress,
power supply levels, transients, etc.
Failures caused by stress transients in the operational environment are in fact often due to
inadequate design margins. Test conditions seldom reproduce these transients, and failures of this
kind are, therefore, difficult to diagnose in the field. Derating can eliminate many such potential
problems.
Electronic components are in general subject to at least two stresses, an electrical stress, with
increasing tendency to breakdown due to voltage, current or power and a thermal stress due to its
own power dissipation and, in part, to the total dissipation of neighbouring components and/or the
local environment. Reducing electrical stress will indirectly reduce thermal stress and lead to
improved failure rates.
Failure rates for generic component types invariably assume that the failure rates are constant
with time and that the components are conservatively rated. Thus predictions based on component
count procedures pre-suppose that derating will be applied.
60
The methods described in this chapter are aimed at reducing failures by increasing design
margins, i.e. the margin of design strength over expected stress. To make an impact on the overall
system failure rate a derating policy must be applied to as many components as possible. In some
cases this may incur weight or space penalties; however, such cases should not prevent the policy
being applied as far as practicable.
General
The reliability of electronic components decreases when they are operated at high stress levels.
These stresses are primarily temperature, voltage, current and power dissipation. Heat-generating
components in particular, such as transistors, resistors, valves and transformers, are susceptible to
these stresses which result in degraded performance and accelerated failure.
The problem is one in which the materials employed in the construction of the component have
upper and lower temperature design limits, beyond which performance changes develop or
catastrophic failures occur. This problem can be brought under control by ensuring that the
component functions within its design rating. It is mostly a heat balance problem and can usually
be solved by keeping components cool enough to function reliably.
The theoretical justification for derating is discussed as a means of reducing thermal and electrical
stress, and derives two mathematical models relating component failure rates to stress conditions.
Under normal operating conditions, a component is considered to have failed when its design
parameters have changed beyond the limits of its acceptance specification, due to degradation
processes. In a structurally sound component most processes of degradation are primarily
dependent upon chemical reaction and include such phenomena as hot spot formation. increased
carrier generation, parametric degradation, aluminium migration, gold-aluminium interdiffusion.
61
In 1889 Arrhenius suggested an empirical model for the rate at which chemical reactions occur at
different temperatures. The Arrhenius chemical reaction rate law is:
E
Chemical Reaction Rate = A exp -
RT
where: A is a constant;
E is activation energy;
R is Boltzmann’s gas constant;
T is absolute temperature.
Where it is assumed that failure rate is directly proportional to chemical reaction rate, the failure
rates to be expected at different temperatures can be estimated, e.g.
λ B exp (−E/ RT )
=
λo B exp (−E / R T 0 )
λ E 1 1
∴
λo
= exp[ (
− ]
R T0 T ) **
The relationship between failure rate and temperature can be seen from equation **.
This approach assumes that the reaction rate of the failure mechanism is related to degradation time
and therefore to component failure rate. However, the activation energy of a particular failure
mechanism is not the same as the apparent activation energy to reach some limit of parameter
variation, since several failure mechanisms may be operating at the same time to produce one
apparent component failure activation energy.
Where a single failure mechanism is considered, a reliability analyst may attempt to determine the
apparent activation energy (E) of that failure mechanism using time to failure data from a suitable
sample. A graph of time to failure against the reciprocal of temperature, using Log/Linear graph
paper, will result in a straight line (the Arrhenius Line) when the theory applies, and the activation
energy that is sensibly constant for each failure mechanism, can then be determined from the slope of
the line.
Supposing that: T 0=288 K ( 15° C )∧T =303 K (30 ° C)
And assuming a value for the apparent activation energy; E of approximately 1 Volt (say0.92V),
then λ = 2λ0.
This implies a doubling of failure rate for an increase of 15° C or, conversely, a halving of failure
rate for a decrease of 15° C for the example chosen.
62
63
Typically failure rates change by a factor between 1.1 and 2.0 for a change of temperature of 15°C,
the higher factor being applied to transistors and some capacitors and the lower factor being
appropriate for resistors. In general the relationship between failure rate and temperature is that given
by equation (ii), i.e. failure rates increase exponentially as temperature increases.
Items such as capacitors that are subject to voltage stress also need to be derated to reduce failures due
to dielectric breakdown. It has been suggested that failure rate is related to dielectric stress by a 5 th
Power Law, which states that life of an item, i.e. its mean time to first failure, is inversely proportional
to the fifth power of dielectric stress.
Data sources that relate failure rate to stress can show that the failure rates do not generally indicate
such drastic changes, suggesting that perhaps a fifth power law is pessimistic for many individual
component types and also may not apply for stress ratios less than 0.5. In general the relationship
between failure rate and voltage stress is that failure rate increases according to a power law as stress
increases.
Resistors
Given that resistors are properly made, the two principal influences on component failure rate are
temperature and electrical stress. Derating characteristics for resistors specify a maximum stress for
these two critical parameters by limiting the power dissipated.
The power rating of resistors is dependent upon the manufacturing techniques and materials used, and
limited by a maximum hot-spot temperature. The power that can be developed in a resistor body
depends upon how effectively the dissipated energy is carried away and is therefore a function of the
local temperature and heat transfer conditions. At all temperatures above the rated temperature for the
type, resistors should be derated.
Operating Power
Stress Ratio = =80 %
Rated Power
This recommended stress ratio provides a sufficient design margin in most practical cases and in
addition increases resistor stability.
For variable resistors the operating current in any part of the resistor is the critical stress
condition. The stress ratio for variable resistors is given by:
Operating Power
Stress Ratio = =75 %
Rated Power
In this case the derating factor is more stringent than that for fixed resistors since variable
resistors have, in general, a higher failure rate than fixed resistor types and a greater design
margin is needed.
Power is not, however, the only quantity in which stress ratings are specified. For example, a
resistor may be rated at 300mW dissipation in free air at 20°C, or the same type may be rated at
64
250V d.c. across the resistor. In every case the ‘limiting element voltage’ specified for the type
must not be exceeded.
Semi-Conductor Devices
Transistors
Transistors can be destroyed by exceeding the manufacturer’s voltage rating even for a few micro-
seconds. Transient voltage spikes of comparatively small magnitude and very short duration are very
difficult to trace and can often be the reason for circuit failures which appear to have no obvious
cause. In terms of power, transistors are rated in a similar fashion to resistors, except that the limiting
hot spot occurs at the junction, and junction temperature is the most important parameter.
In practice it is essential to derate transistors to a level which ensures that the manufacturer’s
recommended junction temperature will not be exceeded. To this end, the following electrical stress
ratios are recommended as a minimum:
Operating Power
Stress Ratio (1) = =75 %
Rated P ower
Operating I c
Stress Ratio (3) = =90 %
Rated I c
where, I cis the collector current. Each of these ratios must be complied with at the same time in each
particular transistor application.
Power Diodes
The following stress ratios are recommended for power diodes in order to ensure, in general, that
the limiting junction temperatures are not exceeded:
Operating PIV
Stress Ratio (1) = =50 %
Rated PIV
and,
Operating I f
Stress Ratio (2) = =70 %
Rated I f
where, Stress Ratio(1) is the Peak Inverse Voltage derating factor and Stress Ratio(2) is the
Forward Current derating factor.
The following stress ratios are recommended for small signal diodes in order to ensure, in general,
that the limiting junction temperature is not exceeded:
Operating PIV
Stress Ratio (1) = =85 %
Rated PIV
and,
Operating I f
Stress Ratio (2) = =85 %
Rated I f
Each of these stress ratios must be complied with at the same time in every application.
Transformers
The policy and principles for the derating of transformers applies also to similar devices including
inductors, chokes, magnetic amplifiers and RF coils.
Most transformer failures result from insulation breakdown and resulting short circuit, and the
overheating that follows may result in misshapen or burst containers due to expansion of the
potting or filling compound. Open circuit windings occur only occasionally.
Transformer failures are largely due to the insulation becoming brittle and losing its insulation
qualities. This is usually caused by hot-spots and is related to the operating temperature. The
operating temperature in turn is related to the power dissipation of the device and the operation
stress ratio.
Operating Temperature = ambient + 0.15 of Rated Temp on full load + (0.85 of Rated
Temp. × Stress Ratio)
The following electrical stress ratio is recommended for transformers and similar devices:
OperatingVA Load
Stress Ratio = =80 %
Rated VA Load
This stress ratio will, in most cases, ensure that the limiting hot-spot temperature is not exceeded.
Capacitors
Capacitors in general do not dissipate heat in the same way as resistors, transistors or
transformers, except when they are subjected to ripple currents or pulse loads when derating does
become important. However, they are subject to thermally sensitive failure modes that depend on
the materials used in their manufacture.
66
Some of the principal conditions associated with capacitor failure are current overload, voltage
overload, high frequency effects, high temperature, high pressure, humidity and shock. The most
important of these are voltage and temperature stress, which are the principal factors to be
derated.
Dielectric breakdown may occur after many hours of satisfactory operation and is associated with
a slowly changing physical or chemical reaction. The ultimate failure is, however, most often
associated with one abnormal electrical or temperature stress.
The recommended electrical stress ratio for all types of capacitor is:
OperatingVoltage
Stress Ratio = =75 %
Rated Voltage
In all cases the capacitor selected for a particular application must be carefully chosen from the
various types available to avoid misapplication. A significant number of equipment failures are
due to incorrect selection and application of capacitors.
Electrolytic capacitors are a special case and have power factors several times higher than other
capacitor types and due to ‘leakage’ currents which cause significant self-heating. This self-
heating tends to increase with age and can build up causing complete failure, thus derating is
particularly important. Non-electrolytic capacitors can be derated down to 10% of the maximum
voltage rating, though this is seldom physically practicable; however, this is not true for
electrolytic capacitors which may exhibit increased failure rates at these low levels because a
minimum voltage is required to establish and maintain the polarisation of these types. The
principal derating parameter is ‘surge voltage’ for solid tantalum types and ‘ripple current’ for
other electrolytic types. These capacitors must not be operated below the minimum specified
voltage; they should be derated but still comply with the manufacturers minimum requirements.
Micro-Electronic Circuits
This heading includes Integrated Circuits, Medium Scale Integration (MSI) Circuits, Large Scale
Integration (LSI) Circuits, and Hybrid Circuits, and covers both thick and thin film technology.
In any given type of micro-electronic structure the device reliability is very strongly related to
temperature of operation, and particularly to junction temperature in applications where the power
dissipated in the device is relatively high. The heat generated must be dispersed using appropriate
metal or ceramic packaging.
The specific stress ratios are necessarily different for each type of device and the only
generalisations that can be made are that digital IC’s should be derated in terms of fan-out and
linear IC’s in terms of current. The following derating factors are recommended:
MOS Devices
The predominant cause of failure in MOS and C-MOS devices is electrical overstress; experience
with C-MOS devices under test conditions indicated that over 40% of failures arise from this
cause. The overstress can be due to mishandling since these devices are especially sensitive to
static discharge, and this has been found to be the most frequent cause of failure. Precautions to
be taken to protect expensive devices from static damage include:
MOS and C-MOS devices typically operate at supply voltages ranging from 3V to 18V. The
choice of supply voltage influences the speed of operation, because the higher the voltage the
shorter the rise and fall times of the output pulses. However, increased supply voltage also
increases the power consumption and thermal dissipation, and directly influences the failure rate.
Manufacturers’ life tests of MOS devices indicate an increase in failure rate of at least 10 times
for an increase of supply voltage from 10 to 15 volts, and it is clear that a compromise between
operating speed and reliability has to be made. Thus these devices should be derated, in terms of
supply voltage, to the lowest level consistent with the required operating speeds.
The functioning of a contact operating device like a switch or relay entails many sources of risk,
and incorrect functioning can expose adjacent circuitry to various degrees of hazard. These
devices present complex electro-mechanical failure modes; for example, a chopper type relay may
occasionally make poor contact with little effect on the overall system operation, while a ‘one-
shot’ armament relay in a missile requires little total usage but demands high reliability when it is
used. The conditions that lead to possible failure include ageing of time delay relays and gas
generation in hermetically sealed cans.
Most failure modes of relays and switches are dependent upon the cumulative number of
operations and, being electro-mechanical devices, relays and switches are subject to both
electrical and mechanical failure. Typical causes of failure are predominantly mechanical in
nature and include; mis-aligned contacts, open circuit contacts, contaminated or pitted contacts,
loss of resilience in contact springs, and open circuit coils.
Contact failure can result from a current surge or high sustained current. Current surges occur in
loads which include motors, lamps, heaters, capacitive input filters and other devices with low
initial impedance. These currents can cause intense heat with associated contact welding.
68
Transformers can present transients of many times the steady state current. At switch-on a lamp
filament can demand current up to 15 times the steady state value and motors up to 10 times.
The following electrical stress ratios for relays and switches are recommended:
OperatingContact current
Stress Ratio = =50 %
Rated Contact current
Special circuit conditions demand a greater degree of derating and the following factors are
recommended:
Chapter Four
Maintainability is the probability that a device that has failed will be restored to operational
effectiveness within a given period of time when the maintenance action is performed in
accordance with prescribed procedures.
This is usually expressed as Mean Time To Repair (MTTR), or how quickly the system can be
restored to operational effectiveness or sometimes expressed as the repair rate. The word “repair”
implies that we are concerned with the time to perform corrective maintenance action only; on the
contrary, the time taken to carry out preventive maintenance is equally of interest. Hence,
Maintainability can be examined from two perspectives, namely: Serviceability (the ease of
conducting scheduled inspection and servicing) and reparability (the ease of restoring service
after a failure).
Reliability and Maintainability work join forces to achieve Availability. The basic factors that
determine availability are mean time between failures (MTBF), mean time to repair (MTTR), and
performance before failure and after repair. For example, a system with a high reliability with
protracted MTTR may have lower availability than a system with lesser reliability that is easier
and faster maintained or repaired. However, the user may be willing to tradeoff either reliability
or maintainability to accomplish higher performance. This is typical of some military armaments.
Similarly, performance may be compromised to raise reliability and reduce MTTR.
It is generally known that some equipment are really easy to maintain while others can make
maintenance work stressful. This attribute is referred as maintainability. Thus, maintainability can
be defined as the characteristic of an equipment that makes it easy to repair. In contrast,
maintenance is the activity oriented to keep equipment running; and it is divided into two major
types, corrective and preventive maintenance.
70
4.2.1 Corrective Maintenance: This is concerned with the repair of something that is not working
according to standards. It is mainly reactive and unplanned. Because this type of maintenance
always takes place when the equipment is running, it has a production loss associated which
negatively impacts on equipment’s availability.
Generally speaking, we expect that equipment will work for a reasonable amount of time
without failure. As everybody knows, there are equipments that are more reliable than others and
their reliability depends on their complexity, quality, design, etc.
Reliability and maintainability are characteristics defined during the design stage. They are also
affected by the manufacturing process and quality control. However, they can be further improved
during their productive life using failure information and field experience to implement
modifications, although this is more difficult and less cost effective.
Reliability, maintainability and availability are all related. Their correlation can be seen in
Figure 4.1
Broadly speaking, reliability will determine how often the equipment fails. Although, it is stated
during the design stage, this will be an estimated value since reliability can decrease during the
equipment life due to several factors such as, poor maintenance, heavy work environments
(performance stress) or incorrect operation procedures.
71
In general, the best option is to acquire equipment with high reliability and maintainability,
but sometimes we already have the equipment installed and we have to deal with it. In those
cases, some ways to improve maintainability are:
i. Improve access to the equipment: we can modify aspects of the equipment that will allow
us to get access faster and easier. For example, sometimes quick access doors can be
72
added or when we have panels with too many screws, we can replace them with quick
release fasteners.
ii. Standardize components: every time we replace a component we can install an equivalent
brand or model that we frequently use in our plant (this can also be done all at once with
higher costs). This can be applied to thermomagnetic switches, fuses, relays, electric
motors and VFDs. In this last two cases, we can even use a model of higher capacity to
diminish the number of models in use. For example, if we have many VFDs of 110 kW
and one of 90 kW, we can analyse the possibility to replace this last one for one of 110
kW so we use only one model. This has to be considered carefully to see the available
space and other compatibility issues.
iii. Improve fault indication: this is especially useful for PLC controlled equipment. In some
occasions, the equipment has failed and there is no fault indication or it is vague or
erroneous. Sometimes, the failure wasn’t anticipated during the design phase so we can
program a new alarm or indication to clearly communicate the problem in the future.
In conclusion, we have seen the importance of maintainability and its effects on maintenance
and availability. It is therefore appropriate that these characteristics be taken into
consideration when buying new equipment, since poor maintainability could lead to further
costs during the life of the equipment.
Example 4.1
The motor may only be used for eight hours a day, 50 weeks a year. The hours of operation
would then be 2000 hours, and the motor Utilization factor for a base of 8760 hours per year
would be 2000/8760 = 22.83%. With a base of 2000 hours per year, the motor Utilization factor
would be 100%.
The bottom line is that the use of this factor is applied to get the correct number of hours that the
motor is in use.
This factor must be applied to each individual load, with particular attention to electric
motors, which are very rarely operated at full load. In an industrial installation this factor
may be estimated on an average at 0.75 for motors.
For incandescent-lighting loads, the factor always equals 1.
73
For socket-outlet circuits, the factors depend entirely on the type of appliances being
supplied from the sockets concerned.
4.6 Availability
Availability can be defined as the proportion of time for which the equipment is either performing
its function or capable of performing its function. It is the probability that a system has not failed
or undergoing a repair action when it needs to be put to use. Availability is based on the question:
“Is the equipment available in a working condition when it is needed?”
An item of equipment may not be very reliable, but if it can be repaired quickly when it fails, its
availability could be high. One major distinguishing factor between availability and reliability is
that availability takes repair time into account. Figure 4.2 illustrates the average value of time, t
over the operating life of the equipment.
Figure 4.2: Illustration of the Average value of time over operating life of the equipment
Mean time between failures (MTBF) is commonly used to express the overall reliability of items
of equipment and system.
From Figure 4.2 we can see what is meant by Uptime, which is the time when the equipment is
available; and Downtime, which is the time when the equipment has failed, and so, is unavailable.
The averages of each of these are:
a) Mean Uptime, which we have already seen is known as the MTBF
b) Mean Downtime, or MDT
Uptime Mean Uptime
∴ Availability (A0) = = (4.2)
Total time Mean Uptime+ Mean Downtime
MTBF
A0 = (4.3)
MTBF+ MDT
This is called the operational Available; A0
Sometimes Mean Time To Repair (MTTR) is used in this formular instead of MDT. But MTTR
may not be the same as MDT because:
i) The failure may not be noticed for some time after it has occurred.
ii) It may be decided not to repair the equipment immediately
iii) The equipment may not be put back in service immediately it is repaired.
Whether MDT or MTTR is used, it is important that it reflects the total time for which the
equipment is unavailable for service; otherwise the calculated availability will be incorrect.
74
Example
A system operates for 100hours as shown in the timeline of Figure 4.3. Calculate the Uptime,
Downtime and the operational Availability.
The instantaneous availability function will start approaching the steady state availability
value after a time period of approximately four times the average time-to-failure. Figure 4.4
illustrates the steady state availability graphically.
75
a
)
Concisely, one can think of the steady state availability as a stabilizing point where the system’s availability is
roughly a constant value.
From the equation of the operational availability, and the availability of repairable system which is a function of
failure rate, λ and its repair rate, µ, we can derive the steady-state availability:
1
MTBF λ
∴ A(∞ )=¿1- =
MTBF+ MDT 1 1
+
λ μ
(4.5)
μ
∴ A(∞ )= (4.6)
μ+λ
The steady state availability reflects the long-term availability after the system “settles”. The
system availability may initially be unstable due to training/learning issue, deciding on a good
spare parts stocking policy, deciding on the number of repair personnel, optimizing the efficiency
of repair, burn-in of the system, etc., and could take some time before it stabilizes.
For systems that operate continuously, once the system passes an initial start-up period, the
instantaneous availability equals its steady state availability.
In many analysis cases, the period of interest is such that the startup transient is negligible and is
ignored.
Now let us consider a two-state model as shown in Figure 4.5, in which a system is either “up”
represented by 1, or “down” represented by 0.
λ
1 0
μ
Figure 4.5: Two state model
The system moves from state 1 to state 0 at a rate λ and from state 0 to 1 with rate µ
1 1
where λ = and µ =
MTBF MDT
The instantaneous availability can be calculated as a function of time. It is the probability of being
in state 1 at time t.
μ λ −(λ+ µ)t
P1(t) = + e (4.7)
μ+λ μ+λ
The average availability or uptime availability is therefore the uptime percentage through time, t
and can be evaluated as:
t
1
A0(t) = ∫ P (t) du
t 0 1
(4.8)
t
1
= ∫¿¿
t 0
μ −λ
Ao(t) = + (e ¿ ¿−( λ+ µ) t−1)¿ (4.9)
μ + λ t( μ+ λ)
μ
As time, t becomes large, approaching infinity, the limit of Ao(t) is (4.10)
μ+λ
This is the probability of system operating and functioning at the requisite level in an ideal
environment when considering only the corrective maintenance (CM). A I ignores standby and
delay times associated with preventive maintenance as well as logistic delays, supply delays and
administrative delays. Since these other causes of delay can be minimized or eliminated, an
availability value that considers only the corrective downtime is the inherent or intrinsic property
of the system. Many times, this is the type of availability that companies use to report the
77
availability of their products (e.g. computer server) because they see downtime other than actual
repair time as out of their control and too unpredictable.
The corrective downtime reflects the efficiency and speed of the maintenance personnel, as well
as their expertise and training level. It also reflects characteristics that should be of importance to
the engineers who design the system, such as the complexity of necessary repairs, ergonomics
factors and whether ease of repair (maintainability) was adequately considered in the design.
i). For a one-off/non-repairable component, the inherent availability can be computed as:
MTTF
AI = (4.11)
MTTF+ MTTR
ii). Equation (4.11) gets slightly more complicated for a repairable element. To do this, one needs
to look at the Mean Time between Failures (MTBF), and compute as follows:
MTBF
AI = (4.12)
MTBF+ MTTR
Uptime
Where MTBF = ;
Number of system failures
Uptime
Where MTBM = ;
Number of system failures+ Number of systemdowning P . M
C . M Downtime+ P . M Downtime
and Ḿ =¿
Number of system failures+ Number of systemdowning P . Ms
78
It should be noted that what is meant by system downing PMs are PMs that cause the system to
go down or required a short down of the system.
MTBF Uptime
Ao = or (4.14)
MTBF+ MDT OperatingCycle
Where operating cycle is the overall time period of operation being investigated and uptime is the
total time the system was functioning during the operating cycle.
For instance, if we are using equipment which has a mean time to failure (MTTF) of 81.5 years
and mean time to repair (MTTR) of 1 hour,
MTTF in hours = 81.5×365×24 =713940 (This is a reliability parameter and often has a high
level of uncertainty)
Inherent availability AI = 713940/(1+713940)
= 713940/713941
= 99.999860%
Inherent Unavailability = 1/713940 = 0.000140%
Downtime
Unavailability (Q) =
Total time
Downtime = MTTR
Total time = MTBF = MTTF+MTTR
MTTR
Q=
MTTF+ MTTR
If MTTR<< MTTF
MTTR 1
Substituting, Q ≈ = × MTTR
MTTF MTTF
If the failure rate is constant (standard assumption for safety integrity level (SIL) verification)...
1
λ=
MTTF
∴ Q ≈ λMTTR
4.8 Repairability
There are various definitions of “repairability” as can be found in literature. Some of them are:
o The ability and ease of product to be repaired during its life cycle.
o The ability to bring a product back to working condition after failure in reasonable amount
of time and for a reasonable price.
o The characteristics of a product that allows all or some of its parts to be separately
repaired or replaced without having to replace the entire product.
Arising from the above three definitions, there are some relevant elements and parameters that
must be included in the definition in order to give it an all-encompassing meaning which portrays
the desired outcome of the repair (bringing a product back to working condition) as well as
determining the feasibility of a repair (cost and time). Accordingly, for purpose of this text,
Repairability is defined as follows:
The characteristics of a product that allows all or some of its parts to be brought back to working
condition after failure in a reasonable amount of time for a reasonable price without having to
replace the entire product.
The first step of the repair activities is to properly identify the product model in order to retrieve
relevant repair information such as failure diagnostic guides, disassembly instructions or
availability of spare parts. The second step is to identify the damage or failure in the
system/product. In this step, the technical repairability is assessed and the required further repair
actions, such as replacement of failed parts are identified. To access this failed parts, complete or
partial disassembly and subsequent reassembly is required. Before putting the product back in
use, in many cases the repaired product is tested and/or reset.
The parts that are most likely to be repaired or replaced during normal service life of the
product and/or parts that are characterized by a high assumed failure rate and/or are critical for
the product to deliver the main desired function.
4.9.2 Disassembly
Diverse definitions of “disassembly” have been put up by different literature, which are as
follows:
• Non-destructive taking apart of an assembled product into constituent materials and/or
components.
• A process whereby an item is taken apart in such a way that it could subsequently be
reassembled and made operational
• A reversible process in which a product is separated into its components and/or
subassemblies by non-destructive or semi-destructive operations which only damage the
connectors/fasteners. If the product separation product is irreversible, this process is called
dismantling
We therefore define partial or complete Disassembly as:
A reversible process in which a product is separated into its parts by non-destructive operations
or semi-destructive operations which only damage the connectors/fasteners in such a way that it
could subsequently be reassembled and made operational, possibly needing new
connectors/fasteners.
4. Index for spare parts: This includes information on where to get spare parts and their
cost Asides from the content of repair manuals also the structure of the manual and
ease of retrieving the required information for persons performing repair operations is
of high importance.
Improvement to system reliability will almost without exception cause increase in general
increase in cost. The total costs incurred over the period of ownership of equipment are often
referred to as life-cycle costs (LCC). These can be categorized into:
a) Acquisition Cost: Capital cost plus cost of installation, transport, etc.
b) Ownership Cost: Cost of preventive and corrective maintenance and modifications.
c) Operating Cost: Cost of materials and energy
d) Administrative Cost: Cost of data acquisition and analysis
These costs will be infuenced by:
i) Reliability- this determines frequency of repair, spare part requirements, and loss of
revenue (together with maintainability)
ii) Maintainability- this affects training, test equipment, downtime and manpower.
iii) Safety Factor- this affects operating efficiency, maintainability and liability costs.
Life-cycle costs (LCC) will clearly be reduced by improving reliability, maintainability and safety
but will be increased by the activities needednto achieve them. Therefore, we need to find an
optimum set of parameters which minimizes the total cost. This concept is illustrated in Figures
4.7 and 4.8. Each curve represents cost against availability. Figure 4.7 shows the general
relationship between availability and cost.
Figure 4.7 Price and Availability Figure 4.8 Cost of Ownership and Availability
The manufacturer’s pre-delivery costs, those of design, procurement and manufacture, increase
with availability. On the other hand, the manufacturerer’s after-delivery costs, those of warranty,
redesign, and loss of reputation, decrease as availability improves. The total cost is shown by a
curve indicating some value of availability at which minimum cost is incurred. Price will be
related to this cost. Taking, then, the price/availability curve and plotting it again in Figure 4.8,
the user’s costs involve the addition of another curve representing losses and expense, owing to
failure, borne by the user. The result is a curvealso showing an optimum availability that incurs
83
minimum cost. These diagrams serve to illustrate the idea that cost is minimized by finding
reliability and maintainability enhancements whose savings exceed the initial expenditure.
The following list summarizes the best practice together with recommended enhancements for
both manual and computer based field failure recording. Recorded field information is frequently
inadequate and it is necessary to emphasize that failure data must contain sufficient information to
enable precise failures to be identified and failure distributions to be identified. They must,
therefore, include:
(a) Adequate information about the symptoms and causes of failure. This is important because
85
predictions are only meaningful when a system level failure is precisely defined. Thus component
failures which contribute to a defined system failure can only be identified if the failure modes are
accurately recorded. There needs to be a distinction between failures (which cause loss of system
function) and defects (which may only cause degradation of function).
(b) Detailed and accurate equipment inventories enabling each component item to be separately
identified. This is essential in providing cumulative operating times for the calculation of assumed
constant failure rates and also for obtaining individual calendar times (or operating times or
cycles) to each mode of failure and for each component item. These individual times to failure are
necessary if failure distributions are to be analysed.
(c) Identification of common cause failures by requiring the inspection of redundant units to
ascertain if failures have occurred in both (or all) units. This will provide data to enhance models.
In order to achieve this it is necessary to be able to identify that two or more failures are related to
specific field items in a redundant configuration. It is therefore important that each recorded
failure also identifies which specific item (i.e. tag number) it refers to.
(d) Intervals between common cause failures. Because common cause failures do not necessarily
occur at precisely the same instant it is desirable to be able to identify the time elapsed between
them.
(e) The effect that a ‘component part’ level failure has on failure at the system level. This will
vary according to the type of system, the level of redundancy (which may postpone system level
failure), etc.
(f) Costs of failure such as the penalty cost of system outage (e.g. loss of production) and the cost
of corrective repair effort and associated spares and other maintenance costs.
(g) The consequences in the case of safety-related failures (e.g. death, injury, environmental
damage) not so easily quantified.
(h) Consideration of whether a failure is intrinsic to the item in question or was caused by an
external factor. External factors might include: process operator error induced failure maintenance
error induced failure caused by a diagnostic replacement attempt modification induced failure
(i) Effective data screening to identify and correct errors and to ensure consistency. There is a cost
issue here in that effective data screening requires significant man-hours to study the field failure
returns. In the author’s experience an average of as much as one hour per field return can be
needed to enquire into the nature of a given failure and to discuss and establish the underlying
cause. Both codification and narrative are helpful to the analyst and, whilst each has its own
merits, a combination is required in practice. Modern computerized maintenance management
systems offer possibilities for classification and codification of failure modes and causes.
However, this relies on motivated and trained field technicians to input accurate and complete
data. The option to add narrative should always be available.
(j) Adequate information about the environment (e.g. weather in the case of unprotected
equipment) and operating conditions (e.g. unusual production throughput loadings).
detailed failure information and the requirement for a simple reporting format. A feature of a
Telecommunication company’s form is the use of four identical print-through forms. The
information is therefore accurately recorded four times with minimum effort. Figure 4.10 shows
the author’s recommended format taking into account the list of items under “Best Practices and
Recommendations.”
4.12 Concept of Failure Reporting, Analysis and Corrective Action System (FRACAS)
FRACAS is a process that gives organizations a way to report, classify and analyze failures, as
well as plan corrective reactions in response to those failures. Usually, software is used to
implement a FRACAS system to help manage multiple failure reports and produce a history of
failure with corresponding corrective actions, so recorded information from those past failures can
be analyzed.
FRACAS is a closed-loop process containing the following steps:
1. Failure reporting (FR): All failures and faults related to a system, piece of equipment or
process are formally reported using a standard form known as a failure report or defect
report. The failure report should clearly identify the failed asset, symptoms of the failure,
testing conditions, operating conditions and failure time.
2. Analysis (A): Perform a root cause analysis to identify what caused the failure.
3. Corrective actions (CA): Once the cause of the failure is determined, implement and
verify corrective (or preventive) actions to prevent future occurrences of the failure. Any
changes should be formally documented to ensure standardization.
This means any information deemed necessary to help determine and resolve issues as well as
information for future tracking.
During the failure reporting stage of FRACAS, you should clearly define the type of information
to record in the incident report. Over time, as failures flow through the closed-loop FRACAS
process, more information will be collected; however, initially, as much data as possible should
be gathered on the failure and how it was detected. Failure reports should collect information such
as:
(iv)All details about the incident, including the steps that led to the incident
(v) Any corrective action that was done to fix the issue
The most important aspect of failure reporting is ensuring issues are logged in your FRACAS as
they occur in real time. To do this, all team members must have access to the FRACAS and be
able to properly navigate the system.
Step 2 – Analysis
After you've logged your failure report(s), it's time to conduct an analysis of the issue at hand.
This phase can also be customized to fit your organization's needs and help you determine how to
proceed with analyzing the issue. The analysis phase typically is done by a team lead or engineer
who fully evaluates what caused the failure and then identifies a solution.
The final step in the FRACAS is resolving the issue and closing it out. At this point, you've
determined the root cause of the failure and come up with a solution to correct it. Once you've
implemented the corrective action, your team should verify the success of the action and close out
the incident in the system. Closing out each failure is critical to ensure the closed-loop system
remains intact.
88
Implementing a FRACAS gives you valuable information to help identify and correct errors or
failures, past problems, defects, or process errors in a timely manner. Additional benefits include
the following:
(i) Through proper investigation of failures and appropriate corrective action, a FRACAS
also directly reduces immediate costs like factory rework and parts/materials scrap, as well
as indirect costs like customer dissatisfaction.
(iii) FRACAS offers visibility of reliability performance issues and initiates continuous
improvement processes.
(iv)Through root cause analysis, FRACAS helps expedite engineering efforts to fix issues,
which in turn leads to effective corrective actions.
(v) FRACAS provides an organization with a knowledge base of a history of problems, giving
you a precedent for numerous issues to help you avoid them in the future.
89
Chapter Five
At times, the word specification is used in relation to a data sheet (or spec sheet), which could be
misconstrued. A data sheet however, explains the technical features of an item or product, usually
documented, and made available by manufacturer to assist select or make proper use of the
products. It is sometimes called performance specification. Hence, a data sheet is not a technical
specification in connection with imformation regarding how to design or produce.
For example, Function Generatorn required for use in a laboratory may have the following brief
specifications:
An “in service” or “maintain as” specification spells outthe state of a single system/product after
functioning for some years, as well as the effectsof wear and maintenance (i.e. changes in
configuration arising from years of operation).
Specifications are perfomance performance goals for the design team. The are so important for
the following reasons:
(i) They provides clear instructions on the intent, materials, product and service.
(ii) They serve as a guidline to the quality and standards which should be applied.
(iii) Materials and manufacturer’s products can be clearly defined.
(iv)The requirements for installation, testing and handover can be identified.
(v) They may be used as mutual agreement between a prospective buyer and manufacturer.
For example, a buyermay wish to specify that material components or equipment should
conform to certain standard.
(vi)They are used as indicators for buyers or users of manufactured products in choosing the
most suitable item for their needs among many options or varieties.
Amplitude Stability:
i) (Sine and Square waveform) typically, less than +5% peak-peak, change over the range
0.01 to 100 KHz.
ii) (Triangle waveform) typically, less than 0.5% peak-peak change over the range 0.01-10
KHz and 2.5% at 100 KHz
Purity:
Sine distortion: less than 2%, typically 0.25% at KHz on 100 KHz to 1 KHz range and 0.7% at
100 KHz.
Rise and fall time: Less than 200nS on square wave
Linearity: typically less than 1% on triangle wave
Auxiliary Output:
Triangle wave: amplitude, 2V peak –peak
Impedance: 600 Ω
TTL (Transistor-transistor logic) wave: amplitude, 0 to +5V (nominal) rise time, less than
100nS
Input:
Input impedance: 10KΩ
Power Requirements:
Line voltage: 200/250Vrms, 50 or 60Hz
Consumption: 6 VA
Observations
A close look at the above data sheet clearly shows that no express information about the reliability
of the product is provided by the manufacturer. Most times, it is a deliberate omission by the
manufacturer in order not to give inadvertent vital information to his competitors. Sometimes
also, it might be as a result of difficulties in working out realistic reliability numerical value. This
is because accurate reliability prediction of equipment requires examining minutely, each of its
conceptual “blocks” for which sufficient facts and figures on their failure rates are useful. The
interconnectivity of these blocks makes the overall reliability of the equipment dependent on the
collective performance of individual conceptual blocks which can only unfold through some
rigorous calculations.
Example 1
Let us take an example of compact change over switch:
Electrical Specifications:
Nominal Current 500 mA
Minimum Current 1MA
Nominal Voltage 12V
Maximum Voltage 10Mv
Electrostatic breakdown value 5Kv
Isolation resistance ≥10000 MΩ at 100 Vdc
Contact resistance ≤22 Ω
Electrical life 1000 actuations (operations)
Electric strength 250 Vrms, 50 Hz, 1 min
Mechanical Characteristics:
Switching action Maintained
Actuating travel 1.6 mm
Operating temperature -40°C ... +85°C
Shock resistance 50 g, 11 ms
Environmental resistance - Standard Tropical
- Damp heat 4 days 21 days
- Saline mist 24 hours 96 hours
Example 3
Specifications of two fixed resistors
Specifications Carbon Composition Type Metal Oxide Type
Range 10Ω to 22Ω 10Ω to 1MΩ
Selection tolerance ± 10 % ±2%
Power rating 250mW 500mW
Load stability 10% 1%
Maximum voltage 150V 350V
Insulation resistance 109 Ω 1010 Ω
Proof voltage* 500V 1KV
Voltage coefficient** 200ppm/V 10ppm/V
Ambient temperature range 40°C to 105°C 55°C to 150°C
Temperature coefficient ± 120 rpm/° C ± 250 rpm/° C
Noise 1KΩ 2µV/V to 10MΩ 6µV/V 0.1µV/V
Soldering effects*** 2% 0.15%
+
Shelf time (1year) 5% 0.1%
++
Damp heat (95% RH) 15°C max. 1%
Remark
* Proof voltage: This is the maximum voltage that can be applied between the resistor body and a
touching external conductor.
**Voltage coefficient: This is the negative change in resistance with respect to applied voltage
(expressed in ppm per volt). 1ppm = 1× 10-6
***Soldering effects: This is the change in resistance as a result of a standard soldering test.
+
Shelf time: this is the possible change in resistance, usually after one year.
++
Damp heat: This is the change in resistance as a result of a standard high temperature and
humidity
Observation
A close look at the above data sheet reveals that:
i) Different components possess diverse items of information for their specifications
96
ii) The comparison of the data for the two fixed resistors showed that for high reliability
performance, the metal oxide type will be required because of their wider range low
temperature coefficient and superior stability.
For any reasonable improvement of any of the under listed specification items, there must be a
trade-off with the cost of the equipment. The items include:
i) Stability
ii) Sensitivity
iii) Reduction of error range
iv) Speed of reading
v) Widening operating conditions
It is inferred therefore, that the higher the qualities of the operational features, the higher the cost
of the equipment. Hence, for the buyer of any equipment, the guiding principle should be the
purpose for which the equipment is needed. That will determine which of the aboe listed factors
the buyer should prioritize.
97
Chapter Six
6.0 Appreciating the Need for Testing; Types of Testing carried out and the purpose of
Testing
Introduction
Testing is a set of activities conducted to facilitate discovery and/or evaluation of properties of
one or more items/equipment under test. Each individual test, known as test case, exercises a set
of predefined test activities, developed to drive the execution of the test item to meet objectives,
including correct implementation, error identification, quality verification and other valued
details. The test environment is usually designed to be identical, or as close as possible to the
anticipated operating environment.
In this chapter, various types of tests, which products undergo in order to be certified suitable for
the market, will be treated.
Objectives of BAT
The objective of BAT is to provide a set of working and fully tested features that are ready for
production to be tested and validated by the business.
It is recommended that the delivery team works collaboratively with the business to define a
testing approach and plan with test cases. Any defects found at this stage will be handled by
the Business Test Lead, Business Analyst and development Lead along with the support from
project manager.
99
Additionally, as the users pass their acceptance criteria, the business owners can be reassured that
the developers are progressing in the right direction.
little (inaccurate measurements). The bottom line however, is that calibration costly, but so do
inaccurate instruments.
In general, there are other scenarios where more frequent calibrations could be required:
i) Before Starting and after Finishing a Major Critical Measuring Project
If you are planning a project that requires extremely accurate measurements, the instruments to be
used for that project should be sent for calibration, and then, kept in storage until the testing
begins. Likewise, after the project is completed, the equipment used should be sent for calibration.
When the results are obtained, they can be used to confirm the accuracy of the testing results for
the project.
ii) After an Incident
If the instrument receives knocks, bumps or any other kind of physical impact or the interval
overload is knocked out, then, one may want to consider sending it for recalibration to ensure its
accuracy. The chances of this happening will be more likely in certain industries such as
construction, field service and facilities maintenance.
iii) Based on individual Project Requirements
One can guarantee every project requiring electrical testing will differ in size and scope, and
therefore have different requirements for calibration. Some will require the use of certified and
calibrated test equipment, while others won’t have as stringent calibration standards. Review of
the specifications is required before the test, as the requirements might not be explicitly stated.
iv) At Quarterly or Semi-annual Periods
If one carries out critical measurements then leaving a shorter time span between measurements
means there will be less chance of questionable test results, or the electrical test meter drifting
from calibration. Be prepared by diarizing the calibration frequency or booking calibrations in
advance.
v) Internal Requirements
Often business insurance may require you to have a valid calibration certificate and awarding
organization. Or check your organization’s quality manual which might stipulate the desired
frequency of calibrations.
Choosing a supplier with affordable prices and quick turnaround times makes regular calibrations
a possibility.
As a final point, there are also some other factors that can influence the calibration period, such
as:
Here are the top reasons NDT is used by so many companies throughout the world:
i) Savings. The most obvious answer to this question is that NDT is more appealing than
destructive testing because it allows the material or object being examined to survive the
examination unharmed, thus saving money and resources.
ii) Safety. NDT is also appealing because almost all NDT techniques (except radiographic
testing) are harmless to people.
iii) Efficiency. NDT methods allow for the thorough and relatively quick evaluation of assets,
which can be crucial for ensuring continued safety and performance on a job site.
iv) Accuracy. NDT methods have been proven accurate and predictable; both qualities you
want when it comes to maintenance procedures meant to ensure the safety of personnel
and the longevity of equipment.
There are several techniques used in NDT for the collection of various types of data, each
requiring its own kind of tools, training, and preparation.
Some of these techniques might allow for a complete volumetric inspection of an object, while
others only allow for a surface inspection. In a similar way, some NDT methods will have varying
degrees of success depending on the type of material they are used on, and some techniques—
102
such as Magnetic Particle NDT, for example—will only work on specific materials (i.e., those
that can be magnetized).
Definition: Visual Non-Destructive Testing is the act of collecting visual data on the status of a
material. Visual Testing is the most basic way to examine a material or object without altering it
in any way.
Visual Testing can be done with the naked eye, by inspectors visually reviewing a material or
asset. For indoor Visual Testing, inspectors use flashlights to add depth to the object being
examined. Visual Testing can also be done with an RVI (Remote Visual Inspection) tool, like a
camera. To get the camera in place, NDT inspectors may use a robot or drone, or may simply
hang it from a rope.
In general, Ultrasonic Testing uses sound waves to detect defects or imperfections on the surface
of a material created.
One of the most common Ultrasonic Testing methods is the pulse echo. With this technique,
inspectors introduce sounds into a material and measure the echoes (or sound reflections)
produced by imperfections on the surface of the material as they are returned to a receiver.
Radiography Testing directs radiation from a radioactive isotope or an X-ray generator through
the material being tested and onto a film or some other kind of detector. The readings from the
detector create a shadowgraph, which reveals the underlying aspects of the inspected material.
Radiography Testing can uncover aspects of a material that can be hard to detect with the naked
eye, such as alterations to its density.
Definition: Eddy Current Non-Destructive Testing is a type of electromagnetic testing that uses
measurements of the strength of electrical currents (also called eddy currents) in a magnetic field
surrounding a material in order to make determinations about the material, which may include the
locations of defects.
To conduct Eddy Current Testing, inspectors examine the flow of eddy currents in the magnetic
field surrounding a conductive material to identify interruptions caused by defects or
imperfections in the material.
To use Magnetic Particle Testing, inspectors first induce a magnetic field in a material that is
highly susceptible to magnetization. After inducing the magnetic field, the surface of the material
is then covered with iron particles, which reveal disruptions in the flow of the magnetic field.
These disruptions create visual indicators for the locations of imperfections within the material.
104
Definition: Acoustic Emission Non-Destructive Testing is the act of using acoustic emissions to
identify possible defects and imperfections in a material.
Inspectors conducting Acoustic Emission Tests are examining materials for bursts of acoustic
energy, also called acoustic emissions, which are caused by defects in the material. Intensity,
location, and arrival time can be examined to reveal information about possible defects within the
material.
Definition: Liquid Penetrant Non-Destructive Testing refers to the process of using a liquid to
coat a material and then looking for breaks in the liquid to identify imperfections in the material.
Inspectors conducting a Penetrant Test will first coat the material being tested with a solution that
contains a visible or fluorescent dye. Inspectors then remove any extra solution from the
material’s surface while leaving the solution in defects that “break” the material’s surface. After
this, inspectors use a developer to draw the solution out of the defects, then use ultraviolet light to
reveal imperfections (for fluorescent dyes). For regular dyes, the color shows in the contrast
between the penetrant and the developer.
Definition: Leak Non-Destructive Testing refers to the process of studying leaks in a vessel or
structure in order to identify defects in it.
Inspectors can detect leaks within a vessel using measurements taken with a pressure gauge, soap-
bubble tests, or electronic listening devices, among others.
ii) Inadequate marking: BS 2770, 86th Edition, April 30, 1990 provides Specification for
pictorial marking of Handling Instructions for Goods in Transit. Set of symbols for the marking of
packages to convey handling instructions without the use of specific language.
iii) Failure to treat for prevention of corrosion
Various clearing methods for the removal of oil, rust and miscellaneous contamination followed
by preventive treatments and coatings.
iv) Degradation of Packaging materials owing to method of storage prior to use.
v) Inadequate adjustments or padding prior to packaging lack of handling care during
transport: This requires adequate work instructions, packaging lists, training etc.
Choosing the most appropriate packaging involves considerations of cost, availability and size,
for which reason a compromise I usually sought. Crates, rigid and collapsible boxes, cartons,
wallets, tri-wall wrapping, clipboard cases, sealed wrapping, fabricated and moulded spacers,
corner blocks and cushions, bubble wrapping, etc. are a few of the many alternatives available to
meet any particular packaging specification.
An environmental testing involving vibration and shock test together with climatic tests is
necessary to qualify a packaging arrangement. This work is undertaken by a number of tests
houses and may save large sums if it ultimately prevents damaged goods being received since the
cost of defect rises tenfold and more, once equipment has left the factory. As well as specified
environmental tests, the product should be transported over a range of typical journeys and then
retested to assess the effectiveness of the proposed pack.
6.6 Preproduction Testing
Preproduction testing is meant to certify that a production model completely satisfies the
indicated technical criteria in the engineering specification. This test is typically done ahead of
mass production of the equipment/items. This is to make certain, the practicability of satisfying
the specifications, subject to usual manufacturing conditions. The test is usually carried out by an
independent section of the company such as the quality control section. Consequently, pre-
production testing usually include performance test, within some specified limits/range,
environmental test, reliability test, maintainability test, packaging and transport test (which covers
shock and vibration tests), physical characteristic test (involving ergonomics).
A brief discussion of the above mentioned tests embodied in the pre-production test is useful
hence, there discussion below:
i) Environmental Testing
This is the type of test whereby the expected operating environmental condition of the equipment
is simulated, and the product is made to function in it. The bottom line is to determine whether the
product will function as stated in the specification uninterruptedly without damage or diminution
in functionality and serviceability. It is usually subjected to stated boundaries of its environment,
using test equipments such as ovens and refrigerators for simulation of diverse temperature and
humidity, and other equipments capable of simulating various degrees of shocks, vibration and
the like.
This proves that equipment functions to specification (for a sustained period) and is not
degraded or damaged by defined extremes of its environment. The test can cover a wide range
of parameters and it is important to agree a specification which is realistic. It is tempting, when
106
in doubt, to widen the limits of temperature, humidity and shock in order to be extra sure of
covering the likely range which the equipment will experience. The resulting cost of overdesign,
even for a few degrees of temperature, may be totally unjustified.
The possibilities are numerous and include:
Electrical
o Electric fields.
o Magnetic fields.
o Radiation.
Climatic
o Temperature extremes
o Temperature cycling internal and external may be specified.
o Humidity extremes.
o Temperature cycling at high humidity.
o Thermal shock – rapid change of temperature.
o Wind – both physical force and cooling effect.
o Wind and precipitation.
o Direct sunlight.
o Atmospheric pressure extremes.
Mechanical
o Vibration at given frequency – a resonant search is often carried out.
o Vibration at simultaneous random frequencies – used because resonances at different
frequencies can occur simultaneously.
o Mechanical shock – bump.
o Acceleration.
Chemical and hazardous atmospheres
o Corrosive atmosphere – covers acids, alkalis, salt, greases, etc.
o Foreign bodies – ferrous, carbon, silicate, general dust, etc.
o Biological – defined growth or insect infestation.
o Reactive gases.
o Flammable atmospheres.
and reassembling. This parameter becomes essential when considering the availability of the
product which, of course is linked with the ability to reduce downtime.
iv) Ergonomics Test
This test is meant to determine ease of interaction between the operator/ maintenance personnel
and the pre-production model. This may uncover likely recurrence of operator mistakes and lapse
or time wasting as a result of the number and positioning of the keys, knobs, switches etc. All
these are a function of the design, and affect the operator’s convenience and accuracy of
performing his duties.
v) Marginal Testing
This involves proving the various system functions at the extreme limits of the electrical and
mechanical parameters and includes:
Electrical
o Mains supply voltage.
o Mains supply frequency.
o Insulation limits.
o Earth testing.
o High voltage interference – radiated. Typical test apparatus consists of a spark plug,
induction coil and break contact.
o Mains-borne interference.
o Line error rate – refers to the incidence of binary bits being incorrectly transmitted in a
digital system. Usually expressed as in 1 in 10-n bits.
o Line noise tests – analogue circuits.
o Electrostatic discharge – e.g. 10 kV from 150 pF through 150Ω to conductive
o surfaces.
o Functional load tests – loading a system with artificial traffic to simulate full utilization
(e.g. call traffic simulation in a telephone exchange).
o Input/output signal limits – limits of frequency and power.
o Output load limits – sustained voltage at maximum load current and testing that current
does not increase even if load is increased as far as a short circuit.
Mechanical
Dimensional limits – maximum and minimum limits as per drawing.
Pressure limits – covers hydraulic and pneumatic systems.
Load – compressive and tensile forces and torque.
7. Assembly
8. Firmware Development (Assembly, C and/or BASIC)
9. Testing and debugging
A) Construction
i) Breadboard: This is a quick way to set up a circuit. It can be used only with through-hole
components although adapters for some surface mount devices exist. Not suitable for high
frequencies, due to stray capacitance and inductance. Also, long connecting traces inside the
breadboard act as antennas. Also, not suitable for high current and/or high voltage circuits.
ii)Point to Point Wiring: More permanent than breadboard. Components are soldered to a
perforated board, typically known as Vero board.
iii)Wire wrapping: It used to be popular in the 70’s and 80’s. Similar to point to point wiring
except that a special tool/gun and special sockets are used. Wire is connected to endpoints
by wrapping onto a rectangular pin, instead of soldering.
iv) Printed Circuit Board (PCB): This is the most reliable method. It used to be the costliest
method of prototyping but with recent advances in automated manufacturing of printed circuit
boards, prices have decreased significantly.
B) Assembly
i)Hand assembly: Suitable for prototypes
ii)Pick and Place Machines: Suitable for mass production. It takes time to set up. There is a
set up fee. Some parts need to be purchased in reels.
D) Bugs to Expect:
i) Design Error
ii) Design is OK but you have a schematics error
iii) Schematics is OK but you made a mistake in board layout
o Examples: Wrong package for the part
o Mirror package for the part
o Wrong hole size
o Too thin a trace to carry the current
o Supply and/or ground does not reach all components
o Missing traces
o Packages conflict with each other
iv) Board manufacturing flaws: Usually short circuits or open circuits. This is rare when
the board is coming from a professional board house but it may happen.
v) Assembly errors: Wrong component was installed. For example, you meant 10K
resistor and a 10-Ohm resistor was installed.
vi) Correct component was installed but it was installed backwards. Typically happens
with 2-pin components that have a polarity such as diodes.
vii) Component damaged during assembly. Some resistors may crack/break during
soldering due to high heat. Some IC’s cannot tolerate soldering iron’s heat longer than
10 seconds. Cool the IC package during soldering if necessary.
viii) Printed Circuit Board Traces can be lifted during soldering if you keep the soldering
iron on the board too long.
6.11 Methods of Sampling Plan for Testing Large and Small Batch Quantities
Introduction
If a large supermarket is supplied with large batches of pre-packed sandwich in its food department
from a catering firm, and the supermarket manager wishes to test the sandwich so as to certify their
freshness and quality. The only way she can test them is by unwrapping them and tasting them. It is
however, obvious that it will no longer be possible to sell them after the test. She is obliged therefore
to decide as to whether or not the batch is acceptable based on testing a relatively small sample of
sandwiches. This is known as acceptance sampling.
Acceptance sampling may be applied where large quantities of similar items or large batches of
material are being bought or are being transferred from one part of an organization to another. Unlike
statistical process control where the purpose is to check production as it proceeds, acceptance
sampling is applied to large batches of goods which have already been produced.
It is discernible that the test on the sandwiches is a destructive test because after the test has been
carried out the sandwich is no longer saleable. Other reasons for applying acceptance sampling are
that when buying large batches of components it may be too expensive or too time-consuming to test
them all. In other cases when dealing with a well established supplier the customer may be quite
confident that the batch will be satisfactory but will still wish to test a small sample to make sure.
The characteristics of acceptance sampling are that each item tested is classified as conforming or
non-conforming. (Items used to be classified as defective or non-defective but these days no self
respecting manufacturing firm will admit to making defective items.)
A sample is taken and if it contains too many non-conforming items the batch is rejected, otherwise it
is accepted.
For this method to be effective, batches containing some non-conforming items must be acceptable. If
the only acceptable percentage of non-conforming item is zero this can only be achieved by
examining every item and removing those that are non-conforming. This is known as 100%
inspection and is not acceptance sampling. However the definition of non-conforming may be chosen
as required. For example, if the contents of jars of jam are required to be between 453 g and 461 g, it
would be possible to define a jar with contents outside the range 455 g and 459 g as non-conforming.
Batches containing up to, say 5% non-conforming items, could then be accepted in the knowledge
that, unless there was something very unusual about the distribution, this would ensure that virtually
all jars in the batch contained between 453 g and 461 g.
Example:
Suppose a mobile phone company produces mobiles phones in lots of 100 phones. To check the
quality of the lots, the quality inspector of the company uses a single sampling plan with n = 15
and c = 1. Explain the procedure for implementing it.
Solution:
For implementing the single sampling plan, the quality inspector of the company randomly draws
a sample of 15 mobile phones from each lot and classifies each mobile of the sample as non-
conforming or conforming. At the end of the inspection, he/she counts the number of non-
conforming mobiles (d) found in the sample and compares it with the acceptance number (c). If d
≤ c (= 1), he/she accepts the lot and if d > c (= 1), he/she rejects the lot under the acceptance
sampling plan. Under rectifying sampling plan, if d ≤ c (= 1), he/she accepts the lot by replacing
all non-conforming mobiles found in the sample by conforming mobiles and if d > c, he/she
accepts the lot by inspecting the entire lot and replacing all non-conforming mobiles in the lot by
conforming mobiles.
Sometimes, situations arise when it is not possible to decide whether to accept or reject the lot on
the basis of a single sample. In such situations, we use a sampling plan known as the double
sampling plan. In this plan, the decision of acceptance or rejection of a lot is taken on the basis of
two samples. A lot may be accepted immediately if the first sample is good or may be rejected if
it is bad. If the first sample is neither good nor bad, the decision is based on the evidence of the
first and second sample combined.
In this section we shall explain the concept of the double sampling plan and the procedure for
implementing it.
A sampling plan in which a decision about the acceptance or rejection of a lot is based on two
samples that have been inspected is known as a double sampling plan.
The double sampling plan is used when a clear decision about acceptance or rejection of a lot
cannot be taken on the basis of a single sample. In double sampling plan, generally, the decision
of acceptance or rejection of a lot is taken on the basis of two samples. If the first sample is bad,
the lot may be rejected on the first sample and a second sample need not be drawn. If the first
sample is good, the lot may be accepted on the first sample and a second sample is not needed.
But if the first sample is neither good nor bad and there is a doubt about its results, we take a
second sample and the decision of acceptance or rejection of a lot is taken on the basis of the
evidence obtained from both the first and the second samples.
For example, suppose a buyer purchases resistors in lots of 500 from a company. To check the
quality of the lots, the buyer and the company decide that the buyer will draw two samples of
sizes 10 (first sample) and 20 (second sample) and the acceptance numbers for the plan are 1 and
3. The buyer takes two samples and makes the decision of acceptance or rejection of the lot on the
114
basis of two samples. Since the decision of acceptance or rejection of the lot is taken on the basis
of two samples, this is a double sampling plan.
A double sampling plan requires the specification of four quantities which are known as its
parameters. These parameters are
n1 – size of the first sample,
c1 – acceptance number for the first sample,
n2 – size of the second sample, and
c2 – acceptance numbers for both samples combined.
Therefore, the parameters of the double sampling plan in the above example are
the size of the first sample (n1) = 10,
the acceptance number for the first sample (c1), = 1
the size of the second sample (n2) = 20, and
the acceptance numbers for both the samples combined (c2) = 3.
So far you have learnt the definition of the double sampling plan and why it is used. We now
describe the procedure for implementing it and its advantages over the single sampling plan.
Step 1: We draw a random sample (first sample) of size n 1 from the lot received from the supplier
or the final assembly.
Step 2: We inspect each and every unit of the sample and classify it as non-conforming or
conforming. At the end of the inspection, we count the number of non-conforming units found in
the sample. Suppose the number of non-conforming units found in the first sample is d1.
Step 3: We compare the number of non-conforming units (d1) found in the first sample with the
stated acceptance numbers c1 and c2.
Step 4: We take the decision on the basis of the first sample as follows:
If d1 ≤ c1, we accept the lot and replace all non-conforming units found in the sample by
conforming units. If d1 > c2, we accept the lot after inspecting the entire lot and replacing all non-
conforming units in the lot by conforming units. But if c 1 < d1 ≤ c2, the first (single) sample is
failed.
Step 5: If c1 < d1 ≤ c2, we draw a second random sample of size n2 from the lot.
Step 6: We inspect each and every unit of the second sample and count the number of non-
conforming units found in it. Suppose the number of non-conforming units found in the second
sample is d2.
Step 7: We combine the number of non-conforming units (d 1 and d2) found in both samples and
consider d1 + d2 for taking the decision about the lot on the basis of the second sample as follows:
Example:
Suppose a mobile phone company produces mobile phones in lots of 400 phones each. To check
the quality of the lots, the quality inspector of the company uses a double sampling plan with n 1 =
15, c1 = 1, n2 = 30, c2 = 3. Explain the procedure for implementing it under acceptance sampling
plan.
Solution:
For implementing the double sampling plan, the quality inspector of the company randomly
draws first sample of 15 mobiles from the lot and classifies each mobile of the first sample as
non-conforming or conforming. At the end of the inspection, he/she counts the number of non-
conforming mobiles (d1) found in the first sample and compares d 1 with the acceptance numbers
c1 and c2. If d1 ≤ c1 = 1, he/she accepts the lot and if d 1 > c2 = 3, he/she rejects the lot. If c 1 < d1 ≤
c2, it means that if the number of non-conforming mobiles is 2 or 3, he /she draws the second
sample from the lot. He/she counts the number of non-conforming mobiles (d 2) found in the
second sample and compares the total number of non-conforming mobiles (d 1 + d2) in both
samples with the acceptance number c2. If d1 + d2 ≤ c2 = 3, he/she accepts the lot and if d1 + d2 > c2
= 3, he/she rejects the lot.
Example
A firm is to introduce an acceptance sampling scheme. Three alternative plans are considered.
116
Plan A Take sample of 50 and accept the batch if no non-conforming items are found,
otherwise reject.
Plan B Take a sample of 50 and accept the batch if 2 or fewer non-conforming items are
found.
Plan C Take a sample of 40 and accept the batch if no non-conforming items is found.
Reject the batch if 2 or more are found. If one is found, then take a further sample
of size 40. If a total of 2 or fewer (out of 80) is found, accept the batch, otherwise
reject.
a) Find the probability of acceptance for each of the plans A, B and C if batches are
submitted containing
(i) 1% non-conforming (ii) 10% non-conforming
b) Without further calculation, sketch on the same axes the operating characteristic for plans
A, B and C.
c) Show that, for batches containing 1% non-conforming, the average number of items
inspected when using plan C is similar to the number inspected when using plan A or B.
Solution:
a) Plan A: accept 0.
P ( accept )=¿
For p=0.01, P ( accept )=0.9950=0.605;
for p=0.1, P ( accept )=0.950=0.005.
Plan B: accept 0, 1 or 2.
¿¿
Plan C: accept 0 in first sample (in which case no second sample will be taken) or 1
in first sample and 0 in second sample or 1 in first sample and1 in second sample.
There are no other ways of accepting the batch – if 2 or more are found in the first
sample the batch is immediately rejected and if 1 is found in the first sample and 2 or
more in the second (giving a total of 3 or more) the batch is rejected.
The samples are of equal size and the batch is large so the probability of acceptance may
be expressed as
P ( 0 ) + P ( 1 ) × P ( 0 ) + P(1) × P(1)
117
P ( 0 )=¿; P ( 1 )=40 × p × ¿
will, on average, require less items to be inspected than the single sampling plan. This will
be true for any value of p. Against this, the double sampling plan is more complex to
operate.
Plan A double sampling plan has the following two main advantages over a single sampling plan:
i) The principal advantage of the double sampling plan over the single sample plan is that for the
same degree of protection (i.e., the same probability of accepting a lot of a given quality), the
double sampling plan may have a smaller average sample number (ASN) than that corresponding
to the single sampling plan. The underlying reason is that the size (n 1) of the first sample in the
double sampling plan is always smaller than the sample size 5 (n) of an equivalent single
sampling plan. Thus, if a decision is taken on the Double Sampling Plans basis of the first sample,
ASN will be lower for the double sampling plan or if a decision is taken after the second sample,
the ASN will be reduced.
ii) The double sampling plan has the psychological advantage of giving a lot a second chance.
From the viewpoint of the producer/manufacture, it is unfair to reject a lot on the basis of a
single sample. The double sampling plan permits the decision to be made on the basis of two
samples.
However, double sampling plans are costlier to administer in comparison with the single sampling
plans.
119
Chapter Seven
It may not always be possible to simplify a system so as to increase its reliability. Nevertheless, it
is usually worthwhile to pose the question “Can the system be simplified?” because it is quite
common for over-elaborate system requirements to be specified at the initial design stage. The
question “do you really need an all-singing and all-dancing system?” can encourage a design team
to reconsider whether an ultra-sophisticated system is really needed.
7.4 Reduction in Complexity
An integrated circuit usually has a lower failure rate than the group of discrete components, which
it replaces, and so, system reliability can often be improved by replacing a circuit using many
discrete components by a single integrated circuit. Custom- built integrated circuits are usually
expensive and so, their use will usually involve a cost penalty.
7.5 Use of Fault Tolerance
A series system as described in section 7.1 can be described as “fault intolerant” because
failure of any one component will cause total system failure.
A fault tolerant system is one in which at least some parts of the system may fail without
causing total system failure. An example of fault-tolerance is in the design of a two-engine
aircraft, which is capable of flying on one engine. However, there is only limited fault-tolerance
in the aircraft; a major structured failure (for example, loss of the tail plane) will still cause the
aircraft to crash.
7.6 Use of Preventive Maintenance
Preventive maintenance is aimed at preventing failures and is exemplified by the regular
maintenance actions which are taken with care, like checking tyre pressures, checking oil and
coolant level etc. it may not be easy to decide during the Design Phase on exactly what impact
preventive maintenance will have, although the early replacement of limited-life components
brings an obvious improvement to reliability. During the Production Phase a careful study and
analysis of the failures, which occur in the field, may indicate how when and where preventive
maintenance should be introduced or extended.
7.7 Use of Corrective Maintenance
Corrective Maintenance (repair) is described in section 4.2. it consists of those actions which
return a failed system to working order. From a reliability viewpoint, corrective maintenance
chiefly affects the system availability and so faster and more effective corrective maintenance will
generally decrease the system down time and increase the system availability.
121
Customer
Preliminary Design Model Prototype Manufacturing Service Retired
Requirement design Development
Experience Base
2) Manufacturing Defect;
Although the design may be free from error, defects introduced at some stage in
manufacturing may degrade it. Some common examples are:
i. Poor surface finish or sharp edge (burrs) that lead to fatigue cracks and
ii. Decarburisation or quench cracks in heat-treated steel. Elimination of defects in
manufacturing is a key responsibility of the manufacturing engineering staff, but a strong
relationship with the Research and Development function may be required to achieve it.
manufacturing errors produced by the production work force are due to such factors as lack
of proper instructions or specifications; insufficient supervision, poor working
environment, unrealistic production quota, inadequate training and poor motivation.
3) Maintenance;
Most engineering systems are designed on the assumption that they will receive adequate
maintenance at specified periods. When maintenance is neglected or improperly performed
service life will suffer. Since many consumer products do not receive proper maintenance by
their owners, a good design strategy is to make the products maintenance free.
4) Exceeding Design Limits;
If the operator exceeds the limits of temperature, speed, etc. for which it was designed, the
equipment is likely to fail.
5) Environmental Factors;
Subjecting equipment to environmental conditions for which it was not designed, e.g. rain,
high humidity and ice usually greatly shortens its service life.
7.11 Safety Factor and Reliability
A variety of methods are used in engineering design practice to improve reliability. We
generally aim at a probability of failure (Pf) of 10-6 for structural applications and 10-4 Pf to
10-3 for unstressed application.
Factor of safety is defined as: maximum strength / maximum stress
S max .
=
σ max .
Another concept is the safety margin defined by:
Minimum strength−Maximum stress
Safety margin=
Minimum strength
S min −σ max .
=
Smin
It is often believed that the use of safety factor greater than some preconceived magnitude;
usually above 2.5 will result in no failures. Actually, with safety factors the failures probably
may vary from a satisfactory low to an intolerable high.
It is known that distributions exist in both load stress requirement and the available strength.
It is this distribution as defined by mean values, standard deviations, and or other parameters
with which the designer should be concerned. The safety factor concept overlooks the facts
of variability, which may give different reliabilities for the same factor.
Another Critical Design Task is to determine the reliability degradation (if any) on the
system/product due to storage, parking, transportation and handling. The equipment is
subjected to these environments when initially shipped from the factory and distributed to
customer, stored as a spare, spare return to the depot or supplier for maintenance and so on.
Sometimes these environments include extreme conditions of rain, sand and dust, salt spray,
high and low temperature and high humidity.
In the event of degradation, either additional design provisions are needed to compensate for
the reduction in reliability or an increase in the quality of maintenance actions will result. In
either case the impact of logistics support and cost is evident.
Quite often, the reliability of a system is degraded through the performance of preventive and
corrective maintenance action. Unless extreme care is taken, maintenance-induce faults may
inadvertently be introduced in the accomplishment of maintenance action or components
may be partly damaged to the extent that subsequent systems failure may occur and more
frequently than initially anticipated. This is primarily due to carelessness on the part of
individuals performing maintenance, using the wrong tools and test equipment, not following
approved maintenance procedures, and so on. Thus, it is extremely important that the proper
logistics support resources be applied in performing system product maintenance, if the
reliability of the system is to be maintained.
References
David J Smith (2011); Reliability, Maintainability and Risk Practical methods for engineers
Eighth edition Butterworth-Heinemann The Boulevard, Langford Lane, Kidlington,
Oxford OX5 1GB, UK
Nwachukwu , J.C (2000); Introduction to Maintenance and Reliability. Ambik Press LTD, Benin
City, Nigeria.
Hoang Pham (Editor) (2003); Handbook of Reliability Engineering. Springer-Verlag London
Limited
Imogah,S.O (2003); Reliability and Testing Methods. Itua Press, Benin City, Nigeria.
Paul Horowitz and Winfield Hill The Art of Electronics ( Cambridge University Press )
Lewis, E. E. (1995), Introduction to Reliability Engineering, Wiley & Sons, New York
Paul A. Tobias and David C. Trindade (1994), Applied Reliability, 2nd ed, Van Nostrand
125