CPE 515 - Reliability and Maintability
CPE 515 - Reliability and Maintability
AND MAINTENANCE
(CPE 506)
Engr. Prof. Abdulfattah A. Aboaba
Associate Prof.
Computer Engineering Department
UNIVERSITY OF
MAIDUGURI
1
Learning Objectives
3
A Motivating Availability
Example
Consider a simple example of an online
brokerage which is in the process of
designing its site and selecting the
components that will be used in its design.
4
The site is used by users to get quotes on
stocks and mutual funds, manage portfolios,
conduct risk analysis, and to place orders to
trade stocks and mutual funds.
Consider in the security trading business, Web
service availability is a key QoS metric.
If customers are denied access to the trading
services, they may incur financial losses and
the trading company may be liable for these
loses.
5
Trading site architecture
R
Internet
R
Load Balancer
7
– The company is considering two types of
boxes: highly reliable, expensive, high-end
servers with hot-swappable CPU boards
and disks, as well as less expensive, less
reliable, low-end servers.
– Management wants to answer the following
questions:
• What is the least expensive configuration that
meets the 99.99% availability requirement?
• All low-end servers, all high-end servers, or a
mix of low-end and high-end servers?
8
RELATIONSHIP BETWEEN
FAILURE RATE AND MTBF
• Failure Rate - The frequency with which a
system fails expressed in failure per hour.
It is denoted by λ
It varies over the life of the system
It is the reciprocal of MTBF,
that is MTBF = 1/λ
• Where the failure rate is assumed to be
constant, the MTBF could be reported instead
and this is possible only in the flat region of
the BATHTUB Curve or the useful life region
9
DEFINITIONS OF MTBF
• MTBF - Mean time before failure (MTBF) is the
predicted elapsed time between inherent failures of a
REPAIRABLE system/device during operation or
arithmetic mean (average) time between failures of a
REPAIRABLE a system/subsystem/device.
• MTBF is the sum of length of the operational periods
divided by the number of observed failures.
• MTBF – The average operating time expected
between failures in a population of identical
components.
10
DEFINITIONS OF MTTF
• MTTF – Mean time to failure (MTTF) is the
predicted elapsed time between inherent failures of a
UNREPAIRABLE system/device during operation or
arithmetic mean (average) time between failures of a
UNREPAIRABLE a system/subsystem/device.
• MTTF is the length of time a device is expected to
last in operation before it fails.
• MTTF – The average operating time expected before
failure of a component which is not repaired or
replaced. Simply put, it is the average time to failure
of ‘n’ units, Sum of ‘n’ individual unit time to failure/ ‘n’
units.
11
DEFINITIONS OF MTTR
• MTTR - Mean time to repair (MTTR) is a basic
measure of the maintainability of a repairable items, it
represents the average time required to repair a
failed component or device. It is the total number of
corrective maintenance time divided by the total no of
corrective maintenance actions during a given period
of time. In fault tolerant design, MTTR is usually
considered to also include the time the fault is latent;
the time from when the failure occurs until it is
detected. Another meaning of MTTR is Mean Time to
Recovery.
• MTTR does not include Load Time for parts not
readily available, administrative or logic Downtime. 12
RELATIONSHIP BETWEEN MTBF,
MTTF, AND MTTR
13
Relationship between MTTF,
MTTR, and MTBF
up down up
MTBF
14
VARIATION of MTBF
17
REGIONS OF THE BATHTUB
CURVE
Explanation of the three regions and the
observed failure rate curve (Mapping three
curves together.
• Region 1 – Decreasing failure rate or early
failure
• Region 2 – Constant failure rate – Random
failure
• Region 3 – Increasing failure rate – Wear out
failure
18
Failure Density Function (FDF)
19
Reasons of System Failure
• To categorize different types of failure,
three dimensions are considered:
– Duration, Effect, and Scope
20
• Duration of the failures :
– Permanent failures:
• A system stops working and there is no possibility of
repairing or replacing it. (e.g., unmanned space ship)
– Recoverable failures:
• The system is placed back in operation after a fault is
recovered. (e.g., Web site inaccessibility due to its
connection to the internet being down)
– Transient failures:
• Characterized by having a very short duration and may
not require major recovery actions. (e.g., problems that
can be solved by resetting network routers or rebooting
servers.)
21
• Effect of the failures:
– Functional failures:
• The system does not operate according to its functional
specifications. (e.g., an online bookstore failing to display
information about a book even though it is in the catalog)
– Performance failures:
• Even though the system may be executing the requested
functions correctly, they are not executed in a timely
fashion. (e.g., A search engine that presents very
accurate results to requests for search but takes more
than a minute on average to process each request)
22
• Scope of the failure:
– Partial failures:
• Some of the services provided by the computer system
becomes unavailable, while others can still be used.
(e.g., The services that allow customers to bid in an
online auction site may become unavailable due to the
failure of the servers that process these types of
requests, while customers may still be able to see
existing bids)
– Total failures:
• Characterized by a complete disruption of all services
offered by the computer system. (e.g., power outages
could cause a Web site to go down completely)
23
MATHEMATICAL
DERIVATIONS – R(t)
24
MATHEMATICAL DERIVATIONS –
MTBF, MTTF, AND FAILURE RATE
Relationship Between MTBF, MTTF & Failure Rate
∞
𝑚= 0
𝑅 𝑡 𝑑𝑡
∞
𝑚= 𝑒 −λ𝑡 𝑑𝑡
0
1 −λ𝑡 ∞
=− 𝑒
λ 0
1 −∞ 1 1
− 𝑒 − 𝑒 −0 = − 0 − 𝐼 =
λ λ λ
1
∴ 𝑀𝑇𝐵𝐹 𝑚 = 𝑓𝑜𝑟 𝑟𝑒𝑝𝑎𝑖𝑟𝑎𝑏𝑙𝑒 𝑖𝑡𝑒𝑚𝑠
λ
25
MATHEMATICAL DERIVATIONS –
MTBF, MTTF, AND FAILURE RATE
1
𝑀𝑇𝑇𝐹 𝑚 =
λ
1 1
∴= = λ 𝐻𝑒𝑛𝑐𝑒 λ =
𝑀𝑇𝐵𝐹 𝑚
1 𝑡
−λ𝑡 − 𝑡
𝑅 𝑡 =𝑒 = 𝑒 𝑚 = 𝑒− 𝑚
Therefore, after I MTBF (that is t=m), the probability of formal or reliability R(t)=𝑒 −1 =
0.37
26
MATHEMATICAL DERIVATIONS –
MTBF, MTTF, AND FAILURE RATE
Then 𝑅 𝑇 = 𝑒 −λ𝑇
27
MATHEMATICAL DERIVATIONS –
MTBF, MTTF, AND FAILURE RATE
−λ𝑇 −λx1x𝑇
And since 𝑒 𝑐𝑜𝑢𝑙𝑑 𝑏𝑒 𝑒𝑥𝑝𝑟𝑒𝑠𝑠𝑒𝑑 𝑎𝑠: 𝑒 = 𝑅 1xT
−λx1xT −λx1 𝑇 −λ T
∴ 𝑅 1xT = 𝑒 =𝑒 =𝑒
−λx1x𝑇 T −λT
𝐻𝑒𝑛𝑐𝑒 𝑅 1xT = 𝑒 = 𝑅(1) = 𝑒
28
MATHEMATICAL DERIVATIONS – Rn(t)
Reliability of Multiple Items
𝑅𝑛 𝑡 = 𝑒 −𝑛λ𝑡
−𝑛λ𝑡 −λ 𝑛𝑇
𝑅𝑛 𝑇 = 𝑒 = 𝑒
29
DETERMINATION OF FAILURE RATE
Example: A maintenance engineer decided to determine the failure rate of BSC equipment
having 100 identical units. Each of the units were tested for 100 hrs each out of
which 60 completed the test without failure, 20 failed after 800 hrs and rest failed
Solution
Total failure = 40
= 86,000
𝑇𝑜𝑡𝑎𝑙 𝐹𝑎𝑖𝑙𝑢𝑟𝑒
Estimated Failure Rate = 𝑇𝑜𝑡𝑎𝑙 𝑂𝑝𝑒𝑟𝑎𝑡𝑖𝑜𝑛 ℎ𝑜𝑢𝑟
40 𝑓𝑎𝑖𝑙𝑢𝑟𝑒
= = 465 𝑥 10−6 ℎ𝑜𝑢𝑟
86,000
0.999 for unit mission time (T=1), therefore, R = R(1) = 0.999. Determine each
Solution
𝑇
From 𝑅 𝑇 = 𝑒 −𝜆𝑠𝑇 = 𝑒 −𝜆𝑠 = 𝑅(1) 𝑇
∴ 𝑅 𝑇 = 𝑅 1 = 𝑒 −𝜆𝑠
10
𝑅 1 = 𝑒 −𝜆𝑠 = 𝑒 −10𝜆 = 𝑒 −𝜆 = 𝑅 1 10
= 0.999
∴ 𝑅𝑖 1 = 0.9991/10
−𝜆𝑇 = 𝐼𝑛𝑅𝑖 𝑇
32
EXAMPLES
Time(t) x100hrs 0 2 4 8 10 14 15
Failure 0 1 1 1 1 1 1
Occurrence(n) 0 1 2 3 4 5 6
Continuous tests were conducted on an electronics component and faults which
were repaired occurred as in the table. Calculate the mean time between failures
(MTBF).
Solution:
𝑡𝑛 − 𝑡0 15 − 0 𝑡𝑛 15
𝑀𝑇𝐵𝐹 𝑚 = = = = × 102 ℎ𝑟𝑠 = 250ℎ𝑟𝑠
𝑛 6 𝑛 6
𝑡0 → 𝑡1 𝑡1 → 𝑡2 𝑡2 → 𝑡3 𝑡3 → 𝑡4 𝑡4 → 𝑡5 𝑡5 → 𝑡6
0 0.5 0.75 0.75 1 1.25
Solution
𝑓𝑖 𝑡𝑖
𝑖=1
Where n is total number of failure, f is failure, and t is time (in hours) of failure
occurrence. Note that there are three failure occurrences, that is, at 170hrs, at 150
hours, and at 100hrs.
= 5 X 170 + 5 X 150 + 10 X 100 = 2,600ℎ𝑟𝑠
34
EXAMPLES
a) Total test hours without failure is ST
Where S is number of item that survived the duration of test, and T is the duration of
test//mission time. This refers to those that made it to the end.
= 30 X 200 = 6000ℎ𝑟𝑠
b) Total survival hours is
𝑛
𝑓𝑖 𝑡𝑖 + 𝑆𝑇
𝑖=1
𝑛
𝑖=1 𝑓 𝑖 𝑡 𝑖 10 X 100 + 5 X 175 +5 X 150 2.600
c) 𝑀𝑇𝑇𝐹 =
𝑇𝑜𝑡𝑎𝑙 𝑓𝑎𝑖𝑙𝑒𝑑 𝑖𝑡𝑒𝑚𝑠
=
10 + 5 + 5
=
20
= 1300 ℎ𝑟𝑠 𝑓𝑎𝑖𝑙𝑢𝑟𝑒
35
EXAMPLES
Example: Five identified RAM card were test until failed as show in the table below:
Determine the MTTF(m)
Failure 0 1 1 2 1
Time (1000hrs) 0 10 22 45 50
𝑡0 𝑡1 𝑡2 𝑡3 𝑡4
Solution
𝑓𝑖 𝑡𝑖
𝑖=1
𝑴𝑻𝑻𝑭 = 𝑓𝑖 𝑡𝑖 𝑁𝑜 𝑜𝑓 𝑐𝑜𝑚𝑝𝑜𝑛𝑒𝑛𝑡
𝒕=𝟏
𝑀𝑇𝑇𝐹 = 34600ℎ𝑟𝑠
36
EXAMPLES
Example: One thousand similar components, each of constant failure rate of 5% per 10 3 hours
are put into test together. Calculate the time lag before failure of the following number of
component (i) 100 and (ii) 500.
Solution
Nf
𝑅𝑒𝑙𝑖𝑎𝑏𝑖𝑙𝑖𝑡𝑦 𝑅 = 𝑒 −λ t = 1 −
Nt
Where Nf and Nt are number of failed items, and total number of items
respectively.
1
λ=%x
R𝑑
5 1
= x = 5x10−5 = 0.00005
100 1000
𝑝𝑒𝑟𝑐𝑒𝑛𝑡𝑎𝑔𝑒 0.05
In another way it is 𝑟𝑎𝑡𝑖𝑛𝑔 𝑑𝑢𝑟𝑎𝑡𝑖𝑜𝑛
= 1000
= 0.00005
37
EXAMPLES
If t = 𝑡𝑖𝑚𝑒 𝑟𝑒𝑞𝑢𝑖𝑟𝑒𝑑 𝑏𝑒𝑓𝑜𝑟𝑒 𝑓𝑎𝑖𝑙𝑢𝑟𝑒 𝑜𝑓 𝑐𝑜𝑚𝑝𝑜𝑛𝑒𝑛𝑡(𝑠)
−5 𝑡 100
(a) 𝑒 −5x 10 1 = 1−
1000
−5𝑥10−5 t 1 = 𝐼𝑛 0.90
𝐼𝑛 0.90
t1 = = 2107 = 2107ℎ𝑜𝑢𝑟𝑠
−5𝑥10−5
−5 𝑡 500
(b) 𝑒 −5𝑥 10 2 = 1−
100
t 2 = 13,863ℎ𝑜𝑢𝑟𝑠
38
EXAMPLES
Example: In a test involving 2030 components lasting for mission time 500 hours, 32
components failed, calculate the failure rate
Solution
1 N𝑓
λ= x
Ns M𝑡
in which Ns, Nf, and Mt or T are number of survived components, number of failed
components, and mission time respectively.
1 32
Therefore, λ =
1998
x
500
= 3.2x10−5
39
EXAMPLES
Example: Ten computer devices were tested for a time lasting 500 hours during which 8
survived the test. Determine (i) MBTF (ii) Failure rate, (iii) Reliability, and (iv) the
unreliability.
Solution
Nt 10
MTBF =
Nf
×𝑇 = × 500 = 2,500ℎ𝑟𝑠/𝑓𝑎𝑖𝑙𝑢𝑟𝑒
2
1 1
Failure rate (λ) =
MTBF
= = 0.0004𝑓𝑎𝑖𝑙𝑢𝑟𝑒/ℎ𝑜𝑢𝑟
2500
𝑡 500
Reliability (R) = 𝑒 − 𝑚 = 𝑒− 2500 = 0.8
Unreliability 𝑈 = 1 − 𝑅 = 1 − 0.8 = 0.2
40
MATHEMATICAL DERIVATIONS – Binomial
Distribution
41
Availability Basics
• Availability:
– The fraction of time that a component (or system)
is operational.
• Consider the notion that a component (or
system) alternates through periods in which it
is operational – the up periods – and periods
in which it is down – the down periods.
• Mean Time To Failure (MTTF)
– The average time it takes for a system to fail.
42
Availability Summary
Computer systems tend to be labeled by the number of “9”s in
the availability. For example a “five-9’s” system has an
availability of 99.999%. Computer system classification
according to their availability is shown below:
Availability Availability Unavailability System Type
Class (min/year)
1 90.0% 52,560 Unmanaged
43
Expression for the availability
of a system
• The following state transition diagram can be used to show that
the system can be in one of two states: up and down
Up Down
• The system fails, i.e., goes from up to down, with a rate
• It gets repaired, i.e., goes from down to up, with a rate
44
MATHEMATICAL DERIVATION -
Availability
• Writing these rates in term of the MTTF and MTTR we get,
1 1
= and =
MTTF MTTR
pup = pdown
• Here pup and pdown are the probability that the system is up and
down, respectively.
• Thus,
pdown = pup
45
MATHEMATICAL DERIVATION -
Availability
• The availability A of a system is simply pup
• We also know that pup + pdown = 1
• Therefore,
pup + pup = 1
or, pup 1 + = 1
+
or, pup =1
or, pup =
+
46
MATHEMATICAL DERIVATION - Availability
• Therefore,
1
MTTR MTTF
A = pup = = =
+ 1 1 MTTF + MTTR
MTTR + MTTF
47
MATHEMATICAL DERIVATION - Availability
48
Availability Example
Here the MTTF is 20 days or (202460=28,800 minutes) and
The MTTR is 10 minutes
Therefore the availability is given by,
A=MTTF/(MTTF+MTTR)=28,800/(28,800+10)=99.965%
If the system administrator were able to cut the reboot time to 20%
The availability would be A = 28,800/(28,800 + 100.2) = 99.972%
49
The Reliability of Systems of
Components
• Q. What is the reliability of the system as a function of the
reliability of the components used to build the system?
50
REDUNDANCY
51
Forms and Functions of Redundancy
• FORMS
• Hardware Redundancy – DMR & TMR
• Information – Error Detection & Correction Methods
• Time – Including transcend fault detection methods
such as Alternative Logic
• Software – Such as N-version programming
• FUNCTIONS
• Passive Redundancy – Use of capacity to prevent
failure (Extra strength in Bridge cabling)
• Active Redundancy – Maintaining Performance by
monitoring and use of voting Logic when failure in
perceived
52
53
Redundancy Techniques
55
Reliability of Redundant System and
Linear Regression Model
𝑛−1
𝑅𝑥 = 𝑛𝐶𝑗 (1 − 𝑅)𝑛 𝑅𝑗
𝑗 =0
𝑛 2 𝑛 𝑛 𝑛
𝑥
𝑖−1 𝑖 𝑦
𝑖−1 𝑖 − 𝑥
𝑖−1 𝑖 𝑖−1 𝑦𝑖 𝑥𝑖
∝= 𝑛 2 𝑛 2
𝑛 𝑖−1 𝑥 − 𝑖−1 𝑥𝑖
𝑛 𝑛 𝑛
𝑛 𝑥 𝑦 𝑥
𝑖−1 𝑖 𝑖 𝑖−1 𝑖 𝑖=1 𝑦𝑖
𝛽=
𝑛 𝑛𝑖−1 𝑥 2 − ( 𝑛𝑖−1 𝑥𝑖 )2
Where
x= Independent variable
y= Dependent variable
n= Number of observation
56
SERIES SYSTEM
R1 R2 … Rn
57
• Assuming that the n components fail in an independent way
(failure of one component does not affect any other component).
• Using the probability theory that says that the probability of an
event expressed as the intersection of independent events (all n
components are operational) is the product of the probabilities of
the independent events. Thus,
• Implications:Since each reliability value, ri, is a probability and
therefore, ri1
• Therefore as more components are added in series the system
reliability will decrease.
n
RS = r1 r2 rn = ri
i =1
• The special case when all components have the same reliability
r. We get,
RS = r n
58
• A Web site has a Web server (WS), an application server (AS), and
a database server (DS) in series. Let rWS, rAS, and rDB be the
reliabilities of these components and assume their values are
rWS=0.9, rAS=0.95, and rDB=0.99.
• Management wants to replace the database server with a highly
reliable and expensive model that is advertised as having a 0.999
reliability. Is it a wise decision?
– The reliability of the site with the current database server is
Rsite = rWS rAS rDB = 0.9 0.95 0.99 = 0.84645
If instead of the database server, the web server (the most unreliable
component of the system) is replaced by a new one with r = 0.95. The
reliability of the site now will be,
RnewWSsite = rWSWS rAS rDB = 0.95 0.95 0.99 = 0.89348
Thus it is evident that replacing the most unreliable component has a more
pronounced effect in terms of improving overall system reliability.
59
Reliability of a series of identical
components
0.8
Reliability
0.6 0.9
0.95
0.4
0.99
0.2
0
0 10 20 30 40 50
Number of Components
60
PARALLEL SYSTEMS
R1
61
• For this to happen, all n components must be down.
• The probability that component i is down is simply (1-ri).
• So, assuming independence of failures between components, we
get,
R p = 1 − Pr[all components are down]
= 1 − (1 − r1 ) (1 − r2 ) (1 − rn )
= 1 − (1 − ri )
n
i =1
• The special case when all components have the same reliability r.
We get,
R p = 1 − (1 − r )
n
62
Reliability of idetical components in
parallel
0.95
Reliability
0.9
0.9 0.95
0.99
0.85
0.8
0 1 2 3 4 5 6 7 8 9 10
Number of Parallel Components
63
• A search engine site wants to achieve a site reliability of 99.999% using a cluster of
very cheap and unreliable Web servers. A cluster is a parallel combination of a
number of servers. Each has a reliability of 85%. How many servers should be
used in the cluster?
From the eq. Rp = 1 − (1 − r ) we know that,
n
•
0.99999 = 1 – (1-0.85)n = 1 – 0.15 n
So,
0.15n = 1 – 0.99999 = 0.00001
If we apply logarithms to both sides of the above equation and we take into consideration
that n must be an integer, we get that
n = ln 0.00001/ ln 0.15 = 6.069 = 7
Thus, seven unreliable Web servers can provide a high-level of reliability when used in
parallel,
• Thus, we can generalize that:
• the Minimum no (nmin)of server with reliability r needed to build a parallel system
with reliability Rp can be given by -
ln (1 − R p )
nmin =
ln (1 − r )
64
EXAMPLE
A Web site has a three tier architecture composed of a layer of Web servers: a
layer of application servers, and a layer of database servers. Their respective
reliabilities are 0.99, 0.999, and 0.9999. 60% of the requests only use services
from the Web server layer. The remaining 40% use the application server, and
16% of which needs services from Web server as well, while the remaining
84% of these requests also need services from both Web server and
database server layers. What is the site availability?
SOLUTION
• For the 60% of the requests that only use the Web servers, the site
availability is 99%.
• For the 40% (1-0.6) and 16% (1-0.84), that is (1-0.6)*(1-0.84) = 6.4% of the
requests that only need Web and application layer services, the availability
is 0.99*0.999=0.98901.
• Finally, for the 40% (1-0.6) and 84% (0.84), that is (1-0.6)*(0.84) = 33.6%
of the requests that need to use all three layers, the site availability is
0.99*0.999*0.9999 = 0.9889111.
Therefore, the average site availability is: 0.60.99 + 0.064 0.98901 + 0.336
0.9889111 = 0.98957077
65
Reliability & MTBF of a
Series System
66
Reliability & MTBF of a
Series System
67
Reliability, MTBF of a
Parallel System
68
Reliability, MTBF of a
Parallel System
69
Reliability, MTBF of a
Parallel System
70
Reliability, MTBF of a
Series-Parallel System
Reliability, MTBF of a Series-Parallel System
RA
...... RB
RA RB RN
RN
R s = R A xR B x . . . . x R N
Rp = 1 − 1 − Ra 1 − Rb . . . . . 1 − Rn
R sp = R s xR p
= RA RB . . . . . RN 1
− 1 − Ra 1 − Rb . . . . . . 1 − Rn
71
EXAMPLES
72
EXAMPLES
A satellite communication system has an in built microwave repeater unit having a MTTF of
40,000 hours. The link is functional if one channel is working and the reliability of the
switching unit is 0.95. Calculate the reliability for (i) one year operating period (ii) two years
operating period, using (a) A single channel, (b) two parallel channels, and (c) three parallel
channels. Assume sub-system components to be of equal reliability.
Solution
R
R sw
R
73
EXAMPLES
74
75
76
EXAMPLES
Calculate the MTTF of a series parallel system with two components in parallel and in series
with a single component all of equal failure rate represented the λ. Also solve for (λ − 1)
failure rate.
Solution
Rp = 1 − 1 − R 1 − R = 2R − R2
Rs = R
R T = R 2R − R2 = 2R2 − R3
77
EXAMPLES
∞ ∞ ∞
MT = R T dt = 2R2 − R3 dt = 2e−2λ t − e−3λ t dt
0 0 0
1 1 2
MT = + − =
λ 3λ 3λ
78
EXAMPLES:RELIABILITY OF m of n
(MAJORITY VOTING) SYSTEM
U1
U2
𝑚
𝑚
𝑛𝑛
U3
Majority Voting
U4 Gate (MVG)
Un
79
Alternatively, Binomial Distribution may be used. The Binomial Distribution derived from
Binomial Expression from where the Distribution relating to the question(s) is/are selected.
For instance, the distribution relating to 2-out-of-4 R 2−4 is R4 + 6R2 Q2 from the full
4
expression of R + Q = R4 + 4R3 Q + 6R2 Q2 + 4RQ3 + Q4
R 2−4 = R4 + 6R2 1 − R 2
= R4 + 6R2 1 − 2R + R2
Furthermore, we can subtract other terms (distribution) from the Binomial expression, that is:
Since in a system, the arithmetic of reliability and unreliability is 1, that is (𝑅 + 𝑄) = 1,
Hence:
4
R+Q = 1 = R4 + 4R3 Q + 6R2 Q2 + 4RQ3 + Q4
80
However, if the system reliability and unreliability are not the same
∴ R1 ≠ R 2 ≠ R 3 likewise Q1 ≠ Q2 ≠ Q3 hence given a three unit system, the system
reliability and unreliability is R1 + Q1 R 2 + Q2 R 3 + Q3 = 1
This gives R 1 R 2 R 3 + R 1 R 2 Q 3 + R 1 R 3 Q 2 + R 2 R 3 Q1 + R 1 Q 2 R 3 + R 2 Q1 Q 3 + R 3 Q1 Q 2 +
Q1 Q 2 Q 3 = 1
R1 R 2 Q3 = Probability of successful operation of first two and the failure of the third one
R1 R 3 Q2 = Probability of successful operation of first and third and the failure of the second
one
R1 Q2 Q3 = Probability of successful operation of first one and the failure of the second and
third
.
.
.
.
.
Q1 Q2 Q3 = Probability of failure of all system
U1
U1
U1
−λt
λ2 t 2 λ n−1 t n−1
R pa n = R t = e 1 + λt + +. . . . . +
2! n−1 !
For two units, out of which one is on standby, the expression is reduced to
R pa 2 = e−λt 1 + λ
λ2 R1 − λ1 R 2 λ2 e−λ 1 t − λ1 e−λ 2 t
R pa 2 = = 82
λ2 − λ1 λ2 − λ1
EXAMPLES
The percentage reliability of an aircraft engine during flight is 99%. Determine the reliability
of successful flight if the aircraft can complete the flight on at least two out of its four
engines?
Solution
n
If R stands for reliability, and Q stands for unreliability then R + Q =1
Since, the aircraft has four engines, the combined reliability and unreliability of the four
4
engines is R + Q
By Binomial expansion
4
R+Q = R4 + 4R3 Q + 6R2 Q2 + 4RQ3 + Q4
83
EXAMPLES
From this expansion, the term responsible for all for engines working is R4, that is the
99% 4 4
= 100
= 0.99 = 0.96
4
That is 0.99 + 4 0.99 3 x 1 − 0.99 1
= 0.96 + 0.0388 = 0.9988
R4 + 4R3 Q + 6R2 R2 =
4 3 2 2
0.99 + 4 0.99 1 − 0.99 + 6 0.99 1 − 0.99 =
4−1
= 1 − 4𝐶0 ∗ 𝑅0 ∗ 1 − 𝑅 4−0
+ 4𝐶1 ∗ 𝑅1 ∗ 1 − 𝑅 4−1
+ 4𝐶2 ∗ 𝑅2 ∗ 1 − 𝑅 4−2
+ 4𝐶3 ∗ 𝑅3 ∗ 1 − 𝑅 4−3
4 3
= 1 − 1 ∗ 1 ∗ 0.01 + 4 ∗ 0.99 ∗ 0.01 + 6 ∗ 0.992 ∗ (0.01)2 + 4 ∗ 0.993 ∗ (0.01)1
= 0.96
85
For all 3 out of 4
3−1
R 3−4 = (1 − 𝑛𝐶𝑖 Ri 1 − R n −i
)
𝑖 =0
= 1 − 4𝐶0 ∗ 𝑅 0 ∗ 1 − 𝑅 4−0
+ 4𝐶1 ∗ 𝑅1 ∗ 1 − 𝑅 4−1
+ 4𝐶2 ∗ 𝑅 2 ∗ 1 − 𝑅 4−2
4 3
= 1 − 1 ∗ 1 ∗ 0.01 + 4 ∗ 0.99 ∗ 0.01 + 6 ∗ 0.992 ∗ 0.012
= 0.9994
2−1
R 2−4 = (1 − 𝑛𝐶𝑖 Ri 1 − R n −i
)
𝑖 =0
= 1 − 4𝐶0 ∗ 𝑅 0 ∗ 1 − 𝑅 4−0
+ 4𝐶1 ∗ 𝑅1 ∗ 1 − 𝑅 4−1
4 3
= 1 − 1 ∗ 1 ∗ 0.01 + 4 ∗ 0.99 ∗ 0.01
86
= 0.999996
Determine the system reliability if reliabilities and unreliabilies of the components are
𝑅3
𝑅1
𝑅3 𝑚 𝑅4 𝑅𝑆𝑊
𝑛
𝑅2
𝑅3
𝑅𝐴 𝑅𝐵 𝑅𝐶 𝑅𝐷
Rs = RA x RB x RC x RD
= R A x R B x R 4 x R SW
R A = 1 − [ 1 − R1 1 − R 3 ]
87
𝑅𝐴 = 1 − 1 − 0.89 1 − .093 = 1 − 0.11 0.07 = 0.9923
Since R B is m/n (majority voting) subsystem, where m = 2
m −1
RB = nC i Ri 1 − R n−i
i=0
2−1
RB = nC i Ri 1 − R n−i
i=0
= 1 − 3𝐶0 (𝑅3 )0 ∗ 1 − 𝑅3 3
+ 3𝐶1 (𝑅3 )1 ∗ 1 − 𝑅3 2
3 2
= 1 − 𝑅3 + 3𝑅3 1 − 𝑅3
3 2
𝑅𝐵 = 1 − 1 − 0.95 + 3 0.95 1 − 0.95
3 2
= 1− 1 − 0.95 + 3 0.95 1 − 0.95
88
= 1− 1.25x10−4 + 7.125x10−3 = 1 − 7.25x10−3 = 0.99275
Alternative Method
The terms responsible for two out-of- three systems are R3 + 3R2 Q = R3 + 3R2 (1 − R)
𝑅𝐶 = 𝑅4 = 0.97
Multiplyvalues of RA to RD
89
EXAMPLES
• Three computer systems were connected in a standby
redundancy configuration for the control of a nuclear plant such
that one works at a time until it is predicted to fail below a
certain reliability threshold at which point another takes over
instantaneously. The failure rate of each of the system is 0.01
failure per hour. Given a mission time of 50 hours, determine the
overall system reliability using (i) A single computer, (ii) One
standby computer, and (iii) Two standby computers.
90
Solution
𝑛 −1
2
−𝜆𝑡
𝜆𝑡
𝑅𝑛 𝑡 = 𝑒
𝑖!
𝑖 =0
𝑅2 𝑡 = 𝑒 −0.01𝑥50 (1 + 0.01𝑥50)
𝑛 −1
2
−𝜆𝑡
𝜆𝑡
𝑅𝑛 𝑡 = 𝑒
𝑖!
𝑖 =0
𝜆2 𝑟 2
= 𝑅3 𝑡 = 𝑒 −𝜆𝑡 1 + 𝜆𝑡 + 2!
for 3 systems (one working and the others
standby)
2
−0.01𝑥50
0.01 𝑥502
=𝑒 1 + 0.01𝑥50 + = 0.985
2
91
EXAMPLE
A printing company has four presses one operating and three in standby.
Each press has an identical failure rate where the MTBF is 50 operating
hours. The company has recured a rust order requiring 75 hours of
continuous time on a press. If standby is utilized whenever the online press
fails, determine the probability of these being continuous printing support
while the order is being processed.
92
The time between failure of the four-unit standby has a gamm distribution with 𝜆 =
1 1
𝑀𝑇 𝑚
3
𝑖
−𝜆𝑡
𝜆𝑡
𝑅4 75 = 𝑒
𝑖!
𝑖=0
1
= 𝑒− 50 𝑥75 1 + 3 2 + 9 8 + 27 48
= 0.9344
Note that the time in which the nth failure is observed is the sum of n identical and
independent experimental distribution. Therefore, the MTTF for the printing press is 50hrs x
4 = 200 hrs.
4
= 200 ℎ𝑟𝑠
1
50ℎ𝑟𝑠
93
Computer Maintainability
96
Computer Maintainability
97
Cyclic Redundancy Check
• A cyclic redundancy cheek (CRC) is an error-detecting code
commonly used in digital networks and storage devices to detect
accidental changes to raw data. Blocks of data entering these
stems get a short check value attached, based on the remainder
of a polynomial division their contents; on retrieval the calculation
is repeated, and corrective action can be taken against presumed
data corruption if the check values do not match.
• CRCs are so called because the check (data verification) value is
a redundancy (it expands the message without adding
information) and the algorithm is based on cyclic codes. CRCs
are popular because they are simple to implement in binary
hardware, easy to analyze mathematically, and particularly good
at detecting common errors caused by noise in transmission
channels. Because the check value has a fixed length, the
function that generates it is occasionally used as a hash function.
98
Maintenance Types
• Generally speaking, there are three types of maintenance in
use:
• Preventive maintenance, where equipment is maintained before
break down occurs. This type of maintenance has many
different variations and is subject of various researches to
determine nest and most efficient way to maintain equipment.
Recent studies have shown that Preventive maintenance is
effective in preventing age related failures of the equipment.
For random failure patterns which amount to 80% of the failure
patterns, condition monitoring proves to be effective.
• Corrective maintenance, where equipment is maintained after
break down. This maintenance is often most expensive
because worn equipment can damage other parts and cause
multiple damages.
• Operational maintenance, where equipment is maintained in
99
using.
Maintenance Types
• Preventive / Predictive maintenance
• Preventive maintenance is maintenance performed in an
attempt to avoid failures, unnecessary production loss and
safety violations.
• The effectiveness of a preventive maintenance depends on the
RCM analysis which it was based on, and the ground rules used
for co-effectively.
• Preventive maintenance (PM) has the following meanings:
• The care and servicing by personnel for the purpose of
maintaining equipment and facilities in satisfactory operating
condition by providing for systematic inspection, detection, and
correction of incipient failures either before they occur or before
they develop into major defects.
• Maintenance, including tests, measurements, adjustments, and
parts replacement, performed specifically to prevent faults from
occurring.
100
• The primary goal of maintenance is to avoid or mitigate the consequences of
failure of equipment. This may be by preventing the failure before it actually
occurs which Planned Maintenance and Condition Based Maintenance help to
achieve. It is designed to preserve and restore equipment reliability by replacing
worn components before they actually fail. Preventive maintenance activities
include partial or complete overhauls at specified periods, oil changes,
lubrication and so on. In addition, workers can record equipment deterioration so
they know to replace or repair worn parts before they cause system failure. The
ideal preventive maintenance program would prevent all equipment failure
before it occurs.
• There is a controversy of sorts regarding the propriety of the usage
“preventative.”
Subgroups
• Preventive maintenance can be described as maintenance of equipment or
systems before fault occurs. It can be divided into two subgroups:
• planned maintenance and
• condition-based maintenance.
The main difference of subgroups is determination of maintenance time, or
determination of moment when maintenance should be performed.
101
While preventive maintenance is generally considered to be worthwhile, there are risks such
as equipment failure or human error involved when performing preventive maintenance, just
as in any maintenance operation. Preventive maintenance as scheduled overhaul or scheduled
replacement provides two of the three proactive failure management policies available to the
maintenance engineer. Common methods of determining what Preventive (or other) failure
management policies should be applied are; OEM recommendations, requirements of codes
and legislation within a jurisdiction, what an “expert” thinks ought to be done, or the
maintenance that’s already done to similar equipment, and most important measured values
and performance indications. In a nutshell:
•Preventive maintenance is conducted to keep equipment working and/or extend the life of
the equipment.
104
Reliability centered maintenance
Reliability centered maintenance is an engineering framework that enables the definition of
a complete maintenance regime. It regards maintenance as the means to maintain the
functions a user may require of machinery in a defined operating context. As a discipline it
enables machinery stakeholders to monitor, assess, predict and generally understand the
working of their physical assets. This is embodied in the initial part of the RCM process
which is to identify the operating context of the machinery, and write a Failure Mode Effects
and Criticality Analysis (FMECA). The second part of the analysis is to apply the “RCM
logic”, which helps determine the appropriate maintenance tasks for the identified failure
modes in the FMECA. Once the logic is complete for all elements in the FMECA, the
resulting list of maintenance is “packaged”, so that the periodicities of the tasks are
rationalised to be called up in work packages; it is important not to destroy applicability of
maintenance in this phase. Lastly, RCM is kept live throughout “n-service’ life of
machinery, where the effectiveness of the maintenance is kept under constant review and
adjusted in light of the experience gained.
105
Difference Between Preventive and
Predictive Maintenance
Predictive maintenance tends to include direct measurement of the item. Example, an infrared picture of
a circuit board to determine hot spots while Preventive Maintenance includes the evaluation of particles
in suspension in a lubricant, sound and vibration analysis of a machine.
•Examples
•An individual bought an incandescent light bulb. The manufacturing company mentioned that the life
span of the bulb is 3 years. Just before the 3 years, the individual decided to replace the bulb with a new
one. This is called preventive maintenance.
•On the other hand, the individual has the opportunity to observe the bulb operation daily. After two
years, the bulb starts flickering. The individual predicts at that time that the bulb is going to fail very
soon and decides to change it for a new one. This is called predictive maintenance.
•The individual ignores the flickering bulb and only goes out to buy another replacement light bulb
when the current one fails. This is called corrective maintenance.
•Applications are used to help complete these tasks such as Flame Task
106
Maintenance, repair, and operations
Maintenance, repair and operations (MRO) or maintenance, repair, and overhaul involves fixing any sort of
mechanical, plumbing or electrical device should it become out of order or broken (known as repair,
unscheduled, or casualty maintenance). It also includes performing routine actions which keep the device in
working order (known as scheduled maintenance) or prevent trouble from arising (preventive maintenance).
MRO may be defined as, “All actions which have the objective of retaining or restoring an item in or to a
state in which it can perform its required function. The actions include the combination of all technical and
corresponding administrative, managerial, and supervision actions.”
•MRO operations can be categorised by whether the product remains the property of the customer, i.e. a
service is being offered, or whether the product is bought by the reprocessing organisation and sold to any
customer wishing to make the purchase (Guadette, 2002). In the former case it may be a backshop operation
within a larger organization or smaller operation.
•The former of these represents a dosed loop supply chain and usually has the scope of maintenance, repair
or overhaul of the product. The latter of the categorisations is an open loop supply chain and is typified by
refurbishment and remanufacture. The main characteristic of the closed loop system is that the demand for a
product is matched with the supply of a used product. Neglecting asset write-offs and exceptional activities
the total population of the product between the customer and the service provider remains constant.
107
Reliability and Maintainability of Public
Utilities in Nigeria
Seminar Title – 2015/2016 and 2016/2017Session
• Electricity
• Water supply
• Road Network
• Rail Network
• Aviation/Air Services
• Water Transportation Services
• Television Services
• Radio Services
• Telephone Services
• Internet/IT Services
• Hospital Infrastructure Services
• School Infrastructure Services
108