0% found this document useful (0 votes)
22 views108 pages

CPE 515 - Reliability and Maintability

Uploaded by

Samuel jidayi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views108 pages

CPE 515 - Reliability and Maintability

Uploaded by

Samuel jidayi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 108

COMPUTER RELIABILITY

AND MAINTENANCE
(CPE 506)
Engr. Prof. Abdulfattah A. Aboaba
Associate Prof.
Computer Engineering Department

UNIVERSITY OF
MAIDUGURI
1
Learning Objectives

•Define reliability and availability in the


context of computer systems.
•Define Maintainability in the context of
computer systems.
•Provide a quantitative approach to
understand and compute reliability and
availability metrics.
2
RELIABILITY

3
A Motivating Availability
Example
Consider a simple example of an online
brokerage which is in the process of
designing its site and selecting the
components that will be used in its design.

The main consideration here is the site


availability, which has to be at least 99.99%
(“four 9’s”) according to management
decision.

4
The site is used by users to get quotes on
stocks and mutual funds, manage portfolios,
conduct risk analysis, and to place orders to
trade stocks and mutual funds.
Consider in the security trading business, Web
service availability is a key QoS metric.
If customers are denied access to the trading
services, they may incur financial losses and
the trading company may be liable for these
loses.

5
Trading site architecture
R

Internet
R

Load Balancer

R = router Web Server Database Server


6
– The trading site architecture is composed of a load
balancer that distributes the incoming requests to
one of nWS Web servers.
– The servers are all implemented using the same
type of hardware and software.
– At the back-end, nDS database server are used to
store all the persistent data needed to support
customer trading transactions.
– The database is fully replicated at each of the nDS
database servers to increase availability and
distribute the load.

7
– The company is considering two types of
boxes: highly reliable, expensive, high-end
servers with hot-swappable CPU boards
and disks, as well as less expensive, less
reliable, low-end servers.
– Management wants to answer the following
questions:
• What is the least expensive configuration that
meets the 99.99% availability requirement?
• All low-end servers, all high-end servers, or a
mix of low-end and high-end servers?

8
RELATIONSHIP BETWEEN
FAILURE RATE AND MTBF
• Failure Rate - The frequency with which a
system fails expressed in failure per hour.
It is denoted by λ
It varies over the life of the system
It is the reciprocal of MTBF,
that is MTBF = 1/λ
• Where the failure rate is assumed to be
constant, the MTBF could be reported instead
and this is possible only in the flat region of
the BATHTUB Curve or the useful life region
9
DEFINITIONS OF MTBF
• MTBF - Mean time before failure (MTBF) is the
predicted elapsed time between inherent failures of a
REPAIRABLE system/device during operation or
arithmetic mean (average) time between failures of a
REPAIRABLE a system/subsystem/device.
• MTBF is the sum of length of the operational periods
divided by the number of observed failures.
• MTBF – The average operating time expected
between failures in a population of identical
components.

10
DEFINITIONS OF MTTF
• MTTF – Mean time to failure (MTTF) is the
predicted elapsed time between inherent failures of a
UNREPAIRABLE system/device during operation or
arithmetic mean (average) time between failures of a
UNREPAIRABLE a system/subsystem/device.
• MTTF is the length of time a device is expected to
last in operation before it fails.
• MTTF – The average operating time expected before
failure of a component which is not repaired or
replaced. Simply put, it is the average time to failure
of ‘n’ units, Sum of ‘n’ individual unit time to failure/ ‘n’
units.
11
DEFINITIONS OF MTTR
• MTTR - Mean time to repair (MTTR) is a basic
measure of the maintainability of a repairable items, it
represents the average time required to repair a
failed component or device. It is the total number of
corrective maintenance time divided by the total no of
corrective maintenance actions during a given period
of time. In fault tolerant design, MTTR is usually
considered to also include the time the fault is latent;
the time from when the failure occurs until it is
detected. Another meaning of MTTR is Mean Time to
Recovery.
• MTTR does not include Load Time for parts not
readily available, administrative or logic Downtime. 12
RELATIONSHIP BETWEEN MTBF,
MTTF, AND MTTR

13
Relationship between MTTF,
MTTR, and MTBF

MTTF MTTR MTTF

up down up

MTBF

n-th failure (n+1)-th failure

14
VARIATION of MTBF

• Mean Time Between System Abort


(MTBSA)
• Mean Time Between Critical Failure
(MTBCF)
• Mean Time Between Unscheduled
Removal (MTBUR)
• Mean Time Between Failure that is
Dangerous (MTBFD)
15
FAILURE RATE
• Failure is declare when a system does not
meet it desired objectives or any system that
cannot meet maximum performance on
availability required.
• Additive Failure Rate – This refers to the
failure rate of a complex system which
comprises of many units or components. It is
the sum of the individual failure rates of its
components or subsystem which brings up a
term called “total system failure rate”.
16
THE BATHTUB CURVE

17
REGIONS OF THE BATHTUB
CURVE
Explanation of the three regions and the
observed failure rate curve (Mapping three
curves together.
• Region 1 – Decreasing failure rate or early
failure
• Region 2 – Constant failure rate – Random
failure
• Region 3 – Increasing failure rate – Wear out
failure

18
Failure Density Function (FDF)

19
Reasons of System Failure
• To categorize different types of failure,
three dimensions are considered:
– Duration, Effect, and Scope

20
• Duration of the failures :
– Permanent failures:
• A system stops working and there is no possibility of
repairing or replacing it. (e.g., unmanned space ship)
– Recoverable failures:
• The system is placed back in operation after a fault is
recovered. (e.g., Web site inaccessibility due to its
connection to the internet being down)
– Transient failures:
• Characterized by having a very short duration and may
not require major recovery actions. (e.g., problems that
can be solved by resetting network routers or rebooting
servers.)

21
• Effect of the failures:
– Functional failures:
• The system does not operate according to its functional
specifications. (e.g., an online bookstore failing to display
information about a book even though it is in the catalog)
– Performance failures:
• Even though the system may be executing the requested
functions correctly, they are not executed in a timely
fashion. (e.g., A search engine that presents very
accurate results to requests for search but takes more
than a minute on average to process each request)

22
• Scope of the failure:
– Partial failures:
• Some of the services provided by the computer system
becomes unavailable, while others can still be used.
(e.g., The services that allow customers to bid in an
online auction site may become unavailable due to the
failure of the servers that process these types of
requests, while customers may still be able to see
existing bids)
– Total failures:
• Characterized by a complete disruption of all services
offered by the computer system. (e.g., power outages
could cause a Web site to go down completely)

23
MATHEMATICAL
DERIVATIONS – R(t)

24
MATHEMATICAL DERIVATIONS –
MTBF, MTTF, AND FAILURE RATE
Relationship Between MTBF, MTTF & Failure Rate

From the general expression


𝑚= 0
𝑅 𝑡 𝑑𝑡

Substituting 𝑒 −λ𝑡 in place of R(t)


𝑚= 𝑒 −λ𝑡 𝑑𝑡
0

1 −λ𝑡 ∞
=− 𝑒
λ 0

1 −∞ 1 1
− 𝑒 − 𝑒 −0 = − 0 − 𝐼 =
λ λ λ

1
∴ 𝑀𝑇𝐵𝐹 𝑚 = 𝑓𝑜𝑟 𝑟𝑒𝑝𝑎𝑖𝑟𝑎𝑏𝑙𝑒 𝑖𝑡𝑒𝑚𝑠
λ

25
MATHEMATICAL DERIVATIONS –
MTBF, MTTF, AND FAILURE RATE

And for non-repairable items

1
𝑀𝑇𝑇𝐹 𝑚 =
λ

1 1
∴= = λ 𝐻𝑒𝑛𝑐𝑒 λ =
𝑀𝑇𝐵𝐹 𝑚

Substitute this into R(t) equation for constant λ rate

1 𝑡
−λ𝑡 − 𝑡
𝑅 𝑡 =𝑒 = 𝑒 𝑚 = 𝑒− 𝑚

Therefore, after I MTBF (that is t=m), the probability of formal or reliability R(t)=𝑒 −1 =
0.37
26
MATHEMATICAL DERIVATIONS –
MTBF, MTTF, AND FAILURE RATE

Reliability of an item after time (t), t ≥ 1

Recall 𝑅 𝑡 = 𝑒 −λ𝑡 𝑤ℎ𝑒𝑟𝑒 𝑡 𝑖𝑠 𝑡ℎ𝑒 𝑚𝑖𝑠𝑠𝑖𝑜𝑛 𝑡𝑖𝑚𝑒 𝑜𝑟 𝑡𝑒𝑠𝑡 𝑑𝑢𝑟𝑎𝑡𝑖𝑜𝑛

For 𝑡 = 1, 𝑅(𝑡) = 𝑒 −λ𝑡 𝑏𝑒𝑐𝑜𝑚𝑒𝑠 𝑅 1 = 𝑒 −λ

If time (t) is > 1 is represented by T

Then 𝑅 𝑇 = 𝑒 −λ𝑇

Since 𝑒 −λ𝑡 = 𝑒 −λ𝑇 for t > 1

27
MATHEMATICAL DERIVATIONS –
MTBF, MTTF, AND FAILURE RATE

−λ𝑇 −λx1x𝑇
And since 𝑒 𝑐𝑜𝑢𝑙𝑑 𝑏𝑒 𝑒𝑥𝑝𝑟𝑒𝑠𝑠𝑒𝑑 𝑎𝑠: 𝑒 = 𝑅 1xT

−λx1xT −λx1 𝑇 −λ T
∴ 𝑅 1xT = 𝑒 =𝑒 =𝑒

−λx1x𝑇 T −λT
𝐻𝑒𝑛𝑐𝑒 𝑅 1xT = 𝑒 = 𝑅(1) = 𝑒
28
MATHEMATICAL DERIVATIONS – Rn(t)
Reliability of Multiple Items

Since reliability of an item is given as

𝑅 𝑡 = 𝑒 −λ𝑡 ≡ 𝑅𝑖 𝑡 = 𝑒 −𝑖λ𝑡 𝑓𝑜𝑟 𝑖 𝑠𝑡𝑎𝑛𝑑𝑠 𝑓𝑜𝑟 1 𝑖𝑡𝑒𝑚

The reliability of a number of compounds represented by ‘n’ will be

𝑅𝑛 𝑡 = 𝑒 −𝑛λ𝑡

Hence if the mission time is greater than 1,

−𝑛λ𝑡 −λ 𝑛𝑇
𝑅𝑛 𝑇 = 𝑒 = 𝑒

29
DETERMINATION OF FAILURE RATE
Example: A maintenance engineer decided to determine the failure rate of BSC equipment

having 100 identical units. Each of the units were tested for 100 hrs each out of

which 60 completed the test without failure, 20 failed after 800 hrs and rest failed

after 500 hours, estimate the failure rate.

Solution

Total failure = 40

Total operation hours = 60 X 1000 + 20 X 800 + 20 X 500

= 60,000 + 16,000 + 10,000

= 86,000

𝑇𝑜𝑡𝑎𝑙 𝐹𝑎𝑖𝑙𝑢𝑟𝑒
Estimated Failure Rate = 𝑇𝑜𝑡𝑎𝑙 𝑂𝑝𝑒𝑟𝑎𝑡𝑖𝑜𝑛 ℎ𝑜𝑢𝑟

40 𝑓𝑎𝑖𝑙𝑢𝑟𝑒
= = 465 𝑥 10−6 ℎ𝑜𝑢𝑟
86,000

Meaning 465 failures for every million hours of operation 30


DERMINATION OF RELIABILITY AND
UNRELIABILITY
Example: Given a system with 10 identical component in series of system reliability R =

0.999 for unit mission time (T=1), therefore, R = R(1) = 0.999. Determine each

component reliability, unreliability and failure rate.

Solution

𝑇
From 𝑅 𝑇 = 𝑒 −𝜆𝑠𝑇 = 𝑒 −𝜆𝑠 = 𝑅(1) 𝑇

For unit mission time T=1

∴ 𝑅 𝑇 = 𝑅 1 = 𝑒 −𝜆𝑠

Hence calculating for 10 identical components

10
𝑅 1 = 𝑒 −𝜆𝑠 = 𝑒 −10𝜆 = 𝑒 −𝜆 = 𝑅 1 10
= 0.999

∴ 𝑅𝑖 1 = 0.9991/10

s = for series & i = for identical 31


DETERMINATION OF FAILURE RATE
Component Unreliability 𝑈𝑖 1 = 1 − 𝑅𝑖 1 = 1 − 0.9991/10

Component Failure Rate λ

From 𝑅 𝑇 = 𝑒 −𝜆𝑠𝑇 = 𝑅𝑖 𝑇 = 𝑒 −𝜆𝑇

In [𝑅𝑖 𝑇 ] = 𝐼𝑛[𝑒 −𝜆𝑇 ]

−𝜆𝑇 = 𝐼𝑛𝑅𝑖 𝑇

𝐼𝑛𝑅𝑖 (𝑇) 𝐼𝑛 0.9991/10


∴𝜆=− =−
𝑇 1

32
EXAMPLES
Time(t) x100hrs 0 2 4 8 10 14 15
Failure 0 1 1 1 1 1 1
Occurrence(n) 0 1 2 3 4 5 6
Continuous tests were conducted on an electronics component and faults which
were repaired occurred as in the table. Calculate the mean time between failures
(MTBF).
Solution:

𝑡𝑛 − 𝑡0 15 − 0 𝑡𝑛 15
𝑀𝑇𝐵𝐹 𝑚 = = = = × 102 ℎ𝑟𝑠 = 250ℎ𝑟𝑠
𝑛 6 𝑛 6

If repair time at each failure is recorded as follow

𝑡0 → 𝑡1 𝑡1 → 𝑡2 𝑡2 → 𝑡3 𝑡3 → 𝑡4 𝑡4 → 𝑡5 𝑡5 → 𝑡6
0 0.5 0.75 0.75 1 1.25

Calculate the mean time to repair of that component

0.5 + 0.75 + 0.75 + 1 + 1.25


𝑀𝑇𝑇𝑅 = = 0.865ℎ𝑟𝑠
5 33
EXAMPLES
Example: Fifty (50) hard disk drives were tested for a period of 200hrs, 30 of them
survived without failure, 5 for 170hrs, another 5 for 150hrs, and 10 lasted for 100hrs.
Determine

a) Total test hours before failure for fail component


b) Total test hours without failure
c) Total survival hours
d) MTTF
e) MTBF

Solution

a) Total test hours before failure for failed components is


𝑛

𝑓𝑖 𝑡𝑖
𝑖=1

Where n is total number of failure, f is failure, and t is time (in hours) of failure
occurrence. Note that there are three failure occurrences, that is, at 170hrs, at 150
hours, and at 100hrs.
= 5 X 170 + 5 X 150 + 10 X 100 = 2,600ℎ𝑟𝑠
34
EXAMPLES
a) Total test hours without failure is ST
Where S is number of item that survived the duration of test, and T is the duration of
test//mission time. This refers to those that made it to the end.
= 30 X 200 = 6000ℎ𝑟𝑠
b) Total survival hours is
𝑛

𝑓𝑖 𝑡𝑖 + 𝑆𝑇
𝑖=1

= 2,600 + 6,000 = 8,600ℎ𝑟𝑠

𝑛
𝑖=1 𝑓 𝑖 𝑡 𝑖 10 X 100 + 5 X 175 +5 X 150 2.600
c) 𝑀𝑇𝑇𝐹 =
𝑇𝑜𝑡𝑎𝑙 𝑓𝑎𝑖𝑙𝑒𝑑 𝑖𝑡𝑒𝑚𝑠
=
10 + 5 + 5
=
20
= 1300 ℎ𝑟𝑠 𝑓𝑎𝑖𝑙𝑢𝑟𝑒

Total survival hours 8600


d) 𝑀𝑇𝐵𝐹 =
Total no failed
=
20
= 4300 ℎ𝑟𝑠 𝑓𝑎𝑖𝑙𝑢𝑟𝑒

35
EXAMPLES
Example: Five identified RAM card were test until failed as show in the table below:
Determine the MTTF(m)

Failure 0 1 1 2 1
Time (1000hrs) 0 10 22 45 50
𝑡0 𝑡1 𝑡2 𝑡3 𝑡4

Solution

𝑓𝑖 𝑡𝑖
𝑖=1

𝑴𝑻𝑻𝑭 = 𝑓𝑖 𝑡𝑖 𝑁𝑜 𝑜𝑓 𝑐𝑜𝑚𝑝𝑜𝑛𝑒𝑛𝑡
𝒕=𝟏

1 x 10,000 + 1 X 22000 + 2 X 45000 + 1 X 50000


=
5

𝑀𝑇𝑇𝐹 = 34600ℎ𝑟𝑠
36
EXAMPLES
Example: One thousand similar components, each of constant failure rate of 5% per 10 3 hours
are put into test together. Calculate the time lag before failure of the following number of
component (i) 100 and (ii) 500.

Solution

Nf
𝑅𝑒𝑙𝑖𝑎𝑏𝑖𝑙𝑖𝑡𝑦 𝑅 = 𝑒 −λ t = 1 −
Nt

Where Nf and Nt are number of failed items, and total number of items
respectively.

1
λ=%x
R𝑑

But % is: % = 5% = 5 100 and Rd is the rating duration, which is 1000


hours in this case.

5 1
= x = 5x10−5 = 0.00005
100 1000

𝑝𝑒𝑟𝑐𝑒𝑛𝑡𝑎𝑔𝑒 0.05
In another way it is 𝑟𝑎𝑡𝑖𝑛𝑔 𝑑𝑢𝑟𝑎𝑡𝑖𝑜𝑛
= 1000
= 0.00005

37
EXAMPLES
If t = 𝑡𝑖𝑚𝑒 𝑟𝑒𝑞𝑢𝑖𝑟𝑒𝑑 𝑏𝑒𝑓𝑜𝑟𝑒 𝑓𝑎𝑖𝑙𝑢𝑟𝑒 𝑜𝑓 𝑐𝑜𝑚𝑝𝑜𝑛𝑒𝑛𝑡(𝑠)

Then 𝑡1 = 𝑡𝑖𝑚𝑒 𝑟𝑒𝑞𝑢𝑖𝑟𝑒𝑑 𝑏𝑒𝑓𝑜𝑟𝑒 𝑓𝑎𝑖𝑙𝑢𝑟𝑒 𝑜𝑓 100 𝑐𝑜𝑚𝑝𝑜𝑛𝑒𝑛𝑡𝑠, and

𝑡2 = 𝑡𝑖𝑚𝑒 𝑟𝑒𝑞𝑢𝑖𝑟𝑒𝑑 𝑏𝑒𝑓𝑜𝑟𝑒 𝑓𝑎𝑖𝑙𝑢𝑟𝑒 𝑜𝑓 500 𝑐𝑜𝑚𝑝𝑜𝑛𝑒𝑛𝑡𝑠

Nt = 1000, Nf1 = 100, and Nf2 = 500

−5 𝑡 100
(a) 𝑒 −5x 10 1 = 1−
1000

−5𝑥10−5 t 1 = 𝐼𝑛 0.90

𝐼𝑛 0.90
t1 = = 2107 = 2107ℎ𝑜𝑢𝑟𝑠
−5𝑥10−5

−5 𝑡 500
(b) 𝑒 −5𝑥 10 2 = 1−
100

−5𝑥10−5 t 2 = 𝑙𝑜𝑔𝑒 0.5

t 2 = 13,863ℎ𝑜𝑢𝑟𝑠

38
EXAMPLES
Example: In a test involving 2030 components lasting for mission time 500 hours, 32
components failed, calculate the failure rate

Solution

1 N𝑓
λ= x
Ns M𝑡

in which Ns, Nf, and Mt or T are number of survived components, number of failed
components, and mission time respectively.
1 32
Therefore, λ =
1998
x
500
= 3.2x10−5

And in terms of percentage per hour


Nf 1 32 1
λ= x x100 = x x100 = 3.20𝑥10−3 𝑝𝑒𝑟𝑐𝑒𝑛𝑡/ℎ𝑜𝑢𝑟
Ns T 1998 500

39
EXAMPLES
Example: Ten computer devices were tested for a time lasting 500 hours during which 8
survived the test. Determine (i) MBTF (ii) Failure rate, (iii) Reliability, and (iv) the
unreliability.

Solution

Nt 10
MTBF =
Nf
×𝑇 = × 500 = 2,500ℎ𝑟𝑠/𝑓𝑎𝑖𝑙𝑢𝑟𝑒
2
1 1
Failure rate (λ) =
MTBF
= = 0.0004𝑓𝑎𝑖𝑙𝑢𝑟𝑒/ℎ𝑜𝑢𝑟
2500
𝑡 500
Reliability (R) = 𝑒 − 𝑚 = 𝑒− 2500 = 0.8
Unreliability 𝑈 = 1 − 𝑅 = 1 − 0.8 = 0.2

40
MATHEMATICAL DERIVATIONS – Binomial
Distribution

41
Availability Basics
• Availability:
– The fraction of time that a component (or system)
is operational.
• Consider the notion that a component (or
system) alternates through periods in which it
is operational – the up periods – and periods
in which it is down – the down periods.
• Mean Time To Failure (MTTF)
– The average time it takes for a system to fail.

42
Availability Summary
Computer systems tend to be labeled by the number of “9”s in
the availability. For example a “five-9’s” system has an
availability of 99.999%. Computer system classification
according to their availability is shown below:
Availability Availability Unavailability System Type
Class (min/year)
1 90.0% 52,560 Unmanaged

2 99.0% 5,256 Managed

3 99.9% 526 Well-Managed

4 99.99% 52.6 Fault-Tolerant

5 99.999% 5.3 Highly-Available

6 99.9999% 0.53 Very Highly-Available

7 99.99999% 0.053 Ultra Available

43
Expression for the availability
of a system
• The following state transition diagram can be used to show that
the system can be in one of two states: up and down

Up Down


• The system fails, i.e., goes from up to down, with a rate 
• It gets repaired, i.e., goes from down to up, with a rate 

44
MATHEMATICAL DERIVATION -
Availability
• Writing these rates in term of the MTTF and MTTR we get,
1 1
= and =
MTTF MTTR

• Using the flow-in-flow-out principle, we can write,

  pup =   pdown
• Here pup and pdown are the probability that the system is up and
down, respectively.

• Thus, 
pdown =  pup

45
MATHEMATICAL DERIVATION -
Availability
• The availability A of a system is simply pup
• We also know that pup + pdown = 1
• Therefore,

pup + pup = 1

 
or, pup 1 +  = 1
 
  + 
or, pup  =1
  

or, pup =
+

46
MATHEMATICAL DERIVATION - Availability

• Therefore,
1
 MTTR MTTF
A = pup = = =
+ 1 1 MTTF + MTTR
MTTR + MTTF

• And, the system un-availability is simply


1
 MTTF MTTR MTTR
U = pdown = = = =
+ 1 + 1 MTTF + MTTR MTBF
MTTR MTTF

47
MATHEMATICAL DERIVATION - Availability

• In most systems of interest, it takes significantly longer time for


the system to fail than to be repaired
MTTF>>MTTR

Thus, the unavailability can be approximated as


U  MTTR/MTTF
Consider a Web site composed of two Web servers, one
application server, and one database server. Suppose that
historical data shows that the application server machine is
rebooted every twenty days on average. Assuming that the
system administrator takes 10 minutes to reboot the machine,
what is the application server availability?

48
Availability Example
Here the MTTF is 20 days or (202460=28,800 minutes) and
The MTTR is 10 minutes
Therefore the availability is given by,
A=MTTF/(MTTF+MTTR)=28,800/(28,800+10)=99.965%

If the system administrator were able to cut the reboot time to 20%
The availability would be A = 28,800/(28,800 + 100.2) = 99.972%

To achieve the same availability (99.972%) with the original MTTR of


10 minutes, the MTTF would have to be increased to 35,704
minutes, I.e., a 24% increase

This indicates the importance of reducing the time to recovery to


improve the availability of a system

49
The Reliability of Systems of
Components
• Q. What is the reliability of the system as a function of the
reliability of the components used to build the system?

• We’ll consider two cases,


– Components connected in series
– Components connected in parallel

• Example of a serial system is when a Web site has a Web


server connected to an application server which is then
connected to a database server, each on its own dedicated
machine,

50
REDUNDANCY

• Redundancy - It is the duplication of a critical


components or function of a system with the intention
of increasing reliability of the system in the form of
backup or failsafe.
• DMR - Duplicate Modular Redundancy
• TMR - Triple Modular Redundancy for
Safeguard Critical System
• Another name for Redundancy is Majority Voting or
Voting Logic. Thus all the redundant
components/subsystems would have to fail one after
the other before the system fails.

51
Forms and Functions of Redundancy
• FORMS
• Hardware Redundancy – DMR & TMR
• Information – Error Detection & Correction Methods
• Time – Including transcend fault detection methods
such as Alternative Logic
• Software – Such as N-version programming
• FUNCTIONS
• Passive Redundancy – Use of capacity to prevent
failure (Extra strength in Bridge cabling)
• Active Redundancy – Maintaining Performance by
monitoring and use of voting Logic when failure in
perceived

52
53
Redundancy Techniques

• Active – All units working simultaneously but one is


enough to make the system function. This is akin of
parallel system in electric circuitry. Example of this
include security systems, life support systems,
method of diversity reception in long-distance radio
transmission, over-reinforcement in concrete
structure.
• Passive (Standby): They alternate units. One is
energized at a time and in case it fails or exhibit
reduced reliability, an alternate one takes over.
Example: power supply to critical installation like
operating theater, security monitoring system,
telecommunication industries etc. 54
Redundancy Techniques

55
Reliability of Redundant System and
Linear Regression Model
𝑛−1

𝑅𝑥 = 𝑛𝐶𝑗 (1 − 𝑅)𝑛 𝑅𝑗
𝑗 =0
𝑛 2 𝑛 𝑛 𝑛
𝑥
𝑖−1 𝑖 𝑦
𝑖−1 𝑖 − 𝑥
𝑖−1 𝑖 𝑖−1 𝑦𝑖 𝑥𝑖
∝= 𝑛 2 𝑛 2
𝑛 𝑖−1 𝑥 − 𝑖−1 𝑥𝑖

𝑛 𝑛 𝑛
𝑛 𝑥 𝑦 𝑥
𝑖−1 𝑖 𝑖 𝑖−1 𝑖 𝑖=1 𝑦𝑖
𝛽=
𝑛 𝑛𝑖−1 𝑥 2 − ( 𝑛𝑖−1 𝑥𝑖 )2

Where

x= Independent variable

y= Dependent variable

n= Number of observation

56
SERIES SYSTEM

R1 R2 … Rn

Reliability block diagram for a serial system


• Inside each box in the diagram are the reliabilities r1,…rn of the n
components.
• To compute the reliability, Rs, of the series system we need,
– To know the probability that the entire system is operational when
needed.
– All n components must be operational for the system to be
operational.

57
• Assuming that the n components fail in an independent way
(failure of one component does not affect any other component).
• Using the probability theory that says that the probability of an
event expressed as the intersection of independent events (all n
components are operational) is the product of the probabilities of
the independent events. Thus,
• Implications:Since each reliability value, ri, is a probability and
therefore, ri1
• Therefore as more components are added in series the system
reliability will decrease.
n
RS = r1  r2  rn =  ri
i =1

• The special case when all components have the same reliability
r. We get,

RS = r n
58
• A Web site has a Web server (WS), an application server (AS), and
a database server (DS) in series. Let rWS, rAS, and rDB be the
reliabilities of these components and assume their values are
rWS=0.9, rAS=0.95, and rDB=0.99.
• Management wants to replace the database server with a highly
reliable and expensive model that is advertised as having a 0.999
reliability. Is it a wise decision?
– The reliability of the site with the current database server is
Rsite = rWS  rAS  rDB = 0.9  0.95  0.99 = 0.84645

– The reliability of the site with the new database server is


RnewDBsite = rWS  rAS  rnewDB = 0.9  0.95  0.999 = 0.85415

If instead of the database server, the web server (the most unreliable
component of the system) is replaced by a new one with r = 0.95. The
reliability of the site now will be,
RnewWSsite = rWSWS  rAS  rDB = 0.95  0.95  0.99 = 0.89348

Thus it is evident that replacing the most unreliable component has a more
pronounced effect in terms of improving overall system reliability.
59
Reliability of a series of identical
components

0.8
Reliability

0.6 0.9
0.95
0.4
0.99
0.2

0
0 10 20 30 40 50
Number of Components

60
PARALLEL SYSTEMS
R1

Reliability block diagram for a parallel system R2


.
.
.
Rn

• Using components in parallel is one of the most common way to


use redundancy.
• The reliability of the parallel system, Rp, is the probability that it
is in operation when needed
• This probability is equal to one minus the probability that the
system is not in operation.

61
• For this to happen, all n components must be down.
• The probability that component i is down is simply (1-ri).
• So, assuming independence of failures between components, we
get,
R p = 1 − Pr[all components are down]
= 1 − (1 − r1 )  (1 − r2 )  (1 − rn )

= 1 −  (1 − ri )
n

i =1

• The special case when all components have the same reliability r.
We get,
R p = 1 − (1 − r )
n

• Thus as we increase the number of components, system reliability


grows very fast.
• As shown

62
Reliability of idetical components in
parallel

0.95
Reliability

0.9
0.9 0.95
0.99
0.85

0.8
0 1 2 3 4 5 6 7 8 9 10
Number of Parallel Components

63
• A search engine site wants to achieve a site reliability of 99.999% using a cluster of
very cheap and unreliable Web servers. A cluster is a parallel combination of a
number of servers. Each has a reliability of 85%. How many servers should be
used in the cluster?
From the eq. Rp = 1 − (1 − r ) we know that,
n

0.99999 = 1 – (1-0.85)n = 1 – 0.15 n
So,
0.15n = 1 – 0.99999 = 0.00001
If we apply logarithms to both sides of the above equation and we take into consideration
that n must be an integer, we get that
n = ln 0.00001/ ln 0.15 = 6.069 = 7
Thus, seven unreliable Web servers can provide a high-level of reliability when used in
parallel,
• Thus, we can generalize that:
• the Minimum no (nmin)of server with reliability r needed to build a parallel system
with reliability Rp can be given by -

 ln (1 − R p )
nmin = 
 ln (1 − r ) 
64
EXAMPLE
A Web site has a three tier architecture composed of a layer of Web servers: a
layer of application servers, and a layer of database servers. Their respective
reliabilities are 0.99, 0.999, and 0.9999. 60% of the requests only use services
from the Web server layer. The remaining 40% use the application server, and
16% of which needs services from Web server as well, while the remaining
84% of these requests also need services from both Web server and
database server layers. What is the site availability?
SOLUTION
• For the 60% of the requests that only use the Web servers, the site
availability is 99%.
• For the 40% (1-0.6) and 16% (1-0.84), that is (1-0.6)*(1-0.84) = 6.4% of the
requests that only need Web and application layer services, the availability
is 0.99*0.999=0.98901.
• Finally, for the 40% (1-0.6) and 84% (0.84), that is (1-0.6)*(0.84) = 33.6%
of the requests that need to use all three layers, the site availability is
0.99*0.999*0.9999 = 0.9889111.
Therefore, the average site availability is: 0.60.99 + 0.064 0.98901 + 0.336
0.9889111 = 0.98957077
65
Reliability & MTBF of a
Series System

66
Reliability & MTBF of a
Series System

67
Reliability, MTBF of a
Parallel System

68
Reliability, MTBF of a
Parallel System

69
Reliability, MTBF of a
Parallel System

70
Reliability, MTBF of a
Series-Parallel System
Reliability, MTBF of a Series-Parallel System

RA

...... RB
RA RB RN

RN

R s = R A xR B x . . . . x R N

Rp = 1 − 1 − Ra 1 − Rb . . . . . 1 − Rn

R sp = R s xR p
= RA RB . . . . . RN 1
− 1 − Ra 1 − Rb . . . . . . 1 − Rn

71
EXAMPLES

72
EXAMPLES
A satellite communication system has an in built microwave repeater unit having a MTTF of
40,000 hours. The link is functional if one channel is working and the reliability of the
switching unit is 0.95. Calculate the reliability for (i) one year operating period (ii) two years
operating period, using (a) A single channel, (b) two parallel channels, and (c) three parallel
channels. Assume sub-system components to be of equal reliability.

Solution

R
R sw
R

73
EXAMPLES

74
75
76
EXAMPLES

Calculate the MTTF of a series parallel system with two components in parallel and in series
with a single component all of equal failure rate represented the λ. Also solve for (λ − 1)
failure rate.

Solution

Rp = 1 − 1 − R 1 − R = 2R − R2

Rs = R

R T = R 2R − R2 = 2R2 − R3

77
EXAMPLES

MTTF of the system is MT

∞ ∞ ∞
MT = R T dt = 2R2 − R3 dt = 2e−2λ t − e−3λ t dt
0 0 0

1 1 2
MT = + − =
λ 3λ 3λ

78
EXAMPLES:RELIABILITY OF m of n
(MAJORITY VOTING) SYSTEM

U1

U2

𝑚
𝑚
𝑛𝑛
U3

Majority Voting
U4 Gate (MVG)

Un

The reliability of an m n system with n equal reliability components is the Binomial


Reliability Function (BRF).
𝑀−1

R m−n = (1 − 𝑛𝐶𝑖 Ri 1 − R n−i


)
𝐼=0

79
Alternatively, Binomial Distribution may be used. The Binomial Distribution derived from
Binomial Expression from where the Distribution relating to the question(s) is/are selected.
For instance, the distribution relating to 2-out-of-4 R 2−4 is R4 + 6R2 Q2 from the full
4
expression of R + Q = R4 + 4R3 Q + 6R2 Q2 + 4RQ3 + Q4

∴ R 2−4 = R4 + 6R2 Q2 and in terms of R

R 2−4 = R4 + 6R2 1 − R 2
= R4 + 6R2 1 − 2R + R2

R 2−4 = R4 + 6R2 − 12R3 + 6R4

Furthermore, we can subtract other terms (distribution) from the Binomial expression, that is:
Since in a system, the arithmetic of reliability and unreliability is 1, that is (𝑅 + 𝑄) = 1,
Hence:

4
R+Q = 1 = R4 + 4R3 Q + 6R2 Q2 + 4RQ3 + Q4

∴ R4 + 6R2 Q2 = 1 − 4R3 Q + 4RQ3 + Q4

80
However, if the system reliability and unreliability are not the same
∴ R1 ≠ R 2 ≠ R 3 likewise Q1 ≠ Q2 ≠ Q3 hence given a three unit system, the system
reliability and unreliability is R1 + Q1 R 2 + Q2 R 3 + Q3 = 1

This gives R 1 R 2 R 3 + R 1 R 2 Q 3 + R 1 R 3 Q 2 + R 2 R 3 Q1 + R 1 Q 2 R 3 + R 2 Q1 Q 3 + R 3 Q1 Q 2 +
Q1 Q 2 Q 3 = 1

R1 R 2 R 3 = Probability of successful operation of all system

R1 R 2 Q3 = Probability of successful operation of first two and the failure of the third one

R1 R 3 Q2 = Probability of successful operation of first and third and the failure of the second
one

R1 Q2 Q3 = Probability of successful operation of first one and the failure of the second and
third

.
.
.
.
.
Q1 Q2 Q3 = Probability of failure of all system

Hence the reliability of 2-of-3 R 2−3 =


81
R 1 R 2 R 3 + R 1 R 2 Q 3 + R 1 R 3 Q 2 + R 2 R 3 Q1
PASSIVE REDUNDANCY (STANDBY) SYSTEM

U1

U1

U1

The reliability of a standby/passive redundancy is given by first n terms of the Poisson


expression.

−λt
λ2 t 2 λ n−1 t n−1
R pa n = R t = e 1 + λt + +. . . . . +
2! n−1 !

For two units, out of which one is on standby, the expression is reduced to

R pa 2 = e−λt 1 + λ

For two units with different reliability

λ2 R1 − λ1 R 2 λ2 e−λ 1 t − λ1 e−λ 2 t
R pa 2 = = 82
λ2 − λ1 λ2 − λ1
EXAMPLES
The percentage reliability of an aircraft engine during flight is 99%. Determine the reliability
of successful flight if the aircraft can complete the flight on at least two out of its four
engines?

Solution

There are three possibilities for a successful flight

(i) All four engines are working

(ii) Three of the engines are working

(iii) Only two of the engines are working

n
If R stands for reliability, and Q stands for unreliability then R + Q =1

Since, the aircraft has four engines, the combined reliability and unreliability of the four
4
engines is R + Q

By Binomial expansion

4
R+Q = R4 + 4R3 Q + 6R2 Q2 + 4RQ3 + Q4
83
EXAMPLES
From this expansion, the term responsible for all for engines working is R4, that is the

reliability of the aircraft if all four engines are working is:

99% 4 4
= 100
= 0.99 = 0.96

The term responsible for three out-of-four engines are R4 + 4R3 Q

4
That is 0.99 + 4 0.99 3 x 1 − 0.99 1
= 0.96 + 0.0388 = 0.9988

The terms responsible for 2-out-of-4 engines are:

R4 + 4R3 Q + 6R2 R2 =

4 3 2 2
0.99 + 4 0.99 1 − 0.99 + 6 0.99 1 − 0.99 =

0.96 + 0.0388 + 0.00059 = 0.9994


84
EXAMPLES
Alternative Method

Binomial Reliability Function (BRF).


𝑚 −1

R m−n = (1 − 𝑛𝐶𝑖 Ri 1 − R n−i


)
𝑖=0

For all 4 (4 out of 4)

4−1

R 4−4 = (1 − 𝑛𝐶𝑖 Ri 1 − R n−i


)
𝑖=0

= 1 − 4𝐶0 ∗ 𝑅0 ∗ 1 − 𝑅 4−0
+ 4𝐶1 ∗ 𝑅1 ∗ 1 − 𝑅 4−1
+ 4𝐶2 ∗ 𝑅2 ∗ 1 − 𝑅 4−2
+ 4𝐶3 ∗ 𝑅3 ∗ 1 − 𝑅 4−3

4 3
= 1 − 1 ∗ 1 ∗ 0.01 + 4 ∗ 0.99 ∗ 0.01 + 6 ∗ 0.992 ∗ (0.01)2 + 4 ∗ 0.993 ∗ (0.01)1

= 1 − 1 ∗ 10−8 + 3.96 ∗ 10−6 + 5.88 ∗ 10−4 + 0.0388

= 0.96
85
For all 3 out of 4

3−1

R 3−4 = (1 − 𝑛𝐶𝑖 Ri 1 − R n −i
)
𝑖 =0

= 1 − 4𝐶0 ∗ 𝑅 0 ∗ 1 − 𝑅 4−0
+ 4𝐶1 ∗ 𝑅1 ∗ 1 − 𝑅 4−1
+ 4𝐶2 ∗ 𝑅 2 ∗ 1 − 𝑅 4−2

4 3
= 1 − 1 ∗ 1 ∗ 0.01 + 4 ∗ 0.99 ∗ 0.01 + 6 ∗ 0.992 ∗ 0.012

= 1 − 1 ∗ 10−8 + 3.96 ∗ 10−6 + 5.88 ∗ 10−4

= 0.9994

For all 2 out of 4

2−1

R 2−4 = (1 − 𝑛𝐶𝑖 Ri 1 − R n −i
)
𝑖 =0

= 1 − 4𝐶0 ∗ 𝑅 0 ∗ 1 − 𝑅 4−0
+ 4𝐶1 ∗ 𝑅1 ∗ 1 − 𝑅 4−1

4 3
= 1 − 1 ∗ 1 ∗ 0.01 + 4 ∗ 0.99 ∗ 0.01

= 1 − 1 ∗ 10−8 + 3.96 ∗ 10−6

86
= 0.999996
Determine the system reliability if reliabilities and unreliabilies of the components are

R1 = 0.89, Q2 = 0.07, R 3 = 0.95, R 4 = 0.97 and Osw = 0.01, and m = 2.

𝑅3
𝑅1

𝑅3 𝑚 𝑅4 𝑅𝑆𝑊
𝑛

𝑅2
𝑅3
𝑅𝐴 𝑅𝐵 𝑅𝐶 𝑅𝐷

Rs = RA x RB x RC x RD

= R A x R B x R 4 x R SW

R A = 1 − [ 1 − R1 1 − R 3 ]

87
𝑅𝐴 = 1 − 1 − 0.89 1 − .093 = 1 − 0.11 0.07 = 0.9923
Since R B is m/n (majority voting) subsystem, where m = 2

m −1

RB = nC i Ri 1 − R n−i

i=0

2−1

RB = nC i Ri 1 − R n−i

i=0

= 1 − 3𝐶0 (𝑅3 )0 ∗ 1 − 𝑅3 3
+ 3𝐶1 (𝑅3 )1 ∗ 1 − 𝑅3 2

Since 𝑛𝐶0 = 1 & 3𝐶1 = 3, 𝑅 0 = 1

3 2
= 1 − 𝑅3 + 3𝑅3 1 − 𝑅3

3 2
𝑅𝐵 = 1 − 1 − 0.95 + 3 0.95 1 − 0.95

3 2
= 1− 1 − 0.95 + 3 0.95 1 − 0.95

88
= 1− 1.25x10−4 + 7.125x10−3 = 1 − 7.25x10−3 = 0.99275
Alternative Method

The terms responsible for two out-of- three systems are R3 + 3R2 Q = R3 + 3R2 (1 − R)

= 0.953 + 3(0.952 ) ∗ 1 − 0.95 = 0.8574 + 0.1354 = 0.992775

𝑅𝐶 = 𝑅4 = 0.97

𝑅𝐷 = 𝑅𝑆𝑊 = 1 − 𝑄𝑆𝑊 = 1 − 0.01 = 0.99

Recall that 𝑅2 = 1 − 𝑄2 = 1 − 0.07 = 0.93

Multiplyvalues of RA to RD

∴ 𝑅𝑆 = 0.9923 x 0.99275 x 0.97 x 0.99 = 0.946

89
EXAMPLES
• Three computer systems were connected in a standby
redundancy configuration for the control of a nuclear plant such
that one works at a time until it is predicted to fail below a
certain reliability threshold at which point another takes over
instantaneously. The failure rate of each of the system is 0.01
failure per hour. Given a mission time of 50 hours, determine the
overall system reliability using (i) A single computer, (ii) One
standby computer, and (iii) Two standby computers.

90
Solution

(i) For a single computer

𝑅 𝑡 = 𝑒 −𝜆𝑡 = 𝑒 −0.01𝑥50 = 0.606

(ii) For one standby

𝑛 −1
2
−𝜆𝑡
𝜆𝑡
𝑅𝑛 𝑡 = 𝑒
𝑖!
𝑖 =0

= 𝑅2 𝑡 = 𝑒 −𝜆𝑡 1 + 𝜆𝑡 for 2 systems (one working and the other standby)

𝑅2 𝑡 = 𝑒 −0.01𝑥50 (1 + 0.01𝑥50)

(iii) For two standby computers

𝑛 −1
2
−𝜆𝑡
𝜆𝑡
𝑅𝑛 𝑡 = 𝑒
𝑖!
𝑖 =0

𝜆2 𝑟 2
= 𝑅3 𝑡 = 𝑒 −𝜆𝑡 1 + 𝜆𝑡 + 2!
for 3 systems (one working and the others

standby)

2
−0.01𝑥50
0.01 𝑥502
=𝑒 1 + 0.01𝑥50 + = 0.985
2

91
EXAMPLE
A printing company has four presses one operating and three in standby.
Each press has an identical failure rate where the MTBF is 50 operating
hours. The company has recured a rust order requiring 75 hours of
continuous time on a press. If standby is utilized whenever the online press
fails, determine the probability of these being continuous printing support
while the order is being processed.

92
The time between failure of the four-unit standby has a gamm distribution with 𝜆 =

1 1
𝑀𝑇 𝑚

3
𝑖
−𝜆𝑡
𝜆𝑡
𝑅4 75 = 𝑒
𝑖!
𝑖=0

1
= 𝑒− 50 𝑥75 1 + 3 2 + 9 8 + 27 48

= 0.9344

Note that the time in which the nth failure is observed is the sum of n identical and

independent experimental distribution. Therefore, the MTTF for the printing press is 50hrs x

4 = 200 hrs.

4
= 200 ℎ𝑟𝑠
1
50ℎ𝑟𝑠

93
Computer Maintainability

• Serviceability or maintainability is the simplicity and


speed with which a system can be repaired or
maintained; if the time to repair a failed system
increases, then availability will decrease. It includes
various methods of easily diagnosing the system
when problems arise. Early detection of faults can
decrease or avoid system downtime. For example,
some enterprise systems can automatically call a
service center without human intervention when the
system experiences a system fault. The traditional
focus has been on making the correct repairs with as
little disruption to normal operations as possible.
94
Computer Maintainability
Maintainability is the ease with which a product can be
maintained in order to:
• isolate defects or their cause,
• correct defects or their cause,
• repair or replace faulty or worn out components
without having to replace still-working parts,
• prevent unexpected breakdowns,
• maximize a product’s useful life,
• maximize efficiency, reliability, and safety,
• meet new requirements,
• make future maintenance easier, or
95
• cope with a changed environment.
Computer Maintainability
In some cases, maintainability involves a system of continuous
improvement - learning from the past in order to improve the ability
to maintain systems, or improve reliability of systems based on
maintenance experience.
In telecommunication and several other engineering fields, the term
maintainability has the following meanings:
• A characteristic of design and installation, expressed as the
probability that an item will be retained in or restored to a
specified condition within a given period of time, when the
maintenance is performed In accordance with prescribed
procedures and resources.
• The ease with which maintenance of a functional unit can be
performed in accordance with prescribed requirements.

96
Computer Maintainability

97
Cyclic Redundancy Check
• A cyclic redundancy cheek (CRC) is an error-detecting code
commonly used in digital networks and storage devices to detect
accidental changes to raw data. Blocks of data entering these
stems get a short check value attached, based on the remainder
of a polynomial division their contents; on retrieval the calculation
is repeated, and corrective action can be taken against presumed
data corruption if the check values do not match.
• CRCs are so called because the check (data verification) value is
a redundancy (it expands the message without adding
information) and the algorithm is based on cyclic codes. CRCs
are popular because they are simple to implement in binary
hardware, easy to analyze mathematically, and particularly good
at detecting common errors caused by noise in transmission
channels. Because the check value has a fixed length, the
function that generates it is occasionally used as a hash function.
98
Maintenance Types
• Generally speaking, there are three types of maintenance in
use:
• Preventive maintenance, where equipment is maintained before
break down occurs. This type of maintenance has many
different variations and is subject of various researches to
determine nest and most efficient way to maintain equipment.
Recent studies have shown that Preventive maintenance is
effective in preventing age related failures of the equipment.
For random failure patterns which amount to 80% of the failure
patterns, condition monitoring proves to be effective.
• Corrective maintenance, where equipment is maintained after
break down. This maintenance is often most expensive
because worn equipment can damage other parts and cause
multiple damages.
• Operational maintenance, where equipment is maintained in
99
using.
Maintenance Types
• Preventive / Predictive maintenance
• Preventive maintenance is maintenance performed in an
attempt to avoid failures, unnecessary production loss and
safety violations.
• The effectiveness of a preventive maintenance depends on the
RCM analysis which it was based on, and the ground rules used
for co-effectively.
• Preventive maintenance (PM) has the following meanings:
• The care and servicing by personnel for the purpose of
maintaining equipment and facilities in satisfactory operating
condition by providing for systematic inspection, detection, and
correction of incipient failures either before they occur or before
they develop into major defects.
• Maintenance, including tests, measurements, adjustments, and
parts replacement, performed specifically to prevent faults from
occurring.
100
• The primary goal of maintenance is to avoid or mitigate the consequences of
failure of equipment. This may be by preventing the failure before it actually
occurs which Planned Maintenance and Condition Based Maintenance help to
achieve. It is designed to preserve and restore equipment reliability by replacing
worn components before they actually fail. Preventive maintenance activities
include partial or complete overhauls at specified periods, oil changes,
lubrication and so on. In addition, workers can record equipment deterioration so
they know to replace or repair worn parts before they cause system failure. The
ideal preventive maintenance program would prevent all equipment failure
before it occurs.
• There is a controversy of sorts regarding the propriety of the usage
“preventative.”
Subgroups
• Preventive maintenance can be described as maintenance of equipment or
systems before fault occurs. It can be divided into two subgroups:
• planned maintenance and
• condition-based maintenance.
The main difference of subgroups is determination of maintenance time, or
determination of moment when maintenance should be performed.
101
While preventive maintenance is generally considered to be worthwhile, there are risks such
as equipment failure or human error involved when performing preventive maintenance, just
as in any maintenance operation. Preventive maintenance as scheduled overhaul or scheduled
replacement provides two of the three proactive failure management policies available to the
maintenance engineer. Common methods of determining what Preventive (or other) failure
management policies should be applied are; OEM recommendations, requirements of codes
and legislation within a jurisdiction, what an “expert” thinks ought to be done, or the
maintenance that’s already done to similar equipment, and most important measured values
and performance indications. In a nutshell:

•Preventive maintenance is conducted to keep equipment working and/or extend the life of
the equipment.

•Corrective maintenance, sometimes called “repair,1’ is conducted to get equipment working


again.
102
Corrective / Breakdown Maintenance

Corrective maintenance can be defined as the maintenance which


is required when an item has failed or worn out, to bring it back to
working order. Corrective maintenance is carried out on all items
where the consequences of failure or wearing out are not
significant and the cost of this maintenance is much greater than
preventive maintenance.
• Corrective maintenance is the program focused on the regular
task that will maintain all the critical machinery and the system
in optimum operating conditions. The major objectives of the
program are to 1.Eliminating breakdown 2.Eliminating deviation
3.Eliminating unnecessary repairs 4.Optimize all the critical
planned system
• Corrective maintenance is probably the most commonly used
approach, but it is easy to see its limitations. When equipment
fails, it often leads to downtime in production. In most cases, this
is costly business. Also, if the equipment needs to be replaced,
the cost of replacing it alone can be substantial. It is also
important to consider health, safety and environment (HSE)
issues related to malfunctioning equipment. 103
Operational maintenance

Operational maintenance is a kind of preventive


maintenance performed by operator of an equipment or
machine or a non-highly technical personnel. It is
carried out while the machine is function or operational
unit. It entails minor maintenance of equipment using
procedures that do not requires detailed knowledge of
the cleaning. Serving, preserving, lubricating, adjusting
and parts replacement.
• The purpose of operational maintenance are: (a) To
make the operator to be aware of state or reddiness
of the equipment (b) To reduce the delays from
having to wait for a qualified make technician
available for more complicated work.

104
Reliability centered maintenance
Reliability centered maintenance is an engineering framework that enables the definition of
a complete maintenance regime. It regards maintenance as the means to maintain the
functions a user may require of machinery in a defined operating context. As a discipline it
enables machinery stakeholders to monitor, assess, predict and generally understand the
working of their physical assets. This is embodied in the initial part of the RCM process
which is to identify the operating context of the machinery, and write a Failure Mode Effects
and Criticality Analysis (FMECA). The second part of the analysis is to apply the “RCM
logic”, which helps determine the appropriate maintenance tasks for the identified failure
modes in the FMECA. Once the logic is complete for all elements in the FMECA, the
resulting list of maintenance is “packaged”, so that the periodicities of the tasks are
rationalised to be called up in work packages; it is important not to destroy applicability of
maintenance in this phase. Lastly, RCM is kept live throughout “n-service’ life of
machinery, where the effectiveness of the maintenance is kept under constant review and
adjusted in light of the experience gained.
105
Difference Between Preventive and
Predictive Maintenance

Predictive maintenance tends to include direct measurement of the item. Example, an infrared picture of
a circuit board to determine hot spots while Preventive Maintenance includes the evaluation of particles
in suspension in a lubricant, sound and vibration analysis of a machine.

•Examples

•An individual bought an incandescent light bulb. The manufacturing company mentioned that the life
span of the bulb is 3 years. Just before the 3 years, the individual decided to replace the bulb with a new
one. This is called preventive maintenance.

•On the other hand, the individual has the opportunity to observe the bulb operation daily. After two
years, the bulb starts flickering. The individual predicts at that time that the bulb is going to fail very
soon and decides to change it for a new one. This is called predictive maintenance.

•The individual ignores the flickering bulb and only goes out to buy another replacement light bulb
when the current one fails. This is called corrective maintenance.

•Applications are used to help complete these tasks such as Flame Task
106
Maintenance, repair, and operations
Maintenance, repair and operations (MRO) or maintenance, repair, and overhaul involves fixing any sort of
mechanical, plumbing or electrical device should it become out of order or broken (known as repair,
unscheduled, or casualty maintenance). It also includes performing routine actions which keep the device in
working order (known as scheduled maintenance) or prevent trouble from arising (preventive maintenance).
MRO may be defined as, “All actions which have the objective of retaining or restoring an item in or to a
state in which it can perform its required function. The actions include the combination of all technical and
corresponding administrative, managerial, and supervision actions.”

•MRO operations can be categorised by whether the product remains the property of the customer, i.e. a
service is being offered, or whether the product is bought by the reprocessing organisation and sold to any
customer wishing to make the purchase (Guadette, 2002). In the former case it may be a backshop operation
within a larger organization or smaller operation.

•The former of these represents a dosed loop supply chain and usually has the scope of maintenance, repair
or overhaul of the product. The latter of the categorisations is an open loop supply chain and is typified by
refurbishment and remanufacture. The main characteristic of the closed loop system is that the demand for a
product is matched with the supply of a used product. Neglecting asset write-offs and exceptional activities
the total population of the product between the customer and the service provider remains constant.
107
Reliability and Maintainability of Public
Utilities in Nigeria
Seminar Title – 2015/2016 and 2016/2017Session

• Electricity
• Water supply
• Road Network
• Rail Network
• Aviation/Air Services
• Water Transportation Services
• Television Services
• Radio Services
• Telephone Services
• Internet/IT Services
• Hospital Infrastructure Services
• School Infrastructure Services
108

You might also like