Quick_algorithms_to_calculate_mean_time (2)
Quick_algorithms_to_calculate_mean_time (2)
ABSTRACT
Calculation of reliability and availability is a necessary part of every fault tolerant system
design. Conventional methods, which use calculus of probability, almost fail when there
is dependant failure, repair, or standby operation in the system. Use of Markov models
also almost breaks down when system is composed of many elements or system is
repairable. However, the alternatives for reliability and availability, i.e., mean time to
failure and steady state availability, respectively, are easily measurable using Markov
graphs. In this paper, we introduce an algorithm for the calculation of mean time to
failure, and also develop a method to measure steady state availability with its
corresponding algorithm.
Key Words:
Mean time to failure, steady state availability, Markov graphs, algorithm, measurement
1 INTRODUCTION
Any system which could function correctly while there exist some faults in it is called a
fault tolerant system. Some reasons to build fault tolerant systems are harsh
environments, novice users, high repairing costs, and large systems which should always
be kept up. Adding redundant components or functions is the most common approach to
acquiring fault tolerant systems. When designing a fault tolerant system, several features
need to be evaluated and a trade-off among them is required. These features are cost,
weight, volume, reliability, and availability. Reliability is the probability of no failure in a
given operating period, while availability is the probability that the system is up at any
point in time. Availability is generally measured to find the effect of repair on a system.
Calculation of reliability and availability is a necessary part of any fault tolerant design
process.
1
RNDr. Moslem Amiri; Masaryk University, FI, B202; [email protected]
2
prof. Ing. Václav Přenosil, CSc.; Masaryk University, FI, B406; [email protected]
1
2. Calculus of probability is used to express the reliability of the whole system in
terms of the probability of success of its units;
3. Every unit’s probability of success is calculated using failure rate models;
4. Combining steps 2 and 3, system reliability is found.
In this method, element reliabilities are directly calculated. When there is dependent
failure, repair, or standby operation in the system, this method gets complicated. To
overcome this complexity, an alternative approach is use of Markov models. Markov
model approach works well when failure hazards, z (t ) , and repair hazards, w(t ) , are
constant. Throughout this paper, we will only consider constant failure and repair hazards
for simplicity. However, the algorithms and formulations explained here for calculation
of mean time to failure and steady state availability, can be easily expanded to time-
dependent hazards.
2 MARKOV MODELS
In this model, each component is assumed to be in either of two conditions at any time;
good or bad. Combination of all components composing the whole system, with each
component being either good or bad, is called a state. First of all, all mutually exclusive
states of the system should be defined. States of the system at time zero are called the
“initial states” and the states representing final or equilibrium states are “final states.”
Hence, in Markov models it is possible to appoint not only the state in which all the
components are operational, but any other state as the initial state. However, practically,
it is almost always the case that all the components are operational at time zero. Starting
from initial state, in time t , there could be transition to every possible state with one
more bad component. These chains of transitions spread between states which differ only
in the condition of one component in time t . For every state, there should be as many
transitions as the number of good components in that state in time t . Probability of
transition in time t from one state to another is z (t )t , where z (t ) is the hazard rate
associated with this transition.
For a system composed of a single non-repairable component x1 , the two possible states
are s0 x1 (component is good) and s1 x1 (component is bad). A state transition table
along with the Markov graph for this system is shown in Fig. 1. In this system,
probability of being in state s0 at time t (t ) is:
1. probability that system is in state s0 at time t (i.e., Ps (t ) ) and there is no failure
0
2
Final States
Initial
States s0 s1
s0 1 z (t )t z (t )t
s1 0 1
(a) (b)
Figure 1 State transition table (a) and Markov graph (b) for a single non-repairable
component
Ps 0 (t t ) Ps 0 (t )
z(t )Ps 0 (t ) (3)
t
Ps1 (t t ) Ps1 (t )
z(t )Ps 0 (t ) (4)
t
When (t ) approaches its limit (goes to zero), (3) and (4) will yield:
dPs 0 (t )
z(t )Ps 0 (t ) (5)
dt
dPs1 (t )
z(t )Ps 0 (t ) (6)
dt
A simple method to obtain (5) and (6) directly from Markov graph in Fig. 1 (b) is to [1]:
1. change any unity to zero and any t to one in the Markov graph;
3
2. equate derivative of probability of any node to the sum of branches coming into
that node.
Formulation of Markov models always leads to a set of first order differential equations.
Solving these differential equations becomes complicated if there are a large number of
components in the system, or repair is added to the system. An alternative method is to
use Laplace transforms. Direct Laplace transform is an easy process which is in fact a
simple replacement. However, inverse transform requires partial fraction expansion for
exact measurements which is difficult to calculate, especially when repair factor is also
considered in the design. Although reliability and availability are obtained only if the
inverse transform is done, it is still possible to extract partial information from the
transformed equations in order to dispose of inverse Laplace transform. These easily
obtained information are mean time to failure (MTTF) and steady state availability
(SSA). MTTF and SSA can be used as alternatives to reliability and availability when
several systems should be compared. If failure hazards are constant ( z (t ) ), the
transform pairs shown in Tab. 1 are sufficient to develop and calculate MTTF and SSA.
No. f (t ) L{ f (t )} F ( s ) f * ( s)
df (t )
1 sF ( s) f (0)
dt
t F ( s)
2
0
f (t )dt
s
3 lim f (t ) lim sF ( s )
t s 0
t
MTTF
0
R(t )dt lim R( )d
t 0
(7)
*
t R (s )
L R( )d
0 s
(8)
t
And applying theorem 3 of Tab. 1 when f (t ) R( )d will give:
0
4
t
t
R * (s )
t 0
MTTF lim R( )d lim sL R( )d lim s
s 0 0 s 0 s
(9)
Laplace transform to (5) and (6), MTTF for a single non-repairable system can be found
as follows:
dPs 0 (t ) 1
Ps 0 (t ) sPs 0 (s ) 1 Ps 0 (s ) Ps 0 (s ) (11)
dt s
dPs1 (t ) 1
Ps0 (t ) sPs1 (s ) Ps0 (s ) Ps1 (s ) (12)
dt s s (s )
A one-component system is reliable when there is no failure (i.e., the system is in state
s0 ). Hence:
1 1
R * (s ) Ps0 (s ) MTTF lim R * (s ) (13)
s s 0
At this point, we design an algorithm for calculation of MTTF for systems with no repair.
In order to illustrate the algorithms explained in this article, we use a 3-component
system as an aide. The Markov graph of such a system with no repair is shown in Fig. 2.
Laplace transformed probabilities of some sample nodes in Fig. 2 are developed here:
1
Ps0 (s ) (14)
s 0 1 0 2 0 4
0 1Ps0 (s )
Ps1 (s ) (15)
s 13 15
5
Figure 2 Markov graph of a 3-component system with no repair
The Markov graph in Fig. 2 can be viewed as a unidirectional graph where various
permutations of a fixed number of failed elements out of the whole number of elements
stand on one specific layer. Fig. 3 illustrates such a graph which we call transition graph.
To measure MTTF for a k -out-of- n system, transformed probabilities of the nodes on
layers 0 to n k (in order) should be calculated and summed up while s variable
approaches zero. The pseudo-code in Tab. 2 explains this procedure.
6
Table 2 MTTF measurement of a system with no repair
if k = n //series configuration
output MTTF
else
for layer L 1 to n k do
for every state si on layer L do
si(L) (failure rate of incoming branch from
j
When the components are repairable, the transformed probabilities of the nodes are not
directly found. Instead, solving for probabilities will result in a system of linear
equations. Viewing this linear system as a matrix equation, Ax b , the matrix A and the
column vectors x and b need to be obtained from the transition graph of the system. For
example, A , x , and b for a 1-out-of-3 system of Fig. 4 are shown in Tab. 3.
Table 3 Matrices required for a system of linear equations of a 1-out-of-3 system shown
in Fig. 4
Ps0 ( s ) Ps1 ( s ) Ps2 ( s ) Ps4 ( s ) Ps3 ( s ) Ps5 ( s ) Ps6 ( s )
Ps0 ( s ) (0 1 0 2 0 4 ) 1 0 20 40 0 0 0
Ps1 ( s ) 0 1 (13 15 1 0 ) 0 0 31 51 0
Ps2 ( s ) 0 2 0 ( 2 3 2 6 2 0 ) 0 3 2 0 6 2
A
Ps4 ( s ) 0 4 0 0 ( 4 5 4 6 4 0 ) 0 5 4 6 4
Ps3 ( s ) 0 1 3 2 3 0 (3 7 31 3 2 ) 0 0
Ps5 ( s ) 0 15 0 4 5 0 (5 7 5 1 5 4 ) 0
Ps6 ( s ) 0 0 2 6 4 6 0 0 (6 7 6 2 6 4 )
Ps0 ( s ) Ps0 ( s ) 1
Ps1 ( s ) Ps1 ( s ) 0
Ps2 ( s ) Ps2 ( s ) 0
x Ps4 ( s ) b Ps4 ( s ) 0
Ps3 ( s ) Ps3 ( s ) 0
Ps5 ( s ) Ps5 ( s ) 0
Ps6 ( s ) Ps6 ( s ) 0
The pseudo-code in Tab. 4 explains the procedure to measure MTTF for a k -out-of- n
system with repair.
7
Figure 4 Transition graph of a 3-component repairable system for MTTF measurement
if k 1
Remove all the branches from layer n k 1 to n k
endif
8
MTTF x[i , 1]
i
An important difference between reliability R (t ) and availability A(t ) is their steady state
behavior; as time approaches infinity, R (t ) approaches zero while A(t ) approaches some
fixed value. Fig. 5 shows this behavior for a system which is initially good.
In this paper, we develop an easy method to calculate SSA without the need to calculate
the inverse Laplace transform for the nodes of Markov graph. A(t ) is the probability that
an item is up at any point in time, i.e.:
Up time(t )
A(t ) (17)
Up time(t ) Down time(t )
Although it seems that taking integration of A(t ) from zero to infinity will equal infinity,
the fraction in (17) cancels the common factors which make A(t ) approach infinity. A(t )
can be thought of as a power signal, while R(t ) is an energy signal. Hence, the
relationship between SSA and A(t ) can be written as:
t
SSA 0
A(t )dt lim
t 0 A()d (18)
*
t A (s )
L A( )d
0 s
(19)
t
And applying theorem 3 of Tab. 1 when f (t ) A( )d will give:
0
9
t
t A * (s )
SSA lim
t 0
A( )d lim sL A( )d lim s
s 0 0 s 0 s (20)
Up time(s )
SSA lim A * (s ) lim (21)
s 0 s 0 Up time(s ) Down time(s )
In the procedure of calculating SSA, solving for probabilities will result in a system of
linear equations, similar to the calculation of MTTF for repairable systems. However, in
this case, all the repair hazard branches of the system should be considered, and the
matrix equation should contain all the states. Hence, the processing of linear system for
SSA calculation is independent of the value k for a k -out-of- n system. For example, A ,
x , and b for a 3-component system (whose graph is shown in Fig. 6) are shown in Tab.
5.
Ps6 ( s) Ps7 ( s)
Ps0 ( s ) Ps0 ( s ) 1
0 0
Ps1 ( s ) Ps1 ( s ) 0
0 0
Ps2 ( s ) Ps2 ( s ) 0
6 2 0
Ps4 ( s ) Ps ( s ) 0
A 6 4 0 x b 4
Ps3 ( s ) Ps3 ( s ) 0
0 7 3
Ps5 ( s ) Ps5 ( s ) 0
0 7 5
Ps6 ( s ) Ps6 ( s ) 0
( 6 2 6 4 6 7 s) 7 6
Ps7 ( s ) Ps7 ( s ) 0
6 7 ( 7 3 7 5 7 6 s )
As seen in matrix A in Tab. 5, the s variables are not replaced by zero, otherwise the
probabilities of the nodes will equal to infinity (which is true since SSA is a fixed
nonzero value extending to infinity). If the fraction formula in (21) is used, the common
product factors (which include s and make the single probabilities equal infinity) from
the numerator and denominator of the fraction are cancelled. Hence, in calculation of the
probabilities of the nodes in the system of linear equations, the product factors need not
be expanded, since they will later be replaced in Eq. (21) and could be cancelled.
The pseudo-code in Tab. 6 explains the procedure to measure SSA for a k -out-of- n
system.
10
Figure 6 Transition graph of a 3-component system for SSA measurement
11
Up time (s)
SSA lim //common product factors should be
s0 Up time (s) Down time (s)
//cancelled before setting s to 0
The general form of a Markov graph contains 2n states for an n -component system.
However, it is possible to reduce the number of nodes in a Markov graph and
consequently, reduce the complexity of the graph. Those nodes on one specific layer of
the graph which have equal hazard rates on their outgoing branches could be collapsed
into one node. The hazard rates of the incoming branches to these collapsed nodes need
not be equal.
In an ordinary parallel system, all the components including the primary and back ups,
start operating at time zero and all can fail. If there are m independent components in the
system with equal failure rates of each, the failure rates of the branches in going from
one layer to the next, starting from layer zero, will be m , (m 1) , (m 2) , . This is
simply the addition of the failure rates of the outgoing branches of one node on every
layer (it could be any node on the layer) in this system’s general Markov model
equivalent. However, if the repair rate in the general model is the same for every
component and equal to , it does not follow the same rule as in the case of failure rate.
The repair rate is relative to the number of repairmen present and whether they work
cooperatively. If is the repair rate for one repairman, k is the repair rate for k
repairmen. Fig. 7 depicts the collapsed transition graph for a 3-component system. The
algorithms explained in section 2 apply to these collapsed graphs as well.
Figure 7 Collapsed transition graph of a 3-component system for MTTF and SSA
measurements; the failure rate of every component is and k repairmen are present
12
The weak point of an ordinary parallel system is that the backup components can fail
even before they are put on-line. This happens because all the components in such a
system are energized at time zero. An improvement is to energize the primary component
only. Thus, as soon as the primary component fails, one backup is energized and put on-
line. Contrary to the parallel systems in which every component can fail, in standby
systems only one component (the active one) can fail at a time (of course, this is true only
if the components do not fail when they are not energized). Fig. 8 shows the collapsed
transition graph for a 3-component standby system with components having the failure
rate of each. As it is seen, the transition probability is t for every branch. For
standby systems, it is not possible to illustrate k -out-of- n system for various k s in one
graph, and the algorithms explained in section 2 do not directly apply to the collapsed
graphs of these systems. However, the fact that exposure time for standby elements does
not start until on-line component has failed, brings simplicity to measurements of features
of standby systems. For example, for n identical non-repairable standby elements with
failure rates each, the system succeeds if there are n 1 or fewer failures. The
reliability of this system can simply be calculated by Poisson distribution [2]:
n 1
( t )i
R ( t ) e t
i 0
i!
(22)
ACKNOWLEDGEMENT
The work presented in this paper has been supported under research project SPECTRUM,
No. TA01011383 by Technology Agency of the Czech Republic.
REFERENCES
13