0% found this document useful (0 votes)
9 views11 pages

Krishna 1987

Uploaded by

manjunathk6490
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views11 pages

Krishna 1987

Uploaded by

manjunathk6490
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

1030 IEEE TRANSACTIONS ON COMPUTERS, VOL. C-36, NO.

9, SEPTEMBER 1987

Processor Tradeoffs in Distributed Real-Time


Systems
C. M. KRISHNA, MEMBER, IEEE, KANG G. SHIN, SENIOR MEMBER, IEEE, AND INDERPAL S. BHANDARI

Abstract-Optimizing the design of real-time distributed sys- control function is well-defined, so is the set of tasks that the
tems is important since the systems are frequently critical to life computer has to run. These tasks may be run cyclically in an
compute to r These tasks by event i
be

This optimization is a difficult problem, and heuristics and


designer judgment are called for in the process. The chief cause of indefinite loop, or may be triggered by some event in the
the difficulty is the large number of parameters under the controlled system or the environment. A task trigger can thus
designer's control which impact performance and life-cycle cost. be produced by the operator, the environment, the controlled
We study the interplay between the more important parameters in system, or any combination of these.
this paper using two objective measures, i.e., the mean cost and The differences between computers for such applications
the probability of dynamic failure in [6], [10]. Among these are and general-purpose machines include the following.
the processor burn-in time and processor replacement policy. A an el
central feature of this work is a look at how the application * Reliablity specifications are stringent and so an approxi-
requirements affect the optimality of the distributed systems; mation to the probability of failure is acceptable only if it can
indeed, the application requirements are an integral part of the be shown to be an upper bound.
analysis. * All executed tasks belong to predefined task classes, and
Index Terms-Burn-in time, hazard rate, mean cost, probabil- the interarrival time distribution and service requirement of
ity of dynamic failure, processor tradeoffs, real-time multipro- these classes are known to a fair accuracy.
cessor and control, replacement policy. * There are now performance measures available for real-
I. INTRODUCTION time computers which are specific to the application at hand,
HE objective of all computer design is to optimize a and which express performance in application-specific terms
1 performance criterion subject to constraints imposed oon [6], [10].
therlfe-c ncle cst,rthe applicatitoon,raindth impeforned The real-time computer is a "black box" from the point of
charatersif-cs
characteristics ofof,
the ava
thve thisotia iable tponeandts. Unfortunaeview of the controlled
components. Unfortunately, iit speed and precision of system.
It is exemplified entirely by the
its output. Thus, the computer response
inimpos inible aaomived
usually results thinpteran-npron
mixed beg nonlinear
integer-and-noninteger lin
usually results
computer
can be used as a parameter with which to characterize
time
performance. I If the response time is greater than
programming problem for which no efficient solutions exist,
and 2) in most instances it is impossible to specify a single system °somgiven fallure cancaoccur. A tehard
lled
real-ttme task
performance metric that captures fully the various facets of if there is a finite hard deadline associated1S with it. The hard
caitastropical
dadinsatd to be critical
"performance." So, designers typically have to proceed
heuristically, using sensitivity analyses or branch-and-bound cont e (comter) think time which deens o the
techiqus
techniques and anrule-of-thumb
rue-ofthub performance
peforancecrieriacriteria. Inany
cotole (cmutr
thn'iewihdpnso
In any dynamics of the controlled system and its environment.2 Since
h

case, approximate (but robust) performance figures are all that


the
the
operating
operating
environmentic hine
environment iS stochastic, the hard deadline iS a
is usually required for general-purpose systems. Things are random variable. (See [10] for a detailed account of this.) We
very different with distributed computers for real-time control d t p
applications such as nuclear reactors and aircraft. (We term Weno the failure occursbuetio deadion being missed
these real-time computers.) Such machines are becoming When system failure occurs due to a deadl.ne
being missed,
increasingly important in practice. reldynamic
failure is said to have happened. The measure of
Real-time computers work in the feedback loop of the denotedlby Pdynd
system they control, i.e., controlled system. They derive denoted byPdyn.
In [61] we define the "cost" associated w with a computer
their input from sensors and the operator(s), and their output is
used to control actuators
- or to . ' In 6]we
~~~~~~~~~response
update displays. Because the f 3 if the
time dfin
of t toth a "cost" assoiatoe
control task i to be someomfunction punte
g(),3 response time iS less than the hard d'eadline, and
Manuscript received August 25, 1985; revised December 17, 1986. This infinity otherwise. Naturally, we assume that the functions
work was supported in part by NASA under the Grant 1-296 and the ONR fi(*) are nondecreasing and continuous. All this is formalized
under Contract N0014-85-K-0122. Any opinions, findings, and conclusions or by defining a cost function.
recommendations expressed in this publication are those of the authors and do
not necessarily reflect the view of the funding agencies. C
_f() if O < tc
C. M. Krishna is with the Department of Electrical and Computer oot otewie
Engineering, University of Massachusetts, Amherst, MA 01003. t,rwse1.
K. G. Shin is with the Real Time Computing Laboratory (RTCL),
Department of Electrical Engineering and Computer Science, The University 'Response time is infinite by definition if an incorrect value is put out by
of Michigan, Ann Arbor, MI 48109. the real-time computer.
I. S. Bhandari is with the Department of Electrical and Computer 2 To avoid needless duplication, we allow noncritical tasks to have tdi
Engineering, Carnegie-Mellon University, Pittsburgh, PA 15213. defined for them, only these are infinite.
IEEE Log Number 8715299. 3We showed in [10] how to derive this function.

0018-9340/87/0900-1030$01.00 ©) 1987 IEEE


KRISHNA et at.: PROCESSOR TRADEOFFS IN REAL-TIME SYSTEMS 1031

The cost function expresses the fact that since a positive probability of dynamic failure and mean cost. Section IV has
computer delay aggravates latent system instability, the some detailed numerical examples. The solution is numeri-
magnitude of the control called for will rise with the computer cally obtained, tabulated, and interpreted in the context of
response delay. The cost function is an expression of cost in real-time application. We conclude with Section V. The
physical terms such as energy, fuel, etc., and is the extra cost Appendix contains a list of definitions.
of control-in these terms-that is incurred because of the
nonzero nature of the computer response delay. Costs incurred II. SYSTEM MODELS AND ASSUMPTIONS
add up; if we denote by rij the cost incurred in relation to A. System Model
version Vij of task i, the total cost of control is >jEj rij. The
mean cost (MC) is the mean total cost of control incurred over We consider a potentially very popular class of computer
a mission lifetime, given that the dynamic failure has not architectures, which is similar to the continuously reconfigura-
occurred. See the Appendix for a list of terms associated with ble multimicroprocessor flight control system (CM2FCS)
real-time control systems. In [6], we discuss the cost function structure proposed by the United States Air Force [8]. (See
in detail, and in [10], we provide a detailed case study of its Fig. 1 for an architectural block diagram.)
determination and use aboard aircraft when the task is to This class of computers shares the following characteristics.
control the deflection of the elevator during landing. So far as Processors work together in triads, and are synchronized.
this present discussion is concerned, there are only two Faults are masked by voting when a failure is located, and the
quantities of interest: the probability of dynamic failure and injured triad is either purged of its faulty element and brought
the mean cost. In other words, we shall analyze certain up to strength with the introduction of a spare (if a spare
processor tradeoffs using these two objective measures. exists), or the triad is disbanded and its working processors
We can now restate our optimization problem to make it classified as spares.
specific to real-time computers: find a design which minimizes The processors each have their own private memory. In this
the mean cost for a particular application over a total working class of architectures, each private memory contains a copy of
lifespan of a given number of missions, subject to the the operating system and all the applications software. In
constraints of a) life-cycle cost, b) the interarrival and service certain applications, where there are only a small number of
time distributions for the various classes of tasks to be programs to be executed, this is possible. This can be justified
executed, c) the characteristics of the available components, by the fact that on-line loading of both program code and data
and d) the maximum acceptable probability of dynamic requires too much time to meet real-time constraints, and
failure. continuous increase in capacity and drop in cost of memory
While this problem is now more definite and formal, it is make such an arrangement economically feasible.
just as impossible to solve exactly and efficiently as the one Tasks enter the system and must be scheduled. The job of
with which we began. So we must study tradeoffs between the scheduler is taken up by one of the triads.
various aspects of the problem, and use human intuition and When a task arrives, it is queued at the scheduler until a
judgment to arrive at a quasi-optimal design. In this paper, we processor triad becomes free. It is then allocated, on a first-
study several important tradeoffs which occur in one poten- come-first-served basis, to that triad. When a free triad in the
tially important type of distributed real-time computer. system receives a task, it can begin executing it immediately.
We study the impact of the application requirements
(expressed through the task loading, the finite cost functions, B. Assumptions
and hard deadline distributions for the various classes of tasks)
on computer performance (expressed through Pdyn and MC). We make the following assumptions:
And we consider the impact of the processor replacement Al. All processors are identical.
policy and processor burn-in times (i.e., the time for which a A2. Tasks belong to well-defined task classes. There is no
processor is made to execute in a test setting) on computer restriction on the distribution of the task service time, except
performance. Finally, we calculate the number of processors that service times are stochastically independent.
that must be replaced under any replacement policy and burn- A3. When a task is completed, it is voted on and the results
in time, thus computing an important factor in the life-cycle are used to control an actuator or update a display.
cost. A4. There is no intertask communication during task
In order to be realistic, exponentially distributed service execution. That is, tasks communicate with each other at the
times are not assumed in our analysis, and an exact analytical beginning for input and at the end of execution for output, but
solution to the models we study is as yet unknown. Analytical not during execution.
methods will be used to obtain upper and lower bounds to the A5. Tasks arrive according to a Poisson process.
reliability: to do this using simulation is impractical since Pdyn A6. The computer system operates in missions. A mission
for critical real-time applications must be of the order of 10-5 iS a continuous interval of time during which the computer
or (usually) much less. performs its function. We treat the case where no repair is
This paper is organized as follows. In Section II, we possible during a mission; repairs must be conducted between
describe the architecture studied, and our modeling assump- missions.
tions. Section III contains analyses such as of the impact of All these assumptions are realistic for computers in charge
processor replacement strategies and the burn-in time on the of real-time control. By A4, the influence of the interconnec-
1032 IEEE TRANSACTIONS ON COMPUTERS, VOL. C-36, NO. 9, SEPTEMBER 1987

Bus Terminator Circuits The processor replacement policy followed is also a


function of the failure process. If the processors follow a
more-or-less exponential failure law, the optimal policy is to
replace a processor only after it has failed. If the failure law is
highly nonexponential, it is useful to replace processors after
they have reached a certain age. We will denote by Pt the
policy under which, at the end of a mission, processors older
than t and those which have failed, are replaced.
Our analysis will proceed as follows. Both the probability of
dynamic failure and the mean cost accrued over a mission are
calculated as functions of the response time distribution for the
Processor Processor Processor
various tasks of the system, the distribution of the sojourn time
of the system in its various states (i.e., number of functional
triads), together with the cost functions and hard deadlines for
Memory Memory Memory the given set of task classes. The response time distribution in
turn depends on the task interarrival and service time
Fig. 1. Schematic of system architecture. distributions. The latter distributions are given as part of the
application description. The sojourn time distribution is
calculated as a function of the processor replacement policy,
tion network on system reliability can easily be computed the processor failure law, and the burn-in period.
using combinatorial arguments and the result amalgamated In Section III-A, we obtain formulas for Pdyn and MC as a
with a study of the influence of processor reliability, and so we function of the state transition epoch and task waiting time
confine ourselves here to studying processor-related perform- distributions in any given mission. Ih Section 1II-B, an
ance issues. approximation to the waiting time distribution is presented. In
In this work, we make none of the assumptions of Section III-C, we derive an expression for the state transition
exponentially distributed service and failure times which, epoch distribution as a function of the distribution of the age of
while permitting an easy analysis, would limit the applicability the processors at the start of any given rnission. In Section III-
of the work. If either of these parameters is exponentially D this age distribution is derived as a function of the burn-in
distributed, then the analysis is greatly simplified, as we shall time, tburn, and the processor replacement policy.
show.
A. Calculating Pdyn and MC
III. TRADEOFF ANALYSIS Both pd, and MC are calculated with respect to the mission
In this section, we derive formulas for the sensitivity of number, say 1. All times are relative to the start of the mission.
computer performance to the hard deadlines, the cost func- Let 7rfail(t, tinl t1nl _,
-I * * 11) be the probability of dynamic
tions, the amount of component burn-in (if any) used, and the failure by time t if the system leaves state i at time tqi during
processor replacement policy. mission 1. Define tlo - m, where m is the mission lifetime.
It is not difficult to qualitatively describe the nature of these Define a state transition function
sensitivities. If the hard deadlines are smaller, meaning that
there is less time allowed for the computer to complete (n if t <
execution, an increased premium is placed on processor speed,
and the balance between failure caused by massive hardware
4(t, t7nl, tnl 1, , ti1) i0
_ if t>1l±<t.ti1,
if t>t
failure (static failure) and that due to missing a deadline for
other causes (nonstatic failure) shifts toward making the and let W( r) be the steady-state waiting time distribution if
latter the dominant component. Similarly, if the finite cost the system is in a state of r triads. Let FWd( ) be the probability
functions are great, the importance of processor speed distribution function of the waiting-time deadline which is the
increases b maximum waiting time permitted if the task is to finish service
Component burn-in is important because in many cases the by its hard deadline. If wi, xi, and tdi represent the waiting
hazard rate follows a bathtub curve: a high rate at first, then time, service time, and deadline of some task in class i,
dropping to a quasi-steady-state value, and finally rising again. clearly, wi c tdi - xi if that task is not to miss its deadline.
The flrst and last segments reflect, respectively, the latent From this, we easily derive the following result.
faults present upon processor manufacture and the effects of
aging. The sensitivity of mean cost and the probability of FWd( w) = Pr { w c tdi - xiltask is of class i }
dynamic failure on burn-in is determined by the slope of the i=l
bathtub curve, and the extent to which they should be applied
depends upon this slope as well as on the contribution of burn- Pr {task is of class i }
in to the life-cycle cost. In cases where the hazard rate is
constant, denoting an exponential failure law, the optimal =-| Fd1(x1+wo) dB,(x),
burn-in period is clearly zero. i=1 X2 °
KRISHNA et al.: PROCESSOR TRADEOFFS IN REAL-TIME SYSTEMS 1033

where Bi() and Fdi(*) are the service time and hard deadline 4 a3
distributions, respectively, for class i. l4
The waiting-time deadline distribution is as simple as it is
because the arrival process is Poisson. Fwd( ) is independent t14 73t
of the current system state because the service time and hard Commencement Mimion
deadline are independent of the current state. Let End
Fig. 2. Example mission profile.
Pfail(t nn I i by 7^ji = max {0, j7i+ I, , - t}, where #i
00

)= {i - W(w(2-l)} dFWd(w), replacing is


0
distributed according to W(- Ii), lfail(m, 71nl- 1 * ,7^1) is '7nl
where we suppress the arguments of i for notational conven- an upper bound to the probability of failure when the system
ience. Pfail dt is the probability in [t, t + dt) of missing a hard leaves state i at m.
deadline if the system leaves state i after qi time units into the The joint density function of the ih, and density function, g,,
mission. In the event that W(wI - 1) does not exist (which of the transformed mission must now be computed.
will occur when the system is oversaturated at state 0 - 1) An examination of the formula defining 7 shows it depends
set W(wj'/i - 1) = 0 for all w. (The "- 1" arises because on i + , and 17;. We can, therefore, define function h such that
one triad acts only as a scheduler.) A i , )
Clearly, 9107np 77nl -I t11ni 7nl -I 9
'7l)
Tfajl(ts 1nlw 1nl 1 X all) )7nl-1)~~~~~~~~~~~11nl
- hnl(tlnllt1nl) 7 hnl-1(?1nl-llt7nls lni9 1(nl7nl
'nl

{ fail ( , 71 n I Xt
-q I Pfail ( t, 31 n ls,(iX|?2s7
So, the probability of failure during mission / is given From the definition of q it follows that
(approximately) by

Im=0o rn l =0o~m)nl = 1ni ...(^n =7n)


v0 m
772 -q I t q enl qn lnl if fini
lrfail(m, lr1ni 'nl - 1, 771))g,(1nl,, 7nl
9 - 1 X 71i)
where w(- i) = W'(| i). We assume W is differentiable. If
dr/1 c1y2 d... d, M(m) dM(I) not, minor changes are required to be made: the integral will
become a summation, and w will be the probability mass
where g,(X01,s 71nl-l '@Bs t1) represents the joint density function. The same remark applies to the formulas below. If
function of time i, i - 1, * , n1 and M(*) is the distribution > 0, then for i < n
'1
of mission lifetimes. The approximations inherent in this
derivation are a) the steady-state values of waiting-time (o if i<
distribution are used, and b) it is assumed that the number of
operational system triads (i.e., the- system state) does not hi(-q^i 77+ I, 11) | _) w(Q i) dt frj=ri
change between the time a task enters the scheduler queue, and
when it achieves service.
In the event that the assumption in b) above is not
acceptable, we can obtain an upper bound to the failure and
probability as follows. Fig. 2 illustrates an example mission
profile. The system starts with four triads, and is down to two A ('± ° if = )d
triads by the time the mission ends. The system is in state 4 h| w( i if 1 =.
I+
until -q4, in state 3 in [74, '03), and in state 2 beyond that. n-
Let hi4 denote arrival time of the first task which arrived
when the system state was four triads, but which arrived at the We can now uncondition 9's g on the and condition on m, the
server to find the system state changed. Similarly for cU3. (We lifetime of the Ith mission. Denote this function by
are assuming that there are such tasks; if there are not, the (Recall that all our calculations are w.r.t. the lth mission.)
Iin).
result will still be an upper bound of the failure probability.) -
Let X4() denote the distribution of
(4 -
A
74 -in
. Clearly, t4 I i,
must be less than or equal to the waiting time of the given task, m m rm
or it will not arrive at the server to find the system state =| i
changed. So, x4(t)-. W(t14), vt > 0, i.e., (4 is stochasti- ____° Nnl-l:ni 1l
cally less than or equal to a random variable distributed *g(nSnll 1lRl n-, ,71
according to W(~ 14) [9]. Similarly, (3 iS stochastically less n- ,
than or equal to a random variable distributed according to *g/(X1n, ?lni-~ ., 71i) *dv7 d'12 *... l
W(S13).
Therefore, if we generalized and transform the mission, where g1(*) is derived in Section Ill-C below. The probability
1034 IEEE TRANSACTIONS ON COMPUTERS, VOL. C-36, NO. 9, SEPTEMBER 1987

of dynamic failure over mission 1 is then upper bounded by


fm ;m m (nl+n
Jm0Jini= iXn -lI fil 1=7?2
nl X0 cb

7rf3nl(m,
-n''J
\-1 2)Pp

*g1(f1n ni-1, 1II|m) \ 0

dijl, dch2 ... dchnj dM(m), nl+n2- \

where, as before, M(*) is the distribution function for the


length of a mission. FAMLTR
Deriving the mean cost accrued over mission / is much
easier. Since the probability of dynamic failure over a mission (i + I)p i5
must be kept small, (typically much less than 10-5), we can
simply add up the contribution of each task without accounting x
for the possibility that the system may have failed prior to its
arrival.
The contribution to the mean cost of an arrival in [,j, jp/
is upper bounded by

C(i - )= - l 0#()f (t)f I- Fdj(t) I dt 4 A

where Oi(t) _ | w(Oa)dB#(t - a) and we may recall that Fig. 3. Markov model for special case.
fi(t) is the (finite) cost function for task class i and Bi(*) is the
service time distribution.
Therefore, the contribution to the mean cost of an arrival at ( i )
time t into mission 1, I'(t) = ,j=_ C(j)u1(j, tlm), where Ti (t)=- {tiLp+ | ( , dFwd( )$ 7ri(t)
ul(j, ti m) is the conditional probability that t E [j, ru ) if ° I

the lifetime of the lth mission is m. u can be computed directly + (i + )lp 7ri 1(t), 3n c i< 3n, + n2
from the sojourn time distribution g( m).
The mean cost of the mission is, therefore, upper bounded
by ,fail (t)E { a ( j d) ()
oo m
XA o
4f(t) dt dM(m). 7ri(t) + nUp1rp(),
The probability of dynamic failure and mean cost over the where 7rf0i1(t) = probability of having failed by time t, and
f(0) = p, bai0, vi 3n, + n2 Then, the
e fi

entire life-cycle of, say, v missions, can now be easily


calculated. probability of dynamic failure over mission number / is given
Special Case: Exponential Failure Laws: This is a very
important special case, and it lends itself particularly nicely to
by Pdyn 1fail(t)dM(t)
Implicit in these calculations is the assumption that the hard
analysis. If we assume that the processors fail with rate /.p, it is deadline is much smaller than the mission lifetime or the
easy to see that an approximate value for Pdyn can be obtained sojourn time in the various states. This is almost invariably the
using the Markov model in Fig. 3. Let n - 2 be the state at case in practice.
which the system is saturated. Then, defining ct(j, t) _lXI The mean cost is similarly easy to derive. The contribution
- W(tij - I)}, the probability of having i processors to the mean cost of an arrival which finds the system with j
functioning at time t by iri(t), the following equations can be working triads is approximately
written. 4 have
C([l/3j),
and we, therefore,

1r3n+tl(t- {(3ni+n2)ttp 3n14-nl r m

Mean Cost (MC)= S | | Xw1r(t)


+ ae(n1, t ) dFvd (t )} w3n l+02 (t)
*C(K3 K dt
dM(m).
4Recall that one triad is used oniy as a scheduler.
KRISHNA et al.: PROCESSOR TRADEOFFS IN REAL-TIME SYSTEMS 1035

B. Steady-State Waiting Time Distribution Since the service time distribution of each task is known, it
If there are n triads, the system can be modeled as an M/G/ is now a simple matter to calculate the response time
n queue. The waiting time distribution of such queues is as yet distribution by using the service time with the waiting time
unknown, so approximations must be sought. distributions.
We employ the approach of Arjas and Lehtonen [1]. The
M!G/n queue is approximated by two E,/G/1 queues: one C. Probability Density Function of m
yields the upper bound, and the other the lower bound of the In this subsection, we compute the joint density of the i

response time distribution. under the following assumptions.


The idea is simple and its proof in [1] is elegant. Divide the 1) Processor failures occur independently, and a processor
arriving tasks, in their order of arrival, into groups of size n. ages only during a mission, not between missions.
Represent each such group by a single task, whose service 2) Mission lifetimes are IID.
time requirement is the maximum of all service times in its 3)3) Processors
Possor are are replaced
re urupon failure at ththe end of the
grou. the M/G/n
group. ReplaceRelacethe system with
IGInsystm wih anE,/G1 system
an En!GI1 sytem current mission.
whose single server is identical to any of the n identical servers It
ti
iswtons
ot osdrn supinIframmn.B
GIn sthem!G/n eresponsetime
in the MIGIn queue.th The response time distribution of thetheEnl/
distribution
dIstrIbution ofstem.
ofanuo
E
assuming that a processor ages only during a mission, and that
Stmem
Gilponse i
the
tie ofthe MGIn sstem
respose
anupperboundto between missions (when, presumably, the system is powered
down) processors do not fail, we have made it possible to
The lower bound is similarly obtained: instead of setting the hand themssios as fifh wereback-to-bac, ban th
service time to the group maximum, set it to the group the p ontof
from the if tem fire itoes notmat
from point of view of system failure, it does not matter
minimum.
We now introduce some additional notation. Let C, denote long the actual idle period between missions is. This
the /th task to arrive, tsom the time between arrivals of C, and assumption is valid in most instances since an unpowered unit
tends to age very slowly.
C l, x,, the service time of C, and
C,,
pservic
the
the probability
time
distribution
ofC',
of urn by
u+ x,,
an u,, =

Lm.
t, 1. Denote When we say that a processor's age is r, we mean,
therefore, that the processor has seen r seconds of service. Let
tt
isinta
t mission h ytmlae
n I°1 beeteeohi
Because of the assumption of Poisson arrivals, '~~~ the waiting h Ith
,
the epoch in the that the system leaves state i
time distribution is independent of the class that C, belongs to, (note that all these times are relative to the time of starting the
and the service time distribution of C, is independent of the .
mission). For convenience, we shall suppress the superscript.
arrival process. C, belongs to class i with probability X1i/X Clearly, we can write
where X = Er Xi. Define B(-) - Y,r (Xi/X)Bi(*). Let bl,
b2, b, be a set of independently identically distributed . =h (l)
(IID) random variables with distribution B(-). Let B(u)(.) and g(1 t1nI n, , 1)h ni)
B(')(-) denote the distribution of b(u) = max { b, * *, bn and * h . -
b(') = min {b,, , bn, respectively. n
LetA(t) - Xe`t((Xt)n-I/(n - 1)!) which represents the w h iy
distributi'on lof7=
Of En ti.
distribution t, Then, by the result of Arjas and
ofArjasand
whr
ha1( ni h rbbiiyta h ytmlae
state a in [a, ta + di1), given that it entered state a at %a+ 1 (in
Lehtonen [1], the waiting time distribution of tasks in the mission 1). For a < nI we get h(1)tiaI a+i) d? = Pr {sojourn
original M/G/n queue which represents the computer, is a
upper-bounded by an En/GI 1 queue with arrival and service dtim IZ }
distributions A( ) and B(u)(.), respectively, and lower
bounded by another En/G/1 queue with arrival and service _ Pr {Xand Y Z}
distributions A (*) and B(')( ), respectively.
Recalling that Lm( ) is the distribution of ur, Lindley's (3a+2 (,7 1)}12{1 )}3
a(
equation [4] immediately yields \\2/
X / 3aa d03l(na)
W M+( +ly)= Lm (y -w) d Wm (w)
0
V1J1-
1 010a
where WI(w) = 1 V w 2 0. If E[umr] < 1 v m, there exists a and
limit such that limm,o Wm(y) = W(y).
Implementing this iteration directly on a computer is, in hn(')1 - (3n;+ n2 N ,(0
general, fraught with formidable difficulties. If the Laplace 2 ) { X )
transform of L,?7 is rational, spectral factorization followed by ~3 n/3X dfSl(7n)
a Laplace inversion [4] can be used to obtain an exact solution. * { 1-&6(0, t7) ] V l y1l ,(;1nl)
If it is not rational, then we must seek approximate methods.
One such method is due to Kingman [5]. Let L*(s) denote where
the Laplace transform of L. Then, defining s0 = max {s~> X = two out of the 3a + 2 processors fail in less than 7}
OIL *( -s) c 1 }, we have the following upper bound to the - Na+ 1 seconds after state a was entered,
waiting time distribution: Y = one further failure occurs in [m1a- na+l tI?- fla +1
W(y n) = 1 - e-soY. + da1) after state a was entered,
1036 IEEE TRANSACTIONS ON COMPUTERS, VOL. C-36, NO. 9, SEPTEMBER 1987

Z = state a was entered at la+ 1' a 1(l)= |* 0

Xl =O X2=XI Xi- I =xi-2

z()oP(X, t) agel,Jx) d b agej,(|xi, xi+ x2, ** sX+2


X + *+Xi- 1)dM(xi I)

(31 + t) - 01)
...
dM(xI)
,
11 - 1(71) IV. NUMERICAL EXAMPLE
In this section, we present examples of the dependencies and
P(I+tburn, 2) if I
+ t tradeoffs among the various parameters under the control of
P(Q1, =2)- 1-P(0, (I tburn) 1 burn- the designer, and consider the effect of the application
. otherwise, requirements the performance of the computer system. To
ensure clarity,on we consider the individual tradeoffs and
0 otherwise,
where p(t, r) is the probability of a processor failing after dependencies separately.
operating (or burning-in) for r E [t, t + r). The function In [11] and [12], we have considered the impact of the
g1(X01,n1s I 1 * * 1) is only valid for qij T,, the lifetime
- < of application requirements on the system performance in detail.
mission 1. The influence of the application is felt through the task
Special Case: Processor Failure Time Exponentially loading, the cost functions, and the hard deadline distribu-
Distributed: Let the failure rate be yp, then P(t, T) = 1 _ tions. To illustrate this, we have treated in [11] and [12] an
e- AP7* The equations for g, and h () will still remain valid, with example sufficiently idealized to remove any extraneous
the simplification that factors. This is the number-power tradeoff, first considered-
in a different context-by King and Mitrani [3]. Assuming that
6{(j, t) = j3(t) 1 - e-pt. the product of the number of triads and the power5 of the
individual triads (which is termed the number-power prod-
D. Calculating the Age Density Function uct) is constrained to be a constant, and considering the effects
Let agei,1(1 t1, , ti- X) be the density function of the of varying the processor number, one can gain useful insights
processor age under policy Pt at the beginning of mission i, into the impact of the application requirements on the optimal
and conditioned on mission j ending at tj, j = 1, * *, i - 1. structure. Especially, one can analyze the tradeoffs between
-

Then age1,, can be derived recursively as follows. the processor speed and processor redundancy in distributed
real-time systems.
agei+ 1,(0l1, , ti) Since processors can only be replaced at the end of
missions, the mission length greatly affects the probability of
- Pr {age of processor at t1> (, or dynamic failure and the mean cost. We illustrate this in
processor has failed in [ti- 1, ti]} Section IV-A by an example with two task classes, each of
p(l, Tj) agej,j1jtI, ti-1) dl
deterministic service time requirements.
When processor failure laws are not memoryless, burn-in
p(l, T,) age1,(lIt1, . 0, ti i) dl
and replacement policy are important. We give an example of
+ + 11 _p(l- Til T1)}
{ 1 -p(l- T1, Ti)l the influence of both of these on the probability of dynamic
failure, the mean cost, and the expected number of processors
agei,1(l- T tI, ** ti_ 1) dl, to be replaced over the system life, in Section IV-B.
A. Influence of Mission Lifetime
where T, =fais t,_1. The
ti -duringThe first term is theterm
mirssioti,anths probability that the
phrocesso is theprobabil The overall useful life of the system in this example is 960
notfails
probability that itt does
doe during mission i, ute hs see
not fail during mission i, but has seen
h. We study the effects of breaking this period up into n
probability that
more~~~~~~
tha~ ~ eod
~ fsriean
. uthrfr,b
miso,eahflngh90nfrn=2
missions, each of length 960/n, for n = 2, ~*,,16Te 16. The
morep thaced
secondsofservicandmust,therefoe
' system parameters are: n1 11,
= n2 = 0, task arrival
300, and there are two tasks, 1 and 2, with X1 = X2 = 150,
rate is

If I > 0, we get and the service time being 0.001 and 0.002, respectively. The
agej4I'jljtI'
ti) hard deadlines are 0.099 and 0.098, respectively. For the
purposes of our example, the finite cost function is taken as
r I-p(l- T, T.,
T o) a(l- Tt1,
t , t.1) if Til< < equal to the waiting time. The processor hazard rate is shown
0o otherwise. in Fig. 4.
As might be expected, the probability of dynamic failure
The initial condition for the recursion is trivially obtained, drops as the number of missions increases, and the missions
Take ageo,(0) = 1, ageo,1(t) =0 v t > 0. We create a become correspondingly shorter. This effectively measures
dummy mission number zero which ends at time zero with all the sensitivity of the system to the interrepair periods. One has
processors new. to pay for this in terms of a greater number of processors
To find the unconditional density function of the age, we
need the following integration. 5Power is the instruction processing rate.
KRISHNA et al.: PROCESSOR TRADEOFFS IN REAL-TIME SYSTEMS 1037

0.01

0.001

2 4 104 106
PROCESSOR AGE
Fig. 4. Processor hazard rate.

TABLE I
INFLUENCE OF MISSION LIFETIME
Mean No. of
No. of Missions Mission Length P M.C. Processors Replaced
2 480.0 9.99 X 10-1 25.39 32.33

3 320.0 9.94 X 10-1 49.38 59.49


4 240.0 7.86X 10-' 53.81 78.54

5 192.0 2.81 X 10-1 40.60 90.73

6 160.0 5.98 X 10-2 31.31 98.31

7 137.143 1.39 X 10-2 27.02 103.13

8 120.0 4.94 X 10-3 25.06 106.28

9 106.67 2.71 X 10-3 24.05 108.49

10 96.0 6.39 X 10-4 22.30 112.10

11 87.27 3.68 X 10-5 20.36 117.27

12 80.0 1.67X 10-' 19.16 121.01

13 73.85 1.52 X 10-7 18.38 124.14

14 68.5 5.82 X 10-8 17.86 126.49

15 64.0 4.16 X 10-8 17.51 128.22

16 60.0 3.48 X 10-5 17.28 129.77

replaced. This is captured in the third and fourth columns of and service times are as before, the replacement strategy is Po,
Table I, which are the ratio of the marginal decrease in the and the processor hazard rate is given by
probability of dynamic failure and the mean cost, respectively,
to the marginal increase in the mean number of processors (0.02e-0.02995t if 0< t<100
replaced. 0.001 if 100 c t c 2100
Notice that the mean cost for the system with two missions =0.Z o( 02e°002995(1-2200) if 2100< t<2200
is very low. The reason is that the mean cost is computed on 0.02 if t22200.
condition that the computer does not fail, and when the failure
probability becomes sufficiently close to one, this causes an We consider the failure probabilities over a period of 15
actual decrease in the mean cost. missions, each of length 170 h.
Table II contains the numerical results of the dependence on
B. Influence of Burn-In Time tburn~As the burn-in period rises, Pdyn initially drops, as does
The task mix used here is the same as in the previous one. )1 the mean cost. Above a burn-in period of 35, however, anl
=- = 150, the length of each mission is 170 h, the deadlines increase in tburn causes an increase in Pdyn, i.e., deterioration of
1038 IEEE TRANSACTIONS ON COMPUTERS, VOL. C-36, NO. 9, SEPTEMBER 1987

TABLE II
INFLUENCE OF BURN-IN TIME
Mean No. of
Burn-In Time Pdyn M.C. Processors Replaced A B

0 1.93 X 10-" 50.38 127.78 -

5 3.38 X 10-9 47.93 118.42 0.32 X 10-8 1.87


10 1.03X 10-9 46.17 110.84 0.47X l0o 1.51

15 3.75X10-'0 44.87 104.52 0.13X10-9


O- 1.26
20 1.56X 10-10 43.91 99.29 0.44 X 10-' 1.05
26 7.36X10-" 43.05 94.21 0.14X 10-1 1.02

30 6.23X 10-" 42.61 91.40 0.28X 10-l' 0.56


35 7.75X 10-l 42.17 88.44 -0.30X 10o- 0.59

40 1.28X 10-'0 41.83 85.99 -0.10X 10-10 0.49


45 2.37X10-'0 41.57 83.96 10-1
-0.22Xl0 0.41
50 4.51 X 10-'0 41.38 82.28 -0.43X10'0 0.34

55 8.65X10-9 41.23 80.93 -0.16X10-8 0.27

A = Pdyn of previous row - Pdyn of present row


tburn of present row - tbu,n of previous row

B =
MC of previous row - MC of present row
tburn of present row - tbu.n of previous row

reliability. This is because the burn-in time is large enough to the FTMP-type structure [2]. Analyzing this would require an
bring the aging-caused rise of the Pdyn within the useful life of approximation to the waiting time distribution in GIG/n
the system. The rightmost two columns of Table II give the queues with blocking, and this is a difficult problem.
ratio of the improvement in Pdyn and the mean cost, respec-
tively, due to burn-in, and the corresponding value of the tburn APPENDIX
used. Such ratios, which indicate a return for the burn-in, can DEFINITIONS
be used in the system optimization mentioned in Section II.
C. Influence of Processor Replacement Strategy A. Terminology Used for Real- Time Control Systems
We assume tburn = 4, the processor hazard rate as in Fig. 4, Given below are the definitions of frequently used terms.
the task mix as in Section IV-B, and 16 missions of length 60 h * Mission: A continuous interval of time during which the
each. Our numerical results are contained in Table III. Rather computer performs its function. We treat the case where no
than indicating replacement policy by processor age, we have repair is possible during the mission: it is only after a mission
chosen the alternative of indicating it by the number of has been completed that the system can be repaired and
missions undergone. Since the mission lifetimes in this processors replaced.
example are deterministic, this is just another way of * Task Trigger: The initiation of a task. A task trigger can
specifying processor age. As before, we provide sensitivity be produced by a timer, a prespecified combination of
ratios for both the Pdyn and the mean cost. controlled system states, by the operator, or any combination
of the above. Triggers produced by timers are open-loop
V. CONCLUSION triggers, those produced by the controlled system state
It has been argued [7] that computer performance is only combinations are close-loop.
meaningful in the context of its application. When computers * Critical Task: The system response time for any version
are designed for specific applications, the needs of the of a critical task must be less than a preset finite deadline if
application can be formally embedded within the computer catastrophic failure is to be averted.
performance analysis. This results, as we have shown in this * Hard Deadline: The hard deadline is a maximum
paper, in precise and quantitative tradeoffs, indicating how computer think time allowed to keep the controlled system
changes in computer parameters affect the capability of the within a "safe" region (see [10] for more on this). The hard
computer to satisfy the demands of the application, deadlines of critical tasks of class i are denoted by tdi. Hard
Many extensions of this work are possible: the bottleneck is deadlines are generally random variables, characterized by a
likely to be the queueing analysis. Similar analyses can be distribution function Fd,( )
carried out for other architectures. One obvious candidate is * Static Failure: When so massive a set of permanent
KRISHNA et al.: PROCESSOR TRADEOFFS IN REAL-TIME SYSTEMS 1039

TABLE III
INFLUENCE OF PROCESSOR REPLACEMENT STRATEGY
Max. Age Mean No. of
(in Missions) Pdxn M.C. Processors Replaced A B

1* 0.00 14.66 528.00 - -

2* 0.00 14.66 267.13 0 0

3 1.29X 10-2 15.16 176.68 1.29X10-2 0.50


4 3.24 X 10-9 16.46 142.85 3.24 X 10-9 1.30
5 3.24X10-9 16.98 127.02 0 0.52
6 3.24X10-9 17.21 121.01 0 0.23

7 3.24 X 10-9 17.33 117.82 0 0.12

8 3.25X10-9 17.39 116.37 1Xxlo-,,1 0.06


9 3.25X 10-9 17.42 115.53 0 0.03

10 3.25 X 10-9 17.43 115.13 0 0.01

11 3.25 X 10-9 17.44 114.96 0 0.01

12 3.25 X 10-9 17.45 114.87 0 0.01


13 3.25 X 10-9 17.45 114.81 0 0.00
14 3.25X 10-9 17.45 114.78 0 0.00
15 3.25X 10-9 17.45 114.78 0 0.00
16 3.25X10- 17.45 114.78 0 0.00

17 3.25X10-9 17.45 114.77 0 0.00


* too low to calculate under the precision used.

A = Pdyn of present row Pdyn of previous row

B = MC of present row - MC of previous row

hardware failures have occurred that it is impossible for the Fwd( ) Distribution of waiting time deadline.
computer to perform its duties, static failure is said to result. 77i Epoch when the system leaves state i (i.e.,
The onset of static failure is typified by a utilization demand of degrades to i - 1 working triads) during
greater than unity. mission 1. Actually, it has to be -j to
* Dynamic Failure: When the deadlines for one or more indicate its dependence on the mission 1,
critical tasks have been violated, dynamic failure is said to but the superscript I is dropped for
have occurred. Note that dynamic failure subsumes static convenience.
failure. The probability of dynamic failure is denoted by gk(q,n, * *, Joint density function of time qj, which is
Pdyn -l) when the system leaves state ni, i = 1,
* Burn-in Time: The time for which processors are n1, for mission 1.
"burnt-in" by being made to execute in a test setting. Burn-in M( ) Distribution function for the length of a mis-
tends to remove latent manufacturing defects. sion.
* Processor Replacement Strategy: The processor re- -M i-fold convolution of M.
placement strategy Pt is the policy according to which a n, Number of triads in the system at the beginning
processor is replaced at the end of a mission during which it of a mission.
has failed or has seen more than t hours of service, whichever n2 Number of spares in the system at the beginning
is the lesser. of a mission.
p(t, r) Probability of a processor failing after
B. List of Symbols Used operating (or burning-in) for t E [t, t + r).
agei,(~) Density function of the age of a processor at the qi Probability of a processor having to be replaced
beginning of mission I when replacement policy at the end of the ith mission.
Pt is used. r Number of task classes.
B1('*) Service time distribution for tasks of class i. T1 Random variable denoting the length of the
ft Finite cost function for tasks of class i. ith mission.
Fd, Distribution of hard deadlines for tasks of t1 Ey5 Tj.
class i. W(' 1) Distribution function ofthe steady-state waiting
1040 IEEE TRANSACTIONS ON COMPUTERS, VOL. C-36, NO. 9, SEPTEMBER 1987
time of a task when the system has j triads Kang G. Shin (S'75-M'78-SM'83) received the
operational. B.S. degree in electronics engineering from Seoul
National University, Seoul, Korea in 1970, and the
Xi Input intensity for tasks of class i, and M.S. and Ph.D. degrees in electrical engineering
X- Si=l Xi- l _ _ from Cornell University, Ithaca, NY, in 1976 and
irf.il(t, 7,,n Probability of the system failing t units into 1978, respectively.
From 1970 to 1972 he served in the Korean Army
* *, 7l) the current mission when it leaves state i at j.t as an ROTC Officer and from 1972 to 1974 he was
REFERENCES on the research staff of the Korea Institute of
Science and Technology, Seoul, Korea, working on
[1] E. Arjas and T. Lehtonen, "Approximating many server queues by the design of VHF/UHF communication systems.
means of single server queues," Math. Oper. Res., vol. 3, p. 205, From 1978 to 1982 he was an Assistant Professor at Rensselaer Polytechnic
1978. Institute, Troy, NY. He was also a Visiting Scientist at the U.S. Airforce
[2] A. L. Hopkins et al., "FTMP-A highly reliable fault-tolerant Flight Dynamics Laboratory in Summer 1979 and at Bell Laboratories,
multiprocessor for aircraft," Proc. IEEE, vol. 66, pp. 1221-1239, Holmdel, NJ, in Summer 1980. Since September 1982, he has been with the
Oct. 1978. Department of Electrical Engineering and Computer Science at The Univer-
[3] P. J. B. King and 1. Mitrani, "The effect of breakdown on the sity of Michigan, Ann Arbor, MI, where he is currently a Professor. He has
performance of multiprocessor systems," in Proc. Performance been very active and authored/coauthored over 100 technical papers in the
'81. Amsterdam, The Netherlands: North-Holland, 1981, pp. 201- areas of distributed fault-tolerant real-time computing, computer architecture,
211. and robotics and automation. As an initial phase of validation of architectures
[4] L. Kleinrock, Queueing Systems: Vol. L. New York: Wiley, 1975. and analytic results, he and his students are currently building a 19-node
[5] , Queueing Systems: Vol. IL. New York: Wiley, 1976. hexagonal mesh real-time system at the Real-Time Computing Laboratory
[6] C. M. Krishna and K. G. Shin, "Performance measures for multipro- (RTCL), The University of Michigan.
cessor controllers," in Performance '83. Amsterdam, The Nether- Dr. Shin is a member of the Association for Computing Machinery, Sigma
lands: North-Holland, 1983, pp. 229-250. Xi, and Phi Kappa Phi. He was the Program Chairman of the 1986 IEEE Real-
[7] C. M. Krishna, K. G. Shin, and Y.-H. Lee, "Optimization criteria for Time Systems Symposium and has served as the Guest Editor of the special
checkpoint placement," Commun. ACM, vol. 27, pp. 1008-1012, issue of IEEE TRANSACTIONS ON COMPUTERS on Real-Time Systems,
Oct. 1984. August 1987.
[8] S. L. Maher and S. J. Larimer, "Continuous reconfiguration in a
multimicroprocessor flight control system," in Proc. NA TO AGARD
Conf. Tactical Airborne Distributed Comput. Networks, Roros,
Norway, 1981.
[9] S. M. Ross, Stochastic Processes. New York: Wiley, 1983.
[10] K. G. Shin, C. M. Krishna, and Y.-H. Lee, "A unified method for
evaluating real-time computer controllers and its application," IEEE
Trans. Automat. Contr., vol. AC-30, pp. 357-366, Apr. 1985.
[11] K. G. Shin and C. M. Krishna, "The processor number-power tradeoff
in a class of multiprocessors," in Proc. 5th Int. Conf. Distributed
Comput. Syst., Denver, CO, May 1985, pp. 321-328.
[12] , "New performance measures for design and evaluation of real-
time multiprocessors," Comput. Syst. Sci. Eng., vol. 1, pp. 179-191,
Oct. 1986.

C. M. Krishns (S'78-M'84) received the B.Tech.


degree from the Indian Institute of Technology, Inderpal S. Bhandari received the M.S. degree in
Delhi, in 1979, the M.S. degree from Rensselaer electrical and computer engineering from the Uni-
Polytechnic Institute, Troy, NY, in 1980, and the versity of Massachusetts, Amherst, in 1985 and the
Ph.D. degree from the University of Michigan, Ann B.Tech. degree in electrical and electronics engi-
Arbor, in 1984, all in electrical engineering. neering from the Birla Institute of Technology and
Sonce September 1984, he has been on the faculty Science, Pilani, India.
of the Department of Electrical and Computer He is currently a doctoral candidate in electrical
Engineering, University of Massachusetts, and computer engineering, Carnegie-Mellon Uni-
Amherst. His research interests include reliability versity, Pittsburgh, PA. His research areas are
modeling, queueing and scheduling theory, and artificial intelligence, computer-aided design, and
distributed architectures and operating systems. distributed systems.

You might also like