0% found this document useful (0 votes)
32 views20 pages

Network Survivability Modeling

This document summarizes research on modeling network survivability. It defines network survivability as a network's ability to continue providing services that meet performance requirements even when failures occur. The paper develops both simulation and analytical models to evaluate network survivability when virtual connections in the network experience link or node failures. These models are applied to small example networks as well as real-sized networks to analyze the transient performance after a failure, such as packet loss rate and delay. The results show good agreement between the simulation and analytical modeling approaches.

Uploaded by

Nano Sujani
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
32 views20 pages

Network Survivability Modeling

This document summarizes research on modeling network survivability. It defines network survivability as a network's ability to continue providing services that meet performance requirements even when failures occur. The paper develops both simulation and analytical models to evaluate network survivability when virtual connections in the network experience link or node failures. These models are applied to small example networks as well as real-sized networks to analyze the transient performance after a failure, such as packet loss rate and delay. The results show good agreement between the simulation and analytical modeling approaches.

Uploaded by

Nano Sujani
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 20

Computer Networks 53 (2009) 1215–1234

Contents lists available at ScienceDirect

Computer Networks
journal homepage: www.elsevier.com/locate/comnet

Network survivability modeling


Poul E. Heegaard a,*, Kishor S. Trivedi b
a
Norwegian University of Science and Technology (NTNU), Department of Telematics, 7491 Trondheim, Norway
b
Pratt School of Engineering, Duke University, Durham, NC, USA

a r t i c l e i n f o a b s t r a c t

Article history: Critical services in a telecommunication network should be continuously provided even
Available online 5 March 2009 when undesirable events like sabotage, natural disasters, or network failures happen. It
is essential to provide virtual connections between peering nodes with certain perfor-
mance guarantees such as minimum throughput, maximum delay or loss. The design, con-
Keywords: struction and management of virtual connections, network infrastructures and service
Survivability platforms aim at meeting such requirements.
End-to-end performance
In this paper we consider the network’s ability to survive major and minor failures in
Analytical models
Simulation
network infrastructure and service platforms that are caused by undesired events that
might be external or internal. Survive means that the services provided comply with the
requirement also in presence of failures. The network survivability is quantified as defined
by the ANSI T1A1.2 committee which is the transient performance from the instant an
undesirable event occurs until steady state with an acceptable performance level is
attained.
The assessment of the survivability of a network with virtual connections exposed to link
or node failures is addressed in this paper. We have developed both simulation and ana-
lytic models to cross-validate our assumptions. In order to avoid state space explosion
while addressing large networks we decompose our models first in space by studying
the nodes independently and then in time by decoupling our analytic performance and
recovery models which gives us a closed form solution. The modeling approaches are
applied to both small and real-sized network examples. Three different scenarios have
been defined, including single link failure, hurricane disaster, and instabilities in a large
block of the system (transient common failure).
The results show very good correspondence between the transient loss and delay perfor-
mance in our simulations and in the analytic approximations.
Ó 2009 Elsevier B.V. All rights reserved.

1. Introduction tolerance and service dependability. Non-critical services


like entertainment will typically have less strict require-
Our society depends on a wide variety of telecommuni- ments. Ideally, component or subsystem failures should
cation services to support our demands for everything be imperceptible to the users, and it should make no differ-
from pure entertainment to commerce, banking and life ence whether the service impairment is caused by attacks,
critical services. Critical services such as transport traffic accident or failure. However, it is far too expensive to build
control systems, and emergency and financial services, a network that retains unchanged QoS level in case of a
have stringent QoS requirements for both performance major outage. Then preference should be given to the most
critical services and most users will accept that non-critical
services are temporarily shut down, while critical services
are degraded or at best unaffected.
* Corresponding author. Tel./fax: +47 99286858.
E-mail addresses: [email protected] (P.E. Heegaard), kst@ A variety of threats, like attacks, accidents, and failures,
ee.duke.edu (K.S. Trivedi). may cause minor or major service degradations in the tele-

1389-1286/$ - see front matter Ó 2009 Elsevier B.V. All rights reserved.
doi:10.1016/j.comnet.2009.02.014
1216 P.E. Heegaard, K.S. Trivedi / Computer Networks 53 (2009) 1215–1234

communication services and network. Attack threats have formance and dependability is carried out by Markov re-
been given a lot of attention after the terrorist attacks in ward type models.
New York 2001, Madrid 2004 and London 2005. Countries In this paper we consider the network’s ability to sur-
consider their telecommunication infrastructure a part of vive major and minor failures in network infrastructure
their national critical infrastructure that needs to be pro- and service platforms that are caused by undesired events
tected [1]. The protection should also cover natural disas- that might be external (environment, natural disasters,
ters such as flooding, earthquake, thunderstorms, etc. accident, malicious human attacks and electronic attacks),
that may have huge impact on the infrastructure. Finally, or internal (traffic congestion, link/node failure and repair,
through regulations and service level agreements (SLA) different failure modes). Survive means that the services
the government and customers must give sufficient incen- provided comply with the requirements even in the pres-
tives to the operators and equipment vendors to take all ence of failures and the term survivability emphasizes that
the necessary precautions to avoid major network outages, the network design and evaluation should take all threats
such as the famous AT&T’s frame relay network outage [2] into account and that services must be differentiated with
that lasted for up to 26 hours and included the failure of respect to performance and dependability requirements.
mission-critical bank transactions, or the more recent ma- Particular attention is given to the transient behavior after
jor outage in an NTT network in Japan [3], where close to an undesired event when analyzing the system. In this pa-
4000 Cisco routers went down for about seven hours dis- per we describe a decomposed analytic model, as intro-
connecting millions of broadband Internet users across duced in [6], for the transient loss ratio (probability) and
eastern Japan. expected number in system, and cross-validate this against
In a telecommunication network it is essential to pro- simulations of two small network examples. Furthermore,
vide virtual connections between peering nodes with cer- the paper describes how the model may quantify the net-
tain performance guarantees such as minimum work survivability by the transient one-way delay distribu-
throughput, maximum delay or loss, and at the same time tion of a set of virtual connections in a real-sized network
ensure an overall good utilisation of the network resources. example. This was first time introduced in [7].
Management of paths in virtual connections must be de- In Section 2 the network survivability is defined and
signed to handle (i) slow and quick changes in traffic load informally discussed in the context of virtual connection
and pattern, (ii) short and long lived minor node, link or management. Section 3 provides the steps in network sur-
server outages, and (iii) short and long lived major outages vivability modeling, including modeling of the recovery
where a large number of nodes or links are simultaneously phases after a failure (Section 3.3), performance modeling
down, or a critical server in service platform is down. The of expected loss ratio, throughput, and delay and distribu-
design, construction and management of network infra- tion (Section 3.2), and the decomposition techniques that
structure and service platforms are very important and ex- allow us to deal with state explosion in real-sized net-
tremely challenging tasks. A combination of different works (Section 3.4). A series of results are included in Sec-
approaches are taken; (i) prevent the causes of the failures, tion 4 to validate the decomposition techniques, and
(ii) design the network to ensure that sufficient diversity demonstrate the applicability on real-sized networks with
and spare capacity is built in to partly tolerate loss of various failure scenarios. Finally, some closing remarks are
capacity, and (iii) develop and configure proactive and given in Section 5.
reactive traffic management techniques and protocols to
enable restoration of services so they continue to provide
2. Survivability
the requested service level. A large variety of management
techniques exist and are under development to meet these
Management of virtual connections must be designed
requirements. They apply to different network layers, use
to preplan for the expected and react to the unexpected.
pre-planned or reactive techniques, and utilize various set-
Evaluation of different schemes is important. However, it
up methods with different resource utilization on local or
is not obvious how to evaluate the management schemes,
global operational domain and scope of repair. An excellent
not only because the size and complexity of the problem is
classification of recovery techniques and current state of
a huge modeling challenge, but also because the definition,
the art are given in [4].
metrics, and quantification methods of survivability are
Both behavioral as well as the structural aspects of the
anything but clear. Different frameworks have been pro-
system need to be taken into account while modeling the
posed and applied under different scenarios [8–11].
(transient) performance of the virtual connection manage-
Survivable system and survivable network have been
ment. This means that the model must capture how the
designed and evaluated in the literature for many years
performance of the virtual connection is affected by rout-
[12–14]. The many definitions of survivability can be sum-
ing and rerouting, by traffic load variations, by changes in
marized as
network capacities, and by different service requirements.
Structural dependability models typically focus on the
probabilities of terminal connectivity, while behavioral
models, e.g. as proposed in [5], take the network dynamics Survivability is the system’s ability to continuously
into account and provide steady state service availability. deliver services in compliance with the given require-
Combining structural and behavior aspects is typically ments in the presence of failures and other undesired
done using Markov models or queuing network models events.
for performance analysis. Further, combined study of per-
P.E. Heegaard, K.S. Trivedi / Computer Networks 53 (2009) 1215–1234 1217

0.017
In the literature we find that the services are everything
from unspecified [15,16], very general like ‘‘mission” [14], 0. 0165
to more specific such as ‘‘connected logical links” [17,18].
The requirements are very general like ‘‘fulfill its mission, 0.016
in a timely manner” [14], ‘‘complies with its survivability
specification” [12], ‘‘committed QoS continuously” [13], 0. 0155 m

Blocking probability
a
‘‘provide service continuity” [15], very unclear ‘‘essential
0.015 mu
functions are still available” [19], or closely linked to the
mr
application area ‘‘logical links remain connected” [17]. Fi- 0. 0145
nally, the definitions of survivability cover a broad range
of undesired events and their effects. Often the definitions 0.014 m0 tr
only refer to failures [15,20] or failure scenarios [18,13]
0. 0135 tR
without any reference to the events that caused it. In
[14] they explicitly indicate that ‘‘attacks, failures, and failure
0.013
accidents” may cause the system failure and the service 0 10 20 30 40 50
degradation.
Fig. 1. Survivability after first failure [21].
In this paper our objective is to quantify the survivabil-
ity of virtual connections in telecommunication networks.
For this we define the (i) service to be the virtual connec-
els in Section 3.2, which are constructed to evaluate
tion between specific peering nodes in the network, (ii)
expected loss ratio, throughput, and delay (mean and dis-
requirement is the maximum packet loss probability and
tribution). In Section 3.3 two approaches are described to
end-to-end delay of non-lost packets in the virtual connec-
model the phased failure propagation and recovery based
tions, and (iii) undesired events are link and node failures
on either knowledge of propagation processes and recov-
caused by attacks, accidents, and software and hardware
ery mechanisms [6] or based on their actual traces [7].
failures. To quantify we use the definition given by ANSI
The phased recovery model in Section 3.3.1 captures the
T1A1 [22]:
rerouting process while Section 3.3.2 traces the changes
in the routing probabilities observed after a failure. The
composite survivability models suffer from state explosion
Survivability quantification. The measure of inter- when addressing real-sized networks and hence in Section
est M has the value m0 just before a failure occurs. The 3.4 efficient time and space model decomposition tech-
survivability behavior can be depicted by the following niques are introduced. In Section 3.5 the model scalability
attributes: ma is the value of M just after the failure oc- and corresponding assumptions are addressed.
curs; mu is the maximum difference between the value
of M and ma after the failure; mr is the restored value of 3.1. Survivability model approach
M after some time t r ; and t R is the relaxation time for
the system to restore the value of M. The survivability model does not consider the frequency
of undesired events because the focus is: given that an
undesired event has occurred what is the nature of perfor-
mance degradation just after such event until the system
These attributes are illustrated in Fig. 1. The measure of stabilizes again. The survivability models are constructed
interest M will in this paper be performance metrics like by combining continuous time Markov chain (CTMC) per-
the loss probability and the delay distribution of non-lost formance models with models of the different failure prop-
packets. Specifically, the transient system behavior imme- agation and recovery phases in the system. Fig. 2 illustrates
diately after the occurrence of a failure can be analyzed un- the modeling principle where the failure propagation and
der our proposed approach. Early related work can be recovery are modeled as a sequence of phases with each
found in [23,24] and more recent work in [21,25,26]. phase being a state in a CTMC model, and the transitions
are caused by events like failure detection, rerouting com-
3. Network survivability modeling pleted, etc. In each state the system is assumed to be in a
performance wise steady-state with unchanged opera-
The network survivability models in this paper consider tional conditions. For a short period immediately after a
networks exposed to undesired events that cause links and state change this is obviously not the case but as will be
nodes to fail, which are typically followed by a sudden demonstrated later this transient effect is negligible under
change in the availability of network resources such as the assumption made in this paper (see Section 3.5 and the
bandwidth of the transmission links, queuing positions experiments in Section 4). In each state the performance
(memory), and processor capacity. Gradually the resources metrics expected loss, throughput, and delay (mean and
are restored through rerouting and by restoration of the distribution), are obtained by reward measures from a con-
failed links and nodes, which results in restored tinuous time Markov chain (CTMC) model of the resource
performance. utilization.
First part of this section outlines the principle of surviv- At time t a (set of) undesired events are assumed to take
ability modeling, followed by details of performance mod- place whence the transient period of interest begins. A
1218 P.E. Heegaard, K.S. Trivedi / Computer Networks 53 (2009) 1215–1234

start from here failure propagation recovery

new
undesired event events reconfigure/ repair/
normal undesired
event detected detected reroute restore
events

Performance model 1 Performance model k


0 1 i i+1 0 1 i i+1

Fig. 2. Sequence of failure propagation and recovery.

X
change in the system state is triggered and the evolution of CðvÞ
i
ðvÞ
¼ ci þ
ðvÞ ðvÞ
rji Cj ð1  pj ðnj ÞÞ; for j ¼ 1; . . . ; n; ð1Þ
the system is followed through stages of failure propaga- j–i
tion, detection, recovery and restoration/repair. Observe ðvÞ
where ci is the arrival rate of external traffic to node i for
that we do not need to know the frequency of undesired
the virtual connection v and pj ðnj Þ is the steady state prob-
events, e.g., the time till failure, because the failure is
ability that node j will reject an incoming packet. In the
forced or triggered (dashed line in the figure).
case of infinite nj the pj ðnj Þ ¼ 0; 8j, and the model is an
open BCMP-type queuing network [28]. A single path vir-
3.2. Performance models
tual connection is modeled as a special case where each
hop has a single link j with rij ¼ 1 and 0 for all other links.
3.2.1. Network performance model
The total external traffic ci to node i is ci ¼ 0 for all i – s
The network is a graph ~ G ¼ ðv; eÞ where v is the set of
and cs ¼ c. The total arrival rate to node i is
nodes and e is the set of links. The single- or multi-path
routing of a virtual connection between source node s X
VC

and destination d reduces ~ G to a directed graph ~G½s;d . The Ci ¼ CðvÞ


i : ð2Þ
v¼1
network model is Markovian with Poisson external arrivals
and exponential service time distribution with an FCFS ser-
vice discipline at each node and/or link. The routing be- 3.2.2. Performance measures: expected values
tween node i and j is stochastic with time-independent The measure of interest, M, from Section 2, is obtained
probability r ij . Depending upon whether the node process- as reward measures from a continuous time Markov chain
ing or the link transmission is the bottleneck in the packet (CTMC) model of the resource utilization. Assume that
forwarding, a node or link centric approach is taken: each node is an M=M=1=ni -queue. With Ci and li indepen-
dent of the state xi and let qi ¼ Ci =li , then the steady state
 node centric: if the node processing of each packet is the probability pi ðxi Þ of the state xi in node i has the following
bottleneck, each node is modeled as an independent closed form solution [29] (if ni ! 1 then qni ! 0)
M=M=1 queue.
1  qi
 link centric: if the link transmission is the bottleneck, pi ðxi Þ ¼ qxi : ð3Þ
1  qni þ1 i
each link (or the corresponding network interface) is
an M=M=n-queue where n is the number of transmission The performance metrics are expected loss, throughput,
channels (propagation times are either included in, or and delay (mean and distribution). The transient loss rate
added to, the link service times). is denoted by LðtÞ at time t, the loss ratio (aka probability)
 node and link centric: if the bottleneck is fluctuating lðtÞ at time t, the number of packets in the system NðtÞ at
between node processing and link transmission, each time t, and the mean end-to-end delay of packets that
node and link are modeled separately. are not lost in the virtual connection, DðtÞ. In the surviv-
ability models in the following sections the composite
In Fig. 3 an illustration of the three possible network CTMC state ðy; ~ xÞ constitutes of the phase y and
performance model views is given. ~
x ¼ ðx1 ; . . . ; xn Þ, where xi is the number of packets in node
Each packet occupies one server or buffer position in i, see Section 3.3 for definition and modeling of phases.
the node or on a transmission interface. The state xi in The reward rates are assigned to node i as follows
the CTMC is the number of packets in state i. The state var-
iable i refers to a unique enumeration of nodes and links 1. For computation of the loss rate:
that are included in the model. In the following section 
Ci ðyÞ if ðxi ¼ ni Þ at time t;
we assume a node centric model view, i.e. the state vari- fLi ðt; y; xi Þ ¼
able i corresponds to node i. Then the node capacity ni is 0 otherwise:
the number of buffer positions. With finite ni , the arrival 2. For computation of the mean number of packets:
rate Ci is obtained by solving the linear system of traffic
equations [27] fNi ðt; y; xi Þ ¼ xi ;
P.E. Heegaard, K.S. Trivedi / Computer Networks 53 (2009) 1215–1234 1219

(i) Node centric model (ii) Link centric model

2 [1,2] [2,4]

1 4

3 [1,3] [3,4]

2
[1,2] [2,4]

1 4

[1,3] [3,4]

(iii) Node and Link centric model

[1,2] 2 [2,4]

4
1

[1,3] 3 [3,4]

Fig. 3. Node and/or link centric models of a 4 node example.

where ni is the maximum number of packets and Ci ðyÞ is knowledge of the one-way delay (aka response time in
the arrival rate to node i in phase y. Once the transient [30]) distribution of a single network node to construct a
probabilities, pðt; y; ~
xÞ, are obtained from the composite CTMC where the one-way delay distribution for a virtual
CTMC models then the performance metrics become connection between two peering points is equal to the
time to absorption in this CTMC. The routing of each virtual
n o
ðvÞ
1. Expected total loss rate at time t: connection v is given in a routing matrix, RðvÞ ¼ r ij
ðvÞ
ni where r ij is the probability of routing packets from node
IV X
X n X
E½LðtÞ ¼ fLi ðt; y; xi Þpðt; y; ~
xÞ: ð4Þ i to j for virtual connection v, v ¼ 1; . . . ; VC. In link state
y¼I i¼1 xi ¼0 routing schemes (e.g. IS–IS [31], OSPF [32]) the virtual con-
nection follows shortest path typically according to the
2. Expected total loss probability at time t:
minimum number of hops or least cost based on static
E½lðtÞ ¼ E½LðtÞ=c: ð5Þ metrics. In stochastic routing schemes the virtual connec-
tions are routed along multiple paths which give load shar-
3. Expected total number of packets in the network at
ing. Stochastic routing is not in use in today’s network but
time t:
are under considerations in traffic engineered MPLS [33]
IV X
X ni
n X and swarm-based routing scheme [34,35].
E½NðtÞ ¼ fNi ðt; y; xi Þpðt; y; ~
xÞ: ð6Þ The block approach that is described in this section as-
y¼I i¼1 xi ¼0
sumes an open Markovian network of M=M=1 queues.
4. Expected total delay of non-lost packets at time t: Approximations for other Markovian, and non-Markovian
networks also exist, see [30] for details. In this section
E½DðtÞ ¼ E½NðtÞ=ðcð1  E½lðtÞÞÞ: ð7Þ we extend the block method to apply to network with mul-
tiple virtual connections. The method is presented in four
steps.
3.2.3. Performance measures: delay distribution
To evaluate the one-way delay distribution of packets in 1. Calculate the arrival rates to each node in the queuing
a virtual connection we use the response time block ap- network for virtual connection v by solving the linear
proach described in [29,30]. The basic idea is to use the system of traffic equations in (1).
1220 P.E. Heegaard, K.S. Trivedi / Computer Networks 53 (2009) 1215–1234

 
2. Create a CTMC with states S ¼ Sf ; Sl ; [ni¼1 Si where Sf is The initial conditions are PQ ðvÞ ðs; 0Þ ¼ 1 if s ¼ sv is the
the absorbing state for any virtual connection, the Sl is source node sv of virtual connection v, and PQ ðvÞ ðs; 0Þ ¼ 0
ðvÞ
the loss state where packets are routed if rij > 0 while otherwise. Observe that if packets are lost the packet delay
the link ½i; j or node j is down, and Si is the state of node distribution is defective, i.e. P Q ðvÞ ðSf ; uÞ < 1 because
i. The CTMC has n þ 2 states where n is the number of PQ ðvÞ ðSf ; uÞ > 0 when u ! 1 [29].
nodes in the network. In Fig. 4 the CTMC for an eleven
node network example with two virtual connections The computations above are exact in the case of an open
is given. product form network that is feed-forward, and all paths
3. The delay distribution in node i can be represented by are overtake-free [29]. As an example the virtual connec-
an exponential distribution with rate li  Ci given that tion VC1 in Fig. 4 is overtake-free while VC2 has a recon-
Ci < li , where Ci is the net arrival rate over all virtual vergent path and hence is not overtake free.
connections and li is the service rate to for i. The gen- With non-Exponential interarrival or service time
erator matrix Q ðvÞ of the CTMC above is constructed by assumptions the approach is approximate. An example is
(a) The routing from node i to j for virtual connection included in Section 4.2.1 to illustrate the approximation.
ðvÞ ðvÞ
v is Q Si Sj ¼ ðli  Ci Þr ij . More details on this method and generalization to
(b) The exit from the network at the destination M=M=c=b-queues and non-Markovian networks are found
ðvÞ
i ¼ dv of the virtual connection v is Q Si Sf ¼ in [30].
ðvÞ
ðli  Ci Þr if
ðvÞ
(c) If rij > 0 and the link ½i; j or node j is down then 3.3. Failure propagation and recovery model
the rate of transition to the loss state is
ðvÞ ðvÞ ðvÞ
Q Si Sl ¼ ðli  Ci Þrij and Q Si Sj ¼ 0. This section describes two approaches taken to model
(d) All other entries are 0 except the diagonals, the phased failure propagation and recovery based on
ðvÞ P ðvÞ
Q Si Si ¼  Sj 2S;Sj – Si Q Si Sj . either the knowledge of propagation processes and recov-
ery mechanisms [6], or based on the actual traces [7].
4. The cumulative delay distribution function for virtual
connection v is equal to the probability PQ ðvÞ ðSf ; uÞ of 3.3.1. Phased recovery model
being in the absorbing state Sf at time u under the gen- In [6] a phased recovery model of the rerouting and res-
erator matrix Q ðvÞ . The probabilities PQ ðvÞ ðs; uÞ, s 2 S are toration was introduced. The idea is to describe the se-
found by solving (PQ ðvÞ ðuÞ ¼ fPQ ðvÞ ðs; uÞg) quence of events as a CTMC model with four phases
where each phase represents different stages in the routing
d
P ðvÞ ðuÞ ¼ PQ ðvÞ ðuÞ  Q ðvÞ : matrix update. This requires knowledge of the detection
du Q
mechanism and (re)routing process. The phased recovery

Fig. 4. CTMC specification for the delay block approach in a network example with two virtual connection.
P.E. Heegaard, K.S. Trivedi / Computer Networks 53 (2009) 1215–1234 1221

start from here failure propagation recovery

new
undesired event events reconfigure/ repair/
normal undesired
event detected detected reroute restore
events

IV I II III

Fig. 5. Phased recovery model of rerouting and restoration.

model describes the ‘‘cycle” starting from an undesired rij ðIIÞ ¼ r ij ðIIIÞ. In this case the two states can
event that causes one or multiple links or nodes to fail, be lumped together to reduce state space and
and until the system is back to the state just before this the model will be semi-Markovian.
event. This can be modeled by phases where each phase Phase IV After the routing information is restored the
may have different set of available resources for the virtual network operates in fault free mode, which is
connections, represented by (possibly) phase-dependent an absorbing state for the purpose of surviv-
stationary routing probabilities fr ij ðyÞg with corresponding ability analysis [21,26].
phase-dependent arrival rates Ci ðyÞ. A similar approach
was taken in [24]. But in the cited paper the recovery This model should not be taken to imply that only one
was not considered whereas here we model both rerouting failure event at a time can occur, since an event can be
and restoration of the network resources that brings the any combination of multiple simultaneous node and link
system back to fault free operation. In Fig. 5 the life cycle failures. Furthermore, the phased recovery model can eas-
of the failure and rerouting is described in four phases, ily be modified to refine the rerouting phases to also model
y ¼ I; . . . ; IV. The dotted1 (blue) lines in the figure illus- gradual changes in the routing probabilities or multiple
trates which phases of the more general model from steps in the virtual connection management scheme (see
Fig. 2 that are included. tracing of phases in the following section), or to model
other failure modes like intermittent link failures (see
Phase I Immediately after the failure the rerouting is example in Section 4.2.4).
activated but it takes some time before the In the phased recovery model in Fig. 5 the system starts
rerouting is effective. Meanwhile, the packets in phase I and steps through all phases before it returns to
are routed according to the original routing phase IV. The transient probabilities pðt; yÞ at time t of the
scheme, rij ðIÞ ¼ r ij ðIVÞ, except for the failed four phases y ¼ I; . . . ; IV ; can be obtained in a closed-form
node i and link ½i; j where rij ðIÞ ¼ 0, i.e. no pack- by the convolution integration approach [29]
ets are fed forward from these nodes and links.
The rerouting time is assumed to be exponen- pðt; IÞ ¼ etad ;
tially distributed with rate ad . ad
pðt; IIÞ ¼ ðets  etad Þ;
Phase II When the rerouting is effective the link or node ad  s
  ð8Þ
is still failed. The packets are routed according ad s etad  etau etau  ets
pðt; IIIÞ ¼  ;
to a new routing scheme and will avoid these ad  s ad  au au  s
failed links or nodes (if possible). After the pðt; IVÞ ¼ 1  pðt; IÞ  pðt; IIÞ  pðt; IIIÞ
exponentially distributed repair time with rate
s, the system enters phase III. under the assumption ad – s – au . For the case ad ¼ au as
Phase III On completion of repair the system returns to in the examples in Section 4.1.1 the solution is even
failure free state but the routing is yet to simpler.
change. After the exponentially distributed As an example of how performance models related to
rerouting time with rate au , the system is back the phased recovery model, consider the virtual connec-
to normal routing in phase IV. Phase II and III tion 1 (VC1) in Fig. 4. In Fig. 6 link [2,10] fails. The failed
may have identical routing probabilities with link is part of VC1 and will cause 100% packet loss at this
stage. As shown in Fig. 7, after some time, provided that
alternative path exists, the routing protocol has obtained
a new path for VC1 that does not contain the failed link.
1
For interpretation of color in Figs. 2 and 20, the reader is referred to the The packet loss is then 0% again because buffers are
web version of this article. infinite.
1222 P.E. Heegaard, K.S. Trivedi / Computer Networks 53 (2009) 1215–1234

Fig. 6. CTMC of the network example with two virtual connection when link [2,10] has failed and all packets are lost.

Fig. 7. CTMC of the network example with two virtual connection when link [2,10] has failed and packets are rerouted.

3.3.2. Tracing of recovery phases d


PðSp ; tÞ ¼ ap PðSp ; tÞ þ ap1 PðSp1 ; tÞ; ð9Þ
In [7] the phased recovery is modeled by simply moni- dt
toring and recording the routing matrix, Rðv;pÞ , for virtual where each phase time is exponentially distributed with
connection v (v ¼ 1; . . . ; VC), at phase p (p ¼ 1; . . . ; PH) rate ap (a0 ¼ 0), and initial condition PðS1 ; 0Þ ¼ 1 and
at different time instances starting from the time t ¼ 0 of PðSp ; 0Þ ¼ 0 for all p > 1.
the undesired event. The routing matrices can be recorded
either from an operational network or from simulations. In 3.4. Decomposed survivability models
Section 4 we use Rðv;pÞ obtained from ns-2 simulation stud-
ies of a swarm-based path management system [36] as an 3.4.1. Space decomposed model
example. The recovery model consists of PH phases It is challenging to obtain the transient probabilities,
where each phase represents a network condition. Let pðt; y; ~
xÞ because the state space becomes huge as the net-
Sp ¼ f1; . . . ; PHg be the state in the phased recovery mod- work size increases. We apply a space decomposition
el. The transient solution PðSp ; tÞ, p ¼ 1; . . . ; PH, is ob- approximation and model the transient behavior in each
tained by solving the following equations node separately. The global probabilities are approximated
P.E. Heegaard, K.S. Trivedi / Computer Networks 53 (2009) 1215–1234 1223

by the product pðt; y; ~ xÞ ¼ p1 ðt; y; x1 Þ    pn ðt; y; xn Þ where In the CTMC model of the failed node j the pj ðt; y; xj Þ is
pi ðt; y; xi Þ is the transient probability of xi packets in node obtained with the initial condition pj ð0; I; nj Þ ¼ 1 while for
i at time t in phase y. This is akin to a product-form solu- the non-failed nodes (i – j) the pi ðt; y; xi Þ is obtained with
tion of Jackson [27] or BCMP networks [28] as described the initial condition pi ð0; I; xi Þ ¼ pi ðxi Þ. The pi ðxi Þ are the
in Section 3.2.1. The space decomposition splits the surviv- steady state probabilities in (3) from Section 3.2.1. Finally,
ability model into independent node (and/or link) models the global state probabilities are obtained by product form
and obtains the arrival rates Ci to node i by solving the approximation
set of traffic equations in (1). The node dynamics depends Y
n
on whether a link connected to this node, or the node itself, pðt; y; ~
xÞ  pi ðt; y; xi Þ: ð10Þ
has failed or not. To give an example consider the 4 node i¼1
case in Fig. 9 where node j ¼ 2 has failed. For this the fol-
There is no easy way to obtain closed form solutions of
lowing two CTMC models describe the failed node (j ¼ 2)
pi ðt; y; xi Þ from the models in Fig. 8a and b. But, numerical
and the non-failed nodes (i ¼ 1; 3; 4);
solutions can be obtained by means of tools like SHARPE
[37,38] and SPNP [39] for rather large systems. However,
1. CTMC model for the node that has failed (see Fig. 8b).
as the size of the network node model increases, caused
Immediately after the undesired event all the packets
either by a more sophisticated recovery model or increas-
that are sent to node j are lost, hence all transitions lead
ing the number of buffers ni , the solution becomes very re-
to state to node j are lost, hence all transitions lead to
source demanding and slow.
state ðI; nj Þ where all resources are unavailable and no
packets will be served.
3.4.2. Time–space decomposed model
2. CTMC model for the non-failed nodes (see Fig. 8a).
The space decomposition improves the model scalability
Immediately after the undesired event the network
significantly. However, even the CTMC model of a single
state is changed from ðIV; xi Þ to ðI; xi Þ. This means that
node (link) might be too complex for a symbolic closed
no packets are lost but the arrival rates Ci ðIVÞ are chan-
form solution, and too large for a numerical solution. This
ged to Ci ðIÞ. For some nodes the arrival rates are
section proposes time decomposition [40,41] where we as-
unchanged, but for the nodes that used to receive pack-
sume steady-state performance in each state which re-
ets from node j the arrival rate is reduced. In phase II the
quires only transient solution of the phased recovery
rerouting is completed and the arrival rate is changed to
model.
Ci ðIIÞ depending on the position of i relative to j and on The time decomposition is a decoupling of the perfor-
the routing probabilities given by Eq. (1).
mance and recovery models. This means that the steady-

Fig. 8. CTMCs in failure state.


1224 P.E. Heegaard, K.S. Trivedi / Computer Networks 53 (2009) 1215–1234

state probabilities in the performance models and the The simulation model is considered to be attractive
transient solution of the phased recovery model are ob- when the analytical model fails due to too restrictive
tained separately and independently of each other, and modeling assumptions or intractable or inefficient solu-
pi ðt; y; xi Þ  pðt; yÞ  pi ðxi ; yÞ. The transient probabilities tions. The strong side of the simulation approach is that
pðt; yÞ are from (8) while the steady-state probabilities an arbitrary level of detail applies that suits the study
pi ðxi ; yÞ are the pi ðxi Þ from (3) for different phase-depen- of interest. In many cases it is easy to change the model-
dent arrival rates Ci ðyÞ and phase-dependent routing ing assumptions and to make the model as close to reality
matrix. Observe that we assume steady state performance as required. However, it is important to point out that the
in each phase. The approximation is good when upon a efficiency of the simulation approach depends on the net-
phase change, the steady-state performance in the new work size and the stiffness of the model, i.e., the ratio be-
phase is reached quickly compared to the duration of the tween the packet arrival and departure rates, and the
phase. This is the case in our network models when repair and rerouting rates [44]. In addition, simulations
ðCi ; li Þ  ðad ; s; au Þ. The quality of the time decomposed become inefficient when the events of importance to
Markov model approximation depends on the degree of the metrics of interest are infrequent, e.g., the packet
coupling between the ‘‘performance block” and ‘‘phase losses are very rare. Then, numerical solutions might
block” in our Markov matrix [40,42,43] as justified by dis- become less computer intensive compared to simulation
cussion of typical system parameters in Section 3.5 and unless some rare event simulation technique like impor-
validated by numerical results in Section 4.1.1. Then, the tance sampling, RESTART or importance splitting [45]
global state probabilities are obtained by approximation can be applied.
using both space decomposition from (10) and the time The routing probabilities in the examples are taken
and space decomposition from ns-2 simulations but might be from real routers, or
Y
n Y
n we may make a phase type model of the updates of the
pðt; y; ~
xÞ  pi ðt; y; xi Þ  pðt; yÞ  pi ðxi ; yÞ: ð11Þ routing probabilities.
i¼1 i¼1
In the network models in this paper we have assumed
When the phased recovery model is as simple as the that external packet arrivals to the source of a virtual con-
example in Fig. 5 or is based on actual traces in Section nection is a Poisson process. If a bursty arrival process is
3.3.2 we have a closed form solution. This enables efficient required we may use an MMPP or MAP type process [29]
evaluation of very large networks with large, and even which will produce larger CTMC models. The packet ser-
infinite, server and buffer capacities, ni . With a much more vice time distribution is assumed to be an exponential dis-
complex phased recovery model where a closed form tribution. The service time distribution is influenced by a
transient solution is hard or impossible, the time decompo- combination of the packet size distribution, the aggrega-
sition approach is still advantageous since the numerical tion level, and the header processing time. Non-exponen-
solution is significantly faster compared to the node tial empirical distributions are observed, for instance in
model in Section 3.4.1 because the number of states is [46] where the empirical distribution of single router ser-
reduced. vice time was fitted to a Weibull distribution. A phase-type
The time-decomposition allows the performance met- for Weibull distribution will once again give larger CTMC
rics to be obtained by independent determination of the models.
performance P Q ðv;pÞ ðS; uÞ of each virtual connection at each In our space decomposition model we assume indepen-
phase, with the generator matrix Q ðv;pÞ from Section 3.2.3, dence between the network nodes. Independence is not a
and the transient probabilities PðSp ; tÞ from (9). This re- fully realistic assumption but a good approximation in net-
duces the computational complexity significantly and al- works with low loss probability and with high aggregation,
lows more phases to be included in the recovery model, i.e., with multiplexing of a large number of connections.
and at the same time you may increase the number of We have decoupled the performance and recovery
nodes and links in the network. models. We assume that in each phase we have steady-
state performance. This approximation is good when the
3.5. Model scalability and assumptions steady-state performance in a phase is reached quickly
after the change of phase compared to the expected dura-
The main purpose of the approximations proposed in tion of the phase. As a rule of the thumb, the approxima-
this paper is to reduce the computational effort of obtain- tion is good if there is at least two order of magnitude
ing transient solutions in large network models without an difference between the time granularity of the events in
undue loss in the accuracy. The assumptions made are dis- the performance model and in the recovery model. E.g. in
cussed below and the accuracy is demonstrated by two medium loaded (30–50%) high capacity networks
small network models in Section 4.1.1. (100 Mbit/s–10 Gbit/s) you will observe 3–300 packets/
The underlying CTMC model of the exact stochastic re- ms, while the routing, rerouting and repair (at IP level)
ward net has a state space that is proportional to time is in the order of 100 s of ms. This means a few hun-
Qn
i¼1 ni  np where np is the number of phases and n is the dred to several thousand packets are expected in each
number of network components, i.e. nodes and/or links. phase.
The space decomposed CTMC model will reduce the state The phase time distribution in the recovery model is for
Pn
space of the transient solution to i¼1 ni  np while for simplicity assumed to be exponentially distributed but any
the time–space decomposed model a transient model with phase type distribution applies. General distribution can be
only np states needs to be solved. accommodated using semi-Markov models [37].
P.E. Heegaard, K.S. Trivedi / Computer Networks 53 (2009) 1215–1234 1225

The simulation model assumptions are the same as the


assumptions in SRN model to cross-validate and to validate
the following model decomposition approximations. The
assumptions in the simulation model are easily relaxed,
e.g., to change to general arrival process or general service
time distributions.

4. Experimental results

In this section several communication networks with a


set of virtual connections are used to demonstrate the use
of the survivability quantification approach outlined in
previous sections. The analytic evaluations of the stochas- Fig. 9. Network example with 4 nodes.
tic reward net are conducted both in SHARPE [38,37] and
SPNP [39], while the CTMC models are numerically solved
in SHARPE and the closed form solution is proof-checked in Table 1
Mathematica.2 Parameters in network with 4 nodes.
There are two simulations models, one process-oriented i ni li Ci ðIVÞ Ci ðIÞ Ci ðIIÞ Ci ðIIIÞ
model implemented using the programming language
1 10 100.0 80.0 80.0 80.0 80.0
SIMULA [47] with the DEMOS (discrete event simulation 2 8 100.0 46.9 0.0 0.0 0.0
on SIMULA [48]) class library, and another model imple- 3 10 100.0 31.2 31.2 78.1 78.1
mented in network simulator (ns-2) [49]. The DEMOS sim- 4 4 100.0 77.9 31.0 69.1 69.1
ulator is customized to validate the decomposition
assumptions by implementing the survivability models
without the node independence assumptions and includ- Table 1. The parameters in the phased recovery model
ing all details in the transient period under changing con- are ad ¼ au ¼ 0:01 and s ¼ 0:001.
ditions. The ns-2 simulator is used for validating the Ten node network. The second example is a network
analytic model against a real-sized network model with with n ¼ 10 nodes. The directed graph ~ G½1;10 for routing
detailed protocol behaviors. E.g. it includes detailed virtual connections between s ¼ 1 and d ¼ 10 is depicted
description of IP packet forwarding and link state and in Fig. 10. The performance of the virtual connection is
swarm routing, and with non-exponential service times evaluated after the failure of node 4 at time t ¼ 500. Again
since each packet has a fixed size that is unchanged each node is an M=M=1=ni system with the parameters gi-
through the network ven in Table 2. The parameters in the phased recovery
In this paper, the overall packet load in the network is model are ad ¼ au ¼ 0:01 and s ¼ 0:001.
not increased after a failure, only increased on selected
nodes and links due to re-routing of packets. The overall 4.1.2. Exact network survivability models
increase of packet load is relevant for transport and link Two different modeling approaches have been applied
protocols with retransmission and can easily be added to to determine the exact transient probabilities, pðt; y; ~ xÞ,
the performance models of each phase. What is more diffi- and the four performance metrics from Section 3.2.2. Both
cult is to model the congestion aware protocols that will the simulation and stochastic reward net (SRN) models as-
decrease the packet load (observed in TCP). This can be sume exponentially distributed inter-event times and
done in the same manner as the load increase as long as time-independent but phase-dependent routing probabili-
we can average the decrease over many connections and ties. This allows cross-validation and comparison of their
slow-start window cycles. solution efficiency. The reason for the use of SRN is to sim-
plify the tedious and error-prone task of CTMC
4.1. Validation of decomposition construction.

4.1.1. Network examples


The decomposition of the analytical model presented in
the previous sections is cross-validated with a comprehen-
sive analytical model without decomposition, and with a
simulation model. Two small networks with 4 and 10
nodes have been defined for this purpose.
Four node network. The first example is a network with
n ¼ 4 nodes as depicted in Fig. 9. The performance of the
virtual connection between s ¼ 1 and d ¼ 4 is evaluated
after the failure of node 2 at time t ¼ 500. Each node i
is an M=M=1=ni system with the parameters given in

2
http://reference.wolfram.com/mathematica/guide/Mathematica.html Fig. 10. Network example with 10 nodes.
1226 P.E. Heegaard, K.S. Trivedi / Computer Networks 53 (2009) 1215–1234

Table 2 nodes. A similar approach is taken as the first step in Sec-


Parameters in network with 10 nodes. tion 3.4.1 to avoid the largeness problem.
i ni li Ci ðIVÞ Ci ðIÞ Ci ðIIÞ Ci ðIIIÞ
1 50 100.0 80.0 80.0 80.0 80.0 4.1.2.2. Simulation model. The process-oriented discrete
2 50 100.0 33.6 33.6 62.2 62.2 event simulation model is shown in Fig. 11b. The source
3 50 100.0 26.5 26.5 49.2 49.2 process is generating packets at the ingress router. The
4 50 100.0 47.8 0.0 0.0 0.0 handling of a packet in a node is modeled as a node process
5 50 100.0 7.1 7.1 13.1 13.1
6 50 100.0 16.7 16.7 30.8 30.8
that describes the packet life cycle which may be inter-
7 50 100.0 4.5 4.5 14.3 14.3 rupted by a failure process at instances of undesired events.
8 50 100.0 35.0 21.1 45.1 45.1 In each node, when the ‘‘InBuffer” contains at least one
9 50 100.0 11.0 11.0 34.9 34.9 packet (token) the node proceeds to the next step and
10 50 100.0 80.0 32.2 80.0 80.0
checks if the number of tokens exceeds the maximum buf-
fer size and if the node is currently working. Then the pack-
et is served and sent to the next node by a random
selection among the currently available buffers. If no buf-
4.1.2.1. Stochastic reward net (SRN) model. This is a power-
fers are available due to failure and all routing probabilities
ful paradigm for modeling and evaluation of the perfor-
are 0, then the packet is counted as lost. The failure is mod-
mance of networks. Fig. 11a shows the SRN model of the
eled to the right in the figure. At the instant of a failure
4 node network of Fig. 9 that is used as a case study in Sec-
(phase I) the ‘‘working” attribute of the ‘‘Node j” process
tion 4.1.3. The packets are tokens that are generated by the
is set to ‘‘false”. After the rerouting time (indicated as dou-
timed transition ‘‘arrival” into the place ‘‘InQ1”. If there are
ble lined rectangle) all routing probabilities into the failed
less than n1 tokens in place ‘‘Node1” the immediate transi-
node are set to 0 (phase II). After repair and rerouting
tion ‘‘Q1” is enabled. If not, the ‘‘loss1” transition is enabled
(phase III), the routing probabilities are restored back to
upon its firing, the token is removed and a packet loss is
their initial values and the ‘‘working” attribute id is set to
counted. The same structure is replicated for each node.
‘‘true” again (phase IV).
The routing is determined by probabilities on the immedi-
The performance metrics in the simulator are obtained
ate transitions out of the place ‘‘OutQ1”. The rerouting, fail-
by a measurement process that reads counters at regular
ure and repair cycle is modeled at the top of the SRN model
time intervals Dt. Let cr;i;t be the value of the counter in
in Fig. 11a where the tokens in the ‘‘phase y” places will
node i at time t in simulation replication r. The average
constrain the token passing of the failed node (Node 2 in
at time t over R transient simulation replications is esti-
this example) through guard function or inhibitor arcs as
mated by
illustrated in the figure.
The performance metrics in the SRN model are obtained XR X n
t ¼ 1
C
cr;i;t  cr;i;tDt
: ð12Þ
by assigning reward rates that depend on the markings in R Dt
r¼0 i¼1
different places. However, to obtain the analytical tran-
sient probabilities pðt; y; ~
xÞ from the SRN model a complete The average packet loss rate  Lt and number of packets N t
multidimensional CTMC model is generated and solved. are obtained by (12) where the counter cr;i;t is the number
This is computationally demanding and increases expo- of losses and the number of packets in node i, respectively.
nentially both with the number of buffer positions and The packet loss probability is estimated by lt ¼ Lt =c and the
with the number of places. Decomposition of the SRN mod- average delay of served packets by D t ¼ N t =cð1  lt Þ. The
el as proposed in [50] utilizes near-independence between latter is biased because it is the ratio of two estimators.

Fig. 11. Complete model.


P.E. Heegaard, K.S. Trivedi / Computer Networks 53 (2009) 1215–1234 1227

Confidence intervals are also computed using well known in a single node this decomposed, product form approxi-
methods. mation is a viable approach.

4.1.3. Four node network 4.1.4. Ten node network


The 4 node network example is studied by simulations The 10 node network example is studied by simula-
and all three analytical approaches. The estimated perfor- tions and one analytic approach. The estimated perfor-
mance metrics from R ¼ 90 simulation replica (Simula- mance metrics from 90 simulation replica (Simulations)
tions) are compared against the analytical values of the are compared with the analytical values of the time–
stochastic reward net model solved by SPNP (SRN model). space decomposed model (Decomposed CTMC model). The
The loss probability and the average number of packets results are given in Fig. 13a and b, for the loss probability
in the system at different time t are shown in Fig. 12a and average number of packets in the system at different
and b, respectively. The space and time–space decomposed t, respectively. The results show that the closed form
models are indistinguishable so they are represented by solution from Section 3.4.2 captures the transient perfor-
only one curve (Decomposed CTMC model). mance very well.
The main observations from the experiments are that Fig. 13a also includes a ‘‘rerouting model” which is
the simulations and SRN models show perfect fit as ex- rðtÞ ¼ qð1; 4Þead t , i.e. the probability that a packet is lost
pected since the modeling assumptions are identical in the failed node at time t after the instant of failure.
although the modeling approaches are different, and that The qð1; 4Þ ¼ r14 þ r12 r 23 r34 is the probability of visiting
the decomposed CTMC models capture the transient per- node 4 when starting from node 1 in the directed graph
formance very well. The time–space decomposed model ~
G½1;10 in Fig. 10. The r ij is the routing probability, i.e. the
with the closed form solution are a significant simplifica- probability of sending a packet from state i to j, given that
tion that enables studies of large networks with many, the packet leaves node i. The results in Fig. 13a show al-
and even infinite, buffers. The results show that when most perfect overlap between rðtÞ and lðtÞ from (5) which
the transient performance is dominated by impairments means that with very low steady-state packet loss proba-

0.7 7.5
Decomposed CTMC model Decomposed CTMC model
m =m Simulations
a u SRN model t =t
Simulations m =m r R
0.6 0 u

6.5
0.5

6
0.4
5.5
0.3
5
0.2
4.5

0.1
4
m t =t m
0 r R a
0 3.5
0 200 400 600 800 1000 1200 1400 0 500 1000 1500 2000 2500 3000 3500 4000
undesired event undesired event

Fig. 12. Performance in 4 node network.

14
m =m Decomposed CTMC model
a u Rerouting model Simulations
Decomposed CTMC model
Simulations 13
0.5 m
u
12
0.4
11
m t =t
0 r R
0.3
10

0.2
9

0.1
8

m 0 t =t 7
0 r R

m
-0.1 a
0 200 400 600 800 1000 1200 1400 0 500 1000 1500 2000 2500 3000 3500 4000
undesired event undesired event

Fig. 13. Performance in 10 node network.


1228 P.E. Heegaard, K.S. Trivedi / Computer Networks 53 (2009) 1215–1234

bility, the transient loss probability is dominated by qð1; 4Þ commercial Internet in Norway, which is provided and
with the decay rate equal to the reciprocal of the expected operated by Uninett (www.uninett.no). The network
rerouting time 1=ad . The same is not observed in the 4 topology with 58 nodes and 81 duplex links is shown in
node network because here steady-state loss probability Fig. 14. The following failure scenarios are defined (see
is not negligible. Fig. 14 for reference).

4.2. Real-sized network cases 1. Single link failure: the link [4,5] fails and the traffic is
rerouted via node 33 or 19 and 46.
In order to demonstrate the feasibility of the survivabil- 2. Hurricane: 6 nodes and nine links fail as a result of a
ity quantification approach outlined in previous section, hurricane on the coastline and a significant rerouting
the performance of 20 virtual connections in a real-sized needs to take place for all but three VCs.
network with three different failure scenarios are 3. Transient common failure: 60 network interfaces, i.e., 30
considered. links, are affected by a common transient failure that
takes down and brings up the interfaces on average
4.2.1. Network and failure scenarios every 0.5 s.
The network scenario used in our experiments is a fic-
titious research intranet between five hospitals located in The routing matrices Rðv;pÞ used in the comparisons are
five major Norwegian cities, Oslo, Bergen, Trondheim, obtained from ns-2 simulations of the failure scenarios. The
Stavanger and Troms. The intranet provides a transport routing matrix is sampled at different stages in the failure,
service for high definition audio and video streaming to rerouting and restoration process represented by phases.
support cooperative surgical procedures. A fully meshed The first phase is immediately after the link failure, fol-
network between the five cities requires ten bi-directional lowed by five phases where the routing matrix changes
or 20 uni-directional virtual connections. The research as rerouting and link restoration takes place. The stochastic
network is supposed to be implemented over the non- routing scheme applied in the experiments is a distributed,

28

hamar kongsberg
16 24
gjovik
pil52
13 drammen
36
8 honefo
22
veths aho
56 0
elverum seilduk
9 40
rena stolavspl
38 45 kjeller
23
stjordal
levanger
44
27 oslo-gw1
34 kristiansand
evenstad 25
sarpsborg oslo-gw2
10 grimstad
mo 39 35
29 14
halden as
trd-real
15 2
50
bodo namsos steinkjer fredrikstad
31 43 12
7 tynset arendal
hitos
tromso-gw porsgrunn 3
20 hist 55 bo
54 37
21 6 bergen-BT
trd-hvd kristiansund 4
teknobyen stavanger
tromso-gw2 51 26 molde
49 42
52 30
arstad-gw tromso-gw3
17 53
nygardsgt
33
lbard-gw haugesund
alesund bergen-HTS
47 narvik 19
1 5
32
harstad-gw3
svalbard-gw218
stord
48
46

sogndal
(i) single link failure
volda forde 41
57 11

(iii) common transient failure


(ii) hurricane

Fig. 14. Simulated network with 58 nodes and 20 virtual connections. Three different failure scenarios are considered.
P.E. Heegaard, K.S. Trivedi / Computer Networks 53 (2009) 1215–1234 1229

autonomous, swarm-based approach called cross entropy a simplified simulation implemented in DEMOS/Simula
ant system (CEAS) [6,51]. The routing matrix was fed into and in the analytic approach described in this paper.

0.25

0.20

0.15

0.10

0.05

0.10 0.20 0.30 0.40

Fig. 15. Loss probability LðtÞ at time t after the failure.

1.0
1.0

0.8 0.8

0.6 0.6

0.4 0.4

0.2 0.2

u
0.0001 0.0002 0.0003 0.0004 0.0005 0.0001 0.0002 0.0003 0.0004 0.0005

Fig. 16. Delay distribution at time t after the failure; all exponential.

1.0 1.0

0.8 0.8

0.6 0.6

0.4 0.4

0.2 0.2

0.0001 0.0002 0.0003 0.0004 0.0005 0.0001 0.0002 0.0003 0.0004 0.0005

Fig. 17. Delay distribution at t ¼ 0:1 after the failure; non-exponential.


1230 P.E. Heegaard, K.S. Trivedi / Computer Networks 53 (2009) 1215–1234

4.2.2. Single link failure lications with analytic results assuming all exponential
4.2.2.1. Simulation and analytical results. The results in distributions. The delay distribution obtained from the
Fig. 15 show the simulated and analytical loss probabilities simulations and the analytic model show very good corre-
using routing matrices Rðv;pÞ with CEAS routing. The simu- spondence for this case. This means that the scenario are
lation results from 90 replications are plotted for different close to meet the assumptions given in Section 3.2.3, in
times t from the undesired event occurred, with the aver- particular that the virtual connections are approximately
age and standard error. The loss probabilities LðtÞ from overtake-free. In Fig. 17a the service time is Pareto and in
the simulations and the analytical solution show very good Fig. 17b the interarrival time is Pareto. By changing the ser-
correspondence. vice and interarrival times it is observed that the approxi-
In Fig. 16 the delay distribution Dðt; uÞ of non-lost pack- mation is less accurate and that we should consider non-
ets is given at t ¼ 0:1 and 50 comparing 30 simulation rep- Markovian approximations.
Solving the analytic model is rather efficient. As an
example, the scenario in this paper was implemented in
Mathematica and solved in less than ten seconds, while
the simulation took almost four minutes per replication,
0.30
or almost two hours to produce 30 replications. The simu-
lation cost will increase rapidly as the number of simulated
0.25 u
packets increases unless we take recourse to fluid-flow
0.20
approximations.
link state routing

0.15 4.2.2.2. Comparisons of two path management methods. To


swarm based
demonstrate the applicability of the method presented in
0.10 routing this paper, link state (LS) routing and cross entropy ant sys-
tem (CEAS) are compared. In Fig. 18 the loss probabilities
0.05 are compared. Observe that the loss probability is higher
just after a failure for LS because temporarily six out of
0.1 0.2 0.3 0.4 20 virtual connections are disconnected and all packets
are lost until rerouting takes effect, while for the stochastic
Fig. 18. Comparison between link state routing and swarm-base routing routing scheme in CEAS alternative paths typically exists
of loss probabilities.

1.0 1.0

0.8 0.8

0.6 0.6

0.4 link state routing 0.4

0.2 0.2
swarm base routing

0.0 0.0
0.0000 0.0001 0.0002 0.0003 0.0004 0.0005 0.0000 0.0001 0.0002 0.0003 0.0004 0.0005

1.0 1.0

0.8 0.8

0.6 0.6

swarm based
0.4 0.4
routing

0.2 0.2
link state
routing
0.0 0.0
0.0000 0.0001 0.0002 0.0003 0.0004 0.0005 0.0000 0.0001 0.0002 0.0003 0.0004 0.0005

Fig. 19. Link state routing versus swarm-based routing.


P.E. Heegaard, K.S. Trivedi / Computer Networks 53 (2009) 1215–1234 1231

so even though the performance is significantly reduced, The average delay values obtained by simulation and
neither of the virtual connections have 100% loss probabil- analytical model match well for all VCs even though they
ity in the critical period but are still connected. are differently influenced by the failures. The analytical
In Fig. 19 the complementary delay distributions Dðt;  uÞ model takes routing probabilities from the simulator as in-
in LS and CEAS are compared. The relation between LS and put, which means that the network conditions are the
CEAS changes as the time t since the undesired event same in the simulator and the analytical model at the time
elapses. Just after the failure CEAS is clearly better for the instant when the routing probabilities are sampled. The
reasons explained above, while as the network is restored simulation and analytical results show very good corre-
the delay of LS with shortest path routing is shorter than of spondence, which demonstrates that the transient effect
CEAS with stochastic routing, as can be noticed by after changes in conditions is short-lived compared to the
comparing the two complementary delay distributions, phase sojourn times. This means that the time decomposi-
D  CEAS ðt; uÞ for all u at some time t after the fail-
 LS ðt; uÞ 6 D tion is a good approximation. Furthermore, it also means
ure. In the example the network is lightly loaded, while that the node independence assumption taken in the space
in a heavily loaded network the delay of the shortest path decomposition in the analytical model works well.
routing will increase more than the stochastic routing,
which will have alternative paths for load sharing. 4.2.4. Transient common failure
In the transient common failure scenario a large block
4.2.3. Hurricane of network interfaces (and corresponding links) are af-
In the hurricane failure scenario a large block of links fected by high frequency and short-lived failures. In
and nodes fails. In Fig. 14 this is indicated as case (ii). When Fig. 14 this is indicated as case (iii). When the failure per-
the failure occurs, all but three virtual connections need to iod starts (at t ¼ 40), all but three virtual connections are
be rerouted, and most of them will experience a significant affected and a large number of reroutings takes place
increase in the delay after rerouting is completed. In Fig. 20 (every 0.5 s) until the failure period is ended at time
the average delay as a function of time for a selection of t ¼ 80. In Fig. 21 the average delay as a function of time
four VCs is shown. The failure occurs at time t ¼ 40 and for a selection of four VCs is shown.
is repaired at time t ¼ 80. The dotted (blue) are results Again, the average values obtained by simulation and
from ns-2 simulations, while the solid line (green) is the analytical model match fairly well and confirms the valid-
delay obtained by the analytical model with the routing ity of the time and space decomposition in the analytical
probabilities from the simulations as input. model.

0.010 0.010
VC1

0.008 0.008

0.006 0.006

VC7
0.004 0.004

0.002 0.002

0 20 40 60 80 100 120 0 20 40 60 80 100 120

0.010 0.010
VC13 VC17

0.008 0.008

0.006 0.006

0.004 0.004

0.002 0.002

0 20 40 60 80 100 120 0 20 40 60 80 100 120

Fig. 20. Delay for a selection of virtual connections in the hurricane case.
1232 P.E. Heegaard, K.S. Trivedi / Computer Networks 53 (2009) 1215–1234

0.010 0.010
VC1 VC7

0.008 0.008

0.006 0.006

0.004 0.004

0.002 0.002

0 20 40 60 80 100 120 0 20 40 60 80 100 120

0.010 0.010
VC13 VC17

0.008 0.008

0.006 0.006

0.004 0.004

0.002 0.002

0 20 40 60 80 100 120 0 20 40 60 80 100 120

Fig. 21. Delay for a selection of virtual connections in the common transient failure case.

5. Closing remarks outlined failure and repair cycle in this paper was consid-
ered to be a natural first approach but extensions like e.g.
The time–space decomposed analytical model for net- cascading effects, failure propagation, imperfect coverage
work survivability quantification that is outlined in this and repair will be considered. In addition, in real-sized net-
paper is a significant simplification that enables us to mod- work there are a huge combination of outages dependent
el huge networks with large and even infinite buffers. The on the cause of failure (the treat modeled). A series of more
model is applied to estimate the expected loss ratio, extensive failure scenarios with major outages are also in-
throughput, delay (mean and distribution) in virtual con- cluded which confirms the applicability of the time and
nection in a communication network. This quantifies the space decomposition in the analytical model.
survivability in a network exposed to different failure sce- Future work includes extensions of the analytical model
narios. We have cross-validated our analytical and simula- to improve correspondence in case with non-exponential
tion models, and checked the approximations of the interarrival and service times through semi-Markov
decomposed analytical models. The results from the sur- modeling.
vivability studies show that when the transient perfor- In the phased recovery model this is relatively easy as it
mance impairment is dominated by a failure event the will generally be an acyclic model. SHARPE can easily solve
decomposed, product form approximation is a viable ap- such models (assuming that state sojourn time distribu-
proach. The analytical model is cross-validated against tions are Coxian) [37]. Non-exp interarrival and service
simulations of real-sized networks too, where the routing time distributions can be modeled by extended queuing
probabilities are imported from ns-2 simulations before network analyzer (QNA) [52] for steady state analysis,
and after a failure, rerouting and repair. The routing prob- and to compute the distribution of end-to-end delay use
abilities can also be obtained from operational networks. the extension outlined in [30]. Modeling of complex rero-
The simulations and analytical results shows very good uting and recovery strategy descriptions will be developed
correspondence when the times in the network are expo- as an alternative to import routing probabilities from the
nentially distributed. In case of non-exponential times a simulations. Extensive simulations are ongoing and
semi-Markov approach should be taken to improve the planned with different network topologies, new failure
approximation. scenarios with inclusion of failure propagation, and
In addition, the comparison between shortest path MPLS-TE traffic differentiation.
routing (link state) and stochastic routing (here repre-
sented by swarm-based routing) should be extended, e.g., Acknowledgements
to investigate how they compare when the traffic load in-
creases and when overload might occur and cause rerout- The authors would like to thank Dr. Otto J. Wittner at
ing, or with traffic differentiation with priority. The the Q2S center, Norwegian University of Science and Tech-
P.E. Heegaard, K.S. Trivedi / Computer Networks 53 (2009) 1215–1234 1233

nology (NTNU), for providing the ns-2 simulation results [23] D.-Y. Chen, S. Garg, K.S. Trivedi, Network survivability performance
evaluation: a quantitative approach with applications in wireless
that have been used to compare the analytical survivability
ad-hoc networks, in: ACM International Workshop on Modeling,
models with simulations. Trivedi’s research was supported Analysis and Simulation of Wireless and Mobile Systems (MSWiM’
in by the US National Science Foundation under grant NSF- 02), ACM, Atlanta, GA, 2002 (September).
CNS-08-31325. [24] C.-Y. Wang, D. Logothetis, K.S. Trivedi, I. Viniotis, Transient behavior
of ATM networks under overloads, in: IEEE INFOCOM’ 96, IEEE, San
Francisco, CA, 1996, pp. 978–985 (March).
References [25] L. Cloth, B.R. Haverkort, Model checking for survivability, in:
Proceedings of the Second International Conference on the
[1] National strategy for the physical protection of critical Quantitative Evaluation of Systems (QEST’05) on The Quantitative
infrastructures and key assets (Online). Available from: <http:// Evaluation of Systems, IEEE Computer Society, Washington, DC, USA,
www.dhs.gov/xlibrary/assets/Physical_Strategy.pdf> (Accessed 2005, pp. 145–154.
16.2.2009). [26] Y. Liu, V.B. Mendiratta, K.S. Trivedi, Survivability Analysis of
[2] The risks digest, vol. 19(72), May 1998. Telephone Access Network, in: ISSRE ’04: Proceedings of the 15th
[3] J. Duffy, Cisco routers caused major outage in japan: report (Online). International Symposium on Software Reliability Engineering, IEEE
Available from: <http://www.networkworld.com/news/2007/ Computer Society, Washington, DC, USA, 2004, pp. 367–378.
051607-cisco-routers-major-outage-japan.html> (Accessed [27] J.R. Jackson, Networks of waiting lines, Operations Research 5 (4)
16.2.2009). (1957) 518–521 (August).
[4] P. Cholda, A. Mykkeltveit, B. Helvik, O. Wittner, A. Jajszczyk, A survey [28] F. Baskett, K.M. Chandy, R.R. Muntz, F.G. Palacios, Open, closed, and
of resilience differentiation frameworks in communication mixed networks of queues with different classes of customers,
networks, Communication Surveys Tutorials 9 (4) (2007) 2–30. Journal of ACM 22 (2) (1975) 248–260.
[5] Q. Gan, B.E. Helvik, Dependability modelling and analysis of [29] K. Trivedi, Probability and Statistics with Reliability, Queuing, and
networks as taking routing and traffic into account, in: Proceedings Computer Science Applications, second ed., John Wiley and Sons,
of the Second EuroNGI Conference on Next Generation Internet 2001. ISBN Number 0-471-33341-7.
Design and Engineering, IEEE, Valencia, Spain, 2006 (3–5 April). [30] M. Grottke, V. Mainkar, K.S. Trivedi, S. Woolet, Response time
[6] P.E. Heegaard, K.S. Trivedi, Survivability quantification of distributions in networks of queues, in: R. Boucherie, N.V. Dijk (Eds.),
communication services, in: The 38th Annual IEEE/IFIP Queueing Networks: A Fundamental Approach, Springer-Verlag, in
International Conference on Dependable Systems and Networks, press.
Anchorage, Alaska, USA, 2008 (June 24–27). [31] R. Callon, Use of OSI IS–IS for Routing in TCP/IP and Dual
[7] P.E. Heegaard, K.S. Trivedi, Survivability quantification of real-sized Environments, Tech. Rep. IETF RFC-1195, 1990 (December).
networks including end-to-end delay distributions, in: The Third [32] J. Moy, OSPF version 2, IETF, Tech. Rep. IETF RFC-2328, 1998 (April).
International Conference on Systems and Networks Communications [33] D. Awduche, J. Malcolm, J. Agogbua, M. O’Dell, J. McManus,
(ICSNC), Sliema, Malta, 2008 (October 26–31). Requirements for Traffic Engineering over MPLS,” IETF, Tech. Rep.
[8] R.J. Ellison, D.A. Fischer, R.C. Linger, H.F. Lipson, T. Longstaff, N.R. IETF RFC-2702, September 1999.
Mead, Survivable network systems: an emerging discipline, CMU/ [34] G.D. Caro, F. Ducatelle, L.M. Gambardella, AntHocNet: an adaptive
SEI, Tech. Rep. CMU/SEI-97-TR-013, 1997 (November ). nature-inspired algorithm for routing in mobile ad hoc networks,
[9] J. Knight, K. Sullivan, On the definition of survivability, Department European Transactions on Telecommunications; Special Issue on Self
of Computer Science, University of Virginia, Tech. Rep. CS-00- 33, Organization in Mobile Networking 16 (5) (2005) 443–455.
2000 (December). [35] P.E. Heegaard, O. Wittner, B.E. Helvik, Self-managed virtual path
[10] S.C. Liew, K.W. Lu, A framework for characterizing disaster-based management in dynamic networks, in: O. Babaoglu, M. Jelasity, A.
network survivability, IEEE Journal on Selected Areas in Montresor, A. van Moorsel, M. van Steen (Eds.), Self-* Properties in
Communications 12 (1) (1994) 52–58. Complex Information Systems, Ser. Lecture Notes in Computer
[11] A. Zolfaghari, F.J. Kaudel, Framework for network survivability Science, LNCS, vol. 3460, Springer-Verlag, 2005, pp. 417–432.
performance, IEEE Journal on Selected Areas in Communications [36] P.E. Heegaard, B.E. Helvik, O.J. Wittner, The cross entropy ant system
12 (1) (1994) 46–51. for network path management, Telektronikk 104 (1) (2008) 19–
[12] J. Knight, E.A. Strunk, K.J. Sullivan, Towards a rigorous definition of 40.
information system survivability, in: DISCEX, Washington DC, 2003. [37] R.A. Sahner, K.S. Trivedi, A. Puliafito, Performance and Reliability
[13] B. Jäger, J. Doucette, D. Tipper, Network survivability, in: Y. Qian, J. Analysis of Computer System: An Example-Based Approach Using
Joshi, D. Tipper, P. Krishnamurthy (Eds.), Information Assurance the SHARPE Software Package, Kluwer Academic Publishers, 1996.
Dependability and Security in Networked Systems, Elsevier, 2007. [38] C. Hirel, R.A. Sahner, X. Zang, K.S. Trivedi, Reliability and
[14] N.R. Mead, R.J. Ellison, R.C. Linger, T. Longstaff, J. McHugh, Survivable Performability Modeling Using SHARPE 2000, in: TOOLS ’00:
Network Analysis Method, CMU/SEI, Tech. Rep. 013, 2000. Proceedings of the 11th International Conference on Computer
[15] M. Pioro, D. Medhi, Routing, Flow and Capacity Design in Performance Evaluation: Modelling Techniques and Tools, Springer-
Communication and Computer Networks, Ser. ISBN 0125571895. Verlag, 2000, pp. 345–349.
Morgan Kaufmann Publishers, 2004. [39] G. Ciardo, A. Blakemore, P.F. Chimento, J.K. Muppala, K.S. Trivedi,
[16] L. Guo, A new and improved algorithm for dynamic survivable Automated generation and analysis of Markov reward models using
routing in optical WDM networks, Computer Communications 30 (6) stochastic reward nets, in: C. Meyer, R. Plemmons (Eds.), Linear
(2007) 1419–1423. Algebra, Markov Chains and Queuing Models, vol. 48, Springer, 1996,
[17] E. Modiano, A. Narula-Tam, Designing survivable networks using pp. 145–191.
effective routing and wavelength assignment (rwa), Optical Fiber [40] J. Meyer, On evaluating the performability of degradable computing
Communication Conference and Exhibit, OFC 2001, vol. 2 (2001) systems, IEEE Transactions on Computers C-29 (8) (1980) 720–731
TuG5-1–TuG5-3. (August).
[18] Y. Zhu, R. Lin, Dynamic survivable routing in wdm networks with [41] B.R. Haverkort, R. Marie, G. Rubino, K. Trivedi, Performability
shared risk link groups, in: Network Architectures, Management, Modelling, Wiley, 2001.
and Applications III, Shanghai, China, 2005 (7–10 November). [42] P.J. Courtois, Decomposability: Queueing and Computer System
[19] M.S. Deutsch, R.R. Willis, Software quality engineering: a total Applications, Academic Press, New York, 1977.
technical and management approach, Prentice-Hall, Inc., Upper [43] A. Bobbio, K.S. Trivedi, An aggregation technique for the transient
Saddle River, NJ, USA, 1988. analysis of Stiff Markov chains, IEEE Transaction of Computer 35 (9)
[20] E. Mannie, D. Papadimitriou, Recovery (protection and restoration) (1986) 803–814.
Terminology for Generalized Multi-protocol Label Switching, IETF, [44] V. Paxson, S. Floyd, Why we do not know how to simulate the
Tech. Rep. IETF RFC-4427, 2006 (March). internet, in: WSC ’97: Proceedings of the 29th Conference on Winter
[21] Y. Liu, K.S. Trivedi, Survivability quantification: the analytical Simulation, IEEE Computer Society, Washington, DC, USA, 1997, pp.
modeling approach, International Journal of Performability 1037–1044.
Engineering 2 (1) (2006) 29–44. http://www.ee.duke.edu/kst/surv/ [45] S. Juneja, P. Shahabuddin, Rare event simulation techniques: an
IoJP.pdf. introduction and recent advances, in: S.G. Henderson, B.L. Nelson
[22] ANSI T1A1.2 Working Group on Network Survivability Performance, (Eds.), Simulation, Ser. Handbooks in Operations Research and
Technical Report on Enhanced Network Survivability Performance, Management Science, Elsevier, Amsterdam, The Netherlands, 2006,
ANSI, Tech. Rep. TR No. 68, 2001 (February). pp. 291–350 (Chapter 11).
1234 P.E. Heegaard, K.S. Trivedi / Computer Networks 53 (2009) 1215–1234

[46] D. Papagiannaki, S. Moon, C. Fraleigh, P. Thiran, F. Tobagi, C. Diot, 20% position. Since 2006 Heegaard has been an Associate Professor at
Analysis of measured single-hop delay from an operational Department of Telematics, Norwegian University of Science and Tech-
backbone network, in: IEEE Infocom, 2002. nology (NTNU) where he is the coordinator of the Network Research area.
[47] B. Kirkerud, Object-Oriented Programming with SIMULA, Addison In the academic year 2007/08 he was a visiting professor at Duke Uni-
Wesley, 1989. versity, Durham, NC.
[48] G. Birtwistle, Demos – A System for Discrete event Modelling on
Simulation, 1997 (Online). Available from: <http://
www.dcs.shef.ac.uk/graham/research/demos.pdf> (Accessed
16.2.2009). Kishor S. Trivedi holds the Hudson Chair in
[49] DARPA:VINT project, UCB/LBNL/VINT Network Simulator – ns the Department of Electrical and Computer
(version 2), <http://www.isi.edu/nsnam/ns/>. Engineering at Duke University, Durham, NC.
[50] G. Ciardo, K.S. Trivedi, A Decomposition Approach for Stochastic He has been on the Duke faculty since 1975.
Reward Net Models, Performance Evaluation 18 (1) (1993) 37–59. He is the author of a well known text entitled,
[51] B.E. Helvik, O. Wittner, Using the cross entropy method to guide/ Probability and Statistics with Reliability,
govern mobile agent’s path finding in networks, in: Proceedings of Queuing and Computer Science Applications,
Third International Workshop on Mobile Agents for
published by Prentice-Hall; a thoroughly
Telecommunication Applications, Springer-Verlag, 2001 (August
revised second edition (including its Indian
14–16).
[52] W. Witt, The queueing network analyzer, Bell System Technical edition) of this book has been published by
Journal 62 (9) (1983) 2817–2843 (November). John Wiley. He has also published two other
books entitled, Performance and Reliability
Analysis of Computer Systems, published by Kluwer Academic Publishers
and Queueing Networks and Markov Chains, John Wiley. He is a Fellow of
Poul E. Heegaard received his Siv.ing. the Institute of Electrical and Electronics Engineers. He is a Golden Core
(M.S.E.E. in ‘89) and his Dr. Ing. (Ph.D. in ‘98) Member of IEEE Computer Society. He has published over 420 articles and
degrees from the University of Trondheim has supervised 42 Ph.D. dissertations. He is on the editorial boards of IEEE
(now NTNU). His research interests cover Transactions on dependable and secure computing, Journal of risk and
performance, dependability and survivability reliability, international journal of performability engineering and inter-
evaluation of communication systems. Special national journal of quality and safety engineering. He is the recipient of
interests are rare event simulation tech- IEEE Computer Society Technical Achievement Award for his research on
niques, IP network monitoring and modeling, Software Aging and Rejuvenation. His research interests in are in reli-
and distributed, autonomous and adaptive ability, availability, performance, performability and survivability mod-
management and routing in communication eling of computer and communication systems. He works closely with
networks and services. Heegaard was a industry in carrying our reliability/availability analysis, providing short
Research Scientist and Senior Scientist at courses on reliability, availability, performability modeling and in the
SINTEF Telecom and Informatics (1989–1999) and has been a Senior development and dissemination of software packages such as SHARPE
Research Scientist at Telenor R&I since 1999 where he currently holds a and SPNP.

You might also like