CatastrophicFailures CommLetters Aug2012 Simmons W PDF

Download as pdf or txt
Download as pdf or txt
You are on page 1of 4

IEEE COMMUNICATIONS LETTERS, VOL. 16, NO.

8, AUGUST 2012, PAGES 1328-1331,


c 2012 IEEE 1

Catastrophic Failures in a Backbone Network


Jane M. Simmons, Fellow, IEEE

AbstractWhen developing a backbone network protection However, there is likely to be a small subset of traffic
scheme, one important consideration is the number of concurrent that can be considered mission critical, where any downtime,
link failures against which protection should be provided. A regardless of the source, is potentially harmful; e.g., traffic
carrier typically is concerned with links going down due to
independent failure events such as fiber cuts or amplifier failures. vital for national defense. For this type of traffic, one also
When considering just these factors, a carrier has little incentive needs to consider catastrophic events not typically covered
to offer protection against more than two concurrent link failures. by an SLA, even though the occurrence of such events may
However, when catastrophic events are also considered, it is be rare. When the effects of such events are included in the
shown that it is worthwhile to protect mission-critical connections analysis, using a simple correlated failure model, we show
against up to three concurrent link failures. It is also shown that
a newly developed connection setup protocol can be effective in that providing protection against up to three concurrent link
rapidly recovering from multiple link failures. failures can be warranted. As was shown in [1], designing
this level of protection for a small subset of the traffic can be
Index TermsCatastrophic failures, correlated failures, dy-
namic networking, multiple failures, protection, restoration. accomplished with relatively little extra spare capacity.
There has been much research already on protection from
multiple failures (e.g., [2] [3]), however, the bulk of the work
I. I NTRODUCTION considers only two concurrent failures and/or only random link
failures. One purpose of this paper is to examine how the

P ROVIDING the proper amount of protection in a network


is necessary to meet customer expectations for service
while judiciously allocating network capacity and cost. Net-
consideration of catastrophic failures yields different results
as to the number of concurrent failures that deserve attention
in designing a network protection scheme.
work customers typically specify the desired availability for a Section II presents the three backbones network that were
given connection when negotiating the service level agreement studied, and the assumptions regarding failure and repair rates.
(SLA) with the carrier, where availability is defined as the Section III looks at the likelihood of multiple link failures, both
probability that a connection is in a working state at a given with and without catastrophic events being modeled. It also
instant of time. The carrier analyzes the expected failure rates illustrates the limitations of a protection scheme solely based
and associated repair rates in the network, and determines an on pre-calculated link-diverse paths when concurrent link
appropriate protection mechanism to meet the SLA. failures occur. Section IV proposes a more dynamic protection
One of the most significant failures to consider is that of a scheme based on a newly developed rapid connection setup
total link failure. (A link is the fiber that runs between two protocol, which improves upon the performance.
network nodes, where the network nodes are the sites at which Note that multiple concurrent link failures does not imply
traffic is sourced, terminated, and switched.) A link failure that the failures occur simultaneously. It is more likely that
typically brings down all connections traversing that link. This one link fails, and prior to this link being repaired, another
disruptive effect is magnified when a network suffers multiple link fails. Additionally, note that providing protection against
concurrent link failures. catastrophic failures is different from providing protection
Carriers must determine the likelihood of multiple concur- against failures to shared-risk link groups (SRLGs). SRLGs in
rent link failures and design the protection scheme accordingly backbone networks typically consist of a small number of links
to meet their contractual SLAs. In calculating the availability where the associated fiber partially resides in the same conduit,
of a connection, SLAs frequently do not include downtime such that the links are vulnerable to a single cut. SRLGs
caused by catastrophic events such as earthquakes, hurricanes, can be taken into account when determining diverse paths for
tornadoes, or terrorist attacks or other hostile acts. Under these protection (e.g., [4] [5]). In contrast, during a catastrophe, any
stipulations, the downtime that is important is that due to of the links within a geographic area may fail, which is more
equipment failure, random fiber cuts (e.g., due to a backhoe), challenging to address.
or maintenance events. When link failures are limited to such
events, then it does not make sense for a carrier to allocate
II. M ODELING A SSUMPTIONS
resources to protect against more than two concurrent link
failures. Furthermore, depending on the network topology and A. Network Topologies
availability targets, providing protection against just a single To examine the likelihood of multiple concurrent link fail-
link failure may be sufficient for many connections. ures in a backbone network, three networks were studied. The
largest of the networks, shown in Fig. 1(a), is the baseline
Manuscript received March 30, 2012. The associate editor coordinating the continental United States network used in the Core Optical
review of this letter and approving it for publication was X. Cao.
J. M. Simmons is with Monarch Network Architects, Holmdel, NJ, USA. Networks (CORONET) program [1] [6] [7]. The network
Digital Object Identifier is composed of 75 nodes and 99 links. The average nodal
IEEE COMMUNICATIONS LETTERS, VOL. 16, NO. 8, AUGUST 2012, PAGES 1328-1331,
c 2012 IEEE 2

(a) (b) (c)

Fig. 1. Network topologies studied. Node locations are similar to those of existing carriers, however, none of the networks represent an actual carrier network.

TABLE I
N ETWORK C ONNECTIVITY S TATISTICS Once a link has failed, it is assumed that path-based
restoration is invoked to restore any failed connection that
Network 1 Network 2 Network 3 requires protection. The connection is rerouted on a new path
Only 2 link-diverse paths 2145 1493 427
Only 3 link-diverse paths 575 271 8 from source to destination, without needing to first isolate
Only 4 link-diverse paths 55 5 0 where the failure has occurred. This is important in networks
Only 5 link-diverse paths 0 1 0 with optical bypass, where fault isolation can be slower [8].
(Segment-based protection has also been proposed to protect
against multiple failures, as it can typically recover from one
degree of 2.6 is in line with that of most US backbone failure per segment. However, with geographically correlated
networks. This network topology was specifically designed failures, it would not provide a significant benefit over path-
to be capable of providing a high degree of protection. For based protection, as the failures are likely to occur in the same
example, four completely link-diverse cross-continental paths segment.)
exist in this network, which is not a common feature in US In the study, we first assume that 1+N shortest link-diverse
carrier networks. The second network [8], shown in Fig. 1(b), paths are pre-calculated for each source/destination node pair,
is somewhat more representative of current carrier networks. where 1+N is the number of such paths that exist between the
This network has 60 nodes and 77 links, and provides three two nodes. This provides protection against any combination
link-diverse cross-continental paths. Finally, the third network of N link failures. The protection mechanism can be either
[8], shown in Fig. 1(c), is representative of a relatively small dedicated or shared; in either case, for simplicity, we refer to
carrier, with 30 nodes and 36 links, and two link-diverse cross- the scheme as 1+N protection.
continental paths. Table I shows the connectivity statistics for
the source/destination pairs in each network. For example, 575
C. Catastrophic Failures
of the source/destination pairs in Network 1 have only three
link-diverse paths between them. We model catastrophic failures assuming correlated link
failures, which is reasonable, though somewhat arbitrary. (Cor-
related link failures were also assumed in [11].) We assume
B. Failure Rates and Repair Rates that a catastrophe hits, on average, one node of Network 1 each
The most common cause of link failures is a fiber cut. year; the rate is correspondingly lower for Networks 2 and 3,
We assume that the fiber-cut rate is 2 cuts per 1000 miles which have fewer nodes. Each node is assumed to have an
per year [9] [10], and that the time to repair a fiber cut is equal probability of being afflicted. With probability 5%, we
uniformly distributed between 6 and 10 hours. Network links assume that the catastrophe results in the whole node failing,
are also susceptible to optical amplifier failures. It is assumed which is modeled as all of its incident links failing. For the
that optical amplifiers have a FIT (failures in 109 hours) rate remaining catastrophes, we assume that each link incident to
of 2000, with the repair rate uniformly distributed between 3 the afflicted node fails independently with probability 35%.
and 5 hours. Furthermore, any non-incident link that passes within 35
There are also likely to be planned maintenance events, but km of the afflicted node is assumed to fail with probability
we assume that a carrier can exert control over when these 10%. (Thus, we are modeling a catastrophic failure over a
occur. We also assume that optical switch failures (other than geographic area of radius 35 km.) These assumptions resulted
the nodal amplifiers), and phenomena such as PMD (polar- in an average of approximately one failed link per catastrophe.
ization mode dispersion) degradation bringing down a link, It is further assumed that the links fail any time from the onset
are relatively infrequent events. In addition to link failures, of the catastrophe to 30 minutes later, with uniform probability
individual connections are vulnerable to component failures, (the assumption of 30 minutes is not critical; the salient point
most notably transponder failures. However, carriers typically is that the failures are not simultaneous). The time to repair a
employ 1:M transponder protection, which minimizes the link that has failed due to a catastrophe is uniformly distributed
need for re-routing the affected connection. Transponder fail- between one and three days. If multiple links fail due to a
ures are not included in our model. catastrophe, they are repaired independently.
IEEE COMMUNICATIONS LETTERS, VOL. 16, NO. 8, AUGUST 2012, PAGES 1328-1331,
c 2012 IEEE 3

TABLE II
III. R ESULTS AVERAGE T IME P ER Y EAR WITH N FAILED N ETWORK L INKS
Two sets of simulations were run for each network. In the Network 1 Network 2 Network 3
first set, only fiber cuts and equipment failure were considered;
in the second set, catastrophic failures were added to the 1 Failed No 404 hours 361 hours 263 hours
model. Each simulation run modeled a duration of 100,000 Link in Catastrophes
Network
years, which resulted in the variance of the statistics being With 425 hours 378 hours 272 hours
less than 0.1% of the mean. For each network, the number of Catastrophes
failed links in the network was tracked, along with the number
2 Failed No 10 hours 8 hours 4 hours
of failed paths for each possible source/destination pair. Links in Catastrophes
Table II shows the average amount of time per year that Network
there are N concurrent link failures in the network, for N With 19 hours 15 hours 7 hours
Catastrophes
ranging from 1 to 3, both with and without catastrophes. When
catastrophes are not considered, providing 1+3 protection 3 Failed No 0.1 hours 0.1 hours 0.04 hours
provides virtually no benefit. Thus, protecting against three Links in Catastrophes
Network
failures would not be a judicious use of resources if meeting With 3 hours 2 hours 0.6 hours
an SLA were the main objective. Providing 1+2 protection Catastrophes
may be beneficial. Almost half of the source/destination pairs
with at least three diverse paths would need 1+2 protection to
achieve 99.999% (i.e., platinum-level) availability, although The drawback of dynamically searching for a protection
none would need it if the availability target were only 99.99%. path at the time of failure has previously been the relative
When catastrophes are added to the model, the fraction of slow speed of recovery. However, using the newly developed
time with three concurrent link failures increases by an order connection setup protocol of [1], which enables a connection
of magnitude. For those source/destination pairs with four link- to be set up in less than 100 ms, dynamism is a viable alterna-
diverse paths, roughly 70% of them need 1+3 protection to tive that can meet stringent restoration time requirements. This
meet 99.999% availability. Given the rarity of catastrophes, protocol relies on sending probes over potential paths, where
per-annum averages may not be the most relevant statistic. the set of paths to probe is calculated by a Path Computation
For mission-critical connections, one needs to consider the Element (PCE). Under no failures, the path set is periodically
ramifications given that a catastrophe has occurred. The dura- recomputed to take into account current network utilization
tion of the three-failed-path event for those source/destination levels. In addition, in the protection scheme proposed here,
pairs with four diverse paths averages approximately 20 hours. whenever a link fails (or is restored), the PCE re-computes
Providing a fourth path such that the service can remain the path set in preparation for the next failure. It is assumed
working during this length of time could be crucial. that there is enough time between failures to allow for this
A number of source/destination pairs cannot meet 99.999% calculation. Thus, the path set that is calculated upon the ith
availability even with 1+N protection. When catastrophes are failure in the network is used as the paths to probe when (and
not modeled, this percentage is 50%, 60%, and 75%, for if) the (i + 1)st failure occurs.
Networks 1 through 3, respectively. With catastrophes, the
The PCE cannot predict where the (i + 1)st failure will
percentages are 90%, 90%, and 99%. Most of these vulnerable
occur; thus, probes need to be sent on a number of paths to
source/destination pairs have just two diverse paths between
increase the probability that a new path will be found. Under
the endpoints. Given this vulnerability, it is worthwhile con-
no failures, on the order of three probes may be sent, where
sidering alternative protection schemes, as described next.
the destination chooses the best path to use based on cost
and available resources. Under failure conditions, more probes
IV. P ROTECTION U SING R APID C ONNECTION S ETUP would be sent, to better ensure at least one makes it through
It is known that incorporating a dynamic aspect into the to the destination. The destination does not need to wait for
restoration process, where restoration paths are computed after all probes to arrive before selecting the new path. Also, in
failures occur, results in higher availability when dealing with contrast with many schemes that incorporate a dynamic aspect,
multiple failures [2] [3] [12] [13]. We consider a protection e.g., [14], there is no need for the fault to be isolated prior to
scheme where two diverse paths are pre-calculated for each initiating recovery, resulting in more rapid recovery times.
connection to enable immediate recovery from a first failure. The main advantage that the connection setup protocol of
If both of these pre-calculated paths fail, then another path [1] has over other setup protocols, e.g., GMPLS (Generalized
is dynamically searched for at the time of the second failure Multi-Protocol Label Switching), is that it more effectively
by issuing a new connection request. If this path subsequently deals with simultaneous connection requests, where contention
fails, another connection request is issued. (This is in contrast for resources may arise. This is especially important when
to schemes such as in [12], where the path to use for protection used for protection, as there are likely multiple failed connec-
against a second failure is calculated after the first failure tions issuing a setup request. (There are unlikely to be a large
occurs. Such a scheme is viable for the more highly connected number, though, because the dynamic aspect is only invoked
network considered in [12], where at least three link-diverse when a connection suffers multiple failures and that connection
paths exist for most source/destination pairs.) requires high availability.)
IEEE COMMUNICATIONS LETTERS, VOL. 16, NO. 8, AUGUST 2012, PAGES 1328-1331,
c 2012 IEEE 4

The simulations were rerun using this 1+1+Dynamic pro-


tection scheme rather than the pre-calculated 1+N protection
scheme. A limit of five probes was arbitrarily imposed for
the dynamic setup process (to limit the control-plane traffic).
The paths to probe for each source/destination pair are based
on eliminating all links that are down at the time of path
computation plus one or more links on the current path. In the
case without catastrophes, the percentage of source/destination
pairs not meeting 99.999% availability is reduced to 0% in
Networks 1 and 2, and 1% in Network 3. (If three probes
were used instead of five, this percentage would increase
to 6% for Network 3.) When catastrophes are considered,
80 to 90% of the source/destination pairs still do not meet
99.999% availability, mainly due to failures that result in there
being no possible path between source and destination (i.e., no
protection scheme could recover from such failures). However,
dynamic protection still provides benefits. As compared to pre-
calculated 1+N protection, the total downtime decreases by
roughly 60% in all three networks. Thus, adding dynamism is
beneficial, even for a sparse network such as Network 3.

V. C ONCLUSION
It was shown that the number of concurrent link failures
to protect against depends on whether or not catastrophes are
considered. With catastrophes, three concurrent link failures
warrant attention, especially for mission-critical applications.
It was also shown that a new connection setup protocol makes
dynamic protection a practical option for multiple failures.

R EFERENCES
[1] A. L. Chiu, et al., Architectures and protocols for capacity efficient,
highly dynamic and highly resilient core networks, J. Opt. Commun.
Netw., vol. 4, no. 1, pp. 114, Jan. 2012.
[2] D. A. Schupke, A. Autenrieth, and T. Fischer, Survivability of multiple
fiber duct failures, Third International Workshop on the Design of
Reliable Communication Networks (DRCN), Oct. 2001, Budapest.
[3] S. Kim and S. S. Lumetta, Evaluation of protection reconfiguration
for multiple failures in WDM mesh networks, Proc. OFC 2003, Mar.
23-28, 2003, Atlanta, GA, vol. 1, pp. 210211.
[4] R. Bhandari, Survivable Networks: Algorithms for Diverse Routing,
Boston, MA: Kluwer Academic Publishers, 1999.
[5] D. Xu, Y. Xiong, C. Qiao, and G. Li, Trap avoidance and protection
schemes in networks with shared risk link groups, IEEE/OSA J. Lightw.
Technol., vol. 21, no. 11, pp. 26832693, Nov. 2003.
[6] A. A. M. Saleh, Dynamic multi-terabit core optical networks: architec-
ture, protocols, control and management (CORONET), DARPA BAA
06-29, Proposer Information Pamphlet.
[7] Sample Optical Network Topology Files, Available at:
http://www.monarchna.com/topology.html.
[8] J. M. Simmons, Optical Network Design and Planning, New York,
NY: Springer, 2008.
[9] R. Feuerstein, Interconnecting the Cyberinfrastructure, CyberInfra-
structure 2005, Aug. 15-16, 2005, Lincoln, NE.
[10] B. Manseur and J. Leung, Comparative analysis of network reliability
and optical reach, Proc. NFOEC 2003, Sept. 7-11, 2003, Orlando, FL.
[11] H.-W. Lee, E. Modiano, and K. Lee, Diverse routing in networks with
probabilistic failures, IEEE/ACM Trans. Netw., vol. 18, no. 6, pp. 1895
1907, Dec. 2010.
[12] J. Zhang, K. Zhu, and B. Mukherjee, Backup reprovisioning to remedy
the effect of multiple link failures in WDM mesh networks, IEEE J.
Sel. Areas Commun., vol. 24, no. 8, pp. 5767, Aug. 2006.
[13] Y. Li, et al., Availability analytical model for permanent dedicated path
protection in WDM networks, IEEE Commun. Lett., vol. 16, no. 1, pp.
9597, Jan. 2012.
[14] Y. Sone, W. Imajuku, and M. Jinno, Multiple failure recovery of
optical paths using GMPLS based restoration scheme escalation, Proc.
OFC/NFOEC 2007, Mar. 25-29, 2007, Anaheim, CA.

You might also like