CatastrophicFailures CommLetters Aug2012 Simmons W PDF
CatastrophicFailures CommLetters Aug2012 Simmons W PDF
CatastrophicFailures CommLetters Aug2012 Simmons W PDF
AbstractWhen developing a backbone network protection However, there is likely to be a small subset of traffic
scheme, one important consideration is the number of concurrent that can be considered mission critical, where any downtime,
link failures against which protection should be provided. A regardless of the source, is potentially harmful; e.g., traffic
carrier typically is concerned with links going down due to
independent failure events such as fiber cuts or amplifier failures. vital for national defense. For this type of traffic, one also
When considering just these factors, a carrier has little incentive needs to consider catastrophic events not typically covered
to offer protection against more than two concurrent link failures. by an SLA, even though the occurrence of such events may
However, when catastrophic events are also considered, it is be rare. When the effects of such events are included in the
shown that it is worthwhile to protect mission-critical connections analysis, using a simple correlated failure model, we show
against up to three concurrent link failures. It is also shown that
a newly developed connection setup protocol can be effective in that providing protection against up to three concurrent link
rapidly recovering from multiple link failures. failures can be warranted. As was shown in [1], designing
this level of protection for a small subset of the traffic can be
Index TermsCatastrophic failures, correlated failures, dy-
namic networking, multiple failures, protection, restoration. accomplished with relatively little extra spare capacity.
There has been much research already on protection from
multiple failures (e.g., [2] [3]), however, the bulk of the work
I. I NTRODUCTION considers only two concurrent failures and/or only random link
failures. One purpose of this paper is to examine how the
Fig. 1. Network topologies studied. Node locations are similar to those of existing carriers, however, none of the networks represent an actual carrier network.
TABLE I
N ETWORK C ONNECTIVITY S TATISTICS Once a link has failed, it is assumed that path-based
restoration is invoked to restore any failed connection that
Network 1 Network 2 Network 3 requires protection. The connection is rerouted on a new path
Only 2 link-diverse paths 2145 1493 427
Only 3 link-diverse paths 575 271 8 from source to destination, without needing to first isolate
Only 4 link-diverse paths 55 5 0 where the failure has occurred. This is important in networks
Only 5 link-diverse paths 0 1 0 with optical bypass, where fault isolation can be slower [8].
(Segment-based protection has also been proposed to protect
against multiple failures, as it can typically recover from one
degree of 2.6 is in line with that of most US backbone failure per segment. However, with geographically correlated
networks. This network topology was specifically designed failures, it would not provide a significant benefit over path-
to be capable of providing a high degree of protection. For based protection, as the failures are likely to occur in the same
example, four completely link-diverse cross-continental paths segment.)
exist in this network, which is not a common feature in US In the study, we first assume that 1+N shortest link-diverse
carrier networks. The second network [8], shown in Fig. 1(b), paths are pre-calculated for each source/destination node pair,
is somewhat more representative of current carrier networks. where 1+N is the number of such paths that exist between the
This network has 60 nodes and 77 links, and provides three two nodes. This provides protection against any combination
link-diverse cross-continental paths. Finally, the third network of N link failures. The protection mechanism can be either
[8], shown in Fig. 1(c), is representative of a relatively small dedicated or shared; in either case, for simplicity, we refer to
carrier, with 30 nodes and 36 links, and two link-diverse cross- the scheme as 1+N protection.
continental paths. Table I shows the connectivity statistics for
the source/destination pairs in each network. For example, 575
C. Catastrophic Failures
of the source/destination pairs in Network 1 have only three
link-diverse paths between them. We model catastrophic failures assuming correlated link
failures, which is reasonable, though somewhat arbitrary. (Cor-
related link failures were also assumed in [11].) We assume
B. Failure Rates and Repair Rates that a catastrophe hits, on average, one node of Network 1 each
The most common cause of link failures is a fiber cut. year; the rate is correspondingly lower for Networks 2 and 3,
We assume that the fiber-cut rate is 2 cuts per 1000 miles which have fewer nodes. Each node is assumed to have an
per year [9] [10], and that the time to repair a fiber cut is equal probability of being afflicted. With probability 5%, we
uniformly distributed between 6 and 10 hours. Network links assume that the catastrophe results in the whole node failing,
are also susceptible to optical amplifier failures. It is assumed which is modeled as all of its incident links failing. For the
that optical amplifiers have a FIT (failures in 109 hours) rate remaining catastrophes, we assume that each link incident to
of 2000, with the repair rate uniformly distributed between 3 the afflicted node fails independently with probability 35%.
and 5 hours. Furthermore, any non-incident link that passes within 35
There are also likely to be planned maintenance events, but km of the afflicted node is assumed to fail with probability
we assume that a carrier can exert control over when these 10%. (Thus, we are modeling a catastrophic failure over a
occur. We also assume that optical switch failures (other than geographic area of radius 35 km.) These assumptions resulted
the nodal amplifiers), and phenomena such as PMD (polar- in an average of approximately one failed link per catastrophe.
ization mode dispersion) degradation bringing down a link, It is further assumed that the links fail any time from the onset
are relatively infrequent events. In addition to link failures, of the catastrophe to 30 minutes later, with uniform probability
individual connections are vulnerable to component failures, (the assumption of 30 minutes is not critical; the salient point
most notably transponder failures. However, carriers typically is that the failures are not simultaneous). The time to repair a
employ 1:M transponder protection, which minimizes the link that has failed due to a catastrophe is uniformly distributed
need for re-routing the affected connection. Transponder fail- between one and three days. If multiple links fail due to a
ures are not included in our model. catastrophe, they are repaired independently.
IEEE COMMUNICATIONS LETTERS, VOL. 16, NO. 8, AUGUST 2012, PAGES 1328-1331,
c 2012 IEEE 3
TABLE II
III. R ESULTS AVERAGE T IME P ER Y EAR WITH N FAILED N ETWORK L INKS
Two sets of simulations were run for each network. In the Network 1 Network 2 Network 3
first set, only fiber cuts and equipment failure were considered;
in the second set, catastrophic failures were added to the 1 Failed No 404 hours 361 hours 263 hours
model. Each simulation run modeled a duration of 100,000 Link in Catastrophes
Network
years, which resulted in the variance of the statistics being With 425 hours 378 hours 272 hours
less than 0.1% of the mean. For each network, the number of Catastrophes
failed links in the network was tracked, along with the number
2 Failed No 10 hours 8 hours 4 hours
of failed paths for each possible source/destination pair. Links in Catastrophes
Table II shows the average amount of time per year that Network
there are N concurrent link failures in the network, for N With 19 hours 15 hours 7 hours
Catastrophes
ranging from 1 to 3, both with and without catastrophes. When
catastrophes are not considered, providing 1+3 protection 3 Failed No 0.1 hours 0.1 hours 0.04 hours
provides virtually no benefit. Thus, protecting against three Links in Catastrophes
Network
failures would not be a judicious use of resources if meeting With 3 hours 2 hours 0.6 hours
an SLA were the main objective. Providing 1+2 protection Catastrophes
may be beneficial. Almost half of the source/destination pairs
with at least three diverse paths would need 1+2 protection to
achieve 99.999% (i.e., platinum-level) availability, although The drawback of dynamically searching for a protection
none would need it if the availability target were only 99.99%. path at the time of failure has previously been the relative
When catastrophes are added to the model, the fraction of slow speed of recovery. However, using the newly developed
time with three concurrent link failures increases by an order connection setup protocol of [1], which enables a connection
of magnitude. For those source/destination pairs with four link- to be set up in less than 100 ms, dynamism is a viable alterna-
diverse paths, roughly 70% of them need 1+3 protection to tive that can meet stringent restoration time requirements. This
meet 99.999% availability. Given the rarity of catastrophes, protocol relies on sending probes over potential paths, where
per-annum averages may not be the most relevant statistic. the set of paths to probe is calculated by a Path Computation
For mission-critical connections, one needs to consider the Element (PCE). Under no failures, the path set is periodically
ramifications given that a catastrophe has occurred. The dura- recomputed to take into account current network utilization
tion of the three-failed-path event for those source/destination levels. In addition, in the protection scheme proposed here,
pairs with four diverse paths averages approximately 20 hours. whenever a link fails (or is restored), the PCE re-computes
Providing a fourth path such that the service can remain the path set in preparation for the next failure. It is assumed
working during this length of time could be crucial. that there is enough time between failures to allow for this
A number of source/destination pairs cannot meet 99.999% calculation. Thus, the path set that is calculated upon the ith
availability even with 1+N protection. When catastrophes are failure in the network is used as the paths to probe when (and
not modeled, this percentage is 50%, 60%, and 75%, for if) the (i + 1)st failure occurs.
Networks 1 through 3, respectively. With catastrophes, the
The PCE cannot predict where the (i + 1)st failure will
percentages are 90%, 90%, and 99%. Most of these vulnerable
occur; thus, probes need to be sent on a number of paths to
source/destination pairs have just two diverse paths between
increase the probability that a new path will be found. Under
the endpoints. Given this vulnerability, it is worthwhile con-
no failures, on the order of three probes may be sent, where
sidering alternative protection schemes, as described next.
the destination chooses the best path to use based on cost
and available resources. Under failure conditions, more probes
IV. P ROTECTION U SING R APID C ONNECTION S ETUP would be sent, to better ensure at least one makes it through
It is known that incorporating a dynamic aspect into the to the destination. The destination does not need to wait for
restoration process, where restoration paths are computed after all probes to arrive before selecting the new path. Also, in
failures occur, results in higher availability when dealing with contrast with many schemes that incorporate a dynamic aspect,
multiple failures [2] [3] [12] [13]. We consider a protection e.g., [14], there is no need for the fault to be isolated prior to
scheme where two diverse paths are pre-calculated for each initiating recovery, resulting in more rapid recovery times.
connection to enable immediate recovery from a first failure. The main advantage that the connection setup protocol of
If both of these pre-calculated paths fail, then another path [1] has over other setup protocols, e.g., GMPLS (Generalized
is dynamically searched for at the time of the second failure Multi-Protocol Label Switching), is that it more effectively
by issuing a new connection request. If this path subsequently deals with simultaneous connection requests, where contention
fails, another connection request is issued. (This is in contrast for resources may arise. This is especially important when
to schemes such as in [12], where the path to use for protection used for protection, as there are likely multiple failed connec-
against a second failure is calculated after the first failure tions issuing a setup request. (There are unlikely to be a large
occurs. Such a scheme is viable for the more highly connected number, though, because the dynamic aspect is only invoked
network considered in [12], where at least three link-diverse when a connection suffers multiple failures and that connection
paths exist for most source/destination pairs.) requires high availability.)
IEEE COMMUNICATIONS LETTERS, VOL. 16, NO. 8, AUGUST 2012, PAGES 1328-1331,
c 2012 IEEE 4
V. C ONCLUSION
It was shown that the number of concurrent link failures
to protect against depends on whether or not catastrophes are
considered. With catastrophes, three concurrent link failures
warrant attention, especially for mission-critical applications.
It was also shown that a new connection setup protocol makes
dynamic protection a practical option for multiple failures.
R EFERENCES
[1] A. L. Chiu, et al., Architectures and protocols for capacity efficient,
highly dynamic and highly resilient core networks, J. Opt. Commun.
Netw., vol. 4, no. 1, pp. 114, Jan. 2012.
[2] D. A. Schupke, A. Autenrieth, and T. Fischer, Survivability of multiple
fiber duct failures, Third International Workshop on the Design of
Reliable Communication Networks (DRCN), Oct. 2001, Budapest.
[3] S. Kim and S. S. Lumetta, Evaluation of protection reconfiguration
for multiple failures in WDM mesh networks, Proc. OFC 2003, Mar.
23-28, 2003, Atlanta, GA, vol. 1, pp. 210211.
[4] R. Bhandari, Survivable Networks: Algorithms for Diverse Routing,
Boston, MA: Kluwer Academic Publishers, 1999.
[5] D. Xu, Y. Xiong, C. Qiao, and G. Li, Trap avoidance and protection
schemes in networks with shared risk link groups, IEEE/OSA J. Lightw.
Technol., vol. 21, no. 11, pp. 26832693, Nov. 2003.
[6] A. A. M. Saleh, Dynamic multi-terabit core optical networks: architec-
ture, protocols, control and management (CORONET), DARPA BAA
06-29, Proposer Information Pamphlet.
[7] Sample Optical Network Topology Files, Available at:
http://www.monarchna.com/topology.html.
[8] J. M. Simmons, Optical Network Design and Planning, New York,
NY: Springer, 2008.
[9] R. Feuerstein, Interconnecting the Cyberinfrastructure, CyberInfra-
structure 2005, Aug. 15-16, 2005, Lincoln, NE.
[10] B. Manseur and J. Leung, Comparative analysis of network reliability
and optical reach, Proc. NFOEC 2003, Sept. 7-11, 2003, Orlando, FL.
[11] H.-W. Lee, E. Modiano, and K. Lee, Diverse routing in networks with
probabilistic failures, IEEE/ACM Trans. Netw., vol. 18, no. 6, pp. 1895
1907, Dec. 2010.
[12] J. Zhang, K. Zhu, and B. Mukherjee, Backup reprovisioning to remedy
the effect of multiple link failures in WDM mesh networks, IEEE J.
Sel. Areas Commun., vol. 24, no. 8, pp. 5767, Aug. 2006.
[13] Y. Li, et al., Availability analytical model for permanent dedicated path
protection in WDM networks, IEEE Commun. Lett., vol. 16, no. 1, pp.
9597, Jan. 2012.
[14] Y. Sone, W. Imajuku, and M. Jinno, Multiple failure recovery of
optical paths using GMPLS based restoration scheme escalation, Proc.
OFC/NFOEC 2007, Mar. 25-29, 2007, Anaheim, CA.