Accumulated-Downtime-Oriented Restoration Strategy With Service Differentiation in Survivable WDM Mesh Networks
Accumulated-Downtime-Oriented Restoration Strategy With Service Differentiation in Survivable WDM Mesh Networks
Mukherjee
113
Accumulated-Downtime-Oriented Restoration Strategy With Service Differentiation in Survivable WDM Mesh Networks
Lei Song and Biswanath Mukherjee
AbstractTelecommunications service providers (SP) should place survivability expectations by guaranteeing maximal allowed system downtime for service-level agreement (SLA)-differentiated services. Furthermore, SPs should continuously focus on utilizing network resources effectively, by considering the bounded network capacity and the growth of future data trafc. In order to improve different service availabilities and achieve high resource efciency, we present a novel restoration scheme by jointly considering accumulated downtime and SLA requirements of faulty connections. While most past related works have focused on providing statistical guarantees on availability when a connection is provisioned, our current approach recognizes that, after a connection has been in existence, it could be ahead (or behind) its performance guarantee based on what network outages it might have experienced, so the resources allocated to it may be revised judiciously. When a link failure occurs, two sets of faulty connections are examined: (a) connections whose primary or restoration path is disrupted by the failure and (b) connections that are in the down state due to some previous failures (which have not been repaired yet). An affected connection is switched to its precomputed or an alternate restoration path if necessary, when its accumulated downtime plus the link repair time will exceed its SLA requirement. The scheme provides differentiated restoration to existing connections upon a link failure in order to satisfy the connections availability requirements. We also propose an upgraded version of the scheme that incorporates both excess capacity and resource preemption into the scheme. Given the network capacities and the current network state including routing information for all existing connections, a faulty connection is restored to its restoration path as long as there is enough excess capacity along the path. Otherwise, when protection switching of a high-SLA connection fails due to limited bandwidth on some link(s), it preempts restoration capacity on each link from a low-SLA connection if both disrupted connections share the same restoration capacity and the availability requirement of the low-SLA connection is not violated. Finally, we report simulation results for
1943-0620/09/010113-12/$15.00
a large carrier-scale network to show computational performance of our proposed algorithm. The results demonstrate that the algorithm achieves a high availability satisfaction rate and good resource utilization, as well as greatly reduces protection-switching overhead. Index TermsOptical network; WDM; Survivability; Restoration; Protection; Outage; System downtime; Excess capacity; Availability; Differentiated services; Protection switching.
I. INTRODUCTION
NCORPORATING survivability capabilities into ber-optic communication has been and continues to be an issue of major importance for wavelengthdivision multiplexing (WDM) networks. Currently, the fastest OC-768 optical channel operates at a data rate of 40 Gbps, and an optical ber can support up to 160 or even 320 such channels [1]. Obviously, a single ber outage, even for only a few seconds, means huge data and revenue loss for both network carriers and their customers. A modern network needs to be survivable and resilient to different network failures [2]. In order to improve network survivability, two types of fault-recovery mechanismsprotection and restorationhave been proposed and studied thoroughly for trafc recovery against network failures, e.g., a ber cut [3]. In a protection approach, a primary and backup path pair is computed and reserved in advance when a connection is set up. Most protection-oriented provisioning algorithms aim to
Manuscript received January 18, 2008; revised January 7, 2009; accepted January 7, 2009; published June 1, 2009 Doc. ID 110466 . L. Song is with Yahoo! Inc., Sunnyvale, CA 94089, USA (e-mail: [email protected], [email protected]). B. Mukherjee is with the Department of Computer Science, University of California, Davis, CA 95616, USA (e-mail: [email protected]). Digital Object Identier 10.1364/JOCN.1.000113
114
improve performance metrics such as network blocking and backup bandwidth sharing efciency (if shared-path protection is applied). Protection achieves fast recovery against network failures. However, it suffers from low capacity efciency because backup bandwidth is pure overhead during normal network operation. As comparison, in a restoration scheme, when a failure occurs, a restoration path and a free wavelength along the path have to be discovered dynamically for an interrupted connection [46]. In order to show the cost-effectiveness of a restoration scheme, optimization technique needs to be applied in the decision of selecting the restorable connections such that network restorability can be improved with minimum restoration switching overhead [712]. However, economic issues in terms of both capital expenditure (CapEx) and operational expenditure (OpEx) become key factors for next-generation telecom network development and scalability [13]. Redundant bandwidth for network resilience can be a large overhead for network operators and thus needs to be minimized, given limited resources and budgets. It is desirable and timely for a service provider (SP) to utilize bandwidth cost-effectively, considering the growth of future data trafc. In practice, customers have different availability requirements dened in their service-level agreements (SLAs) [14]. For example, a virtual private network (VPN) customer (usually a large company or organization) may require ve 9s availability, which translates into no more than 5 minutes of system outage duration per year, while a school customer can typically tolerate 4 5 hours of downtime during its 6-month service subscription, which corresponds to an availability expectation of 0.9995. The SP only needs to provision a service just above its SLA threshold when network resources are limited, in order to minimize network cost. For example, a connection can temporarily stay in the down state when an outage occurs, as long as its availability requirement is still respected. SPs tend to place survivability expectations on guaranteeing maximal allowed system downtime for SLA-differentiated services. Furthermore, SPs are good at modeling and managing system downtime [15]. When a failure occurs, two considerations can be exploited for an affected connection (which loses its primary or restoration path due to the failure): 1) the accumulated downtime, i.e., the outage history of the connection; and 2) the time to repair the failure, i.e., mean time to repair (MTTR) [16]. Upon a failure arrival, the SP can wisely pick affected connections for restoration only if necessary, based on a connections accumulated downtime, its availability requirement, and MTTR. An affected connection is switched to a pre-computed restoration path
(determined and reserved before a failure occurs) or an alternate restoration path (computed afterward if resources are available) when its availability will be violated. The accumulated-downtime-aware restoration approach may achieve signicant reduction in restoration resource usage as well as protectionswitching overhead. A. Related Work End-to-end restoration schemes in survivable WDM mesh networks have been gaining increasing interest in the literature [10,1719]. A comparative analysis is conducted in [10] to investigate the benet of end-toend restoration schemes in terms of resource cost quantitatively. Based on the analysis, an integer linear program (ILP) formulation is proposed, which jointly considers capacity and ow-assignment optimization. In [17], a dynamic RBR (restoration-based-onrerouting) scheme with WDM-layer re-congurability is presented for ATM-over-WDM networks. The RBR scheme utilizes existing working resources to efciently restore affected applications due to network failures. In [18], the authors propose both ILP and heuristic algorithms to achieve 100% restoration guarantee for single node or link failures for protected connections. Restoration capacity is shared between demands as long as their primaries are link disjoint. Reference [19] evaluates existing restoration techniques and proposes a restoration framework, which consists of multiple ILP objective functions and three algorithms. The algorithms are developed for restoration routing and wavelength assignment in optical WDM mesh networks and to measure the performance of the objective functions. Numerical results show that the framework can achieve a near-optimal solution in terms of wavelength usage, load balancing, etc. In particular, a recent protection scheme with the knowledge of connections outage history can be found in [20]. The scheme keeps track of the accumulated outage time of a connection and protects it with a restoration path before the connections SLA is violated. Our study focuses on a restoration approach instead of protection, because of its better resource efciency. B. Our Contributions Our work extends the end-to-end restoration scheme to different availability requirements against multiple concurrent failures [21] in survivable WDM mesh networks. Some initial work has been done in [22], which proposes to switch a faulty connection to its pre-allocated restoration path only when its accumulated downtime plus predicted MTTR will violate its SLA. However, as we stated in this paper, an SP would favor restoring an affected connection as long as resources are available, even if its cumulative
115
downtime is below its SLA requirement. Furthermore, resource preemption can be triggered when both highSLA and low-SLA services compete for shared bandwidth, in order to maximize the number of restorable connections and therefore to achieve higher net prot. Our work investigates the benets of accumulateddowntime-oriented restoration based on all suitable factorsSLA requirement, network capacity, knowledge of connection outage history, and revenue awareness, instead of only cumulative downtime consideration. We also give a thorough overview of previous work on end-to-end path restoration in survivable WDM mesh networks. Most of our work focuses on an overall study of service restoration with accumulated downtime information in order to improve performance in terms of (1) restoration overbuild, (2) availability satisfaction, (3) restoration switching overhead, (4) restorable connections, and (5) net prot. We develop a novel accumulated-downtime-aware restoration algorithm for connection recovery against link failures. It examines two sets of faulty connections: (a) a connection that is disrupted by a link failure, i.e., a connection that loses its primary or restoration path [1], and (b) a connection that is in the down state because of some previous failures (that have not been repaired yet). Upon a link failure occurrence, an affected connection is switched to its preassigned restoration path or an alternate path (if resources are available), only when its accumulated downtime plus the repair time will exceed its SLA requirement. The goal is to achieve minimum-capacity restoration [23]. We demonstrate the effectiveness of our proposed algorithm and evaluate its computational performance through illustrative examples. Furthermore, we propose an upgraded version of the restoration scheme by considering a trade-off between network capacity and availability [24]. The upgraded version incorporates both excess capacity and resource preemption concepts into the scheme. Given the knowledge of current capacity-usage information and the network state including routing information for all existing connections, a faulty connection is switched to its restoration path as long as there is enough capacity along the path. When protection switching of a high-SLA connection fails due to limited bandwidth on some links along the restoration path, it preempts restoration capacity on each out-ofbandwidth link from a low-SLA connection, if both disrupted connections share the same restoration capacity and the availability requirement of the low-SLA connection will not be violated. We demonstrate the effectiveness of our proposed algorithm and evaluate its computational performance through illustrative examples. We note that, if network capacity is available, it would be preferable to exploit the holding time of a
connection to better decide when to restore a faulty connection. In this paper, our schemes aim at meeting the total downtime requirement at the end of a connection. We do not consider the connection holding time as a factor for selecting restorable connections. However, one of our recent works [25] presents a novel provisioning and protection scheme to exploit the holding time of a connection to redene service availability targets. If a connection has not been affected by failures so far, its availability target can be redened and opportunely decreased, as long as the SLA target provided to the customer is still respected. In that work, we have shown that taking advantage of the holding time and of the history of a connection allows us to obtain relevant improvements on resource utilization and availability guarantee. While most past related works have focused on providing statistical guarantees on availability when a connection is provisioned, our current approach recognizes that, after a connection has been in existence, it could be ahead (or behind) its performance guarantee based on what network outages it might have experienced; thus the resources allocated to it may be revised judiciously.
C. Economic Considerations Network cost and revenue are related to bandwidth utilization in pre-planned routing and service recovery from failures. Two major criteria for evaluating restoration schemes are the number of restorable connections (and the corresponding net prot obtained by serving the connections) and resource utilization (and the corresponding cost). An efcient restoration scheme is very attractive if it supports appropriate levels of service restorability and wisely allocates bandwidth to the most-protable services by considering possible revenues and penalties based on SLAs. Usually, an SP charges high prices for high-quality services and thus requires more resources for routing and restoration against network failures. Penalty prices are specied based on different SLA levels and will be paid to a customer at a certain rate (e.g., up to 40 percent of the service price) for SLA violation [26]. Given a price model with SLA-differentiated service and penalty prices, our proposed restoration approaches can be evaluated to pursue the goal of maximal net prot (revenue price minus penalty price). In particular, the upgraded version of our proposed restoration scheme tries to preempt resources from lowSLA connections to support high-SLA connections to obtain maximal revenue for both categories, since both restoration constraints are respected. Extension of our study on the topic of revenue by bandwidth adjustment and preemption is an open problem for future research.
116
D. Outline of the Work The remainder of this paper is organized as follows. Section II addresses the accumulated-downtimeaware restoration architecture. Section III presents the fundamental constraints used in the study. Section IV proposes a novel restoration algorithm called Accumulated-Downtime-Oriented Restoration (ADORE), based on different availabilities and accumulated downtime of connections. An upgraded version of the restoration algorithm is presented and analyzed in Section V. We show the effectiveness of the algorithm through illustrative examples and performance analysis in Section VI. Section VII concludes this study. II. ACCUMULATED-DOWNTIME-ORIENTED RESTORATION ARCHITECTURE Any network component may fail, and our focus here is on a mesh-based optical network. Among all failures, ber cuts are the major contributors. A ber cut usually occurs due to construction equipment ploughing through buried cable or natural disasters such as earthquakes, etc. Other network facilities, e.g., an optical cross-connect (OXC), amplier, or regenerator, may also fail. Reference [16] shows that the typical MTTR is between 4 and 12 hours for a ber cut and 6 hours for a node in a practical backbone network. The average ber cut rate is 0.2 to 2 times per year per 100-kilometer ber length [16,27]. Usually, nodes are signicantly more reliable and therefore considered to be fault-free in many survivability studies as well as in this study. We focus only on link failures [28]. In general, SPs are very good at locating and repairing link failures [15,29]. Therefore, it is reasonable to represent the time to repair a link failure, or MTTR, as a known-in-advance value when modeling and analyzing a restoration scheme. Availability represents the portion of time that a service is in the normal operating state during the total service time. A connection cs availability A c is often dened as Ac TotalServiceTime c MaxAllowedDowntime c = TotalServiceTime c , 1 where TotalServiceTime c is the service duration of c and MaxAllowedDowntime c represents the maximal allowed system outage time for c (usually dened in its SLA). A penalty is paid by the SP to the customer if MaxAlllowedTime c (and hence SLA) is violated. Different customers have different MaxAlllowed-
Downtime expectations during their service periods. A customer with a 12-month service holding time and 2-9s through 5-9s A c requirements can tolerate a maximum of 3.65 days, 8.76 hours, 53 minutes, and 5.3 minutes of MaxAlllowedDowntime c , respectively. When a link failure affects a low-SLA customer (e.g., 2-9s and 3-9s), the SP may keep it in the down state while still satisfying its availability requirement, as long as its accumulated downtime plus the failure repair time does not exceed its SLA. The saved spare capacity can be assigned to serve connections with more stringent SLAs, especially when a low-SLA connection has not been subjected to any system outage before. Thus, recomputing restoration bandwidth gets better capacity utilization and improves connection availabilities, under bounded link-bandwidth constraints. We consider a novel restoration scheme against link failures based on the knowledge of connections accumulated downtime and SLA differentiation, in order to achieve a high availability satisfaction rate as well as minimum capacity usage and cost, given the current network capacity.
III. FUNDAMENTAL CONSTRAINTS We assume the same conditions as in [30] when investigating connection availabilities in a telecom mesh network, e.g., an optical WDM mesh network. Besides, we have the following constraints in the restoration scheme: 1) Reverting constraint: a disrupted connection is switched to its pre-computed restoration path when a link failure occurs on its primary path or is switched to an alternate restoration path when both its primary and pre-allocated restoration paths fail. After the failure is repaired, the trafc will be switched back to its primary path again. This policy is called reverting. We assume a reverting model in the study. 2) Capacity constraint: the number of wavelengths on a link is bounded. 3) Link-disjoint constraint: a connections primary and restoration paths are link disjoint. 4) Path availability constraint: a path is available if and only if all of its links are available and have sufcient spare capacity. 5) Primary path constraint: a connection is carried on its primary path in the normal network state. 6) Restoration path constraint: when a link failure occurs on a connections primary path, it will be switched to a restoration path (determined before a link failure occurs or afterward) only if needed, in order to satisfy its SLA requirement. 7) Restoration-capacity sharing constraint: multiple connections can share wavelength among their restoration paths as long as their primary
117
paths are link disjoint. We apply a link-vector technique [31] in computing restoration paths.
IV. ACCUMULATED-DOWNTIME-ORIENTED RESTORATION (ADORE): PROPOSED RESTORATION APPROACH A. Problem Statement The following information about topology, network state, and failures is given: 1) The network topology is a directed graph G = V , E , W , where V is the set of nodes, E is the set of bidirectional bers (referred to as links), and W : E Z+ species the number of wavelengths on each link (where Z+ denotes the set of positive integers). Also, we assume that each node in the graph has full wavelengthconversion capability. 2) An incoming connection request c = s , d , A , h , where s is the source, d is the destination, A is the SLA availability requirement (e.g., 2-9s, 5-9s, etc.), and h is the holding time of connection c. Under a dynamic trafc pattern, a path that has been set up between two nodes to satisfy a connection request is taken down after a period of time h, called the connections holding time. Each connection c occupies one wavelength. A connection is provisioned with a link-disjoint primary and restoration path pair. 3) An incoming link failure f = e , hf e , where e is the index of the failed link and hf e is the failure holding time of link e. B. Notations The following notations are used in the restoration algorithm: p c : primary path of connection c. r c : pre-allocated restoration path of connection c. c : an alternative restoration path for connecr tion c that is determined after a link failure occurs, if r c is unavailable. S c : state of connection c. A connection can be in the NORMAL, RESTORED, or DOWN state. Tdowntime c : maximal allowed downtime of connection c. A c : cs availability requirement. Da c : accumulated downtime of connection c. hf e : failure holding time of link e. Qp e : set of connections whose primary paths traverse failed link e. Qr e : set of connections whose restoration paths traverse failed link e. Qr* e : set of connections that are using their restoration paths due to some previous failure(s) and whose restoration paths traverse failed link e.
: a small number, e.g., 105, which is used to differentiate between costs of sharable and nonsharable links to compute the restoration path of a connection. f e : number of free wavelengths on link e. C e : cost of a link e to compute the restoration path of a connection. Given the holding time h c of a dynamic connection c and cs SLA availability requirement A c , the maximal allowed downtime Tdowntime c can be computed as Tdowntime c = h c 1A c . 2
C. Primary-Restoration Path-Pair Routing Procedure In the normal network state, an incoming connection c is provisioned with a shortest link-disjoint primary and restoration path pair. The shorter path is used as the primary path, and the longer one is the restoration path. c is blocked if no such path pair is found. cs state is set to NORMAL. Connection cs restoration path is calculated based on the following link-cost function with a link-vector technique (please see [30] for details): e
e e
p c , or f e = 0 and e =B e , pc, 0,
e e
pc ,
Ce = 1+
e if f e
Be ,
otherwise, 3 where in the link-vector scheme (a) e is the number of backup wavelengths ree served on link e to protect connections whose primary paths traverse link e and backup paths traverse link e; and (b) B e is the number of backup wavelengths that have been reserved on link e and computed according to the link-vector method. D. Restoration Algorithm We now propose and study the properties of a novel dynamic restoration algorithm, called AccumulatedDowntime-Oriented Restoration (ADORE) with accumulated downtime knowledge and service differentiation, a preliminary version of which was presented in [22]. A formal specication of ADORE is shown in Algorithm 1. ADORE examines all connections that lose their primary or restoration paths due to a failure. Also, those connections that are in DOWN state because of previous not-yet-repaired link failure(s) are also checked. When a link failure occurs, a disrupted connection c is switched to its restoration path, only if its
118
accumulated downtime Da c plus hf will exceed its allowed accumulated downtime Tdowntime c . Otherwise, the connection is in the DOWN state. Upon a link failure repair, a faulty connection is reverted back to its primary path if the path is xed. If the connection was using its restoration path, we release the capacity along the path. In Algorithm 1, please also note the following: 1) When a connection cs primary path p c fails and restoration path r c is unavailable due to a previous link failure, ADORE tries to nd an alternate path c that is link-disjoint with p c . r c is calculated based on the same link-cost r function as r c with failed links removed, given the current network state information and remaining capacity on each link. 2) A connection c is in the DOWN state when either of the following two conditions is satised: a) Da c + hf e Tdowntime c (e.g., Step 2(i) or Step 4(ii)), or b) Da c + hf e Tdowntime c and network resources are unavailable, e.g., the link bandwidth limit is reached (e.g., Step 2(iv) or Step 4(v)). 3) A connection is reverted back to its primary path as long as a failure on the primary is repaired, in order to improve network survivability. V. EXTENSION OF ADORE ADORE applies to the scenario that bandwidth cost is a key concern for an SP, given limited network capacity. On the other hand, excess network resources (e.g., enough bandwidth on links) are required in order to achieve a high availability satisfaction rate [24]. If a network has excess capacity [32], we can further consider restoring connections as long as resources are available. Low-SLA connections can be kept in the DOWN state only when SLAdifferentiated connections are contending for limited restoration capacity on a link. The scheme will maintain reasonably small accumulated downtime for connections, in order to improve restorability and better satisfy their SLA requirements against link failures. When protection switching of a high-SLA connection fails due to limited bandwidth on some link(s) along the restoration path, it can preempt restoration resources from a low-SLA connection on each out-ofbandwidth link, if both faulty connections share the same restoration capacity and the availability requirement of the low-SLA connection will not be violated after the preemption.
Restoration
Input: 1. G = V , E , W ; 2. current network state, including routing information for all existing connections; and 3. a failure occurrence on link e with holding time hf e . Output: For each affected connection c, it is switched to r c or c , if needed, satisfying cs SLA requirement; otherwise, c r stays in DOWN state. Upon failure arrival (i.e., failure occurrence) on link e: 1. Set link e as unavailable. 2. For each connection c Qp e , do the following: i. If Da c + hf e Tdowntime c , set S c = DOWN and Da c = Da c + hf e . ii. Else, if r c is available, restore c to r c and set S c = RESTORED. r iii. Else, if c is available, restore c to c and set r S c = RESTORED. iv. Else, set S c = DOWN and Da c = Da c + hf e . 3. For each connection c Qr e , set r c as unavailable. 4. For each connection c Qr* e , do the following: i. Release bandwidth along r c (if c was using r c ) or c (if c was using c ). r r ii. If Da c + hf e Tdowntime c , set S c = DOWN and Da c = Da c + hf e . iii. Else, if c was using r c and c is available, restore r c to c and set S c = RESTORED. r iv. Else, if c was using c and r c is available, restore r c to r c and set S c = RESTORED. v. Else, set S c = DOWN and Da c = Da c + hf e . vi. For each connection cx which shares restoration wavelength with c on link e and S cx = DOWN, if Da cx + hf e Tdowntime cx and r cx or cx is available, restore cx to r cx or cx and set r r S cx = RESTORED. Upon failure departure (i.e., failure repair) on link e: 5. For each connection c Qp e , do the following: i. If p c is available, revert c back to p c and do the following: a. If S c = RESTORED and c was using r c , release bandwidth along r c . b. Else, if S c = RESTORED and c was using c , r release bandwidth along c . r c. For each connection cx which shares restoration wavelength with c on link e and S cx = DOWN, do the same operation as Step 4(vi) of Upon failure arrival (i.e., occurrence) on link e. d. Set S c = NORMAL. ii. Else, go to the next connection. 6. For each connection c Qr e , if S c = DOWN and Da c + hf e Tdowntime c , do the following: i. If r c is available, restore c to r c and set S c = RESTORED. r ii. Else, if c is available, restore c to c and set r S c = RESTORED.
119
Algorithm 2 Accumulated-Downtime-Oriented Restoration with Preemption (ADORE-P) Algorithm Input: Same as Algorithm 1. Output: Same as Algorithm 1. Upon failure arrival (i.e., failure occurrence) on link e: 1. Set link e as unavailable. 2. For each connection c Qp e , do the following: i. If r c is available, restore c to r c and set S c = RESTORED. ii. Else, if r c is unavailable due to limited bandwidth on some link(s) l1 , l2 , . . . , li , . . . along r c , preempt cxi on each li if cxi Qr* li , AP cxi AP c , and Da cxi + hf e Tdowntime cxi . If preemption succeeds on all links l1 , l2 , . . . , li , . . .., restore c to r c and set S c = RESTORED. Release bandwidth along the links of r cxi . Set S cxi = DOWN and Da cxi = Da cxi + hf e . iii. Else, if an alternate path c is available, restore c to r c and set S c = RESTORED. r iv. Else, set S c = DOWN and Da c = Da c + hf e . 3. For each connection c Qr e , set r c as unavailable. 4. For each connection c Qr* e , do the following: i. Release bandwidth along r c (if c was using r c ) or c (if c was using c ). r r iiiv. Same as Step iiiStep v in Algorithm 1. v. For each connection cx which shares restoration links with c and S cx = DOWN, if r cx or cx is available, r restore cx to r cx or cx and set S cx = RESTORED. r Upon failure departure (i.e., failure repair) on link e: 5. For each connection c Qp e , do the following: i. If p c is available, revert c back to p c and do the following: a. If S c = RESTORED and c was using r c , release bandwidth along r c . b. Else, if S c = RESTORED and c was using c , r release bandwidth along c . r c. Same as Step 5(c) in Algorithm 1 (except change 4(vi) to 4(v)). d. Set S c = NORMAL. ii. Else, go to the next connection. 6. For each connection c Qr e , do the following: i. if r c is available and S c = DOWN, restore c to r c and set S c = RESTORED. ii. Else, if r c is unavailable due to limited bandwidth on some link(s) l1 , l2 , . . . , li , . . .., preempt cxi on each li if cxi Qr* li , AP cxi AP c , and Da cxi + hf e Tdowntime cxi . If preemption succeeds on all links l1 , l2 , . . . , li , . . .., restore c to r c and set S c = RESTORED. Release bandwidth along the links of r cxi . Set S cxi = DOWN and Da cxi = Da cxi + hf e .
key factor to decide the APs of connections. Besides, a connections residual allowed downtime (derived from its Tdowntime and accumulated downtime) is another important feature to decide on the APs among the connections. When a link e fails, AP c of a faulty connection c that tends to contend for shared restoration capacity on some link is computed as AP c = log 1 A c + Tdowntime c Da c hf e . 4 To be noted here, we normalize a connection cs availability A c to log 1 A c such that a connection with high A c will be the preferred candidate to compete for shared restoration bandwidth on a link. Tdowntime c Da c / hf e is a tieThe second item breaker when two connections have the same SLA. The ratio of a connections residual allowed downtime (i.e., Tdowntime c Da c ) and link es failure holding time hf e reects relative resource-grabbing urgency when multiple connections have failed and compete for a shared restoration wavelength on e. The is assigned a small value, e.g., 103, such that the normalized SLA is of higher priority and thus a more important factor in the AP decision. In Algorithm 2, we propose an upgraded version of ADORE called Accumulated-Downtime-Oriented Restoration with Preemption (ADORE-P). When a link e fails, an interrupted connection rst tries to grab the pre-allocated restoration path. If remaining capacity is not enough on some restoration link(s), we check all out-of-bandwidth links l1 , l2 , . . . , li , . . .. The restoration wavelength on link li can be preempted, if there exists a connection cxi Qr* li that has lower AP cxi than that of the faulty connection and whose accumulated downtime Da cxi plus the repair time hf e does not exceed Tdowntime cxi . The rest of the restoration bandwidth along cxis restoration path is then released for future use. The objective of ADORE-P is to maximize the restoration capability (restore as many connections as possible against link failures), while ADORE aims to minimize the capacity requirements and thus the cost, given link-bandwidth constraints. The ADORE-P scheme can also provide economic advantages over ADORE, since it favors services with high restorability requirements and therefore obtains higher revenue and lower penalty per service. VI. ILLUSTRATIVE NUMERICAL EXAMPLES The efciency of ADORE and ADORE-P is validated through performance comparison with a traditional shared-path-protected recovery scheme [33], which is called REST in the study. In REST, a connection is
We dene the availability priority (AP) to evaluate a connections resource-grabbing priority when SLAdifferentiated connections compete for shared restoration capacity. The AP is based on both the SLA and residual allowed downtime of a connection. Obviously, a connection with a stringent SLA should have high priority in resource preemption. Therefore, an SLA is a
120
switched to a pre-dened restoration path or an alternate restoration path when a link failure occurs on its primary path. Figure 1 shows a large carrier-scale network topology used in this study, with each links length (in kilometers) marked. Each (bidirectional) ber link has 32 wavelengths in each direction. All the nodes have wavelength converters. We simulate 100,000 connection arrivals and departures assuming a uniform trafc pattern.1 We evaluate three important performance criteria for the proposed restoration scheme: the availability satisfaction rate, capacity efciency, and protectionswitching overhead. We show results for a representative network load of 120 Erlangs, which is a mediumload situation in our example scenario. The performance results are similar for other network loads. ADORE and ADORE-P outperform REST even more when loads are light to medium, compared to heavy-load situations. The following assumptions are made about connections: Connection Assumptions: 1) Connection requests are randomly generated and uniformly distributed among all node pairs. 2) Each connection needs the full wavelength capacity. 3) Connection arrivals follow a Poisson process. 4) Connection holding times are uniformly distributed over {1, 2,, 12} months. 5) SLA requirements of connections are uniformly distributed over {2-9s, 3-9s, 4-9s, 5-9s}. Failure Assumptions: 1) Failures are random and uniform among all links. 2) Failures (ber cuts) occur between 0.2 and 2 times per 100 kilometers per year, which is typical in backbone networks [16,27]. 3) Number of failures is normalized to FIT (failure in time) representation (i.e., the number of failures in 109 hours); thus, the values are between 24,000 and 200,000 FIT/100 km. 4) MTTR of a link failure is 8 hours [16].
A provisioning and restoration design requires knowledge of the trafc pattern and load between various node pairs in a network. A real-world trafc pattern may be hubbed or non-uniform, although most state-of-the-art research work on network performance has been carried out with the assumption of uniform trafc distribution. This has impact on resource allocation and network performance. If reasonable information is available about the trafc non-uniformity pattern, i.e., the fraction of total trafc load that goes through one node, and if we assume that the remaining trafc is still uniformly distributed between all other node pairs, our restoration schemes can also be applied to reduce restoration overhead and improve the availability satisfaction rate. In general, if information (or more information) is available about trafc distribution, loads, and failure scenarios in our networks, a restoration scheme can be utilized to do efcient restoration path selection for improved resource utilization and availability satisfaction performance.
1
300
1,600
1 6
1,600 800
10
1,200
2
700 800
15
600 700
16 22
20
300
200
4
200
3
300
11
900
300 500
17
600 800
5
1,500
12
600
1,300
500 1,700
8
400 1,000
13
300
1,500 500
19
650
25
1,200
14
1,100
1,200
Fig. 1.
A. Availability Satisfaction Rate Figure 2 shows the availability satisfaction rates (ASRs) [1] for ADORE, ADORE-P, and REST with different FIT values for SLA-differentiated connections. Availability satisfaction rate refers to the portion of connections that are guaranteed with their availability requirements upon failure occurrences. As expected, ADORE and ADORE-P perform much better than REST. In particular, at a load of 120 Erlangs, which is a moderate load scenario in our example, ADORE achieves 100% ASR for connections with 2-9s and 3-9s SLA and 95% ASR for connections with 4-9s, and 5-9s SLA, because it intelligently restores connections based on accumulated downtime information in order to satisfy their SLA requirements. ADORE-P outperforms ADORE for 3-9s, 4-9s, and 5-9s connections ( 99% for 5-9s, 98% for 4-9s, and 100% for 3-9s), because ADORE-P restores an interrupted connection whenever resources are available and hence many connections achieve reasonably small ADTs. Furthermore, ADORE-P allows a high-SLA connection to preempt restoration bandwidth from a lowSLA connection while still meeting availability requirements for both connections. We note that our proposed schemes favor connections with high SLAs (4-9s and 5-9s) even under low ber-cut rates. For 24,000 FIT/100 km (0.2 cuts per 100 kilometer per year), ADORE-P and ADORE achieve a 100% restoration guarantee while REST provides only 98.9% and 98.6% ASR for 5-9s and 4-9s connections, respectively. A SLA violation incurs a penalty that reduces an SPs revenue as well as a customers loyalty. Hence, our proposed restoration schemes are very desireable for SPs who want to ensure each service leads to improved customer satisfaction. The performance comparison results indicate that connection selection is a key factor to availability guarantee against link failures. In REST, ASR of connections with 4-9s and 5-9s SLAs are very similar because Tdowntime of both SLA levels (5 minutes/ year for
121
(a)
1 Availability Satisfaction Rate (3-9s)
(b)
1 Availability Satisfaction Rate (2-9s)
0.95
0.995
0.9
0.99
0.85
0.985
0.8
ADORE: 3-9s ADORE-P: 3-9s REST: 3-9s 57000 142000 193000 Link failure rate in FIT/100km (network load = 120 Erlangs)
0.98
ADORE: 2-9s ADORE-P: 2-9s REST: 2-9s 57000 142000 193000 Link failure rate in FIT/100km (network load = 120 Erlangs)
0.75 24000
0.975 24000
(c)
(d)
Fig. 2.
5-9s and 53 minutes/ year for 4-9s, respectively) receives similar treatment for restoration considering an 8-hour MTTR. B. Restoration Overbuild Another important metric to evaluate resource utilization in restoration schemes is restoration overbuild, dened as the amount of bandwidth utilized by successful restoration over the amount of total primary bandwidth in a network [34]. Restoration overbuild indicates the extra resources needed to implement a restoration scheme and helps an SP to gain some insight on bandwidth overhead for restoration. Obviously, lower restoration overbuild means better resource utilization. 1) Different FIT Values: Figure 3 illustrates the restoration overbuilds for ADORE, ADORE-P, and REST with different FIT values when the network load is 120 Erlangs. The results show that ADORE achieves up to 17.4% improvement on restoration overbuild compared to REST. The different restoration services based on accumulated downtime and SLA requirements contribute to the reduction, especially because
some connections may be kept in the down state without restoration bandwidth allocation while still meeting their availability requirements. ADORE-P gets higher restoration overbuild than ADORE but still achieves better performance than REST. Recall that ADORE-P aims to restore as many disrupted connections as possible, whenever links have enough remaining capacity or when a connection
1 ADORE ADORE-P REST
0.8
0.7
0.6
0.5 24000
57000 142000 193000 Link failure rate in FIT/100km (network load = 120 Erlangs)
Fig. 3.
122
can grab bandwidth from its lower-SLA peers on its restoration links. Therefore, the ADORE-P scheme achieves more successful restorations but consumes more capacity than ADORE. We also note that the restoration overbuilds of ADORE, ADORE-P, and REST do not change much when the FIT value is high. This is mainly because the available bandwidth resource in a network is limited and does not change, although more connections need to be restored and they compete for limited bandwidth when failure arrivals get heavier. 2) Different Network Loads: In Fig. 4, we show the restoration overbuilds for ADORE, ADORE-P, and REST with 24,000 FIT/100 km under different network loads. The FIT value reects 0.2 failures per 100 kilometers per year [27]. The performance results are similar for other FIT values. Similarly, ADORE performs much better than REST. ADORE-P consumes more restoration capacity than ADORE in order to maximize restoration capability and therefore has higher overbuild value. The restoration overbuilds of ADORE, ADORE-P, and REST decrease when the network load is heavy. This is mainly because more connections need to be provisioned and tend to share restoration capacity in common when the network load is heavy. C. Protection-Switching Overhead Figure 5 summarizes the number of protection switchings (i.e., the number of activated restoration paths) for ADORE, ADORE-P, and REST with different FIT values when the network load is 120 Erlangs. Using the number of switchings for REST, we normalize the number of switchings for the three algorithms in Fig. 5. The performance results are similar for other network loads. ADORE reduces protectionswitching overhead up to 62% compared to REST because ADORE wisely restores faulty connections when
0.8 0.75 Restoration Overbuild 0.7 0.65 0.6 0.55 0.5 40 60 80 100 120 140 Network load in Erlangs (24000 FIT/100km) 160 ADORE ADORE-P REST
24000
57000 142000 193000 Link failure rate in FIT/100km (network load = 120 Erlangs)
Fig. 5.
a link fails. ADORE efciently uses network resources while still satisfying the different SLA requirements of most connections. ADORE-P enables more restoration paths (and therefore more protection-switching overhead) than ADORE and REST because it switches a faulty connection to its restoration path as long as resources are available. Moreover, it activates protection paths for high-SLA connections under conditions of resource preemption. VII. CONCLUSION In this paper, we proposed a novel restoration algorithm, called Accumulated-Downtime-Oriented Restoration (ADORE), with accumulated downtime awareness for optical mesh networks. The algorithm wisely determines the set of restorable connections, in order to minimize restoration-capacity requirements as well as to improve availability-guarantee capability. When a link failure occurs, an affected connection is switched to a restoration path, if necessary, only when its accumulated downtime plus MTTR will violate its SLA requirement. Otherwise, the connection can remain temporarily in the down state. The signicant contribution of this algorithm is that it captures the merits of both resource efciency and SLA guarantee, by applying restoration-capacity sharing and intelligently selecting restorable connections based on their accumulated downtime and SLA requirements. We also presented an upgraded version of the algorithm ADORE-P. In ADORE-P, a faulty connection is switched to its restoration path when there is sufcient network capacity or when it can preempt restoration capacity from its lower-SLA peers. The upgraded version focuses on maximizing the number of restorable connections, given link-bandwidth constraints. We demonstrated through simulation experiments that ADORE achieves signicantly better perfor-
Fig. 4.
123
mance on three important performance criteria availability satisfaction rate (ASR), resource efciency, and protection-switching overheadcompared to a general restoration scheme REST, which is unaware of accumulated downtime information. ADORE-P shows trade-off between restoration capability and restoration bandwidth usage as well as protection-switching overhead. To conclude, the two algorithms provide different and accumulateddowntime-oriented restoration services to existing connections upon a link failure, in order to achieve optimal connection recovery and provide extra gain by maximizing restorable services. ACKNOWLEDGMENTS This work has been supported in part by the National Science Foundation (NSF) under grant ANI-02-07864 and CNS-05-20190. A short version of this paper was presented at the OFC08 conference in San Diego, CA, in February 2008. REFERENCES
[1] B. Mukherjee, Optical WDM Networks, Springer, Feb. 2006. [2] M. Tornatore, G. Maier, and A. Pattavina, Availability design of optical transport networks, IEEE J. Sel. Areas Commun., vol. 23, no. 8, pp. 15201532, Aug. 2005. [3] W. D. Grover and D. Tipper, Design and operation of survivable networks, J. Network Syst. Manage., vol. 13, no. 1, pp. 711, Mar. 2005. [4] R. Ramamurthy, Z. Bogdanowicz, S. Samieian, D. Saha, B. Rajagopalan, S. Sengupta, S. Chaudhuri, and K. Bala, Capacity performance of dynamic provisioning in optical networks, J. Lightwave Technol., vol. 19, no. 1, pp. 4048, Jan. 2001. [5] R. Bartos and M. Raman, A heuristic approach to service restoration in MPLS networks, in Proc. IEEE ICC2001, vol. 1, pp. 117121, June 2001. [6] S. S. Lumetta and M. Medard, Towards a deeper understanding of link restoration algorithms for mesh networks, in Proc. IEEE INFOCOM01, vol. 1, pp. 367375, May 2001. [7] J. Zhang and B. Mukherjee, Review of fault management in WDM mesh networks: basic concepts and research challenges, IEEE Network, vol. 18, no. 2, pp. 4148, Mar./Apr. 2004. [8] J. Doucette, Advances on design and analysis of meshrestorable networks, Ph.D. dissertation, University of Alberta, Edmonton, AB, Canada, Dec. 2004. [9] C. Ou, H. Zang, N. K. Singhal, K. Zhu, L. H. Sahasrabuddhe, R. A. MacDonald, and B. Mukherjee, Sub-path protection for scalability and fast recovery in optical WDM mesh networks, IEEE J. Sel. Areas Commun., vol. 22, no. 9, pp. 18591875, Nov. 2004. [10] K. Murakami and H. S. Kim, Comparative study on restoration schemes of survivable ATM networks, in Proc. IEEE INFOCOM97, vol. 1, pp. 711, Apr. 1997. [11] A. Fumagalli and L. Valcarenghi, IP restoration vs. WDM protection, IEEE Network, vol. 14, no. 6, pp. 3441, Nov./Dec. 2000. [12] G. Ellinas, E. Bouillet, R. Ramamurthy, J. Labourdette, S. Chaudhuri, and K. Bala, Routing and restoration architectures in mesh optical networks, SPIE Optical Networks Mag., vol. 4, no. 1, pp. 91106, Jan./Feb. 2003. [13] A. Lord and M. Wade, Techno-economics issues in future telecom networks, in Proc. IEEE OFC2007, Anaheim, CA, Mar.
2007. [14] M. Tacca, A. Fumagalli, A. Paradisi, E. Unghvary, K. Gadhiraju, S. Lakshmanan, S. M. Rossi, A. deCampos Sachs, and D. S. Shah, Differentiated reliability in optical networks: theoretical and practical results, J. Lightwave Technol., vol. 21, no. 11, pp. 25762586, Nov. 2003. [15] D. A. Kimber, X. Zhang, P. H. Franklin, and E. J. Bauer, Modeling planned downtime, Bell Labs Tech. J., pp. 719, Nov. 2006. [16] Network availability in meshed transport networks, Technology White Paper, Alcatel-Lucent, May 2007. http:// www1.alcatel-lucent.com/com/en/appcontent/opgss/ Net_Avail_Meshed_twp_tcm228-1283621635.pdf. [17] L. Kant, D. Hsing, and T. Wu, Modeling and simulation study of the survivability performance of ATM-based restoration strategies for the next generation high-speed networks, in Proc. 8th Int. Conf. Computer Communications and Networks, Oct. 1999, pp. 469473, . [18] M. Sridharan, M. V. Salapaka, and A. K. Somani, A practical approach to operating survivable WDM networks, IEEE J. Sel. Areas Commun., vol. 20, no. 1, pp. 3446, Jan. 2002. [19] S. Tak and E. K. Park, Modeling and performance study of restoration framework in WDM optical networks, Computer Networks, vol. 49, no. 2, pp. 217242, Oct. 2005. [20] O. Gerstel and G. Sasaki, Meeting SLAs by design: a protection scheme with memory, in Proc. IEEE OFC2007, Anaheim, CA, Mar. 2007. [21] D. A. Schupke, A. Autenrieth, and T. Fisher, Survivability of multiple ber duct failures, in Proc. Design of Reliable Communication Networks, Oct. 2001, pp. 219231. [22] L. Song and B. Mukherjee, Accumulated-downtime-aware restoration approach for dynamic SLA-differentiated services in survivable mesh networks, in Proc. IEEE OFC2008, San Diego, CA, Feb. 2008. [23] B. T. Doshi, S. Dravida, P. Harshavardhana, O. Hauser, and Y. Wang, Optical network design and restoration, Bell Labs Tech. J., vol. 4, no. 1, pp. 5884, Jan./Mar. 1999. [24] M. Tornatore, G. Maier, and A. Pattavina, Capacity versus availability trade-offs for availability-based routing, J. Opt. Network., vol. 5, no. 11, pp. 858869, Nov. 2006. [25] M. Tornatore, D. Lucerna, L. Song, B. Mukherjee, and A. Pattavina, Dynamic SLA redenition for shared-path-protected connections with known duration, in Proc. IEEE OFC2008, San Diego, CA, Feb. 2008. [26] J. L. Hellerstein, K. Katircioglu, and M. Surendra, An on-line, business-oriented optimization of performance and availability for utility computing, IEEE J. Sel. Areas Commun., vol. 23, pp. 20132021, Oct. 2005. [27] R. C. Menendez and J. W. Gannett, Efcient, fault-tolerant all-optical multicast networks via network coding, Telcordia Research Paper, Feb. 25, 2008. http:// www.research.telcordia.com/papers/jgannett/ 20080225_Menendez_Gannett_Optical_Network_Coding.pdf. [28] J. Jereb, T. Jakab, and F. Unghvary, Availability analysis of multi-layer optical networks, SPIE Optical Networks Mag., vol. 3, no. 2, pp. 8495, Mar./Apr. 2002. [29] E. J. Bauer and P. H. Franklin, Framework for availability characterization by analyzing outage durations, Bell Labs Tech. J., pp. 3946, Nov. 2006. [30] L. Song, J. Zhang, and B. Mukherjee, Dynamic provisioning with availability guarantee for differentiated services in survivable mesh networks, IEEE J. Sel. Areas Commun., vol. 25, no. 3, pp. 3543, Apr. 2007. [31] G. Mohan, C. S. R. Murthy, and A. K. Somani, Efcient algorithms for routing dependable connections in WDM optical networks, IEEE/ACM Trans. Networking, vol. 9, pp. 553566, Oct. 2001. [32] M. Batayneh, S. Rai, S. Sarkar, and B. Mukherjee, Efcient
124
L. Song and B. Mukherjee as Chairman of the Department of Computer Science from September 1997 to June 2000. He is the winner of the 2004 Distinguished Graduate Mentoring Award at UC Davis. Two Ph.D. dissertations (by Dr. Laxman Sahasrabuddhe and Dr. Keyao Zhu), which were supervised by Professor Mukherjee, were winners of the 2000 and 2004 UC Davis College of Engineering Distinguished Dissertation Awards. To date, he has graduated nearly 31 Ph.D. students, whith almost the same number of M.S. students. Currently, he supervises the research of nearly 20 scholars, mainly Ph.D. students and including visiting research scientists in his laboratory. Dr. Mukherjee is a co-winner of paper awards presented at the 1991 and 1994 National Computer Security Conferences. He serves or has served on the editorial boards of IEEE/ACM Transactions on Networking, IEEE Network, ACM/Baltzer Wireless Information Networks (WINET), the Journal of High-Speed Networks, Photonic Network Communication, Optical Network Magazine, and Optical Switching and Networking. He served as Editor-at-Large for optical networking and communication for the IEEE Communications Society, as the Technical Program Chair of the IEEE INFOCOM 96 conference, and as the Chairman of the IEEE Communication Societys Optical Networking Technical Committee (ONTC) during 20032005. Dr. Mukherjee is the author of the textbook Optical WDM Networks published by Springer in January 2006. Earlier, he authored the textbook Optical Communication Networks published by McGrawHill in 1997, a book which received the Association of American Publishers, Inc.s 1997 Honorable Mention in Computer Science award. He is a Member of the Board of Directors of IPLocks, Inc., a Silicon Valley startup company. He has consulted for and served on the Technical Advisory Board (TAB) of a number of startup companies in optical networking. His current TAB appointments include Teknovus, Intelligent Fiber Optic Systems, and LookAhead Decisions Inc. (LDI). He is a Fellow of the IEEE. Dr. Mukherjees research interests include lightwave networks, network security, and wireless networks.
management of a networks excess capacity: a trafcengineering approach, in Proc. European Conf. Optical Communication (ECOC2007), Berlin, Sept. 2007. [33] S. Ramamurthy, L. Sahasrabuddhe, and B. Mukherjee, Survivable WDM mesh networks, J. Lightwave Technol., vol. 21, no. 4, pp. 870883, Apr. 2003. [34] F. Yu, R. K. Sinha, D. Wang, G. Li, J. Strand, R. Doverspike, C. Kalmanek, and B. Cortez, Improving restoration success in mesh optical networks, J. Opt. Netw., vol. 3, no. 1, pp. 2537, Dec. 2003. Lei Song (S95-M08) received the B.S. and M.S. degrees in computer science from Beijing University of Posts and Telecommunications, Beijing (China) in 1995 and 1998, and a Ph.D. degree from the University of California, Davis (USA), in November 2007. She was with the Research and Development Department at China Telecom Co. Ltd., Shanghai, China from April 1998 to August 2001. She is currently working in the Platform Engineering Department at Yahoo! Inc., Sunnyvale, CA, USA. Her research interests include performance analysis of survivability in modern telecom networks; protection. and restoration of trafc in optical WDM networks, network trafc analysis and modeling, and next generation networking. Biswanath Mukherjee (S82M87F07) received the B.Tech. (Hons) degree from the Indian Institute of Technology, Kharagpur (India), in 1980 and the Ph.D. degree from the University of Washington, Seattle, in June 1987. At Washington, he held a GTE Teaching Fellowship and a General Electric Foundation Fellowship. In July 1987, he joined the University of California, Davis, where he has been a Professor of Computer Science since July 1995 (and currently holds the Child Family Endowed Chair Professorship) and served