0% found this document useful (0 votes)
11 views32 pages

Slides 12

The document discusses the importance of survivable network design, emphasizing the need for communication networks to maintain quality of service in the face of various failures. It outlines key concepts such as working and backup paths, different survivability techniques, and the classification of protection and restoration strategies. The document also highlights the significance of redundancy, topology design, and network management in achieving robust network survivability.

Uploaded by

muskansh7860
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views32 pages

Slides 12

The document discusses the importance of survivable network design, emphasizing the need for communication networks to maintain quality of service in the face of various failures. It outlines key concepts such as working and backup paths, different survivability techniques, and the classification of protection and restoration strategies. The document also highlights the significance of redundancy, topology design, and network management in achieving robust network survivability.

Uploaded by

muskansh7860
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 32

Survivable Network Design

David Tipper
Graduate Telecommunications and Networking Program
University of Pittsburgh
Telcom 2110 Slides 12

Motivation
• Communications networks need to be
survivable?
• Communication Networks are Critical
I f t t
Infrastructure (CI) (PCCIP 1996) the
th systems,
t
assets and services upon which society and the
economy depend
• Communication infrastructure often considered
most important CI due to reliance on it by other
infrastructures
– banking and finance, government services
– power grid SCADA, etc.
• Increasing Impact and Rate of Failures
– Increased bandwidth of links (WDM technology in
fiber optic network)
– Increased societal dependence
– Multiple network operators and vendor equipment

1
Causes of Network Outages
• According to Sprint a link outage in IP backbone every 30 min
on average
• Accidents
– cable cuts
cuts, car wreck
wreck, etc
etc.
– According to AT&T 4.39 Cable cuts / year / 1000 km
• Human errors
– incorrect maintenance, installation
• Environmental hazards
– fire, flood, etc.
• Sabotage
– physical, electronic
• Operational disruptions
– schedule upgrades, maintenance, power outage
• Hardware/Software failures
– Line card failure, faulty laser, software crash, etc.

Backbone Failures

Other Unknown

9%

Link Failure 36% Router Operations


 Time to Recover  Software Upgrade
from Layer
1 failure 32% 

Hardware Upgrade
Configuration Errors
 Congestion

23%

Router Failures
 Software failures
 Hardware failures
Source: University of Michigan, 2000  DOS Attacks

2
Network Survivability
• Definition
– Ability of the network to support the committed Quality of
Services (QoS) continuously in the presence of various
failure scenarios
– Includes performance as well as availability
• Survivability Components
– Analysis: understand failures and system functionality after failures

– Design: adopt network procedures and architecture to prevent and


minimize the impact of failures/attacks on network services.
– Goal: maintain service for certain scenarios at reasonable cost
• Self – Healing network

Survivable Network Design


• Three steps towards a survivable network
1. Prevention:
– Robust equipment and architecture (e.g., backup power supplies)
– Security (physical, electronic), Intrusion detection, etc.
2. Topology Design and Capacity Allocation
 Design network with enough resources in appropriate topology
 Spare capacity allocation – to recover from failure
3. Network Management and traffic restoration procedures
 Detect
D t t the
th failure,
f il andd reroute
t traffic
t ffi around
d ffailure
il using
i ththe
redundant capacity

3
Survivability – Basic Concepts
• Working path and Backup path (recovery path):
• Working path: carry traffic under normal
operation
• Backup path: an alternate path to carry the traffic
in case of failures

3 4 W orking route

Backup route
Backup
route

D CS

Custom er
1 X 2

A B

Survivability – Basic Concepts


– To survive against a network failure
– working path and backup path must be disjoint
– So that both paths are not lost at the same time
• Disjoint = ? (depending on a failure scenario)
– Link disjoint
– Node disjoint
– (Shared Risk Link Group) SRLG disjoint

BP

BP

AP AP

Source Destination Source Destination


Link-disjoint Node-disjoint

4
Shared Risk Link Group
(SRLG)
C

Logical intent

Actual routing

A Physical Cables

• Two fiber cables share the same duct or other common


physical structure (such as a bridge crossing).
• Two cables can fail simultaneously

Classification of
Survivability Techniques

• Path-based (Global) versus Link-based (Local)


• F il
Failure Dependent
D d t vs. Failure
F il Independent
I d d t
• Protection versus Restoration
• Dedicated-Backup versus Shared- Backup Capacity
• Ring versus Mesh topology
• Dual homing
• P cycle

5
Path-based versus Link-based

• Path-based Scheme (Global)


– Disjoint alternate routes are provided between source
and destination node

2 3

Working path
Backup path
1 6

4 5

Path-based versus Link-based

• Link-based Scheme (Local)


– Alternate routes are pprovided between end nodes of the
failed link
– Can have backhaul situation which wastes bandwidth
2 3

Working path
Backup path
1 6

4 5

6
Partial Path Scheme

• Partial Path Scheme


– Alternate routes are from the upstream node to
destination node or from the downstream node to
source node

2 3 2 3

1 6 1 6

4 5 4 5

Working path
Backup path

Path-based versus Link-based

Bandwidth Faster
efficient Simpler recovery speed

Path-based 
Link-based  

7
What Does Survivability Get You?

BP

WP
Source (S) Destination (D)

• Ai is an availability of link i
• Availability of a connection between S-D:
Ano  protection   Ai
iWP

Aprotection   Ai   Ai   Ai
iWP iBP iWP  BP

• Given Ai = 0.998297,
- Ano-protection = 0.996597, Aprotection= 0.999983

Failure Dependent vs. Failure Independent

• Failure Dependent – the backup path depends on which


device fails – need a set of paths one for each failure case
• Failure
F il Independent
I d d t – backup
b k path th link
li k andd node
d disjoint
di j i t
with working path - one backup path per working path
• Example:
7
9
13
Failure Dependent backup path
for link 2-3 failure
1 2 3

W ki path
Working th 6
12
10 5
Failure Dependent backup 4

path for link 1-2 failure


11 8

Failure Independent backup path

8
Protection versus Restoration
• When to establish the backup paths?
• Protection
– Backup paths are fully setup before a failure occurs.
occurs
– When failure occurs, no additional signaling is needed to establish
the backup path
– Faster recovery time W
P

• Restoration
– Backup paths are established after a failure occurs
– More flexible with regard to the failure scenarios BP
• backup
b k paths
h are setup after
f theh llocation
i off failure
f il is
i known
k
– More capacity efficient
• due to its shared-backup nature,
• Utilize any spare capacity available in the network
– But cannot guarantee 100% restorability after failures

Protection
• Protection Variants
– 1+1 Protection (dedicated protection)
• Traffic is duplicated and transmitted over both working and backup
paths
– Fastest recovery speed, but not bandwidth efficient
– 1:1 Protection (dedicated protection with extra traffic)
• During normal operation (failure free), traffic is transmitted only
over working path; backup path can be used to transmit extra traffic
y traffic))  better bandwidth utilization
((low ppriority
• When the working path fails, extra traffic is preempted, and traffic
is switched to the backup path
BP

Source WP
Destination

9
Protection
– 1:N Protection (shared recovery with extra
traffic)
• One protection entity for N working entities
Protection Channel
Working Channel 1

APS
APS
Working Channel 2
Working Channel n

Node 1 Node 2

– M:N Protection (M  N)
• M protection entities for N working entities
– Self Healing Rings are a form of Protection

Link Redundancy
Simultaneous Physical
• Link Bundling Connections

– Link failure does not


affect forwarding
– Load redistributed among
other members of bundle
• Parallel Link Technologies
– MLPPP – T1/E1
/ Link aggregation
agg egat o
– 802.3ad – Ethernet aggregation
– SONET/SDH aggregation
– Multi-Link Frame Relay

10
Types of Self-healing Rings

Working ring
Working ring
Protection ring
P
Protection
i ring
i
ADM ADM

ADM ADM ADM ADM

ADM ADM

1:1 Uni-directional self-healing ring 1:1 Bi-directional self-healing ring


(USHR) (BSHR)

Dedicated versus Shared - Backup


• Dedicated-Backup Capacity
– Backup resource can be used only by a particular working path
• Shared-Backup
Shared Backup Capacity
– Backup resource between several working paths can be shared
– Rule: backup resource can be shared only when corresponding
working paths are not expected to fail at the same time
– More capacity efficient
WP1 (traffic 5 units)

4 6
BP1
2
Working path Link 5-7:
5 7 dedicated spare capacity = 15 units
Backup path
1 shared spare capacity = 10 units
BP2

3 8
WP2 (traffic 10 units)

11
Dedicated versus Shared - Backup
• Dedicated-Backup Capacity
– Backup resource can be used only by a particular working path
• Shared-Backup
Shared Backup Capacity
– Backup resource between several working paths can be shared
– Rule: backup resource can be shared only when corresponding
working paths are not expected to fail at the same time
– More capacity efficient
WP1 (traffic 5 units)
4 6
BP1
2
Working path 5 7 Link 5-7:
Backup path dedicated spare capacity = 15 units
1 shared spare capacity = 10 units
BP2

3 8
WP2 (traffic 10 units)

Ring vs Mesh Architectures


Advantages of Rings:
• More cost efficient at low traffic volumes
• Fast protection switching, some capacity
sharing
Advantages of Mesh:
• More cost efficient at high traffic volumes
• Facilitates capacity and cost efficient mesh
restoration
• More flexible channel re-configuration

12
P Cycles
Protection (P) Cycle
– Closed cycles are formulated in the mesh network.
– Affected traffic is rerouted along these cycles
cycles.
– For a large network will have a number of p-cycles

((a)) A pre-configure
p g cycle
y (b) A link on the cycle fails

(c) A link not on the cycle fails (d) Another link not on the cycle fails

P-Cycles: Basics
• For meshed networks
• Pre-reserved protection paths (before failure)
• Based on cycles, like rings
• Also protects straddling failures, unlike rings
• Local protection action, adjacent to failure (in the
order of some 10 milliseconds)
• Shared capacity
(c) A link not on the cycle fails

• “pre-configured protection cycles”  p-cycles


• Developed at

13
P-Cycles: Basics

• A single p-cycle in a network:

p-Cycles: Basics
• Protected spans:
• 9 „on-cycle“ (1 protection path)

14
p-Cycles: Basics

• Protected spans:
• 9 ``on-cycle’’
y ((1 pprotection ppath))
• 8 ``straddling’’ (2 protection paths)

Restoration using p-cycles

A. Form the spare If span i fails,


capacity into a i p-cycle j provides
particular set of
j
one unit of
pre-connected
cycles !
restoration
capacity

A span on the cycle fails - 1 Restoration Path, BLSR-like


" xi , j  1 " case
If span i fails,
j p-cycle j provides
A p-cycle
i two units of
restoration
capacity

A span off the p-cycle fails - 2 Restoration Paths, Mesh-like


" xi , j  2 " case

15
Mesh Survivability Techniques

Mesh Survivability Techniques

Protection Restoration

Dedicated-backup Protection Path-based Restoration

Path-based
Link-based Restoration

Link-based

Shared-backup Protection

Path-based

Link-based

P-cycle

Survivability Technique Metrics


• Scope of failure coverage
– single link failure, single node/link failure, multiple failures, etc.
• Recoveryy time
– 50ms in SONET Ring
• Backup capacity requirement (redundancy, Rr  amount of spare capacity )
• Guaranteed bandwidth vs. non guaranteed amount of working capacity
• Reordering and duplication
– switching between WP and BP
• Additive latency and jitter
– quality of backup path, backup path length, congestion on backup path
• State overhead
• Scalability
• Signaling requirements
• Notion of resilence classes (QoR)
– Different level of connection availability, restorability and recovery time

16
Transport Survivability
• Number of techniques exist
– APS
– Multi-homing (with or without trunk diversity)
– Link restoration
– Path restoration
– Self healing rings
– p-cycles
• See a mixture of techniques in real networks
• Usually little or no survivability at the far edge (CPE – last mile)
• Edges are multi-homed to MAN or WAN

Access
Core
Access

Dual/Multi-homing Topologies
• Dual-homing • Multi-homing
– Customer host is connected to – Customer host is connected to more
two switched-hubs. than two switched hubs.
– Traffic may be split between – Greater protection against a failure.
primary and secondary paths
connecting to the hubs.
– Each path serves as a backup for
another.

switch

customer host

Dual-homing topology Multi-homing topology

17
Dual-homing in Telephone
Network
SDH/SONET Facility Transmission Network
Protection
Small
X Radius of
Class 4 Toll
Damage
Network Diverse
Switch
Locations
X Small Radius
of Service
Multiple Routes Transmission Network
Loss
Between Offices

Class 5
Local
Network

Resilient Edge Connectivity


• Multi-Homing for resilient Internet and IP-VPN
connectivity
• Solves link failure and ISP node failure problems
• What about failure of customer edge router?

Customer

ISP

18
Virtual Router Redundancy Protocol
• Redundant default gateways:VRRP (RFC 2338)
Master Back-up
Multiple routers on
the subnet negotiate
who will be “Master”
All other routers
and own the Virtual
are backups. Backup
Router IP Address.
priority is configurable.

GE GE
Master sends
periodic hellos Ethernet Hosts are
Switch preconfigured with Virtual
to communicate
Router IP address as
alive state. default for
traffic exiting the LAN.

Host Host
Host

Dual-homing in Data Network

Customer Edge
(CE) Router

Provider Edge
(PE) Router

19
Implementation
• Multi-layered:
– Demand Topology
– Logical Transport Topology
– Fiber/Optical Topology
• Can implement survivability
techniques at each layer
• Need to consider
– Failure propagation
– Alarm Setting
– Speed of recovery
– Cost
– Management
– Traffic Grooming
– Etc.

Traffic Restoration Capabilities


• A survivability scheme and spare capacity doesn’t accomplish
restoration by itself, must be used in conjunction with dynamic
restoration techniques.
techniques

• Need to detect failure and do path rearrangement given that


there is enough spare capacity in the networks.
• For example a dual-homing approach guarantees surviving
connectivity, but it doesn’t restore the circuits/connections in
itself.
itself
• Need network management procedures to perform path
rearrangement.

20
Steps in Traffic Recovery

Detection
Reconfiguration
Repair
p p process
process
Notification

Fault Isolation Identification

Path selection

Repair Rerouting

Normalization

IP Survivability Options

• Several techniques to improve survivability in


IP networks
• IP layer –
– adjust link weights and timers for faster failure
recovery
– prestore second shortest paths, etc,
• Adopt Optical Transport techniques from Telco
operators (survivable rings, APS, path
restoration, etc.)
• MPLS logical layer restoration

21
IP Dynamic Routing

Link failure flooded

New York

San
Francisco
New Path Computed

• OSPF or IS-IS computes


p ppath
• If link or node fails, New path is computed
• Response times: Typically a few seconds
– Can be tuned to ~1000’s milliseconds
– According to Sprint data – usually ~ 7secs to recover

Backup Label Switched Paths


Error signaled

New York

San
Francisco

Primary LSP
Backup LSP

• Primary (working) LSP & backup LSPs established a priori


• If primary fails
– Signal to head end, Use backup
• Faster response, requires wide area signaling

22
MPLS Fast Reroute
• Increasing demand for
“APS-like” redundancy
– MPLS resilience
ili tto li
link/node
k/ d failures
f il D t
Detour

– Control-plane protection required Primary

– Avoid cost of SONET APS protection


LSR

• Solution: MPLS Fast-reroute


– RSVP Extensions define Fast Reroute
– LSPs can be set up, a priori, to backup:
• One LSP across a link and optionally next node, or
• All LSPs across a particular link

1:1 Protection
• For each LSP, for each node
– Set up one LSP as backup
– Merge into primary LSP further downstream
– Backs up link and downstream node

23
1:1 LSP Protection

Traffic uses detour LSP

Link Fails Merged Downstream

1:N Link Protection


• For each link, for each neighbor
– Set up one detour LSP to backup the link as a whole
– Uses LSP Hierarchy to backup all LSPs which were
using failed link
Multiple Primary
LSPs on same link

One detour LSP for link

24
1:N Link Protection

Link Fails

Primary LSPs multiplexed LSPs demultiplexed


over one detour LSP at next node

1:N Link and Node Protection


• For each link
– For each node 2 hops away
• Detour LSP backs up link & intermediate node
• Uses LSP Hierarchy to backup all LSPs to that node
• If there are two 2-hop paths to that node, setup two
detour LSPs
– For each node 1 hop away
• Detour LSP backs up LSPs ending at that node

25
MPLS Fast Reroute

• Provides fast recovery for LSP failure


– Based on a ppriori backupp of detour LSPs
– (eg, ~5 millisecond for tens of LSPs with 1:1)
• There are significant tradeoffs between the
approaches
– Number of LSPs required
– Whether node failures are protected
– Ability to reserve resources for backup LSPs
– Optimality of routes

Summary of MPLS Methods


• End-to-End disjoint backup LSP – one per working
LSP in the network
• MPLS FFastt R
Re-Route
R t
– 1:1 LSP link or link + node protection
– 1:N Link protection
– 1:N Link plus node protection
• All of these are interoperable based on IETF standards
• Sink Trees are under study
• Does MPLS solve all the problems?
– Can’t recover from IP Layer Failure
– Doesn’t provide protection of layer 1 customers
– Fault Propagation Issue

26
Multilayer Networks
• WAN networks have multiple technology layers
• Converging toward IP/MPLS/WDM
• Multiple Layers present several survivability challenges
• Coordination of recovery actions at different layers
– Which layer is responsible for fault recovery?
• Spare Capacity Allocation (SCA)
– How to prevent over allocation, when each layer provides spare resources?
• Failure Propagation
– Lower layer failure can affect multiple higher layer links!
3

1 MPLS connections
5 WDM Physical Path

2 3

4 5

Optimization Based Design


• In implementing the chosen survivability
technique (e.g., link protection, p-cycles) at a
particular layer (e.g., optical)- optimization
techniques are usually adopted.
• First design working network and working/active
paths
• Then determine survivability design (often called
spare capacity network design)
• Examples in ITU Planning document
• Consider shared backup path protection

27
Spare Capacity Allocation
• Single Layer Spare Capacity Allocation (SCA) Problem
– given working paths and network (or virtual network) topology
– provision spare capacity and find backup routes for fault tolerance
– Goal: minimum spare capacity or cost
• Matrix based formulation*
– P path link incident matrix, Q backup link incident matrix
– Relate to spare provision matrix G, and spare capacity reservation s
– Assume path restoration with disjoint backup routes
– Shared backup path protection for any single link failure

1 4

l
2 3

5 6

Working Path Backup Path

* Y.Liu, D.Tipper, and P. Siripongwutikorn, “Approximating Optimal Spare Capacity Allocation by Successive
Survivable Routing,'' ACM/IEEE Transactions on Networking, Vol. 13., No. 1, pp. 198-211, Feb., 2005 .

Matrix model for SCA


Working and backup path
matrices related to spare Link i 1 2 3 4 5 6 7
Backup path link
provision matrix G= QT P incident matrix
s G QT
gij = spare capacity needed
on link i when link j fails 1 2 0 2 1 1 1 0 1 0 0 1 1 0 1 1 0 0 0
2 2 2 0 2 1 1 0 0 1 1 0 0 0 1 1 0 0 0
3 1 0 1 0 0 0 1 1 0 0 1 0 0 0 0 0 1 0
4 1 1 1 1 0 0 1 0 1 0 0 1 1 0 0 0 1 0
5 2 1 1 1 0 0 0 2 0 1 1 0 0 0 0 0 0 1
From G find spare 6 1 0 0 1 0 1 0 1 0 0 0 0 1 0 0 1 0 1
7 2 1 0 2 0 2 0 0 0 1 0 0 0 1 0 1 0 0
capacity allocation 11
s=max(G) Flows 1 2 3 4 5 6 7 8 9 10
src dst
1 0 0 0 0 0 0 1 a b
b 3 c 1 0 1 0 0 0 0 2 a c
P 0 1 0 0 0 0 1 3 a d
1
0 1 0 0 0 0 0 4 a e
a 4 6 5 Working path link 0 0 1 0 0 0 0 5 b c
incident matrix 0 0 1 0 1 0 0 6 b d An Example:
2 0 0 0 1 0 0 0 7 b e 1. Link 2 fails
0 0 0 0 1 0 0 8 c d
e 7 d 0 0 0 0 0 1 0 9 c e 2. Flow 3,4 affected
0 0 0 0 0 0 1 10 d e 3. Backup paths up
4. Spare BW=2 on Li

28
Optimization model for link
failures
min S = eT s Total spare capacity
Q,s
s.t. sG Enough spare capacity on each link
G = QT M P Calculation of spare provision matrix
P+Q1
Link-disjointed backup paths
Q BT = D (mod 2)
Flow conservation of backup
Q is a binary matrix
Decision variable: Q, s Integer programming

Given: M – traffic demand matrix


P – working path link incidence matrix
B and D – node-link & flow-node incidence matrices
Mixed Integer Programming problem NP Hard

Heuristic Solution Algorithm


• Successive survivable routing algorithm*
– Decompose multi-commodity flow  multiple single flows
– Goal: Each flow seeks a backup path with minimal
additional
dd l spare capacity
– Using shortest path algorithm for each flow to
• route link-disjointed backup paths
• using spare provision matrix G to calculate
link cost – incremental spare reservation vr ;
• Flows successively update their backup paths
termed: successive survivable routingg ((SSR))
• Randomly order flows for successively updating.
• Fast computation find near optimal solution
• *Apparatus and Method for Spare Capacity Allocation, Y. Liu and D. Tipper
, U.S. Patent 6,744,727 B2, June 1, 2004
• Presented in ACM/IEEE Trans. On Networking Feb.,2005

29
SSR flowchart of flow r

• On source node of flow r:


1. Given pr – pr , qr : working and
backup path vectors
2. Periodically update G – G, s: spare provision
matrix and spare
3. Calculate vr reservation vector
– vr : incremental spare
4. Update qr using vr
reservations used as link
reservations,
cost
5. Update s, and G
• Stop after no backup path
update on the network

Numerical comparison

• Compare different algorithms and bounds


– RAFT:: Resource
esou ce aggregation
agg egat o fault
au t tolerance
to e a ce
– SPI: Sharing with partial information
– SR: Survivable routing (SSR without iteration)
– SSR : Successive survivable routing
– SA: Simulated annealing
– BB: Branch and bound on a path-flow model – optimal
– LP: Linear programming lower bound
• Metrics:
– % Redundancy = spare capacity/working capacity,
– execution time

30
Experiment networks

1 2 3 4
1 2 13 9 7
1 10 9 7
13
17 4 2 3
4 5 2 12 11 9 6
4 2 3
6 12
3 6 12
10 4 5
8
7 8 3 10 4 5
16
6 11 8
4 7 14
9 10 11 8
5
15

5 8
6 7 8 1 2 4 9
13
14
17 26
13 2 3 4 5 6 7 8 46
3 8
6 5
1
2
3 12 16
1 21 20 47
18 45
17 16 11 16 7 38
13 20 22 21
5 9
23
6 25
5 15 48
10 8
14 15 17 22
9 1 24
14 18
39
15 18 19 23 10 49 50
19 19 27
3 17 9
2 12 6 24 44
7 12 11
4 25 11 20 23 28 40
4 16
7
43
18 10 26

15 14 13 12 11 10 22 29 41
32 42
36
30
21 33
34
35
31 37

Network node degree ranges from 2.31 to 4.4


Consider balanced mesh load case

Redundancy versus Time


on Network 3
75
• SSR, SR, SPI have 70 Fast response
Network 3

Worse
64 random
d cases with
ith solutions,
65
different flow orders fast
60 RAFT
Redundancy (%)

• Range of solutions
55 SPI Near optimal
• Time is the sum of solutions, fast
50
time to compute all Better solutions,
64 cases 45 slow, not
40
scalable
SR SSR SA
LP BB
35
Infeasible
30
-2 0 2 4
10 10 10 10
Time (second)

31
State of the Art
• Survivable Network Design
– Important in WAN Backbones
• Basic approach
pp
– Given particular technology (e.g., WDM, MPLS, etc)
assume
• Traffic restoration scheme (e.g., failure independent path restoration)
• Failure scenario (any single link failure)
– Determine least cost survivable network design using
optimization formulations with heuristic solutions
• Manyy tradeoffs
deo s identified
de ed andd studied
s ud ed
– Protection vs. Restoration
– Reactive vs. Proactive
W
– Shared vs. Dedicated P

– Link vs. Path vs. Rings, etc.


– Failure Dependent vs. FID
– Etc., BP

32

You might also like