Disaster Recovery and Busi Ness Continuity: Dr. Pranita Upadhyaya
Disaster Recovery and Busi Ness Continuity: Dr. Pranita Upadhyaya
ness Continuity
2
DR and BCP motivation
WTC, 9/11 terrors
BASEL II
– An international business standard
– A series of recommendations on banking la
ws and regulations
e-commerce, e-banking, e-government b
ooming
3
Disaster aftermaths
Most companies that experience a major disa
ster are no longer in business within 5 year
s !!!
- The US Bureau of Labor -
Revenue loss
Brand image hurt
Customer leaves
4
How Disasters Affect Businesses
• Direct damage to facilities and equipment
• Transportation infrastructure damage
– Delays deliveries, supplies, customers, employees goi
ng to work
• Communications outages
• Utilities outages
Classification of Disasters
disasters
natural man-made
natural
natural non-intentional intentional
6
9 major threats to Data Center
Cooling system down
Power system down
Radioactive contamination
Terror (including cyber terror)
Telecom network cut off
Huge human resources vacuum
Earthquake
Flood
Fire
7
How BCP and DRP
Support Security
• BCP (Business Continuity Planning) and DRP
(Disaster Recovery Planning)
• Security pillars: C-I-A
– Confidentiality
– Integrity
– Availability
• BCP and DRP directly support availability
BCP and DRP Differences
and Similarities
• BCP
– Activities required to ensure the continuation of criti
cal business processes in an organization
– Alternate personnel, equipment, and facilities
– Often includes non-IT aspects of business
• DRP
– Assessment, salvage, repair, and eventual
restoration of damaged facilities and systems
– Often focuses on IT systems
Industry Standards Supporting
BCP and DRP
• NIST 800-34
– Contingency Planning Guide for Information Te
chnology Systems.
– Seven step process for BCP and DRP projects
– From U.S. National Institute for Standards and T
echnology
• NFPA 1600
– Standard on Disaster / Emergency Management
and Business Continuity Programs
– From U.S. National Fire Protection Association
Benefits of BCP and DRP Planning
• Reduced risk
• Process improvements
• Improved organizational maturity
• Improved availability and reliability
• Marketplace advantage
The Role of Prevention
• Not prevention of the disaster itself
– Prevention of surprise and disorganized response
• Reduction in impact of a disaster
– Better equipment bracing
– Better fire detection and suppression
– Contingency plans that provide [near] contin
uous operation of critical business processes
– Prevention of extended periods of downtime
What is a Disaster Recovery ?
DR : The planned process of restoring systems, data, and infrastructure r
equired to support key ongoing business operations.
A DR plan : a proactive measure to minimize a company’s downtime duri
ng sudden emergencies
An unforeseen event : fire, flood, earthquake, etc
14
Benefits from DR center
Significantly reducing the impact of sales, financial, and cu
stomer losses during unforeseen interruptions to the busin
ess operations
15
Types of DR sites
Average
Type Ideal for Pros Cons
recovery
Hot Mission-critical Almost instant failover, Long setup process. High 10
standby applications, high full data integrity, little cost, higher administrative seconds ~
business impact to no impact to business burden 2 minutes
activities operations, guaranteed
recovery timeframe
Warm Mission-critical Fast failover, little data Long setup process, medium- 10 ~ 45
standby applications, loss, small-to-medium to-high cost, medium minutes
medium-to-high impact to business administrative burden
business impact operations, guaranteed
activities recovery timeframe
Cold Non-mission- Low initial cost, Unpredictable recovery time, 4 hours ~
standby critical guaranteed equipment tedious restoration process, 2 days
applications, low availability potentially large impact to
business impact business operations
activities
Offsite Non-mission- Flexible, inexpensive, Very long recovery time, must 18 hours ~
data critical secure first configure application 8 days
backup applications, very environment and then restore
16
DR components
DR center infrastructure
DR Solution implementation
DR planning
17
DR – infrastructure construction
18
Data center design considerations
Operational reliability
Quick changes, including additions and rapid expansions
Online status monitoring
Life cycle management
Customer access
Physical security
Rapid detection, identification and resolution of faults
Modern data center infrastructure management (DCIM) solution
– provides data center visualization,
– robust reporting and analytics,
– becomes the central source-of-truth for changes being made in the dat
a center
19
Considerations for DR site selection
20
Engineering Plan & Space design
21
Critical Building Systems
22
Case : DR site selection - distance
US : 40 miles (64Km, out of the same influen
ce of the hurricane)
Japan : on a different tectonic plate, a differe
nt seismic activity zone
EU : 5~10Km (against bombing attack)
Korea : similar to the situation in EU, usually
+30km away
23
DR site selection - distance
disaster
manageability responsiveness
optimum point ?
distance
24
Site evaluation factors : ASSES
Backup, redundancy
Availability
24*7 operation
Natural disasters
stability Security
Potential man-made disasters
Survivability IT resources
Maintenance
Efficiency
Hi-quality equipment
economics
Physical scalability
Scalability
Functional scalability
25
General DR plan
Primary processing location
Backup processing location Primary
26
DR Solution implementation
27
DRS implementation
Business
Define DR DR Implementation Implementing
impact & DRP
requirements solution methodology DRS
system
28
DR requirements
Identify what are the Functional Areas that MUST be recov
ered during an emergency
Define the Recovery Time Objective (RTO)
- “How much downtime (if any) can be tolerated?”
Define the Recovery Point Objective (RPO)
- “How much data (if any) can you afford to lose?”
In addition,
Define the Recovery Access Objective (RAO), and
the Recovery Scope Objective (RSO)
29
Recovery Access Objective (RAO)
– Subcomponent of RTO that
– measures the time it takes for the network to re-e
stablish connectivity of users, customers, and part
ners
– with the applications at the alternate site once th
e primary site has been disrupted
– It identifies the point in time at which the users t
hat were connected to applications and services r
unning on one data center have access to the sa
me applications and services running at an alterna
te data center.
30
RPO/RTO vs. cost
time
high
Mirroring(Copy Database)
log journaling
33
DR solution selection
Continuous availability High availability Improved availability Traditional availability
Loss
IRC : intermittent
SOS remote copy
Loss after
backup Remote
DASD SOS : standby
operating system
PPRC : peer-to-peer
Remote tape
IRC remote copy
Little loss XRC : extended
XRC
Electronic remote copy
RR/400
journaling Electronic
GDPS/XRC journaling : dual
PPRC transaction logging
No loss SRDF
GDPS/PPRC
Recovery
time
0~1 hour 1~6 hours 6~24 hours 24~48 hours
34
Business Continuity Planning
35
Creating a BCP
Is an on-going process, not a project with a
beginning and an end
• Creating, testing, maintaining, and updating
• “Critical” business functions may evolve
The BCP team must include both business and
IT personnel
Requires the support of senior management
36
BCP phases
1. Project management & initiation
2. Business Impact Analysis (BIA)
3. Recovery strategies
4. Plan design & development
5. Testing, maintenance, awareness, training
I - Project management & initiation
Establish need (risk analysis)
Get management support
Establish team (functional, technical, BCC – Business
Continuity Coordinator)
Create work plan (scope, goals, methods, timeline)
Initial report to management
Obtain management approval to proceed
II - Business Impact Analysis (BIA)
Goal: obtain formal agreement with senior management
on the MTD for each time-critical business resource
MTD – maximum tolerable downtime, also known as
MAO (Maximum Allowable Outage)
Quantifies loss due to business outage (financial, extra
cost of recovery, embarrassment)
Does not estimate the probability of kinds of incidents,
only quantifies the consequences
II - BIA phases
Choose information gathering methods (surveys,
interviews, software tools)
Select interviewees
Customize questionnaire
Analyze information
Identify time-critical business functions
Assign MTDs
Rank critical business functions by MTDs
Report recovery options
Obtain management approval
III – Recovery strategies
Recovery strategies are based on MTDs
Predefined
Management-approved
Different technical strategies
Different costs and benefits
How to choose?
Careful cost-benefit analysis
Driven by business requirements
Strategies should address recovery of:
•Business operations
•Facilities & supplies
•Users (workers and end-users)
•Network, data center, telecommunications (technical)
•Data (off-site backups of data and applications)
IV – BCP development / implementati
on
Detailed plan for recovery
•Business & service recovery plans
•Maintenance
•Awareness & training
•Testing
Sample plan phases
•Initial disaster response
•Resume critical business operations
•Resume non-critical business operations
•Restoration (return to primary site)
•Interacting with external groups (customers, media,
emergency responders)
V – BCP final phase
Testing
•Until it’s tested, you don’t have a plan
•Testing types: Structured walk-through, Checklist, Simulation,
Parallel, Full interruption.
Maintenance
•Fix problems found in testing
•Implement change management
•Audit and address audit findings
Awareness / Training
•BCP team is probably the DR team
•BCP training must be on-going, part of corporate culture
DR planning
44
Disaster recovery plan
DRP
– is a subset BCP (business continuity plannin
g), and
– should include planning for resumption of a
pplications, data, hardware, communications
(such as networking) and other IT infrastruct
ure.
45
Body of DR plan
Communication plan
Pre-disaster actions
46
Case : DR plan
Main center DR center
Restore data
RTO : 3 hours
Recover DB & task Recover N/W
Consistency
Recover DB & task
?
Resume business
47