0% found this document useful (0 votes)
120 views38 pages

Session 14-15 PDF

The document discusses a power outage at British Airways that led to the cancellation of hundreds of flights and cost the airline about $102 million in lost revenue and expenses. It notes that human error is often blamed to hide issues like underinvestment in IT infrastructure. The outage caused major damage to the servers British Airways uses for online check-in and baggage handling, grounding flights for two days. Experts say airlines often underfund maintenance of aging IT systems not designed for customer-facing use.

Uploaded by

rajesh shekar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
120 views38 pages

Session 14-15 PDF

The document discusses a power outage at British Airways that led to the cancellation of hundreds of flights and cost the airline about $102 million in lost revenue and expenses. It notes that human error is often blamed to hide issues like underinvestment in IT infrastructure. The outage caused major damage to the servers British Airways uses for online check-in and baggage handling, grounding flights for two days. Experts say airlines often underfund maintenance of aging IT systems not designed for customer-facing use.

Uploaded by

rajesh shekar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 38

Session 14 & 15

A total of 115 British Airways flights, or 13 percent of services, were cancelled on Sunday while 311 services,
or 35 percent, were delayed, according to Flight Aware, a Houston-based plane-tracking service. The carrier
scrapped a combined 418 flights at Heathrow and Gatwick airport, south of London, on Saturday and 568 were
delayed, the research company said. British Airways has declined to specify figures for flights or customers
affected.
British Airways Incident
✓ “Human Error” is all too often used by firms to hide a multitude of datacentre design
and training flaws, caused by years of underinvestment in their server farms.
✓ Switching off power supply to the Servers was the main reason
✓ It caused major damage to the servers the airline uses to run its online check-in,
baggage handling and customer contact systems, resulting in flights from Heathrow
and Gatwick being grounded for the best part of two days.
✓ “Management decisions about budget and cost and spending have not allowed these
facilities to be upgraded over time to keep up with the demand and criticality of these
systems.”
✓ In the airline industry in particular, flight operators are under mounting pressure to cut
costs in the face of growing competition from budget carriers, said Kirby, and the
upkeep of their IT estates can be the first thing to suffer.
✓ “A lot of the systems the airlines use have been around since the late 1970s and they
weren’t really [designed] for client-facing systems. They were for internal use,” he said.
British Airways Incident
British Airways owner IAG SA said a power outage that led to the cancellation of hundreds
of flights last month probably cost it about 80 million pounds ($102 million) in lost
revenue and the expense of accommodating, re-booking and compensating thousands of
passengers……

Likely ….About $200 Millions

After 6 weeks:
'Total chaos' at Heathrow as British Airways computers crash yet again
HOLIDAYMAKERS are facing huge delays after British Airways systems at two of the UK's
busiest airports crashed once more.!!!
Business Continuity Management (BCM)
✓ Business Continuity Management is an holistic management process that
identified potential impacts that threaten an organization and provides a
framework for building resilience and capability for an effective response
that safeguards the interest of its key stakeholders, reputation, brand and
value creating activities.
✓ Business continuity means maintaining the uninterrupted availability of
all key business resources required to support essential business
activities.
BC and DR - Definitions
✓ Business Continuity - Overall continuation of business functions during
an emergency event.
✓ Disaster Recovery – Recovery of the systems, applications and processing
capabilities
Why BCP and DRP?

DATA CORRUPTION COMPONENT FAILURE APPLICATION FAILURE

USER ERROR MAINTENANCE SITE OUTAGE


BCP and DRP
Fair amount of Confusion in terminology
Business Continuity Plan
✓ Prepared at Business level
✓ Includes IT
✓ Covers all relevant functions
✓ Considers all aspects such as :
▪ Communication to External agencies
▪ Communication to Customers, Suppliers
▪ Quick response to correct misinformation
▪ Handling increased vulnerability during emergencies
▪ Keep the controls , security , integrity
▪ Avoid making it worse than it is
BCP - Definition
A documented, tested, rehearsed plan to minimize financial losses to the
institution, serve customers with minimal disruptions, and mitigate the
negative effects of disruptions on business operations.

What is BCP for?


To continue the essential services to key stake-holders when the organization
faces :
▪ catastrophic events such as floods, earthquakes, or acts of terrorism
▪ accidents or sabotage
▪ outages due to an application error, hardware or network failures
BCP – Team structure

Business Continuity Committee


(Management Authorization)

Execution Teams

BCP Team Leader

BCP Spokesperson Internal Auditor

Damage Admin,
Emergency Relocation IT Operations
Asst. & Security &
Action Team Team Team Team
Salvage Team Support Team
BCP – Documentation
Documentation should
cover

Risk Management Environmental Management

Emergency Management Crisis Management

IT Disaster Recovery Knowledge Management

Facility Management Human Management

Supply Chain Management Security and Privacy

Health and Safety Communications PR

Enterprise business process, people and technology


BCP - Process
✓ Initiated and Supported by Top Management
✓ Assess risks and vulnerabilities
✓ Actions to protect people, environment, assets
✓ Actions to contain and prevent further damage
✓ Business Impact Analysis
✓ Identify the essential activities that must continue during emergencies
and level of service targeted
✓ Identify all resources needed to provide such services:
Place, People, Data, Facilities (Security, food, water, IT, communication,
transportation), Raw Materials and other equipment (as necessary), Prior Permission
from relevant authorities, Service Provider support, Contact lists, Authorisation,
access and escalation procedures, Budgets
BCP - Process
✓ Identify who among those available Top Management will invoke the BCP
to be implemented , during emergencies
✓ Communication process to stake-holders
✓ Arrange for the people , premises , IT Facilities etc to be available , when
needed
✓ Train people
✓ Test the facilities, remedy weaknesses
✓ Document the process in a brief document
✓ External audit of the document and complete audit actions
✓ A senior Business Person accountable for on-going preparedness of BCP
arrangements
BCM Compliance Standards
✓ Standards in Business Continuity ✓ Measure compliance in these BCM
✓ ISO 22301 dimensions
✓ FFIEC ✓ Program Administration
✓ NIST 800 ✓ Crisis Management
✓ NFPA 1600 ✓ Business Recovery
✓ SEC ✓ IT Disaster Recovery
✓ FISMA ✓ Fire & Life Safety
✓ FINRA ✓ Supply Chain Risk Management
✓ Supply Chain Resilience ✓ Third Party Management
Leadership Council
BCP & DRP - Differences
Business Continuity Plan (BCP) Disaster Recovery Plan (DRP)
✓ Focused on recovery of individual business ✓ Focused on recovery of Enterprise IT
processes, departments, functions, applications and supporting infrastructure
facilities etc. (revenue, production and (support the business)
operational management) ✓ Recovery Time Objective (RTO) is typically
✓ Recovery Time Objective (RTO) is typically measured in minutes or hours… sometimes
measured in days or weeks… sometimes days.
months ✓ Active IT participation with little to no business
✓ Active business and IT participation participation during an event.
✓ Recovery addresses people, process, and ✓ Recovery addresses enterprise data
support technologies required to continue center/computing, facility and support staff
the business needs.
✓ Continuity plans are usually by process ✓ Recovery plans are usually by application suite,
department, function and/or facility platform and/or data center facility
DRP – IT Component of BCP
Owned by IT
Consistent with rest of BCP
Objectives :
Recovery Time Objective (RTO) – maximum permissible outage time
Recovery Point Objective (RPO) – the furthest point to which data loss is permitted
Facilities :
Cold Site: A facility that is environmentally conditioned, but devoid of any equipment.
Hot Site: It is an alternate facility having workspace for the personnel, fully equipped
with all resources and stand-by computer facilities
Mirrored site: It is identical in all aspects to the primary site, right down to the
information availability. It is equivalent to having a redundant site in normal times and
is naturally the most expensive option.
Data Recovery – Facilities
✓ Conventional Backup
✓ RAID
✓ Remote Journaling
✓ Electronic Vaulting – transmits data electronically and automatically
creates the backup offsite.
✓ Disk Replication (Mirroring, Shadowing) – data on both the primary
server and the replicated server ; up-to-date copy of the data, excellent
RPO
✓ Clustering – solution for high availability; use a secondary server to
provide access to applications and data when the primary server fails.
Cost of RTO, RPO
Recovery Objectives

Wks Days Hrs Mins Secs Secs Mins Hrs Days Wks

Data Loss Downtime


(Recovery Point Objective) (Recovery Time Objective)

Mirroring / Replication Clustering

Backup Restore from Disk

Vaulting Restore from Tape


Testing
✓ Methods: Tabletop, simulation, full dress rehearsal
✓ Scenarios to test
✓ Care !: avoid creating confusion and panic
✓ Measure and document the tests
✓ Identify weaknesses and improvements
✓ Implement improvement actions
Cloud DR as a service
✓ Migrating entire IT operations or DR solutions only to
cloud, and replication or movement of data to cloud
brings significant cost savings and lowering of recovery
times.
✓ Can shrink and grow in response to demand.
‘Replication Mode’ requires fewer resources and incurs
low cost.
✓ When a business disruption occurs, the system enters
‘Failover Mode’ which requires more resources that
scale smoothly without requiring large upfront
investments.
✓ Cloud Computing eliminates hardware unification
between primary datacenter and cloud.
✓ Cloud servers start-up can be easily automated and
managed
Platform Recovery Strategies
DRP Document Format
1.Introduction
2.Business Impact Analysis - including a sample impact matrix
3.DRP Organization Responsibilities pre &post disaster - DRP / BCP checklist
4.Backup Strategy for Data Centers, Departmental File Servers, Wireless Network servers, Data
at Outsourced Sites, Desktops (In office and "at home"), Laptops and mobiles
5.Recovery Strategy including approach ( E,g. Log File transfers, data back-up , spare servers,
dedicated DataComm lines), escalation plan process , decision points.
6.DR Facility to be set-up / upgraded ( Provide details of location, capacity , data communication ,
data centre facilities)
7.User Area readiness – Contact Point , Tech Support, Alternate DataComm lines, User Access Set-
up
8.Accountability to decide on invoking DRP Actions ( Person 1/ 2/3/4)
9.Disaster Recovery Procedures in a tabular format
Sequence ,Action, who, when, Verification of correct completion, Communication
DRP Document Format
10.Technical Appendix
11.Communication Process including necessary phone numbers and contact points
12.Role of Outsourced Support Providers ( IT Support, Communication, Security, Transport )

Keeping the DRP in readiness:


a) Accountability and resources
b) Alternate members when primary contact is not available
c) Additional resources to be requisitioned when needed
d) Distribution of the Disaster Recovery Plan – who are informed ahead of time
e) Training of the Disaster Recovery Team
f) Testing of the Disaster Recovery Plan
g) Evaluation of the Disaster Recovery Plan Tests and Rectification
h) Maintenance of the Disaster Recovery Plan
Recovery Action when Disaster strikes
1. Protection to People, Environment, Assets
2. Actions to Contain or eliminate further damage
3. Call the Support Organisations
4. Inform Senior Management
5. DR Incident manager and other roles for DRP clarified
6. Communicate to End users customers, Suppliers, Employees, Service Providers ,
Support functions– What to do till normalcy returns
7. Recovery Strategy Clarified to all including approach ( E,g. Log File transfers, data
back-up , spare servers, dedicated DataComm lines), escalation plan process and
decision points.
8. Disaster Recovery Procedures in a tabular format
9. Sequence ,Action, who, when, Verification of correct completion, Communication
Recovery Action when Disaster strikes
10. User Area Actions initiated– Contact Point , Tech Support, Alternate DataComm lines,
User Access Set-up
11. Technical Information shared for reference
12. Correctness of recovery checked and reviewed ( Audit team involved , if possible)
13. Select users asked to test pre-defined software options
14. Communicate to all to resume operations
15. Separate team to work on Recovering the Original site
16. Test Original site for readiness
17. Plan and Communicate Transition back to original site
18. Perform transition to Original site
19. Bring the DR site to readiness for any new emergencies
20. Document observations and learning
Planning and Setting up a DR system
✓ Set the RPO, RTO Objectives based on Business need
✓ Decide DR Location
✓ Decide Recovery strategy
✓ Decide on growth of Processing, Disk space and Communication needs
✓ Decide on data (log files) movement from “Production” to DR Site
✓ Set up data Centre , install servers, disk space, Communication lines
✓ Test the data movement and recovery up to RPO
✓ Set the data back up routine at DR Centre
✓ Synchronize “Production” and DR System
✓ Initiate log file movement
Recovery Actions
Subject Action Who does when Who else involved Remarks/Guidelines
Protect People Evacuate Building Data Immediately on Any one on site , Speed of action is vital
environment / Centre noticing preferably trained
assets manager flood/fire/buildi
ng collapse
Contain further First Aid, Fire- Data After protecting Trained personnel on Call Ambulance / doctor at
damage fighting, Stop Centre people First Aid, Fire Fighting site immediately
power/ water Manager
flow
Call Civil
Authorities
Updating BCP/DRP
✓ A Person accountable
✓ Update BCP/DRP when changes happen in business or IT
✓ Audited by external
Disaster “Recovery”
✓ Getting back to BAU
✓ Site readiness
✓ People readiness
✓ Data consistency or gaps
✓ Involve Internal Audit teams
✓ Preserve data back-ups at key points
▪ May be needed in investigation, insurance etc.
✓ Controlled move back to usual site
Closure:
▪ Document the incidents and actions
▪ Document learnings
▪ Document gaps, close gaps
Incident Management
Unplanned interruption to an IT service or a reduction in the quality of an IT
service
Incident Management - Activities
✓ Identification - detect or reported the incident.
✓ Registration - the incident is registered in an ICM System.
✓ Categorization - the incident is categorized by priority, SLA etc.
✓ Communication
✓ Prevention of further damage.
✓ Seek help
✓ Prioritization - the incident is prioritized for better utilization of the
resources and the Support Staff time.
✓ Diagnosis - reveal the full symptom of the incident.
Incident Management - Activities
✓ Escalation - should the Support Staff need support from other
organizational units.
✓ Investigation and diagnosis - if no existing solution from the past could be
found the incident is investigated and root cause found.
✓ Resolution and recovery - once the solution is found the incident is
resolved.
✓ Rectify the effect of the problem.
✓ Put in place temporary or permanent fix; if temporary, initiate permanent
fix.
✓ Incident closure - the registry entry of the incident in the ICM System is
closed by providing the end-status of the incident.
Incident Manager - Responsibilities
✓ Understand the incident/fault.
✓ Gather sufficient information to start an analysis.
✓ Maintain a general overview of the incident including communication.
✓ Understand the functionality of multiple areas.
✓ Obtain guidance on priorities to the teams starting the immediate urgent
unexpected recovery work.
Incident Closure
✓ Root cause analysis.
✓ Read-across to other parts of the system and actions to prevent.
✓ “How could the event have been anticipated or detected early?”.
✓ Training or HR actions as identified.
✓ Update documentation and processes suitably.

You might also like