Module - 4 Introduction To Business Continuity, Backup and Archive
Module - 4 Introduction To Business Continuity, Backup and Archive
MODULE – 4
INTRODUCTION TO BUSINESS CONTINUITY, BACKUP AND
ARCHIVE
INTRODUCTION TO BUSINESS CONTINUITY
Business Continuity (BC):
Business continuity (BC) is an integrated and enterprise wide process that includes all activities
(internal and external to IT) that a business must perform to mitigate the impact of planned and
unplanned downtime.
BC entails preparing for, responding to, and recovering from a system outage that adversely
affects business operations. It involves proactive measures, such as business impact analysis, risk
assessments, deployment of BC technology solutions (backup and replication), and reactive
measures, such as disaster recovery and restart, to be invoked in the event of a failure.
The goal of a BC solution is to ensure the “information availability” required to conduct vital
business operations.
Information Availability:
Information availability (IA) refers to the ability of the infrastructure to function according to
business expectations during its specified time of operation. Information availability ensures that
people (employees, customers, suppliers, and partners) can access information whenever they
need it. Information availability can be defined in terms of:
1. Reliability,
2. Accessibility
3. Timeliness.
1. Reliability: This reflects a component’s ability to function without failure, under stated
conditions, for a specified amount of time.
2. Accessibility: This is the state within which the required information is accessible at the
right place, to the right user. The period of time during which the system is in an
accessible state is termed system uptime; when it is not accessible it is termed system
As illustrated in Fig 3.1 above, the majority of outages are planned. Planned outages are
expected and scheduled, but still cause data to be unavailable.
Consequences of Downtime
Information unavailability or downtime results in loss of productivity, loss of revenue,
poor financial performance, and damage to reputation.
Loss of productivity includes reduced output per unit of labor, equipment, and capital.
Loss of revenue includes direct loss, compensatory payments, future revenue loss, billing
loss, and investment loss.
Poor financial performance affects revenue recognition, cash flow, discounts, payment
guarantees, credit rating, and stock price.
Damages to reputations may result in a loss of confidence or credibility with customers,
suppliers, financial markets, banks, and business partners.
An important metric, average cost of downtime per hour, provides a key estimate in
determining the appropriate BC solutions. It is calculated as follows:
Average cost of downtime per hour = average productivity loss per hour +
average revenue loss per hour
Where:
Productivity loss per hour = (total salaries and benefits of all employees per week)
/(average number of working hours per week)
Average revenue loss per hour = (total revenue of an organization per week)
/(average number of hours per week that an organization is open for business)
Information availability (IA) relies on the availability of physical and virtual components
of a data center. Failure of these components might disrupt IA. A failure is the
termination of a component’s capability to perform a required function. The component’s
capability can be restored by performing an external corrective action, such as a manual
reboot, a repair, or replacement of the failed component(s).
Proactive risk analysis performed as part of the BC planning process considers the
component failure rate and average repair time, which are measured by MTBF and
MTTR:
Mean Time Between Failure (MTBF): It is the average time available for a system or
component to perform its normal operations between failures.
Mean Time To Repair (MTTR): It is the average time required to repair a failed
component. MTTR includes the total time required to do the following activities:
Detect the fault, mobilize the maintenance team, diagnose the fault, obtain the spare
parts, repair, test, and restore the data.
Fig 3.2 illustrates the various information availability metrics that represent system uptime
and downtime.
IA is the time period that a system is in a condition to perform its intended function upon
demand. It can be expressed in terms of system uptime and downtime and measured as the
amount or percentage of system uptime:
IA = system uptime / (system uptime + system downtime)
In terms of MTBF and MTTR, IA could also be expressed as
IA = MTBF / (MTBF + MTTR)
Uptime per year is based on the exact timeliness requirements of the service, this calculation
leads to the number of “9s” representation for availability metrics.
Table 3-1 lists the approximate amount of downtime allowed for a service to achieve certain
levels of 9s availability. For example, a service that is said to be “five 9s available” is available
for 99.999 percent of the scheduled time in a year (24 × 365).
BC Terminology
This section defines common terms related to BC operations which are used in this module to
explain advanced concepts:
Disaster recovery: This is the coordinated process of restoring systems, data, and the
infrastructure required to support key ongoing business operations in the event of a disaster.
It is the process of restoring a previous copy of the data and applying logs or other necessary
processes to that copy to bring it to a known point of consistency. Once all recoveries are
completed, the data is validated to ensure that it is correct.
Disaster restart: This is the process of restarting business operations with mirrored
consistent copies of data and applications.
Recovery-Point Objective (RPO): This is the point in time to which systems and data must
be recovered after an outage. It defines the amount of data loss that a business can endure. A
large RPO signifies high tolerance to information loss in a business. Based on the RPO,
organizations plan for the minimum frequency with which a backup or replica must be made.
For example, if the RPO is six hours, backups or replicas must be made at least once in 6
hours. Fig 3.3 (a) shows various RPOs and their corresponding ideal recovery strategies. An
organization can plan for an appropriate BC technology solution on the basis of the RPO it
sets. For example:
RPO of 24 hours: This ensures that backups are created on an offsite tape drive every
midnight. The corresponding recovery strategy is to restore data from the set of last
Recovery-Time Objective (RTO): The time within which systems and applications must be
recovered after an outage. It defines the amount of downtime that a business can endure and
survive. Businesses can optimize disaster recovery plans after defining the RTO for a given
system. For example, if the RTO is two hours, then use a disk backup because it enables a
faster restore than a tape backup. However, for an RTO of one week, tape backup will likely
meet requirements. Some examples of RTOs and the recovery strategies to ensure data
availability are listed below (refer to Fig 3.3 (b)):
RTO of 72 hours: Restore from backup tapes at a cold site.
RTO of 12 hours: Restore from tapes at a hot site.
RTO of few hours: Use a data vault to a hot site.
RTO of a few seconds: Cluster production servers with bidirectional mirroring, enabling
the applications to run at both sites simultaneously.
The BC planning lifecycle includes five stages shown below (Fig 3.4):
2. Analyzing
Collect information on data profiles, business processes, infrastructure support,
dependencies, and frequency of using business infrastructure.
Identify critical business needs and assign recovery priorities.
Create a risk analysis for critical areas and mitigation strategies.
Conduct a Business Impact Analysis (BIA).
Create a cost and benefit analysis based on the consequences of data unavailability.
3. Designing and developing
Define the team structure and assign individual roles and responsibilities. For example,
different teams are formed for activities such as emergency response, damage assessment,
and infrastructure and application recovery.
Design data protection strategies and develop infrastructure.
Develop contingency scenarios.
Develop emergency response procedures.
Detail recovery and restart procedures.
4. Implementing
Implement risk management and mitigation procedures that include backup, replication, and
management of resources.
Prepare the disaster recovery sites that can be utilized if a disaster affects the primary data
center.
Implement redundancy for every resource in a data center to avoid single points of
failure.
5. Training, testing, assessing, and maintaining
Train the employees who are responsible for backup and replication of business-critical data
on a regular basis or whenever there is a modification in the BC plan
Train employees on emergency response procedures when disasters are declared.
Train the recovery team on recovery procedures based on contingency scenarios.
Perform damage assessment processes and review recovery plans.
Test the BC plan regularly to evaluate its performance and identify its limitations.
Assess the performance reports and identify limitations.
Ms. Mamatha A, Assistant Professor, Dept of CSE, SVIT 8
Storage Area Networks(18CS822) Module-4
Update the BC plans and recovery/restart procedures to reflect regular changes within the
data center.
Failure Analysis
Single Point of Failure
A single point of failure refers to the failure of a component that can terminate the
availability of the entire system or IT service.
Fig 3.5 depicts a system setup in which an application, running on a VM, provides an
interface to the client and performs I/O operations.
The client is connected to the server through an IP network, the server is connected to
the storage array through a FC connection, an HBA installed at the server sends or
receives data to and from a storage array, and an FC switch connects the HBA to the
storage port
In a setup where each component must function as required to ensure data
availability, the failure of a single physical or virtual component causes the failure of the
entire data center or an application, resulting in disruption of business operations.
In this example, failure of a hypervisor can affect all the running VMs and the virtual
network, which are hosted on it.
The can be several similar single points of failure identified in this example. A VM, a
hypervisor, an HBA/NIC on the server, the physical server, the IP network, the FC switch,
the storage array ports, or even the storage array could be a potential single point of
failure. To avoid single points of failure, it is essential to implement a fault-tolerant
mechanism.
Data centers follow stringent guidelines to implement fault tolerance for uninterrupted
information availability. Careful analysis is performed to eliminate every single point of
failure.
The example shown in Fig 3.6 represents all enhancements of the system shown in
Fig 3.5 in the infrastructure to mitigate single points of failure:
Configuration of redundant HBAs at a server to mitigate single HBA failure
Configuration of NIC (network interface card) teaming at a server allows protection
against single physical NIC failure. It allows grouping of two or more physical NICs
and treating them as a single logical device. NIC teaming eliminates the single point
of failure associated with a single physical NIC.
Configuration of redundant switches to account for a switch failure
Configuration of multiple storage array ports to mitigate a port failure
RAID and hot spare configuration to ensure continuous operation in the event of disk
failure
Implementation of a redundant storage array at a remote site to mitigate local site
failure
Implementing server (or compute) clustering, a fault-tolerance mechanism whereby
two or more servers in a cluster access the same set of data volumes. Clustered
servers exchange a heartbeat to inform each other about their health. If one of the
servers or hypervisors fails, the other server or hypervisor can take up the workload.
Implementing a VM Fault Tolerance mechanism ensures BC in the event of a server
failure. This technique creates duplicate copies of each VM on another server so that
when a VM failure is detected, the duplicate VM can be used for failover. The two
VMs are kept in synchronization with each other in order to perform successful
failover.
Multipathing Software
Configuration of multiple paths increases the data availability through path failover. If
servers are configured with one I/O path to the data there will be no access to the data if
that path fails. Redundant paths eliminate the path to become single points of failure.
Multiple paths to data also improve I/O performance through load sharing and maximize
server, storage, and data path utilization.
In practice, merely configuring multiple paths does not serve the purpose. Even with
multiple paths, if one path fails, I/O will not reroute unless the system recognizes that it
has an alternate path.
Multipathing software provides the functionality to recognize and utilize alternate I/O
path to data. Multipathing software also manages the load balancing by distributing I/Os
to all available, active paths.
In a virtual environment, multipathing is enabled either by using the hypervisor’s built-in
capability or by running a third-party software module, added to the hypervisor.
BC Technology Solutions
After analyzing the business impact of an outage, designing appropriate solutions to recover
from a failure is the next important activity. One or more copies of the original data are
maintained using any of the following strategies, so that data can be recovered and business
operations can be restarted using an alternate copy:
1. Backup: Data backup is a predominant method of ensuring data availability. The
frequency of backup is determined based on RPO, RTO, and the frequency of data
changes.
Backup Purpose
Backups are performed to serve three purposes: disaster recovery, operational recovery,
and archival. These are discussed in the following sections.
Disaster Recovery
Backups are performed to address disaster recovery needs.
The backup copies are used for restoring data at an alternate site when the primary site
is incapacitated due to a disaster. Based on RPO and RTO requirements, organizations
use different backup strategies for disaster recovery.
When a tape-based backup method is used as a disaster recovery strategy, the backup
tape media is shipped and stored at an offsite location. These tapes can be recalled for
restoration at the disaster recovery site.
Organizations with stringent RPO and RTO requirements use remote replication
technology to replicate data to a disaster recovery site. Organizations can bring
production systems online in a relatively short period of time if a disaster occurs.
Operational Recovery
Data in the production environment changes with every business transaction and
operation.
Operational recovery is the use of backups to restore data if data loss or logical
Backup Granularity
• Backup granularity depends on business needs and the required RTO/RPO.
• Based on the granularity, backups can be categorized as full, incremental and cumulative
(differential).
• Most organizations use a combination of these three backup types to meet their backup and
recovery requirements.
• The below figure shows the different backup granularity levels.
storage device.
• It provides a faster recovery but requires more storage space and also takes more time to
back up.
Incremental Backup
• Incremental backup copies the data that has changed since the last full or incremental
backup, whichever has occurred more recently.
• This is much faster than a full backup (because the volume of data backed up is restricted to
the changed data only) but takes longer to restore.
Cumulative Backup
• Cumulative backup copies the data that has changed since the last full backup.
• This method takes longer than an incremental backup but is faster to restore.
On Tuesday, a new file (File 4 in the figure) is added, and no other files have changed.
Consequently, only File 4 is copied during the incremental backup performed on Tuesday
evening.
On Wednesday, no new files are added, but File 3 has been modified. Therefore, only the
modified File 3 is copied during the incremental backup on Wednesday evening.
Similarly, the incremental backup on Thursday copies only File 5.
On Friday morning, there is data corruption, which requires data restoration from the
backup.
The first step toward data restoration is restoring all data from the full backup of Monday
evening. The next step is applying the incremental backups of Tuesday, Wednesday, and
Thursday.
In this manner, data can be successfully recovered to its previous state, as it existed on
Thursday evening.
The below figure shows an example of restoring data from cumulative backup.
full backup.
Similarly, on Thursday, File 6 is added. Therefore, the cumulative backup on Thursday
evening copies all three files: File 4, File 5, and File 6.
On Friday morning, data corruption occurs that requires data restoration using backup
copies.
The first step in restoring data is to restore all the data from the full backup of Monday
evening. The next step is to apply only the latest cumulative backup, which is taken on
Thursday evening.
In this way, the production data can be recovered faster because its needs only two copies of
data — the last full backup and the latest cumulative backup.
Backup Methods
Hot backup and cold backup are the two methods deployed for backup. They are
based on the state of the application when the backup is performed.
In a hot backup, the application is up and running, with users accessing their data
during the backup process. This method of backup is also referred to as an online
backup.
In a cold backup, the application is not active or shutdown during the backup process
and is also called as offline backup.
The hot backup of online production data becomes more challenging because data is
actively used and changed.
An open file is locked by the operating system and is not backed up during the backup
process. In such situations, an open file agent is required to back up the open file.
In database environments, the use of open file agents is not enough, because the agent
should also support a consistent backup of all the database components.
For example, a database is composed of many files of varying sizes occupying several
file systems. To ensure a consistent database backup, all files need to be backed up in
the same state. That does not necessarily mean that all files need to be backed up at the
same time, but they all must be synchronized so that the database can be restored with
consistency.
The disadvantage associated with a hot backup is that the agents usually affect the
overall application performance.
Backup Architecture
• A backup system commonly uses the client-server architecture with a backup server and multiple
backup clients.
• The below figure illustrates the backup architecture.
• The backup server manages the backup operations and maintains the backup catalog, which
contains information about the backup configuration and backup metadata.
• Backup configuration contains information about when to run backups, which client data to be
backed up, and so on, and the backup metadata contains information about the backed up data.
• The role of a backup client is to gather the data that is to be backed up and send it to the storage
node. It also sends the tracking information to the backup server.
• The storage node is responsible for writing the data to the backup device. (In a backup
environment, a storage node is a host that controls backup devices.)
• The storage node also sends tracking information to the backup server. In many cases, the storage
node is integrated with the backup server, and both are hosted on the same physical platform.
• A backup device is attached directly or through a network to the storage node’s host platform.
Some backup architecture refers to the storage node as the media server because it manages the
storage device.
• Backup software provides reporting capabilities based on the backup catalog and the log files.
These reports include information, such as the amount of data backed up, the number of
completed and incomplete backups, and the types of errors that might have occurred.
• When a backup operation is initiated, significant network communication takes place between the
different components of a backup infrastructure.
• The backup operation is typically initiated by a server, but it can also be initiated by a client.
• The backup server initiates the backup process for different clients based on the backup schedule
Backup Topologies
Three basic topologies are used in a backup environment:
1. Direct attached backup
2. LAN based backup, and
3. SAN based backup.
A mixed topology is also used by combining LAN based and SAN based topologies.
In a direct-attached backup, a backup device is attached directly to the client. Only
the metadata is sent to the backup server through the LAN. This configuration frees the
LAN from backup traffic.
The example shown in Fig 3.7 device is directly attached and dedicated to the backup
client. As the environment grows, however, there will be a need for central
management of all backup devices and to share the resources to optimize costs. An
appropriate solution is to share the backup devices among multiple servers. Network-
based topologies (LAN-based and SAN-based) provide the solution to optimize the
utilization of backup devices.
In LAN-based backup, the clients, backup server, storage node, and backup device are
connected to the LAN (see Fig 3.8). The data to be backed up is transferred from the
backup client (source), to the backup device (destination) over the LAN, which may
affect network performance.
This impact can be minimized by adopting a number of measures, such as configuring
separate networks for backup and installing dedicated storage nodes for some
application servers.
The emergence of low-cost disks as a backup medium has enabled disk arrays to be
attached to the SAN and used as backup devices. A tape backup of these data backups
on the disks can be created and shipped offsite for disaster recovery and long-term
Ms. Mamatha A, Assistant Professor, Dept of CSE, SVIT 23
Storage Area Networks(18CS822) Module-4
retention.
The mixed topology uses both the LAN-based and SAN-based topologies, as shown in Fig
3.10. This topology might be implemented for several reasons, including cost, server
location, reduction in administrative overhead, and performance considerations.