SAN Module3 ExamsExpert
SAN Module3 ExamsExpert
SAN Module3 ExamsExpert
in
Storage Area Networks Module-3
MODULE – 3
BACKUP, ARCHIVE, AND REPLICATION
Business continuity (BC) is an integrated and enterprise wide process that includes all activities
(internal and external to IT) that a business must perform to mitigate the impact of planned and
unplanned downtime.
BC entails preparing for, responding to, and recovering from a system outage that adversely
in
affects business operations. It involves proactive measures, such as business impact analysis, risk
rt.
assessments, deployment of BC technology solutions (backup and replication), and reactive
measures, such as disaster recovery and restart, to be invoked in the event of a failure.
business operations.
x pe
The goal of a BC solution is to ensure the “information availability” required to conduct vital
Information availability (IA) refers to the ability of the infrastructure to function according to
business expectations during its specified time of operation. Information availability ensures that
am
people (employees, customers, suppliers, and partners) can access information whenever they
need it. Information availability can be defined in terms of:
Ex
1. Reliability,
2. Accessibility
3. Timeliness.
1. Reliability: This reflects a component’s ability to function without failure, under stated
conditions, for a specified amount of time.
2. Accessibility: This is the state within which the required information is accessible at the
right place, to the right user. The period of time during which the system is in an
accessible state is termed system uptime; when it is not accessible it is termed system
downtime.
3. Timeliness: Defines the exact moment or the time window (a particular time of the day,
week, month, and/or year as specified) during which information must be accessible. For
example, if online access to an application is required between 8:00 am and 10:00 pm
each day, any disruptions to data availability outside of this time slot are not considered
to affect timeliness.
in
software upgrades or patches, taking backups, application and data restores, facility
operations (renovation and construction), and refresh/migration of the testing to the
rt.
production environment.
➢ Unplanned outages include failure caused by database corruption, component failure,
and human errors.
pe
➢ Disasters (natural or man-made) such as flood, fire, earthquake, and contamination
x
are another type of incident that may cause data unavailability.
sE
am
Ex
As illustrated in Fig 3.1 above, the majority of outages are planned. Planned outages are
expected and scheduled, but still cause data to be unavailable.
in
suppliers, financial markets, banks, and business partners.
➢ An important metric, average cost of downtime per hour, provides a key estimate in
rt.
determining the appropriate BC solutions. It is calculated as follows:
Average cost of downtime per hour = average productivity loss per hour +
pe
average revenue loss per hour
Where:
x
sE
Productivity loss per hour = (total salaries and benefits of all employees per week)
/(average number of working hours per week)
Average revenue loss per hour = (total revenue of an organization per week)
am
/(average number of hours per week that an organization is open for business)
➢ Information availability (IA) relies on the availability of physical and virtual components
of a data center. Failure of these components might disrupt IA. A failure is the
termination of a component’s capability to perform a required function. The component’s
capability can be restored by performing an external corrective action, such as a manual
reboot, a repair, or replacement of the failed component(s).
➢ Proactive risk analysis performed as part of the BC planning process considers the
component failure rate and average repair time, which are measured by MTBF and
MTTR:
→ Mean Time Between Failure (MTBF): It is the average time available for a system or
component to perform its normal operations between failures.
→ Mean Time To Repair (MTTR): It is the average time required to repair a failed
component. MTTR includes the total time required to do the following activities:
Detect the fault, mobilize the maintenance team, diagnose the fault, obtain the spare
parts, repair, test, and restore the data.
Fig 3.2 illustrates the various information availability metrics that represent system uptime
and downtime.
in
rt.
x pe
sE
IA is the time period that a system is in a condition to perform its intended function upon
demand. It can be expressed in terms of system uptime and downtime and measured as the
amount or percentage of system uptime:
Ex
in
3.1.2 BC Terminology
rt.
This section defines common terms related to BC operations which are used in this module to
explain advanced concepts:
pe
➢ Disaster recovery: This is the coordinated process of restoring systems, data, and the
infrastructure required to support key ongoing business operations in the event of a disaster.
x
It is the process of restoring a previous copy of the data and applying logs or other necessary
sE
processes to that copy to bring it to a known point of consistency. Once all recoveries are
completed, the data is validated to ensure that it is correct.
am
➢ Disaster restart: This is the process of restarting business operations with mirrored
consistent copies of data and applications.
➢ Recovery-Point Objective (RPO): This is the point in time to which systems and data must
Ex
be recovered after an outage. It defines the amount of data loss that a business can endure. A
large RPO signifies high tolerance to information loss in a business. Based on the RPO,
organizations plan for the minimum frequency with which a backup or replica must be made.
For example, if the RPO is six hours, backups or replicas must be made at least once in 6
hours. Fig 3.3 (a) shows various RPOs and their corresponding ideal recovery strategies. An
organization can plan for an appropriate BC technology solution on the basis of the RPO it
sets. For example:
→ RPO of 24 hours: This ensures that backups are created on an offsite tape drive every
midnight. The corresponding recovery strategy is to restore data from the set of last
backup tapes.
→ RPO of 1 hour: Shipping database logs to the remote site every hour. The corresponding
recovery strategy is to recover the database at the point of the last log shipment.
→ RPO in the order of minutes: Mirroring data asynchronously to a remote site
→ Near zero RPO: This mirrors mission-critical data synchronously to a remote site.
in
rt.
Fig 3.3: Strategies to meet RPO and RTO targets
➢ Recovery-Time Objective (RTO): The time within which systems and applications must be
pe
recovered after an outage. It defines the amount of downtime that a business can endure and
survive. Businesses can optimize disaster recovery plans after defining the RTO for a given
x
system. For example, if the RTO is two hours, then use a disk backup because it enables a
sE
faster restore than a tape backup. However, for an RTO of one week, tape backup will likely
meet requirements. Some examples of RTOs and the recovery strategies to ensure data
am
The BC planning lifecycle includes five stages shown below (Fig 3.4):
in
rt.
x pe
sE
am
Several activities are performed at each stage of the BC planning lifecycle, including the
following key activities:
1. Establishing objectives
→ Determine BC requirements.
→ Estimate the scope and budget to achieve requirements.
→ Select a BC team by considering subject matter experts from all areas of the business,
whether internal or external.
→ Create BC policies.
2. Analyzing
→ Collect information on data profiles, business processes, infrastructure support,
dependencies, and frequency of using business infrastructure.
→ Identify critical business needs and assign recovery priorities.
→ Create a risk analysis for critical areas and mitigation strategies.
→ Conduct a Business Impact Analysis (BIA).
→ Create a cost and benefit analysis based on the consequences of data unavailability.
3. Designing and developing
→ Define the team structure and assign individual roles and responsibilities. For example,
in
different teams are formed for activities such as emergency response, damage assessment,
and infrastructure and application recovery.
rt.
→ Design data protection strategies and develop infrastructure.
→ Develop contingency scenarios.
→ Develop emergency response procedures.
pe
→ Detail recovery and restart procedures.
x
4. Implementing
sE
→ Implement risk management and mitigation procedures that include backup, replication, and
management of resources.
am
→ Prepare the disaster recovery sites that can be utilized if a disaster affects the primary data
center.
→ Implement redundancy for every resource in a data center to avoid single points of
Ex
failure.
5. Training, testing, assessing, and maintaining
→ Train the employees who are responsible for backup and replication of business-critical data
on a regular basis or whenever there is a modification in the BC plan
→ Train employees on emergency response procedures when disasters are declared.
→ Train the recovery team on recovery procedures based on contingency scenarios.
→ Perform damage assessment processes and review recovery plans.
→ Test the BC plan regularly to evaluate its performance and identify its limitations.
→ Assess the performance reports and identify limitations.
Prof. Madhuri M, Dept of ISE, SVIT 8
→ Update the BC plans and recovery/restart procedures to reflect regular changes within the
data center.
➢ A single point of failure refers to the failure of a component that can terminate the
availability of the entire system or IT service.
➢ Fig 3.5 depicts a system setup in which an application, running on a VM, provides an
interface to the client and performs I/O operations.
➢ The client is connected to the server through an IP network, the server is connected to
in
the storage array through a FC connection, an HBA installed at the server sends or
rt.
receives data to and from a storage array, and an FC switch connects the HBA to the
storage port
pe
➢ In a setup where each component must function as required to ensure data
availability, the failure of a single physical or virtual component causes the failure of the
entire data center or an application, resulting in disruption of business operations.
x
➢ In this example, failure of a hypervisor can affect all the running VMs and the virtual
sE
hypervisor, an HBA/NIC on the server, the physical server, the IP network, the FC switch,
the storage array ports, or even the storage array could be a potential single point of
failure. To avoid single points of failure, it is essential to implement a fault-tolerant
Ex
mechanism.
➢ Data centers follow stringent guidelines to implement fault tolerance for uninterrupted
information availability. Careful analysis is performed to eliminate every single point of
failure.
➢ The example shown in Fig 3.6 represents all enhancements of the system shown in
Fig 3.5 in the infrastructure to mitigate single points of failure:
in
• Configuration of redundant HBAs at a server to mitigate single HBA failure
• Configuration of NIC (network interface card) teaming at a server allows protection
rt.
against single physical NIC failure. It allows grouping of two or more physical NICs
and treating them as a single logical device. NIC teaming eliminates the single point
•
pe
of failure associated with a single physical NIC.
Configuration of redundant switches to account for a switch failure
x
• Configuration of multiple storage array ports to mitigate a port failure
sE
• RAID and hot spare configuration to ensure continuous operation in the event of disk
failure
am
two or more servers in a cluster access the same set of data volumes. Clustered
servers exchange a heartbeat to inform each other about their health. If one of the
servers or hypervisors fails, the other server or hypervisor can take up the workload.
• Implementing a VM Fault Tolerance mechanism ensures BC in the event of a server
failure. This technique creates duplicate copies of each VM on another server so that
when a VM failure is detected, the duplicate VM can be used for failover. The two
VMs are kept in synchronization with each other in order to perform successful
failover.
in
rt.
Fig 3.6: Resolving single points of failure
that path fails. Redundant paths eliminate the path to become single points of failure.
➢ Multiple paths to data also improve I/O performance through load sharing and maximize
am
in
corruption occurs.
rt.
3. Storage array-based replication (remote): Data in a storage array can be replicated
to another storage array located at a remote site. If the storage array is lost due to a
pe
disaster, business operations can be started from the remote storage array.
x
sE
am
Ex
in
secondary storage for a long term to meet regulatory requirements. This reduces the
amount of data to be backed up and the time required to back up the data.
rt.
3.2.1 Backup Purpose
pe
Backups are performed to serve three purposes: disaster recovery, operational recovery,
x
and archival. These are discussed in the following sections.
sE
tape media is shipped and stored at an offsite location. These tapes can be recalled for
restoration at the disaster recovery site.
➢ Organizations with stringent RPO and RTO requirements use remote replication
technology to replicate data to a disaster recovery site. Organizations can bring
production systems online in a relatively short period of time if a disaster occurs.
in
rt.
3.2.2 Backup Methods
➢ Hot backup and cold backup are the two methods deployed for backup. They are
based on the state of the application when the backup is performed.
pe
➢ In a hot backup, the application is up and running, with users accessing their data
during the backup process. This method of backup is also referred to as an online
x
backup.
sE
➢ In a cold backup, the application is not active or shutdown during the backup process
and is also called as offline backup.
am
➢ The hot backup of online production data becomes more challenging because data is
actively used and changed.
➢ An open file is locked by the operating system and is not backed up during the backup
Ex
process. In such situations, an open file agent is required to back up the open file.
➢ In database environments, the use of open file agents is not enough, because the agent
should also support a consistent backup of all the database components.
➢ For example, a database is composed of many files of varying sizes occupying several
file systems. To ensure a consistent database backup, all files need to be backed up in
the same state. That does not necessarily mean that all files need to be backed up at the
same time, but they all must be synchronized so that the database can be restored with
consistency.
➢ The disadvantage associated with a hot backup is that the agents usually affect the
overall application performance.
in
metadata, also need to be backed up. These attributes are as important as the data itself
rt.
and must be backed up for consistency.
➢ Backup of boot sector and partition layout information is also critical for successful
pe
recovery.
➢ In a disaster recovery environment, bare-metal recovery (BMR) refers to a backup in
x
which all metadata, system information, and application configurations are
sE
appropriately backed up for a full system recovery. BMR builds the base system, which
includes partitioning, the file system layout, the operating system, the applications, and
all the relevant configurations. BMR recovers the base system first, before starting the
am
recovery of data files. Some BMR technologies can recover a server onto dissimilar
hardware.
Ex
in
management of all backup devices and to share the resources to optimize costs. An
appropriate solution is to share the backup devices among multiple servers. Network-
rt.
based topologies (LAN-based and SAN-based) provide the solution to optimize the
utilization of backup devices.
x pe
sE
am
➢ In LAN-based backup, the clients, backup server, storage node, and backup device are
Ex
connected to the LAN (see Fig 3.8). The data to be backed up is transferred from the
backup client (source), to the backup device (destination) over the LAN, which may
affect network performance.
➢ This impact can be minimized by adopting a number of measures, such as configuring
separate networks for backup and installing dedicated storage nodes for some
application servers.
in
Fig 3.8: LAN-based backup topology
rt.
➢ The SAN-based backup is also known as the LAN-free backup. Fig 3.9 illustrates a
SAN-based backup. The SAN-based backup topology is the most appropriate solution
pe
when a backup device needs to be shared among the clients. In this case the backup
device and clients are attached to the SAN.
x
➢ In the example from Fig 3.9, a client sends the data to be backed up to the backup
sE
device over the SAN. Therefore, the backup data traffic is restricted to the SAN, and
only the backup metadata is transported over the LAN. The volume of metadata is
am
insignificant when compared to the production data; the LAN performance is not
degraded in this configuration.
Ex
➢ The emergence of low-cost disks as a backup medium has enabled disk arrays to be
attached to the SAN and used as backup devices. A tape backup of these data backups
on the disks can be created and shipped offsite for disaster recovery and long-term
Prof. Madhuri M, Dept of ISE, SVIT 17
in
rt.
pe
Fig 3.10: Mixed backup topology
x
3.2.4 Backup Technologies
sE
➢ A wide range of technology solutions are currently available for backup targets.
➢ Tapes and disks are the two most commonly used backup media. Virtual tape libraries
am
use disks as backup medium emulating tapes, providing enhanced backup and recovery
capabilities.
3.2.4.1 Backup to Tape
➢ Tapes, a low-cost technology, are used extensively for backup. Tape drives are used to
Ex
read/write data from/to a tape cartridge. Tape drives are referred to as sequential, or
linear, access devices because the data is written or read sequentially.
➢ A tape cartridge is composed of magnetic tapes in a plastic enclosure.
➢ Tape Mounting is the process of inserting a tape cartridge into a tape drive. The tape
drive has motorized controls to move the magnetic tape around, enabling the head to
read or write data.
➢ Several types of tape cartridges are available. They vary in size, capacity, shape,
number of reels, density, tape length, tape thickness, tape tracks, and supported speed.
in
rt.
x pe
sE
➢ Another type of slot called a mail or import/export slot is used to add or remove tapes
from the library without opening the access doors (Fig 3.11 Front View) because
opening the access doors causes a library to go offline.
Ex
➢ In addition, each physical component in a tape library has an individual element address
that is used as an addressing mechanism for moving tapes around the library.
➢ When a backup process starts, the robotic arm is instructed to load a tape to a tape
drive. This process adds to the delay to a degree depending on the type of hardware
used, but it generally takes 5 to 10 seconds to mount a tape. After the tape is mounted,
additional time is spent to position the heads and validate header information. This total
time is called load to ready time, and it can vary from several seconds to minutes.
➢ The tape drive receives backup data and stores the data in its internal buffer. This
backup data is then written to the tape in blocks. During this process, it is best to ensure
that the tape drive is kept busy continuously to prevent gaps between the blocks. This is
in
➢ Many times, even the buffering and speed adjustment features of a tape drive fail to
prevent the gaps, causing the “shoe shining effect” or “backhitching.” This is the
rt.
repeated back and forth motion a tape drive makes when there is an interruption in the
backup data stream. This repeated back-and-forth motion not only causes a degradation
pe
of service, but also excessive wear and tear to tapes.
➢ When the tape operation finishes, the tape rewinds to the starting position and it is
x
unmounted. The robotic arm is then instructed to move the unmounted tape back to the
sE
robotic arm is instructed to move the tape from its slot to a tape drive. If the required
tape is not found in the tape library, the backup software displays a message, instructing
the operator to manually insert the required tape in the tape library.
Ex
➢ When a file or a group of files require restores, the tape must move sequentially to the
beginning of the data before it can start reading. This process can take a significant
amount of time, especially if the required files are recorded at the end of the tape.
➢ Modern tape devices have an indexing mechanism that enables a tape to be fast
forwarded to a location near the required data.
Limitations of Tape
➢ Tapes must be stored in locations with a controlled environment to ensure preservation
of the media and prevent data corruption.
➢ Data access in a tape is sequential, which can slow backup and recovery operations.
➢ Physical transportation of the tapes to offsite locations also adds management overhead.
in
from a tape, which took 108 minutes for the same environment.
➢ Recovering from a full backup copy stored on disk and kept onsite provides the fastest
rt.
recovery solution. Using a disk enables the creation of full backups more frequently,
which in turn improves RPO and RTO.
pe
➢ Backup to disk does not offer any inherent offsite capability, and is dependent on other
technologies such as local and remote replication.
x
➢ Some backup products also require additional modules and licenses to support backup
sE
to disk, which may also require additional configuration steps, including creation of
RAID groups and file system tuning. These activities are not usually performed by a
am
backup administrator.
Ex
in
starts in a virtual tape library. However, unlike a physical tape library, where this
process involves some mechanical delays, in a virtual tape library it is almost
rt.
instantaneous. Even the load to ready time is much less than in a physical tape library.
➢ After the virtual tape is mounted and the tape drive is positioned, the virtual tape is
pe
ready to be used, and backup data can be written to it. Unlike a physical tape library,
the virtual tape library is not constrained by the shoe shining effect.
x
➢ When the operation is complete, the backup software issues a rewind command and
sE
in
administration because it is preconfigured by the manufacturer.
rt.
➢ However, a virtual tape library is generally used only for backup purposes. In a backup-
to-disk environment, the disk systems are used for both production and backup data.
x pe
sE
am
Ex
in
Table 3.2: Backup targets comparison
rt.
pe
3.2.5 Data Deduplication for Backup
➢ Data deduplication is the process of identifying and eliminating redundant data. When
x
duplicate data is detected during backup, the data is discarded and only the pointer is
sE
created to refer the copy of the data that is already backed up.
➢ Data deduplication helps to reduce the storage requirement for backup, shorten the
backup window, and remove the network burden. It also helps to store more backups on
am
the disk and retain the data on the disk for a longer time.
➢ There are two methods of deduplication: file level and subfile level.
➢ The differences exist in the amount of data reduction each method produces and the
time each approach takes to determine the unique content.
➢ File-level deduplication (also called single-instance storage) detects and removes
redundant copies of identical files. It enables storing only one copy of the file; the
subsequent copies are replaced with a pointer that points to the original file.
➢ File-level deduplication is simple and fast but does not address the problem of
duplicate content inside the files. For example, two 10-MB PowerPoint presentations
with a difference in just the title page are not considered as duplicate files, and each
file will be stored separately.
in
➢ In variable-length segment deduplication, if there is a change in the segment, the
rt.
boundary for only that segment is adjusted, leaving the remaining segments
unchanged. This method vastly improves the ability to find duplicate data segments
pe
compared to fixed-block.
x
3.2.5.2 Data Deduplication Implementation
sE
Deduplication for backup can happen at the data source or the backup target.
Source-Based Data Deduplication
am
sent over the network during backup processes. It provides the benefits of a shorter
backup window and requires less network bandwidth. There is also a substantial
reduction in the capacity required to store the backup images.
➢ Fig 3.15 shows source-based data deduplication.
➢ Source-based deduplication increases the overhead on the backup client, which
impacts the performance of the backup and application running on the client.
➢ Source-based deduplication might also require a change of backup software if it is not
supported by backup software.
in
backup client from the deduplication process.
rt.
➢ Fig 3.16 shows target-based data deduplication.
➢ In this case, the backup client sends the data to the backup device and the data is
pe
deduplicated at the backup device, either immediately (inline) or at a scheduled time
(post-process).
x
➢ Because deduplication occurs at the target, all the backup data needs to be transferred
sE
over the network, which increases network bandwidth requirements. Target-based data
deduplication does not require any changes in the existing backup software.
➢ Inline deduplication performs deduplication on the backup data before it is stored on
am
the backup device. Hence, this method reduces the storage capacity needed for the
backup.
➢ Inline deduplication introduces overhead in the form of the time required to identify
Ex
and remove duplication in the data. So, this method is best suited for an environment
with a large backup window.
➢ Post-process deduplication enables the backup data to be stored or written on the
backup device first and then deduplicated later.
➢ This method is suitable for situations with tighter backup windows. However, post-
process deduplication requires more storage capacity to store the backup images
before they are deduplicated.
in
➢ There are two approaches for performing a backup in a virtualized environment: the
rt.
traditional backup approach and the image-based backup approach.
➢ In the traditional backup approach, a backup agent is installed either on the virtual
pe
machine (VM) or on the hypervisor.
➢ Fig 3.17 shows the traditional VM backup approach.
x
➢ If the backup agent is installed on a VM, the VM appears as a physical server to the
sE
agent. The backup agent installed on the VM backs up the VM data to the backup
device. The agent does not capture VM files, such as the virtual BIOS file, VM swap
am
file, logs, and configuration fi les. Therefore, for a VM restore, a user needs to manually
re-create the VM and then restore data onto it.
➢ If the backup agent is installed on the hypervisor, the VMs appear as a set of files to the
Ex
agent. So, VM files can be backed up by performing a file system backup from a
hypervisor. This approach is relatively simple because it requires having the agent just
on the hypervisor instead of all the VMs.
➢ The traditional backup method can cause high CPU utilization on the server being
backed up.
➢ So the backup should be performed when the server resources are idle or during a low
activity period on the network.
➢ And also allocate enough resources to manage the backup on each server when a large
number of VMs are in the environment.
➢ Image-based backup operates at the hypervisor level and essentially takes a snapshot of
the VM.
➢ It creates a copy of the guest OS and all the data associated with it (snapshot of VM disk
in
files), including the VM state and application configurations. The backup is saved as a
rt.
single file called an “image,” and this image is mounted on the separate physical
machine–proxy server, which acts as a backup client.
pe
➢ The backup software then backs up these image files normally. (see Fig 3.18).
➢ This effectively offloads the backup processing from the hypervisor and transfers the
x
load on the proxy server, thereby reducing the impact to VMs running on the
sE
hypervisor.
➢ Image-based backup enables quick restoration of a VM.
am
Ex
in
rt.
x pe
sE
am
Ex