Compellent Best Practices With Site Recovery Manager
Compellent Best Practices With Site Recovery Manager
Compellent Best Practices With Site Recovery Manager
Revisions
Date
Description
August 2011
Initial release
November
2011
December
2011
March 2012
July 2012
Added warning
October 2012
Updated diagrams
October 2012
April 2013
July 2013
December
2013
Added Selectable Replay, QoS, user accounts, and revert to snapshot details
August 2014
THIS WHITE PAPER IS FOR INFORMATIONAL PURPOSES ONLY, AND MAY CONTAIN TYPOGRAPHICAL ERRORS AND
TECHNICAL INACCURACIES. THE CONTENT IS PROVIDED AS IS, WITHOUT EXPRESS OR IMPLIED WARRANTIES OF
ANY KIND.
2013 Dell Inc. All rights reserved. Reproduction of this material in any manner whatsoever without the express
written permission of Dell Inc. is strictly forbidden. For more information, contact Dell.
Dell, the DELL logo, and the DELL badge are trademarks of Dell Inc. VMware and Site Recovery Manager are
trademarks of VMware Corporation in the U.S. and other countries. Other trademarks and trade names may be used in
this document to refer to either the entities claiming the marks and names or their products. Dell disclaims any
proprietary interest in the marks and names of others.
Table of contents
Revisions ............................................................................................................................................................................................. 2
Executive summary .......................................................................................................................................................................... 6
1
Introduction ................................................................................................................................................................................ 7
1.1
1.2
2.2
2.3
2.4
3.2
3.3
3.4
4.2
4.3
4.4
4.5
4.6
4.7
5.2
5.3
5.4
5.5
5.6
5.7
5.8
5.9
Using Application and Data Consistent Frozen Replays with SRM ...................................................................... 23
6.2
6.3
6.4
6.5
7.2
Overview ......................................................................................................................................................................... 36
8.2
Reprotection .................................................................................................................................................................. 36
8.3
Failback ........................................................................................................................................................................... 37
Conclusion ................................................................................................................................................................................ 38
Executive summary
Datacenter consolidation by way of x86 virtualization is a trend which has gained tremendous momentum
and offers many benefits. Although the physical nature of a server is transformed once it is virtualized, the
necessity for data protection. Virtualization opens the door to new and flexible opportunities in data
protection, data recovery, replication, and business continuity. This document focuses on best practices
for providing an automated disaster recovery solution for virtualized workloads using Dell Compellent
Storage Center, Replay Manager, Array Based Replication, and VMware vSphere Site Recovery Manager
with varying levels of consistency.
Introduction
1.1
1.2
Forced Recovery for vSphere Replication - The Forced Recovery feature is now available to both
vSphere and Array Based Replication.
vSphere Replication Decoupling - vSphere Replication was introduced as a feature bundled in SRM
5.0. In SRM 5.1, vSphere Replication is decoupled from SRM and is included in the vSphere
Essentials Plus and above platform bundle.
Relaxed Licensing - VMware added SRM support for the Essentials Plus tier of vSphere licensing.
SRM 5.5 adds support for Storage DRS and Storage vMotion within a consistency group for array
based replication
Setup Prerequisites
2.1
Enterprise Manager
Compellent Enterprise Manager Version 6.4.1 or greater is required for SRM 5.5. The SRA makes calls
directly to the Enterprise Manager Data Collector to manipulate the storage in order to carry out SRM
requested tasks. It is recommended to have the latest version of the Enterprise Manager Data Collector
installed to ensure compatibility with SRM and the SRA.
2.2
Storage Center
It is required to have two Compellent Storage Center systems with Remote Data Instant Replay
(replication) between the sites licensed and operational. Site Recovery Manager using Array Based
Replication requires two Compellent systems replicating between each other. SRM using vSphere
Replication can leverage any storage certified for use with vSphere including Dell Compellent Storage
Center.
2.3
VMware vSphere
Compatible versions of VMware Site Recovery Manager, vCenter Server, and vSphere and/or ESXi/ESX 3.5
hosts are required. Please check the latest Site Recovery Manager Compatibility Matrix for the versions of
software required for SRM to function. As of the release of SRM 5.1, vSphere Essentials Plus licensing is
required.
2.4
3.1
3.2
10
migrations can be performed with just one Enterprise Manager Data Collector server. However, once there
are active virtual machines running simultaneously at both the primary and secondary sites, the site with
the Enterprise Manager Data Collector will not be adequately protected by SRM. Enterprise Manager Data
Collector Servers are placed at each site so that either site can fail.
3.3
11
Infrastructure
Storage
3.4
Recovery Site
vCenter
vCenter
Site Recovery
Manager
Site Recovery
Manager
Protected
Cluster
VMFS/RDM/NFS/
Fibre Channel/
FCoE/
Compellent/
EqualLogic/
PowerVault
Databases
Databases
Software
Protected Site
Recovery
Cluster
vSphere
Replication
VMFS/RDM/NFS/
Fibre Channel/
FCoE/
Compellent/
EqualLogic/
PowerVault
12
Infrastructure
Storage
13
Active Site B
vCenter
vCenter
Site Recovery
Manager
Site Recovery
Manager
Cluster
A
VMFS/RDM/NFS/
Fibre Channel/
FCoE/
Compellent/
EqualLogic/
PowerVault
vSphere
Replication
Databases
Databases
Software
Active Site A
Cluster
B
VMFS/RDM/NFS/
Fibre Channel/
FCoE/
Compellent/
EqualLogic/
PowerVault
4.1
4.2
Keep in mind that each Data Collector, whether primary or remote, maintains its own user access
database. In a typical active/DR site configuration a single Enterprise Manager Data Collector server is
configured and deployed managing both source and destination site arrays. A single set of credentials is
needed to register that Data collector server as an array manager for both sites.
In a typical active/active site configuration, two Enterprise Manager Data Collector servers are configured
and deployed with one managing the source site array and one managing the destination site array. Two
sets of credentials are needed to register each respective Data Collector as an array manager in SRM.
14
4.3
The sra-admin account used to access Enterprise Manager can now be used for configuring the Storage
Center credentials within SRM Array Manager configuration.
Note that each Enterprise Manager Data Collector and Storage Center maintains its onwn user account
database. In an active/active SRM site configuration, two sets of credentials will be used to configure the
SRM array managers.
4.4
15
1.
For convenience, it is automatically initiated at the end of the Create Replication Wizard:
4.5
4.6
16
second Data Collector is installed and configured as a Remote Data Collector. Restore points are saved
on the Primary Data Collector and replicated to the Remote Data Collector at one minute intervals.
Replications must be created and Restore Points saved before the volume can be protected by SRM. Nonreplicated volumes wont be discovered as a device by SRM and thus cannot be protected.
4.7
17
The vmware-dr.xml file is located in a directory named config which resides within the Site Recovery
Manager installation folder which will vary depending on the operating system and SRM version. For
example:
C:\Program Files\VMware\VMware vCenter Site Recovery Manager\config\vmware-dr.xml
18
Configuring Replication
Storage Center replication in coordination with Site Recovery Manager can provide a robust disaster
recovery solution. Since each replication method affects recovery differently, choosing the correct
method to meet business requirements is important. Here is a brief summary of the different options.
5.1
5.2
5.3
In an asynchronous replication, the I/O needs only be committed and acknowledged to the
source system, so the data can be transferred to the destination in a nonconcurrent timeframe.
There are two different methods to determine when data is transferred to the destination:
o By replay schedule The replay schedule dictates how often data is sent to the
destination. When each replay is taken, the Storage Center determines which blocks have
changed since the last replay (the delta changes), and then transfers them to the
destination. Depending on the rate of change and the bandwidth, it is entirely possible for
the replications to fall behind, so it is important to monitor them to verify that the
recovery point objective (RPO) can be met.
o Replicating the active replay With this method, the data is transferred near real-time to
the destination, usually requiring more bandwidth than if the system were just replicating
the delta changes in the replays. As each block of data is written on the source volume, it
is committed, acknowledged to the host, and then transferred to the destination as fast as
it can. Keep in mind that the replications can still fall behind if the rate of change exceeds
available bandwidth.
Asynchronous replications usually have more flexible bandwidth requirements making this the
most common replication method.
One of the benefits of an asynchronous replication is that the replays are transferred to the
destination volume, allowing for check-points at the source system as well as the destination
system.
The data is replicated real-time to the destination. In a synchronous replication, an I/O must be
committed on both systems before an acknowledgment is sent back to the host. This limits the
type of links that can be used, since they need to be highly available with low latencies. High
latencies across the link will slow down access times on the source volume.
QoS Definitions
Site Recovery Manager supports replication from a source to a destination, and then reversing that
direction of replication after recovery and reprotect plans have been invoked. For consistency and to
prevent unexpected replication latency, it is recommended to maintain consistent QoS definitions on each
19
Storage Center in a replication pair. Inconsistent QoS definitions between sites may also cause reprotect
or failback workflows to fail.
5.4
5.5
Live Volume replications add an additional abstraction layer to the replication allowing mapping of
the same volume through multiple Storage Center systems.
Live Volume replications are not supported due to the fact that using them with SRM is mutually
exclusive, and because SRM may be confused by the volume being actively mapped at two
different sites.
A. Once a replay is taken of the source volume, the delta changes begin transferring to the
destination immediately. The consistency state of the data within this replay is dependent on
whether or not the application had the awareness to quiesce the data before the replay was taken.
B. During a Recovery Plan test, a new replay is taken of the destination volume. This is done per the
VMware SRM specification to capture the latest data that has arrived at the DR site. Of course this
means that the consistency of the data is dependent on whether or not the previous replay was
completely transferred. For example using the figure above:
a. If the 4:00 pm replay taken at the primary site was application consistent, but at the time
the SRM Recovery Replay was taken, only 75% of that replays data had been transferred
and thus the data is considered incomplete.
20
5.6
A. As writes are committed to the source volume, they are near simultaneously transferred to the
destination and stored in the Active Replay. (See figure above) Keep in mind that consistent
Replays can still be taken of the source volume, and the check points will be transferred to the
destination volume when replicating the active replay.
B. During a Recovery Plan test, a new replay is taken of the destination volume which locks in all the
data that has been transferred up to that point in time. Data stored within the Active Replay will
most likely be crash consistent. For example, using the figure above:
a. Although the 3:00 pm replay taken at the primary site was application consistent, at the
time the SRM Recovery Replay was taken, the data must still be deemed crash consistent
because it is highly probable that writes into the Active Replay occurred after the 3:00 pm
replay was taken.
21
b. If the application is unable to recover the crash consistent data captured in this replay,
then manual steps would be required to present the last known consistent replay (such as
the 3:00pm frozen replay) back to the host.
c. The only time that application data can be considered consistent while replicating the
active replay, is if ALL I/O has ceased to the source volume, followed by a complete
synchronization of the data from the source to the destination.
C. Once the SRM Recovery Replay has been taken, a view volume is created from that Replay.
D. The View volume is then presented to the ESX(i) host(s) at the DR site for SRM to begin test
execution of the Recovery Plan.
During an actual Disaster Recovery or Planned Migration execution, a View volume from a Replay is not
mounted to the remote vSphere hosts. Instead, the destination volume itself is mounted to the remote
hosts. This change in behavior from SRM 4.x is to facilitate the Reprotect and Failback features in SRM 5.x.
5.7
Due to various factors such as rate of change, link bandwidth, volume size, and replication QoS, it is
possible for multiple volumes in a backup set to finish replicating frozen replays at different times. In this
scenario, SRM will not use replays which have not been completely replicated. This means that if the plan
is executed before the frozen replay replications have completely transferred all of the data (as seen in the
figure above), the Volume A replay from 4pm will be mounted, the Volume B replay from 3pm will be
mounted, and the Volume C replay from 3pm will be mounted. If all three volumes are part of a
consistency group and in use for a tiered application or database application, then depending on the
application volumes presented from different points in time may cause problems with the application. If
22
this happens, manual intervention may be required to present the previous set of consistent replays back
to the host (for example, the 3:00 pm replay from volumes A, B, and C may need to be forced).
5.8
Always use Active Replay (default) Uses the Active Replay (current, unfrozen state of the data
transferred to the destination).
2. Use Active Replay If Replicating Active Replay Uses the Active Replay if Replicate Active Replay is
enabled for the replication, otherwise the last frozen Replay is used.
3. Always use Last Frozen Replay Uses the most current frozen Replay that has been 100%
transferred to the destination.
4. Use Restore Point Settings Uses the pre-configured settings for the restore point of the
replication, if Use Active Replay is not selected then the last frozen Replay is used. This option
allows granular selection of SRM Selectable Replay configuration on a per volume replication
basis. The default Use Active Replay configuration for each restore point is a cleared checkbox
which translates to Use Last Frozen Replay for the restore point.
The Enterprise Manager default is to use the Active Replay (the current unfrozen state) of the volume for
SRM purposes. The SRM Selectable Replay feature integrates only with certain SRM actions and in other
actions it will be ignored. The table below outlines each scenario.
Table 1
5.9
SRM Action
Recovery Type
Planned Migration
No
Disaster Recovery
N/A
23
using the vSphere Client plug-in to create a Replay of a datastore and using the workflow feature to
Quiesce file systems (if available). Using either of these methods results in a vSphere snapshot with both
parent and delta disks frozen in the Replay being replicated to the destination site.
Whats important to recognize here are two things:
1.
The VM is replicated to the destination site in a vSphere snapshot state and should be dealt with in
one way or another to prevent the VM from running continuously over a long period of time in a
vSphere snapshot state.
2. The application and data consistency is contained within the frozen parent virtual disk and crash
consistent data is contained within the delta virtual disk.
When SRM recovery plan workflow is carried out, SRM registers the VM into inventory at the destination
site and powers on the VM with no special attention given to the current snapshot state of the VM. This
ultimately means that in this case SRM will power on the VM using the delta resulting in recovery from a
crash consistent state. In order to recovery the VM from the frozen parent disk with application and data
consistency, before the VM is powered on it must be reverted to the previous snapshot using the vSphere
Snaphot Manager. Once this is done the snapshot can be deleted (closed) and the VM can be powered
on. This process ensures the VM is powered on from its frozen parent disk and the delta disk along with
the crash consistent data in it is destroyed.
Carrying out the process above manually at large scale will quickly erode efforts made toward meeting the
recovery plans Recovery Time Objective (RTO) and frankly its not the best use of SRM. In such instances,
a better and more efficient and consistent solution would be to script the snapshot management process
using PowerShell and have that process carried out as a Pre or potentially a Post-power on step for the
VM. Custom Recovery Tasks are discussed in the next section.
5.10
24
See Appendix A for examples of CompCU and PowerShell scripts that can be used within Site Recovery
Manager.
25
6.1
The Protected Site Array Managers and the Recovery Site Array Managers must both be configured so that
they can be paired. Depending on the architecture, a single Enterprise Manager Data Collector can be
added for both sites, or an Enterprise Manager Data Collector and Remote Data Collector model can be
deployed.
26
For example, in a single data collector configuration, the Protection Site Array Manager should specify the
recovery site data collector. Likewise, the Recovery Site Array Manager should specify the same data
collector residing within at the same physical site.
27
6.2
28
29
6.3
30
By clicking on the Refresh link on this screen, the SRA will re-query the Enterprise Manager Data Collector
to obtain the new replicated virtual machine volume information.
6.4
31
After the Placeholder datastore is created, creating protection groups follows the same general process
from previous versions of SRM. Replicated datastore volumes are the foundation which protection groups
are built upon. Immediately after the protection group is created, virtual machines residing on the
datastore or datastores in that protection group are protected. Legacy rules apply in that once a VM is
protected, it is essentially pinned to the datastore or datastores its .vmx and .vmdk files reside on.
Moving files belonging to a virtual machine is not supported with SRM and will result in the VM no longer
being protected or replicated from its original datastore or datastores. From a design and operational
standpoint, this means that automated Storage DRS (SDRS) and Storage vMotion cannot be used with SRM
protected VMs. Dell Compellent Storage Center arrays offer Dynamic Block Architecture and Data
Progression which provides automated sub-LUN tiering for virtual machines without interfering with SRM
protection groups.
Do not enable VMware Storage IO Control (SIOC) on datastores protected by Site Recovery Manager.
SIOC can prevent datastore unmounts leading to the failure of a Planned Migration. For more information,
see Storage IO Control and SRM Planned Migration at http://blogs.vmware.com/vsphere/2012/06/siocand-srm.html as well as VMware KB articles 2008507, 1037393, and 2004605
6.5
32
In addition, consider adding Prompts and/or SRM server-side Commands to the recovery plan to help
ensure all data has been replicated before the subsequent Storage section is executed.
For example, the Compellent CompCU utility or Compellent PowerShell scripts could be integrated into
the recovery plan to take current replays of all the volumes to make sure the most recent data has been
replicated. An example CompCU script can be found in Appendix A of this document. When the recovery
plan executes, it will wait at an added Pause step. However, recovery plan execution will not pause at a
Command on SRM Server step.
33
7.1
7.2
34
As a safety precaution, a warning message will appear and must be acknowledged to execute a live plan.
35
8.1
Overview
After virtual machines are migrated from one site to another using either the Disaster Recovery or Planned
Migration features in SRM, they are in an active running state on the network at the alternate site.
However, at this point they are vulnerable to a site failure with no SRM protection. This was true in
previous versions of SRM and is true today in SRM 5.x. Previous versions of SRM required a manual
reprotection of the virtual machines at the recovery site. SRM 5.x automates the reprotect process and
prepares the virtual machines for failback.
8.2
Reprotection
Once protected virtual machines are migrated or disaster recovery failed over to the secondary site, they
are unprotected. Immediately following the migration of a protected group, SRM 5.x automates the ability
to reprotect the virtual machines. It does this by performing a series of steps.
During a Reprotect, SRM commands the SRA to reverse storage replication for each of the
datastores/volumes in the protection group in the opposite direction. The protection group, which was
originally set up at the primary site is migrated to the secondary site. Placeholder VMs, which were
originally set up at the secondary site, are now created at the opposite site (which can now be considered
the new recovery site) on its respective placeholder datastore.
36
8.3
Failback
Failback is no more than an SRM term which describes the ability to perform a subsequent Disaster
Recovery or Planned Migration after a recovery and reprotect have already been successfully performed.
The benefit that Failback brings in SRM 5.x is the automated ability to move back and forth between sites
with minimal effort. This can facilitate a number of use cases including the ability to perform production
processing live in the disaster recovery site, resource balancing, and improved disaster recovery
infrastructure ROI.
37
Conclusion
VMware vSphere, Site Recovery Manager, and Dell Compellent Storage Center combine to provide a
highly available business platform for automated disaster recovery with the best possible RTO and RPO, as
well as planned migrations for your virtualized datacenter.
38
10
39
This PowerShell script will connect to a Storage Center with a host name of sc12.techsol.local with a
username of srmadmin and a password of mmm to take a replay of lun40 with a replay expiration set
to 1 day. It is easiest to run this PowerShell script from the Compellent Storage Center Command Set
Shell which automatically loads Compellents Compellent.StorageCenter.PSSnapin snap-in. However, this
script can be run from any PowerShell prompt provided the Compellent.StorageCenter.PSSnapin snap-in
is manually loaded or loaded as part of the PowerShell profile.
40
11
11.1
11.2
VMware Resources
41