Nutanix Solution Note
Nutanix Solution Note
Nutanix Solution Note
Version 1.0
April 2015
The Nutanix distributed software architecture runs a virtual storage controller (Controller VM or CVM) on each
Nutanix node or host on the Virtual Computing Platform, forming a distributed system. All nodes actively work
together to aggregate storage resources into a single global pool that can be leveraged by all. The storage
resources are managed by the Nutanix Distributed File System (NDFS) to ensure that data and system integrity is
preserved in the event of node, disk or application or hypervisor software failure. NDFS also delivers data
protection and high availability functionality that keeps critical data and VMs protected and applications running.
Figure 1: Nutanix solution for data protection and disaster recovery covers all aspects of availability
A snapshot is an evolution of the traditional backup process. It is created when the storage system creates a full or
virtual copy of the metadata or the index of the stored data. This is different from traditional backup solutions,
which create separate copies of the stored data. Because snapshots only need to copy the metadata or index at the
time they are taken, they can be near instantaneous, have little performance impact and require little incremental
space. IT organizations can take snapshot-based backups more frequently and improve recovery point objective.
Backup vendors and analysts have acknowledged the shift to snapshots as a viable option for backup and
recovery.
However, it is important to note that not all snapshot implementations are created equal. Each of the
implementations has different storage requirements and pose different restrictions on their use. The preferred
implementation of snapshot is redirect-on-write (ROW). In this method, any updates to existing protected data are
redirected to a new location. None of the existing data in snapshots needs to be copied or moved. As a result
ROW snapshots do not suffer the performance impact of the alternative copy-on-write snapshot implementations.
The performance impact for copy-on-write snapshots limits their applicability for primary data.
Another consideration when implementing snapshots is the granularity of data that can be protected. This
determines the space overhead of the snapshots taken. Smaller block sizes result in increased sharing of data
between snapshots and greater space efficiency. With large blocks, a change to a small portion of a block would
create a full new block with mostly duplicate data, causing the snapshot size to be much larger than the amount of
data changed.
The last aspect that needs to be considered for snapshot design is the unit of data that can be protected and
restored by the storage system. Traditional storage deployments typically operate at the storage object or
volume/LUN level with little to no understanding of what is stored in those containers. In virtualized
environment, this results in a simultaneous snapshot of tens-to-hundreds of VMs, each with varying change rates.
Consequently, it puts the burden on the administrators to map the different VMs to the storage objects such as
LUNs or volumes. This results in additional steps and greater system complexity, especially when recovering
individual VMs. In the traditional approach, snapshot schedules can only be set at a LUN or a volume level,
leading to practices such as creating one LUN per VM as a workaround in order to create individualized snapshot
VM schedules.
An alternative to this method is taking a VM-centric approach to storage and data protection. In this scenario,
storage understands and operates at the virtual disk or VM-level. So snapshots are taken at the VM-level and
administrators can set schedules and retention periods at the VM-level to meet service levels. Recovery is simple
as administrators can restore individual VMs without dealing with the underlying storage objects.
This brings us to the snapshot implementation on the Virtual Computing Platform. Nutanix OS implements
redirect-on-write, VM-granular snapshots. When a snapshot of a VM is initially taken on the Nutanix Virtual
Computing Platform, the system creates a read only zero-space clone of the metadata (i.e. index to data) and
makes the underlying VM data immutable or read only; no VM data or virtual disks are actually copied or moved.
The system creates a read only copy of the VM that can be accessed similar to its active counterpart. Nutanix
snapshots take only a few seconds to create, eliminating application and VM backup windows.
From an efficiency standpoint, Nutanix snapshots can be taken with byte-level resolution. This byte-incremental
implementation means that only the changed data is captured between successive snapshots. For even greater
efficiency, all the data stored on the Virtual Computing Platform including the snapshot can be compressed and
deduplicated. Even though individual deployment savings will vary with the specific workloads, average
deployments depending on the workload have seen anywhere from 25% to 75% reduction in the amount of space
needed.
Nutanix
snapshots have
byte-level
resolution
The VM-granular snapshots can be set to be either crash consistent or VM-consistent and can be scheduled on an
hourly, daily, weekly or monthly basis depending on the Recovery Point Objectives (RPO) and retention needs.
The choice between taking crash-consistent or VM-consistent snapshots should be based on recovery needs.
Crash consistent snapshots are instantaneous and are sufficient for workloads able to recover from operating
system (OS) or VM crashes. Stateless applications such as web-servers are best protected through crash consistent
snapshots. The alternative VM-consistent snapshots take advantage of host framework and services such as
Microsoft Volume Shadow Copy Service (VSS) to quiesce the VM and supported applications; rendering them in
to a known or consistent state. In the case of VMware running Microsoft Windows guests, VSS support is
provided with VMware tools running in the guest OS. Using deep integration between Nutanix Virtual
Computing Platform and VMware vSphere, the VMware tools are called to quiesce the OS and supported
applications such as Microsoft Exchange and SQL Server before the Virtual Computing Platform takes a VM-
consistent snapshot of the VM.
Additionally, multiple VMs can be grouped together in a Nutanix protection domain enabling them to be operated
upon as a single entity with the same RPO. This is useful when trying to protect complex applications such as
Microsoft SQL Server-based applications or Microsoft Exchange. The main advantage of using a protection
domain approach of grouping VMs versus the traditional SAN
approach of consolidating different VMs on to a single LUN is VM
Keeping Data Optimized
portability. VMs can be moved between different protection domains
on a Nutanix Virtual Computing Platform without the need for any Nutanix Virtual Computing Platform
data to be moved or copied. For traditional SANs, changing a VM’s runs a distributed data management
SLA will most likely require migrating the VM to another LUN or service in the background. The
volume. MapReduce-based service called Curator
is responsible for executing tasks such as
Because of the unique NDFS design leveraging a shared nothing metadata optimization, garbage
distributed approach to metadata, there is no upper limit to the number collection of deleted VMs, data
of snapshots that can be taken with the Nutanix Virtual Computing reduction, tiering, consistency checking,
Platform. This scalable approach eliminates the need for separate and rebalancing to optimize data across
nodes and flash/disks with minimal
impact to performance.
Data Protection and Disaster Recovery 6
storage systems for backup and long term archiving, as the VM snapshots are stored across the entire cluster that
makes up Nutanix Virtual Computing Platform.
Nutanix snapshot technology forms the basis of a unique set of functionality and ecosystem for high availability
and disaster-recovery. The first feature that builds on the Nutanix snapshot capability is VM-granular cloning.
Cloning can be used for a variety of reasons including deployment and recovery. Integration with the
virtualization stack with functionality such as VMware vStorage APIs for Array Integration (VAAI) and VMware
View Composer API for Array Integration (VCAI) enables administrators to simplify VM deployment using
integrated cloning. For the purpose of this document, the discussion will focus on recovering VMs.
The Virtual Computing Platform enables user-driven recovery of individual VMs from snapshots. This is done by
either replacing the existing active VM with the snapshot copy or by creating a separate clone of a snapshot
preserving the active VM. Depending on settings of snapshot, the recovered VM will either be crash-consistent or
VM-consistent upon recovery.
If needed, administrators can create a clone of a Nutanix VM-granular snapshot for the purpose of recovering a
single file without taking up additional space. Compared to a traditional LUN/volume based approach, a VM-
granular snapshot approach eliminates the need for first recovering the storage object (LUN/volume) and then
identifying and mounting the VM, and recovering the file.
A single management console will be used for managing storage, compute, backup and DR. From within Nutanix
Prism, Cloud Connect can be setup, workloads can be backed up to public cloud or a remote site, protected items
can be parsed through quick recovery can be performed, make changes to protection schedules. When using a
VPC to connect to public cloud all of the nodes help participate in replication so it does not impact the running
workloads
Data that is sent across the WAN can be compressed and the granularity of what is sent is at the byte level. If
32KB of data is changed Nutanix will send 32BK of data. If only 4KB of data has changed then only 4KB of data
is sent.
Remote Replication
Nutanix VM-granular snapshots also make it possible to efficiently replicate individual virtual machines from a
primary Virtual Computing Platform to one or more secondary Nutanix clusters across different sites. By
supporting a fan-out and fan-in or multi-way model for replication, the Virtual Computing Platform can create
flexible multi-master virtualization environment for backup and disaster recovery. Deployments supporting
numerous remote and branch offices can benefit from a flexible deployment model.
Since the software-defined replication functionality builds on VM-granular snapshots, policies for replication are
also set at the individual protection domain level rather than working at the LUN/volume level. Only byte-level
changes between snapshots of individual-VMs are sent over the network to the remote cluster. NDFS also enables
another host other than the one serving IO on the active virtual disk in the cluster can do the work of calculating
the changed blocks; eliminating performance bottlenecks for critical VMs and their corresponding hosts. So all
nodes in the cluster participate in replication.
To make the most out of WAN connectivity, the data can be deduplicated and compressed before it is sent across
the WAN. First the fingerprint of changed blocks for individual VMs are sent from the primary system to the
different destinations. The different destination systems report back with the unique blocks they need to create the
destination, which is sent back by the primary system. Deduplicating data sent to remote sites can effectively cut
the bandwidth required by as much as 75% versus host-based full-copy backup solutions.
Metro Availability
Metro Availability synchronously replicates data to another site ensuring that a real-time copy of data exists at a
different location. In the event of a disaster or a planned maintenance, virtual machines (VM) can failover from a
primary site to a secondary site, guaranteeing near 100% uptime for applications.
Metro Availability is a continuous availability solution that provides a global file system namespace across a
“stretched” container between Nutanix clusters. The stretched container is supported by synchronous storage
replication across independent Nutanix clusters across different sites. Synchronous replication is enabled at the
container level, and all virtual machines and files stored within that container are replicated synchronously to
another Nutanix cluster.
Containers have two primary roles while enabled for Metro Availability, Active and Standby. Active containers
replicate data synchronously to Standby containers. The active and standby containers will be mounted to their
respective Hypervisor hosts using the same datastore name, which effectively spans the datastore across both
clusters and sites. With a stretched datastore across both Nutanix clusters, a single Hypervisor cluster can be
created and common clustering features, like VMware vMotion and VMware High Availability, can be used to
manage the environment.
Metro Availability is supported in conjunction with existing Nutanix data management features including
compression, deduplication and tiering. Metro Availability also allows compression to be enabled for the
synchronous replication traffic between the Nutanix clusters. The compression of replication traffic is enabled
when creating the remote site configuration and will help reduce the total bandwidth required to maintain the
synchronous relationship.
With Metro Availability, hypervisor related high availability or clustering technologies typically used within
datacenters can now be leveraged across datacenters. This type of configuration is commonly referred to as a
stretched cluster and helps to minimize downtime during unplanned outages. Metro Availability also supports the
migration of virtual machines across sites using technologies such as vMotion. This enables zero downtime while
transitioning workloads between datacenters.
Setup and management is simple, intuitive and done from within the Prism UI. It can also be automated using
REST APIs in larger environments. The simplicity and ease of management is unparalleled and for the first time
enterprises will have a modern consumer-grade management experience when it comes to disaster recovery and
high availability.
Additionally, with support for vStorage API for Data Protection (VADP) and application-level consistent
snapshots by leveraging Volume Shadow Services (VSS), Nutanix backup and DR capabilities fully integrate
with third-party tools, such as Symantec NetBackup, Commvault Simpana, and Veeam.
Figure 7: Nutanix Prism APIs and PowerShell commandlets enable runbook automation for failover
Nutanix Prism APIs and PowerShell commandlets can also be used to automate workflows using snapshots and
replication for backup and disaster through scripting languages, or workflow engines. The Prism APIs are also
used to create an automated run book for failover, automatically registering the VM at the DR site in VMware
vCenter and powering them on. For example, a custom script can be created using the Prism APIs can trigger a
Virtual Computing Platform to take and replicate a snapshot of the group of critical VMs making up an order-
entry system, based on the number of transactions being executed.