IDC WP Data-Deduplication
IDC WP Data-Deduplication
Laura DuBois
May 2009
EXECUTIVE SUMMARY
www.idc.com
Deduplication Adoption
P.508.872.8200
The demand for data deduplication in both midsize and enterprise environments is
escalating as firms look for ways to keep pace with the near doubling of storage
growth annually. This growth is fueled by new applications, the proliferation of
virtualization, creation of electronic document stores and document sharing, use of
Global Headquarters: 5 Speen Street Framingham, MA 01701 USA
Web 2.0 technologies, and the retention or preservation of digital records. With
constrained IT budgets, the need to curb growth is heightened as firms look to reduce
capital and operating costs. From a physical perspective, many datacenter managers
are also dealing with limited infrastructure in terms of power, cooling, and floor space.
Deduplication is a technology that not only aids in accelerating storage efficiency by
reducing cost but also alleviates physically constrained datacenters.
Driving down cost. Deduplication offers resource efficiency and cost savings
that include a reduction in datacenter power, cooling, and floor tile demands as
well as storage capacity, network bandwidth, and IT staff.
Table 1 outlines the myriad of backup challenges that exist and how deduplication
can address them. It also identifies the deduplication approach best suited to address
each challenge.
What Deduplication Is
Different from single-instance storage (SIS), which deduplicates data at the file or
object level, data deduplication is most often associated with subfile deduplication
processes. Subfile deduplication examines a file and breaks it up into "chunks."
These smaller chunks are then evaluated for the occurrence of redundant data
content across multiple systems and locations. Deduplication is also different from
compression, which reduces the footprint of a single object rather than across files or
pieces of a file. However, deduplicated data can also be compressed for further
space savings.
Backup data deduplication can occur at the source or target. An example of source-
side deduplication would be reducing the size of backup data at the client (e.g.,
Exchange or file server) so that only unique subfile data is sent across the network
during the backup process. An example of target-side deduplication would be
reducing the size of backup data after it crosses the network when it reaches a
deduplication appliance. Deduplication at the source provides network bandwidth,
backup window, and storage savings. Deduplication at the target provides storage
savings, works with existing backup software, and can reduce the network impact,
although it requires a hardware appliance at every location. Where deduplication is
implemented not only yields different benefits but also affects implementation times
and cost. Firms should evaluate their current backup problems and map these
challenges to the different deduplication approaches (refer back to Table 1).
Source-side Deduplication
Performing deduplication at the source (client) provides an extended set of benefits
beyond capacity optimization. It also means significantly less data is sent, thus
relieving congested virtual/physical infrastructure and LAN/WAN links. Because only
new or changed subfile data segments are sent, the amount of data moved is
significantly reduced, enabling extremely fast daily full backups. The incremental
overhead on the client CPU to perform source deduplication can be up to 15%, but
the backup completes much faster than traditional methods. The overall impact of
source deduplication is actually much less than that of traditional agents over a
seven-day period. Environments with very large databases or databases with high
daily change rates may want to consider a target-side solution instead. Fortunately,
Target-side Deduplication
Performing deduplication at the target optimizes backup disk storage capacity since
only new, unique subfile data is stored to disk. However, redundant backup data is
still sent to the deduplication target using traditional backup software. Thus, it does
not offer relief to an available backup window. A critical factor to consider when using
a target-side approach is the ability to keep pace with backup window performance,
and whether or not inline or post-process deduplication is warranted given a particular
workload. (Refer to the following section for more on inline versus post-process
deduplication).
There are two different approaches available today for determining when the
deduplication process occurs: inline or post-process. Some suppliers are also working
on a third approach called hybrid or adaptive deduplication. Inline deduplication
eliminates redundant data before it is written to disk so that a disk staging area is not
needed. Post-process deduplication analyzes and reduces data after it has been
stored to disk, so it needs a full-capacity staging area upon which to start a
deduplication process. In selecting an approach, organizations need to make
considerations with regard to backup speed and disk capacity.
An inline process is more capacity efficient, and there is no lag time for a
deduplication process to begin. For large-capacity environments with backup window
considerations, post-process deduplication gives precedence to completing the
backup but requires greater initial storage capacity. These approaches can mean a
trade-off in performance and capacity requirements. A third approach, still in the
developmental stages, is called hybrid or adaptive deduplication. This method of
deduplication gives precedence to an inline approach until a performance threshold is
reached and then automatically switches to a post-process approach, tuning the
method for the current workload in the environment. Some leading solutions offer
policy-based deduplication that allows for customer configuration of deduplication to
occur either immediately or on a schedule or to be disabled based on characteristics
of a data set. For example, smaller data sets and unstructured data can be set for
immediate deduplication and large backup jobs configured for post-processing, while
Another factor impacting the deduplication ratio is whether or not the deduplication
engine can recognize a particular data format (particular backup application, Microsoft
Exchange data, etc.). The ability to detect the format of the data requires
understanding where application-specific metadata is injected into a stream. The
deduplication engine can then tune the chunk size so that it's ideal for the data format
according to natural application, resulting in potentially greater deduplication results.
FIGURE 1
We are
considering this Yes, software-
technology in the based
next 12 months deduplication
(18.0%) (31.7%)
n = 300
Source: IDC's Remote Branch Special Study, 2009
8. Seeding and migration. While deduplication is great for reducing the storage
and/or transmission of redundant data, it does require an initial baseline or first
backup to be established. For edge to core deduplication, users need to consider
how to create this baseline over bandwidth-constrained links. Most vendors offer
some form of seeding service to quickly create this baseline, either through a
bulk deduplication-aware replication process with systems side by side or by
using a series of tapes from a last full backup and restoring them locally into a
deduplication system. With storage refresh cycles on a three- to five-year rotation
cycle, other considerations include how a migration is done and how disruptive it
will be to an existing environment.
9. Vendor selection. Vendors make many claims and statements with regard to
their deduplication approach. IDC research shows that not all deduplication
products generally available actually work as advertised. Firms should consider
how long a particular deduplication-enabled product has been shipping, how
EMC'S PORTFOLIO OF
DEDUPLICATION-ENABLED SOLUTIONS
EMC offers a broad range of deduplication-enabled products to assist customers with
driving down IT costs and accelerating backup efficiency. Backup deduplication
solutions include EMC Avamar, which provides a source-side approach to
deduplication; EMC Disk Library, which offers a target-side approach to deduplication;
and EMC NetWorker, which can be deployed with either a source-side or target-side
approach, or both. Additionally, although not included in the scope of this paper, EMC
offers a deduplication solution for primary storage and backup data with its network-
attached storage EMC Celerra system and a disk archive deduplication solution with
its Centera product line.
EMC Avamar
The Avamar agent keeps track of files that are new or have changed. The agent does
not need to walk the entire file system tree to identify new or changed data and will
check local cache for those files first. Upon identification, the agent will break the new
or changed files into subfile variable-length data segments and assign a hash value
(unique ID) to each segment. The agent will then communicate with the Avamar
server to determine if the hash is unique or already exists. If the data segment is new,
it will be sent across the LAN/WAN during the daily full backup.
Avamar backup and recovery solutions provide source-side and global deduplication,
making it ideal for firms with the following environments:
Improving their remote branch office backups to gain fast, daily full backups;
centralized management; improved reliability; secure replication; and reduced
backup traffic over congested WAN links
Seeking to curb data growth, backup windows, and network traffic for backup of
local NAS and file server environments
Avamar software. For smaller remote offices, the Avamar software agent can be
deployed on the systems to be protected (clients) with no additional local
hardware required.
Avamar Data Store. This scalable, all-in-one solution includes Avamar software
preinstalled and preconfigured on EMC hardware for simplified ordering,
deployment, and service.
Avamar Virtual Edition for VMware. An industry first, this configuration enables
an Avamar server to be deployed as a virtual appliance on an existing ESX
Server, leveraging the attached resources and disk storage.
The Avamar grid uses a redundant array of independent nodes (RAIN) configuration
for built-in fault tolerance and high availability across the grid and eliminates single
points of failure. Avamar distributes its internal index across Avamar nodes for
reliability, load balancing, and scalability. Also, every day and automatically, Avamar
verifies that backup data is fully recoverable, and the Avamar server checks itself
Avamar agent in guest OS. An Avamar agent inside each guest OS provides a
backup approach that is an order of magnitude more efficient than traditional
agent backup approaches. Lightweight Avamar agents reduce backup data at the
guest, reducing network requirements and contention for shared CPU, NIC, disk,
and memory resources. Because only new or unique subfile data is backed up,
Avamar enables fast daily full backups.
Avamar for VCB backup. An Avamar agent running on the VCB proxy server
backs up only unique data and offloads the processing for the guest machines.
Deduplication occurs within and across VMDK files and supports VCB file- and
image-level backup. Avamar's efficient replication enables VMDK files to be
quickly transferred across the WAN in support of disaster recovery objectives.
Avamar agent on ESX console. An Avamar agent on the ESX console can
deduplicate within and across VMDK files. This method provides an image-level
backup and restore option, without a dependency on VMware VCB or shared
storage. However, it does not provide for file-level restore.
The EMC Disk Library (DL) family offers policy-based deduplication with its 1500,
3000, and 4000 series systems. The EMC Disk Library 1500 and 3000 provide LAN-
based backup to disk with deduplication included. The DL1500 is designed for
midsize customers that want improved performance, longer onsite retention, and
lower replication costs. The DL1500 begins at 4TB of usable capacity and expands to
36TB, with a sustained backup ingest rate of 0.72TB/hour with immediate data
deduplication — or up to 0.84TB/hour when the deduplication process is deferred.
The DL3000 begins at 8TB of usable capacity and expands to 148TB, with a
sustained backup ingest rate of 1.44TB/hour with immediate data deduplication. With
both the DL1500 and DL3000, policy-based deduplication is included with the system.
Unlike the DL1500 and DL3000 models, the DL4000 deduplication is via an add-on
hardware option for new and installed DL4000 Virtual Tape Library systems. Firms
can deploy it to reduce capacity requirements for backup to disk and reduce network
traffic for replication between datacenters.
EMC Disk Library deduplication is ideal for datacenter, large storage volume, and
highly changing database environments looking to introduce disk for backup. Firms
using Disk Library deduplication are:
Seeking to curb large volume data growth for backup to existing EMC Virtual
Tape Library environments
Introducing both disk and deduplication into an existing EMC Disk Library
environment
Seeking to replace tape with disk for backup with little disruption to current
backup operations
The Disk Library systems use a target-side deduplication method. The same
deduplication capability works across the entire Disk Library family, providing block-
level, variable-length hash-based deduplication at the target. The Disk Library
deduplication uses "application sensing filters" that can detect the format of the data
stream and understands where application-specific metadata is injected into a
stream. The filter will place markers around this metadata and sift it out for greater
deduplication impact.
For data types that do not deduplicate well, the capability can be disabled. For
deduplication-enabled remote replication for disaster recovery purposes, replication
can be configured by system, application, directory level, or virtual tape cartridge. The
Disk Library deduplication index is clustered into branches, and similar objects are
grouped into buckets for efficient index lookups while minimizing disk I/O. Hardware
compression provides another level of storage optimization.
EMC NetWorker
Seeking to curb large volume data growth for existing NetWorker environments
Deploying a new backup to disk strategy for improved recovery that still requires
the use of physical tape for archival or long-term needs
Introducing both disk and deduplication into an existing EMC Disk Library
environment
The deduplication approach within the NetWorker application has advanced the
market in terms of its integration of deduplication with a traditional backup application.
The NetWorker client software for both nondeduplicating and deduplication-aware
backups is a single agent. Source deduplication capabilities have been fully
integrated minimizing deployment and maintenance. The NetWorker console can
manage and monitor both types of backups — traditional and deduplication. For
NetWorker customers that want the benefits of deduplication, there is no additional
client-side cost.
Unlike other offerings, NetWorker has no incremental software SKUs or pricing for
deduplication integration. NetWorker customers can add the appropriate
deduplication engine into the backup environment, either the Avamar or the EMC
Disk Library back-end solution. One of the benefits of using NetWorker-enabled
deduplication is the support for physical tape, ensuring that users who continue to
have a tape requirement can meet the need within the same application. Another
benefit of using deduplication within the backup application is the correct provisioning
and sequencing of encryption and compression. NetWorker gives firms the strong
features of deduplication without disrupting their current backup environment.
CONCLUSION
Deduplication technology can accelerate backup efficiency and drive down IT costs.
Firms are deploying different types of deduplication-enabled solutions to address a
myriad of cost and operational challenges with the growing volume of backup data.
IDC finds that deduplication is a core, must-have feature for a variety of storage
solutions to address these challenges. EMC as a vendor is well-positioned to address
these long-standing problems, offering a range of solutions for a variety of
environments and use cases to meet customer demand for the technology over the
next five years.
External Publication of IDC Information and Data — Any IDC information that is to be
used in advertising, press releases, or promotional materials requires prior written
approval from the appropriate IDC Vice President or Country Manager. A draft of the
proposed document should accompany any such request. IDC reserves the right to
deny approval of external usage for any reason.