05 - Simpana® Deduplication
05 - Simpana® Deduplication
Simpana® Deduplication
Simpana® v9 offers a variety of deduplication based features that drastically changes the way data protection is
conducted. Client side deduplication can greatly reduce network usage, Dash Full can significantly reduce the
time of Synthetic full backups, and Dash Copy will greatly reduce the time it takes to copy backups to off-site
disk storage. Additionally, SILO storage can copy deduplicated data to tape still in its deduplicated state. This
chapter details how deduplication works and how to best configure and manage deduplicated storage.
Important! This section provides guidelines based on current CommVault best practices.
It is strongly recommended you check with CommVault for any updated guidelines and
best practices as they may have changed since the writing of this book.
In the following diagram, five blocks are being written to disk. Three of the blocks are
exactly the same. With deduplication, the result will be three blocks written to disk. The
two unique blocks will be written individually but the three duplicate blocks will only be
written once.
Data Blocks
Unique Duplicate
Blocks Blocks
Disk Storage
Deduplication Features
Simpana v9 software has a unique set of deduplication features that are not available with third party
deduplication solutions. If third party solutions are being used, most of the Simpana features will not be available.
Efficient use of storage media. Each Deduplication Storage Policy will deduplicate all data blocks
written to the policy. An optional global deduplication storage policy can be used to write blocks from
multiple storage policies through a single deduplicated policy resulting in multiple policy data blocks
being stored once on disk storage.
Efficient use of network bandwidth. Client Side Deduplication can be used to deduplicate block data
before it leaves the client. This will greatly reduce network bandwidth requirements after the first
successful full backup is completed. From that point forward only changed blocks will be sent over the
network.
Significantly faster Synthetic Full operations. DASH Full is a read optimized Synthetic Full operation
which does not require traditional full backups to be performed. Once the first full is completed, changed
blocks are protected during incremental or differential backups. A DASH Full will run in place of a
traditional full or synthetic full. This operation does not require movement of data. It will simply update
indexing information and the deduplication database signifying that a full backup has been completed.
This will significantly reduce the time it takes to perform full backups.
Significantly faster auxiliary copy operations to disk storage. DASH Copy operations are optimized
auxiliary copy jobs that require only changed blocks to be sent to a second disk target. Once the initial
full auxiliary copy has been performed only changed blocks will be sent during auxiliary copy jobs. This
is an ideal solution for off-site copies to secondary disaster recovery facilities since it does not require
high bandwidth requirements.
Efficient use of tape media using SILO. The SILO storage allows data to be copied to tape in its
deduplicated form. In other words, the data will not be rehydrated. During normal auxiliary copy
operations the rehydration of data is required. This means the deduplicated data will be read from the
disk, expanded into its raw compressed form and then written to tape. A SILO operation is not an
auxiliary copy; it is a backup of the disk volume folders in the CommVault disk library. The SILO
operation simply copies the folders to tape and the data remains in its deduplicated form.
Resilient indexing and restorability. The deduplication database that is used to check signature hashes
for deduplication purposes is not required during restore operations. Instead the standard indexing
methodology is used. This includes using the index cache, an optional index cache server, and index files
written at the conclusion of the job. This resiliency ensures the restorability of deduplicated data even
under the worst case scenarios.
Many people look at deduplication as the most effective method of moving and storing data. Curiously
there is one part of the data movement process that is missing from the above features list. Though
deduplication can greatly improve data protection and data storage it will actually have a negative effect
on the restorability of the data. When data needs to be restored, the Media Agent will recreate all chunk
data and send it to the restore destination. Since deduplication leads to data fragmentation, reading the
data block from disk will be slower than if the data was not deduplicated.
A solution that the Simpana software has supported for many versions that will improve backup
performance, reduce storage requirements and improve restore performance as well as data archiving.
Archiving removes the data from the production storage leaving a stub file in its place. The stub is then
backed up during normal data protection jobs which greatly reduces backup windows and storage
requirements. If the system needs to be restored, the stubs are restored instead of the actual data. As
long as the path to the archived file is available when the user opens the stub, it will automatically recall
the file.
One emerging technology which is ideal for this type of solution is Cloud Storage. Archiving data into
the cloud moves the data off-site which provides for a sound disaster recovery solution and gives users
the ability to retrieve the data from the cloud by opening the stub file.
Storage Policy
Data Blocks
Signature Hash
Media Agent
Deduplication Database (DDB)
Disk storage
Optional client side signature cache
Storage Policy
All deduplication activity is centrally managed through a storage policy. Configuration settings are defined in the
policy, the location of the deduplication database is set through the policy, and the disk library which will be used
is also defined in the policy. Storage policy configurations will be covered in great detail later in this chapter.
The block size that will be used is determined in the Storage Policy Properties in the Advanced tab. CommVault
recommends using the default value of 128k but the value ranges from 32k to 512k. Higher block sizes for large
databases is recommended. Determining the best block size will be covered later in this chapter.
Deduplication can be configured for Storage Side Deduplication or Client (source) Side Deduplication.
Depending on how deduplication is configured, the process will work as follows:
1. Storage Side Deduplication. Once the signature hash is generated on the block, the block and the hash
are both sent to the Media Agent. The Media Agent with a local or remotely hosted deduplication
database (DDB) will compare the hash within the database. If the hash does not exist that means the
block is unique. The block will be written to disk storage and the hash will be logged in the database. If
the hash already exists in the database that means the block already exists on disk. The block and hash
will be discarded but the metadata of the data being protected will be written to the disk library.
Storage Side Deduplication will send the hash and block to the Media Agent. The hash
is checked in the deduplication database to determine if the block is duplicate or unique.
2. Client Side Deduplication Once the signature is generated on the block, only the hash will be sent to
the Media Agent. The Media Agent with a local or remotely hosted deduplication database will compare
the hash within the database. If the hash does not exist that means the block is unique. The Media Agent
will request the block to be sent from the Client to the Media Agent which will then write the data to
disk. If the hash already exists in the database that means the block already exists on disk. The Media
Agent will inform the Client to discard the block and only metadata will be written to the disk library.
a. Client Side Disk Cache An optional configuration for low bandwidth environments is the
client side disk cache. This will maintain a local cache for deduplicated data. Each subclient
will maintain its own cache. The signature is first compared in the local cache. If the hash exists
the block is discarded. If the hash does not exist in the local cache, it is sent to the Media Agent.
If the hash does not exist in the DDB, the Media Agent will request the block to be sent to the
Media Agent. Both the local cache and the deduplication database will be updated with the new
hash. If the block does exist the Media Agent will request the block to be discarded.
Client Side Deduplication only sends the hash to the Media Agent to compare in the
dedupe database. This will greatly reduce network traffic when backing up redundant
data blocks. Once the first full backup is completed, only changed blocks will be sent
over the network.
Components of Deduplication
Careful planning must be taken before setting up deduplication in an environment. Poor configuration can lead to
scalability issues which may result in redesigning the environment which can result in loss of deduplicated
storage.
Example: Consider an initial design strategy using 32KB block size configured in the storage policy and a two
year retention policy for data. The deduplication database which contains all signature hashes can maintain up to
750 million records. The smaller block size results in more unique blocks resulting in a larger database. You
realize you are approaching the upper size limit of the database so you change the block size to 128KB. 32KB
data cannot be deduplicated against 128KB so all the 32KB blocks will remain based on the two year retention.
The 128KB blocks will also be written to disk with a two year retention resulting in duplicate data within the disk
library.
This section will break down all the critical configuration concerns and best practices for an affective design
strategy.
Important! This section provides guidelines based on current CommVault best practices.
It is strongly recommended you check with CommVault for any updated guidelines and
best practices as they may have changed since the writing of this book.
Deduplication Database
The deduplication database is the primary component of Simpana‘s deduplication process. It maintains all
signature hash records for a deduplicated storage policy. Each storage policy will have its own deduplication
database. Optionally, a global deduplication storage policy can be used to link multiple storage policies to a single
deduplication database by associating storage policy copies to a global deduplication storage policy.
The deduplication database currently can scale from 500 to 750 million records. This results in up to 90 Terabytes
of data stored within the disk library and up to 900 Terabytes of production in protected storage. It is important to
note that the 900 TB is not source size but the amount of data that is baked up over time. For example if 200 TB
of data is being protected and retained for 28 days using weekly full and daily incremental backups, the total
amount of protected data would be 800 TB (200 TB per cycle multiplied by 4 cycles since a full is being
performed every seven days). These estimations are based on a 128k block size and may be higher or lower
depending on the number of unique blocks and deduplication ratio being attained.
A deduplication database can handle up to 50 concurrent connections with up to 10 active threads at any given
time. The database structure uses a primary and secondary table. Unique blocks are committed to the primary
table and use 152 bytes per entry. Duplicate entries are registered in the secondary table and use 48 bytes per
entry.
The deduplication database is initially configured during the storage policy creation. It
can be specified for secondary copies or moved by going to the policy copy properties,
Deduplication tab, Store Information.
A Media Agent can also host multiple deduplication databases. It is recommended that a single Media Agent
should not have more than two databases open at any given point in time. If a Media Agent has more than two
databases, consider stagger scheduling protection operations, if possible for the best performance.
Point in time backups are configured in the Advanced tab of the Deduplication settings. The Store Availability
Option allows you to set recovery points for the backup of the deduplication database. The recommended setting
is to create recovery points every 24 hours. If recovery points are not created and a dedupe database becomes
corrupt the store will be sealed and a new store and database will be created. This would result in all data blocks
being resent to the disk library which will have a negative impact on network performance and deduplication
ratio.
Prior to Simpana v9 SP4 the method to protect the deduplication database was to create
automatic recovery points. The default value for recovery point creation is eight hours.
If this method is going to be used it is recommended to set the recovery points to 24
hours. This method for protecting the deduplication database will require data
protection jobs to pause for the entire time of the database backup process.
By default the location of the database backups are to the same location where the dedupe database resides. It is
strongly recommended that you change this location. The location can be set with the following registry key:
SIDBBackupPath. Consult with CommVault‘s online documentation for more information and the latest
guidance on configuring this method for protecting the dedupe database.
Using the File System iDataAgent to protect the deduplication database allows the periodic backup of the
database without requiring jobs to pause. This is done by configuring a Read Only File System iDataAgent on the
Media Agent hosting the dedupe database. A special subclient is then added with the DDB Subclient option
selected. When using the point in time recovery points option jobs will be paused for the entire time it takes to
protect the database. Using the file system iDataAgent, jobs will only pause until VSS properly quiesces the
database which should take no more than two minutes. Due to this factor it is recommended to use this method to
protect the database.
Deduplication Store
Each storage policy copy configured with a deduplication database will have its own deduplication store. Quite
simply a deduplication store is a group of folders used to write deduplicated data to disk. Each store will be
completely self-contained. Data blocks from one store cannot be written to another store and data blocks in one
store cannot be referenced from a different deduplication database for another store. This means that the more
independent deduplication storage policies you have, the more duplicate data will exist in disk storage.
Example: If you had three storage policies each with their own deduplication database
and database store you would have three databases and three sets of folders. This
results in duplicate blocks existing in each of the three stores which will reduce
deduplication efficiency. In some situations this may be desirable being that dissimilar
data types such as file system and database data may not deduplicate well against each
other.
Data Blocks
Data Blocks
Data Blocks
Storage Policy A Storage Policy B Storage Policy C
Performance of the deduplication database is the primary factor in store sealing. If block lookups take too long,
data protection jobs will slow substantially. Sealing the store will start a new deduplication database which will
results in faster data protection jobs but will diminish deduplication ratios. One reason causing block lookups is
that the database has grown too large. If the environment is designed and scaled appropriately this should not be a
problem. Another reason is that the deduplication database is being stored on slow disks or using inefficient
protocols such as NFS, CIFS or iSCSI. CommVault recommends using high speed dedicated disks directly
attached to the Media Agent. Designing a scalable deduplication solution will be discussed later in this chapter.
For organizations with large amounts of data protected through a single storage policy and deduplication store it
may become necessary to seal the store. Since there is an upper limit to the deduplication database, sealing the
store may occasionally be required. This will be dependent on how long the data is being retained for since blocks
pruned from disk will also have the signature records pruned from the deduplication database. If large amounts of
data need to be managed and retention rules are defined that will require more than 750 million records, the best
solution to prevent the sealing of stores would be to use multiple deduplicated storage policies. Each policy will
have their own store resulting in multiple smaller deduplication databases.
Another reason to seal a store is when using SILO storage to tape. A store can be sealed when it grows to a
certain size, after a defined number of days, or through time intervals such as once a quarter. If you wanted to
have a set of tapes representing a fiscal quarter, then sealing the store every three months would result in all
volume folders being sealed and placed in SILO storage isolating the data based on the fiscal quarter.
Sealing stores and SILO storage will be discussed in the Deduplication Strategies section of this chapter.
The Deduplication store can be sealed and a new one created by any one of the following criteria:
Create a new store every n number of days.
Create a new store every n number of terabytes.
Create a new store every n number of months from a specific starting date.
Deduplication Store Settings are configured in the Store Information tab in the
Deduplication settings within the storage policy copy.
It is important to note that as each file is read into memory the 128 KB buffer is reset. Files will not be combined
to meet the 128 KB buffer size requirement. This is a big advantage in achieving dedupe efficiency. Consider the
same exact file on 10 different servers. If we always tried to fill the 128 KB buffer each machine would use
different data and the hashes would always be different. By resetting the buffer with each file, each of the 10
machines would generate the same hash for the file.
Appliance based deduplication devices are not content aware since the backup software writes data in large
chunks. This is why they need to use considerably smaller block sizes and realign blocks as they are written to the
appliance. By using these methods, they can achieve comparable deduplication ratios to the Simpana method, but
the overhead and cost is significantly higher.
In the following diagram two files are being deduplicated with 128 KB block size. This
first file has three 128 KB segments and a trailing segment of 32 KB. The first three
segments are hashed at 128 KB and the last segment is hashed at 32 KB. The buffer is
then reset for the next file to be read into memory. This will align all the blocks of the
new file so signature hashes will always be consistent.
The block size recommendations depend on the data type that is being protected and the size of the data. The
current recommended setting for file and virtual machine data is 128 KB block size. This provides the best
balance for deduplication ratio, performance and scalability. Though the block size can be set as low as 32 KB,
deduplication efficiency only improves marginally and is therefore not recommended.
For databases the recommended block size is from 128 KB to 512 KB depending on the size. For large database
servers such as Oracle which may perform application level compression deduplication ratios may be
compromised. It is strongly recommended to consult with CommVault Professional Services in designing a
proper protection strategy with large databases.
For large data stores, especially media repositories, Consider setting a higher block size 256 KB +. For media
types the potential for duplicate data blocks will be minimal. By using a large block size, deduplication savings
will be noticed in backing up the same data over time. By using a higher block size it will allow more data to be
stored with a smaller deduplication database size.
Setting the Deduplication block factor in the Advanced tab of the storage policy
properties.
Consider the size of the deduplication database as well as the deduplication store when factoring block size. A
general guideline would be setting the block size to 64k would result in half the sizing capabilities of 128k.
Setting block size to 256 will yield twice the capacity. Since CommVault deduplication is content aware, a
smaller block size may or may not give you a deduplication advantage but it will definitely limit the scale of how
much data can be stored.
Why 128KB?
Many competitors will use significantly smaller blocks sizes as low as 8k. The reason for this is
simple… a better deduplication ratio... Well, sort of. The truth is the ratio will be about the same
since the appliance is not content aware where the CommVault software is. So why does
CommVault recommend the higher block size? Competitors who usually sell appliance based
deduplication solutions with steep price tags use a lower block size which actually results in a
marginal gain in space savings considering most data in modern datacenters is quite large.
Unfortunately there are some severe disadvantages to this. First off, records for those blocks must be
maintained in a database. Smaller block size results in a much larger database which limits the size
of disks that the appliance can support. CommVault software can scale significantly higher, up to 90
Terabytes per database.
Even more importantly is the aspect of fragmentation. The nature of deduplication and referencing
blocks in different areas of a disk leads to data fragmentation. This can significantly decrease restore
and auxiliary copy performance. The higher block size recommended by CommVault makes restores
and copying data much faster.
The main aspect is price and scalability. With relatively inexpensive enterprise class disk arrays you
can save significant money over dedicated deduplication appliances. If you start running out of disks
just add more space which can be added to an existing library so deduplication will be preserved.
Considering advanced deduplication features such as DASH Full, DASH Copy, and SILO tape
storage, Simpana deduplication solution is a powerful tool.
Prior to SP4 the block will still be hashed down to a minimum size of 16 KB. This number can be further reduced
down to 4 KB using the SignatureMinFallbackDataSize registry key. This key should be added to any Client or
Media Agent performing signature generation. Check with Online Documentation for complete instructions and
current best practices for configuring this value and how to deploy this key to Clients and Media Agents.
In the Settings tab of Deduplication configuration in the storage policy copy the option Do not deduplicate
against objects older than can be configured to limit the life of a block on disk. This can be used to periodically
refresh data blocks on the deduplicated disk storage. This will decrease the potential for bad blocks affecting
restorability of data. Prior to Simpana v9 SP3 this option was enabled and configured for 365 days. As of SP3 this
option is disabled by default.
Careful consideration should be taken in configuring this option. By enabling this setting, periodically when you
refresh blocks on disk the deduplication ratio will suffer significantly until the old blocks are pruned. If you are
performing Client Side Deduplication, setting this option will periodically require ALL blocks that have
previously been deduplicated to be retransmitted over the network. The best solution in this case is to ensure you
have high quality enterprise class disks and ALWAYS make additional copies of data to other disk and/or tape
locations. If you are concerned with the health of data blocks on disk you can enable the Do not deduplicate
against objects older than option with the understanding that all blocks will need to be re-sent at the specific time
interval you specify.
For Oracle databases advanced table compression is available which may result in dissimilar hashes being
generated each time the database is backed up. This can negate deduplication completely. Careful consideration
should be given to which compression methods should be used. Though the Oracle compression is extremely
efficient it may not always be the best solution when using deduplicated storage. CommVault strongly
recommends consulting with professional services when deploying CommVault software to protect large Oracle
databases.
If encryption is going to be used with deduplication a dedicated storage policy MUST be used for encrypted data.
Mixing encrypted and non-encrypted data will result in data not being able to be restored. This is due to the fact
that an unencrypted file referencing an encrypted block will not be able to access the encrypted block.
In the following illustration dedicated storage policies, Media Agents, and disk libraries
are using to protect large amounts of data.
For large databases and other data types such as media files, higher block sizes (256k +) can be set to allow for
higher scalability. The following diagram illustrates three storage policies:
The following diagram illustrates different data types using separate storage policies to
protect and manage deduplicated data. Note that object level protection for SharePoint
and Exchange is using the same storage policy as the file system backups. Since there
will be similar data at the object level a better deduplication ratio can be attained. It is
also common to have similar retention requirements with these data types so combining
them into the same storage policy makes sense from deduplication and management
aspects.
Example: Three storage policies are required to manage specific data based on independent retention policies.
The three storage policy primary copies can be linked to a global deduplication database so data blocks across the
three policies is only stored once. The block will remain in the deduplication store until the longest retention for
the data block has been reached.
The following diagram illustrates a global deduplication storage policy linking primary
copies from three different storage policies into a single store. Each primary copy can
have separate retention and manage different subclients but they will all share the same
data path and deduplication store.
Data Blocks
Data Blocks
Data Blocks
Data that exists in separate physical locations and is being consolidated into a single location.
Like data types that deduplicate well against each other that have different retention requirements.
Global Deduplication storage policies are not recommended in the following situations:
Using multiple Media Agents where the deduplication database is centrally located on one requiring
network communication for other Media Agents to compare signatures. Note: for small environments
this deployment method could be used but will degrade performance.
When backing up large amounts of data since a single database can only scale to 750 million entries. In
this case multiple dedicated storage policies are recommended.
It is important to note that associating or not associating a storage policy copy with a global deduplication policy
can only be done at the creation of the policy copy. Once the copy is created it will either be part of a global
policy or it won‘t. By using the global dedupe policy for the initial storage policy primary copy that will protect
data, if additional policies are required, they can also be linked to the global dedupe policy. Using this method
will result in better deduplication ratios and provide more flexibility for defining retention policies or
consolidating remote location data to a central policy (which will be discussed next). The main caveat when using
this method is to ensure that your deduplication infrastructure will be able to scale as your protection needs grow.
In the following illustration three remote sites are locally performing backups to disk.
The data is being copied to the main data center using a global dedupe policy associated
with the secondary copy.
Global Deduplication for small data size with different retention needs
For small environments that do not contain a large amount of data but different retention settings are required,
multiple storage policy Primary Copies can be associated with a global deduplication storage policy. This should
be used for small environments with the data path defined through a single Media Agent.
SILO Storage
Consider all the data that is protected within one fiscal quarter within an organization. Traditionally a quarter end
backup would be preserved for long term retention. Let‘s assume that quarter end backup of all data requires 10
LTO 5 tapes. Unfortunately with this strategy the only data that could be recovered would be what existed at the
time of the quarter end backup. Anything deleted prior to the backup within the specific quarter would be
unrecoverable unless it existed in a prior quarter end backup. This results in a single point in time that data can be
recovered. Now let‘s consider those same 10 tapes containing every backup that existed within the entire quarter.
Now any point in time within the entire quarter can be recovered. That is what SILO storage can do.
SILO storage allows deduplicated data to be copied to tape without rehydrating the data. This means the same
deduplication ratio that is achieved on disk can also be achieved to tape. As data on disk storage gets older the
data can be pruned to make space available for new data. This allows disk retention to be extended out for very
long periods of time by moving older data to tape.
The following diagram illustrates full volume folders being copied to SILO storage.
Active volumes will not be placed in the SILO storage until they are marked full.
By copying volume folders to tape, space can be reclaimed on disk for new data to be written. This does require
some careful planning and configuration.
When a storage policy deduplication copy is enabled for SILO storage a SILO backup
set will be created on the CommServe server. This will be used to schedule and copy
folders that qualify for SILO storage.
Hardware encryption can also be used for LTO4 and LTO5 drives that support encryption. Enabling hardware
encryption is configured in the SILO data path properties in the storage policy Copy.
Note: This section assumes a basic understanding of backup sets, subclients, and encryption configuration. If you
are unfamiliar with these concepts it is strongly recommended attending a CommVault Administration instructor
led training course.
It is important to note at this point that SILO storage is less of a disaster recovery solution and more of a data
preservation solution. From the original release of the SILO feature in Simpana v8, it has received some negative
feedback. This is due to the fact that our competitors placed a lot of negative spin on this feature since they had
no comparable solution. The other is misunderstanding Service Level Agreements. SLA policies usually specify
that the older data gets, the longer the time to recover will be. SILO storage is not an option to recover data from
last week; it is a feature to recover data from last year or five years ago. Understanding this concept places Silo
storage into proper perspective. This feature is for long term preservation of data to allow for point in time
restores within a time period with considerably less storage requirements than traditional tape storage methods.
1. The CommVault administrator performs a browse operation to restore a folder from eight months ago.
2. If the volume folders are still on disk the recovery operation will proceed normally.
3. If the volume folders are not on disk the recovery operation will go into a waiting state.
4. A SILO recovery operation will start and all volume folders required for the restore will be staged back
to the disk library.
5. Once all volume folders have been staged, the recovery operation will run.
To ensure adequate space for SILO staging operations a disk library mount path can optionally be dedicated to
SILO restore operations. To do this, in the Mount Path Properties General tab select the option Reserve space
for SILO restores.
The procedure is straight forward and as long as SILO tapes are available the recovery operation is fully
automated and requires no special intervention by the CommVault administrator.
Along with configuring Client Side Deduplication in the Client Properties, a Client Side Disk Cache can be
created. Each subclient will contain their own disk cache which will hold signatures for data blocks related to the
subclient. The default cache size is 4GB and can be increased up to 32GB. The Client Side Disk Cache is
recommended for slow networks such as WAN backups. For any networks that are 1Gbps or higher using this
option will not improve backup performance.
Another Client option is Enable Variable Content Alignment. Enabling this option will read block data and
align the blocks to correspond to prior data blocks that have been deduplicated. By aligning the content prior to
performing the hash process, better deduplication ratios may be attained. This will however require more
processing power on the Client. Since Simpana deduplication is content aware, enabling this option will not
provide better deduplication for average file data. This option is only recommended for large file system data
such as database dumps or PST files with low incremental rates of change.
DASH Full
A DASH Full backup is a read optimized synthetic full backup job. A traditional synthetic full backup is designed
to synthesize a full backup by using data from prior backup jobs to generate a new full backup. This method will
not move any data from the production server. Traditionally the synthetic full would read the data back to the
Media Agent and then write the data to new locations on the disk library. With deduplication when the data is
read to the Media Agent during a synthetic full, signatures will be generated and compared in the deduplication
database. Being that the block was just read from the library, there would always being a signature match in the
DDB and the data blocks would be discarded. To avoid the read operation all together a DASH Full can be used
in place of a traditional synthetic full.
A DASH Full operation will simply update the index files and deduplication database to signify that a full backup
has been performed. No data blocks are actually read from the disk library back to the Media Agent. Once the
DASH Full is complete a new cycle will begin. This DASH Full acts like a normal full and any older cycles
eligible for pruning can be deleted during the next data aging operation.
The option to enable DASH Full operations is configured in the Advanced tab in the
Deduplication section of the Storage Policy Primary Copy.
Once this option is enabled, schedule data protection jobs to use Synthetic Full backups. Depending on the
configuration in the storage policy settings, either a traditional synthetic full or a DASH Full will be used.
DASH Copy
A DASH Copy is an optimized auxiliary copy operation which only transmits unique blocks from the source
library to the destination library. It can be thought of as an intelligent replication which is ideal for consolidating
data from remote sites to a central data center and backups to DR sites. It has several advantages over traditional
replication methods:
DASH Copies are auxiliary copy operations so they can be scheduled to run at optimal time periods
when network bandwidth is readily available. Traditional replication would replicate data blocks as it
arrives at the source.
Not all data on the source disk needs to be copied to the target disk. Using the subclient associations of
the secondary copy, only the data required to be copied would be selected. Traditional replication would
require all data on the source to be replicated to the destination.
Different retention values can be set to each copy. Traditional replication would use the same retention
settings for both the source and target.
DASH Copy is more resilient in that if the source disk data becomes corrupt the target is still aware of
all data blocks existing on the disk. This means after the source disk is repopulated with data blocks,
duplicate blocks will not be sent to the target, only changed blocks. Traditional replication would require
the entire replication process to start over if the source data became corrupt.
DASH Copy is similar to Client Side Deduplication but with DASH, the source is a Media Agent and the
destination is a Media Agent. This is why Client Side Deduplication and DASH Copy operations are sometimes
referred to as Source Side Deduplication. Once the initial full auxiliary copy is performed, only change blocks
will be transmitted from that point forward.
DASH Copy has two additional options; Disk Read Optimized Copy, and Network Optimized Copy. Again, this
is similar to Client Side configuration. Disk Read Optimized will transmit the signature hash to the target Media
Agent where it will compare the hash in the DDB to determine if the block needs to be sent. Network Optimized
will use a cache on the source Media Agent to compare the signature and determine if the hash exists resulting in
less network traffic.
Example 1: A large media repository used to frequently edit and recompile videos requires protection. Once the
files are finalized they are written to a separate repository and deleted from the source production location. In this
case there are two primary issues:
1. Media files and other binary based data types do not deduplicate well. The savings for this type of data is
seen when performing full backups over time. Since the files will be deleted from the production
location once they are finalized subsequent full backups will not provide much disk space savings.
2. Since the media files are being edited the binary data blocks will be constantly changing. This may
greatly reduce the space savings when subsequent full backups of the same data are performed.
In this scenario the processing of data blocks to generate signatures probably will not be worth the deduplication
results. Backing up the data to disk or tape for short term disaster recovery or using hardware based snapshots and
Simpana SnapProtect feature would be a better solution in this case.
Example 2: A database is being regularly protected by performing nightly full backups and transaction log
backups every 15 minutes. Though the database should deduplicate well since it is being protected nightly, the
transaction logs will not. In this scenario making use of a separate log storage policy with a non-deduplicated disk
target would provide better backup and recovery performance. For Microsoft SQL iDataAgents a separate log
storage policy can be configured in the SQL subclient. For other database types an Incremental Storage Policy
can be used. Log storage policies are discussed in more detail in the Additional Storage Policies chapter.
It should be noted that the memory requirements for the Media Agent is due to the deduplication processes
requiring significant memory and not the database itself. Though the database may grow to 100+ GB in size, the
deduplication processes will only load specific portions of the database in memory as it‘s needed. As
deduplication jobs run the processes will use more memory the longer they are in operation. As a result, the
longer the jobs run the more efficient the overall process will be. Because of this factor, it is recommended that
there are always running jobs requiring deduplication processes during a protection window. If no jobs are
running then the process will terminate. This will require deduplication processes to restart when new jobs are run
which will result in slower performance. It has been tested that it will take up to an hour for deduplication to
reach it peak performance. This performance aspect will be noted in the tables listed in this section. The greater
the amount of data being moved the higher the throughput will be.
Note: The specified requirements are as of the printing of this book. Please consult with CommVault for updated
deduplication recommendations.
SIDB2 Utility
The SIDB2 utility tool can be used to simulate the operations of a deduplication database. This tool should be
used to test the disk location where the deduplication database will be stored to ensure performance is adequate.
For complete instructions on using this utility refer to CommVault online documentation.
Note: The sizing charts provided here are based on an adequately scaled environment. If CommVault best
practice guidelines are not followed results can be significantly less than what is presented here.
The first chart illustrates database size and maximum store size for protecting 10 TB of
data using a 128 KB block factor.
Deduplication assumptions:
base full reduction 60%
Subsequent full reduction 95% I
Incremental reduction 60%
Data Type Cycle Cycle Block Storage Dedupe Max Store Throughput TB /
Frequency Retained Factor Policies Database Size Hour
Required Size
Database Weekly 4 128KB 1 45 GB 96 TB .5
File / messages Weekly 4 128KB 1 46 GB 96 TB .5
Virtual machines Weekly 4 128KB 1 42 GB 96 TB .5
The second chart illustrates the same deduplication characteristics but with 50 TB of
data. This is to demonstrate the scalability of Simpana deduplication. In this case
separating data into different storage policies provides greater scalability. These results
show three storage policies capable of scaling beyond 150 TB of production data.
Deduplication assumptions:
base full reduction 60%
Subsequent full reduction 95% I
Incremental reduction 60%
Data Type Cycle Cycle Block Storage Dedupe Max Store Throughput TB /
Frequency Retained Factor Policies Database Size Hour
Required Size
Database Weekly 4 128 KB 1 223 GB 96 TB 2.25
File / messages Weekly 4 128 KB 1 232 GB 96 TB 2.5
Virtual machines Weekly 4 128 KB 1 216 GB 96 TB 2.5
The following chart illustrates two storage policy designs, one using 128 KB block size
and the other using 256 KB. Note the greater scalability of the deduplication store by
using the higher block size. Managing 100 TB of data using the 128 KB block size, two
storage policies would be required. With 256 KB only 1 policy is required with a single
deduplication store scaling to almost 200 TB.
Deduplication assumptions:
base full reduction 60%
Subsequent full reduction 95% I
Incremental reduction 60%
Data Type Cycle Cycle Block Storage Dedupe Max Store Throughput TB /
Frequency Retained Factor Policies Database Size Hour
Required Size
Database Weekly 4 128k 2 447 GB 96 TB 4.5
Database Weekly 4 256k 1 223 GB 193 TB 4.5
The following table illustrates media files being retained for 4 weeks with an initial
backup size of 100 TB and daily incremental change rate of 100 GB. A low base
reduction rate for fulls and incrementals is assumed due to the data type being
protected.
Total Size of
(1) Data (5) Incremental Retained Base Full Seq-Full INCR
Set Type (4) Full Backup Job Size Inc:Full Backup Jobs Reduction Reduction Reduction
(SP Copy) Backup size %/Full Ratio / Cycle (/job) (/job) (/job)
Dues to the data type being protected, using small block sizes will not provide additional space savings. By
setting a higher block size, the deduplication database and store can scale out significantly higher providing better
scalability and performance.
The following table illustrates the scaling capabilities of the dedupe store and database
when using various block sizes. In this case the 512 KB block size provides a scale of
close to 400 TB while maintaining the dedupe database as 32GB.
General Guidelines
Carefully plan your environment before implementing deduplication policies.
Consider current protection and future growth into your storage policy design. Scale your deduplication
solution accordingly so the deduplication infrastructure can scale with your environment.
Once a storage policy has been created the option to use a global dedupe policy cannot be modified.
When using encryption use dedicated policies for encrypted data and other policies for non-encrypted
data.
Not all data should be deduplicated. Consider a non-deduplicated policy for certain data types.
Non-deduplicated data should be stored in a separate disk library. This will ensure accurate
deduplication statistics which can assist in estimating future disk requirements.
Deduplication Database
Ensure there is adequate disk space for the deduplication database.
Use dedicated dedupe databases with local disk access on each Media Agent.
Use high speed SCSI disks in a RAID 0, 5, 10, or 50 configuration.
Ensure the deduplication database is properly protected.
Do NOT backup the deduplication database to the same location the active database resides.
Deduplication Store
Only seal deduplication stores when databases grow too large or when using SILO storage.
When using SILO storage consider sealing stores at specific time intervals e.g. monthly or quarterly to
consolidate the time period to tape media.
For WAN backups you can seed active stores to reduce data blocks that must be retransmitted when a
store is sealed. Use the option Use Store Priming option with Source-Side Deduplication to seed new
active stores with data blocks from sealed stores.
Performance
Use DASH Full backup operations to greatly increase performance for full data protection operations.
Use DASH Copy for auxiliary copy jobs to greatly increase auxiliary copy performance.
Ensure the deduplication database is on high speed SCSI disks.
Ensure Media Agents hosting a dedupe database has enough memory (at least 32GB).
Global Deduplication
Global deduplication is not a be-all-end-all solution and should not be used all the time.
Consider using global dedupe policies as a base for other object level policy copies. This will provide
greater flexibility in defining retention policies when protecting object data.
Use global deduplication storage policies to consolidate remote office backup data in one location.
Use this feature when like data types (File data and or virtual machine data) need to be managed by
different storage policies but in the same disk library.
SILO storage
SILO storage is for long term data preservation and not short term disaster recovery.
Recovery time will be longer if data is in tape SILO so for short term fast data recovery use traditional
auxiliary copy operations.