0% found this document useful (0 votes)
16 views36 pages

05 - Simpana® Deduplication

Chapter 5 discusses the deduplication features of Simpana® v9, which enhance data protection by reducing storage needs and network usage through client-side deduplication and optimized backup operations. The chapter outlines the deduplication process, including the generation of signature hashes, the role of the deduplication database, and best practices for configuration. It also highlights the trade-offs of deduplication, such as potential impacts on data restorability due to fragmentation.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views36 pages

05 - Simpana® Deduplication

Chapter 5 discusses the deduplication features of Simpana® v9, which enhance data protection by reducing storage needs and network usage through client-side deduplication and optimized backup operations. The chapter outlines the deduplication process, including the generation of signature hashes, the role of the deduplication database, and best practices for configuration. It also highlights the trade-offs of deduplication, such as potential impacts on data restorability due to fragmentation.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 36

Chapter 5

Simpana® Deduplication

CommVault Concepts & Design Strategies: https://www.createspace.com/3726838


60 - Simpana® Deduplication

Simpana® v9 offers a variety of deduplication based features that drastically changes the way data protection is
conducted. Client side deduplication can greatly reduce network usage, Dash Full can significantly reduce the
time of Synthetic full backups, and Dash Copy will greatly reduce the time it takes to copy backups to off-site
disk storage. Additionally, SILO storage can copy deduplicated data to tape still in its deduplicated state. This
chapter details how deduplication works and how to best configure and manage deduplicated storage.

Important! This section provides guidelines based on current CommVault best practices.
It is strongly recommended you check with CommVault for any updated guidelines and
best practices as they may have changed since the writing of this book.

The Concept of Deduplication


Within individual and multiple files duplicate data blocks will exist. One example could be a DLL file that exists
on 10 different servers. The blocks in the files may be exactly the same. Another example would be a Word
document that five different users have. Some of the users modified the document which would result in some
blocks in the file changing while other blocks have not. One additional example could be a database file which
contains white space which results in duplicate blocks since they contain no data. All of this data traditionally
would be redundantly stored on disk or tape requiring significant amount of space to protect. With disk storage
and Simpana® deduplication, all of those blocks only need to be stored once.

In the following diagram, five blocks are being written to disk. Three of the blocks are
exactly the same. With deduplication, the result will be three blocks written to disk. The
two unique blocks will be written individually but the three duplicate blocks will only be
written once.

Data Blocks

Unique Duplicate
Blocks Blocks

Disk Storage

CommVault Concepts & Design Strategies: https://www.createspace.com/3726838


Simpana® Deduplication - 61

Deduplication Features
Simpana v9 software has a unique set of deduplication features that are not available with third party
deduplication solutions. If third party solutions are being used, most of the Simpana features will not be available.

Simpana v9 deduplication features:

 Efficient use of storage media. Each Deduplication Storage Policy will deduplicate all data blocks
written to the policy. An optional global deduplication storage policy can be used to write blocks from
multiple storage policies through a single deduplicated policy resulting in multiple policy data blocks
being stored once on disk storage.

 Efficient use of network bandwidth. Client Side Deduplication can be used to deduplicate block data
before it leaves the client. This will greatly reduce network bandwidth requirements after the first
successful full backup is completed. From that point forward only changed blocks will be sent over the
network.

 Significantly faster Synthetic Full operations. DASH Full is a read optimized Synthetic Full operation
which does not require traditional full backups to be performed. Once the first full is completed, changed
blocks are protected during incremental or differential backups. A DASH Full will run in place of a
traditional full or synthetic full. This operation does not require movement of data. It will simply update
indexing information and the deduplication database signifying that a full backup has been completed.
This will significantly reduce the time it takes to perform full backups.

 Significantly faster auxiliary copy operations to disk storage. DASH Copy operations are optimized
auxiliary copy jobs that require only changed blocks to be sent to a second disk target. Once the initial
full auxiliary copy has been performed only changed blocks will be sent during auxiliary copy jobs. This
is an ideal solution for off-site copies to secondary disaster recovery facilities since it does not require
high bandwidth requirements.

 Efficient use of tape media using SILO. The SILO storage allows data to be copied to tape in its
deduplicated form. In other words, the data will not be rehydrated. During normal auxiliary copy
operations the rehydration of data is required. This means the deduplicated data will be read from the
disk, expanded into its raw compressed form and then written to tape. A SILO operation is not an
auxiliary copy; it is a backup of the disk volume folders in the CommVault disk library. The SILO
operation simply copies the folders to tape and the data remains in its deduplicated form.

 Resilient indexing and restorability. The deduplication database that is used to check signature hashes
for deduplication purposes is not required during restore operations. Instead the standard indexing
methodology is used. This includes using the index cache, an optional index cache server, and index files
written at the conclusion of the job. This resiliency ensures the restorability of deduplicated data even
under the worst case scenarios.

CommVault Concepts & Design Strategies: https://www.createspace.com/3726838


62 - Simpana® Deduplication

Is Deduplication the best solution for data movement and storage?

Many people look at deduplication as the most effective method of moving and storing data. Curiously
there is one part of the data movement process that is missing from the above features list. Though
deduplication can greatly improve data protection and data storage it will actually have a negative effect
on the restorability of the data. When data needs to be restored, the Media Agent will recreate all chunk
data and send it to the restore destination. Since deduplication leads to data fragmentation, reading the
data block from disk will be slower than if the data was not deduplicated.

A solution that the Simpana software has supported for many versions that will improve backup
performance, reduce storage requirements and improve restore performance as well as data archiving.
Archiving removes the data from the production storage leaving a stub file in its place. The stub is then
backed up during normal data protection jobs which greatly reduces backup windows and storage
requirements. If the system needs to be restored, the stubs are restored instead of the actual data. As
long as the path to the archived file is available when the user opens the stub, it will automatically recall
the file.

One emerging technology which is ideal for this type of solution is Cloud Storage. Archiving data into
the cloud moves the data off-site which provides for a sound disaster recovery solution and gives users
the ability to retrieve the data from the cloud by opening the stub file.

CommVault Concepts & Design Strategies: https://www.createspace.com/3726838


Simpana® Deduplication - 63

The Simpana® Deduplication Process


The deduplication process contains the following key components:

 Storage Policy
 Data Blocks
 Signature Hash
 Media Agent
 Deduplication Database (DDB)
 Disk storage
 Optional client side signature cache

Storage Policy
All deduplication activity is centrally managed through a storage policy. Configuration settings are defined in the
policy, the location of the deduplication database is set through the policy, and the disk library which will be used
is also defined in the policy. Storage policy configurations will be covered in great detail later in this chapter.

Data blocks and Signature Generation


When CommVault protects data, the data is sent to the Simpana agent from the file system or application the
agent is responsible for protecting. Even though the data may be files or application data, we will process the data
in blocks. The deduplication process starts by performing a calculation to generate a Signature Hash. This is a
512 bit value that uniquely represents the data within the block. This hash will then be used to determine if the
block already exists in storage.

The block size that will be used is determined in the Storage Policy Properties in the Advanced tab. CommVault
recommends using the default value of 128k but the value ranges from 32k to 512k. Higher block sizes for large
databases is recommended. Determining the best block size will be covered later in this chapter.

Signature Hash Comparison


The block signature hash is used to determine if the block exists in storage by comparing the hash against other
hashes in the Deduplication Database. By default, signature hashes are generated on the Client. This is preferred
since the processing of block signatures can be distributed to many different systems. This is required when using
Simpana Client Side Deduplication. For underpowered Clients that will not be using Client Side Deduplication, a
subclient can be optionally configured to generate signatures on the Media Agent.

Deduplication can be configured for Storage Side Deduplication or Client (source) Side Deduplication.
Depending on how deduplication is configured, the process will work as follows:

1. Storage Side Deduplication. Once the signature hash is generated on the block, the block and the hash
are both sent to the Media Agent. The Media Agent with a local or remotely hosted deduplication
database (DDB) will compare the hash within the database. If the hash does not exist that means the
block is unique. The block will be written to disk storage and the hash will be logged in the database. If
the hash already exists in the database that means the block already exists on disk. The block and hash
will be discarded but the metadata of the data being protected will be written to the disk library.

CommVault Concepts & Design Strategies: https://www.createspace.com/3726838


64 - Simpana® Deduplication

Storage Side Deduplication will send the hash and block to the Media Agent. The hash
is checked in the deduplication database to determine if the block is duplicate or unique.

2. Client Side Deduplication Once the signature is generated on the block, only the hash will be sent to
the Media Agent. The Media Agent with a local or remotely hosted deduplication database will compare
the hash within the database. If the hash does not exist that means the block is unique. The Media Agent
will request the block to be sent from the Client to the Media Agent which will then write the data to
disk. If the hash already exists in the database that means the block already exists on disk. The Media
Agent will inform the Client to discard the block and only metadata will be written to the disk library.

a. Client Side Disk Cache An optional configuration for low bandwidth environments is the
client side disk cache. This will maintain a local cache for deduplicated data. Each subclient
will maintain its own cache. The signature is first compared in the local cache. If the hash exists
the block is discarded. If the hash does not exist in the local cache, it is sent to the Media Agent.
If the hash does not exist in the DDB, the Media Agent will request the block to be sent to the
Media Agent. Both the local cache and the deduplication database will be updated with the new
hash. If the block does exist the Media Agent will request the block to be discarded.

CommVault Concepts & Design Strategies: https://www.createspace.com/3726838


Simpana® Deduplication - 65

Client Side Deduplication only sends the hash to the Media Agent to compare in the
dedupe database. This will greatly reduce network traffic when backing up redundant
data blocks. Once the first full backup is completed, only changed blocks will be sent
over the network.

Components of Deduplication
Careful planning must be taken before setting up deduplication in an environment. Poor configuration can lead to
scalability issues which may result in redesigning the environment which can result in loss of deduplicated
storage.

Example: Consider an initial design strategy using 32KB block size configured in the storage policy and a two
year retention policy for data. The deduplication database which contains all signature hashes can maintain up to
750 million records. The smaller block size results in more unique blocks resulting in a larger database. You
realize you are approaching the upper size limit of the database so you change the block size to 128KB. 32KB
data cannot be deduplicated against 128KB so all the 32KB blocks will remain based on the two year retention.
The 128KB blocks will also be written to disk with a two year retention resulting in duplicate data within the disk
library.

This section will break down all the critical configuration concerns and best practices for an affective design
strategy.

Important! This section provides guidelines based on current CommVault best practices.
It is strongly recommended you check with CommVault for any updated guidelines and
best practices as they may have changed since the writing of this book.

CommVault Concepts & Design Strategies: https://www.createspace.com/3726838


66 - Simpana® Deduplication

Deduplication Database
The deduplication database is the primary component of Simpana‘s deduplication process. It maintains all
signature hash records for a deduplicated storage policy. Each storage policy will have its own deduplication
database. Optionally, a global deduplication storage policy can be used to link multiple storage policies to a single
deduplication database by associating storage policy copies to a global deduplication storage policy.

The deduplication database currently can scale from 500 to 750 million records. This results in up to 90 Terabytes
of data stored within the disk library and up to 900 Terabytes of production in protected storage. It is important to
note that the 900 TB is not source size but the amount of data that is baked up over time. For example if 200 TB
of data is being protected and retained for 28 days using weekly full and daily incremental backups, the total
amount of protected data would be 800 TB (200 TB per cycle multiplied by 4 cycles since a full is being
performed every seven days). These estimations are based on a 128k block size and may be higher or lower
depending on the number of unique blocks and deduplication ratio being attained.

A deduplication database can handle up to 50 concurrent connections with up to 10 active threads at any given
time. The database structure uses a primary and secondary table. Unique blocks are committed to the primary
table and use 152 bytes per entry. Duplicate entries are registered in the secondary table and use 48 bytes per
entry.

Location of the Deduplication database


CommVault recommends locating the deduplication database locally on the Media Agent moving the data either
through direct or SAN attached disks. They should be high speed SCSI disks in a RAID 0, 10, or 50
configuration. The faster the disk performance, the more efficient the data protection and deduplication process
will be. Fibre Chanel if preferred over iSCSI due to the TCP/IP overhead involved in iSCSI data processing.
There should be adequate space for the estimated size of the deduplication database. Not having enough space for
the database can cause the database to become corrupt.

The deduplication database is initially configured during the storage policy creation. It
can be specified for secondary copies or moved by going to the policy copy properties,
Deduplication tab, Store Information.

CommVault Concepts & Design Strategies: https://www.createspace.com/3726838


Simpana® Deduplication - 67

Requirements for Deduplication Database


A Media Agent hosting a deduplication database should be a 64 bit server with at least 32 GB of RAM. When a
deduplication database is opened (protection jobs are running requiring the database) the system processes will
gradually increase memory requirements as time and jobs progress. This results in performance increasing the
longer the database remains open. A recommended best practice is to ensure jobs requiring the deduplication
database remain running throughout the data protection windows. If there are no jobs running the database will
close. A subsequent job requiring the database will open the database and the system processes will start over.
This will result in slower performance until the processes reach peak memory usage and performance.

A Media Agent can also host multiple deduplication databases. It is recommended that a single Media Agent
should not have more than two databases open at any given point in time. If a Media Agent has more than two
databases, consider stagger scheduling protection operations, if possible for the best performance.

Protecting the Deduplication database


Note: Best practice guidelines and methods for protecting the deduplication database may have changed since the
writing of this book. Check with CommVault for current best practices.

There are two methods to protect the deduplication database:


1. Point in time backups which requires all operations to pause until the operation is complete.
2. Using a File System iDataAgent and VSS to protect the database which does not require jobs to pause.
This feature is available as of Simpana v9 SP3b and is the preferred method to protect the dedupe
database.

Point in time backups are configured in the Advanced tab of the Deduplication settings. The Store Availability
Option allows you to set recovery points for the backup of the deduplication database. The recommended setting
is to create recovery points every 24 hours. If recovery points are not created and a dedupe database becomes
corrupt the store will be sealed and a new store and database will be created. This would result in all data blocks
being resent to the disk library which will have a negative impact on network performance and deduplication
ratio.

Prior to Simpana v9 SP4 the method to protect the deduplication database was to create
automatic recovery points. The default value for recovery point creation is eight hours.
If this method is going to be used it is recommended to set the recovery points to 24
hours. This method for protecting the deduplication database will require data
protection jobs to pause for the entire time of the database backup process.

CommVault Concepts & Design Strategies: https://www.createspace.com/3726838


68 - Simpana® Deduplication

By default the location of the database backups are to the same location where the dedupe database resides. It is
strongly recommended that you change this location. The location can be set with the following registry key:
SIDBBackupPath. Consult with CommVault‘s online documentation for more information and the latest
guidance on configuring this method for protecting the dedupe database.

Using the File System iDataAgent to protect the deduplication database allows the periodic backup of the
database without requiring jobs to pause. This is done by configuring a Read Only File System iDataAgent on the
Media Agent hosting the dedupe database. A special subclient is then added with the DDB Subclient option
selected. When using the point in time recovery points option jobs will be paused for the entire time it takes to
protect the database. Using the file system iDataAgent, jobs will only pause until VSS properly quiesces the
database which should take no more than two minutes. Due to this factor it is recommended to use this method to
protect the database.

The preferred method to protect the deduplication database is with a Read-Only


subclient installed on the Media Agent hosting the database. A subclient is configured
as a DDB subclient which will automatically define databases as the subclient contents.
Backups can then be scheduled as frequently as desired for database protection.

CommVault Concepts & Design Strategies: https://www.createspace.com/3726838


Simpana® Deduplication - 69

Deduplication Store
Each storage policy copy configured with a deduplication database will have its own deduplication store. Quite
simply a deduplication store is a group of folders used to write deduplicated data to disk. Each store will be
completely self-contained. Data blocks from one store cannot be written to another store and data blocks in one
store cannot be referenced from a different deduplication database for another store. This means that the more
independent deduplication storage policies you have, the more duplicate data will exist in disk storage.

Example: If you had three storage policies each with their own deduplication database
and database store you would have three databases and three sets of folders. This
results in duplicate blocks existing in each of the three stores which will reduce
deduplication efficiency. In some situations this may be desirable being that dissimilar
data types such as file system and database data may not deduplicate well against each
other.
Data Blocks

Data Blocks

Data Blocks
Storage Policy A Storage Policy B Storage Policy C

Duplicate block data


across multiple stores

CommVault Concepts & Design Strategies: https://www.createspace.com/3726838


70 - Simpana® Deduplication

Sealing the Store


In previous versions of the Simpana software it was necessary to occasionally seal the deduplication store. This
process would close the deduplication database and start a new one. This was required due to size limitations of
the database itself. When the new database was started any blocks that were written to the sealed store would be
required to be written to disk again in the new store. This affected deduplication ratios since duplicate blocks
would exist in each store. In Simpana v9 this is no longer needed since the database can scale significantly. The
default settings for store size is disabled which means the store will never seal.

Performance of the deduplication database is the primary factor in store sealing. If block lookups take too long,
data protection jobs will slow substantially. Sealing the store will start a new deduplication database which will
results in faster data protection jobs but will diminish deduplication ratios. One reason causing block lookups is
that the database has grown too large. If the environment is designed and scaled appropriately this should not be a
problem. Another reason is that the deduplication database is being stored on slow disks or using inefficient
protocols such as NFS, CIFS or iSCSI. CommVault recommends using high speed dedicated disks directly
attached to the Media Agent. Designing a scalable deduplication solution will be discussed later in this chapter.

For organizations with large amounts of data protected through a single storage policy and deduplication store it
may become necessary to seal the store. Since there is an upper limit to the deduplication database, sealing the
store may occasionally be required. This will be dependent on how long the data is being retained for since blocks
pruned from disk will also have the signature records pruned from the deduplication database. If large amounts of
data need to be managed and retention rules are defined that will require more than 750 million records, the best
solution to prevent the sealing of stores would be to use multiple deduplicated storage policies. Each policy will
have their own store resulting in multiple smaller deduplication databases.

Another reason to seal a store is when using SILO storage to tape. A store can be sealed when it grows to a
certain size, after a defined number of days, or through time intervals such as once a quarter. If you wanted to
have a set of tapes representing a fiscal quarter, then sealing the store every three months would result in all
volume folders being sealed and placed in SILO storage isolating the data based on the fiscal quarter.

Sealing stores and SILO storage will be discussed in the Deduplication Strategies section of this chapter.

The Deduplication store can be sealed and a new one created by any one of the following criteria:
 Create a new store every n number of days.
 Create a new store every n number of terabytes.
 Create a new store every n number of months from a specific starting date.

Deduplication Store Settings are configured in the Store Information tab in the
Deduplication settings within the storage policy copy.

CommVault Concepts & Design Strategies: https://www.createspace.com/3726838


Simpana® Deduplication - 71

Deduplication Block Size


As application or file data is read into memory it is optionally compressed and then it will be hashed. This hash is
compared in the deduplication database to determine if the block already exists. If the hash exists then the block is
a duplicate and if not it is unique. It is important to understand how we address data blocks within files and
applications to best configure deduplication.

Content Aware Block Deduplication


Simpana‘s ability to be aware of the content that is being deduplicated allows blocks to be better aligned when
deduplication takes place. When a file is read into memory it is compressed into 128KB blocks by default. A hash
is generated on that compressed block which is used for deduplication. But not all files are 128 KB in size and not
all files are evenly divided by 128 KB. If a compressed file is smaller than 128 KB it will be hashed in its entirety
down to a minimum size of 4KB. For larger files that have a trailing segment that is smaller than 128 KB, that
segment will also be hashed in its entirety down to 4 KB.

It is important to note that as each file is read into memory the 128 KB buffer is reset. Files will not be combined
to meet the 128 KB buffer size requirement. This is a big advantage in achieving dedupe efficiency. Consider the
same exact file on 10 different servers. If we always tried to fill the 128 KB buffer each machine would use
different data and the hashes would always be different. By resetting the buffer with each file, each of the 10
machines would generate the same hash for the file.

Appliance based deduplication devices are not content aware since the backup software writes data in large
chunks. This is why they need to use considerably smaller block sizes and realign blocks as they are written to the
appliance. By using these methods, they can achieve comparable deduplication ratios to the Simpana method, but
the overhead and cost is significantly higher.

In the following diagram two files are being deduplicated with 128 KB block size. This
first file has three 128 KB segments and a trailing segment of 32 KB. The first three
segments are hashed at 128 KB and the last segment is hashed at 32 KB. The buffer is
then reset for the next file to be read into memory. This will align all the blocks of the
new file so signature hashes will always be consistent.

CommVault Concepts & Design Strategies: https://www.createspace.com/3726838


72 - Simpana® Deduplication

Block Size Recommendations


Block size is configured in the storage policy Properties under the Advanced tab. When using a global
deduplication storage policy, all other storage policy copies that are associated with the global policy must use the
same block size. If you change the block size for the storage policy the deduplication database will be closed, the
store will be sealed and a new database and store will be created.

The block size recommendations depend on the data type that is being protected and the size of the data. The
current recommended setting for file and virtual machine data is 128 KB block size. This provides the best
balance for deduplication ratio, performance and scalability. Though the block size can be set as low as 32 KB,
deduplication efficiency only improves marginally and is therefore not recommended.

For databases the recommended block size is from 128 KB to 512 KB depending on the size. For large database
servers such as Oracle which may perform application level compression deduplication ratios may be
compromised. It is strongly recommended to consult with CommVault Professional Services in designing a
proper protection strategy with large databases.

For large data stores, especially media repositories, Consider setting a higher block size 256 KB +. For media
types the potential for duplicate data blocks will be minimal. By using a large block size, deduplication savings
will be noticed in backing up the same data over time. By using a higher block size it will allow more data to be
stored with a smaller deduplication database size.

Setting the Deduplication block factor in the Advanced tab of the storage policy
properties.

Consider the size of the deduplication database as well as the deduplication store when factoring block size. A
general guideline would be setting the block size to 64k would result in half the sizing capabilities of 128k.
Setting block size to 256 will yield twice the capacity. Since CommVault deduplication is content aware, a
smaller block size may or may not give you a deduplication advantage but it will definitely limit the scale of how
much data can be stored.

CommVault Concepts & Design Strategies: https://www.createspace.com/3726838


Simpana® Deduplication - 73

Why 128KB?

Many competitors will use significantly smaller blocks sizes as low as 8k. The reason for this is
simple… a better deduplication ratio... Well, sort of. The truth is the ratio will be about the same
since the appliance is not content aware where the CommVault software is. So why does
CommVault recommend the higher block size? Competitors who usually sell appliance based
deduplication solutions with steep price tags use a lower block size which actually results in a
marginal gain in space savings considering most data in modern datacenters is quite large.
Unfortunately there are some severe disadvantages to this. First off, records for those blocks must be
maintained in a database. Smaller block size results in a much larger database which limits the size
of disks that the appliance can support. CommVault software can scale significantly higher, up to 90
Terabytes per database.

Even more importantly is the aspect of fragmentation. The nature of deduplication and referencing
blocks in different areas of a disk leads to data fragmentation. This can significantly decrease restore
and auxiliary copy performance. The higher block size recommended by CommVault makes restores
and copying data much faster.

The main aspect is price and scalability. With relatively inexpensive enterprise class disk arrays you
can save significant money over dedicated deduplication appliances. If you start running out of disks
just add more space which can be added to an existing library so deduplication will be preserved.
Considering advanced deduplication features such as DASH Full, DASH Copy, and SILO tape
storage, Simpana deduplication solution is a powerful tool.

Setting Minimum Block Size


The 128 KB block size is based on the idea that most files will be larger than 128 KB. Those files will be divided
into 128 KB blocks and each block will be hashed. Simpana v9 SP4 is configured to hash and deduplicate blocks
as low as 4 KB. Since it is unlikely that a file will be evenly disable by 128 KB, file remainders will be hashed
down to a 4 KB size. Files smaller than 128 KB will be hashed and deduplicated in their entirety.

Prior to SP4 the block will still be hashed down to a minimum size of 16 KB. This number can be further reduced
down to 4 KB using the SignatureMinFallbackDataSize registry key. This key should be added to any Client or
Media Agent performing signature generation. Check with Online Documentation for complete instructions and
current best practices for configuring this value and how to deploy this key to Clients and Media Agents.

CommVault Concepts & Design Strategies: https://www.createspace.com/3726838


74 - Simpana® Deduplication

Life and Health of Data Blocks


Traditionally when data was moved to protected storage, over time multiple copies of the data would exist. Take
for example four cycles of data written to tape. A single unmodified file would exist in four different locations.
With deduplication, four cycles of data written to disk would result in one copy of that file. In addition, blocks of
that file may be referenced by other files on the disk. Where a bad block on a tape would require you to get the
file from a different tape, with deduplication a bad block would result in that file and any other files referencing
the block unrecoverable.

In the Settings tab of Deduplication configuration in the storage policy copy the option Do not deduplicate
against objects older than can be configured to limit the life of a block on disk. This can be used to periodically
refresh data blocks on the deduplicated disk storage. This will decrease the potential for bad blocks affecting
restorability of data. Prior to Simpana v9 SP3 this option was enabled and configured for 365 days. As of SP3 this
option is disabled by default.

Careful consideration should be taken in configuring this option. By enabling this setting, periodically when you
refresh blocks on disk the deduplication ratio will suffer significantly until the old blocks are pruned. If you are
performing Client Side Deduplication, setting this option will periodically require ALL blocks that have
previously been deduplicated to be retransmitted over the network. The best solution in this case is to ensure you
have high quality enterprise class disks and ALWAYS make additional copies of data to other disk and/or tape
locations. If you are concerned with the health of data blocks on disk you can enable the Do not deduplicate
against objects older than option with the understanding that all blocks will need to be re-sent at the specific time
interval you specify.

Block Compression with Deduplication


By default when a storage policy is configured to use deduplication, compression is automatically enabled for the
storage policy copy. This setting will override the subclient‘s compression settings. For most data types
compression is recommended. The process works by compressing the block and then generating a signature hash
on the compressed block. It is important to note that using Simpana compression for data blocks will ensure the
compressed blocks always results in the same signature. If other compression methods are used then the signature
will be different and deduplication ratios will suffer.

Compression and databases


Most database applications will perform compression on the data before handing it off to the Simpana database
agent. In this case using Simpana compression is not advised as it can cause the data to expand. With some
database applications it can expand considerably. CommVault strongly recommends using either the application
compression or Simpana compression. If application compression is going to be used, best practice is to use a
dedicated storage policy for the application and disable compression in the deduplication settings.

For Oracle databases advanced table compression is available which may result in dissimilar hashes being
generated each time the database is backed up. This can negate deduplication completely. Careful consideration
should be given to which compression methods should be used. Though the Oracle compression is extremely
efficient it may not always be the best solution when using deduplicated storage. CommVault strongly
recommends consulting with professional services when deploying CommVault software to protect large Oracle
databases.

CommVault Concepts & Design Strategies: https://www.createspace.com/3726838


Simpana® Deduplication - 75

Encrypting deduplicated data


A unique feature of the Simpana software is the ability to encrypt deduplicated data. The software accomplishes
this by encrypting the data after the hash signature has been generated. The full process, including compression
would be: Compress, Hash, Encrypt. Traditional methods of Deduplicating encrypted data would be to hash the
encrypted data. Every time the block would be encrypted a different hash would be generated. As a result it is not
possible to achieve efficient deduplication ratios using this method. Since the Simpana software hashes the block
prior to encryption, the hash will always be consistent even if the encryption key changes resulting in efficient
deduplication ratios.

If encryption is going to be used with deduplication a dedicated storage policy MUST be used for encrypted data.
Mixing encrypted and non-encrypted data will result in data not being able to be restored. This is due to the fact
that an unencrypted file referencing an encrypted block will not be able to access the encrypted block.

Resiliency & Recovery of Deduplicated Data


One key aspect of Simpana deduplication is the ability to recover data even in the event of database corruption or
loss of the index cache. The deduplication database is only used for signature comparisons during backup and
data aging – not restore, so if the database becomes corrupt data can still be recovered. As data blocks are
deduplicated and written to disk, the index cache is updated with pointers to a block being written to disk. If the
block already exists in storage then the index cache will be updated with pointers to the location of the block
being referenced. At the conclusion of the data protection job the index cache will be written to media as well. In
the event that the index is unavailable in the cache, the index will be recovered from media. This resiliency in the
deduplication process allows for recovery even in the worst case disaster scenario.

In the following illustration, index information is maintained in three locations: Index


Cache, Index Cache Server, and in the disk library where the job is located.

CommVault Concepts & Design Strategies: https://www.createspace.com/3726838


76 - Simpana® Deduplication

Storage Policy Elements


Deduplication is centrally managed through storage policies. Each policy can maintain its own deduplication
settings or can be linked to a global deduplication storage policy. Which method is used for configuring storage
policies will depend on the type of data and your environment. This section will explain the elements of a
Deduplication storage policy and when dedicated policies should be used and when global policies should be
used.

Dedicated Deduplication Storage Policy


A dedicated deduplication storage policy will consist of one library, one deduplication database, and one or more
Media Agents. For scalability purposes, using a dedicated deduplication policy allows for the efficient movement
of very large amounts of data. Dedicated policies are also recommended to separate data types that do not
deduplicate well against each other such as database and file system data.

Large Scale Deduplication Solution


For large amounts of data CommVault recommends using dedicated storage policies, libraries and Media Agents.
This method allows for greater scalability by building out your storage infrastructure to accommodate current and
future data growth. In the following diagram, there are three Media Agents with disk libraries dedicated to each
Media Agent. Although data blocks will not deduplicate between the different libraries, for scalability and
performance reasons this is the recommended method for protecting large amounts of data.

In the following illustration dedicated storage policies, Media Agents, and disk libraries
are using to protect large amounts of data.

CommVault Concepts & Design Strategies: https://www.createspace.com/3726838


Simpana® Deduplication - 77

Deduplicating Dissimilar Data Types


Certain data types do not deduplicate well against others such as database against file system data. In addition,
data types from applications such as SQL and Oracle will perform application level compression, where as file
data and VM data may not. Because of this it is recommended to use different storage policies to protect different
data types.

For large databases and other data types such as media files, higher block sizes (256k +) can be set to allow for
higher scalability. The following diagram illustrates three storage policies:

 Storage Policy A is using a 512k block size for SQL databases.


 Storage Policy B is using a 128k block size for files, SharePoint documents, and Exchange messages.
 Storage Policy C is using 256k block size for Exchange database data.

The following diagram illustrates different data types using separate storage policies to
protect and manage deduplicated data. Note that object level protection for SharePoint
and Exchange is using the same storage policy as the file system backups. Since there
will be similar data at the object level a better deduplication ratio can be attained. It is
also common to have similar retention requirements with these data types so combining
them into the same storage policy makes sense from deduplication and management
aspects.

CommVault Concepts & Design Strategies: https://www.createspace.com/3726838


78 - Simpana® Deduplication

Global Deduplication Storage Policy


Global Deduplication storage policies work by linking storage policy copies to a single deduplication database
and store. This allows data to be managed independently by a specific storage policy while maintaining a more
efficient deduplication ratio. Each policy can manage specific content and independently manage retention and
additional copies. This provides for efficient deduplication ratios while providing scalability and flexibility for
different data protection requirements.

Example: Three storage policies are required to manage specific data based on independent retention policies.
The three storage policy primary copies can be linked to a global deduplication database so data blocks across the
three policies is only stored once. The block will remain in the deduplication store until the longest retention for
the data block has been reached.

The following diagram illustrates a global deduplication storage policy linking primary
copies from three different storage policies into a single store. Each primary copy can
have separate retention and manage different subclients but they will all share the same
data path and deduplication store.
Data Blocks

Data Blocks

Data Blocks

Storage Policy A Storage Policy B Storage Policy C


Primary Primary Primary
Copy Copy Copy

Duplicate block data

Global Deduplication Policy

Blocks are deduplicated and


stored only once in store

CommVault Concepts & Design Strategies: https://www.createspace.com/3726838


Simpana® Deduplication - 79

Global Deduplication storage policies are useful in the following situations:

 Data that exists in separate physical locations and is being consolidated into a single location.

 Like data types that deduplicate well against each other that have different retention requirements.

Global Deduplication storage policies are not recommended in the following situations:

 Using multiple Media Agents where the deduplication database is centrally located on one requiring
network communication for other Media Agents to compare signatures. Note: for small environments
this deployment method could be used but will degrade performance.

 When backing up large amounts of data since a single database can only scale to 750 million entries. In
this case multiple dedicated storage policies are recommended.

Global Deduplication for Base Storage Policy Design


If you are planning a new storage policy architecture and you are unsure of how many policies will be needed,
using a global deduplication policy as your base store could provide better deduplication ratios as your
environment changes and grows. Even if one storage policy initially will be used, consider linking the primary
copy to a global deduplication policy. This is best used when protecting object data or virtual machines. This use
of global dedupe policy would not apply to databases even if the same DB application is being used as
deduplication efficiency will not be realized and the result would just be a bigger deduplication database.

It is important to note that associating or not associating a storage policy copy with a global deduplication policy
can only be done at the creation of the policy copy. Once the copy is created it will either be part of a global
policy or it won‘t. By using the global dedupe policy for the initial storage policy primary copy that will protect
data, if additional policies are required, they can also be linked to the global dedupe policy. Using this method
will result in better deduplication ratios and provide more flexibility for defining retention policies or
consolidating remote location data to a central policy (which will be discussed next). The main caveat when using
this method is to ensure that your deduplication infrastructure will be able to scale as your protection needs grow.

CommVault Concepts & Design Strategies: https://www.createspace.com/3726838


80 - Simpana® Deduplication

Global Deduplication for consolidating multiple remote sites


Global Deduplication storage policies were designed specifically to address remote site backups where backups
were being performed locally at each site. Then using DASH Copy operations, the data is copied to a central data
center location. Since duplicate blocks may exist at each of the sites, using a global deduplication storage policy
associated with a secondary copy will use a single deduplication database and a single store to consolidate data
blocks from all remote locations.

In the following illustration three remote sites are locally performing backups to disk.
The data is being copied to the main data center using a global dedupe policy associated
with the secondary copy.

CommVault Concepts & Design Strategies: https://www.createspace.com/3726838


Simpana® Deduplication - 81

Global Deduplication for small data size with different retention needs
For small environments that do not contain a large amount of data but different retention settings are required,
multiple storage policy Primary Copies can be associated with a global deduplication storage policy. This should
be used for small environments with the data path defined through a single Media Agent.

In the following illustration 10 Terabytes of data is being backed up through a single


Media Agent to a disk library. Three storage policies are defined with varying retention
and the Primary Copies are each associated with the same global deduplication storage
policy.

SILO Storage
Consider all the data that is protected within one fiscal quarter within an organization. Traditionally a quarter end
backup would be preserved for long term retention. Let‘s assume that quarter end backup of all data requires 10
LTO 5 tapes. Unfortunately with this strategy the only data that could be recovered would be what existed at the
time of the quarter end backup. Anything deleted prior to the backup within the specific quarter would be
unrecoverable unless it existed in a prior quarter end backup. This results in a single point in time that data can be
recovered. Now let‘s consider those same 10 tapes containing every backup that existed within the entire quarter.
Now any point in time within the entire quarter can be recovered. That is what SILO storage can do.

CommVault Concepts & Design Strategies: https://www.createspace.com/3726838


82 - Simpana® Deduplication

SILO storage allows deduplicated data to be copied to tape without rehydrating the data. This means the same
deduplication ratio that is achieved on disk can also be achieved to tape. As data on disk storage gets older the
data can be pruned to make space available for new data. This allows disk retention to be extended out for very
long periods of time by moving older data to tape.

How SILO works


Data blocks are written to volume folders in disk storage. These folders make up the deduplication store. The
folders have a maximum size which once reached the folder is marked closed. New folders will then be created
for new blocks being written. The default volume folder size for a SILO enabled copy is 512 MB. This value can
be set in the Control Panel, in the Media Management Applet. The SILO Archive Configuration setting
Approximate Dedup disk volume size in MB for SILO enabled copy is used to specify the volume folder size. It is
strongly recommended to use the default 512 MB value. For a SILO enabled storage policy, when the folder is
marked full it can then be copied to tape. What this really is doing is backing up the backup.

The following diagram illustrates full volume folders being copied to SILO storage.
Active volumes will not be placed in the SILO storage until they are marked full.

By copying volume folders to tape, space can be reclaimed on disk for new data to be written. This does require
some careful planning and configuration.

CommVault Concepts & Design Strategies: https://www.createspace.com/3726838


Simpana® Deduplication - 83

How volume folders are moved to SILO Storage


When a storage policy is enabled for SILO storage an On Demand Backup Set is created in the File System
iDataAgent on the CommServe server. The on Demand Backup Set will determine which volume folders have
been marked full and back them up to tape each time a SILO operation runs. Within the backup set a Default
Subclient is used to schedule the SILO operations to run. Just like an ordinary data protection operation, right
click the subclient and select Backup. The SILO backup will always be a full backup operation and use the On
Demand Backup to determine which folders will be copied to SILO storage.

When a storage policy deduplication copy is enabled for SILO storage a SILO backup
set will be created on the CommServe server. This will be used to schedule and copy
folders that qualify for SILO storage.

Encrypting SILO data to tape


Since the SILO exists as a backup set and subclient, inline encryption options can be configured to perform
encryption as the SILO job runs. The encryption settings would be configured in the same way an ordinary
subclient encryption would be configured.

Hardware encryption can also be used for LTO4 and LTO5 drives that support encryption. Enabling hardware
encryption is configured in the SILO data path properties in the storage policy Copy.

Note: This section assumes a basic understanding of backup sets, subclients, and encryption configuration. If you
are unfamiliar with these concepts it is strongly recommended attending a CommVault Administration instructor
led training course.

SILO storage recovery process


In traditional recovery from tape the tape is mounted in a drive and the data is recovered directly back to the
recovery location. With SILO to tape the data must first be staged to the disk before the data can be recovered.
Each volume folder that contains data blocks for the restore must be staged to the disk library for the recovery
operation to complete. Since block level deduplication will result in blocks in different locations being referenced
by data, multiple volume folders may be needed for a single recovery operation. This can result in a slower
restore performance.

CommVault Concepts & Design Strategies: https://www.createspace.com/3726838


84 - Simpana® Deduplication

It is important to note at this point that SILO storage is less of a disaster recovery solution and more of a data
preservation solution. From the original release of the SILO feature in Simpana v8, it has received some negative
feedback. This is due to the fact that our competitors placed a lot of negative spin on this feature since they had
no comparable solution. The other is misunderstanding Service Level Agreements. SLA policies usually specify
that the older data gets, the longer the time to recover will be. SILO storage is not an option to recover data from
last week; it is a feature to recover data from last year or five years ago. Understanding this concept places Silo
storage into proper perspective. This feature is for long term preservation of data to allow for point in time
restores within a time period with considerably less storage requirements than traditional tape storage methods.

How the Process works


Let‘s assume we are using deduplication and Silo storage. Our primary storage policy copy has a retention of two
years. We choose to seal the deduplication store every quarter. We will have one active store, and at least one
cached store on disk. This means we can perform point in time recovery of data for a period of six months from
disk. We will also be using space management with disk thresholds configured that if we reach 85% of disk
capacity we will prune cached volumes. If there is enough disk storage available we might be able to keep 9 – 12
months of data on disk. Beyond that point the data will need to be pulled from the tape SILO. We could define
our SLA for up to 6 months to be 2 hours. From 6 months to 1 year the SLA will be 2-4 hours. Beyond that point
the SLA will be 4+ hours.

The recovery process will work as follows:

1. The CommVault administrator performs a browse operation to restore a folder from eight months ago.

2. If the volume folders are still on disk the recovery operation will proceed normally.

3. If the volume folders are not on disk the recovery operation will go into a waiting state.

4. A SILO recovery operation will start and all volume folders required for the restore will be staged back
to the disk library.

5. Once all volume folders have been staged, the recovery operation will run.

To ensure adequate space for SILO staging operations a disk library mount path can optionally be dedicated to
SILO restore operations. To do this, in the Mount Path Properties General tab select the option Reserve space
for SILO restores.

The procedure is straight forward and as long as SILO tapes are available the recovery operation is fully
automated and requires no special intervention by the CommVault administrator.

CommVault Concepts & Design Strategies: https://www.createspace.com/3726838


Simpana® Deduplication - 85

Data Movement with Deduplication

Client Side Deduplication


It is highly recommended in Simpana v9 to use Client Side Deduplication. This will greatly reduce network
bandwidth required to move data over time. By default, Client Side Deduplication is enabled in the storage policy
creation wizard. If it is enabled in the policy it will automatically be used by all subclients associated with the
policy. If it is not enabled in the policy it can be enabled in the Client Side Deduplication tab of the Client
properties.

Along with configuring Client Side Deduplication in the Client Properties, a Client Side Disk Cache can be
created. Each subclient will contain their own disk cache which will hold signatures for data blocks related to the
subclient. The default cache size is 4GB and can be increased up to 32GB. The Client Side Disk Cache is
recommended for slow networks such as WAN backups. For any networks that are 1Gbps or higher using this
option will not improve backup performance.

Another Client option is Enable Variable Content Alignment. Enabling this option will read block data and
align the blocks to correspond to prior data blocks that have been deduplicated. By aligning the content prior to
performing the hash process, better deduplication ratios may be attained. This will however require more
processing power on the Client. Since Simpana deduplication is content aware, enabling this option will not
provide better deduplication for average file data. This option is only recommended for large file system data
such as database dumps or PST files with low incremental rates of change.

DASH Full
A DASH Full backup is a read optimized synthetic full backup job. A traditional synthetic full backup is designed
to synthesize a full backup by using data from prior backup jobs to generate a new full backup. This method will
not move any data from the production server. Traditionally the synthetic full would read the data back to the
Media Agent and then write the data to new locations on the disk library. With deduplication when the data is
read to the Media Agent during a synthetic full, signatures will be generated and compared in the deduplication
database. Being that the block was just read from the library, there would always being a signature match in the
DDB and the data blocks would be discarded. To avoid the read operation all together a DASH Full can be used
in place of a traditional synthetic full.

A DASH Full operation will simply update the index files and deduplication database to signify that a full backup
has been performed. No data blocks are actually read from the disk library back to the Media Agent. Once the
DASH Full is complete a new cycle will begin. This DASH Full acts like a normal full and any older cycles
eligible for pruning can be deleted during the next data aging operation.

CommVault Concepts & Design Strategies: https://www.createspace.com/3726838


86 - Simpana® Deduplication

The option to enable DASH Full operations is configured in the Advanced tab in the
Deduplication section of the Storage Policy Primary Copy.

Once this option is enabled, schedule data protection jobs to use Synthetic Full backups. Depending on the
configuration in the storage policy settings, either a traditional synthetic full or a DASH Full will be used.

DASH Copy
A DASH Copy is an optimized auxiliary copy operation which only transmits unique blocks from the source
library to the destination library. It can be thought of as an intelligent replication which is ideal for consolidating
data from remote sites to a central data center and backups to DR sites. It has several advantages over traditional
replication methods:

 DASH Copies are auxiliary copy operations so they can be scheduled to run at optimal time periods
when network bandwidth is readily available. Traditional replication would replicate data blocks as it
arrives at the source.

 Not all data on the source disk needs to be copied to the target disk. Using the subclient associations of
the secondary copy, only the data required to be copied would be selected. Traditional replication would
require all data on the source to be replicated to the destination.

 Different retention values can be set to each copy. Traditional replication would use the same retention
settings for both the source and target.

 DASH Copy is more resilient in that if the source disk data becomes corrupt the target is still aware of
all data blocks existing on the disk. This means after the source disk is repopulated with data blocks,
duplicate blocks will not be sent to the target, only changed blocks. Traditional replication would require
the entire replication process to start over if the source data became corrupt.

DASH Copy is similar to Client Side Deduplication but with DASH, the source is a Media Agent and the
destination is a Media Agent. This is why Client Side Deduplication and DASH Copy operations are sometimes
referred to as Source Side Deduplication. Once the initial full auxiliary copy is performed, only change blocks
will be transmitted from that point forward.

CommVault Concepts & Design Strategies: https://www.createspace.com/3726838


Simpana® Deduplication - 87

DASH Copy has two additional options; Disk Read Optimized Copy, and Network Optimized Copy. Again, this
is similar to Client Side configuration. Disk Read Optimized will transmit the signature hash to the target Media
Agent where it will compare the hash in the DDB to determine if the block needs to be sent. Network Optimized
will use a cache on the source Media Agent to compare the signature and determine if the hash exists resulting in
less network traffic.

Seeding Deduplicated Disk Libraries


For low bandwidth networks, seeding a disk library can be performed to greatly reduce the data required to be
sent over the network. This is done by temporarily placing a disk library at the source location. This library can be
an external USB drive or regular disk storage. The data can be copied to the temporary disk library and then
relocated to the destination location. These procedures require several detailed steps and it is recommended to
consult with CommVault Professional Services for assistance. Seeding disk libraries can be used for Client Side
Deduplication, DASH Full and DASH Copy operations.

Deduplication Considerations when Using Secondary Tape Copies


Many of the recommendation regarding deduplication and storage policy design focus on scalability and
performance. A key aspect that should be taken into account using this approach is if multiple storage policies are
being used based on block factor and scalability recommendations, secondary copies to tape will require different
media for each storage policy secondary copy.

When Not to Use Deduplication


The best use of deduplication is when data is redundantly backed up over time or if duplicate data blocks exist
across an infrastructure. In some cases neither of these situations may exist. In this case deduplication may have a
negative impact since the processing required to generate signatures will not be worth the effort since the data
will not deduplicate well and the data protection operations will take longer.

Example 1: A large media repository used to frequently edit and recompile videos requires protection. Once the
files are finalized they are written to a separate repository and deleted from the source production location. In this
case there are two primary issues:

1. Media files and other binary based data types do not deduplicate well. The savings for this type of data is
seen when performing full backups over time. Since the files will be deleted from the production
location once they are finalized subsequent full backups will not provide much disk space savings.

2. Since the media files are being edited the binary data blocks will be constantly changing. This may
greatly reduce the space savings when subsequent full backups of the same data are performed.

CommVault Concepts & Design Strategies: https://www.createspace.com/3726838


88 - Simpana® Deduplication

In this scenario the processing of data blocks to generate signatures probably will not be worth the deduplication
results. Backing up the data to disk or tape for short term disaster recovery or using hardware based snapshots and
Simpana SnapProtect feature would be a better solution in this case.

Example 2: A database is being regularly protected by performing nightly full backups and transaction log
backups every 15 minutes. Though the database should deduplicate well since it is being protected nightly, the
transaction logs will not. In this scenario making use of a separate log storage policy with a non-deduplicated disk
target would provide better backup and recovery performance. For Microsoft SQL iDataAgents a separate log
storage policy can be configured in the SQL subclient. For other database types an Incremental Storage Policy
can be used. Log storage policies are discussed in more detail in the Additional Storage Policies chapter.

Designing a Scalable Deduplication Solution


The proper implementation of a deduplication solution is essential to provide:

 Efficient use of disk space


 Performance
 Scalability

Recommended Hardware Requirements


To properly design a scalable high performance deduplication solutions there are several hardware requirements
that should be considered. The Media Agent should be a Windows or Linux 64 bit operating system with at least
32 GB of RAM. For the storage location of the deduplication database it is recommended to use locally attached
storage. 10k SCSI drives configured in a RAID 0 provides the best read/write performance. The disks can be
located on a Fibre attached SAN. iSCSI is NOT recommended as the IP overhead will degrade performance.

It should be noted that the memory requirements for the Media Agent is due to the deduplication processes
requiring significant memory and not the database itself. Though the database may grow to 100+ GB in size, the
deduplication processes will only load specific portions of the database in memory as it‘s needed. As
deduplication jobs run the processes will use more memory the longer they are in operation. As a result, the
longer the jobs run the more efficient the overall process will be. Because of this factor, it is recommended that
there are always running jobs requiring deduplication processes during a protection window. If no jobs are
running then the process will terminate. This will require deduplication processes to restart when new jobs are run
which will result in slower performance. It has been tested that it will take up to an hour for deduplication to
reach it peak performance. This performance aspect will be noted in the tables listed in this section. The greater
the amount of data being moved the higher the throughput will be.

Note: The specified requirements are as of the printing of this book. Please consult with CommVault for updated
deduplication recommendations.

CommVault Concepts & Design Strategies: https://www.createspace.com/3726838


Simpana® Deduplication - 89

SIDB2 Utility
The SIDB2 utility tool can be used to simulate the operations of a deduplication database. This tool should be
used to test the disk location where the deduplication database will be stored to ensure performance is adequate.
For complete instructions on using this utility refer to CommVault online documentation.

Basic Deduplication Database and Store Sizing


The following charts illustrate various sizing requirements for deduplication databases and deduplication stores.
These charts are based on the CommVault deduplication calculator.

Note: The sizing charts provided here are based on an adequately scaled environment. If CommVault best
practice guidelines are not followed results can be significantly less than what is presented here.

The first chart illustrates database size and maximum store size for protecting 10 TB of
data using a 128 KB block factor.

10 TB Full with 10% incremental change rate

Deduplication assumptions:
base full reduction 60%
Subsequent full reduction 95% I
Incremental reduction 60%

Data Type Cycle Cycle Block Storage Dedupe Max Store Throughput TB /
Frequency Retained Factor Policies Database Size Hour
Required Size
Database Weekly 4 128KB 1 45 GB 96 TB .5
File / messages Weekly 4 128KB 1 46 GB 96 TB .5
Virtual machines Weekly 4 128KB 1 42 GB 96 TB .5

CommVault Concepts & Design Strategies: https://www.createspace.com/3726838


90 - Simpana® Deduplication

The second chart illustrates the same deduplication characteristics but with 50 TB of
data. This is to demonstrate the scalability of Simpana deduplication. In this case
separating data into different storage policies provides greater scalability. These results
show three storage policies capable of scaling beyond 150 TB of production data.

50 TB Full with 10% incremental change rate

Deduplication assumptions:
base full reduction 60%
Subsequent full reduction 95% I
Incremental reduction 60%

Data Type Cycle Cycle Block Storage Dedupe Max Store Throughput TB /
Frequency Retained Factor Policies Database Size Hour
Required Size
Database Weekly 4 128 KB 1 223 GB 96 TB 2.25
File / messages Weekly 4 128 KB 1 232 GB 96 TB 2.5
Virtual machines Weekly 4 128 KB 1 216 GB 96 TB 2.5

Deduplication Database and Store Sizing for Database Data

The following chart illustrates two storage policy designs, one using 128 KB block size
and the other using 256 KB. Note the greater scalability of the deduplication store by
using the higher block size. Managing 100 TB of data using the 128 KB block size, two
storage policies would be required. With 256 KB only 1 policy is required with a single
deduplication store scaling to almost 200 TB.

100 TB Full with 10% incremental change rate

Deduplication assumptions:
base full reduction 60%
Subsequent full reduction 95% I
Incremental reduction 60%

Data Type Cycle Cycle Block Storage Dedupe Max Store Throughput TB /
Frequency Retained Factor Policies Database Size Hour
Required Size
Database Weekly 4 128k 2 447 GB 96 TB 4.5
Database Weekly 4 256k 1 223 GB 193 TB 4.5

CommVault Concepts & Design Strategies: https://www.createspace.com/3726838


Simpana® Deduplication - 91

Scaling for Large Media Repositories


The following scenario looks at protecting large data stores of data blocks that does not deduplicate well. Media
files or other binary data types do not deduplicate well when compared to other data even within the same set. As
a result disk space savings is only realized by performing multiple full backups over time. In this scenario the
primary advantage of using deduplication is not for space savings but rather for making use of DASH Full
backups. Once the initial full backup is complete, only incremental backups will be run from that point forward.
Data will be consolidated on a monthly basis by performing DASH Full backup operations.

The following table illustrates media files being retained for 4 weeks with an initial
backup size of 100 TB and daily incremental change rate of 100 GB. A low base
reduction rate for fulls and incrementals is assumed due to the data type being
protected.

Total Size of
(1) Data (5) Incremental Retained Base Full Seq-Full INCR
Set Type (4) Full Backup Job Size Inc:Full Backup Jobs Reduction Reduction Reduction
(SP Copy) Backup size %/Full Ratio / Cycle (/job) (/job) (/job)

Media Files 100.000 TB 0.100 TB 0.1% 111.1 TB -10.0% -97.5% -10.0%

Media Files 100.000 TB 0.100 TB 0.1% 111.1 TB -10.0% -97.5% -10.0%

Media Files 100.000 TB 0.100 TB 0.1% 111.1 TB -10.0% -97.5% -10.0%

Dues to the data type being protected, using small block sizes will not provide additional space savings. By
setting a higher block size, the deduplication database and store can scale out significantly higher providing better
scalability and performance.

CommVault Concepts & Design Strategies: https://www.createspace.com/3726838


92 - Simpana® Deduplication

The following table illustrates the scaling capabilities of the dedupe store and database
when using various block sizes. In this case the 512 KB block size provides a scale of
close to 400 TB while maintaining the dedupe database as 32GB.

(2) Recovery DISK STORE Dedupe DDB DDB Suggested


(1) Data Set SLA - # SP Media Size Dedupe Store Store DB Max Number
Type (SP Weeks Fast Segment / retention Ratio / Percentage Size Store Storage
Copy) Restore Size (KB) period Store Saved (GB) Size Policies
1.1 : 1
Media Files 4 wks 128KB 99.81 TB -10% 127 GB 96.6 TB 2.0
X
1.1 : 1 193.1
Media Files 4 wks 256KB 99.81 TB -10% 63 GB 1.0
X TB
1.1 : 1 386.2
Media Files 4 wks 512KB 99.81 TB -10% 32 GB 1.0
X TB

CommVault Concepts & Design Strategies: https://www.createspace.com/3726838


Simpana® Deduplication - 93

Best Practices When Using Deduplication


Note: This section provides guidance based on current CommVault best practices. Check with
CommVault for any updated guidance regarding deduplication best practices.

General Guidelines
 Carefully plan your environment before implementing deduplication policies.
 Consider current protection and future growth into your storage policy design. Scale your deduplication
solution accordingly so the deduplication infrastructure can scale with your environment.
 Once a storage policy has been created the option to use a global dedupe policy cannot be modified.
 When using encryption use dedicated policies for encrypted data and other policies for non-encrypted
data.
 Not all data should be deduplicated. Consider a non-deduplicated policy for certain data types.
 Non-deduplicated data should be stored in a separate disk library. This will ensure accurate
deduplication statistics which can assist in estimating future disk requirements.

Deduplication Database
 Ensure there is adequate disk space for the deduplication database.
 Use dedicated dedupe databases with local disk access on each Media Agent.
 Use high speed SCSI disks in a RAID 0, 5, 10, or 50 configuration.
 Ensure the deduplication database is properly protected.
 Do NOT backup the deduplication database to the same location the active database resides.

Disk Library Considerations


 It is recommended to use dedicated disk libraries for each Media Agent.
 If using a shared disk library with multiple Media Agents use NAS disk storage as opposed to SAN.
 Disk libraries should be divided into 2-4 TB mount paths.
 Use network paths as opposed to drive letters. Drive letters will limit the total number of mount paths
that can be added.

GridStor Technology Considerations


 For backup and restore performance in large environments, it is not recommended to use GridStor
Round Robin load balancing.
 If you choose to use the GridStor feature for data protection resiliency configure the GridStor feature in
a shared disk library configuration to Failover as opposed to Round Robin.
 Do NOT use GridStor Round Robin option when using a shared disk library in a SAN environment.

CommVault Concepts & Design Strategies: https://www.createspace.com/3726838


94 - Simpana® Deduplication

Deduplication Store
 Only seal deduplication stores when databases grow too large or when using SILO storage.
 When using SILO storage consider sealing stores at specific time intervals e.g. monthly or quarterly to
consolidate the time period to tape media.
 For WAN backups you can seed active stores to reduce data blocks that must be retransmitted when a
store is sealed. Use the option Use Store Priming option with Source-Side Deduplication to seed new
active stores with data blocks from sealed stores.

Block Size & block Processing


 Use the recommended 128 KB block size for all object level and virtual machine data protection jobs.
 For large databases use 256 KB or higher block setting. Consult with Professional Services for very
large databases for best approach for data protection.
 Use compression for object level and virtual machine data protection jobs.
 For database applications that perform their own compression do NOT use CommVault compression.
 Use the Variable Content Alignment option when backing up large database dump files using the
Simpana File System iDataAgent.

Performance
 Use DASH Full backup operations to greatly increase performance for full data protection operations.
 Use DASH Copy for auxiliary copy jobs to greatly increase auxiliary copy performance.
 Ensure the deduplication database is on high speed SCSI disks.
 Ensure Media Agents hosting a dedupe database has enough memory (at least 32GB).

Global Deduplication
 Global deduplication is not a be-all-end-all solution and should not be used all the time.
 Consider using global dedupe policies as a base for other object level policy copies. This will provide
greater flexibility in defining retention policies when protecting object data.
 Use global deduplication storage policies to consolidate remote office backup data in one location.
 Use this feature when like data types (File data and or virtual machine data) need to be managed by
different storage policies but in the same disk library.

SILO storage
 SILO storage is for long term data preservation and not short term disaster recovery.
 Recovery time will be longer if data is in tape SILO so for short term fast data recovery use traditional
auxiliary copy operations.

CommVault Concepts & Design Strategies: https://www.createspace.com/3726838

You might also like