Understanding Data Deduplication
Understanding Data Deduplication
Example
A typical email system might contain 100 instances of the same 1 MB file attachment. If the email
platform is backed up or archived, all 100 instances are saved, requiring 100 MB storage space.
With data deduplication, only one instance of the attachment is actually stored; each subsequent
instance is just referenced back to the one saved copy reducing storage and bandwidth demand to
only 1 MB.
Technological Classification
The practical benefits of this technology depend upon various factors like –
On the contrary Source based deduplication acts on the data at the source before it’s moved. A
deduplication aware backup agent is installed on the client which backs up only unique data. The
result is improved bandwidth and storage utilization. But, this imposes additional computational
load on the backup client.
The post-process deduplication asynchronously acts on the stored data. And has an exact
opposite effect on advantages and disadvantages of the inline deduplication listed above.
File vs Sub-file Level Deduplication
The duplicate removal algorithm can be applied on full file or sub-file levels. Full file level duplicates
can be easily eliminated by calculating single checksum of the complete file data and comparing it
against existing checksums of already backed up files. It’s simple and fast, but the extent of
deduplication is very less, as it does not address the problem of duplicate content found inside
different files or data-sets (e.g. emails).
The sub-file level deduplication technique breaks the file into smaller fixed or variable size blocks,
and then uses standard hash based algorithm to find similar blocks.
For example, similar data blocks may be present at different offsets in two different datasets. In
other words the block boundary of similar data may be different. This is very common when some
bytes are inserted in a file, and when the changed file processes again and divides into fixed-length
blocks, all blocks appear to have changed.
Therefore, two datasets with a small amount of difference are likely to have very few identical fixed
length blocks.
Variable-Length Data Segment technology divides the data stream into variable length data
segments using a methodology that can find the same block boundaries in different locations and
contexts. This allows the boundaries to “float” within the data stream so that changes in one part of
the dataset have little or no impact on the boundaries in other locations of the dataset.
ROI Benefits
Each organization has a capacity to generate data. The extent of savings depends upon – but not
directly proportional to – the number of applications or end users generating data. Overall the
deduplication savings depend upon following parameters –
If you employ smart choices in backup and data management processes, you might not need data deduplication.
But if you keep all of your inactive and unimportant data on your production storage systems, and use backup
software that forces you to perform repetitive full backups of all that static data, then data deduplication can
provide you with a huge benefit.
The basic idea behind data deduplication is to store just one copy of any data object, and place pointers to the
single copy wherever duplicates are eliminated. Some solutions do this at a file level, so that the files have to be
exactly the same to be deduplicated. This is often called single-instance storage (SIS). Other solutions
deduplicate data at a fixed or variable block length. IBM’s solutions use a blended approach based on the size of
the data—file-based for smaller files, and variable block for larger files.
Most deduplication solutions run a checksum algorithm against the selected data to create a hash signature, then
check to see if that signature has ever been seen before. If it has, the data is discarded and a pointer to the
already stored data is put in its place. A small number of high-end solutions perform a complete byte-level
differential comparison of the data to remove all potential for “data collisions,” where two distinct data blocks may
share the same hash signature.
Data deduplication can and does occur at many points in the data creation and management life cycle. In general,
these points of deduplication can be broken into source-side, where the data is created, and target-side, where it
is stored and managed. Backup applications, for example, can perform source-side deduplication by not
transferring data that has previously been backed up over the LAN or WAN, saving on bandwidth.
On the target side, the most popular use of deduplication is in virtual tape libraries, or VTLs. These disk-based
systems emulate tape libraries and drives, but apply deduplication to store equivalent amounts of data on disk
very cost-effectively while providing performance advantages over tape. Performing deduplication on tape-based
systems is considered to be a bad idea, given the portable nature of tapes and the need to recycle them over
time; it would be very difficult to guarantee that you maintain the original data for all of the pointers that are out
there.