Understanding Data Deduplication

Uploaded by

Fabrice Dève

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

11 views

Understanding Data Deduplication

Uploaded by

Fabrice Dève

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 4

Understanding Data De-duplication

Posted on January 15th, 2009 in In Depth, Technology | 10 Comments »

Druvaa is a Pune-based startup that sells fast, efficient, and

cheapbackup (Update: see the comments section for Druvaa’s comments on my use of the word
“cheap” here – apparently they sell even in cases where their product is priced above the
competing offerings) software for enterprises and SMEs. It makes heavy use of data de-duplication
technology to deliver on the promise of speed and low-bandwidth consumption. In this article,
reproduced with permission from their blog, they explain what exactly data de-duplication is and
how it works.

Definition of Data De-duplication

Data deduplication or Single Instancing essentially refers to the elimination of redundant data. In
the deduplication process, duplicate data is deleted, leaving only one copy (single instance) of the
data to be stored. However, indexing of all data is still retained should that data ever be required.

Example
A typical email system might contain 100 instances of the same 1 MB file attachment. If the email
platform is backed up or archived, all 100 instances are saved, requiring 100 MB storage space.
With data deduplication, only one instance of the attachment is actually stored; each subsequent
instance is just referenced back to the one saved copy reducing storage and bandwidth demand to
only 1 MB.

Technological Classification
The practical benefits of this technology depend upon various factors like –

1. Point of Application – Source Vs Target

2. Time of Application – Inline vs Post-Process
3. Granularity – File vs Sub-File level
4. Algorithm – Fixed size blocks Vs Variable length data segments
A simple relation between these factors can be explained using the diagram below -
Target Vs Source based Deduplication
Target based deduplication acts on the target data storage media. In this case the client is
unmodified and not aware of any deduplication. The deduplication engine can embedded in the
hardware array, which can be used as NAS/SAN device with deduplication capabilities. Alternatively
it can also be offered as an independent software or hardware appliance which acts as intermediary
between backup server and storage arrays. In both cases it improves only the storage utilization.

On the contrary Source based deduplication acts on the data at the source before it’s moved. A
deduplication aware backup agent is installed on the client which backs up only unique data. The
result is improved bandwidth and storage utilization. But, this imposes additional computational
load on the backup client.

Inline Vs Post-process Deduplication

In target based deduplication, the deduplication engine can either process data for duplicates in
real time (i.e. as and when its send to target) or after its been stored in the target storage.

The former is called inline deduplication. The obvious advantages are -

1. Increase in overall efficiency as data is only passed and processed once

2. The processed data is instantaneously available for post storage processes like recovery
and replication reducing the RPO and RTO window.
the disadvantages are -

1. Decrease in write throughput

2. Extent of deduplication is less – Only fixed-length block deduplication approach can be use
The inline deduplication only processed incoming raw blocks and does not have any knowledge of
the files or file-structure. This forces it to use the fixed-length block approach (discussed in details
later).

The post-process deduplication asynchronously acts on the stored data. And has an exact
opposite effect on advantages and disadvantages of the inline deduplication listed above.
File vs Sub-file Level Deduplication
The duplicate removal algorithm can be applied on full file or sub-file levels. Full file level duplicates
can be easily eliminated by calculating single checksum of the complete file data and comparing it
against existing checksums of already backed up files. It’s simple and fast, but the extent of
deduplication is very less, as it does not address the problem of duplicate content found inside
different files or data-sets (e.g. emails).

The sub-file level deduplication technique breaks the file into smaller fixed or variable size blocks,
and then uses standard hash based algorithm to find similar blocks.

Fixed-Length Blocks v/s Variable-Length Data Segments

Fixed-length block approach, as the name suggests, divides the files into fixed size length blocks
and uses simple checksum (MD5/SHA etc.) based approach to find duplicates. Although it’s possible
to look for repeated blocks, the approach provides very limited effectiveness. The reason is that the
primary opportunity for data reduction is in finding duplicate blocks in two transmitted datasets
that are made up mostly – but not completely – of the same data segments.

For example, similar data blocks may be present at different offsets in two different datasets. In
other words the block boundary of similar data may be different. This is very common when some
bytes are inserted in a file, and when the changed file processes again and divides into fixed-length
blocks, all blocks appear to have changed.

Therefore, two datasets with a small amount of difference are likely to have very few identical fixed
length blocks.

Variable-Length Data Segment technology divides the data stream into variable length data
segments using a methodology that can find the same block boundaries in different locations and
contexts. This allows the boundaries to “float” within the data stream so that changes in one part of
the dataset have little or no impact on the boundaries in other locations of the dataset.

ROI Benefits
Each organization has a capacity to generate data. The extent of savings depends upon – but not
directly proportional to – the number of applications or end users generating data. Overall the
deduplication savings depend upon following parameters –

1. No. of applications or end users generating data

2. Total data
3. Daily change in data
4. Type of data (emails/ documents/ media etc.)
5. Backup policy (weekly-full – daily-incremental or daily-full)
6. Retention period (90 days, 1 year etc.)
7. Deduplication technology in place
The actual benefits of deduplication are realized once the same dataset is processed multiple times
over a span of time for weekly/daily backups. This is especially true for variable length data
segment technology which has a much better capability for dealing with arbitrary byte insertions

If you employ smart choices in backup and data management processes, you might not need data deduplication.
But if you keep all of your inactive and unimportant data on your production storage systems, and use backup
software that forces you to perform repetitive full backups of all that static data, then data deduplication can
provide you with a huge benefit.

The basic idea behind data deduplication is to store just one copy of any data object, and place pointers to the
single copy wherever duplicates are eliminated. Some solutions do this at a file level, so that the files have to be
exactly the same to be deduplicated. This is often called single-instance storage (SIS). Other solutions
deduplicate data at a fixed or variable block length. IBM’s solutions use a blended approach based on the size of
the data—file-based for smaller files, and variable block for larger files.

Most deduplication solutions run a checksum algorithm against the selected data to create a hash signature, then
check to see if that signature has ever been seen before. If it has, the data is discarded and a pointer to the
already stored data is put in its place. A small number of high-end solutions perform a complete byte-level
differential comparison of the data to remove all potential for “data collisions,” where two distinct data blocks may
share the same hash signature.

Data deduplication can and does occur at many points in the data creation and management life cycle. In general,
these points of deduplication can be broken into source-side, where the data is created, and target-side, where it
is stored and managed. Backup applications, for example, can perform source-side deduplication by not
transferring data that has previously been backed up over the LAN or WAN, saving on bandwidth.

On the target side, the most popular use of deduplication is in virtual tape libraries, or VTLs. These disk-based
systems emulate tape libraries and drives, but apply deduplication to store equivalent amounts of data on disk
very cost-effectively while providing performance advantages over tape. Performing deduplication on tape-based
systems is considered to be a bad idea, given the portable nature of tapes and the need to recycle them over
time; it would be very difficult to guarantee that you maintain the original data for all of the pointers that are out
there.