Douglis

This document proposes and evaluates an approach called delta-encoding via resemblance detection (DERD) to reduce storage and transmission costs of files, web pages, and emails. DERD identifies pairs of objects in a collection that have overlapping content so their differences can be compressed as deltas, without needing an a priori relationship between the objects. The evaluation finds DERD can improve on simple compression by up to 2x depending on workload, and a small fraction of objects account for large savings. Parameters like the amount of feature overlap required for an effective delta influence the results.

Uploaded by

Reid Lindsay

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

21 views14 pages

Douglis

Uploaded by

Reid Lindsay

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 14

Application-specific Delta-encoding via Resemblance Detection

Fred Douglis Arun Iyengar

IBM T. J. Watson Research Center IBM T. J. Watson Research Center
Hawthorne, NY 10532 Hawthorne, NY 10532
[email protected] [email protected]

Abstract via a slow link, techniques which minimize bandwidth

Many objects, such as files, electronic messages, and required for updates are highly desirable. However, in
web pages, contain overlapping content. Numerous past each of these environments, it is not always possible to
research projects have observed that one can compress identify an appropriate base version to take advantage of
one object relative to another one by computing the delta-encoding.
differences between the two, but these delta-encoding Our work therefore addresses a domain in which there
systems have almost invariably required knowledge of are very many objects with arbitrary overlap among dif-
a specific relationship between them—most commonly, ferent pairs of objects, and the relationships between
two versions using the same name at different points in these pairs are not known a priori. If one can identify
time. We consider cases in which this relationship is de- which pairs are suitable candidates, delta-encoding can
termined dynamically, by efficiently determining when reduce the size of one relative to another, thereby reduc-
a sufficient resemblance exists between two objects in a ing storage or transmission costs in exchange for com-
relatively large collection. We look at specific examples putation. We consider several application domains for
of this technique, namely web pages, email, and files this technique, which we refer to as delta-encoding via
in a file system, and evaluate the potential data reduc- resemblance detection, or DERD: web traffic, email, and
tion and the factors that influence this reduction. We files in a file system.
find that delta-encoding using this resemblance detec- We defer additional discussion of our research until
tion technique can improve on simple compression by after a more detailed discussion of delta-encoding and
up to a factor of two, depending on workload, and that resemblance detection, which appears in the following
a small fraction of objects can potentially account for a subsection. After that, the next section describes the
large portion of these savings. framework of our analysis in greater detail, including
the metrics we consider. Section 3 presents the various
1 Introduction datasets we used. Section 4 describes the experiments,
Delta-encoding is the act of compressing a data ob- and Section 5 provides the results of these experiments.
ject, such as a file or web page, relative to another ob- Section 6 discusses the resource usage issues that would
ject [1, 13]. Usually there is a temporal relationship be- arise in a practical implementation of DERD. Section 7
tween the two objects: the latter object exists, and when surveys related work, and Section 8 summarizes and de-
it is subsequently modified, the changes can be repre- scribes possible future work.
sented in a small fraction of the size of the entire object.
There is often also a naming relationship between the
1.1 Background
objects, since a modified file can have the same name It is difficult to describe our approach without provid-
as the original copy. In these cases, identifying the base ing a general overview of both delta-encoding and re-
version against which to compute a delta is straightfor- semblance detection. We cover enough of each of these
ward. areas here to set the stage for combining the two, then
Delta-encoding is particularly attractive for situations return to a more comprehensive comparison with related
where information is being updated across a network work toward the end of the paper.
with limited bandwidth. For example, web sites are of- Deltas are useful for reducing resource requirements,
ten replicated both for higher performance and availabil- and existing applications of deltas generally fall into two
ity. The bandwidth between the replicas can be lim- categories: storage and networking. For storage, when
ited. Another example would be replicated mail systems. one already stores a base version of a file, subsequent
Electronic mail systems often allow clients to replicate versions can be represented by changes. This lowers
copies of mail messages locally. Clients may be con- storage demands within file systems (the Revision Con-
nected to the network via phone lines with limited band- trol System (RCS) [25] is a longstanding example of
width. For an email client connected to a mail server this), backup-restore systems [1], and similar environ-
ments. the amount of overlap among features necessary to
Over a network, transmitting data that are already
known to the recipient can be avoided. The most com- get a sufficiently small delta,
the number of files with similar overlap necessary
mon approach in this case is to work from a common
base version known to the sender and recipient, com- to get close to the “best” delta,
selection of delta-encoding algorithms and param-
pute the delta, and transmit it. This technique has been
applied to web traffic [16], IP-level network commu- eters to those algorithms,
whether delta-encoding the contents of specially
nication [24], and other domains. An extension to the formatted files such as Zip files in an application-
traditional web delta-encoding approach is to select the
base version by finding similar, rather than identical, specific method is beneficial,
and other metrics.
URLs [7].
What if one wishes to find a similar file based on con- 1.3 Summary of Results
tent rather than name, among a large collection of files? We have found that the benefits of application-specific
Manber devised a method for extracting features of files deltas vary depending on the mix of content types. For
based on their contents, in order to find files with over- example, HTML and email messages display a great
lapping content efficiently [14]. He computed hashes of deal of redundancy across large datasets, resulting in
overlapping sequences of bytes (also known as “shin- deltas that are significantly smaller than simply com-
gles”), then looked for how many of these hashes were pressing the data, while mail attachments are often dom-
shared by different files. Manber indicated that cluster- inated by non-textual data that do not lend themselves
ing similar files for improved compression would be an to the technique. A few large files can contribute much
application of this technique. Broder used a similar ap- of the total savings if they are particularly amenable to
proach but used a deterministic sampling of the hash val- delta-encoding. Application-specific techniques, such as
ues to dramatically reduce the amount of data needed delta-encoding an unzipped version of a zip or gzip file
for each file [5, 6]. With his approach, a subset of fea- and then zipping the result, can significantly improve
tures of a file is used to represent the file, and if two results for a particular file, but unless an entire dataset
files share many of those features in common, there is consists of such files, overall results improve by just a
a high probability of significant content in common as couple of percent.
well. A common use for this technique is to suppress Numerous parameters can be varied in assessing the
near-duplicates in search engine results [6], and varia- benefits of deltas in this context, and we have evalu-
tions of the technique have been used in link-level dupli- ated several. The results do not appear to be sensitive
cate suppression [24] and file systems [8, 17, 20]. to the size of shingles or the delta-encoding algorithm,
Because the shingling technique has seen so much use within reason. The extent of the match of the number
in the systems community of late, we refrain from pro- of features is a good predictor of the delta size. Perhaps
viding a detailed description of it. Briefly, it uses Ra- most importantly, when multiple files match the same
bin fingerprints [21] to compute a hash of consecutive number of features, there is minimal difference between
bytes; the key properties of Rabin fingerprints are that the best delta—the smallest delta obtained across all the
they are efficient to compute over a sliding window, and files—and the average delta. The latter two results sug-
they are uniformly distributed over all possible values. gest that while it is beneficial to determine the file(s)
Thus, Broder’s approach of selecting the fingerprints with the maximal number of matching features, only one
with the smallest values effectively selects “random” delta need be computed. This is crucial because find-
features in a deterministic fashion, and two documents ing matching features, given a precomputed database of
with many features in common overall would hopefully the features of other files and the dynamically computed
have many of these features in common. feature set of the file being delta-encoded, is far more
efficient than computing an actual delta.
1.2 Goals
As Manber suggested, one can use the features of doc- 2 Framework
uments to identify when files overlap and then delta- This section describes our approach to the problem of
encode pairs of overlapping files to save space or band- delta-encoding with resemblance detection in greater de-
width. One goal of this work was to assess whether this tail. We discuss the types of data we considered and the
technique is generally applicable, and if not, to identify way in which we evaluate the potential benefits of DERD.
some specific instances in which it is applicable. A sec-
ond goal was to evaluate a number of the parameters 2.1 Types of Data
used in this process, such as:
the size of a shingle,
In the past, delta-encoding has been used for many types
of data in numerous environments. Our interest has fo-
cused on data that are located “together,” meaning that ing each object to remove internal redundancy is trivial.
they belong to a single user, or they reside on a single We analyzed several datasets: the contents of /usr on
server. Earlier work has demonstrated the potential ben- a Redhat Linux 7.1 PC, totaling nearly 2 Gbytes of data;
efits of deltas when the same object is modified over the contents of a user’s MH mail repository, with each
time, whereas we consider different objects that exist at message stored in a separate file (possibly including one
the same time. Thus far, we have analyzed web data or more MIME attachments) totaling 566 Mbytes of
(primarily HTML), email, and a file system. data; and the contents of several users’ Lotus Notes mail,
In a Research Report [10] coauthored with Kiem- with message bodies and attachments separated into dis-
Phong Vo of AT&T Labs, we previously argued that one tinct files. Section 3 describes the datasets in detail.
could use Broder’s technique for efficiently selecting
features of objects to determine dynamically a suitable 2.2 Evaluation Metrics and Practical Con-
candidate to serve as the base for HTTP delta-encoding. siderations
This would be an extension to the proposed standard de- As noted above, size reduction is the crucial determin-
scribed in a recent RFC [15]. The report described a pos- ing factor for the success of our proposal. This reduc-
sible protocol but gave no statistics to support the utility tion must be considered not only relative to the origi-
of the idea in practice. In the case of individual web nal content, but relative to the size of the content using
clients, objects must be large enough to justify the added traditional compression tools such as gzip. Considering
overheads of transmitting their features, comparing the that reconstructing the original requires the reference file
features on a client, possibly computing a new delta- to be available, one might favor a compressed version
encoding on the fly in response to the client’s request, over a delta-encoded version if the former is marginally
and reconstructing the page on the client. Beyond that larger.
proposal, similarity among different web pages could be Furthermore, the effect of the reduction is dependent
used for efficient distribution of new pages to caches in a on the environment:
content distribution network (CDN), or other replicas; in If an individual file is encoded, either as a delta
this case, by transmitting many pages at once, overheads
or simple compression, and then stored on disk or
could be minimized. We have estimated the best-case
some other block-based medium, the gain is not ex-
benefits for a web-based DERD system, by downloading
actly the number of bytes by which the file is re-
numerous pages from several sites at a single point in
duced. Instead, it is a function of the number of
time, and then comparing each page against the others.
blocks taken up by the file before and after en-
In practice, not all the other pages would be cached by
coding. For instance, if every file is rounded to
an individual client, though they might be cached by a
the nearest 4-Kbyte boundary, then shrinking a file
CDN if they are not completely dynamic.
from 4097 bytes to 4095 bytes actually saves 1
In parallel with assessing the overlap of content on
block, i.e. 4096 bytes. More typically, a file might
real web sites, we identified the overlap of content in
be encoded but still use the same number of blocks
email and other local file system content as an appro-
priate application domain. At any instant, all the files on disk.
Similarly, reducing traffic over a network has low
are available, so in theory any file could be represented
marginal benefits if the same number of packets is
as a delta from one or more other files. As new files
used; however, if the number of round-trips in com-
are created, they could be encoded against all earlier
munication can be decreased, the improvement in
stored files, especially a previous version of the same file
should it exist. If a “live” file system uses this approach, response time is more significant.
If many files are encoded together, such as a full
it must use techniques such as copy-on-write and ref-
backup or web server replication, then the benefits
erence counting to ensure that the base version against
are more directly related to the actual per-file gains,
which a delta was computed is not modified or deleted
since rounding effects are amortized over the entire
until the delta itself is no longer needed. The same ap-
dataset.
proach could be used to efficiently back up a file sys-
tem: rather than delta-encoding updates in an incremen- There are other evaluation metrics of interest, includ-
tal backup, the entire file system would be compressed ing:
by identifying where similarity exists. Computation There are overheads due to computing
None of these techniques would be useful without sig- the features for each file, comparing the features
nificant reduction in file sizes, so the primary focus of of the candidate and stored files, and encoding a
this study is to evaluate those reductions. Like the ear- delta once a base version is selected. Since there
lier study of deltas in HTTP [16], we consider regular has been extensive research in making both delta-
compression as a basis for comparison, since compress- encoding [1] and resemblance detection [5, 6] ef-
ficient even in enormous datasets such as Inter- pressed version of the file. For ex-
net search engines, and because our prototype is ample, we made a copy of the Redhat
geared toward assessing space reduction benefits 7.1 /usr/share/dict/words (409,276
rather than speed, we do not report timings in this bytes, 45,424 one-word lines) and changed
paper. However, we discuss performance issues in line six from abandon to xyzzy. We
general terms in Section 6. call the copy words1. Both words and
Space overheads A system that is selecting a base ver- words1 generated gzipped files of about
sion given a set of features must be able to com- 131 Kbytes, with a difference of just four
pare those features to a large set of existing files. bytes in size. Encoding the differences
The overhead per file may be from 50-800 bytes de- between the uncompressed words1 and
pending on how much information is stored, which words, using vcdiff, represented the differ-
in turn affects the quality of the comparison [6]. ences in just 79 bytes. In stark contrast, delta-
Execution parameters There are a number of run-time encoding words1.gz against words.gz
parameters that can affect the performance and/or generated about 93 Kbytes.
effectiveness of the system. We consider the fol- Therefore, delta-encoding two compressed
lowing: files by encoding their uncompressed versions
Size and number of features Shingling a file cre- and compressing the result (if needed) has the
ates an enormous number of fingerprints, potential for significant gains. Since zip can
or features, representing sequences of data. store an arbitrarily large number of files and
Broder’s technique selects a “small” number directories as a single compressed file, com-
of them, where “small” is parameterizable [5]. paring its contents individually and zip-ing
We evaluated the sensitivity of the results to the results into a single zip file can have simi-
this parameter. We also can require a minimal lar benefits. One might assume that tar need
fraction of features to match before comput- not be handled specially, since it concatenates
ing a delta, to see if the poorer matches still its input without compression. We find below
demonstrate benefits. Finally, the number of that this hypothesis is incorrect for the three
bytes used to create a single feature can vary. delta-encoding programs we tried. For all
Best matches If multiple files match the same these datatypes, however, the overall effects
number of features, an exhaustive computa- depend on the mix of data: in practice, the
tion could determine which base file produces number and size of compressed files that can
the smallest delta. In fact, a file matching benefit from this approach may be dwarfed by
fewer features could produce a smaller delta all the other data.
than one matching more features. However, Delta-encoding algorithm and parameters
in practice, one would want to consider as There are a few possible delta-encoding
few base versions as possible. While it was programs. We did not find significant dif-
not possible to perform an exhaustive search ferences in output sizes among the available
within large datasets, we sampled several files programs; therefore, following the approach
with an equal number of matching features to of delta-encoding in HTTP [16], we report
determine whether there is a significant vari- numbers using Korn and Vo’s vcdiff [13].
ance among candidate base files. Delta-encoding versus compression We vary a
There is also an interaction between the parameter that specifies how much smaller a
number of features and the quality of the delta must be than simply compressing a file
match. If more features are compared, then before the delta is used. If no delta is small
different base files can be distinguished more enough, of the files used as potential base
finely, possibly resulting in a smaller delta. versions, the compressed version is used in-
Lastly, some files may produce particu- stead. We use vcdiff for compression (delta-
larly large savings relative to an entire dataset, encoding a file against /dev/null), due
while others may contribute relatively little. to historical reasons. Its data reduction is
Assuming files are sorted by the savings from comparable to gzip, though typically slightly
encoding them, we analyze how many files worse.
need be delta-encoded to produce a given Identical files When an identical file appears mul-
fraction of the total benefit. tiple times in a dataset, it can be trivially en-
Unzip-Rezip A small change to a file can re- coded against another instance through the
sult in significant differences in a com- use of hash functions such as MD5. Past stud-
ies have investigated the prevalence of mirrors 3.2 File Data
on the web [4] and techniques for suppressing
We used two types of file data, which are summarized
duplicate payloads [12]. We chose to suppress
in Table 2. First, we scanned the entire /usr direc-
duplicates from consideration in our analysis,
tory in a nearly unmodified Redhat Linux 7.1 distribu-
since they are trivially handled through other
tion, totaling just under 2 Gbytes of data in over 100K
means, except when a file contained in a zip
files. Second, we examined email from several users
archive is duplicated (since two zip files may
and in several formats. Much of our analysis used over
have many identical files and some changed
500 Mbytes of one user’s UNIX-based email, which is
content, and our unzip-rezip procedure would
stored individually in separate files by the MH mail sys-
match up the identical files).
tem. The remaining data came from Lotus Notes, which
stores message bodies and attachments as separate ob-
3 Datasets jects in a flat-file document database. We studied the
We separate our analyses into two types of data: web attachments of five users and the message bodies of two.
pages and files in a file system. We lump email into the
latter category, since in general we expect the benefits 4 Experiments
to be greater for static encoding (space reduction) than As described in Section 2.2, we varied a number of pa-
network transmission. Note that not all the datasets we rameters in the delta-encoding and resemblance detec-
analyzed are discussed further in this paper, but we in- tion process. Our general goals were to determine how
clude them in the tables to give a sense of the variability much more data could be eliminated by using deltas
of the results. rather than just compression, and how sensitive that re-
sult would be to this set of parameters. In particular, we
3.1 Web Data wanted to estimate the minimal work a system might do
Ideally, to analyze the benefits of DERD for the web, to get a reasonable benefit (i.e., the point of diminishing
one would study a live implementation over an extended returns).
time, and/or use full content traces to simulate an imple- In general, we fixed the parameters to a common set.
mentation. The latter approach was used effectively to We then varied each parameter to evaluate its effect.
study delta-encoding based on identical URLs [16], but Table 3 lists these parameters, with a brief description
such traces are difficult to obtain. of each one, the default value in boldface, and other
Instead, we used the w3get program to download a tested parameters. The parameters are clustered into two
small set of root web pages, and recursively the pages sets: the first controls the pass over the data to compute
linked from them, up to two levels. We specifically ex- the features, and the second controls the comparison of
cluded file suffixes that suggested image data, such as those features and computation of the deltas.
JPG and GIF, focusing instead on the base pages. This is In some cases, due to space constraints, we do not
partly because delta-encoding has already been demon- present additional details about variations in parameters
strated to be ineffective across two different image files, that did not significantly affect results; these are denoted
even having the same name [16], and partly because im- by italic text. Additional descriptions of many of the
ages change more slowly than HTML [9] and are more parameters were given above in Section 2.2. Note that
likely to be cached in the first place. min features ratio is special, in that it is possi-
While periodic downloads of specific web pages have ble to compute the savings for each number of matching
been used in the past to evaluate delta-encoding [13], features and then compute a cumulative benefit for each
cross-page comparisons require a single snapshot of a number of matches in a later stage, as demonstrated in
large number of pages. We believe these pages, and the Section 5.1.
results obtained from them, demonstrate a high degree
of overlap in content between pages on the same site; 4.1 Implementation Details
this has been observed in other research due to the high Most of the work to encode differences based on simi-
use of “templates” for creating dynamic pages [3, 23]. larity is performed by a pair of Perl scripts. One of these
Table 1 lists the sites accessed, all between 24-26 July recursively descends over a set of directories and invokes
2002, with the number of pages and total size. Note a Java program to compute the features. Each computa-
that in the case of Yahoo!, the download was aborted af- tion is a separate invocation of Java, though that could be
ter about 27 Mbytes were downloaded, as that offered optimized. Once a file’s features have been computed,
sufficient data to perform an analysis, and it was un- they are cached in a separate file.
clear how much additional data would be retrieved if left The other script takes the precomputed set of file-
unchecked. names and features, and for each file determines which
Name Files From... Files Size (Mbytes) Delta% Comp%
Yahoo yahoo.com 3,755 27.55 8 34
IBM ibm.com 177 3.21 19 36
Masters masters.com 192 3.19 9 35
CNN cnn.com 73 2.53 15 29
Wimbledon wimbledon.com 190 2.40 10 35

Table 1: Web datasets evaluated. Delta and compression percentages refer to the size of the encoded dataset relative to the
original.

Files Size
Name Files From... Delta% Comp%
Included Excluded (Mbytes)
/usr /usr 102,932 1,250 1,964.16 36 45
MH one user’s MH directory 87,005 565.69 34 54
User1 Bod User 1’s Notes mail bodies 3,097 5.97 29 60
User1 Att User 1’s Notes mail attach. 189 81.29 71 75
User2 Bod User 2’s Notes mail bodies 445 1.18 42 56
User2 Att User 2’s Notes mail attach. 1,078 417.35 32 37
User3 Att User 3’s Notes mail attach. 140 36.18 52 61
User4 Att User 4’s Notes mail attach. 1,982 991.45 53 66

Table 2: File datasets evaluated. Excluded files are explained in the text. Delta and compression percentages refer to the size of
the encoded dataset relative to the original.

other files have the maximum number of matching fea- diff compression (delta-encoding against /dev/null,
tures. Currently this is done by identifying which fea- comparable to gzip). Table 4 is an example of this out-
tures a file has, and incrementing counters for all other put. The rows at the top show dissimilar files, where
files with a given feature in common, using the value deltas made no difference, while the rows at the bottom
of the feature as a hash key. This records the most had the greatest similarity and the smallest deltas. The

features in common at any point, . After all fea- BestDelta and AvgDelta columns show that, in general,
tures are processed, any files that have at least one fea- there was at most a 1% difference in size (relative to the
ture in common are sorted by the number of matching original file) between the best of up to ten matching files
features. Typically, only the files that match exactly and the average of all ten. This characteristic was com-
features are considered as base versions, up to the mon to all the datasets. Correspondingly, in all the fig-
max comparisons parameter, but if the best matches ures, the curves for the savings for delta-encoding depict
fail to produce a small enough delta, poorer matches are the average cases.
considered until the maximum is reached. There are
methods to optimize this comparison by precomputing There are two apparent anomalies in Table 4 worth
the overlap of files, as well as through estimation [22], noting. First, there is a substantial jump in size at
which we intend to integrate at a later date. the complete 30/30 features match, despite a consistent
number of files, showing a much higher average file size.
Delta-encoding is performed by one of a set of pro-
This is skewed by a large number of nearly identical
grams, all written in C. Once a pair of files has been so
files, resulting from form letters attaching manuscripts
encoded, the size of the output is cached. Occasionally,
for review; if each manuscript was sent to three persons
the delta-encoding program might generate a delta that
and the features in the large common data were all se-
is larger than the compressed file, or even larger than the
lected by the minimization process, they all match in
original file. In those cases, the minimum of the other
every feature. (This is a desirable behavior, but may
values is used.
not be typical of all datasets.) Second, the files with
For a given dataset, the results are reported by list- 0–2 out of 30 features matching have a dramatically
ing how many files have a maximum features match worse compression ratio than the other data. We be-
for a given number of features, with statistics aggre- lieve these are attributable to types of data that neither
gated over those files: the original size, the size of the match other files to a great extent nor exhibit particu-
delta-encoded output, and the size of the output using vc- larly good compressibility from internally repeated text
Processing Parameter Description Values
Stage
Number of bytes in a fingerprinted
shingle size 20, 30
shingle
num features Number of features compared 30, 100
Minimum size of an individual file to
min size 128, 512 bytes
Preprocessing include in statistics
Should zip files be unzipped before yes, no
unzip
comparison
Should gz files be unzipped before yes, no
gunzip
comparison
Whether encoding A against B pre-
static files web=no, files=yes
cludes encoding B against A
program Program to perform delta-encoding vcdiff
Whether to compare against all files, no, yes
exhaustive search
or just best matches
Maximum number of files to compare
Encoding max comparisons against, with equal maximal matching 10, 1, 5
features
What fraction of features must match 0-1 (cumulative
min features ratio
to compute a delta? distribution)
What is the maximum size of a delta,
improvement 25%, 50%, 75%,
relative to simple compression, for it
threshold 100%
to be used?

Table 3: Parameters evaluated. Boldface represents defaults, and italics represent evaluated cases not reported here.

Matches Files Size (Mbytes) BestDelta (%) AvgDelta (%) Compressed (%)
0 230 4.37 65 65 65
1 2634 95.09 64 65 65
2 3308 63.87 58 58 60
3 3927 30.86 39 40 45
4 4284 32.53 31 32 39
5 4710 22.86 35 36 46
...
27 294 2.85 4 4 46
28 227 3.09 2 2 44
29 174 9.39 0 0 43
30 224 91.38 0 0 48
All 87005 565.69 34 34 54

Table 4: Delta-encoding and compression results for the MH directory. Percentages are relative to original size, e.g. 34% means
deltas save about two-thirds of the original size. Boldfaced numbers are explained in the text. This table corresponds to the
graphical results in Figure 1.

strings. MIME-encoded compressed data would have files, we create a special ZIPDIR directory, into which
this attribute, when the same compressed file does not the contents are unzipped before features are calculated.
appear in multiple messages. We assume there are no additional benefits to compres-
sion, since zip has already taken care of that. For deltas,
To analyze the benefits of unzipping files, encoding we delta-encode each file in this directory, storing the
them, and zipping the results, we take two approaches. results in a second temporary directory, and then zip the
Zip files can contain entire directory hierarchies, while results. For gzip files, we gunzip the files, compute
gzip files compress just one file. Therefore, for zip
the features, and discard the uncompressed output. Each tained using a particular technique such as compression
time we delta-encode a gzipped file, either as the refer-
or delta-encoding is if all files with at least maxi-
ence or the version, we uncompress it on the fly (the mal matching features are encoded. For instance, the Y-
most recent uncompressed version file is then cached value of the point on the Compressed curve with X-value
and reused for each encoding). Section 5.4 discusses the 15 is the percent of the total data size obtained if all files
added benefits of these two approaches. matching at least one other file in at least 15 features are
In some cases, the features for all the files in a single compressed. Figure 1(b) shows that the most benefit is
dataset, with other run-time state, resulted in a virtual derived from including all files, even with zero matches,
memory image that exceeded the 512 Mbytes of physical although in those cases these benefits come from com-
memory on the machine performing the comparisons— pression rather than deltas—recall that the size of a delta
this is an artifact of our Perl-based prototype, and not is never larger than delta-encoding it against the empty
inherent to the methodology, as evidenced by the scale file, i.e., compressing it.
of the search engines that use resemblance detection to Figure 2(a) shows the cumulative benefits of deltas
suppress duplicates [6]. For the usr and MH datasets, we and compression for two of the static datasets: usr, and
preprocessed the data to separate them into manageable the MH data. Figure 2(b) does the same for two of the
subdirectories, then merged the results. This would re- web datasets, IBM and Yahoo. Both graphs are limited
sult in files in different partitions not being compared: to two datasets in order to avoid cluttering them with
for example, a file in Mail/conferences would many overlapping lines, but the bottom-line savings for
not be compared against a file in Mail/projects. the other datasets were reported in Table 2 and Table 1,
In general, spatial locality would suggest that the best respectively. In each, the different datasets show differ-
matches for a file in Mail/conferences would be ent benefits, due to the amount of data being compared
found in Mail/conferences. (We subsequently and the nature of the contents. Specifically, the graphs
validated this theory by rerunning the script on all MH have very different shapes because many more files in
directories at once, using a more capable machine, with the web datasets have high degrees of overlap.
no significant difference in the overall benefits.) Also,
since partitions were based on subdirectories of a single 5.2 Contributions of Large Files
root such as /usr, it also would result in some partitions The graphs presented thus far have emphasized the effect
having too few files to perform meaningful comparisons; of statistics such as the number of features that match.
we skipped any subdirectories with fewer than 100 files, Another consideration is the skew in the savings: do a
resulting in a small fraction of files being omitted (listed small number of files contribute most of the benefits of
in Table 2). delta-encoding? In the case of the MH dataset, such a
skew was suggested by the statistics in Table 4, which
5 Results showed 91 of the 566 Mbytes matching in all 30 features
Here we present our analyses. We start with overall ben- and delta-encoding to virtually nothing.
efits for different types of data, then describe how vary- We visualize an answer to this question by consider-
ing certain parameters impacts the results. ing every file in a particular dataset, sorting by the most
bytes saved for any delta obtained for it, and plotting the
5.1 Overall Benefits cumulative distribution of the savings as a function of
Our overall goal is to reduce file sizes and to evaluate the original files. Figure 3(a) plots the cumulative sav-
how sensitive this reduction is to different data types, ings of the MH dataset (as a fraction of the original data)
the amount of effort expended, and other considerations. against the fraction of files used to produce those sav-
Table 4 gives a sense of these results, in tabular form, ings or the fraction of bytes in those files. In each case
for a dataset that is particularly conducive to this ap- the savings for DERD and strict compression are shown
proach; Figure 1 shows the same data graphically. Fig- as separate curves. Finally, points are plotted on a log-
ure 1(a) plots compressed sizes and delta-encoded sizes, log scale to emphasize the differences at small values,
as well as the original total file sizes, against the num- and note that the Comp by byte% curve starts at just
ber of matching features. For each possible number of
over 2% on the -axis.
matching features from 0–30, we plot the total data of The results for this dataset clearly show significant
files having that number of matching features as their skew. For example, for deltas, 1% of the files account
maximum match. As we expected, the more features for 38% of the total 65% saved; encoding 25% of the
match, the smaller the delta size. The cumulative ef- bytes will save 22% of the data. Compression also shows
fect is shown in Figure 1(b). In this graph (as well as some skew, since some files are extremely compressible.
several subsequent ones with the same label on the X- If one compressed the best files containing 25% of the

axis), a point ( , ) shows that the total data size ob- bytes, one would save 17% of the data. This degree of
MH Sizes by Feature MH Cumulative Sizes, 30 features
100 100

Relative Size (%)

Total Size (MB)

10
60

40
1
Original 20
Compressed Compressed
DERD DERD
0.1 0
0 5 10 15 20 25 30 0 5 10 15 20 25 30
Maximum Number of Matching Features Matching Features (M >= X)

(a) Total data sizes for the original dataset, using compression, (b) Cumulative benefits. The y-axis shows the relative size, in
and using DERD , for individual numbers of matching features. percent, of compressing or delta-encoding each file. A point on
Most of the data match very few features in any other file, or the x-axis shows the benefit from performing this on all files that
match all the features. The y-axis is on a log scale. match at least that many features.

Figure 1: Effect of matching features, for the MH data. These figures graphically depict the the data in Table 4.

Cumulative Sizes, multiple static datasets Cumulative Sizes, multiple web datasets
100 100
IBM comp
IBM DERD
Yahoo comp
Yahoo DERD
80 80
Relative Size (%)

Relative Size (%)

60 60

40 40

/usr comp
20 /usr DERD 20
MH comp
MH DERD
0 0
0 5 10 15 20 25 30 0 5 10 15 20 25 30
Matching Features (M >= X) Matching Features (M >= X)

(a) Static datasets. (b) Web datasets.

Figure 2: Effect of matching features, cumulative, for several datasets.

skew suggests that heuristics for intelligently selecting imum blocksize, typical for many UNIX systems with
a subset of potential delta-encoded pairs, or compressed fragmented file blocks, reduces the total possible ben-
files, could be quite beneficial. efit of delta-encoding from around 66% (assuming no
rounding) to 61%, but a 4-Kbyte blocksize brings the
5.3 Effects of File Blocking benefit down to 40% since so many messages are smaller
Section 2.2 referred to an impact on size reduction from than 4 Kbytes.
rounding to fixed block sizes. In some workloads, such
as file backups, this is a non-issue, but in others it can
5.4 Handling Compressed and Tarred
have a moderate impact for small blocks and a substan- Files
tial impact for large ones. Section 2.2 provided a justification for comparing the
Figure 3(b) shows how varying the blocksize affects uncompressed versions of zip and gzip files, as well as
overall savings for the MH dataset. Like Figure 3(a), it a hypothesis that tar files would not need special treat-
plots the cumulative savings sorted by contribution, but ment. For some workloads this is irrelevant, since for
it accounts for block rounding effects. A 1-Kbyte min- example the MH repository stored all messages with full
Cumulative Savings by Contribution Cumulative DERD Savings by Contribution and Block Size
100 100
90
80
70
10

Savings (%)

Savings (%)
60
50
40
1
30
DERD by File%
Comp by File% 20 No Blocking
DERD by Byte% 10 1Kbyte Blocks
Comp by Byte% 4Kbyte Blocks
0.1
0.01 0.1 1 10 100 10 20 30 40 50 60 70 80 90 100
Percent of original files/bytes to obtain savings Percent of original bytes to obtain savings

(a) Relative savings as a function of cumulative files, by count (b) Relative savings assuming no file blocking, or rounding to
or by bytes. Plotted on a log-log scale. 1-Kbyte or 4-Kbyte units.

Figure 3: Cumulative savings from MH files, sorted in order of contribution to total savings.

bodies, uncompressed. An attachment might contain used. There are reasons why this might not be desir-
MIME-encoded compressed files, but these would be able, such as a web server using a cached compressed
part of the single file being examined, and one would version rather than computing a specialized delta for a
have to be more sophisticated about extracting these at- given request. As another example, consider a file sys-
tachments. In fact, there was no single workload in our tem backup that would require both a base file and a
study with large numbers of both zip and gzip files, and delta to be retrieved before producing a saved file: if
overall benefits from including this feature were only 1- the compressed version were 25% larger than the delta,
2% of the original data size in any dataset. For example, it would consume that extra storage, but restoring the file
the User4 Attach workload, which had the most zip would involve retrieving 125% of the delta’s size rather
files, only saved an additional 2% over the case without than the delta and a base version that would undoubtedly
special handling. Even though the zip files themselves be much larger than that 25%.
were reduced by about a third, overall storage was dom- We varied the threshold for using a delta to be 25–
inated by other file types. 100% of the compressed size, in increments of 25%.
We expected directly delta-encoding one tar file Figure 4 shows the result of this experiment on the MH
against a similar tar file to generate a small delta if in- dataset. There is a dramatic increase in the relative size
dividual files had much overlap, but this was not the of the delta-encoded data at higher numbers of match-
case in some limited experiments. Vcdiff generated a ing features, because in some cases, there is no longer a
delta about the size of the original gzipped tar file, and usable match at a given level. The most interesting met-
two other delta programs used within IBM performed ric is the overall savings if all files are included, since
similarly. We tried a sample test, using two email tar that no longer suffers from this shift; the relative size in-
file attachments unpacked into two directories, and then creases from about 35% to about 45% as the threshold is
using DERD to encode all files in the two directories. reduced.
We selected the delta-encoded and compressed sizes of
the individual files in the smaller of the tar files, and 5.6 Shingle Size
found delta-encoding saved 85% of the bytes, compared
Unlike some of the other parameters, the choice of shin-
to 71% for simple compression of individual files and
gle size—within reason—seems to have minimal effect
79% when the entire tar file was compressed as a whole.
on overall performance. As an example, Figure 5 shows
Depending on how this extends to an entire workload,
how the size reduction varies when using shingle sizes
just as with zip and gzip, these savings may not justify
of 20 versus 30 bytes. If all files are encoded, even for
the added effort.
minimal matches, the total size reduction is about the
same. If a higher value of min features ratio is
5.5 Deltas versus Compression
used, the 20-byte shingles produce smaller deltas for the
By default, our experiments assumed that if a delta is same threshold within a reasonable range (10-15 of 30
at all smaller than just using compression, the delta is features matching).
MH DERD by Feature and Threshold
the “traditional” method [6]. For one such meta-feature
100 to match, all of some subset of the regular features must
match exactly, suggesting a higher degree of overlap
80 than we felt would be appropriate for DERD.
Relative Size (%)

60 6 Resource Usage
A system using our techniques to efficiently delta en-
40
Minimum size relative
to compression
code files and web documents could compute features
20
25% for objects when it first becomes aware of them. The
50%
75% cost for determining features is not that high, and it
100%
0 could be amortized over time. The system could also be
0 5 10 15 20 25 30
tuned to perform delta-encoding when space is the criti-
Matching Features (M >= X)
cal resource and to store things in a conventional manner
Figure 4: Effect of limiting the use of deltas to a fraction of when CPU resources are the bottleneck.
the compressed file, for the MH dataset. Using 30 features of 4 bytes apiece, the space over-
head per file is around 120 bytes. For large files, this is
insignificant. Once the features for a file have been de-
100
MH DERD by Feature and Shingle Size termined, it requires operations to determine the
maximum number of matching features with existing
files where is the total number of files. However, to
80
get a reasonably good number of matching features, it
Relative Size (%)

60
is not always necessary to examine features for all of the
existing files. A reasonable number of matching features
40 can often be determined by only examining a fraction of
the objects when the number of objects is large. That
20 way, the number of comparisons needed for performing
30 bytes/shingle efficient delta-encoding can be bounded.
20 bytes/shingle
0 Delta-encoding itself has been made extremely effi-
0 5 10 15 20 25 30
Matching Features (M >= X) cient [1], and it should not usually be a bottleneck except
in extremely high-bandwidth environments. Early work
Figure 5: Effect of varying the shingle size between 20 and demonstrated its feasibility on wireless networks [11]
30 bytes, for the MH dataset. and showed that processors an order of magnitude
slower than current machines could support deltas over
HTTP over network speeds up to about T3 speeds [16].
5.7 Number of Features
More recent systems like rsync [26] and LBFS [17], and
The number of features used for comparisons represents the inclusion of the Ajtai delta-encoding work in a com-
a tradeoff between accuracy of resemblance detection mercial backup system, also support the argument that
and computation and storage overheads. In the extreme DERD will not be limited by the delta-encoding band-
case, one could use Manber’s approach of computing width.
and comparing every feature, and have an excellent es-
timate of the overlap between any two files. The other 7 Related Work
extreme is to use no resemblance detection at all or have Mogul, et al., analyzed the potential benefits of compres-
just a handful of features. Since we have found a fair sion and delta-encoding in the context of HTTP [16].
amount of discrimination using our default of 30 fea- They found that delta-encoding could dramatically re-
tures, we have not considered fewer features than that, duce network traffic in cases where a client and server
but we did compute the savings for the MH dataset from shared a past version of a web page, termed a “delta-
using 100 features instead of 30. The results were virtu- eligible” response. When a delta was available, it re-
ally indistinguishable in the two cases—leading to the duced network bandwidth requirements by about an or-
conclusion that 30 features are preferable, due to the der of magnitude. However, in the traces evaluated in
lower costs of storing and comparing a given number that study, responses were delta-eligible only a small
of features. fraction of the time: 10% in one trace and 30% in the
Broder has described a way to store the features even other, but the one with 30% excluded binary data such
more compactly, such as 48 bytes per file, by treating the as images. On the other hand, most resources were com-
features as aggregates of multiple features computed in pressible, and they estimated that compressing those re-
sources dynamically would still offer significant savings tire payload is identical to an earlier payload [12], or
in bandwidth and end-to-end transfer times—factors of when a particular region of a file has not changed. Exam-
2-3 improvement in size were typical. ples of system taking this approach include rsync [26], a
Later, Chan and Woo devised a method to increase the popular protocol for remote file copying, and the Low-
frequency of delta-eligible responses by comparing re- bandwidth File System (LBFS) [17]. However, there are
sources to other cached resources with similar URLs [7]. applications for which identifying an appropriate base
Their assumption was that resources “near” each other version is difficult and the available redundancy is ig-
on a server would have pieces in common, something nored. For instance, LBFS exploits similarities not only
they then validated experimentally. They also described between different versions of the same file but across
an algorithm for comparing a file against several other files. To identify similar files, it hashes the contents
files, rather than the one-on-one comparison typically of blocks of data, where a block boundary is (usu-
performed in this context. However, they did not ex- ally) defined by a subset of features—like the Spring &
plain how a server would select the particular related Wetherall approach, except that the features determine
resources in practice, assuming that it has no specific block boundaries rather than indices for the data being
knowledge of a client’s cache. We believe there is an im- compared. Variable block boundaries allow a change
plicit assumption that this approach is in fact limited to within one block not to affect neighboring blocks. (The
“personal proxies” with exact knowledge of the client’s Venti archival system [20] and the Pastiche peer-to-peer
cache [11, 2], in which case it has limited applicability. backup system [8] are two more recent examples of the
Ouyang, et al., similarly clustered related web pages use of content-defined blocks to identify duplicate con-
by URL, and tried to select the best base version for tent; we use LBFS here as the “canonical” example of
a given cluster by computing deltas from a small sam- the technique.)
ple [18]. While they were not focused on a caching Similarly, it is not always possible to ensure that both
context, and are more similar to the general applications sides of a network connection share a single common
described herein, they did not initially use the more ef- base version. Rsync allows the two communicating par-
ficient resemblance detection methods of Manber and ties to ascertain dynamically which blocks of a file are
Broder to best select the base versions. Subsequently, already contained in a version of the file on the receiving
they applied resemblance detection techniques to scale side.
the technique to larger collections [19]. This work, LBFS and rsync are well suited to compressing large
roughly concurrent with our own, is similar in its gen- files with long sequences of unchanged bytes, but if the
eral approach. However, the largest dataset they ana- granularity of change is finer than their block bound-
lyzed was just over 20,000 web pages, and they did not aries, they get no benefit. Most delta-encoding algo-
consider other types of data such as email. Another pos- rithms remove redundancy if it is large enough to amor-
sibly significant distinction is that they used shingle sizes tize the overhead of the pointers and other meta-data that
of only 4 bytes, whereas we used 20-30 bytes. (We did identify the redundancy. A resemblance detection pro-
not obtain this paper in time to repeat our analyses with cedure should therefore be suited to the delta-encoding
such a small shingle size.) algorithm, and the size and contents of the data. Our
Spring and Weatherall [24] essentially generalized work demonstrates that fine-grained deltas work well in
Chan and Woo’s work by applying it to all data sent a variety of environments, but a head-to-head compari-
over a specific communication channel, and using re- son with LBFS and rsync in these environments will help
semblance detection to detect duplicate sequences in a determine which approach is best in which context.
collection of data. This was done by computing finger-
prints of shingles, selecting those with a predetermined 8 Conclusions and Future Work
number of zeroes in the low-order bits (deterministically Delta-encoding has been used in a number of applica-
selecting a fraction of features), and scanning before and tions, but it has been limited to two general contexts: en-
after the matching shingle to find the longest duplicate coding a file against an earlier version of the same file, or
data sequence. Like Chan and Woo’s work, this sys- encoding against other files (or data blocks) where both
tem worked only with a close coupling between clients sides of a communication channel have a consistent view
and servers, so both sides would know what redundant of the cached data. We have generalized this approach in
data existed in the client. In addition, the communica- the web context to use features of web content to iden-
tion channel approach requires a separate cache of pack- tify appropriate base versions, and quantified the poten-
ets exchanged in the past, which may compete with the tial reductions in transfer sizes of such a system. We
browser cache and other applications for resources. have also extended Manber’s use of this technique on a
In some cases, the suppression of redundancy is at a single server [14], and quantified potential benefits in a
very coarse level, for instance identifying when an en- general file system and specific to email.
For web content, we have found substantial overlap is a LotusScript guru extraordinaire. Ziv Bar-Yossef,
among pages on a single site. This is consistent with Sridhar Rajagopalan, and Lakshmish Ramaswamy pro-
Chan and Woo [7], Ouyang, et al. [19], and recent work vided code for computing features. Several people have
on automatic detection of common fragments within permitted us to analyze their data, including Lisa Amini,
pages [23]. For the five web datasets we considered, Frank Eskesen and Andy Walter. Ramesh Agarwal, An-
deltas reduced the total size of the dataset to 8–19% of drei Broder, Ron Fagin, Chris Howson, Ray Jennings,
the original data, compared to 29–36% using compres- Jason LaVoie, Srini Seshan, John Tracey, and Andrew
sion. For files and email, there was much more variabil- Tridgell have provided helpful comments on some of the
ity, and the overall benefits are not as dramatic, but they ideas presented in this paper and/or earlier drafts of this
are significant: two of the largest datasets reduced the paper. Finally, we thank the anonymous reviewers and
overall storage needs by 10–20% beyond compression. our shepherd, Darrell Long, for their advice and feed-
There was significant skew in at least one dataset, with back.
a small fraction of files accounting for a large portion of
the savings. Factors such as shingle size and the number References
of features compared do not dramatically affect these re-
[1] M. Ajtai, R. Burns, R. Fagin, D. Long, and
sults. Given a particular number of maximal matching
L. Stockmeyer. Compactly encoding unstructured
features, there is not a wide variation across base files in
input with differential compression. Journal of the
the size of the resulting deltas.
ACM, 49(3):318–367, May 2002.
A new file will often be created by making a small
number of changes to an older file; the new file may [2] Gaurav Banga, Fred Douglis, and Michael Rabi-
even have the same name as the old file. In these cases, novich. Optimistic deltas for WWW latency re-
the new file can often be delta-encoded from the old file duction. In Proceedings of 1997 USENIX Techni-
with minimal overhead. For the most part, our datasets cal Conference, pages 289–303, January 1997.
did not consider these scenarios. For situations where [3] Ziv Bar-Yossef and Sridhar Rajagopalan. Template
this type of update is prevalent, the benefits from delta- detection via data mining and its applications. In
encoding are likely to be higher. Proceedings of the Eleventh International Confer-
Now that we have demonstrated the potential savings ence on World Wide Web, pages 580–591. ACM
of DERD, in the abstract, we would like to implement Press, 2002.
underlying systems using this technology. The smaller
deltas for web data suggest that an obvious approach is [4] K. Bharat and A. Broder. Mirror, mirror on the
to integrate DERD into a web server and/or cache, and web: A study of host pairs with replicated content.
then use a live system over time. However, supporting In Proceedings of the 8th International World Wide
resemblance-based deltas in HTTP involves extra over- Web Conference, pages 501–512, May 1999.
heads and protocol support [10] that do not affect other [5] Andrei Z. Broder. On the resemblance and con-
applications such as backups. We are also interested in tainment of documents. In Compression and Com-
methods to reduce storage and network costs in email plexity of Sequences (SEQUENCES’97), 1997.
systems, and hope to implement our approach in com-
monly used mail platforms. As the system scales to [6] Andrei Z. Broder. Identifying and filtering near-
larger datasets, we can add heuristics for more efficient duplicate documents. In Combinatorial Pattern
resemblance detection and feature computation. We Matching, 11th Annual Symposium, pages 1–10,
can also evaluate additional application-specific meth- June 2000.
ods, such as encoding individual elements of tar files, [7] Mun Choon Chan and Thomas Y. C. Woo. Cache-
and compare the various delta-based approaches against based compaction: A new technique for optimizing
other systems such as LBFS and rsync in greater depth. web transfer. In Proceedings of Infocom’99, pages
117–125, 1999.
Acknowledgments
[8] L. P. Cox, C. D. Murray, and B. D. Noble. Pastiche:
Kiem-Phong Vo jointly developed the idea of web-based
Making backup cheap and easy. In Proceedings
DERD , resulting in a research report [10] from which
of the 5th Symposium on Operating Systems De-
a small amount of the text in this manuscript has been
sign and Implementation (OSDI’02), pages 285–
taken. Andrei Broder has been extremely helpful in
298. USENIX, December 2002.
understanding the intricacies of resemblance detection,
Randal Burns and Kiem-Phong Vo have similarly been [9] Fred Douglis, Anja Feldmann, Balachander Krish-
helpful in providing and helping us to understand their namurthy, and Jeffrey Mogul. Rate of change and
delta-encoding software packages, and Laurence Marks other metrics: a live study of the World Wide Web.
In Proceedings of the Symposium on Internet Tech- Research in Computing Technology, Harvard Uni-
nologies and Systems, pages 147–158. USENIX, versity, 1981.
December 1997. [22] Sridhar Rajagopalan, 2002. Personal Communica-
[10] Fred Douglis, Arun K. Iyengar, and Kiem-Phong tion.
Vo. Dynamic suppression of similarity in the web: [23] Lakshmish Ramaswamy, Arun Iyengar, Ling Liu,
a case for deployable detection mechanisms. Tech- and Fred Douglis. Techniques for efficient de-
nical Report RC22514, IBM Research, July 2002. tection of fragments in web pages. Manuscript,
[11] Barron C. Housel and David B. Lindquist. Web- November 2002.
Express: A system for optimizing Web browsing
[24] Neil T. Spring and David Wetherall. A protocol-
in a wireless environment. In Proceedings of the
independent technique for eliminating redundant
Second Annual International Conference on Mo-
network traffic. In Proceedings of ACM SIG-
bile Computing and Networking, pages 108–116.
COMM, August 2000.
ACM, November 1996.
[25] W. Tichy. RCS: a system for version control.
[12] Terence Kelly and Jeffrey Mogul. Aliasing on the
Software—Practice & Experience, 15(7):637–654,
World Wide Web: Prevalence and Performance
July 1985.
Implications. In Proceedings of the 11th Interna-
tional World Wide Web Conference, May 2002. [26] Andrew Tridgell. Efficient Algorithms for Sorting
and Synchronization. PhD thesis, Australian Na-
[13] David G. Korn and Kiem-Phong Vo. Engineering a
tional University, 1999.
differencing and compression data format. In Pro-
ceedings of the 2002 Usenix Conference. USENIX
Association, June 2002.
[14] U. Manber. Finding similar files in a large file sys-
tem. In Proceedings of the USENIX Winter 1994
Technical Conference, pages 1–10, January 1994.
[15] J. Mogul, B. Krishnamurthy, F. Douglis, A. Feld-
mann, Y. Goland, A. van Hoff, and D. Hellerstein.
Delta encoding in HTTP, January 2002. RFC 3229.
[16] Jeffrey Mogul, Fred Douglis, Anja Feldmann, and
Balachander Krishnamurthy. Potential benefits of
delta-encoding and data compression for HTTP. In
Proceedings of ACM SIGCOMM’97 Conference,
pages 181–194, September 1997.
[17] Athicha Muthitacharoen, Benjie Chen, and David
Mazieres. A low-bandwidth network file system.
In Symposium on Operating Systems Principles,
pages 174–187, 2001.
[18] Zan Ouyang, Nasir Memon, and Torsten Suel. Us-
ing delta encoding for compressing related web
pages. In Data Compression Conference, page
507, March 2001. Poster.
[19] Zan Ouyang, Nasir Memon, Torsten Suel, and
Dimitre Trendafilov. Cluster-based delta compres-
sion of a collection of files. In International Con-
ference on Web Information Systems Engineering
(WISE), December 2002.
[20] S. Quinlan and S. Dorward. Venti: a new approach
to archival storage. In Proceedings of the First
USENIX Conference on File and Storage Tech-
nologies, Monterey,CA, 2002.
[21] Michael O. Rabin. Fingerprinting by random poly-
nomials. Technical Report TR-15-81, Center for