Douglis
Douglis
Table 1: Web datasets evaluated. Delta and compression percentages refer to the size of the encoded dataset relative to the
original.
Files Size
Name Files From... Delta% Comp%
Included Excluded (Mbytes)
/usr /usr 102,932 1,250 1,964.16 36 45
MH one user’s MH directory 87,005 565.69 34 54
User1 Bod User 1’s Notes mail bodies 3,097 5.97 29 60
User1 Att User 1’s Notes mail attach. 189 81.29 71 75
User2 Bod User 2’s Notes mail bodies 445 1.18 42 56
User2 Att User 2’s Notes mail attach. 1,078 417.35 32 37
User3 Att User 3’s Notes mail attach. 140 36.18 52 61
User4 Att User 4’s Notes mail attach. 1,982 991.45 53 66
Table 2: File datasets evaluated. Excluded files are explained in the text. Delta and compression percentages refer to the size of
the encoded dataset relative to the original.
other files have the maximum number of matching fea- diff compression (delta-encoding against /dev/null,
tures. Currently this is done by identifying which fea- comparable to gzip). Table 4 is an example of this out-
tures a file has, and incrementing counters for all other put. The rows at the top show dissimilar files, where
files with a given feature in common, using the value deltas made no difference, while the rows at the bottom
of the feature as a hash key. This records the most had the greatest similarity and the smallest deltas. The
features in common at any point, . After all fea- BestDelta and AvgDelta columns show that, in general,
tures are processed, any files that have at least one fea- there was at most a 1% difference in size (relative to the
ture in common are sorted by the number of matching original file) between the best of up to ten matching files
features. Typically, only the files that match exactly and the average of all ten. This characteristic was com-
features are considered as base versions, up to the mon to all the datasets. Correspondingly, in all the fig-
max comparisons parameter, but if the best matches ures, the curves for the savings for delta-encoding depict
fail to produce a small enough delta, poorer matches are the average cases.
considered until the maximum is reached. There are
methods to optimize this comparison by precomputing There are two apparent anomalies in Table 4 worth
the overlap of files, as well as through estimation [22], noting. First, there is a substantial jump in size at
which we intend to integrate at a later date. the complete 30/30 features match, despite a consistent
number of files, showing a much higher average file size.
Delta-encoding is performed by one of a set of pro-
This is skewed by a large number of nearly identical
grams, all written in C. Once a pair of files has been so
files, resulting from form letters attaching manuscripts
encoded, the size of the output is cached. Occasionally,
for review; if each manuscript was sent to three persons
the delta-encoding program might generate a delta that
and the features in the large common data were all se-
is larger than the compressed file, or even larger than the
lected by the minimization process, they all match in
original file. In those cases, the minimum of the other
every feature. (This is a desirable behavior, but may
values is used.
not be typical of all datasets.) Second, the files with
For a given dataset, the results are reported by list- 0–2 out of 30 features matching have a dramatically
ing how many files have a maximum features match worse compression ratio than the other data. We be-
for a given number of features, with statistics aggre- lieve these are attributable to types of data that neither
gated over those files: the original size, the size of the match other files to a great extent nor exhibit particu-
delta-encoded output, and the size of the output using vc- larly good compressibility from internally repeated text
Processing Parameter Description Values
Stage
Number of bytes in a fingerprinted
shingle size 20, 30
shingle
num features Number of features compared 30, 100
Minimum size of an individual file to
min size 128, 512 bytes
Preprocessing include in statistics
Should zip files be unzipped before yes, no
unzip
comparison
Should gz files be unzipped before yes, no
gunzip
comparison
Whether encoding A against B pre-
static files web=no, files=yes
cludes encoding B against A
program Program to perform delta-encoding vcdiff
Whether to compare against all files, no, yes
exhaustive search
or just best matches
Maximum number of files to compare
Encoding max comparisons against, with equal maximal matching 10, 1, 5
features
What fraction of features must match 0-1 (cumulative
min features ratio
to compute a delta? distribution)
What is the maximum size of a delta,
improvement 25%, 50%, 75%,
relative to simple compression, for it
threshold 100%
to be used?
Table 3: Parameters evaluated. Boldface represents defaults, and italics represent evaluated cases not reported here.
Matches Files Size (Mbytes) BestDelta (%) AvgDelta (%) Compressed (%)
0 230 4.37 65 65 65
1 2634 95.09 64 65 65
2 3308 63.87 58 58 60
3 3927 30.86 39 40 45
4 4284 32.53 31 32 39
5 4710 22.86 35 36 46
...
27 294 2.85 4 4 46
28 227 3.09 2 2 44
29 174 9.39 0 0 43
30 224 91.38 0 0 48
All 87005 565.69 34 34 54
Table 4: Delta-encoding and compression results for the MH directory. Percentages are relative to original size, e.g. 34% means
deltas save about two-thirds of the original size. Boldfaced numbers are explained in the text. This table corresponds to the
graphical results in Figure 1.
strings. MIME-encoded compressed data would have files, we create a special ZIPDIR directory, into which
this attribute, when the same compressed file does not the contents are unzipped before features are calculated.
appear in multiple messages. We assume there are no additional benefits to compres-
sion, since zip has already taken care of that. For deltas,
To analyze the benefits of unzipping files, encoding we delta-encode each file in this directory, storing the
them, and zipping the results, we take two approaches. results in a second temporary directory, and then zip the
Zip files can contain entire directory hierarchies, while results. For gzip files, we gunzip the files, compute
gzip files compress just one file. Therefore, for zip
the features, and discard the uncompressed output. Each tained using a particular technique such as compression
time we delta-encode a gzipped file, either as the refer-
or delta-encoding is if all files with at least maxi-
ence or the version, we uncompress it on the fly (the mal matching features are encoded. For instance, the Y-
most recent uncompressed version file is then cached value of the point on the Compressed curve with X-value
and reused for each encoding). Section 5.4 discusses the 15 is the percent of the total data size obtained if all files
added benefits of these two approaches. matching at least one other file in at least 15 features are
In some cases, the features for all the files in a single compressed. Figure 1(b) shows that the most benefit is
dataset, with other run-time state, resulted in a virtual derived from including all files, even with zero matches,
memory image that exceeded the 512 Mbytes of physical although in those cases these benefits come from com-
memory on the machine performing the comparisons— pression rather than deltas—recall that the size of a delta
this is an artifact of our Perl-based prototype, and not is never larger than delta-encoding it against the empty
inherent to the methodology, as evidenced by the scale file, i.e., compressing it.
of the search engines that use resemblance detection to Figure 2(a) shows the cumulative benefits of deltas
suppress duplicates [6]. For the usr and MH datasets, we and compression for two of the static datasets: usr, and
preprocessed the data to separate them into manageable the MH data. Figure 2(b) does the same for two of the
subdirectories, then merged the results. This would re- web datasets, IBM and Yahoo. Both graphs are limited
sult in files in different partitions not being compared: to two datasets in order to avoid cluttering them with
for example, a file in Mail/conferences would many overlapping lines, but the bottom-line savings for
not be compared against a file in Mail/projects. the other datasets were reported in Table 2 and Table 1,
In general, spatial locality would suggest that the best respectively. In each, the different datasets show differ-
matches for a file in Mail/conferences would be ent benefits, due to the amount of data being compared
found in Mail/conferences. (We subsequently and the nature of the contents. Specifically, the graphs
validated this theory by rerunning the script on all MH have very different shapes because many more files in
directories at once, using a more capable machine, with the web datasets have high degrees of overlap.
no significant difference in the overall benefits.) Also,
since partitions were based on subdirectories of a single 5.2 Contributions of Large Files
root such as /usr, it also would result in some partitions The graphs presented thus far have emphasized the effect
having too few files to perform meaningful comparisons; of statistics such as the number of features that match.
we skipped any subdirectories with fewer than 100 files, Another consideration is the skew in the savings: do a
resulting in a small fraction of files being omitted (listed small number of files contribute most of the benefits of
in Table 2). delta-encoding? In the case of the MH dataset, such a
skew was suggested by the statistics in Table 4, which
5 Results showed 91 of the 566 Mbytes matching in all 30 features
Here we present our analyses. We start with overall ben- and delta-encoding to virtually nothing.
efits for different types of data, then describe how vary- We visualize an answer to this question by consider-
ing certain parameters impacts the results. ing every file in a particular dataset, sorting by the most
bytes saved for any delta obtained for it, and plotting the
5.1 Overall Benefits cumulative distribution of the savings as a function of
Our overall goal is to reduce file sizes and to evaluate the original files. Figure 3(a) plots the cumulative sav-
how sensitive this reduction is to different data types, ings of the MH dataset (as a fraction of the original data)
the amount of effort expended, and other considerations. against the fraction of files used to produce those sav-
Table 4 gives a sense of these results, in tabular form, ings or the fraction of bytes in those files. In each case
for a dataset that is particularly conducive to this ap- the savings for DERD and strict compression are shown
proach; Figure 1 shows the same data graphically. Fig- as separate curves. Finally, points are plotted on a log-
ure 1(a) plots compressed sizes and delta-encoded sizes, log scale to emphasize the differences at small values,
as well as the original total file sizes, against the num- and note that the Comp by byte% curve starts at just
ber of matching features. For each possible number of
over 2% on the -axis.
matching features from 0–30, we plot the total data of The results for this dataset clearly show significant
files having that number of matching features as their skew. For example, for deltas, 1% of the files account
maximum match. As we expected, the more features for 38% of the total 65% saved; encoding 25% of the
match, the smaller the delta size. The cumulative ef- bytes will save 22% of the data. Compression also shows
fect is shown in Figure 1(b). In this graph (as well as some skew, since some files are extremely compressible.
several subsequent ones with the same label on the X- If one compressed the best files containing 25% of the
axis), a point ( , ) shows that the total data size ob- bytes, one would save 17% of the data. This degree of
MH Sizes by Feature MH Cumulative Sizes, 30 features
100 100
80
10
60
40
1
Original 20
Compressed Compressed
DERD DERD
0.1 0
0 5 10 15 20 25 30 0 5 10 15 20 25 30
Maximum Number of Matching Features Matching Features (M >= X)
(a) Total data sizes for the original dataset, using compression, (b) Cumulative benefits. The y-axis shows the relative size, in
and using DERD , for individual numbers of matching features. percent, of compressing or delta-encoding each file. A point on
Most of the data match very few features in any other file, or the x-axis shows the benefit from performing this on all files that
match all the features. The y-axis is on a log scale. match at least that many features.
Figure 1: Effect of matching features, for the MH data. These figures graphically depict the the data in Table 4.
Cumulative Sizes, multiple static datasets Cumulative Sizes, multiple web datasets
100 100
IBM comp
IBM DERD
Yahoo comp
Yahoo DERD
80 80
Relative Size (%)
60 60
40 40
/usr comp
20 /usr DERD 20
MH comp
MH DERD
0 0
0 5 10 15 20 25 30 0 5 10 15 20 25 30
Matching Features (M >= X) Matching Features (M >= X)
skew suggests that heuristics for intelligently selecting imum blocksize, typical for many UNIX systems with
a subset of potential delta-encoded pairs, or compressed fragmented file blocks, reduces the total possible ben-
files, could be quite beneficial. efit of delta-encoding from around 66% (assuming no
rounding) to 61%, but a 4-Kbyte blocksize brings the
5.3 Effects of File Blocking benefit down to 40% since so many messages are smaller
Section 2.2 referred to an impact on size reduction from than 4 Kbytes.
rounding to fixed block sizes. In some workloads, such
as file backups, this is a non-issue, but in others it can
5.4 Handling Compressed and Tarred
have a moderate impact for small blocks and a substan- Files
tial impact for large ones. Section 2.2 provided a justification for comparing the
Figure 3(b) shows how varying the blocksize affects uncompressed versions of zip and gzip files, as well as
overall savings for the MH dataset. Like Figure 3(a), it a hypothesis that tar files would not need special treat-
plots the cumulative savings sorted by contribution, but ment. For some workloads this is irrelevant, since for
it accounts for block rounding effects. A 1-Kbyte min- example the MH repository stored all messages with full
Cumulative Savings by Contribution Cumulative DERD Savings by Contribution and Block Size
100 100
90
80
70
10
Savings (%)
Savings (%)
60
50
40
1
30
DERD by File%
Comp by File% 20 No Blocking
DERD by Byte% 10 1Kbyte Blocks
Comp by Byte% 4Kbyte Blocks
0.1
0.01 0.1 1 10 100 10 20 30 40 50 60 70 80 90 100
Percent of original files/bytes to obtain savings Percent of original bytes to obtain savings
(a) Relative savings as a function of cumulative files, by count (b) Relative savings assuming no file blocking, or rounding to
or by bytes. Plotted on a log-log scale. 1-Kbyte or 4-Kbyte units.
Figure 3: Cumulative savings from MH files, sorted in order of contribution to total savings.
bodies, uncompressed. An attachment might contain used. There are reasons why this might not be desir-
MIME-encoded compressed files, but these would be able, such as a web server using a cached compressed
part of the single file being examined, and one would version rather than computing a specialized delta for a
have to be more sophisticated about extracting these at- given request. As another example, consider a file sys-
tachments. In fact, there was no single workload in our tem backup that would require both a base file and a
study with large numbers of both zip and gzip files, and delta to be retrieved before producing a saved file: if
overall benefits from including this feature were only 1- the compressed version were 25% larger than the delta,
2% of the original data size in any dataset. For example, it would consume that extra storage, but restoring the file
the User4 Attach workload, which had the most zip would involve retrieving 125% of the delta’s size rather
files, only saved an additional 2% over the case without than the delta and a base version that would undoubtedly
special handling. Even though the zip files themselves be much larger than that 25%.
were reduced by about a third, overall storage was dom- We varied the threshold for using a delta to be 25–
inated by other file types. 100% of the compressed size, in increments of 25%.
We expected directly delta-encoding one tar file Figure 4 shows the result of this experiment on the MH
against a similar tar file to generate a small delta if in- dataset. There is a dramatic increase in the relative size
dividual files had much overlap, but this was not the of the delta-encoded data at higher numbers of match-
case in some limited experiments. Vcdiff generated a ing features, because in some cases, there is no longer a
delta about the size of the original gzipped tar file, and usable match at a given level. The most interesting met-
two other delta programs used within IBM performed ric is the overall savings if all files are included, since
similarly. We tried a sample test, using two email tar that no longer suffers from this shift; the relative size in-
file attachments unpacked into two directories, and then creases from about 35% to about 45% as the threshold is
using DERD to encode all files in the two directories. reduced.
We selected the delta-encoded and compressed sizes of
the individual files in the smaller of the tar files, and 5.6 Shingle Size
found delta-encoding saved 85% of the bytes, compared
Unlike some of the other parameters, the choice of shin-
to 71% for simple compression of individual files and
gle size—within reason—seems to have minimal effect
79% when the entire tar file was compressed as a whole.
on overall performance. As an example, Figure 5 shows
Depending on how this extends to an entire workload,
how the size reduction varies when using shingle sizes
just as with zip and gzip, these savings may not justify
of 20 versus 30 bytes. If all files are encoded, even for
the added effort.
minimal matches, the total size reduction is about the
same. If a higher value of min features ratio is
5.5 Deltas versus Compression
used, the 20-byte shingles produce smaller deltas for the
By default, our experiments assumed that if a delta is same threshold within a reasonable range (10-15 of 30
at all smaller than just using compression, the delta is features matching).
MH DERD by Feature and Threshold
the “traditional” method [6]. For one such meta-feature
100 to match, all of some subset of the regular features must
match exactly, suggesting a higher degree of overlap
80 than we felt would be appropriate for DERD.
Relative Size (%)
60 6 Resource Usage
A system using our techniques to efficiently delta en-
40
Minimum size relative
to compression
code files and web documents could compute features
20
25% for objects when it first becomes aware of them. The
50%
75% cost for determining features is not that high, and it
100%
0 could be amortized over time. The system could also be
0 5 10 15 20 25 30
tuned to perform delta-encoding when space is the criti-
Matching Features (M >= X)
cal resource and to store things in a conventional manner
Figure 4: Effect of limiting the use of deltas to a fraction of when CPU resources are the bottleneck.
the compressed file, for the MH dataset. Using 30 features of 4 bytes apiece, the space over-
head per file is around 120 bytes. For large files, this is
insignificant. Once the features for a file have been de-
100
MH DERD by Feature and Shingle Size termined, it requires operations to determine the
maximum number of matching features with existing
files where is the total number of files. However, to
80
get a reasonably good number of matching features, it
Relative Size (%)
60
is not always necessary to examine features for all of the
existing files. A reasonable number of matching features
40 can often be determined by only examining a fraction of
the objects when the number of objects is large. That
20 way, the number of comparisons needed for performing
30 bytes/shingle efficient delta-encoding can be bounded.
20 bytes/shingle
0 Delta-encoding itself has been made extremely effi-
0 5 10 15 20 25 30
Matching Features (M >= X) cient [1], and it should not usually be a bottleneck except
in extremely high-bandwidth environments. Early work
Figure 5: Effect of varying the shingle size between 20 and demonstrated its feasibility on wireless networks [11]
30 bytes, for the MH dataset. and showed that processors an order of magnitude
slower than current machines could support deltas over
HTTP over network speeds up to about T3 speeds [16].
5.7 Number of Features
More recent systems like rsync [26] and LBFS [17], and
The number of features used for comparisons represents the inclusion of the Ajtai delta-encoding work in a com-
a tradeoff between accuracy of resemblance detection mercial backup system, also support the argument that
and computation and storage overheads. In the extreme DERD will not be limited by the delta-encoding band-
case, one could use Manber’s approach of computing width.
and comparing every feature, and have an excellent es-
timate of the overlap between any two files. The other 7 Related Work
extreme is to use no resemblance detection at all or have Mogul, et al., analyzed the potential benefits of compres-
just a handful of features. Since we have found a fair sion and delta-encoding in the context of HTTP [16].
amount of discrimination using our default of 30 fea- They found that delta-encoding could dramatically re-
tures, we have not considered fewer features than that, duce network traffic in cases where a client and server
but we did compute the savings for the MH dataset from shared a past version of a web page, termed a “delta-
using 100 features instead of 30. The results were virtu- eligible” response. When a delta was available, it re-
ally indistinguishable in the two cases—leading to the duced network bandwidth requirements by about an or-
conclusion that 30 features are preferable, due to the der of magnitude. However, in the traces evaluated in
lower costs of storing and comparing a given number that study, responses were delta-eligible only a small
of features. fraction of the time: 10% in one trace and 30% in the
Broder has described a way to store the features even other, but the one with 30% excluded binary data such
more compactly, such as 48 bytes per file, by treating the as images. On the other hand, most resources were com-
features as aggregates of multiple features computed in pressible, and they estimated that compressing those re-
sources dynamically would still offer significant savings tire payload is identical to an earlier payload [12], or
in bandwidth and end-to-end transfer times—factors of when a particular region of a file has not changed. Exam-
2-3 improvement in size were typical. ples of system taking this approach include rsync [26], a
Later, Chan and Woo devised a method to increase the popular protocol for remote file copying, and the Low-
frequency of delta-eligible responses by comparing re- bandwidth File System (LBFS) [17]. However, there are
sources to other cached resources with similar URLs [7]. applications for which identifying an appropriate base
Their assumption was that resources “near” each other version is difficult and the available redundancy is ig-
on a server would have pieces in common, something nored. For instance, LBFS exploits similarities not only
they then validated experimentally. They also described between different versions of the same file but across
an algorithm for comparing a file against several other files. To identify similar files, it hashes the contents
files, rather than the one-on-one comparison typically of blocks of data, where a block boundary is (usu-
performed in this context. However, they did not ex- ally) defined by a subset of features—like the Spring &
plain how a server would select the particular related Wetherall approach, except that the features determine
resources in practice, assuming that it has no specific block boundaries rather than indices for the data being
knowledge of a client’s cache. We believe there is an im- compared. Variable block boundaries allow a change
plicit assumption that this approach is in fact limited to within one block not to affect neighboring blocks. (The
“personal proxies” with exact knowledge of the client’s Venti archival system [20] and the Pastiche peer-to-peer
cache [11, 2], in which case it has limited applicability. backup system [8] are two more recent examples of the
Ouyang, et al., similarly clustered related web pages use of content-defined blocks to identify duplicate con-
by URL, and tried to select the best base version for tent; we use LBFS here as the “canonical” example of
a given cluster by computing deltas from a small sam- the technique.)
ple [18]. While they were not focused on a caching Similarly, it is not always possible to ensure that both
context, and are more similar to the general applications sides of a network connection share a single common
described herein, they did not initially use the more ef- base version. Rsync allows the two communicating par-
ficient resemblance detection methods of Manber and ties to ascertain dynamically which blocks of a file are
Broder to best select the base versions. Subsequently, already contained in a version of the file on the receiving
they applied resemblance detection techniques to scale side.
the technique to larger collections [19]. This work, LBFS and rsync are well suited to compressing large
roughly concurrent with our own, is similar in its gen- files with long sequences of unchanged bytes, but if the
eral approach. However, the largest dataset they ana- granularity of change is finer than their block bound-
lyzed was just over 20,000 web pages, and they did not aries, they get no benefit. Most delta-encoding algo-
consider other types of data such as email. Another pos- rithms remove redundancy if it is large enough to amor-
sibly significant distinction is that they used shingle sizes tize the overhead of the pointers and other meta-data that
of only 4 bytes, whereas we used 20-30 bytes. (We did identify the redundancy. A resemblance detection pro-
not obtain this paper in time to repeat our analyses with cedure should therefore be suited to the delta-encoding
such a small shingle size.) algorithm, and the size and contents of the data. Our
Spring and Weatherall [24] essentially generalized work demonstrates that fine-grained deltas work well in
Chan and Woo’s work by applying it to all data sent a variety of environments, but a head-to-head compari-
over a specific communication channel, and using re- son with LBFS and rsync in these environments will help
semblance detection to detect duplicate sequences in a determine which approach is best in which context.
collection of data. This was done by computing finger-
prints of shingles, selecting those with a predetermined 8 Conclusions and Future Work
number of zeroes in the low-order bits (deterministically Delta-encoding has been used in a number of applica-
selecting a fraction of features), and scanning before and tions, but it has been limited to two general contexts: en-
after the matching shingle to find the longest duplicate coding a file against an earlier version of the same file, or
data sequence. Like Chan and Woo’s work, this sys- encoding against other files (or data blocks) where both
tem worked only with a close coupling between clients sides of a communication channel have a consistent view
and servers, so both sides would know what redundant of the cached data. We have generalized this approach in
data existed in the client. In addition, the communica- the web context to use features of web content to iden-
tion channel approach requires a separate cache of pack- tify appropriate base versions, and quantified the poten-
ets exchanged in the past, which may compete with the tial reductions in transfer sizes of such a system. We
browser cache and other applications for resources. have also extended Manber’s use of this technique on a
In some cases, the suppression of redundancy is at a single server [14], and quantified potential benefits in a
very coarse level, for instance identifying when an en- general file system and specific to email.
For web content, we have found substantial overlap is a LotusScript guru extraordinaire. Ziv Bar-Yossef,
among pages on a single site. This is consistent with Sridhar Rajagopalan, and Lakshmish Ramaswamy pro-
Chan and Woo [7], Ouyang, et al. [19], and recent work vided code for computing features. Several people have
on automatic detection of common fragments within permitted us to analyze their data, including Lisa Amini,
pages [23]. For the five web datasets we considered, Frank Eskesen and Andy Walter. Ramesh Agarwal, An-
deltas reduced the total size of the dataset to 8–19% of drei Broder, Ron Fagin, Chris Howson, Ray Jennings,
the original data, compared to 29–36% using compres- Jason LaVoie, Srini Seshan, John Tracey, and Andrew
sion. For files and email, there was much more variabil- Tridgell have provided helpful comments on some of the
ity, and the overall benefits are not as dramatic, but they ideas presented in this paper and/or earlier drafts of this
are significant: two of the largest datasets reduced the paper. Finally, we thank the anonymous reviewers and
overall storage needs by 10–20% beyond compression. our shepherd, Darrell Long, for their advice and feed-
There was significant skew in at least one dataset, with back.
a small fraction of files accounting for a large portion of
the savings. Factors such as shingle size and the number References
of features compared do not dramatically affect these re-
[1] M. Ajtai, R. Burns, R. Fagin, D. Long, and
sults. Given a particular number of maximal matching
L. Stockmeyer. Compactly encoding unstructured
features, there is not a wide variation across base files in
input with differential compression. Journal of the
the size of the resulting deltas.
ACM, 49(3):318–367, May 2002.
A new file will often be created by making a small
number of changes to an older file; the new file may [2] Gaurav Banga, Fred Douglis, and Michael Rabi-
even have the same name as the old file. In these cases, novich. Optimistic deltas for WWW latency re-
the new file can often be delta-encoded from the old file duction. In Proceedings of 1997 USENIX Techni-
with minimal overhead. For the most part, our datasets cal Conference, pages 289–303, January 1997.
did not consider these scenarios. For situations where [3] Ziv Bar-Yossef and Sridhar Rajagopalan. Template
this type of update is prevalent, the benefits from delta- detection via data mining and its applications. In
encoding are likely to be higher. Proceedings of the Eleventh International Confer-
Now that we have demonstrated the potential savings ence on World Wide Web, pages 580–591. ACM
of DERD, in the abstract, we would like to implement Press, 2002.
underlying systems using this technology. The smaller
deltas for web data suggest that an obvious approach is [4] K. Bharat and A. Broder. Mirror, mirror on the
to integrate DERD into a web server and/or cache, and web: A study of host pairs with replicated content.
then use a live system over time. However, supporting In Proceedings of the 8th International World Wide
resemblance-based deltas in HTTP involves extra over- Web Conference, pages 501–512, May 1999.
heads and protocol support [10] that do not affect other [5] Andrei Z. Broder. On the resemblance and con-
applications such as backups. We are also interested in tainment of documents. In Compression and Com-
methods to reduce storage and network costs in email plexity of Sequences (SEQUENCES’97), 1997.
systems, and hope to implement our approach in com-
monly used mail platforms. As the system scales to [6] Andrei Z. Broder. Identifying and filtering near-
larger datasets, we can add heuristics for more efficient duplicate documents. In Combinatorial Pattern
resemblance detection and feature computation. We Matching, 11th Annual Symposium, pages 1–10,
can also evaluate additional application-specific meth- June 2000.
ods, such as encoding individual elements of tar files, [7] Mun Choon Chan and Thomas Y. C. Woo. Cache-
and compare the various delta-based approaches against based compaction: A new technique for optimizing
other systems such as LBFS and rsync in greater depth. web transfer. In Proceedings of Infocom’99, pages
117–125, 1999.
Acknowledgments
[8] L. P. Cox, C. D. Murray, and B. D. Noble. Pastiche:
Kiem-Phong Vo jointly developed the idea of web-based
Making backup cheap and easy. In Proceedings
DERD , resulting in a research report [10] from which
of the 5th Symposium on Operating Systems De-
a small amount of the text in this manuscript has been
sign and Implementation (OSDI’02), pages 285–
taken. Andrei Broder has been extremely helpful in
298. USENIX, December 2002.
understanding the intricacies of resemblance detection,
Randal Burns and Kiem-Phong Vo have similarly been [9] Fred Douglis, Anja Feldmann, Balachander Krish-
helpful in providing and helping us to understand their namurthy, and Jeffrey Mogul. Rate of change and
delta-encoding software packages, and Laurence Marks other metrics: a live study of the World Wide Web.
In Proceedings of the Symposium on Internet Tech- Research in Computing Technology, Harvard Uni-
nologies and Systems, pages 147–158. USENIX, versity, 1981.
December 1997. [22] Sridhar Rajagopalan, 2002. Personal Communica-
[10] Fred Douglis, Arun K. Iyengar, and Kiem-Phong tion.
Vo. Dynamic suppression of similarity in the web: [23] Lakshmish Ramaswamy, Arun Iyengar, Ling Liu,
a case for deployable detection mechanisms. Tech- and Fred Douglis. Techniques for efficient de-
nical Report RC22514, IBM Research, July 2002. tection of fragments in web pages. Manuscript,
[11] Barron C. Housel and David B. Lindquist. Web- November 2002.
Express: A system for optimizing Web browsing
[24] Neil T. Spring and David Wetherall. A protocol-
in a wireless environment. In Proceedings of the
independent technique for eliminating redundant
Second Annual International Conference on Mo-
network traffic. In Proceedings of ACM SIG-
bile Computing and Networking, pages 108–116.
COMM, August 2000.
ACM, November 1996.
[25] W. Tichy. RCS: a system for version control.
[12] Terence Kelly and Jeffrey Mogul. Aliasing on the
Software—Practice & Experience, 15(7):637–654,
World Wide Web: Prevalence and Performance
July 1985.
Implications. In Proceedings of the 11th Interna-
tional World Wide Web Conference, May 2002. [26] Andrew Tridgell. Efficient Algorithms for Sorting
and Synchronization. PhD thesis, Australian Na-
[13] David G. Korn and Kiem-Phong Vo. Engineering a
tional University, 1999.
differencing and compression data format. In Pro-
ceedings of the 2002 Usenix Conference. USENIX
Association, June 2002.
[14] U. Manber. Finding similar files in a large file sys-
tem. In Proceedings of the USENIX Winter 1994
Technical Conference, pages 1–10, January 1994.
[15] J. Mogul, B. Krishnamurthy, F. Douglis, A. Feld-
mann, Y. Goland, A. van Hoff, and D. Hellerstein.
Delta encoding in HTTP, January 2002. RFC 3229.
[16] Jeffrey Mogul, Fred Douglis, Anja Feldmann, and
Balachander Krishnamurthy. Potential benefits of
delta-encoding and data compression for HTTP. In
Proceedings of ACM SIGCOMM’97 Conference,
pages 181–194, September 1997.
[17] Athicha Muthitacharoen, Benjie Chen, and David
Mazieres. A low-bandwidth network file system.
In Symposium on Operating Systems Principles,
pages 174–187, 2001.
[18] Zan Ouyang, Nasir Memon, and Torsten Suel. Us-
ing delta encoding for compressing related web
pages. In Data Compression Conference, page
507, March 2001. Poster.
[19] Zan Ouyang, Nasir Memon, Torsten Suel, and
Dimitre Trendafilov. Cluster-based delta compres-
sion of a collection of files. In International Con-
ference on Web Information Systems Engineering
(WISE), December 2002.
[20] S. Quinlan and S. Dorward. Venti: a new approach
to archival storage. In Proceedings of the First
USENIX Conference on File and Storage Tech-
nologies, Monterey,CA, 2002.
[21] Michael O. Rabin. Fingerprinting by random poly-
nomials. Technical Report TR-15-81, Center for