0% found this document useful (0 votes)
13 views10 pages

Cluster

Uploaded by

rodrego sandino
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views10 pages

Cluster

Uploaded by

rodrego sandino
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

Cluster-Based Delta Compression of a Collection of Files

Zan Ouyang Nasir Memon Torsten Suel Dimitre Trendafilov

CIS Department
Polytechnic University
Brooklyn, NY 11201

Abstract for a more detailed discussion. In a communication sce-


nario, they typically exploit the fact that the sender and
Delta compression techniques are commonly used to receiver both possess a reference file that is similar to the
succinctly represent an updated version of a file with re- transmitted file; thus transmitting only the difference (or
spect to an earlier one. In this paper, we study the use of delta) between the two files requires a significantly smaller
delta compression in a somewhat different scenario, where number of bits. In storage applications such as version con-
we wish to compress a large collection of (more or less) re- trol systems, deltas are often orders of magnitude smaller
lated files by performing a sequence of pairwise delta com- than the compressed target file.
pressions. The problem of finding an optimal delta encod-
Delta compression techniques have also been studied in
ing for a collection of files by taking pairwise deltas can be
detail in the context of the World Wide Web, where con-
reduced to the problem of computing a branching of maxi-
secutive versions of a web page often differ only slightly
mum weight in a weighted directed graph, but this solution
[8, 19] and pages on the same site share a lot of common
is inefficient and thus does not scale to larger file collec-
HTML structure [5]. In particular, work in [2, 5, 7, 11, 18]
tions. This motivates us to propose a framework for cluster-
considers possible improvements to HTTP caching based
based delta compression that uses text clustering techniques
on sending a delta with respect to a previous version of the
to prune the graph of possible pairwise delta encodings. To
page, or another similar page, that is already located in a
demonstrate the efficacy of our approach, we present ex-
client or proxy cache.
perimental results on collections of web pages. Our exper-
iments show that cluster-based delta compression of col- In this paper, we study the use of delta compression in
lections provides significant improvements in compression a slightly different scenario. While in most other applica-
ratio as compared to individually compressing each file or tions, delta compression is performed with respect to a pre-
using tar+gzip, at a moderate cost in efficiency. vious version of the same file, or some other easy to identify
reference file, we are interested in using delta compression
to better compress large collections of files where it is not
obvious at all how to efficiently identify appropriate refer-
1 Introduction
ence and target files. Our approach is based on a reduction
to the optimum branching problem in graph theory and the
Delta compressors are software tools for compactly en- use of recently proposed clustering techniques for finding
coding the differences between two files or strings in or- similar files.
der to reduce communication or storage costs. Examples of
such tools are the diff and bdiff utilities for comput- We focus on collections of web pages from several
ing edit sequences between two files, and the more recent sites. Applications that we have in mind are efficient
xdelta [16], vdelta [12], vcdiff [15], and zdelta downloading and storage of collection of web pages for
[26] tools that compute highly compressed representations off-line browsing, and improved archiving of massive ter-
of file differences. These tools have a number of applica- abyte web collections such as the Internet Archive (see
tions in various networking and storage scenarios; see [21] http://archive.org). However, the techniques we

study are applicable to other scenarios as well, and might
This project was supported by a grant from Intel Corporation, and by
the Wireless Internet Center for Advanced Technology (WICAT) at Poly-
lead to new general-purpose tools for exchanging collec-
technic University. Torsten Suel was also supported by NSF CAREER tions of files that improve over the currently used zip and
Award NSF CCR-0093400. tar/gzip tools.

1
1.1 Contributions of this Paper graph. Section 3 provides our framework called cluster-
based delta compression and outlines several approaches
In this paper, we study the problem of compressing col- under this framework. In Section 4, we present our ex-
lections of files, with focus on collections of web pages, perimental results. Finally, Section 5 provides some open
with varying degrees of similarity among the files. Our ap- questions and concluding remarks.
proach is based on using an efficient delta compressor, in
particular the zdelta compressor [26], to achieve signif- 1.2 Related Work
icantly better compression than that obtained by compress-
ing each file individually or by using tools such as tar and For an overview of delta compression techniques and ap-
gzip on the collection. Our main contributions are: plications, see [21]. Delta compression techniques were
originally introduced in the context of version control sys-
The problem of obtaining optimal compression of a
tems; see [12, 25] for a discussion. Among the main
collection of files, given a specific delta compres-
delta compression algorithms in use today are diff and


sor, can be solved by finding an optimal branching on




vdelta [12]. Using diff to find the difference between


a directed graph with nodes and edges. We im-
two files and then applying gzip to compress the differ-
 

plement this algorithm and show that it can achieve


ence is a simple and widely used way to perform delta com-
significantly better compression than current tools. On
pression, but it does not provide good compression on files
the other hand, the algorithm quickly becomes ineffi-
that are only slightly similar. vdelta, on the other hand,
cient as the collection size grows beyond a few hun-
is a relatively new technique that integrates both data com-
dred files, due to its quadratic complexity.
pression and data differencing. It is a refinement of Tichy’s
block-move algorithm [24] that generalizes the well known
We present a general framework, called cluster-based
Lempel-Ziv technique [27] to delta compression. In our
delta compression, for efficiently computing near-
work, we use the zdelta compressor, which was shown
optimal delta encoding schemes on large collections
to achieve good compression and running time in [26].
of files. The framework combines the branching ap-
proach with two recently proposed hash-based tech- The issue of appropriate distance measures between files
niques for clustering files by similarity [3, 10, 14, 17]. and strings has been studied extensively, and many different
measures have been proposed. We note that diff is related
Within this framework, we evaluate a number of differ- to the symmetric edit distance measure, while vdelta
ent algorithms and heuristics in terms of compression and other recent Lempel-Ziv type delta compressors such
and running time. Our results show that compression as xdelta [16], vcdiff [15], and zdelta [26] are re-
very close to that achieved by the optimal branching al- lated to the copy distance between two files. Recent work
gorithm can be achieved in time that is within a small in [6] studies a measure called LZ distance that is closely
multiplicative factor of the time needed by tools such related to the performance of Lempel-Ziv type compress-
as gzip. ing schemes. We also refer to [6] and the references therein
for work on protocols for estimating file similarities over a
We also note three limitations of our study: First, our re- communication link.
sults are still preliminary and we expect additional improve- Fast algorithms for the optimum branching problem are
ments in running time and compression over the results in described in [4, 22]. While we are not aware of previous
this paper. In particular, we believe we can narrow the gap work that uses optimum branchings to compress collections
between the speed of gzip and our best algorithms. Sec- of files, there are two previous applications that are quite
ondly, we restrict ourselves to the case where each target similar. In particular, Tate [23] uses optimum branchings
file is compressed with respect to a single reference file. to find an optimal scheme for compressing multispectral
Additional significant improvements in compression might images, while Adler and Mitzenmacher [1] use it to com-
be achievable by using more than one reference file, at the press the graph structure of the World Wide Web. Adler
cost of additional algorithmic complexity. Finally, we only and Mitzenmacher [1] also show that a natural extension of
consider the problem of compressing and uncompressing the branching problem to hypergraphs that can be used to
an entire collection, and do not allow individual files to be model delta compression with two or more reference files is
added to or retrieved from the collection. NP Complete, indicating that an efficient optimal solution
The rest of this paper is organized as follows. The is unlikely.
next subsection lists related work. In Section 2 we dis- We use two types of hash-based clustering techniques
cuss the problem of compressing a collection of files using in our work, a technique with quadratic complexity called
delta compression, and describe an optimal algorithm based min-wise independent hashing proposed by Broder in [3]
on computing a maximum weight branching in a directed (see also Manber and Wu [17] for a similar technique), and

2
a very recent nearly linear time technique called locality- 0

sensitive hashing proposed by Indyk and Motwani in [14]


16079
and applied to web documents in [10].
1 13864

2 Delta Compression Based on Optimum 18123 20331 13663

Branchings 13663 20138 17676 2

20234 17795 17840 17604

Delta compressors such as vcdiff or zdelta provide 4 17582 17795


an efficient way to encode the difference between two sim-
17781 17805
ilar files. However, given a collection of files, we are faced
with the problem of succinctly representing the entire col- 3

lection through appropriate delta encodings between target


and reference files. We observe that the problem of finding
an optimal encoding scheme for a collection of files through Figure 1. Example of a directed and weighted

 !" #$%


pairwise deltas can be reduced to that of computing an opti- complete graph. The optimal branching for

&$ ' (#


the graph consists of the edges , ,

mum branching of an appropriately constructed weighted
directed graph . , and

2.1 Problem Reduction data set pages average cat+gzip optimal


data set size ratio branch
Formally, a branching of a directed graph is defined  CBC
CBSNews
530
218
23 KB
44 KB
5.83
5.06
10.01
15.42
as a set of edges such that (1) contains at most one incom-
USAToday 344 25 KB 6.30 18.64
ing edge for each node, and (2) does not contain a cycle.
CSmonitor 388 43 KB 5.06 17.31
Given a weighted directed graph, a maximum branching is a
Ebay 100 23 KB 6.78 10.90
 
branching of maximum edge weight. Given a collection of
Thomas-dist 105 27 KB 6.39 9.73
 files we construct a complete directed graph
all sites 1685 29 KB 5.53 12.36
  
where each node corresponds to a file and each directed


edge has a corresponding weight that represents
Table 1. Compression ratios for some collec-

the reduction (in bytes) obtained by delta-compressing file
tions of files.

with respect to file . In addition to these nodes, the graph

includes an extra null node corresponding to the empty


file that is used to model the compression savings if a file is
 
 !" #$ ' & %)'   )*
compressed by itself (using, e.g., zlib, or zdelta with is compressed using as a reference file. The optimal se-

  
an empty reference file). quence for compression is , , , and ,

 
Given the above formulation it is not difficult to see that i.e., file is compressed by itself, files and are com-

 
a maximum branching of the graph gives us an optimal pressed by computing a delta with respect to file , and file
delta encoding scheme for a collection of files. Condition is compressed by computing a delta with respect to file .
(1) in the definition of a branching expresses the constraint
that each file is compressed with respect to only one other 2.2 Experimental Results
file. The second condition ensures that there are no cycli-
cal dependencies that would prevent us from decompress- We implemented delta compression based on the optimal
ing the collection. Finally, given the manner in which the branching algorithm described in [4, 22], which for dense
weights have been assigned, a maximum branching results graphs takes time proportional to the number of edges. Ta-
in a compression scheme with optimal benefit over the un- ble 1 shows compression results and times on several col-
compressed case. lections of web pages that we collected by crawling a lim-


Figure 1 shows the weighted directed graph formed by a ited number of pages from each site using a breadth-first

   
collection of four files. In the example, node is the null crawler.

   
node, while nodes , , , and represent the four files. The results indicate that the optimum branching ap-


The weights on the edges from node to nodes , , , proach can give significant improvements in compression

 
and are the compression savings obtained when the tar- over using cat or tar followed by gzip, outperforming

 
get files are compressed by themselves. The weights for all them by a factor of to . However, the major problem
other edges represent compression savings when file with the optimum branching approach is that it becomes

3
$
very inefficient as soon as the number of files grows beyond algorithm based on maximum branching quickly becomes

 !$
a few dozens. For the cbc.ca data set with pages, it a bottleneck as we increase the collection size , mainly


took more than an hour (  ) to perform the computa- due to the quadratic number of pairwise delta compression
tion, while multiple hours were needed for the set with all computations that have to be performed. In this section,
sites. we describe a basic framework, called Cluster-Based Delta
Figure 2 plots the running time in seconds of the opti- Compression, for efficiently computing near-optimal delta
mal branching algorithm for different numbers of files, us- compression schemes on larger collections of files.
ing a set of files from the gcc software distribution also
used in [12, 26]. Time is plotted on a logarithmic scale to
3.1 Basic Framework
accomodate two curves: the time spent on computing the
edge weights (upper curve), and the time spent on the ac-
tual branching computation after the weights of the graph We first describe the general approach, which leads to


have been determined using calls to zdelta (lower curve). several different algorithms that we implemented. In a nut-


While both curves grow quadratically, the vast majority of shell, the basic idea is to prune the complete graph into
a sparse subgraph  , and then find the best delta encoding

the time is spent on computing appropriate edge weights for
the graph , and only a tiny amount is spent on the actual scheme within this subgraph. More precisely, we have the
branching computation afterwards. following general steps:

(1) Collection Analysis: Perform a clustering computa-


speed performance tion that identifies pairs of files that are very similar


100000
computing complete graph
optimum branching
and thus good candidates for delta compression. Build
10000
a sparse directed subgraph  containing only edges
1000 between these similar pairs.
time in seconds

100


10 (2) Assigning Weights: Compute or estimate appropriate
1 edge weights for  .
0.1


0.01 (3) Maximum Branching: Perform a maximum branch-
0.001 ing computation on  to determine a good delta en-
100 200 300 400 500 600 700 800
number of files
coding.

The assignment of weights in the second step can be done


either precisely, by performing a delta compression across
Figure 2. Running time of the optimal branch-
each remaining edge, or approximately, e.g., by using esti-
ing algorithm
mates for file similarity produced during the document anal-
ysis in the first step. Note that if the weights are computed
Thus, in order to compress larger collections of pages, precisely by a delta compressor and the resulting com-


we need to find techniques that avoid computing the ex- pressed files are saved, then the actual delta compression
act weights of all edges in the complete graph . In the after the last step consists of simply removing files corre-
next sections, we study such techniques based on cluster- sponding to unused edges (assuming sufficient disk space).
ing of pages and pruning and approximation of edges. We
The primary challenge is Step (1), where we need to ef-
note that another limitation of the branching approach is that
ficiently identify a small subset of file pairs that give good
it does not support the efficient retrieval of individual files
delta compression. We will solve this problem by using
from a compressed collection, or the addition of new files to
two sets of known techniques for document clustering, one
the collection. This is a problem in some applications that
set proposed by Broder [3] and Manber and Wu [17], and
require interactive access, and we do not address it in this
one set proposed by Indyk and Motwani [14] and applied to
paper.
document clustering by Haveliwala, Gionis, and Indyk [10].
These techniques were developed in the context of identi-
3 Cluster-Based Delta Compression fying near-duplicate web pages and finding closely related
pages on the web. While these problems are clearly closely
As shown in the previous section, delta compression related to our scenario, there are also a number of differ-
techniques have the potential for significantly improved ences that make it nontrivial to apply the techniques to delta
compression of collections of files. However, the optimal compression, and in the following we discuss these issues.

4
3.2 File Similarity Measures [3]. (A similar technique is described by Manber and Wu in
[17].) The simple idea in this technique is to approximate
The compression performance of a delta compressor on the shingle similarity measures by sampling a small sub-
a pair of files depends on many details, such as the precise set of shingles from each file. However, in order to obtain
locations and lengths of the matches, the internal compress- a good estimate, the samples are not drawn independently
ibility of the target file, the windowing mechanism, and from each file, but they are obtained in a coordinated fash-
the performance of the internal Huffman coder. A num-
ber of formal measures of file similarity, such as edit dis-
tance (with or without block moves), copy distance, or LZ

ion using a common set of random hash functions that map
shingles of length to integer values. We then select in each
file the smallest hash values obtained this way.
distance [6] have been proposed that provide reasonable ap- We refer the reader to [3] for a detailed analysis. Note
proximations; see [6, 21] for a discussion. However, even that there are a number of different choices that can be made
these simplified measures are not easy to compute with, and in implementing these schemes:
thus the clustering techniques in [3, 17, 10] that we use are
based on two even simpler similarity measures, which we Choice of hash functions: We used a class of simple


refer to as shingle intersection and shingle containment. linear hash functions analyzed by Indyk in [13] and

  
Formally, for a file and an integer , we define the also used in [10].
shingle set (or -gram set) of as the multiset of
!%  $%
Sample Sizes: One option is to use a fixed number of
substrings of length (called shingles) that occur in . samples, say or , from each file, independent

       


Given two files and  , we define the shingle intersec-
'2%3$ % 2* 54
of file size. Alternatively, we could sample at a con-

 
tion of and  as  . We define the stant rate, say or , resulting in sample sizes

     


shingle containment of with respect to  as  that are proportional to file sizes.
. (Note that shingle containment is not symmet-
One or several hash functions: One way to select 
ric.)
samples from a file is to use  hash functions, and in-
Thus, both of these measures assign higher similarity
clude the minimum value under each hash function in
scores to files that share a lot of short substrings, and in-
the sample. Alternatively, we could select one random
tuitively we should expect a correlation between the delta
hash function, and select the  smallest values under
compressibility of two files and these similarity measures.
this hash function. We selected the second method as it
In fact, the following relationship between shingle intersec-
is significantly more efficient, requiring only one hash
tion and the edit distance measure can be easily derived:
Given two files
   
and  within edit distance
      ,
function computation for each shingle.

 
!#"%$ &  ' ( *)+-, .,/10
and a shingle size , we have  Shingle size: We used a shingle size of bytes in
 .
We refer to [9] for a proof and a similar result for the case
 64
the results reported here. (We also experimented with
but achieved slightly worse results.)

of edit distance with block moves. A similar relationship After selecting the sample, we estimate the shingle inter-
can also be derived between shingle containment and copy section or shingle containment measures by intersecting the
distances. Thus, shingle intersection and shingle contain- samples of every pair of files. Thus, this phase takes time


ment are related to the edit distance and copy distance mea- quadratic in the number of files. Finally, we decide which
sures, which have been used as models for the correspond- edges to include in the sparse graph  . There are two in-
ing classes of edit-based and copy-based delta compression dependent choices to be made here:
schemes.
While the above discussion supports the use of the Similarity measure: We can use either intersection or
containment as our measure.

7
shingle-based similarity measures in our scenario, in prac-

77 98
tice the relationship between these measures and the Threshold versus best neighbors: We could keep
achieved delta compression ratio is quite noisy. Moreover, all edges above a certain similarity threshold, say ,
for efficiency reasons we will only approximate these mea- in the graph. Or, for each file, we could keep the most

7
sures, introducing additional potential for error. promising incoming edges, for some constant , i.e.,
the edges coming from the nearest neighbors w.r.t.
3.3 Clustering Using Min-Wise Independent the estimated similarity measure.
Hashing
A detailed discussion of the various implementation choices
We now describe the first set of techniques, called min- outlined here and their impact on running time and com-
wise independent hashing, that was proposed by Broder in pression is given in the experimental section.

5
 ) 0  ( 
The total running time for the clustering step using min- (b) For each file construct a signature by concate-
wise independent hashing is thus roughly1   nating hash values   to  


  .
where is the number of files,  the (average) size of each


(c) Sort all resulting signatures, and scan the sorted


file, and  the (average) size of each sample. The main ad- list to find all pairs of files whose signature is
vantage over the optimal algorithm is that for each edge, identical.
instead of performing a delta compression step between
two files of size  (several kilobyte), we perform a sim-

(d) For each such pair, add edges in both directions
pler computation between two samples of some small size  to  .
(say,    Thus, the running time of this method is given by   )
 0  ) , where  ,  , and  are constants in the range
). This results in a significant speedup over the 

optimal algorithm in practice, although the algorithm will


from  to at most  $ depending on the choice of parame-
 

eventually become inefficient due to the quadratic complex-


ity. ters. We discuss parameter settings and their consequences
in detail in the experimental section.
3.4 Clustering Using Locality-Sensitive Hashing We note two limitations. First, the above implementation

The second set of techniques, proposed by Indyk and


Motwani [14] and applied to document clustering by
only identifies the pairs that are above a given fixed similar-
ity threshold. Thus, it does not allow us to determine the
best neighbors for each node, and it does not provide a good
7
Haveliwala, Gionis, and Indyk [10], is an extension of the estimate of the precise similarity of a pair (i.e., whether it is
first set that results in an almost linear running time. In par- significantly or only slightly above the threshold). Second,
ticular, these techniques avoid the pairwise comparison be- the method is based on shingle intersection, and not shingle
tween all files by performing a number of sorting steps on

containment. Addressing these limitations is an issue for
specially designed hash signatures that can directly identify future work.
similar files.
The first step of the technique is identical to that of the 4 Experimental Evaluation
min-wise independent hashing technique for fixed sample
size. That is, we select from each file a fixed number of
In this section, we perform an experimental evaluation
min-wise independent hash values, using  different random
hash functions. For a file , let  ) of several cluster-based compression schemes that we im-

be the value selected
plemented based on the framework from the previous sec-
by the th hash function. The main idea, called locality-
tion. We first introduce the algorithm and the experimen-

sensitive hashing, is to now use these hash values to con-
tal setup. In Subsection 4.2 we show that naive methods
   %
struct file signatures that consist of the concatenation of
based on thresholds to do not give good results. The next
" 4
hash values (e.g., for we concatenate four -bit hash
three subsections look at different techniques that resolve
values into one -bit signature). If two files agree on their
this problem, and finally Subsection 4.6 presents results for
signature, then this is strong evidence that their intersection
our best two algorithms on a larger data set. Due to space

is above some threshold. It can be formally shown that by
constraints and the large number of options, we can only
repeating this process a number of times that depends on
give a selection of our results. We refer the reader to [20]
and the chosen threshold, we will find most pairs of files
for a more complete evaluation.
with shingle intersection above the threshold, while avoid-
ing most of the pairs below the threshold. For a more formal
4.1 Algorithms
description of this technique we refer to [10].
The resulting algorithm consists of the following steps:
We implemented a number of different algorithms and
(1) Sampling: Extract a fixed number  of hash values variants. In particular, we have the following options:
() from each file in the collection, using  dif-
Basic scheme: MH vs. LSH.
ferent hash functions.
Number of hash function: single hash vs. multiple
(2) Locality-sensitive hashing: Repeat the following  hash.
times:

   Sample size: fixed size vs. fixed rate.

   ., 
(a) Randomly select indexes to from
   . Similarity measure: intersection vs. containment.
1 If  different hash functions are used, then an additional factor of  Edge pruning rule: threshold vs. best neighbors vs.
has to be added to the first term. heuristics.

6
Edge weight: exact vs. estimated. alg. smp thr remaining br benefit
size edges size over zlib
We note that not every combination of these choices make optimal 2,782,224 1667 6980935
sense. For example, our LSH implementations do not sup- MH 100 20% 357,961 1616 6953569
port containment or best neighbors, and require a fixed sam- intersect 40% 154,533 1434 6601218
ple size. On the other hand, we did not observe any benefit 60% 43,289 988 5326760
in using multiple hash functions in the MH scheme, and 80% 2,629 265 1372123
thus assume a single hash function for this case. We note MH 


20% 391,682 1641 6961645


that in our implementations, all samples were treated as intersect 40% 165,563 1481 6665907
sets, rather than multi-sets, so a frequently occurring string 60% 42,474 1060 5450312
is presented at most once.2 80% 4,022 368 1621910
All algorithms were implemented in C and compiled us- MH  20% 1,258,272 1658 6977748
9


ing gcc 2.95.2 under Solaris 7. Experiments were run contain 40% 463,213 1638 6943999

on a E450 Sun Enterprise server, with two UltraSparc 60% 225,675 1550 6724167
CPUs at 400MHz and GB of RAM. Only one CPU was 80% 79,404 1074 5016699
!  $%
used in the experiments, and data was read from a single
  SCSI disk. We note that the large amount of Table 2. Number of remaining edges, number
memory and fast disk minimize the impact of I/O on the of edges in the final branching, and compres-
running times. We used two data sets: sion benefit for threshold-based clustering
schemes for different sampling techniques
3 354
The medium data set consists of the union of the six and threshold values (column 3).
94
web page collections from Section 2, with files
and a total size of  .

The large data set consists of    4$


HTML pages determines the cost of the subsequent computation.3 Unfor-

 4
crawled from the poly.edu domain, with a total size tunately, these numbers indicate that there is no real “sweet
of    . The pages were crawled in a breadth- spot” for the threshold that gives both a small number of
first crawl that attempted to fetch all pages reachable similar edges and good compression on this data set.
from the www.poly.edu homepage, subject to cer- We note that this result is not due to the precision of the
tain pruning rules to avoid dynamically generated con- sampling-based methods, and it also holds for threshold-
tent and cgi scripts. based LSH algorithms. A simplified explanation for this is
that data sets contain different clusters of various similar-
4.2 Threshold-Based Methods ity, and a low threshold will keep these clusters intact as
dense graphs with many edges, while a high threshold will
disconnect too many of these clusters, resulting in inferior
The first experiments that we present look at the perfor-
compression. This leads us to study several techniques for
mance of MH and LSH techniques that try to identify and
overcoming this problem:
7
retain all edges that are above a certain similarity threshold.
In Table 2 we look at the optimum branching method Best neighbors: By retaining only the best incoming


and at three different algorithms that use a fixed thresh- edges for each node according to the MH algorithm,

7
we can keep the number of edges in  bounded by
old to select edges that are considered similar, for differ-
ent thresholds. For each method, we show the number . 

of similar edges, the number of edges included in the fi-


nal branching, and the total improvement obtained by the Estimating weights: Another way to improve the ef-
method as compared to compressing each file individually ficiency of threshold-based MH algorithms is to di-
using zlib. The results demonstrate a fundamental prob- rectly use the similarity estimate provided by the MH
lem that arises in these threshold-based methods: for high schemes as the edge weight in the subsequent branch-
thresholds, the vast majority of edges is eliminated, but the ing.
resulting branching is of poor quality compared to the opti-
Pruning heuristics: We have also experimented with
mal one. For low thresholds, we obtain compression close
heuristics for decreasing the number of edges in LSH

to the optimal, but the number of similar edges is very high;
algorithms, described further below.
this is a problem since the number of edges included in 
 3 For
example, if we compute the exact weight of each edge above a

 
2 Intuitively,
this seems appropriate given our goal of modeling delta threshold, then we have to perform over calls to zdelta at
compression performance. a cost of about  each.

7
sample k cluster weighing br. benefit k cluster branching zdelta benefit
size time time time over zlib time time time over zlib
1/2 1 1198.25 51.44 0.02 6137816 1 39.26 0.02 45.63 6115888
2 1201.27 84.17 0.02 6693921 2 39.35 0.02 48.49 6408702
4 1198.00 149.99 0.04 6879510 4 39.35 0.02 48.14 6464221
8 1198.91 287.31 0.09 6937119 8 39.40 0.06 49.63 6503158
1/128 1 40.52 47.77 0.02 6124913
2
4
40.65
40.57
82.88
149.06
0.03
0.03
6604095
6774854
'2*" 4
7
Table 4. Running time and compression bene-
fit for -neighbor schemes with sampling rate
and estimated edge weights.
8 40.82 283.57 0.09 6883487

7
Table 3. Running time and compression ben-
efit for -neighbor schemes. threshold edges branching
size
benefit
over zlib
20% 28,866 1640 6689872
40% 8,421 1612 6242688
In summary, using a fixed threshold followed by an optimal
60% 6,316 1538 5426000
branching on the remaining edges does not result in a very
80% 2,527 1483 4945364
good trade-off between compression and running time.
Table 5. Number of remaining edges and com-
4.3 Using Best Neighbors pression benefit for LSH scheme with pruning
heuristic.
We now look at the case where we limit the number of

7
remaining edges in the MH algorithm by keeping only the
most similar edges into each node, as proposed above.
Clearly, this limits the total number of edges in  to ,  7 seconds (versus about 3
seconds for standard zlib and


several hours for the optimum branching).


thus reducing the cost of the subsequent computations.
Table 3 shows the running times of the various phases

7
and the compression benefit as a function of the number 4.5 LSH Pruning Heuristic
of neighbors and the sampling rate. The clustering time
of course depends heavily on the sampling rate; thus one For LSH algorithms, we experimented with a sim-
should use the smallest sampling rate that gives reasonable ple heuristic for reducing the number of remaining edges

%2*" 4 
compression, and we do not observe any significant impact where, after the sorting of the file signatures, we only keep


on compression up to a rate of for the file sizes we a subset of the edges in the case where more than files

7 7
have. The time for computing the weights of the graph  have identical signatures. In particular, instead of building

7 7 
grows (approximately) linear with . The compression rate a complete graph on these files, we connect these files by a

8
grows with , but even for very small , such as , we simple linear chain of directed edges. This somewhat arbi-
get results that are within of the maximum benefit. As in trary heuristic (which actually started out as a bug) results in


all our results, the time for the actual branching computation significantly decreased running time at only a slight cost in
on  is negligible. compression, as shown in Table 5. We are currently looking
at other more principled approaches to thinning out tightly
4.4 Estimated Weights connected clusters of edges.

By using the containment measure values computed by 4.6 Best Results for Large Data Set


the MH clustering as the weights of the remaining edges

$  4$
in  , we can further decrease the running time, as shown Finally, we present the results of the best schemes iden-
in Table 4. The time for building the weighted graph is tified above on the large data set of pages from
now essentially reduced to zero. However, we have an extra the poly.edu domain. We note that our results are still
step at the end where we perform the actual compression
across edges, which is independent of and has the same 77  somewhat preliminary and can probably be significantly im-
proved by some optimizations. We were unable to compute
cost as computing the exact weights for

8
at the achieved benefit we see that for
about 
. Looking
we are within
of the optimum, at a total cost of less than
7 4 
The MH algorithm used 7 4
the optimum branching on this set due to its size.
neighbors and estimated
edge weights, while the LSH algorithm used a threshold of

8
algorithm running time size store billions of pages on a network of workstations. Note
uncompressed 257.8 MB that in this scenario, fast insertions and lookups are crucial,
zlib 73.9 42.3 MB and significant changes in the approach are necessary. An
cat+gzip 79.5 30.5 MB early prototype of the system is currently being evaluated.
best MH 996.3 23.7 MB
best LSH 800.0 21.7 MB References
Table 6. Comparison of best MH and LSH
[1] M. Adler and M. Mitzenmacher. Towards compress-
schemes to zlib and cat+gzip.
ing web graphs. In Proc. of the IEEE Data Compres-
sion Conference (DCC), March 2001.

98
8
[2] G. Banga, F. Douglis, and M. Rabinovich. Optimistic
and the pruning heuristic from the previous subsec-
tion. For MH, about 
deltas for WWW latency reduction. In 1997 USENIX

of the running time is spent on

 8


Annual Technical Conference, Anaheim, CA, pages


the clustering, which scales as and thus eventually

8
289–303, January 1997.


becomes a bottleneck, and on the final compression


step. For LSH, more than  is spent on computing the [3] A. Broder. On the resemblance and containment of
exact weights of remaining edges, while the rest is spent on documents. In Compression and Complexity of Se-
the clustering. quences (SEQUENCES’97), pages 21–29. IEEE Com-
puter Society, 1997.
5 Concluding Remarks [4] P. Camerini, L. Fratta, and F. Maffioli. A note on find-
ing optimum branchings. Networks, 9:309–312, 1979.
In this paper, we have investigated the problem of using
delta compression to obtain a compact representation of a [5] M. Chan and T. Woo. Cache-based compaction: A
cluster of files. As described, the problem of optimally en- new technique for optimizing web transfer. In Proc. of
coding a collection using delta compression based on a sin- INFOCOM’99, March 1999.
gle file can be reduced to the problem of computing a maxi-
[6] G. Cormode, M. Paterson, S. Sahinalp, and
mum weight branching. However, while providing superior
U. Vishkin. Communication complexity of document
compression, this algorithm does not scale to larger collec-
exchange. In Proc. of the ACM–SIAM Symp. on Dis-
tions, motivating us to propose a faster cluster-based delta
crete Algorithms, January 2000.
compression framework. We studied several file clustering
heuristics and performed extensive experimental compar- [7] M. Delco and M. Ionescu. xProxy: A transparent
isons. Our preliminary results show that significant com- caching and delta transfer system for web objects.
pression improvements can be obtained over tar+gzip at May 2000. unpublished manuscript.
moderate additional computational costs.
Many open questions remain. First, some additional op- [8] F. Douglis, A. Feldmann, B. Krishnamurthy, and
timizations are possible that should lead to improvements J. Mogul. Rate of change and other metrics: a live
in compression and running time, including faster sampling study of the World Wide Web. In Proc. of the USENIX
and better pruning heuristics for LSH methods. Second, the Symp. on Internet Technologies and Systems (ITS-
cluster-based framework we have proposed uses only pair- 97), pages 147–158, Berkeley, December 8–11 1997.
wise deltas, that is, each file is compressed with respect to USENIX Association.
only a single reference file. It has been shown [5] that multi- [9] L. Gravano, P. Ipeirotis, H. Jagadish, N. Koudas,
ple reference files can result in significant improvements in S. Muthukrishnan, L. Pietarinen, and D. Srivastava.

%
compression, and in fact this is already partially exploited Using q-grams in a DBMS for approximate string pro-
by tar+gzip with its  window on small files. As cessing. IEEE Data Engineering Bulletin, 24(4):28–
discussed, a polynomial-time optimal solution for multiple 34, December 2001.
reference files is unlikely, and even finding schemes that
work well in practice is challenging. Our final goal is to [10] T.H. Haveliwala, A. Gionis, and P. Indyk. Scalable
create general purpose tools for distributing file collections techniques for clustering the web. In Proc. of the
that improve significantly over tar+gzip. WebDB Workshop, Dallas, TX, May 2000.
In related work, we are also studying how to apply delta
[11] B. Housel and D. Lindquist. WebExpress: A sys-
compression techniques to a large web repository4 that can
tem for optimizaing web browsing in a wireless en-
4 Similar to the Internet Archive at http://www.archive.org. vironment. In Proc. of the 2nd ACM Conf. on Mobile

9
Computing and Networking, pages 108–116, Novem- [25] W. Tichy. RCS: A system for version control. Software
ber 1996. - Practice and Experience, 15, July 1985.

[12] J. Hunt, K. P. Vo, and W. Tichy. Delta algorithms: [26] D. Trendafilov, N. Memon, and T. Suel. zdelta: a sim-
An empirical analysis. ACM Transactions on Software ple delta compression tool. Technical Report TR-CIS-
Engineering and Methodology, 7, 1998. 2002-02, Polytechnic University, CIS Department,
June 2002.
[13] P. Indyk. A small approximately min-wise indepen-
dent family of hash functions. In Proc. of the 10th [27] J. Ziv and A. Lempel. A universal algorithm for data
Symp. on Discrete Algorithms, January 1999. compression. IEEE Transactions on Information The-
ory, 23(3):337–343, 1977.
[14] P. Indyk and R. Motwani. Approximate nearest neigh-
bors: Towards removing the curse of dimensionality.
In Proc. of the 30th ACM Symp. on Theory of Comput-
ing, pages 604–612, May 1998.

[15] D. Korn and K.-P. Vo. Engineering a differencing


and compression data format. In Proceedings of the
Usenix Annual Technical Conference, pages 219–228,
June 2002.

[16] J. MacDonald. File system support for delta compres-


sion. MS Thesis, UC Berkeley, May 2000.

[17] U. Manber and S. Wu. GLIMPSE: A tool to search


through entire file systems. In Proc. of the 1994 Winter
USENIX Conference, pages 23–32, January 1994.

[18] J. C. Mogul, F. Douglis, A. Feldmann, and B. Krishna-


murthy. Potential benefits of delta-encoding and data
compression for HTTP. In Proc. of the ACM SIG-
COMM Conference, pages 181–196, 1997.

[19] Z. Ouyang, N. Memon, and T. Suel. Delta encoding


of related web pages. In Proc. of the IEEE Data Com-
pression Conference (DCC), March 2001.

[20] Z. Ouyang, N. Memon, T. Suel, and D. Trendafilov.


Cluster-based delta compression of a collection of
files. Technical Report TR-CIS-2002-05, Polytechnic
University, CIS Department, October 2002.

[21] T. Suel and N. Memon. Algorithms for delta compres-


sion and remote file synchronization. In Khalid Say-
ood, editor, Lossless Compression Handbook. Aca-
demic Press, 2002. to appear.

[22] R. Tarjan. Finding optimum branchings. Networks,


7:25–35, 1977.

[23] S. Tate. Band ordering in lossless compression of mul-


tispectral images. IEEE Transactions on Computers,
46(45):211–320, 1997.

[24] W. Tichy. The string-to-string correction problem with


block moves. ACM Transactions on Computer Sys-
tems, 2(4):309–321, November 1984.

10

You might also like