Cluster
Cluster
CIS Department
Polytechnic University
Brooklyn, NY 11201
1
1.1 Contributions of this Paper graph. Section 3 provides our framework called cluster-
based delta compression and outlines several approaches
In this paper, we study the problem of compressing col- under this framework. In Section 4, we present our ex-
lections of files, with focus on collections of web pages, perimental results. Finally, Section 5 provides some open
with varying degrees of similarity among the files. Our ap- questions and concluding remarks.
proach is based on using an efficient delta compressor, in
particular the zdelta compressor [26], to achieve signif- 1.2 Related Work
icantly better compression than that obtained by compress-
ing each file individually or by using tools such as tar and For an overview of delta compression techniques and ap-
gzip on the collection. Our main contributions are: plications, see [21]. Delta compression techniques were
originally introduced in the context of version control sys-
The problem of obtaining optimal compression of a
tems; see [12, 25] for a discussion. Among the main
collection of files, given a specific delta compres-
delta compression algorithms in use today are diff and
2
a very recent nearly linear time technique called locality- 0
edge has a corresponding weight that represents
Table 1. Compression ratios for some collec-
the reduction (in bytes) obtained by delta-compressing file
tions of files.
with respect to file . In addition to these nodes, the graph
an empty reference file). quence for compression is , , , and ,
Given the above formulation it is not difficult to see that i.e., file is compressed by itself, files and are com-
a maximum branching of the graph gives us an optimal pressed by computing a delta with respect to file , and file
delta encoding scheme for a collection of files. Condition is compressed by computing a delta with respect to file .
(1) in the definition of a branching expresses the constraint
that each file is compressed with respect to only one other 2.2 Experimental Results
file. The second condition ensures that there are no cycli-
cal dependencies that would prevent us from decompress- We implemented delta compression based on the optimal
ing the collection. Finally, given the manner in which the branching algorithm described in [4, 22], which for dense
weights have been assigned, a maximum branching results graphs takes time proportional to the number of edges. Ta-
in a compression scheme with optimal benefit over the un- ble 1 shows compression results and times on several col-
compressed case. lections of web pages that we collected by crawling a lim-
Figure 1 shows the weighted directed graph formed by a ited number of pages from each site using a breadth-first
collection of four files. In the example, node is the null crawler.
node, while nodes , , , and represent the four files. The results indicate that the optimum branching ap-
The weights on the edges from node to nodes , , , proach can give significant improvements in compression
and are the compression savings obtained when the tar- over using cat or tar followed by gzip, outperforming
get files are compressed by themselves. The weights for all them by a factor of to . However, the major problem
other edges represent compression savings when file with the optimum branching approach is that it becomes
3
$
very inefficient as soon as the number of files grows beyond algorithm based on maximum branching quickly becomes
!$
a few dozens. For the cbc.ca data set with pages, it a bottleneck as we increase the collection size , mainly
took more than an hour ( ) to perform the computa- due to the quadratic number of pairwise delta compression
tion, while multiple hours were needed for the set with all computations that have to be performed. In this section,
sites. we describe a basic framework, called Cluster-Based Delta
Figure 2 plots the running time in seconds of the opti- Compression, for efficiently computing near-optimal delta
mal branching algorithm for different numbers of files, us- compression schemes on larger collections of files.
ing a set of files from the gcc software distribution also
used in [12, 26]. Time is plotted on a logarithmic scale to
3.1 Basic Framework
accomodate two curves: the time spent on computing the
edge weights (upper curve), and the time spent on the ac-
tual branching computation after the weights of the graph We first describe the general approach, which leads to
have been determined using calls to zdelta (lower curve). several different algorithms that we implemented. In a nut-
While both curves grow quadratically, the vast majority of shell, the basic idea is to prune the complete graph into
a sparse subgraph , and then find the best delta encoding
the time is spent on computing appropriate edge weights for
the graph , and only a tiny amount is spent on the actual scheme within this subgraph. More precisely, we have the
branching computation afterwards. following general steps:
100000
computing complete graph
optimum branching
and thus good candidates for delta compression. Build
10000
a sparse directed subgraph containing only edges
1000 between these similar pairs.
time in seconds
100
10 (2) Assigning Weights: Compute or estimate appropriate
1 edge weights for .
0.1
0.01 (3) Maximum Branching: Perform a maximum branch-
0.001 ing computation on to determine a good delta en-
100 200 300 400 500 600 700 800
number of files
coding.
we need to find techniques that avoid computing the ex- pressed files are saved, then the actual delta compression
act weights of all edges in the complete graph . In the after the last step consists of simply removing files corre-
next sections, we study such techniques based on cluster- sponding to unused edges (assuming sufficient disk space).
ing of pages and pruning and approximation of edges. We
The primary challenge is Step (1), where we need to ef-
note that another limitation of the branching approach is that
ficiently identify a small subset of file pairs that give good
it does not support the efficient retrieval of individual files
delta compression. We will solve this problem by using
from a compressed collection, or the addition of new files to
two sets of known techniques for document clustering, one
the collection. This is a problem in some applications that
set proposed by Broder [3] and Manber and Wu [17], and
require interactive access, and we do not address it in this
one set proposed by Indyk and Motwani [14] and applied to
paper.
document clustering by Haveliwala, Gionis, and Indyk [10].
These techniques were developed in the context of identi-
3 Cluster-Based Delta Compression fying near-duplicate web pages and finding closely related
pages on the web. While these problems are clearly closely
As shown in the previous section, delta compression related to our scenario, there are also a number of differ-
techniques have the potential for significantly improved ences that make it nontrivial to apply the techniques to delta
compression of collections of files. However, the optimal compression, and in the following we discuss these issues.
4
3.2 File Similarity Measures [3]. (A similar technique is described by Manber and Wu in
[17].) The simple idea in this technique is to approximate
The compression performance of a delta compressor on the shingle similarity measures by sampling a small sub-
a pair of files depends on many details, such as the precise set of shingles from each file. However, in order to obtain
locations and lengths of the matches, the internal compress- a good estimate, the samples are not drawn independently
ibility of the target file, the windowing mechanism, and from each file, but they are obtained in a coordinated fash-
the performance of the internal Huffman coder. A num-
ber of formal measures of file similarity, such as edit dis-
tance (with or without block moves), copy distance, or LZ
ion using a common set of random hash functions that map
shingles of length to integer values. We then select in each
file the smallest hash values obtained this way.
distance [6] have been proposed that provide reasonable ap- We refer the reader to [3] for a detailed analysis. Note
proximations; see [6, 21] for a discussion. However, even that there are a number of different choices that can be made
these simplified measures are not easy to compute with, and in implementing these schemes:
thus the clustering techniques in [3, 17, 10] that we use are
based on two even simpler similarity measures, which we Choice of hash functions: We used a class of simple
refer to as shingle intersection and shingle containment. linear hash functions analyzed by Indyk in [13] and
Formally, for a file and an integer , we define the also used in [10].
shingle set (or -gram set) of as the multiset of
!% $%
Sample Sizes: One option is to use a fixed number of
substrings of length (called shingles) that occur in . samples, say or , from each file, independent
tion of and as . We define the stant rate, say or , resulting in sample sizes
!#"%$ & ' ( *)+-, .,/10
and a shingle size , we have Shingle size: We used a shingle size of bytes in
.
We refer to [9] for a proof and a similar result for the case
64
the results reported here. (We also experimented with
but achieved slightly worse results.)
of edit distance with block moves. A similar relationship After selecting the sample, we estimate the shingle inter-
can also be derived between shingle containment and copy section or shingle containment measures by intersecting the
distances. Thus, shingle intersection and shingle contain- samples of every pair of files. Thus, this phase takes time
ment are related to the edit distance and copy distance mea- quadratic in the number of files. Finally, we decide which
sures, which have been used as models for the correspond- edges to include in the sparse graph . There are two in-
ing classes of edit-based and copy-based delta compression dependent choices to be made here:
schemes.
While the above discussion supports the use of the Similarity measure: We can use either intersection or
containment as our measure.
7
shingle-based similarity measures in our scenario, in prac-
77 98
tice the relationship between these measures and the Threshold versus best neighbors: We could keep
achieved delta compression ratio is quite noisy. Moreover, all edges above a certain similarity threshold, say ,
for efficiency reasons we will only approximate these mea- in the graph. Or, for each file, we could keep the most
7
sures, introducing additional potential for error. promising incoming edges, for some constant , i.e.,
the edges coming from the nearest neighbors w.r.t.
3.3 Clustering Using Min-Wise Independent the estimated similarity measure.
Hashing
A detailed discussion of the various implementation choices
We now describe the first set of techniques, called min- outlined here and their impact on running time and com-
wise independent hashing, that was proposed by Broder in pression is given in the experimental section.
5
) 0 (
The total running time for the clustering step using min- (b) For each file construct a signature by concate-
wise independent hashing is thus roughly1 nating hash values to
.
where is the number of files, the (average) size of each
.,
(a) Randomly select indexes to from
. Similarity measure: intersection vs. containment.
1 If different hash functions are used, then an additional factor of Edge pruning rule: threshold vs. best neighbors vs.
has to be added to the first term. heuristics.
6
Edge weight: exact vs. estimated. alg. smp thr remaining br benefit
size edges size over zlib
We note that not every combination of these choices make optimal 2,782,224 1667 6980935
sense. For example, our LSH implementations do not sup- MH 100 20% 357,961 1616 6953569
port containment or best neighbors, and require a fixed sam- intersect 40% 154,533 1434 6601218
ple size. On the other hand, we did not observe any benefit 60% 43,289 988 5326760
in using multiple hash functions in the MH scheme, and 80% 2,629 265 1372123
thus assume a single hash function for this case. We note MH
ing gcc 2.95.2 under Solaris 7. Experiments were run contain 40% 463,213 1638 6943999
on a E450 Sun Enterprise server, with two UltraSparc 60% 225,675 1550 6724167
CPUs at 400MHz and GB of RAM. Only one CPU was 80% 79,404 1074 5016699
! $%
used in the experiments, and data was read from a single
SCSI disk. We note that the large amount of Table 2. Number of remaining edges, number
memory and fast disk minimize the impact of I/O on the of edges in the final branching, and compres-
running times. We used two data sets: sion benefit for threshold-based clustering
schemes for different sampling techniques
3 354
The medium data set consists of the union of the six and threshold values (column 3).
94
web page collections from Section 2, with files
and a total size of .
4
crawled from the poly.edu domain, with a total size tunately, these numbers indicate that there is no real “sweet
of . The pages were crawled in a breadth- spot” for the threshold that gives both a small number of
first crawl that attempted to fetch all pages reachable similar edges and good compression on this data set.
from the www.poly.edu homepage, subject to cer- We note that this result is not due to the precision of the
tain pruning rules to avoid dynamically generated con- sampling-based methods, and it also holds for threshold-
tent and cgi scripts. based LSH algorithms. A simplified explanation for this is
that data sets contain different clusters of various similar-
4.2 Threshold-Based Methods ity, and a low threshold will keep these clusters intact as
dense graphs with many edges, while a high threshold will
disconnect too many of these clusters, resulting in inferior
The first experiments that we present look at the perfor-
compression. This leads us to study several techniques for
mance of MH and LSH techniques that try to identify and
overcoming this problem:
7
retain all edges that are above a certain similarity threshold.
In Table 2 we look at the optimum branching method Best neighbors: By retaining only the best incoming
and at three different algorithms that use a fixed thresh- edges for each node according to the MH algorithm,
7
we can keep the number of edges in bounded by
old to select edges that are considered similar, for differ-
ent thresholds. For each method, we show the number .
7
sample k cluster weighing br. benefit k cluster branching zdelta benefit
size time time time over zlib time time time over zlib
1/2 1 1198.25 51.44 0.02 6137816 1 39.26 0.02 45.63 6115888
2 1201.27 84.17 0.02 6693921 2 39.35 0.02 48.49 6408702
4 1198.00 149.99 0.04 6879510 4 39.35 0.02 48.14 6464221
8 1198.91 287.31 0.09 6937119 8 39.40 0.06 49.63 6503158
1/128 1 40.52 47.77 0.02 6124913
2
4
40.65
40.57
82.88
149.06
0.03
0.03
6604095
6774854
'2*" 4
7
Table 4. Running time and compression bene-
fit for -neighbor schemes with sampling rate
and estimated edge weights.
8 40.82 283.57 0.09 6883487
7
Table 3. Running time and compression ben-
efit for -neighbor schemes. threshold edges branching
size
benefit
over zlib
20% 28,866 1640 6689872
40% 8,421 1612 6242688
In summary, using a fixed threshold followed by an optimal
60% 6,316 1538 5426000
branching on the remaining edges does not result in a very
80% 2,527 1483 4945364
good trade-off between compression and running time.
Table 5. Number of remaining edges and com-
4.3 Using Best Neighbors pression benefit for LSH scheme with pruning
heuristic.
We now look at the case where we limit the number of
7
remaining edges in the MH algorithm by keeping only the
most similar edges into each node, as proposed above.
Clearly, this limits the total number of edges in to , 7 seconds (versus about 3
seconds for standard zlib and
7
and the compression benefit as a function of the number 4.5 LSH Pruning Heuristic
of neighbors and the sampling rate. The clustering time
of course depends heavily on the sampling rate; thus one For LSH algorithms, we experimented with a sim-
should use the smallest sampling rate that gives reasonable ple heuristic for reducing the number of remaining edges
%2*" 4
compression, and we do not observe any significant impact where, after the sorting of the file signatures, we only keep
on compression up to a rate of for the file sizes we a subset of the edges in the case where more than files
7 7
have. The time for computing the weights of the graph have identical signatures. In particular, instead of building
7 7
grows (approximately) linear with . The compression rate a complete graph on these files, we connect these files by a
8
grows with , but even for very small , such as , we simple linear chain of directed edges. This somewhat arbi-
get results that are within of the maximum benefit. As in trary heuristic (which actually started out as a bug) results in
all our results, the time for the actual branching computation significantly decreased running time at only a slight cost in
on is negligible. compression, as shown in Table 5. We are currently looking
at other more principled approaches to thinning out tightly
4.4 Estimated Weights connected clusters of edges.
By using the containment measure values computed by 4.6 Best Results for Large Data Set
the MH clustering as the weights of the remaining edges
$ 4$
in , we can further decrease the running time, as shown Finally, we present the results of the best schemes iden-
in Table 4. The time for building the weighted graph is tified above on the large data set of pages from
now essentially reduced to zero. However, we have an extra the poly.edu domain. We note that our results are still
step at the end where we perform the actual compression
across edges, which is independent of and has the same 77 somewhat preliminary and can probably be significantly im-
proved by some optimizations. We were unable to compute
cost as computing the exact weights for
8
at the achieved benefit we see that for
about
. Looking
we are within
of the optimum, at a total cost of less than
7 4
The MH algorithm used 7 4
the optimum branching on this set due to its size.
neighbors and estimated
edge weights, while the LSH algorithm used a threshold of
8
algorithm running time size store billions of pages on a network of workstations. Note
uncompressed 257.8 MB that in this scenario, fast insertions and lookups are crucial,
zlib 73.9 42.3 MB and significant changes in the approach are necessary. An
cat+gzip 79.5 30.5 MB early prototype of the system is currently being evaluated.
best MH 996.3 23.7 MB
best LSH 800.0 21.7 MB References
Table 6. Comparison of best MH and LSH
[1] M. Adler and M. Mitzenmacher. Towards compress-
schemes to zlib and cat+gzip.
ing web graphs. In Proc. of the IEEE Data Compres-
sion Conference (DCC), March 2001.
98
8
[2] G. Banga, F. Douglis, and M. Rabinovich. Optimistic
and the pruning heuristic from the previous subsec-
tion. For MH, about
deltas for WWW latency reduction. In 1997 USENIX
of the running time is spent on
8
8
289–303, January 1997.
%
compression, and in fact this is already partially exploited Using q-grams in a DBMS for approximate string pro-
by tar+gzip with its window on small files. As cessing. IEEE Data Engineering Bulletin, 24(4):28–
discussed, a polynomial-time optimal solution for multiple 34, December 2001.
reference files is unlikely, and even finding schemes that
work well in practice is challenging. Our final goal is to [10] T.H. Haveliwala, A. Gionis, and P. Indyk. Scalable
create general purpose tools for distributing file collections techniques for clustering the web. In Proc. of the
that improve significantly over tar+gzip. WebDB Workshop, Dallas, TX, May 2000.
In related work, we are also studying how to apply delta
[11] B. Housel and D. Lindquist. WebExpress: A sys-
compression techniques to a large web repository4 that can
tem for optimizaing web browsing in a wireless en-
4 Similar to the Internet Archive at http://www.archive.org. vironment. In Proc. of the 2nd ACM Conf. on Mobile
9
Computing and Networking, pages 108–116, Novem- [25] W. Tichy. RCS: A system for version control. Software
ber 1996. - Practice and Experience, 15, July 1985.
[12] J. Hunt, K. P. Vo, and W. Tichy. Delta algorithms: [26] D. Trendafilov, N. Memon, and T. Suel. zdelta: a sim-
An empirical analysis. ACM Transactions on Software ple delta compression tool. Technical Report TR-CIS-
Engineering and Methodology, 7, 1998. 2002-02, Polytechnic University, CIS Department,
June 2002.
[13] P. Indyk. A small approximately min-wise indepen-
dent family of hash functions. In Proc. of the 10th [27] J. Ziv and A. Lempel. A universal algorithm for data
Symp. on Discrete Algorithms, January 1999. compression. IEEE Transactions on Information The-
ory, 23(3):337–343, 1977.
[14] P. Indyk and R. Motwani. Approximate nearest neigh-
bors: Towards removing the curse of dimensionality.
In Proc. of the 30th ACM Symp. on Theory of Comput-
ing, pages 604–612, May 1998.
10