0% found this document useful (0 votes)
25 views7 pages

2000 - Scalability For Clustering Algorithms Revisited

Uploaded by

Aishwarya Desai
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
25 views7 pages

2000 - Scalability For Clustering Algorithms Revisited

Uploaded by

Aishwarya Desai
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

Scalability for Clustering Algorithms Revisited

Fredrik Farnstrom James Lewis Charles Elkan


Computer Science Computer Science Computer Science
and Engineering and Engineedng and Engineenng
Lund Institute of Technology University of California University of California
Sweden San Diego San Diego
[email protected] jlewis @ cs. ucsd.edu elkan @cs.ucsd.edu

ABSTRACT lution, with each element needing to be accessed on each it-


This paper presents a simple new algorithm that performs eration. Therefore, considerable recent research has focused
k-means clustering in one scan of a dataset, while using a on designing clustering algorithms that use only one pass
buffer for points from the dataset of fixed size. Experiments over a dataset [9; 6]. These methods all assume that only
show that the new method is several times faster than stan- a portion of the dataset can reside in memory, and require
dard k-means, and that it produces clusterings of equal or al- only a single pass through the dataset.
most equal quality. The new method is a simplification of an The starting point of this paper is a single pass k-means al-
algorithm due to Bradley, Fayyad and Reina that uses sev- gorithm proposed by Bradley, Fayyad, and Reina [1]. This
eral data compression techniques in an attempt to improve method uses several types of compression to limit memory
speed and clustering quality. Unfortunately, the overhead of usage. However, the compression techniques make the al-
these techniques makes the original algorithm several times gorithm complicated. We investigate the tradeoffs involved
slower than standard k-means on materialized datasets, even by comparing several variants of the algorithm of Bradley
though standard k-means scans a dataset multiple times. et al. experimentally with a simple new single pass k-means
Also, lesion studies show that the compression techniques method. Our overall conclusion is that the simple met:hod
do not improve clustering quality. All results hold for 400 is superior in speed, and at least equal in the quality of
megabyte synthetic datasets and for a dataset created from clusterings produced.
the real-world data used in the 1998 KDD data mining con-
test. All algorithm implementations and experiments are
designed so that results generalize to datasets of many giga- 2. SINGLEPASS K-MEANSALGORITH2VIS
bytes and larger. The algorithm of Bradley e~ al. [1] is intended to increase the
scalability of k-means clustering for large datasets. The cen-
tral idea is to use a buffer where points from the dataset are
1. INTRODUCTION saved in compressed form. First, the means of the clusters
Clustering is the task of grouping together similar items in are initialized, as with standard k-means. Then, all avail-
a dataset. Similar data items can be seen as being gener- able space in the buffer is filled with points from the dataset.
ated from the same component of a mixture of probability The current model is updated on the buffer contents in the
distributions. The clustering problem is to determine the usual way. The buffer contents are then compressed in two
parameters of the mixture distribution that generated a set steps.
of observed data items, where for each item its component The first step, called primary compression, finds and dis-
is an unobserved feature. cards points that are unlikely ever to move to a different
The k-means algorithm is a heuristic solution to the cluster- cluster. There are two methods to do this. The first method
ing problem based on the assumption that data points are measures the Mahalanobis distance from each point to the
drawn from a fixed number k of spherical Gaussian distri- cluster mean it is associated with, and discards a point if
butions. The algorithm is an iterative process of assigning it is within a certain radius. For the second method, confi-
cluster memberships and re-estimating cluster parameters. dence intervals are computed for each cluster mean. Then,
It terminates when the data points no longer change mem- for each point, a worst case scenario is created by perturbing
bership due to changes in the re-estimated cluster parame- the cluster means within the confidence intervals. The c~us-
ters. ter mean that is associated with the point is moved mvay
Under the assumption that datasets tend to be small, re- from the point, and the cluster means of all other clusters
search on clustering algorithms has traditionally focused on are moved towards the point. If the point is still closest
improving the quality of clusterings [4]. However, many to the same cluster mean after the perturbations, then it is
datasets now are large and cannot fit into main memory. deemed unlikely ever to change cluster membership.
Scanning a dataset stored on disk or tape repeatedly is time- Points that are unlikely to change membership are removed
consuming, but the standard k-means algorithm typically from the buffer, and are placed in a discard set. Each of the
requires many iterations over a dataset to converge to a so- main clusters has a discard set, represented by the sufficient
statistics for all points belonging to that cluster that have
been removed.
On the remaining points in the buffer, another k-means cl.us-

SIGKDD Explorations. Volume 2, Issue 1 - page 51


tering is performed, with a larger number of clusters than ter, the sufficient statistics are updated as follows:
for the main clustering. This phase is called secondary com-
pression. The aim is to save buffer space by storing some s'~ A) := s ~ ~)+ xj
auxiliary clusters instead of individual points. In order to re- S,~mSq~A) := S~mSq~ A) + xj
place points in the buffer by a secondary cluster, the cluster
must satisfy a tightness criterion, meaning that its standard n (A) := n (A) + i .
deviation in each dimension must be below a certain thresh-
If clusters A and B are merged, the sufficient statistics for
old ~. Secondary clusters are combined using hierarchical
the resulting cluster C are
agglomerative clustering [7], as long as the combined clusters
satisfy the tightness criterion. Sum~°) = ~'~m~-(A) + S~m~)
After primary and secondary compression, the space in the
buffer that has become available is filled with new points, s,,ms F> = +
and the whole procedure is repeated. The algorithm ends af- n (c) = n ( A ) + n (B).
ter one scan of the dataset, or if the centers of the main clus-
ters do not change significantly as more points are added. 2.2 A s i m p l e single pass k-means m e t h o d
A special case of the algorithm of Bradley et al., not men-
tioned in their paper, would be when all points in the buffer
are discarded each time. This algorithm is:
2.1 I m p l e m e n t a t i o n issues
We have coded a new C + + implementation of the algorithm 1. Randomly initialize cluster means. Let each cluster
of Bradley et al. All the algorithms we compare experimen- have a discard set in the buffer that keeps track of the
tally are implemented as variants of the same code. The sufficient statistics for all points from previous itera-
platform for our experiments is a dual 450 MHz Pentium tions.
II workstation with 256 megabytes of main memory, run-
ning Linux. Our program is not multithreaded, so only one 2. Fill the buffer with points.
of the processors is directly used in the experiments. The 3. Perform iterations of k-means on the points and dis-
program is compiled with all optimizations turned on. All card sets in the buffer, until convergence. For this clus-
datasets are stored on disk as Linux binary files. Regardless tering, each discard set is treated like a regular point
of the size of any dataset, each pass of each algorithm reads placed at the mean of the discard set, b u t weighted
the dataset afresh from disk. Therefore, our experimental with the number of points in the discard set.
conclusions generalize to very large datasets.
Some details of the implementation of their algorithm are 4. For each cluster, update the sufficient statistics of the
not given by Bradley et al. For each primary cluster, a Ma- discard set with the points assigned to the cluster. Re-
halanobis radius must be determined that causes a certain move all points from the buffer.
fraction p of buffer points in that cluster to be discarded.
Our implementation computes the distance between each 5. If the dataset is exhausted, then finish. Otherwise,
buffer point and the cluster it is assigned to. For each repeat from Step 2.
cluster, the list of distances is sorted. Then it is easy to
This algorithm is called the simple single pass k-means method.
find a radius such that a certain fraction of points is dis-
Compared to the more complicated algorithm above, it does
carded. However, sorting can change the time complexity
much less computation each time the buffer is filled, and the
of the whole algorithm. It may be possible to determine
whole buffer can be filled with new points at every fill. Fol-
each Mahalanobis radius more efficiently, especially when
lowing Bradley et al. [1], if a cluster ever becomes empty, it
the fraction of discarded points is small.
is reinitialized with the point in the buffer that is most dis-
Our implementation stores the sufficient statistics (sum of
taut from the centers of all other clusters. However, with a
elements, squared sum of elements, number of points) as well
large dataset and a small number of clusters, reinitialization
as the mean and standard deviation in each dimension of all is almost never necessary.
main and secondary clusters. Means are stored so that the
Like the more complicated algorithm above, the simple method
distance between old and new means (the new mean is com-
uses only one scan over the dataset and a fixed size buffer.
puted from the sum of the elements) can be computed when
It also satisfies all the other desiderata listed by Bradley
doing k-means clustering. Standard deviations are stored
et al. [1]: incremental production of better results given ad-
to speed up primary compression. Representing one cluster
ditional data, ease of stopping and resuming execution, and
uses four times as much space as one data point. There-
ability to use many different database scan modes, includ-
fore, if a secondary cluster contains four or fewer points, the
ing forward-only scanning over a database view that is never
points themselves are retained instead of a representation of
materialized completely.
the cluster.
For our purposes, the sufficient statistics of a cluster are
two vectors, S u m and S u m S q , and one integer, n. The 3. LESION EXPERIMENTS
vectors store the sum and the sum of squares of the ele- To evaluate the contribution of each of the data compres-
ments of the points in the cluster, and the integer records sion methods, we report the results of experimental lesion
the number of points in the cluster. From these statis- studies. Comparisons are made between four variants of
tics, the mean and variance along each dimension can be the algorithm of Bradley at aL, the standard k-means al-
calculated. Let the sufficient statistics of a cluster A be gorithm, and the simple single pass k-means algorithm de-.
( S u m (A), S u m S q (A) , n(A)). If a point x is added to the clus- scribed above.

SIGKDD Explorations. Volume 2, Issue 1 - page 52


Parameter Value In general one of two different situations occurs with each
Confidence level for cluster means 95% clustering. Either, one cluster mean in the model is close
Max std. dev. for tight clusters (#) 1.5 to each true Gaussian center, or, two cluster means in the
Number of secondary clusters 20 model axe trapped near the same center. As we measure the
Fraction of points discarded (p) 20% distance between the true and the estimated cluster means,
if a center is trapped then the distance measure will be nmch
Table 1: Parameter settings used for the lesion studies of larger than otherwise. Therefore , the cluster quality in Fig-
the k-means algorithm of Bradley et al. ure 1 is based only on datasets for which every algorithm
produced at least one clustering where no center is trapped.

Cluster quality
1.5 ...................................................................................................
Three variants involve adding one of the data compression
methods described above to the previous variant. The first =
variant uses none of the data compression techniques. This E
variant runs until convergence on the first fill of the buffer,
and then stops. This variant is similar to clustering on a
small random sample of the dataset. In the second vari-
ant, the first primary compression technique is used. This i
E
involves moving to the discard set each point within a cer-
tain Mahalanobis distance from its associated cluster mean.
In the third variant, the second primary compression tech- • 0.5 ...................................................................................
nique is added. Confidence intervals are used to discard data ,IO

points deemed unlikely ever to change cluster membership. 8

The fourth variant includes the data compression technique


of determining secondary clusters. All parameter settings
used in the experiments reported here are shown in Table 1. R1 $1-- $1- S1 N1 K
Algorithm
3.1 Synthetic datasets
Figure 1: The graph shows the mean sum of the dista~mes
The lesion experiments use synthetic datasets. Using artifi-
between the estimated and true cluster means, for synthetic
cial data allows the clusters found by each algorithm to be datasets of 1,000,000 points, 100 dimensions, and five clus-
compared with known true probability distribution compo- ters. The algorithms are random sampling k-means (R1),
nents. In each synthetic dataset, points are drawn from a single pass k-means with the first primary compression tech-
mixture of a fixed number of Gaussian distributions. Each nique only ( S 1 - - ) , with both primary compression tech-
Gaussian is assigned a random weight that determines the
niques (S1-), with primary and secondary compression (S1),
probability of generating a data point from that component.
the simple single pass k-means method (N1), and the stan-
Following Bradley et al. [1], the mean and variance of each dard k-means algorithm operating on the whole dataset (K).
Gaussian are uniformly sampled, for each dimension, from Error bars show standard errors.
the intervals [-5, 5] and [0.7, 1.5] respectively.
In order to measure the accuracy of a clustering, the true
cluster means must be compared with the estimated cluster
means. The problem of discovering which true cluster mean 3.2 Lesion experiment results
corresponds to which estimated mean must be solved. If the Figure 1 shows that even the simplest single pass algorithm
number of clusters k is small, then it is possible to use the achieves the same clustering quality as the full k-means
one of the k! permutations that yields the highest accuracy. method. Random sampling k-means is less accurate because
We do this, and for this reason the number of clusters k = 5 it uses only 1% of the total data points.
is small in our experiments. Results with a much larger A clustering where no centers are trapped is highly desirable.
number of clusters might be different. Therefore, we also measure the fraction of clusterings where
The synthetic datasets have 100 dimensions and 1,000,000 n o centers are trapped, counting clusterings from all :five
data points. They are stored on disk in 400 megabyte files. random initial conditions. We call this fraction the reliability
This size is chosen to guarantee that the operating system of an algorithm. •Surprisingly, Figure 2 shows that the single
cannot buffer a dataset in main memory. Except for the pass algorithms are more reliable than the standard k-means
standard k-means algorithm, each clustering algorithm uses algorithm, and this difference is statistically significant.
a limited buffer large enough to contain approximately 1% Throughout this paper, the difference between x and y is
of the data points. called statistically significant if x + s~ < y - s~ or y + s~ <
The experiments use 30 different synthetic datasets. For x - sx, where s= and s~ are the standard errors of x and y
each dataset each algorithm generates five different cluster- respectively. If x is the mean of n observations then its stan-
ings from different initial conditions. The best of these five dard error is the standard deviation of the n observations
models is retained for a comparison of the accuracy of the divided bY v r~" For the special case where x is a propor-
algorithms. The best of five runs is used because k-means tion, its standard error is x/x(1 - x ) / n . If n is sufficiently
algorithms are known to be sensitive to how cluster cen- large then a Gaussian approximation is valid, s o the null
ters are initialized. In applications where a good clustering hypothesis that the true values of x and y are the same can
is wanted, it is therefore natural to use the best of several be rejected with confidence p < 0.05, if x + s® < y - s~ or
runs. y + s~ < x -- s=. Ntimerical p values from specific statisti-

SIGKDD Explorations. Volume 2, Issue 1 - page 53


Reliability Running time
1 O0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

90 ..........................................................................................

80 ............................................................................................

70 .............................................................................................

SO .....................................................................................

50 ...................................................................................................

40 ...............................................................................................

20 . . . . .
1 . . . . . . . . . . . . .

R1 $1-- $1- $1 N1 K R1 $1-- $1- $1 N1 K


Algodthm Algorithm

Figure 2: The graph shows the reliability of the different Figure 3: The graph shows the average running time of each
algorithms on the synthetic datasets. Reliability is defined k-means algorithm variant. Error bars show s t a n d a r d errors.
as the fraction of all runs where no centers are trapped.
Error bars show standard errors.

profits.
The dataset contains 95412 records, each of which has 481
cal tests are not reported because their precision could be fields. We take a subset of these fields and code each record
misleading, since the assumptions on which s t a n d a r d tests as a real-valued vector. Numerical fields (e.g. amounts of
are based are often not valid when comparing performance past donations, income, age) are directly represented by a
metrics for d a t a mining methods [3]. single element in the vector. Date values (e.g. donation
Figure 2 shows surprisingly t h a t the standard k-means al- dates, date of birth) are stored as the number of months
gorithm is not significantly more reliable than random sam- from a fixed date. Fields with discrete values, such as an
pling k-means. This fact indicates that the s t a n d a r d algo- income category, are converted into several binary elements.
rithm has difficulty escaping from a bad initialization, re- Each vector has 56 elements in total, of which 18 are binary.
gardless of how many d a t a points are available. Similarly, To give equal weight to each feature, each feature is normal-
the more complicated single pass methods are not more reli- ized to have zero mean and unit variance. The records in
able than the simple single pass method. This fact indicates the original KDD dataset are converted to this format and
t h a t the more complicated methods do not have any im- saved to a binary file of about 21.4 megabytes. As men-
proved ability to escape from a bad initialization. tioned in Section 2.1, the implementation of the s t a n d a r d
The average running time of each algorithm is shown in k-means algorithm reads the dataset from disk at each iter-
Figure 3. Reported times are averages over 135 runs for each ation, even though the dataset is small enough to be saved
algorithm. The full algorithm of Bradley et al., identified as in memory.
S1 in Figure 3, is about four times slower than the standard The purpose of this experiment is to compare the running
k-means algorithm, while the simple single pass method is time and clustering quality of s t a n d a r d k-means, operating
about 40% faster. on the whole dataset or on samples, the algorithm of Bradley
W i t h the method of Bradley et al., each additional d a t a et al. using all types of compression, and the simple single
compression technique allows more points to be discarded pass method. Experiments are performed with samples and
from the buffer. Doing so should make the algorithm run buffers of 10% and 1% of the size of the whole dataset. The
faster, because then fewer refills of the buffer are needed. number of clusters is always 10.
A balance must be maintained between the time taken to First, the dataset is randomly reordered. Then it is clus-
identify points to discard and the speedup gained from dis- tered five times by each algorithm, each time with different
carding those points. Figure 3 shows t h a t compression based randomly chosen initial conditions. All algorithms use the
on confidence interval perturbation causes a net decrease in same five initial conditions. The quality of each clustering
speed, while compression based on secondary clustering is is measured as the sum of the squared distances between
beneficial. each point and the cluster mean it is associated with. Of
the five clusterings for each algorithm, the one with the best
quality is used. As above, the best of five is chosen because
4. E X P E R I M E N T S W I T H REAL DATA k-means algorithms are highly sensitive to initial conditions.
In order to experiment with real-world data, the dataset The whole procedure is repeated 52 times with different ran-
from the 1998 KDD (Knowledge Discovery and D a t a Mining dom orderings of the dataset.
Conference) contest is used. This dataset contains informa- It is difficult to discover good p a r a m e t e r values for the al-
tion about people who have made charitable donations in gorithm of Bradley et al., especially for the parameters t h a t
response to direct mailing requests. In principle, clustering control the number of points removed by secondary com-
can be used to identify groups of donors who can be targeted pression. The values used here are given in Table 2. Note
with specialized solicitations in order to maximize donation t h a t it is difficult for a secondary cluster to have s t a n d a r d

S I G K D D Explorations. Volume 2, Issue 1 - page 54


Parameter Value 4 x 10e Cluster distortion
Confidence level for cluster means 95%
Max std. dev. for tight clusters (]3) 1.1 . Ii/ i iiii ii . ii i i i i
Number of secondary clusters 40
Fraction of points discarded (p) 20%

Table 2: Parameter settings used for the algorithm of


Bradley et al. with the K D D dataset.

deviation j3 < 1.1 in every dimension, even though the whole


dataset is normalized to have standard deviation 1.0 in each
dimension.
Figure 4 show~ the average quality of the best of five clus-
terings, for each algorithm. Random sampling k-means op- $10 $1 NIO N1 R10 R1 K
erating on a 1% sample performs much worse than all other Algorithm
methods. Standard k-means performs best, followed by the
simple single pass method using a buffer of size 1%, followed Figure 4: The graph shows the sum of the squared distances
by the algorithm of Bradley et al. All differences mentioned between each point in the dataset and the cluster metal it
here are statistically significant. is associated with, on the KDD contest dataset of 95,112
There is no "true" clustering of the KDD dataset that can points with 10 clusters. The algorithms are due to Bradley
be used to define reliability in a way similar to how relia- et al. (S10 and S1), the simple single pass method (N10 emd
bility is defined for the synthetic datasets. Therefore, the N1), random sampling k-means (R10 and Pal), and standard
reliability of an algorithm is defined here to be the fraction k-means working on the whole dataset. Algorithms with
of all clusterings that have a quality measure of less than names ending in 10 use a buffer or sample of size 10% of the
3.9- 10 s. This number is chosen somewhat arbitrarily based whole dataset, while those with names ending with 1 use a
on Figure 4 as a threshold for what constitutes an acceptable 1% buffer or sample. Error bars show standard errors.
clustering. A reliable algorithm is one t h a t is less sensitive
to how cluster centers are initialized, and that produces a
good clustering more often. Algorithm Time Space i/o
Figure 5 shows t h a t the standard k-means method and the Standard nkdm kd ndm
simple single pass method with a buffer of size 1% are the Bradley et al. nbrk2dm2 nbd + kid nd
most reliable. All other methods are statistically signifi- Simple single pass nkdm' nbd nd
cantly less reliable. It is surprising that the simple single
pass algorithm using a buffer of size 1% of the entire dataset Table 3: Order of magnitude time, memory, and disk in-
outperforms the same method using a 10% buffer. Similar p u t / o u t p u t complexity for different k-means algorithms.
results were found by Bradley et al. [1] when they varied
the buffer size used by their algorithm. The reason why a
smaller buffer can be better remains to be discovered.
Figure 6 shows the average running time of each method. Because the buffer is emptied completely before each retill,
Compared to the standard k-means method, the algorithm the number of refills is 1/b, so the time complexity of clus-
of Bradley et al. is over four times slower, while the simple tering the whole dataset is O(nbkdm'. 1/b) = O(nkdm').
single pass method is over five times faster. Interestingly, m ' tends to be less than m because cluster-
ing is performed over fewer d a t a points than for standard
k-means. In fact, m ' tends towards one for large datasets,
5. COMPUTATIONAL COMPLEXITY because when the model has stabilized, new points are si:m-
In the discussion here of the asymptotic efficiency of the ply placed in the nearest cluster. This observation is true
algorithms, we use the following notation: for all the single pass algorithms.
rn number of k-means passes over entire dataset The complicated nature of the method of Bradley et al.
m' number of k-means passes over one buffer refill makes it difficult to analyze. The main clustering takes
d number of dimensions O(nbkdm') time per fill. Measuring the Mahalanobis dis-
n number of d a t a points tance to the closest cluster for the points in the buffer is an
b size of buffer, as fraction of n O(nbd) operation. Finding the discard radius for all main
r number of buffer refills clusters takes O(nb log nb) time if sorting is used; the worst
k number of main clusters case is when essentially all points belong to one cluster. The
k2 number of secondary clusters total time complexity of the first method of primary com-
m2 number of passes for each secondary clustering. pression is thus O(nb(d+log nb)). The second method of pri-
m a r y compression, where the cluster means are perturbed,
The time complexity of the standard k-means algorithm is has time complexity O(nbkd).
O(nkdm), where empirically m grows very slowly with n, k, In the secondary compression phase, m2 passes with k2 clus-
and d. ters are performed over the points in each fill of the bui~er,
For the simple single pass k-means algorithm, the time com- giving this phase O(nbk2dm2) complexity for one fill of the
plexity of clustering the buffer contents once is O(nbkdrn'). buffer. Then, hierarchical agglomerative clustering is per-

SIGKDD Explorations. Volume 2, Issue 1 - page 55


Reliability Running time
100 ............................................................................. 120 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

g~ ......................................................................................

100
80 .............................................................................

70

6o
.........................................................................................

...........
..........................................................
~ +~ ~ .......
80
®
. . . . . .so
.............................. ........................ . . . . . . .
.~ 60
~ , ~

4~ ..... ! ...... i+ :!~ i~ i . . . . . . . . . . . . . . . . . . +~ ......


40
30 . . . . . . . . . . . . . . . +~*~: .......

20
10 . . . . Ju++~:~,~:~

$10
~ E

S1
~ +

N10 N1
~::~+~:~:::++

R10
.........

R1 K
.....

$10
..... Jill2
$1 N10
:::::i:
N1
Algorithm
R10 R1
......N ......
K
Algorithm

Figure 5: The graph shows the reliability of each different Figure 6: The graph shows the average time taken by each
algorithm on the KDD contest dataset, defined as the frac- method to perform one clustering of the KDD dataset. Error
tion of clusterings having a distortion less than 3.9 • 10~. bars show standard errors.
Error bars show standard errors.

ity between algorithms, and the largest dataset used in the


formed on the k2 clusters. This can be done with O(k2~d) paper whose size can be computed from information in the
time and space complexity [7]. paper occupies only 10 megabytes when stored as a float-
The steps described above must be repeated r times to scan ing point binary file. The operating system of any modern
through the whole dataset. Typically r > l i b since the workstation can cache a dataset of this size in main memory.
whole buffer cannot be filled at each frill. So, the algorithm We may not have found optimal settings for the parame-
of Bradley et al. has a total time complexity of ters of the algorithm of Bradley et al. However, we have
searched informally for good parameter settings. In general,
[ k~d
o + + + + algorithms that have many parameters with few guidelines
about how to choose values for them axe difficult to use ef-
In general k2 > k and m2 > m', so the total time complexity fectively.
is O(nbrk2dm2). An assumption here is that the clustering Compared to the standard k-means algorithm, the method
is not stopped until the whole dataset has been processed. of Bradley et al. is slower on the KDD dataset t h a n on the
This assumption is true in all our experiments. synthetic datasets. The opposite is true for the simple sin-
gle pass method: it is relatively faster on the KDD dataset.
The time, memory, and disk I / O complexities of the three
The reason is that the KDD dataset has clusters that are
algorithms are summarized in Table 3. The simple single
separated less well, and the method of Bradley et al. is sen-
pass algorithm is superior asymptotically in both time and
sitive to clusters not being separated well. The standard
space complexity to the algorithm of Bradley et al.
algorithm requires 13 passes on average to converge on the
KDD dataset, but only 3.3 passes on the synthetic datasets.
6. DISCUSSION As explained in Section 5, the number m ' +of iterations per
The main positive result of this paper is that a simple single refill of the buffer tends to 1 for the simple algorithm for all
pass k-means algorithm, with a buffer of size 1% of the input datasets. But for the method of Bradley et al., the number
dataset, can produce clusterings of almost the same quality ms of iterations in the secondary clustering for each refill of
as the standard multiple pass k-means method, while being the buffer may remain high.
several times faster. With all k-means algorithms, both single pass and multiple
Being faster than the standard k-means algorithm is not a pass, it is possible to update several clusterings in parallel,
trivial accomplishment, because the standard algorithm is where each clustering starts from different initial conditions.
already quite scalable. Its r u n n i n g time is close to linear We did not do so for our experiments. If we did so, the
in the size of the input dataset, since the number of passes average r u n n i n g time per clustering of all methods would
required is empirically almost independent of the size of the presumably decrease. There is no reason to think that the
dataset. In addition, at each pass the dataset is scanned relative speeds of the methods would change.
sequentially, so a good operating system and disk array can If a dataset to be clustered does not already exist as a single
easily provide access to the dataset with high bandwidth. table in a relational database or as a flat file, then materi-
Although it is called scalable, the algorithm of Bradley e~ al. is alizing it can be expensive. Materializing a dataset may be
much slower in our experiments than the standard k-means especially expensive if it consists of a join of tables in a dis-
method. Bradley et al. did not report this fact because their tributed or heterogeneous data warehouse. In this case, all
paper contains no comparisons with standard k-means, and single pass clustering methods can be faster t h a n the stan-
no r u n n i n g times. Moreover, the paper gives no measures dard k-means algorithm. However, the ranking of different
of statistical significance for differences in clustering qual- single pass methods according to speed is likely to be still

SIGKDD Explorations. Volume 2, Issue 1 - page 56


the same. [2] P. Bradley, U. Fayyad, and C. Reina. Scaling EM (ex-
The results of this paper are complementary to those of Pel- pectation maximization) clustering to large databases.
leg and Moore [8], who show how to use a sophisticated Technical Report MSR-TR-98-35, Microsoft Reseaxch,
data structure to increase the speed of k-means clustering Redmond, WA, November 1998.
for datasets of low dimensionality (d < 8). Our simple single
pass method is effective regardless of dimensionality. The re- [3] T. G. Dietterich. Approximate statistical tests for
sults here are also complementary to those of Guha, Mishra, comparing supervised classification learning algorithms.
Motwani, and O'Callaghan [5], who present single pass clus- Neural Computation, 10(7):1895-1924, 1998.
tering algorithms that are guaranteed to achieve clusterings [4] V. Ganti, J. Gehrke, and R. Ramakrishnan. Mining very
with quality within a constant factor of optimal. large databases. Computer, 32(8):38-45, 1999.
We have not tested other single pass clustering algorithms,
notably the BIRCH method [9]. The authors of BIRCH have [5] S. Guha, N. Mishra, R. Motwani, and L. O'Callaghan.
shown convincingly that it is faster than k-means on large Clustering data streams. In Proceedings of the Annual
datasets. A comparison of the simple single pass method of Symposium on Foundations of Computer Science. IEEE,
this paper with BIRCH would be interesting. Also, all the November 2000. To appear.
single pass methods discussed in this paper can be extended
to apply to other iterative clustering approaches, and in par- [6] S. Guha, R. Rastogi, and K. Shim. CURE: An efficient
ticular to expectation maximization (EM) [2]. It would be clustering algorithm for large databases. In Proceedings
interesting to repeat the experiments of this paper in the o/the A CM SIGMOD International Conference on Man-
EM context. agement of Data, pages 73-84. ACM, 1998.
[7] M. Meila and D. Heckermau. An experimental com-
A c k n o w l e d g m e n t s : The authors are grateful to Nina Mishra parison of several clustering and initialization methods.
and Bin Zhang of Hewlett Packard Laboratories and to the Technical Report MSR-TR-98-06, Microsoft Research,
anonymous referees for valuable comments. Redmond, WA, February 1998.
[8] D. Pelleg and A. Moore. Accelerating exact k-means al-
gorithms with geometric reasoning. In Proceedings of the
Fifth International Conference on Knowledge Discovery
7. REFERENCES and Data Mining. AAAI Press, 1999.
[I] P. Bradley, U. Fayyad, and C. Reina. Scaling cluster- [9] T. Zhang, R. Ramakrishnan, and M. Livny. BIRCH: An
ing algorithms to large databases. In Proceedings of the efficient data clustering method for very large databases.
Fourth International Conference on Knowledge Discov- In Proceedings of the A CM SIGMOD International Con-
ery and Data Mining, pages 9-15. AAAI Press, 1998. ference on Management of Data, pages 103-114. ACM,
1996.

SIGKDD Explorations. Volume 2, Issue 1 - page 57

You might also like