2000 - Scalability For Clustering Algorithms Revisited
2000 - Scalability For Clustering Algorithms Revisited
Cluster quality
1.5 ...................................................................................................
Three variants involve adding one of the data compression
methods described above to the previous variant. The first =
variant uses none of the data compression techniques. This E
variant runs until convergence on the first fill of the buffer,
and then stops. This variant is similar to clustering on a
small random sample of the dataset. In the second vari-
ant, the first primary compression technique is used. This i
E
involves moving to the discard set each point within a cer-
tain Mahalanobis distance from its associated cluster mean.
In the third variant, the second primary compression tech- • 0.5 ...................................................................................
nique is added. Confidence intervals are used to discard data ,IO
90 ..........................................................................................
80 ............................................................................................
70 .............................................................................................
SO .....................................................................................
50 ...................................................................................................
40 ...............................................................................................
20 . . . . .
1 . . . . . . . . . . . . .
Figure 2: The graph shows the reliability of the different Figure 3: The graph shows the average running time of each
algorithms on the synthetic datasets. Reliability is defined k-means algorithm variant. Error bars show s t a n d a r d errors.
as the fraction of all runs where no centers are trapped.
Error bars show standard errors.
profits.
The dataset contains 95412 records, each of which has 481
cal tests are not reported because their precision could be fields. We take a subset of these fields and code each record
misleading, since the assumptions on which s t a n d a r d tests as a real-valued vector. Numerical fields (e.g. amounts of
are based are often not valid when comparing performance past donations, income, age) are directly represented by a
metrics for d a t a mining methods [3]. single element in the vector. Date values (e.g. donation
Figure 2 shows surprisingly t h a t the standard k-means al- dates, date of birth) are stored as the number of months
gorithm is not significantly more reliable than random sam- from a fixed date. Fields with discrete values, such as an
pling k-means. This fact indicates that the s t a n d a r d algo- income category, are converted into several binary elements.
rithm has difficulty escaping from a bad initialization, re- Each vector has 56 elements in total, of which 18 are binary.
gardless of how many d a t a points are available. Similarly, To give equal weight to each feature, each feature is normal-
the more complicated single pass methods are not more reli- ized to have zero mean and unit variance. The records in
able than the simple single pass method. This fact indicates the original KDD dataset are converted to this format and
t h a t the more complicated methods do not have any im- saved to a binary file of about 21.4 megabytes. As men-
proved ability to escape from a bad initialization. tioned in Section 2.1, the implementation of the s t a n d a r d
The average running time of each algorithm is shown in k-means algorithm reads the dataset from disk at each iter-
Figure 3. Reported times are averages over 135 runs for each ation, even though the dataset is small enough to be saved
algorithm. The full algorithm of Bradley et al., identified as in memory.
S1 in Figure 3, is about four times slower than the standard The purpose of this experiment is to compare the running
k-means algorithm, while the simple single pass method is time and clustering quality of s t a n d a r d k-means, operating
about 40% faster. on the whole dataset or on samples, the algorithm of Bradley
W i t h the method of Bradley et al., each additional d a t a et al. using all types of compression, and the simple single
compression technique allows more points to be discarded pass method. Experiments are performed with samples and
from the buffer. Doing so should make the algorithm run buffers of 10% and 1% of the size of the whole dataset. The
faster, because then fewer refills of the buffer are needed. number of clusters is always 10.
A balance must be maintained between the time taken to First, the dataset is randomly reordered. Then it is clus-
identify points to discard and the speedup gained from dis- tered five times by each algorithm, each time with different
carding those points. Figure 3 shows t h a t compression based randomly chosen initial conditions. All algorithms use the
on confidence interval perturbation causes a net decrease in same five initial conditions. The quality of each clustering
speed, while compression based on secondary clustering is is measured as the sum of the squared distances between
beneficial. each point and the cluster mean it is associated with. Of
the five clusterings for each algorithm, the one with the best
quality is used. As above, the best of five is chosen because
4. E X P E R I M E N T S W I T H REAL DATA k-means algorithms are highly sensitive to initial conditions.
In order to experiment with real-world data, the dataset The whole procedure is repeated 52 times with different ran-
from the 1998 KDD (Knowledge Discovery and D a t a Mining dom orderings of the dataset.
Conference) contest is used. This dataset contains informa- It is difficult to discover good p a r a m e t e r values for the al-
tion about people who have made charitable donations in gorithm of Bradley et al., especially for the parameters t h a t
response to direct mailing requests. In principle, clustering control the number of points removed by secondary com-
can be used to identify groups of donors who can be targeted pression. The values used here are given in Table 2. Note
with specialized solicitations in order to maximize donation t h a t it is difficult for a secondary cluster to have s t a n d a r d
g~ ......................................................................................
100
80 .............................................................................
70
6o
.........................................................................................
...........
..........................................................
~ +~ ~ .......
80
®
. . . . . .so
.............................. ........................ . . . . . . .
.~ 60
~ , ~
20
10 . . . . Ju++~:~,~:~
$10
~ E
S1
~ +
N10 N1
~::~+~:~:::++
R10
.........
R1 K
.....
$10
..... Jill2
$1 N10
:::::i:
N1
Algorithm
R10 R1
......N ......
K
Algorithm
Figure 5: The graph shows the reliability of each different Figure 6: The graph shows the average time taken by each
algorithm on the KDD contest dataset, defined as the frac- method to perform one clustering of the KDD dataset. Error
tion of clusterings having a distortion less than 3.9 • 10~. bars show standard errors.
Error bars show standard errors.