LJ 9
LJ 9
M
any types of data analysis, such as the interpretation of Landsat images
discussed in the accompanying article, involve datasets so large that
their direct manipulation is impractical. Some method of data compres-
sion or consolidation must first be applied to reduce the size of the dataset without
losing the essential character of the data. All consolidation methods sacrifice
some detail; the most desirable methods are computationally efficient and yield re-
sults that are—at least for practical applications—representative of the original
data. Here we introduce several widely used algorithms that consolidate data by
clustering, or grouping, and then present a new method, the continuous k-means
algorithm,* developed at the Laboratory specifically for clustering large datasets.
Test scores are an example of one-dimensional data; each data point represents a
single measured quantity. Multidimensional data can include any number of mea-
surable attributes; a biologist might use four attributes of duck bills (four-dimen-
sional data: size, straightness, thickness, and color) to sort a large set of ducks
into several species. Each independent characteristic, or measurement, is one di-
mension. The consolidation of large, multidimensional datasets is the main pur-
* The continuous k-means algorithm is part of a patented application for improving both the processing
speed and the appearance of color video displays. The application is commercially available for Mac-
intosh computers under the names Fast Eddie, 1992 and Planet Color, 1993, by Paradigm Con-
cepts, Inc., Santa Fe, NM. This software was developed by Vance Faber, Mark O. Mundt, Jeffrey S.
Saltzman, and James M. White.
F D C B A
47 5253 5657 59 61 65 6768 70 71 73 75 77 79 8283 87 97
50 60 70 80 90 100
pose of the field of cluster analysis. We will describe several clustering methods Figure 1. Clustering Test Scores
below. In all of these methods the desired number of clusters k is specified be- The figure illustrates an arbitrary parti-
forehand. The reference point zi for the cluster i is usually the centroid of the tioning of 20 test scores into 5 non-over-
cluster. In the case of one-dimensional data, such as the test scores, the centroid lapping clusters (dashed lines), corre-
is the arithmetic average of the values of the points in a cluster. For multi- sponding to 5 letter grades. The
dimensional data, where each data point has several components, the centroid will reference points (means) are indicated
have the same number of components and each component will be the arithmetic in red.
average of the corresponding components of all the data points in the cluster.
Perhaps the simplest and oldest automated clustering method is to combine data
points into clusters in a pairwise fashion until the points have been condensed into
the desired number of clusters; this type of agglomerative algorithm is found in
many off-the-shelf statistics packages. Figure 2 illustrates the method applied to
the set of test scores given in Figure 1.
There are two major drawbacks to this algorithm. First—and absolutely prohibi-
tive for the analysis of large datasets—the method is computationally inefficient.
Each step of the procedure requires calculation of the distance between every pos-
sible pair of data points and comparison of all the distances. The second difficulty
is connected to a more fundamental problem in cluster analysis: Although the al-
gorithm will always produce the desired number of clusters, the centroids of these
clusters may not be particularly representative of the data.
where xij is the jth point in the ith cluster, zi is the reference point of the ith clus-
ter, and ni is the number of points in that cluster. The notation ||xij - zi|| stands for
the distance between xij and zi. Hence, the error measure E indicates the overall
spread of data points about their reference points. To achieve a representative
clustering, E should be as small as possible.
cluster2.adb•
7/26/94
Test scores
47 52 53 56 57 59 61 65 6768 70 71 73 75 77 79 82 83 87 97
S1
52.5
S2
56.5
S3
67.5
S4
70.5
S5
82.5
S6
60
Step number
S7
74
S8
78
S9
66.25
S10
58.25
S11
72.25
S12
80.25
S13
49.75
S14
69.25
S15
83.75
When clustering is done for the purpose of data reduction, as in the case of the
Landsat images, the goal is not to find the best partitioning. We merely want a
reasonable consolidation of N data points into k clusters, and, if necessary, some
efficient way to improve the quality of the initial partitioning. For that purpose,
there is a family of iterative-partitioning algorithms that is far superior to the ag-
glomerative algorithm described above.
Iterative algorithms begin with a set of k reference points whose initial values are
usually chosen by the user. First, the data points are partitioned into k clusters: A
data point x becomes a member of cluster i if zi is the reference point closest to x.
The positions of the reference points and the assignment of the data points to clus-
ters are then adjusted during successive iterations. Iterative algorithms are thus
similar to fitting routines, which begin with an initial “guess” for each fitted para-
meter and then optimize their values. Algorithms within this family differ in the
details of generating and adjusting the partitions. Three members of this family
are discussed here: Lloyd’s algorithm, the standard k-means algorithm, and a con-
tinuous k-means algorithm first described in 1967 by J. MacQueen and recently
developed for general use at Los Alamos.
For Lloyd’s and other iterative algorithms, improvement of the partitioning and
convergence of the error measure E to a local minimum is often quite fast—even
when the initial reference points are badly chosen. However, unlike guesses for
parameters in simple fitting routines, slightly different initial partitionings general-
ly do not produce the same set of final clusters. A final partitioning will be better
than the initial choice, but it will not necessarily be the best possible partitioning.
For many applications, this is not a significant problem. For example, the differ-
ences between Landsat images made from the original data and those made from
the clustered data are seldom visible even to trained analysts, so small differences
in the clustered data are even less important. In such cases, the judgment of the
analyst is the best guide as to whether a clustering method yields reasonable results.
(a) Setup: (b) Results of first iteration: (c) Results of second iteration:
Reference point 1 (filled red circle) and Next each reference point is moved to the During the second iteration, the process in
reference point 2 (filled black circle) are centroid of its cluster. Then each data point is Figure 3(b) is performed again for every data
chosen arbitrarily. All data points (open considered in the sequence shown. If the point. The partition shown above is stable; it
circles) are then partitioned into two clusters: reference point closest to the data point will not change for any further iteration.
each data point is assigned to cluster 1 or belongs to the other cluster, the data point is
cluster 2, depending on whether the data point reassigned to that other cluster, and both
is closer to reference point 1 or 2, respectively. cluster centroids are recomputed.
Cluster 2 Cluster 2
Cluster 2
7 8
Cluster 1 Cluster 1 9
Cluster 1
6
3
5
1 2
4
Figure 3. Clustering by the The standard k-means algorithm differs from Lloyd’s in its more efficient use of
Standard k-Means Algorithm information at every step. The setup for both algorithms is the same: Reference
The diagrams show results during two points are chosen and all the data points are assigned to clusters. As with Lloyd’s,
iterations in the partitioning of nine two- the k-means algorithm then uses the cluster centroids as reference points in subse-
dimensional data points into two well- quent partitionings—but the centroids are adjusted both during and after each par-
separated clusters, using the standard titioning. For data point x in cluster i, if the centroid zi is the nearest reference
k-means algorithm. Points in cluster 1 point, no adjustments are made and the algorithm proceeds to the next data point.
are shown in red, points in cluster 2 are However, if the centroid zj of the cluster j is the reference point closest to data
shown in black; data points are denoted point x, then x is reassigned to cluster j, the centroids of the “losing” cluster i
by open circles and reference points by (minus point x) and the “gaining” cluster j (plus point x) are recomputed, and the
filled circles. Clusters are indicated by reference points zi and zj are moved to their new centroids. After each step,
dashed lines. Note that the iteration con- every one of the k reference points is a centroid, or mean, hence the name “k-
verges quickly to the correct clustering, means.” An example of clustering using the standard k-mean algorithm is shown
even for this bad initial choice of the two in Figure 3.
reference points.
There are a number of variants of the k-means algorithm. In some versions, the
error measure E is evaluated at each step, and a data point is reassigned to a dif-
ferent cluster only if that reassignment decreases E. In MacQueen’s original paper
on the k-means method, the centroid update (assign data point to cluster, recom-
pute the centroid, move the reference point to the centroid) is applied at each step
in the initial partitioning, as well as during the iterations. In all of these cases, the
standard k-means algorithm requires about the same amount of computation for a
single pass through all the data points, or one iteration, as does Lloyd’s algorithm.
However, the k-means algorithm, because it constantly updates the clusters, is un-
likely to require as many iterations as the less efficient Lloyd’s algorithm and is
therefore considerably faster.
The continuous k-means algorithm is faster than the standard version and thus ex-
tends the size of the datasets that can be clustered. It differs from the standard
version in how the initial reference points are chosen and how data points are se-
lected for the updating process.
In the standard algorithm the initial reference points are chosen more or less arbi-
trarily. In the continuous algorithm reference points are chosen as a random sam-
ple from the whole population of data points. If the sample is sufficiently large,
the distribution of these initial reference points should reflect the distribution of
points in the entire set. If the whole set of points is densest in Region 7, for ex-
ample, then the sample should also be densest in Region 7. When this process is
applied to Landsat data, it effectively puts more cluster centroids (and the best
color resolution) where there are more data points.
Another difference between the standard and continuous k-means algorithms is the
way the data points are treated. During each complete iteration, the standard algo-
rithm examines all the data points in sequence. In contrast, the continuous algo-
rithm examines only a random sample of data points. If the dataset is very large
and the sample is representative of the dataset, the algorithm should converge
much more quickly than an algorithm that examines every point in sequence. In
fact, the continuous algorithm adopts MacQueen’s method of updating the cen-
troids during the initial partitioning, when the data points are first assigned to clus-
ters. Convergence is usually fast enough so that a second pass through the data
points is not needed.
Ei 5 E
x[Ri
2
r (x)x 2 zi dx ,
where ρ (x) is the probability density function, a continuous function defined over
the space, and the total error measure E is given by the sum of the Ei’s. In Mac-
Queen’s concept of the algorithm, a very large set of discrete data points can be
thought of as a large sample—and thus a good estimate—of the continuous proba-
bility density r (x). It then becomes apparent that a random sample of the dataset
can also be a good estimate of r (x). Such a sample yields a representative set of
cluster centroids and a reasonable estimate of the error measure without using all
the points in the original dataset.
The computer time can be further reduced by making the individual steps in the
algorithm more efficient. A substantial fraction of the computation time required
by any of these clustering algorithms is typically spent in finding the reference
point closest to a particular data point. In a “brute-force” method, the distances
from a given data point to all of the reference points must be calculated and com-
pared. More elegant methods of “point location” avoid much of this time-consum-
ing process by reducing the number of reference points that must be considered—
but some computational time must be spent to create data structures. Such
structures range from particular orderings of reference points, to “trees” in which
reference points are organized into categories. A tree structure allows one to elim-
inate entire categories of reference points from the distance calculations. The con-
tinuous k-means algorithm uses a tree method to cluster three-dimensional data,
such as pixel colors on a video screen. When applied to seven-dimensional Land-
sat data, the algorithm uses single-axis boundarizing, which orders the reference
points along the direction of maximum variation. In either method only a few
points need be considered when calculating and comparing distances. The choice
of a particular method will depend on the number of dimensions of the dataset.
Further Reading
James M. White, Vance Faber, and Jeffrey S. Saltzman. 1992. Digital color representation. U.S.
Patent Number 5,130,701.
Stuart P. Lloyd. 1982. Least squares quantization in PCM. IEEE Transactions on Information Theory
IT-28: 129–137.
Edward Forgy. 1965. Cluster analysis of multivariate data: Efficiency vs. interpretability of classifica-
tions, Biometrics 21: 768.
J. MacQueen. 1967. Some methods for classification and analysis of multivariate observations. In
Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability. Volume I,
Statistics. Edited by Lucien M. Le Cam and Jerzy Neyman. University of California Press.
Jerome H. Friedman, Forest Baskett, and Leonard J. Shustek. 1975. An algorithm for finding nearest
neighbors. IEEE Transactions on Computers C-24: 1000-1006. [Single-axis boundarizing, dimensionali-
ty.]
Jerome H. Friedman Jon Louis Bentley, and Raphael Ari Finkel. 1977. An algorithm for finding best
matches in logarithmic expected time. ACM Transactions on Mathematical Software 3: 209–226.
[Tree methods.]
Helmuth Späth. 1980. Cluster Analysis Algorithms for Data Reduction and Classification of Objects.
Halsted Press.
Anil K. Jain and Richard C. Dubes. 1988. Algorithms for Clustering Data. Prentice Hall.