0% found this document useful (0 votes)
2 views6 pages

Chapter 9 Clustering

The CURE algorithm clusters data in Euclidean space without assuming normal distribution, using representative points instead of centroids. It involves initializing clusters, selecting representative points, merging clusters based on proximity, and assigning points to the nearest cluster. Additionally, the Stream-Computing model allows for clustering of stream elements, utilizing a sliding window approach to efficiently answer queries about recent clusters while managing varying statistics over time.

Uploaded by

hingoranipiyush5
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views6 pages

Chapter 9 Clustering

The CURE algorithm clusters data in Euclidean space without assuming normal distribution, using representative points instead of centroids. It involves initializing clusters, selecting representative points, merging clusters based on proximity, and assigning points to the nearest cluster. Additionally, the Stream-Computing model allows for clustering of stream elements, utilizing a sliding window approach to efficiently answer queries about recent clusters while managing varying statistics over time.

Uploaded by

hingoranipiyush5
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 6

Chapter 9

Clustering

CURE Algorithm

This algorithm, called CURE (Clustering Using REpresentatives), assumes a Euclidean space. However, it
does not assume anything about the shape of clusters; they need not be normally distributed, and can
even have strange bends, S-shapes, or even rings. Instead of representing clusters by their centroid, it
uses a collection of representative points, as the name implies.

Figure 9.1: Two clusters, one surrounding the other

Example : Figure 9.1 is an illustration of two clusters. The inner cluster is an ordinary circle, while the
second is a ring around the circle. This arrangement is not completely pathological. A creature from
another galaxy might look at our solar system and observe that the objects cluster into an inner circle
(the planets) and an outer ring (the Kuyper belt), with little in between.

Initialization in CURE

We begin the CURE algorithm by:

1. Take a small sample of the data and cluster it in main memory. In principle, any clustering
method could be used, but as CURE is designed to handle oddly shaped clusters, it is often
advisable to use a hierarchical method in which clusters are merged when they have a close pair
of points.
2. Select a small set of points from each cluster to be representative points. These points should be
chosen to be as far from one another as possible.
3. Move each of the representative points a fixed fraction of the distance between its location and
the centroid of its cluster. Perhaps 20% is a good fraction to choose. Note that this step requires
a Euclidean space, since otherwise, there might not be any notion of a line between two points.

Example : We could use a hierarchical clustering algorithm on a sample of the data from Fig. 9.1. If we
took as the distance between clusters the shortest distance between any pair of points, one from each
cluster, then we would correctly find the two clusters. That is, pieces of the ring would stick together,
and pieces of the inner circle would stick together, but pieces of ring would always be far away from the
pieces of the circle. Note that if we used the rule that the distance between clusters was the distance
between their centroids, then we might not get the intuitively correct result. The reason is that the
centroids of both clusters are in the center of the diagram.

For the second step, we pick the representative points. If the sample from which the clusters are
constructed is large enough, we can count on a cluster’s sample points at greatest distance from one
another lying on the boundary of the cluster. Figure 9.2 suggests what our initial selection of sample
points might look like.

Figure 9.2: Select representative points from each cluster, as far from one another as possible

Finally, we move the representative points a fixed fraction of the distance from their true location
toward the centroid of the cluster. Note that in Fig. 7.13 both clusters have their centroid in the same
place: the center of the inner circle. Thus, the representative points from the circle move inside the
cluster, as was intended. Points on the outer edge of the ring also move into their cluster, but points on
the ring’s inner edge move outside the cluster.
Completion of the CURE Algorithm

The next phase of CURE is to merge two clusters if they have a pair of representative points, one from
each cluster, that are sufficiently close. The user may pick the distance that defines “close.” This merging
step can repeat, until there are no more sufficiently close clusters.

Example : The situation of Fig. 9.3 serves as a useful illustration. There is some argument that the ring
and circle should really be merged, because their centroids are the same. For instance, if the gap
between the ring and circle were much smaller, it might well be argued that combining the points of the
ring and circle into a single cluster reflected the true state of affairs. For instance, the rings of Saturn
have narrow gaps between them, but it is reasonable to visualize the rings as a single object, rather than
several concentric objects. In the case of Fig. 9.3 the choice of

1. The fraction of the distance to the centroid that we move the representative points and
2. The choice of how far apart representative points of two clusters need to be to avoid merger

together determine whether we regard Fig. 9.1 as one cluster or two.

The last step of CURE is point assignment. Each point p is brought from secondary storage and
compared with the representative points. We assign p to the cluster of the representative point that is
closest to p.

Example : In our running example, points within the ring will surely be closer to one of the ring’s
representative points than to any representative point of the circle. Likewise, points within the circle will
surely be closest to a representative point of the circle. An outlier – a point not within the ring or the
circle – will be assigned to the ring if it is outside the ring. If the outlier is between the ring and the
circle, it will be assigned to one or the other, somewhat favoring the ring because its representative
points have been moved toward the circle.
Stream-Computing

The Stream-Computing Model

We assume that each stream element is a point in some space. The sliding window consists of the most
recent N points. Our goal is to precluster subsets of the points in the stream, so that we may quickly
answer queries of the form “what are the clusters of the last m points?” for any m ≤ N. There are many
variants of this query, depending on what we assume about what constitutes a cluster. For instance, we
may use a k-means approach, where we are really asking that the last m points be partitioned into
exactly k clusters.

We make no restriction regarding the space in which the points of the stream live. It may be a Euclidean
space, in which case the answer to the query is the centroids of the selected clusters. The space may be
non-Euclidean, in which case the answer is the clustroids of the selected clusters, where any of the
definitions for “clustroid” may be used.

The problem is considerably easier if we assume that all stream elements are chosen with statistics that
do not vary along the stream. Then, a sample of the stream is good enough to estimate the clusters, and
we can in effect ignore the stream after a while. However, the stream model normally assumes that the
statistics of the stream elements varies with time. For example, the centroids of the clusters may
migrate slowly as time goes on, or clusters may expand, contract, divide, or merge.

A Stream-Clustering Algorithm

In this section, we shall present a greatly simplified version of an algorithm referred to as BDMO (for the
authors, B. Babcock, M. Datar, R. Motwani, and L. O’Callaghan). The true version of the algorithm
involves much more complex structures, which are designed to provide performance guarantees in the
worst case.

The BDMO Algorithm builds on the methodology for counting ones in a stream .Here are the key
similarities and differences:

 Like that algorithm, the points of the stream are partitioned into, and summarized by, buckets
whose sizes are a power of two. Here, the size of a bucket is the number of points it represents,
rather than the number of stream elements that are 1.
 As before, the sizes of buckets obey the restriction that there are one or two of each size, up to
some limit. However, we do not assume that the sequence of allowable bucket sizes starts with
1. Rather, they are required only to form a sequence where each size is twice the previous size,
e.g., 3, 6, 12, 24, . . . .
 Bucket sizes are again restrained to be nondecreasing as we go back in time. As in Section 4.6,
we can conclude that there will be O(log N) buckets.
 The contents of a bucket consists of:
o The size of the bucket.
o The timestamp of the bucket, that is, the most recent point that contributes to the
bucket. As in Section 4.6, timestamps can be recorded modulo N.
o A collection of records that represent the clusters into which the points of that bucket
have been partitioned. These records contain:
 The number of points in the cluster.
 The centroid or clustroid of the cluster.
 Any other parameters necessary to enable us to merge clusters and maintain
approximations to the full set of parameters for the merged cluster.

Initializing Buckets

Our smallest bucket size will be p, a power of 2. Thus, every p stream elements, we create a new bucket,
with the most recent p points. The timestamp for this bucket is the timestamp of the most recent point
in the bucket. We may leave each point in a cluster by itself, or we may perform a clustering of these
points according to whatever clustering strategy we have chosen. For instance, if we choose a k-means
algorithm, then (assuming k < p) we cluster the points into k clusters by some algorithm.

Whatever method we use to cluster initially, we assume it is possible to compute the centroids or
clustroids for the clusters and count the points in each cluster. This information becomes part of the
record for each cluster. We also compute whatever other parameters for the clusters will be needed in
the merging process.

Merging Buckets

Following the strategy from Section 4.6, whenever we create a new bucket, we need to review the
sequence of buckets. First, if some bucket has a timestamp that is more than N time units prior to the
current time, then nothing of that bucket is in the window, and we may drop it from the list. Second, we
may have created three buckets of size p, in which case we must merge the oldest two of the three. The
merger may create two buckets of size 2p, in which case we may have to merge buckets of increasing
sizes, recursively.

To merge two consecutive buckets, we need to do several things:

1. The size of the bucket is twice the sizes of the two buckets being merged.
2. The timestamp for the merged bucket is the timestamp of the more recent of the two
consecutive buckets.
3. We must consider whether to merge clusters, and if so, we need to compute the parameters of
the merged clusters. We shall elaborate on this part of the algorithm by considering several
examples of criteria for merging and ways to estimate the needed parameters.
Answering Queries

we assume a query is a request for the clusters of the most recent m points in the stream, where m ≤ N.
Because of the strategy we have adopted of combining buckets as we go back in time, we may not be
able to find a set of buckets that covers exactly the last m points. However, if we choose the smallest set
of buckets that cover the last m points, we shall include in these buckets no more than the last 2m
points. We shall produce, as answer to the query, the centroids or clustroids of all the points in the
selected buckets. In order for the result to be a good approximation to the clusters for exactly the last m
points, we must assume that the points between 2m and m + 1 will not have radically different statistics
from the most recent m points. However, if the statistics vary too rapidly, recall from Section 4.6.6 that a
more complex bucketing scheme can guarantee that we can find buckets to cover at most the last m(1 +
ǫ) points, for any ǫ > 0.

Having selected the desired buckets, we pool all their clusters. We then use some methodology for
deciding which clusters to merge.

You might also like