0% found this document useful (0 votes)
14 views103 pages

Clustering Class

Cluster analysis is a method of grouping data objects into clusters based on their similarity, with applications in various fields such as marketing and spatial data analysis. It involves different clustering methods, including partitioning, hierarchical, and density-based approaches, and relies on distance measures to evaluate the closeness of data points. The quality of clustering is determined by intra-class and inter-class similarities, and challenges arise in high-dimensional spaces and determining the optimal number of clusters.

Uploaded by

Valerie Menezes
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views103 pages

Clustering Class

Cluster analysis is a method of grouping data objects into clusters based on their similarity, with applications in various fields such as marketing and spatial data analysis. It involves different clustering methods, including partitioning, hierarchical, and density-based approaches, and relies on distance measures to evaluate the closeness of data points. The quality of clustering is determined by intra-class and inter-class similarities, and challenges arise in high-dimensional spaces and determining the optimal number of clusters.

Uploaded by

Valerie Menezes
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 103

Cluster Analysis

• What is Cluster Analysis?


• Types of Data in Cluster Analysis
• Partitioning Methods
• Hierarchical Methods
• Density-Based Methods
• Cluster Evaluation
• Grid-Based Methods
• Model-Based Clustering Methods
• Outlier Analysis
• Summary
The Problem of Clustering
• Given a set of points, with a notion of
distance between points, group the points
into some number of clusters, so that
members of a cluster are in some sense
as close to each other as possible.
What is Cluster Analysis?
• Finding groups of objects such that the objects in a
group will be similar (or related) to one another and
different from (or unrelated to) the objects in other
groups
Inter-cluster
Intra-cluster distances are
distances are maximized
minimized
What is Cluster Analysis?
• Cluster: a collection of data objects
– Similar to one another within the same cluster
– Dissimilar to the objects in other clusters
• Cluster analysis
– Grouping a set of data objects into clusters
• Clustering is unsupervised classification: no
predefined classes
• Typical applications
– As a stand-alone tool to get insight into data
distribution
– As a preprocessing step for other algorithms
General Applications of
Clustering
• Pattern Recognition
• Spatial Data Analysis
– create thematic maps in GIS by clustering feature
spaces
– detect spatial clusters and explain them in spatial data
mining
• Image Processing
• Economic Science (especially market research)
• WWW
– Document classification
– Cluster Weblog data to discover groups of similar
access patterns
Examples of Clustering
Applications
• Marketing: Help marketers discover distinct groups in their customer
bases, and then use this knowledge to develop targeted marketing
programs
• Land use: Identification of areas of similar land use in an earth
observation database
• Insurance: Identifying groups of motor insurance policy holders with
a high average claim cost
• City-planning: Identifying groups of houses according to their house
type, value, and geographical location
• Earth-quake studies: Observed earth quake epicenters should be
clustered along continent faults
Notion of a Cluster can be
Ambiguous

How many clusters? Six Clusters

Two Clusters Four Clusters


Problems With Clustering
• Clustering in two dimensions looks easy.
• Clustering small amounts of data looks
easy.
• And in most cases, looks are not
deceiving.
Large dimension
Non Euclidean Distances
The Curse of Dimensionality
• Many applications involve not 2, but 10 or
10,000 dimensions.
• High-dimensional spaces look different:
almost all pairs of points are at about the
same distance.
– Assuming random points within a bounding
box, e.g., values between 0 and 1 in each
dimension.
What Is Good Clustering?
• A good clustering method will produce high quality
clusters with
– high intra-class similarity
– low inter-class similarity
• The quality of a clustering result depends on both the
similarity measure used by the method and its
implementation.
• The quality of a clustering method is also measured by
its ability to discover some or all of the hidden patterns.
Types of Clusterings
• A clustering is a set of clusters
• Important distinction between hierarchical and
partitional sets of clusters
• Partitional Clustering
– A division data objects into non-overlapping subsets (clusters)
such that each data object is in exactly one subset

• Hierarchical clustering
– A set of nested clusters organized as a hierarchical tree
Partitional Clustering

Original Points A Partitional Clustering


Hierarchical Clustering

p1 p2 p3 p4
Traditional Hierarchical Clustering Traditional Dendrogram

p1
p3 p4
p2
p1 p2 p3 p4
Non-traditional Hierarchical Clustering Non-traditional Dendrogram
Distance Measures
• Each clustering problem is based on
some kind of “distance” between points.
• Two major classes of distance measure:
1. Euclidean
2. Non-Euclidean
Euclidean Vs. Non-
Euclidean
• A Euclidean space has some number of
real-valued dimensions and “dense”
points.
– There is a notion of “average” of two points.
– A Euclidean distance is based on the
locations of points in such a space.
• A Non-Euclidean distance is based on
properties of points, but not their “location”
in a space.
Axioms of a Distance
Measure
• d is a distance measure if it is a
function from pairs of points to reals such
that:
1. d(x,y) > 0.
2. d(x,y) = 0 iff x = y.
3. d(x,y) = d(y,x).
4. d(x,y) < d(x,z) + d(z,y) (triangle inequality ).
Some Euclidean Distances
• L2 norm : d(x,y) = square root of the sum of
the squares of the differences between x
and y in each dimension.
– The most common notion of “distance.”
• L1 norm : sum of the differences in each
dimension.
– Manhattan distance = distance if you had to
travel along coordinates only.
Examples of Euclidean Distances
y = (9,8)
L2-norm:
dist(x,y) =
(42+32)
=5
5 3
L1-norm:
4 dist(x,y) =
x = (5,5) 4+3 = 7
Another Euclidean Distance
• L∞ norm : d(x,y) = the maximum of the
differences between x and y in any
dimension.
• Note: the maximum is the limit as n goes
to ∞ of what you get by taking the n th
power of the differences, summing and
taking the n th root.
Non-Euclidean Distances

• Jaccard distance for sets = 1 minus


ratio of sizes of intersection and union.
• Cosine distance = angle between
vectors from the origin to the points in
question.
• Edit distance = number of inserts and
deletes to change one string into
another.
 As with CDs we have a choice when we
think of documents as sets of words or
shingles:
▪ Sets as vectors: Measure similarity by the
cosine distance
▪ Sets as sets: Measure similarity by the
Jaccard distance
▪ Sets as points: Measure similarity by
Euclidean distance

BDA 21
 Hierarchical:
▪ Agglomerative (bottom up):
▪ Initially, each point is a cluster
▪ Repeatedly combine the two
“nearest” clusters into one
▪ Divisive (top down):
▪ Start with one cluster and recursively split it

 Point assignment:
▪ Maintain a set of clusters
▪ Points belong to “nearest” cluster
BDA 22
 Key operation:
Repeatedly combine
two nearest clusters

 Three important questions:


▪ 1) How do you represent a cluster of more
than one point?
▪ 2) How do you determine the “nearness” of
clusters?
▪ 3) When to stop combining clusters?

BDA 23
 Key operation: Repeatedly combine two
nearest clusters
 (1) How to represent a cluster of many points?
▪ Key problem: As you merge clusters, how do you
represent the “location” of each cluster, to tell which
pair of clusters is closest?
 Euclidean case: each cluster has a
centroid = average of its (data)points
 (2) How to determine “nearness” of clusters?
▪ Measure cluster distances by distances of centroids

BDA 24
(5,3)
o
(1,2)
o
x (1.5,1.5) x (4.7,1.3)
x (1,1) o (2,1) o (4,1)
x (4.5,0.5)
o (0,0) o (5,0)

Data:
o … data point
x … centroid
Dendrogram
BDA 25
What about the Non-Euclidean case?
 The only “locations” we can talk about are the
points themselves
▪ i.e., there is no “average” of two points

 Approach 1:
▪ (1) How to represent a cluster of many points?
clustroid = (data)point “closest” to other points
▪ (2) How do you determine the “nearness” of
clusters? Treat clustroid as if it were centroid, when
computing inter-cluster distances
BDA 26
 (1) How to represent a cluster of many points?
clustroid = point “closest” to other points
 Possible meanings of “closest”:
▪ Smallest maximum distance to other points
▪ Smallest average distance to other points
▪ Smallest sum of squares of distances to other points
▪ For distance metric d clustroid c of cluster C is: min  d ( x, c) 2
c
xC
Datapoint Centroid

X Centroid is the avg. of all (data)points


in the cluster. This means centroid is
Clustroid an “artificial” point.
Cluster on Clustroid is an existing (data)point
3 datapoints that is “closest” to all other points in
BDA the cluster. 27
 (2) How do you determine the “nearness” of
clusters?
▪ Approach 2:
Intercluster distance = minimum of the distances
between any two points, one from each cluster
▪ Approach 3:
Pick a notion of “cohesion” of clusters, e.g.,
maximum distance from the clustroid
▪ Merge clusters whose union is most cohesive

BDA 28
 Approach 3.1: Use the diameter of the
merged cluster = maximum distance between
points in the cluster
 Approach 3.2: Use the average distance
between points in the cluster
 Approach 3.3: Use a density-based approach
▪ Take the diameter or avg. distance, e.g., and divide
by the number of points in the cluster

BDA 29
 (3) When to stop clustering?
 Stop when we have K clusters
 Stop if diameter/radius of cluster that results from
the best merger exceeds a threshold.
 Stop if density is below some threshold
▪ Density – Number of cluster points per unit volume of
cluster ( Ratio of number of cluster points divided by some
power of the diameter or radius)
 If evidence suggest that merging will produce a bad
cluster
▪ Sudden increase in cluster diameter.

BDA 30
 Naïve Implementation of hierarchical
clustering :
▪ At ech step, compute pairwaise distances between
all pairs of clusters, then merge
▪ O(𝑁 3 )
 Careful implementation using priority queue
can reduce time to O(𝑁 2 𝑙𝑜𝑔𝑁)
▪ Still too expensive for really big datasets that do
not fit in memmory

BDA 31
The K-Means Clustering
Method
• Given k, the k-means algorithm is implemented in four
steps:
– Partition objects into k nonempty subsets
– Compute seed points as the centroids of the clusters of the
current partition (the centroid is the center, i.e., mean point, of
the cluster)
– Assign each object to the cluster with the nearest seed point
– Go back to Step 2, stop when no more new assignment
The K-Means Clustering
Method
10 10
10
9 9
9
8 8
8
7 7
7
6 6
6
5 5
5
4 4
4
3 3
3
Assign 2
Update 2
2

1 each 1 the 1

0
objects
0
0 1 2 3 4 5 6 7 8 9 10
cluster 0
0 1 2 3 4 5 6 7 8 9 10
0 1 2 3 4 5 6 7 8 9 10
to most means
similar
center reassign reassign
10 10

K=2 9 9

8 8

Arbitrarily choose K 7

6
7

6
object as initial 5 5

cluster center 4
Update 4

3 3

2
the 2

1 cluster 1

0
0 1 2 3 4 5 6 7 8 9 10 means 0
0 1 2 3 4 5 6 7 8 9 10
 Assumes Euclidean space/distance

 Start by picking k, the number of clusters

 Initialize clusters by picking one point per


cluster
▪ Example: Pick one point at random, then k-1
other points, each as far away as possible from
the previous points

BDA 35
 1) For each point, place it in the cluster whose
current centroid it is nearest

 2) After all points are assigned, update the


locations of centroids of the k clusters

 3) Reassign all points to their closest centroid


▪ Sometimes moves points between clusters

 Repeat 2 and 3 until convergence


▪ Convergence: Points don’t move between clusters
and centroids stabilize
BDA 36
x
x
x
x
x

x x x x x x

x … data point
… centroid Clusters after round 1

BDA 37
x
x
x
x
x

x x x x x x

x … data point
… centroid Clusters after round 2

BDA 38
x
x
x
x
x

x x x x x x

x … data point
… centroid Clusters at the end

BDA 39
How to select k?
 Try different k, looking at the change in the
average distance to centroid as k increases
 Average falls rapidly until right k, then
changes little

Best value
of k
Average
distance to
centroid k

BDA 40
Too few; x
many long x
xx x
distances
x x
to centroid. x x x x x
x x x x x
x xx x xx x
x x x x
x x

x x x
x x x x
x x x
x

BDA 41
x
Just right; x
distances xx x
rather short. x x
x x x x x
x x x x x
x xx x xx x
x x x x
x x

x x x
x x x x
x x x
x

BDA 42
Too many; x
little improvement x
in average xx x
distance. x x
x x x x x
x x x x x
x xx x xx x
x x x x
x x

x x x
x x x x
x x x
x

BDA 43
Extension of k-means to large data
 BFR [Bradley-Fayyad-Reina] is a
variant of k-means designed to
handle very large (disk-resident) data sets

 Assumes that clusters are normally distributed


around a centroid in a Euclidean space
▪ Standard deviations in different
dimensions may vary
▪ Clusters are axis-aligned ellipses
 Efficient way to summarize clusters
(want memory required O(clusters) and not O(data))
BDA 45
 Points are read from disk one main-memory-
full at a time
 Most points from previous memory loads
are summarized by simple statistics
 To begin, from the initial load we select the
initial k centroids by some sensible approach:
▪ Take k random points
▪ Take a small random sample and cluster optimally
▪ Take a sample; pick a random point, and then
k–1 more points, each as far from the previously
selected points as possible
BDA 46
3 sets of points which we keep track of:
 Discard set (DS):
▪ Points close enough to a centroid to be
summarized
 Compression set (CS):
▪ Groups of points that are close together but
not close to any existing centroid
▪ These points are summarized, but not
assigned to a cluster
 Retained set (RS):
▪ Isolated points waiting to be assigned to a
compression set
BDA 47
Points in
the RS

Compressed sets.
Their points are in
the CS.

A cluster. Its points The centroid


are in the DS.

Discard set (DS): Close enough to a centroid to be summarized


Compression set (CS): Summarized, but not assigned to a cluster
Retained set (RS): Isolated points
BDA 48
For each cluster, the discard set (DS) is
summarized by:
 The number of points, N
 The vector SUM, whose ith component is
the sum of the coordinates of the points in
the ith dimension
 The vector SUMSQ: ith component = sum of
squares of coordinates in ith dimension

A cluster.
All its points are in the DS. The centroid
BDA 49
 2d + 1 values represent any size cluster
▪ d = number of dimensions
 Average in each dimension (the centroid)
can be calculated as SUMi / N
▪ SUMi = ith component of SUM
 Variance of a cluster’s discard set in
dimension i is: (SUMSQi / N) – (SUMi / N)2
▪ And standard deviation is the square root of that
 Next step: Actual clustering
Note: Dropping the “axis-aligned” clusters assumption would require
storing full covariance matrix to summarize the cluster. So, instead of
SUMSQ being a d-dim vector, it would be a d x d matrix, which is too big!
BDA 50
Processing the “Memory-Load” of points (1):
 1) Find those points that are “sufficiently
close” to a cluster centroid and add those
points to that cluster and the DS
▪ These points are so close to the centroid that
they can be summarized and then discarded
 2) Use any main-memory clustering algorithm
to cluster the remaining points and the old RS
▪ Clusters go to the CS; outlying points to the RS
Discard set (DS): Close enough to a centroid to be summarized.
Compression set (CS): Summarized, but not assigned to a cluster
Retained set (RS): Isolated points
BDA 51
Processing the “Memory-Load” of points (2):
 3) DS set: Adjust statistics of the clusters to
account for the new points
▪ Add Ns, SUMs, SUMSQs
 4) Consider merging compressed sets in the CS
 5) If this is the last round, merge all compressed
sets in the CS and all RS points into their nearest
cluster
Discard set (DS): Close enough to a centroid to be summarized.
Compression set (CS): Summarized, but not assigned to a cluster
Retained set (RS): Isolated points
BDA 52
Points in
the RS

Compressed sets.
Their points are in
the CS.

A cluster. Its points The centroid


are in the DS.

Discard set (DS): Close enough to a centroid to be summarized


Compression set (CS): Summarized, but not assigned to a cluster
Retained set (RS): Isolated points
BDA 53
 Q1) How do we decide if a point is “close
enough” to a cluster that we will add the
point to that cluster?

 Q2) How do we decide whether two


compressed sets (CS) deserve to be
combined into one?

BDA 54
 Q1) We need a way to decide whether to put
a new point into a cluster (and discard)

 BFR suggests two ways:


▪ The Mahalanobis distance is less than a threshold
▪ High likelihood of the point belonging to
currently nearest centroid

BDA 55
 Normalized Euclidean distance from centroid

 For point (x1, …, xd) and centroid (c1, …, cd)


1. Normalize in each dimension: yi = (xi - ci) / i
2. Take sum of the squares of the yi
3. Take the square root
𝑑 2
𝑥𝑖 − 𝑐𝑖
𝑑 𝑥, 𝑐 = ෍
𝜎𝑖
𝑖=1
σi … standard deviation of points in
the cluster in the ith dimension
BDA 56
 If clusters are normally distributed in d
dimensions, then after transformation, one
standard deviation = 𝒅
▪ i.e., 68% of the points of the cluster will
have a Mahalanobis distance < 𝒅

 Accept a point for a cluster if


its M.D. is < some threshold,
e.g. 2 standard deviations

BDA 57
Q2) Should 2 CS subclusters be combined?
 Compute the variance of the combined
subcluster
▪ N, SUM, and SUMSQ allow us to make that
calculation quickly
 Combine if the combined variance is
below some threshold

 Many alternatives: Treat dimensions


differently, consider density

BDA 59
Extension of k-means to clusters
of arbitrary shapes
Vs.
 Problem with BFR/k-means:
▪ Assumes clusters are normally
distributed in each dimension
▪ And axes are fixed – ellipses at
an angle are not OK

 CURE (Clustering Using REpresentatives):


▪ Assumes a Euclidean distance
▪ Allows clusters to assume any shape
▪ Uses a collection of representative
points to represent clusters
BDA 66
h h

h
e e
e
h e
e e h
e e e e

salary h
e
h
h
h h
h h h

age

BDA 67
2 Pass algorithm. Pass 1:
 0) Pick a random sample of points that fit in
main memory
 1) Initial clusters:
▪ Cluster these points hierarchically – group
nearest points/clusters
 2) Pick representative points:
▪ For each cluster, pick a sample of points, as
dispersed as possible
▪ From the sample, pick representatives by moving
them (say) 20% toward the centroid of the cluster
BDA 68
h h

h
e e
e
h e
e e h
e e e e
h
salary e
h
h
h h
h h h

age

BDA 69
h h

h
e e
e
h e
e e h
e e e e
h
salary e
h Pick (say) 4
h remote points
h h for each
h h h cluster.

age

BDA 70
h h

h
e e
e
h e
e e h
e e e e
h
salary e
h Move points
h (say) 20%
h h toward the
h h h centroid.

age

BDA 71
Pass 2:
 Now, rescan the whole dataset and
visit each point p in the data set

 Place it in the “closest cluster” p


▪ Normal definition of “closest”:
Find the closest representative to p and
assign it to representative’s cluster

BDA 72
 Clustering: Given a set of points, with a notion
of distance between points, group the points
into some number of clusters
 Algorithms:
▪ Agglomerative hierarchical clustering:
▪ Centroid and clustroid
▪ k-means:
▪ Initialization, picking k
▪ BFR
▪ CURE
BDA 73
Dealing With a Non-
Euclidean Space
• Problem: clusters cannot be represented by
centroids.
• Why? Because the “average” of “points” might
not be a point in the space.
• Best substitute: the clustroid = point in the
cluster that minimizes the sum of the squares
of distances to the points in the cluster.
Representing Clusters in Non-
Euclidean Spaces
• Recall BFR represents a Euclidean cluster
by N, SUM, and SUMSQ.
• A non-Euclidean cluster is represented by:
– N.
– The clustroid.
– Sum of the squares of the distances from
clustroid to all points in the cluster.
The GRGPF Algorithm
• From Ganti et al.
• Works for non-Euclidean distances.
• Works for massive (disk-resident) data.
• Hierarchical clustering.
• Clusters are grouped into a tree of disk
blocks (like a B-tree or R-tree).
Information Retained About
a Cluster
1. N, clustroid, SUMSQ.
2. The p points closest to the clustroid, and
their values of SUMSQ.
3. The p points of the cluster that are
furthest away from the clustroid, and
their SUMSQ’s.
At Interior Nodes of the Tree

• Interior nodes have samples of the


clustroids of the clusters found at
descendant leaves of this node.
• Try to keep clusters on one leaf block
close, descendants of a level-1 node
close, etc.
• Interior part of tree kept in main memory.
Picture of the Tree

main
memory samples

cluster data cluster data

on disk
Initialization

• Take a main-memory sample of points.


• Organize them into clusters
hierarchically.
• Build the initial tree, with level-1 interior
nodes representing clusters of clusters,
and so on.
• All other points are inserted into this tree.
Inserting Points
• Start at the root.
• At each interior node, visit one or more
children that have sample clustroids near
the inserted point.
• At the leaves, insert the point into the
cluster with the nearest clustroid.
Updating Cluster Data
• Suppose we add point X to a cluster.
• Increase count N by 1.
• For each of the 2p + 1 points Y whose
SUMSQ is stored, add d (X,Y )2.
• Estimate SUMSQ for X.
Estimating SUMSQ(X )
• If C is the clustroid, SUMSQ(X ) is, by the
CoD assumption:
• Nd (X,C )2+ SUMSQ(C )
– Based on assumption that vector from X to C
is perpendicular to vectors from C to all the
other nodes of the cluster.
• This value may allow X to replace one of
the closest or furthest nodes.
Possible Modification to
Cluster Data
• There may be a new clustroid --- one of
the p closest points --- because of the
addition of X.
• Eventually, the clustroid may migrate out
of the p closest points, and the entire
representation of the cluster needs to be
recomputed.
Splitting and Merging
Clusters
• Maintain a threshold for the radius of a
cluster = (SUMSQ/N ).
• Split a cluster whose radius is too large.
• Adding clusters may overflow leaf
blocks, and require splits of blocks up
the tree.
– Splitting is similar to a B-tree.
– But try to keep locality of clusters.
Splitting and Merging --- (2)
• The problem case is when we have split
so much that the tree no longer fits in main
memory.
• Raise the threshold on radius and merge
clusters that are sufficiently close.
Merging Clusters
• Suppose there are nearby clusters with
clustroids C and D, and we want to
consider merging them.
• Assume that the clustroid of the combined
cluster will be one of the p furthest points
from the clustroid of one of those clusters.
Merging --- (2)
• Compute SUMSQ(X ) [from the cluster of
C ] for the combined cluster by summing:
1. SUMSQ(X ) from its own cluster.
2. SUMSQ(D ) + N [d (X,C )2 + d (C,D )2].
Uses the CoD to reason that the distance from X
to each point in the other cluster goes to C,
makes a right angle to D, and another right angle
to the point.
Merging --- Concluded
• Pick as the clustroid for the combined
cluster that point with the least SUMSQ.
• But if this SUMSQ is too large, do not
merge clusters.
• Hope you get enough mergers to fit the
tree in main memory.
Clustering a Stream (New Topic)
• Assume points enter in a stream.
• Maintain a sliding window of points.
• Queries ask for clusters of points within
some suffix of the window.
• Only important issue: where are the
cluster centroids.
– There is no notion of “all the points” in a
stream.
BDMO Approach
• BDMO = Babcock, Datar, Motwani,
O’Callaghan.
• k –means based.
• Can use less than O(N ) space for
windows of size N.
• Generalizes trick of DGIM: buckets of
increasing “weight.”
Recall DGIM
• Maintains a sequence of buckets B1, B2, …
• Buckets have timestamps (most recent
stream element in bucket).
• Sizes of buckets nondecreasing.
– In DGIM size = power of 2.
• Either 1 or 2 of each size.
Alternative Combining Rule
• Instead of “combine the 2nd and 3rd of any
one size” we could say:
• “Combine Bi+1 and Bi if size(Bi+1 ∪Bi) <
size(Bi-1 ∪Bi-2 ∪… ∪B1).”
– If Bi+1, Bi, and Bi-1 are the same size, inequality
must hold (almost).
–If Bi-1 is smaller, it cannot hold.
Buckets for Clustering
• In place of “size” (number of 1’s) we use
(an approximation to) the sum of the
distances from all points to the centroid of
their cluster.
• Merge consecutive buckets if the “size” of
the merged bucket is less than the sum of
the sizes of all later buckets.
Consequence of Merge
Rule
• In a stable list of buckets, any two
consecutive buckets are “bigger” than
all smaller buckets.
• Thus, “sizes” grow exponentially.
• If there is a limit on total “size,” then the
number of buckets is O(log N ).
• N = window size.
– E.g., all points are in a fixed hypercube.
Outline of Algorithm

1. What do buckets look like?


– Clusters at various levels, represented by
centroids.
2. How do we merge buckets?
– Keep # of clusters at each level small.
3. What happens when we query?
– Final clustering of all clusters of all
relevant buckets.
Organization of Buckets

• Each bucket consists of clusters at


some number of levels.
– 4 levels in our examples.
• Clusters represented by:
1. Location of centroid.
2. Weight = number of points in the cluster.
3. Cost = upper bound on sum of distances
from member points to centroid.
Processing Buckets --- (1)

• Actions determined by N (window size)


and k (desired number of clusters).
• Also uses a tuning parameter τ for
which we use 1/4 to simplify.
– 1/τ is the number of levels of clusters.
Processing Buckets --- (2)

• Initialize a new bucket with k new


points.
– Each is a cluster at level 0.
• If the timestamp of the oldest bucket is
outside the window, delete that bucket.
Level-0 Clusters
• A single point p is represented by (p,
1, 0).
• That is:
1. A point is its own centroid.
2. The cluster has one point.
3. The sum of distances to the centroid is 0.
Merging Buckets --- (1)
• Needed in two situations:
1. We have to process a query, which requires
us to (temporarily) merge some tail of the
bucket sequence.
2. We have just added a new (most recent)
bucket and we need to check the rule about
two consecutive buckets being “bigger” than
all that follow.
Merging Buckets --- (2)

• Step 1: Take the union of the clusters at


each level.
• Step 2: If the number of clusters (points)
at level 0 is now more than N 1/4, cluster
them into k clusters.
– These become clusters at level 1.
• Steps 3,…: Repeat, going up the levels,
if needed.
Representing New Clusters

• Centroid = weighted average of


centroids of component clusters.
• Weight = sum of weights.
• Cost = sum over all component
clusters of:
1. Cost of component cluster.
2. Weight of component times distance from
its centroid to new centroid.
Example: New Centroid

5
+ (12,12)

10
+ (3,3) new centroid
+ (12,2)

+ (18,-2)
15
weights centroids
Example: New Costs

5
+ (12,12)
added
10
+ (3,3)
+ (12,2)

old cost + (18,-2)


true cost 15
Queries
• Find all the buckets within the range of the
query.
– The last bucket may be only partially within
the range.
• Cluster all clusters at all levels into k
clusters.
• Return the k centroids.
Error in Estimation

• Goal is to pick the k centroids that minimize


the true cost (sum of distances from each
point to its centroid).
• Since recorded “costs” are inexact, there
can be a factor of 2 error at each level.
• Additional error because some of last
bucket may not belong.
– But fraction of spurious points is small (why?).
Effect of Cost-Errors
1. Alter when buckets get combined.
Not really important.
2. Produce suboptimal clustering at any
stage of the algorithm.
The real measure of how bad the output is.
Speedup of Algorithm
• As given, algorithm is slow.
– Each new bucket causes O(log N ) bucket-
merger problems.
• A faster version allows the first bucket to
have not k, but N 1/2 (or in general N 2τ)
points.
– A number of consequences, including slower
queries, more space.

You might also like