Clustering Analysis
Clustering Analysis
Inter-cluster
Intra-cluster distances are
distances are maximized
minimized
Discovered Clusters Industry Group
Applied-Matl-DOWN,Bay-Network-Down,3-COM-DOWN,
Understanding 1 Cabletron-Sys-DOWN,CISCO-DOWN,HP-DOWN,
DSC-Comm-DOWN,INTEL-DOWN,LSI-Logic-DOWN,
Micron-Tech-DOWN,Texas-Inst-Down,Tellabs-Inc-Down,
Technology1-DOWN
Natl-Semiconduct-DOWN,Oracl-DOWN,SGI-DOWN,
Sun-DOWN
2 ADV-Micro-Device-DOWN,Andrew-Corp-DOWN,
Computer-Assoc-DOWN,Circuit-City-DOWN,
browsing, group genes and Compaq-DOWN, EMC-Corp-DOWN, Gen-Inst-DOWN,
Motorola-DOWN,Microsoft-DOWN,Scientific-Atl-DOWN
Technology2-DOWN
Summarization
Reduce the size of large data
sets
Clustering precipitation
in Australia
John Snow, London 1854
How many clusters? Six Clusters
p1
p3 p4
p2
p1 p2 p3 p4
3 well-separated clusters
Center-based
A cluster is a set of objects such that an object in a cluster is closer
(more similar) to the “center” of a cluster, than to the center of any
other cluster
The center of a cluster is often a centroid, the minimizer of
distances from all the points in the cluster, or a medoid, the most
“representative” point of a cluster
4 center-based clusters
Contiguous Cluster (Nearest neighbor or
Transitive)
A cluster is a set of points such that a point in a cluster is closer (or
more similar) to one or more other points in the cluster than to any
point not in the cluster.
8 contiguous clusters
Density-based
A cluster is a dense region of points, which is separated by low-
density regions, from other regions of high density.
Used when the clusters are irregular or intertwined, and when noise
and outliers are present.
6 density-based clusters
Shared Property or Conceptual Clusters
Finds clusters that share some common property or represent a
particular concept.
.
2 Overlapping Circles
Clustering as an optimization problem
Finds clusters that minimize or maximize an objective
function.
Enumerate all possible ways of dividing the points into
clusters and evaluate the `goodness' of each potential set of
clusters by using the given objective function. (NP Hard)
Can have global or local objectives.
▪ Hierarchical clustering algorithms typically have local objectives
▪ Partitional algorithms typically have global objectives
A variation of the global objective function approach is to fit
the data to a parameterized model.
▪ The parameters for the model are determined from the data, and
they determine the clustering
▪ E.g., Mixture models assume that the data is a ‘mixture' of a number
of statistical distributions.
K-means and its variants
Hierarchical clustering
DBSCAN
K-means and its variants
Hierarchical clustering
DBSCAN
Partitional clustering approach
Each cluster is associated with a centroid
(center point)
Each point is assigned to the cluster with
the closest centroid
Number of clusters, K, must be specified
The objective is to minimize the sum of
distances of the points to their respective
centroid
Problem: Given a set X of n points in a d-
dimensional space and an integer K group the
points into K clusters C= {C1, C2,…,Ck} such that
,
∈
2.5
2
Original Points
1.5
y
1
0.5
3 3
2.5 2.5
2 2
1.5 1.5
y
y
1 1
0.5 0.5
0 0
2.5
1.5
y
0.5
2 2 2
y
1 1 1
0 0 0
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2
x x x
2 2 2
y
1 1 1
0 0 0
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2
x x x
Iteration 5
1
2
3
4
3
2.5
1.5
y
0.5
2.5 2.5
2 2
1.5 1.5
y
y
1 1
0.5 0.5
0 0
2 2 2
y
1 1 1
0 0 0
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2
x x x
Do multiple runs and select the clustering
with the smallest error
Divisive:
▪ Start with one, all-inclusive cluster
▪ At each step, split a cluster until each cluster contains a point (or there are
k clusters)
4
0.2 3 4
2
5
0.15 2
0.1 1
3 1
0.05
0
1 3 2 5 4 6
Do not have to assume any particular number
of clusters
Any desired number of clusters can be obtained
by ‘cutting’ the dendogram at the proper level
C2 C5
We want to merge the two closest clusters (C2 and C5) and update
the proximity matrix.
C1 C2 C3 C4 C5
C1
C2
C3 C3
C4
C4
C5
Proximity Matrix
C1
C2 C5
The question is “How do we update the proximity matrix?”
C2
U
C1 C5 C3 C4
C1 ?
C2 U C5 ? ? ? ?
C3
C3 ?
C4
C4 ?
C1 Proximity Matrix
C2 U C5
p1 p2 p3 p4 p5 ...
p1
Similarity?
p2
p3
p4
p5
MIN
.
MAX
.
Group Average .
Proximity Matrix
Distance Between Centroids
Other methods driven by an objective
function
– Ward’s Method uses squared error
p1 p2 p3 p4 p5 ...
p1
p2
p3
p4
p5
MIN
.
MAX
.
Group Average .
Proximity Matrix
Distance Between Centroids
Other methods driven by an objective
function
– Ward’s Method uses squared error
p1 p2 p3 p4 p5 ...
p1
p2
p3
p4
p5
MIN
.
MAX
.
Group Average .
Proximity Matrix
Distance Between Centroids
Other methods driven by an objective
function
– Ward’s Method uses squared error
p1 p2 p3 p4 p5 ...
p1
p2
p3
p4
p5
MIN
.
MAX
.
Group Average .
Proximity Matrix
Distance Between Centroids
Other methods driven by an objective
function
– Ward’s Method uses squared error
p1 p2 p3 p4 p5 ...
p1
× × p2
p3
p4
p5
MIN
.
MAX
.
Group Average .
Proximity Matrix
Distance Between Centroids
Other methods driven by an objective
function
– Ward’s Method uses squared error
Another way to view the processing of the
hierarchical algorithm is that we create links
between their elements in order of increasing
distance
The MIN – Single Link, will merge two clusters
when a single pair of elements is linked
The MAX – Complete Linkage will merge two
clusters when all pairs of elements have been
linked.
1 2 3 4 5 6
1 0 .24 .22 .37 .34 .23
5
1 2 .24 0 .15 .20 .14 .25
3 3 .22 .15 0 .15 .28 .11
4 .37 .20 .15 0 .29 .22
5 5 .34 .14 .28 .29 0 .39
2 1
6 .23 .25 .11 .22 .39 0
2 3 6
0.2
4
4 0.15
0.1
0.05
4 0.3
0.25
0.2
0.15
0.1
Nested Clusters Dendrogram
0.05
0
3 6 4 1 2 5
Original Points Two Clusters
p j∈Clusterj
proximity(Clusteri , Clusterj ) =
|Clusteri |∗|Clusterj |
1 2 3 4 5 6
1 0 .24 .22 .37 .34 .23
2 .24 0 .15 .20 .14 .25
3 .22 .15 0 .15 .28 .11
4 .37 .20 .15 0 .29 .22
5 .34 .14 .28 .29 0 .39
6 .23 .25 .11 .22 .39 0
1 2 3 4 5 6
1 0 .24 .22 .37 .34 .23
5 4 1 2 .24 0 .15 .20 .14 .25
0.15
0.1
0
3 6 4 1 2 5
Compromise between Single and Complete
Link
Strengths
Less susceptible to noise and outliers
Limitations
Biased towards globular clusters
Similarity of two clusters is based on the
increase in squared error (SSE) when two
clusters are merged
Similar to group average if distance between points is
distance squared
Less susceptible to noise and outliers
Biased towards globular clusters
Hierarchical analogue of K-means
Can be used to initialize K-means
5
1 4 1
3
2 5
5 5
2 1 2
MIN MAX
2 3 6 3 6
3
1
4 4
4
5
1 5 4 1
2 2
5 Ward’s Method 5
2 2
3 6 Group Average 3 6
3
4 1 1
4 4
3
O(N2) space since it uses the proximity
matrix.
N is the number of points.
DBSCAN:
Density at point p: number of points within a circle of radius Eps
Dense Region: A circle of radius Eps that contains at least
MinPts points
Characterization of points
A point is a core point if it has more than a
specified number of points (MinPts) within Eps
▪ These points belong in a dense region and are at the
interior of a cluster
Eps ~ 7-10
MinPts = 4
Original Points
Clusters
• Resistant to Noise
• Can handle clusters of different shapes and sizes
(MinPts=4, Eps=9.75).
Original Points
• Varying densities
• High-dimensional data
(MinPts=4, Eps=9.92)
PAM, CLARANS: Solutions for the k-medoids problem
BIRCH: Constructs a hierarchical tree that acts a
summary of the data, and then clusters the leaves.
MST: Clustering using the Minimum Spanning Tree.
ROCK: clustering categorical data by neighbor and
link analysis
LIMBO, COOLCAT: Clustering categorical data using
information theoretic tools.
CURE: Hierarchical algorithm uses different
representation of the cluster
CHAMELEON: Hierarchical algorithm uses closeness
and interconnectivity for merging
In order to understand our data, we will assume that there is a
generative process (a model) that creates/describes the data, and
we will try to find the model that best fits the data.
Models of different complexity can be defined, but we will assume
that our model is a distribution from which data points are sampled
Example: the data is the height of all people in US
% $ $
$%
%
For value
, we have:
|Θ 1
$1 2
|$
For all values %
, … ,
'
'
%|Θ *
|Θ
We want to estimate the parameters that maximize
the Likelihood of the data
Once we have the parameters
Θ 1 , , 1 , , 1 ,
we can estimate the
membership probabilities 5
and
for each point
:
This is the probability that point
belongs to the US
or the Chinese population (cluster)
6 6
6
6 6 2
6 1
6 1 2
Initialize the values of the parameters in Θ to
some random values
Repeat until convergence
E-Step: Given the parameters Θ estimate the
membership probabilities 6
and
M-Step: Compute the parameter values that (in
expectation) maximize the data likelihood
' '
1 1
1 6|
|
Fraction of
, , population in U,C
' '
6
1
, ∗
MLE Estimates
, ∗ 1 if ’s were fixed
' '
6
1
1
, ∗ , ∗ 1
E-Step: Assignment of points to clusters
K-means: hard assignment, EM: soft assignment
M-Step: Computation of centroids
K-means assumes common fixed variance
(spherical clusters)
EM: can change the variance for different clusters
or different dimensions (elipsoid clusters)
If the variance is fixed then both minimize the
same error function
# K-means
iris2 = iris
iris2$Species = NULL
(kmeans.result = kmeans(iris2,3))
table(iris$Species, kmeans.result$cluster)
plot(iris2[c("Sepal.Length", "Sepal.Width")], col
= kmeans.result$cluster)
# plot cluster centers
points(kmeans.result$centers[,c("Sepal.Length
", "Sepal.Width")], col = 1:3, pch = 8, cex=2)
# Hierarchical Clustering
idx = sample(1:dim(iris)[1], 40)
irisSample = iris[idx,]
irisSample$Species = NULL
hc = hclust(dist(irisSample), method="ave")
plot(hc, hang = -1, labels=iris$Species[idx])
# cut tree into 3 clusters
rect.hclust(hc, k=3)
groups = cutree(hc, k=3)
# DBSCAN
library(fpc)
iris2 = iris[-5] # remove class tags
ds = dbscan(iris2, eps=0.42, MinPts=5)
# compare clusters with original class labels
table(ds$cluster, iris$Species)
plot(ds, iris2)
plot(ds, iris2[c(1,4)])
plotcluster(iris2, ds$cluster)
# create a new dataset for labeling
set.seed(435)
idx = sample(1:nrow(iris), 10)
newData = iris[idx,-5]
newData = newData + matrix(runif(10*4, min=0, max=0.2), nrow=10, ncol=4)
# label new data
myPred = predict(ds, iris2, newData)
# plot result
plot(iris2[c(1,4)], col=1+ds$cluster)
points(newData[c(1,4)], pch="*", col=1+myPred, cex=3)
# check cluster labels
table(myPred, iris$Species[idx])