0% found this document useful (0 votes)
25 views

DS3 Partitional Clustering

The document provides an overview of partitional clustering analyses techniques. It discusses key concepts like k-means, k-medoids, fuzzy c-means clustering and Gaussian mixture clustering. It also covers cluster evaluation metrics and different types of clusters like well-separated, center-based, contiguous and density-based clusters. The document is from lecture notes on data science and aims to introduce students to basic concepts and algorithms for partitional clustering.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
25 views

DS3 Partitional Clustering

The document provides an overview of partitional clustering analyses techniques. It discusses key concepts like k-means, k-medoids, fuzzy c-means clustering and Gaussian mixture clustering. It also covers cluster evaluation metrics and different types of clusters like well-separated, center-based, contiguous and density-based clusters. The document is from lecture notes on data science and aims to introduce students to basic concepts and algorithms for partitional clustering.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 52

Partitional Clustering Analyses

Lecture Notes for Chapter 3

Data Science and Decision Support (DSDS)


By
Prof. Chih-Hsuan (Jason) Wang

Department of IEM, NYCU, Hsinchu, Taiwan


Basic Concepts and Algorithms

 3.1 Overview
 3.2 K-means
 3.3 K-medoids
 3.4 Fuzzy C means (FCM)
 3.5 Gaussian Mixture Clustering (GMC)
 3.6 Agglomerative Hierarchical Clustering
 3.7 DBSCAN (Density Based Spatial Clustering and
Applications)
 3.8 Cluster Evaluation

Department of IEM, NYCU, Hsinchu, Taiwan


3.1 Overview: What is Cluster Analysis?

 Finding groups of objects such that the objects in a group will


be similar (or related) to one another and different from (or
unrelated to) the objects in other groups

Inter-cluster
Intra-cluster distances are
distances are maximized
minimized

Department of IEM, NYCU, Hsinchu, Taiwan


Applications of Cluster Analysis
 Understanding Discovered Clusters Industry Group
Applied-Matl-DOWN,Bay-Network-Down,3-COM-DOWN,

– Group related documents for 1 Cabletron-Sys-DOWN,CISCO-DOWN,HP-DOWN,


DSC-Comm-DOWN,INTEL-DOWN,LSI-Logic-DOWN,
Micron-Tech-DOWN,Texas-Inst-Down,Tellabs-Inc-Down,
Technology1-DOWN
browsing, group genes and Natl-Semiconduct-DOWN,Oracl-DOWN,SGI-DOWN,
Sun-DOWN

proteins that have similar Apple-Comp-DOWN,Autodesk-DOWN,DEC-DOWN,

2 ADV-Micro-Device-DOWN,Andrew-Corp-DOWN,
Computer-Assoc-DOWN,Circuit-City-DOWN,
functionality, or group stocks Compaq-DOWN, EMC-Corp-DOWN, Gen-Inst-DOWN,
Motorola-DOWN,Microsoft-DOWN,Scientific-Atl-DOWN
Technology2-DOWN

with similar price fluctuations Fannie-Mae-DOWN,Fed-Home-Loan-DOWN,

3 MBNA-Corp-DOWN,Morgan-Stanley-DOWN Financial-DOWN

 Summarization 4
Baker-Hughes-UP,Dresser-Inds-UP,Halliburton-HLD-UP,
Louisiana-Land-UP,Phillips-Petro-UP,Unocal-UP,
Schlumberger-UP
Oil-UP

– Image clustering can reduce the


size of large data sets
– Text clustering can help readers
identify similar documents
– Customer clustering can help
marketers conduct
Clustering precipitation
segmentation in Australia

Department of IEM, NYCU, Hsinchu, Taiwan


What is NOT Cluster Analysis?

 Supervised classification
– Have class label information

 Simple segmentation
– Dividing students into different registration groups alphabetically,
by last name

 Results of a query
– Groupings are a result of an external specification

 Graph partitioning
– Some mutual relevance and synergy, but areas are not identical

Department of IEM, NYCU, Hsinchu, Taiwan


Notion of a Cluster can be Ambiguous

How many clusters? Six Clusters

Two Clusters Four Clusters

Department of IEM, NYCU, Hsinchu, Taiwan


Types of Clusterings

 A clustering is a set of clusters

 Important distinction between hierarchical and


partitional sets of clusters

 Partitional Clustering
– A division data objects into non-overlapping subsets (clusters) such
that each data object is in exactly one subset

 Hierarchical clustering
– A set of nested clusters organized as a hierarchical tree

Department of IEM, NYCU, Hsinchu, Taiwan


Partitional Clustering

Original Points A Partitional Clustering

Department of IEM, NYCU, Hsinchu, Taiwan


Hierarchical Clustering

p1
p3 p4
p2
p1 p2 p3 p4
Traditional Hierarchical Clustering Traditional Dendrogram

p1
p3 p4
p2
p1 p2 p3 p4
Non-traditional Hierarchical Clustering Non-traditional Dendrogram

Department of IEM, NYCU, Hsinchu, Taiwan


Other Distinctions Between Sets of Clusters

 Exclusive versus non-exclusive


– In non-exclusive clusterings, points may belong to multiple clusters.
– Can represent multiple classes or ‘border’ points
 Fuzzy versus non-fuzzy
– In fuzzy clustering, a point belongs to every cluster with some weight
between 0 and 1 (FCM)
– Weights must sum to 1
– Probabilistic clustering has similar characteristics (GMC)
 Partial versus complete
– In some cases, we only want to cluster some of the data
 Heterogeneous versus homogeneous
– Cluster of widely different sizes, shapes, and densities

Department of IEM, NYCU, Hsinchu, Taiwan


Types of Clusters

 Well-separated clusters

 Center-based clusters

 Contiguous clusters

 Density-based clusters

 Property or Conceptual

 Described by an Objective Function


Department of IEM, NYCU, Hsinchu, Taiwan
Types of Clusters: Well-Separated

 Well-Separated Clusters:
– A cluster is a set of points such that any point in a cluster is closer (or
more similar) to every other point in the cluster than to any point not
in the cluster.

3 well-separated clusters

Department of IEM, NYCU, Hsinchu, Taiwan


Types of Clusters: Center-Based

 Center-based
– A cluster is a set of objects such that an object in a cluster is closer
(more similar) to the “center” of a cluster, than to the center of any
other cluster
– The center of a cluster is often a centroid, the average of all the
points in the cluster, or a medoid, the most “representative” point of a
cluster

4 center-based clusters

Department of IEM, NYCU, Hsinchu, Taiwan


Types of Clusters: Contiguity-Based

 Contiguous Cluster (Nearest neighbor or Transitive)


– A cluster is a set of points such that a point in a cluster is closer (or
more similar) to one or more other points in the cluster than to any
point not in the cluster.

8 contiguous clusters

Department of IEM, NYCU, Hsinchu, Taiwan


Types of Clusters: Density-Based

 Density-based
– A cluster is a dense region of points, which is separated by low-
density regions, from other regions of high density.
– Used when the clusters are irregular or intertwined, and when noise
and outliers are present.

6 density-based clusters

Department of IEM, NYCU, Hsinchu, Taiwan


Types of Clusters: Conceptual Clusters

 Shared Property or Conceptual Clusters


– Finds clusters that share some common property or represent a
particular concept.
.

2 Overlapping Circles

Department of IEM, NYCU, Hsinchu, Taiwan


Types of Clusters: Objective Function

 Clusters Defined by an Objective Function


– Finds clusters that minimize or maximize an objective function.
– Enumerate all possible ways of dividing the points into clusters and
evaluate the `goodness' of each potential set of clusters by using the given
objective function. (NP Hard)
– Can have global or local objectives.
 Hierarchical clustering algorithms typically have local objectives
 Partitional algorithms typically have global objectives
– A variation of the global objective function approach is to fit the data to a
parameterized model.
 Parameters for the model are determined from the data.
 Mixture models assume that the data is a ‘mixture' of a number of statistical
distributions.

Department of IEM, NYCU, Hsinchu, Taiwan


Types of Clusters: Objective Function

 Map the clustering problem to a different domain and


solve a related problem in that domain
– Proximity matrix defines a weighted graph, where the
nodes are the points being clustered, and the weighted
edges represent the proximities between points

– Clustering is equivalent to breaking the graph into


connected components, one for each cluster.

– Want to minimize the edge weight between clusters and


maximize the edge weight within clusters

Department of IEM, NYCU, Hsinchu, Taiwan


Characteristics of the Input Data Are Important

 Type of proximity or density measure


– This is a derived measure, but central to clustering
 Sparseness
– Dictates type of similarity
– Adds to efficiency
 Attribute type
– Dictates type of similarity
 Data type
– Dictates type of similarity
– Other characteristics, e.g., autocorrelation
 Dimensionality
 Noise and Outliers
 Type of Distribution

Department of IEM, NYCU, Hsinchu, Taiwan


Partitional Clustering Algorithms

 K-means (means represent centroids)

 K-medoids (medoids denote the most representative


data that has a minimal sum of distance between this
point and other samples)

 Fuzzy C means (FCM)

 Gaussian Mixture Clustering (GMC)

Department of IEM, NYCU, Hsinchu, Taiwan


K-means Clustering (steps)

 Partitional clustering approach


 Each cluster is associated with a centroid (center point)
 Each point is assigned to the cluster with the closest centroid
 Number of clusters, K, must be specified
 The basic algorithm is very simple

Department of IEM, NYCU, Hsinchu, Taiwan


K-means Clustering – Details

 Initial centroids are often chosen randomly.


– Clusters produced vary from one run to another.
 The centroid is (typically) the average of the points in the cluster.
 ‘Closeness’ is measured by Euclidean distance, cosine similarity,
correlation, etc.
 K-means will converge for common similarity measures
mentioned above.
 Most of the convergence happens in the first few iterations.
– Often the stopping condition is changed to ‘Until relatively few points
change clusters’
 Complexity is O( n * K * I * d )
– n = number of points, K = number of clusters,
I = number of iterations, d = number of attributes

Department of IEM, NYCU, Hsinchu, Taiwan


Two different K-means Clusterings
3

2.5

2
Original Points
1.5

y
1

0.5

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2


x

3 3

2.5 2.5

2 2

1.5 1.5
y

y
1 1

0.5 0.5

0 0

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2


x x

Optimal Clustering Sub-optimal Clustering


Department of IEM, NYCU, Hsinchu, Taiwan
Importance of Choosing Initial Centroids

Iteration 6
1
2
3
4
5
3

2.5

1.5
y

0.5

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2


x

Department of IEM, NYCU, Hsinchu, Taiwan


Importance of Choosing Initial Centroids
Iteration 1 Iteration 2 Iteration 3
3 3 3

2.5 2.5 2.5

2 2 2

1.5 1.5 1.5


y

y
1 1 1

0.5 0.5 0.5

0 0 0

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2
x x x

Iteration 4 Iteration 5 Iteration 6


3 3 3

2.5 2.5 2.5

2 2 2

1.5 1.5 1.5


y

y
1 1 1

0.5 0.5 0.5

0 0 0

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2
x x x

Department of IEM, NYCU, Hsinchu, Taiwan


Evaluating K-means Clusters (cluster validity)

 Most common measure is Sum of Squared Error (SSE)


– For each point, the error is the distance to the nearest cluster
– To get SSE, we square these errors and sum them.
K
SSE    dist 2 (mi , x )
i 1 xCi

– x is a data point in cluster Ci and mi is the representative point for


cluster Ci
 can show that mi corresponds to the center (mean) of the cluster
– Given two clusters, we can choose the one with the smallest error
– One easy way to reduce SSE is to increase K, the number of clusters
 A good clustering with smaller K can have a lower SSE than a poor
clustering with higher K

Department of IEM, NYCU, Hsinchu, Taiwan


Importance of Choosing Initial Centroids …

Iteration 5
1
2
3
4
3

2.5

1.5
y

0.5

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2


x

Department of IEM, NYCU, Hsinchu, Taiwan


Importance of Choosing Initial Centroids

Iteration 1 Iteration 2
3 3

2.5 2.5

2 2

1.5 1.5
y

y
1 1

0.5 0.5

0 0

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2


x x

Iteration 3 Iteration 4 Iteration 5


3 3 3

2.5 2.5 2.5

2 2 2

1.5 1.5 1.5


y

y
1 1 1

0.5 0.5 0.5

0 0 0

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2
x x x

Department of IEM, NYCU, Hsinchu, Taiwan


10 Clusters Example
Iteration 4
1
2
3
8

2
y

-2

-4

-6

0 5 10 15 20
x
Starting with two initial centroids in one cluster of each pair of clusters
Department of IEM, NYCU, Hsinchu, Taiwan
10 Clusters Example
Iteration 1 Iteration 2
8 8

6 6

4 4

2 2
y

y
0 0

-2 -2

-4 -4

-6 -6

0 5 10 15 20 0 5 10 15 20
x x
Iteration 3 Iteration 4
8 8

6 6

4 4

2 2
y

y
0 0

-2 -2

-4 -4

-6 -6

0 5 10 15 20 0 5 10 15 20
x x

Starting with two initial centroids in one cluster of each pair of clusters
Department of IEM, NYCU, Hsinchu, Taiwan
10 Clusters Example
Iteration 4
1
2
3
8

2
y

-2

-4

-6

0 5 10 15 20
x
Starting with some pairs of clusters having three initial centroids, while other have only one.

Department of IEM, NYCU, Hsinchu, Taiwan


10 Clusters Example
Iteration 1 Iteration 2
8 8

6 6

4 4

2 2
y

y
0 0

-2 -2

-4 -4

-6 -6

0 5 10 15 20 0 5 10 15 20
Iteration
x 3 Iteration
x 4
8 8

6 6

4 4

2 2
y

y
0 0

-2 -2

-4 -4

-6 -6

0 5 10 15 20 0 5 10 15 20
x x

Starting with some pairs of clusters having three initial centroids, while other have only one.

Department of IEM, NYCU, Hsinchu, Taiwan


Solutions to Initial Centroids Problem

 Multiple runs
– Helps, but probability is not on your side
 Sample and use hierarchical clustering to determine
initial centroids (two-step clustering)
 Select more than k initial centroids and then select
among these initial centroids
– Select most widely separated
 Postprocessing
 Bisecting K-means
– Not as susceptible to initialization issues

Department of IEM, NYCU, Hsinchu, Taiwan


Updating Centers Incrementally

 In the basic K-means algorithm, centroids are updated


after all points are assigned to a centroid

 An alternative is to update the centroids after each


assignment (incremental approach)
– Each assignment updates zero or two centroids
– More expensive
– Introduces an order dependency
– Never get an empty cluster
– Can use “weights” to change the impact

Department of IEM, NYCU, Hsinchu, Taiwan


Pre-processing and Post-processing

 Pre-processing
– Normalize the data
– Eliminate outliers

 Post-processing
– Eliminate small clusters that may represent outliers
– Split ‘loose’ clusters, i.e., clusters with relatively high SSE
– Merge clusters that are ‘close’ and that have relatively low
SSE
– Can use these steps during the clustering process
 ISODATA

Department of IEM, NYCU, Hsinchu, Taiwan


Limitations of K-means

 K-means has problems when clusters are of differing


– Sizes
– Densities
– Non-globular shapes

 K-means has problems when the data contains outliers


(centroids are significantly influenced by outliers).

Department of IEM, NYCU, Hsinchu, Taiwan


Limitations of K-means: Differing Sizes

Original Points K-means (3 Clusters)

Department of IEM, NYCU, Hsinchu, Taiwan


Limitations of K-means: Differing Density

Original Points K-means (3 Clusters)

Department of IEM, NYCU, Hsinchu, Taiwan


Limitations of K-means: Non-globular Shapes

Original Points K-means (2 Clusters)

Department of IEM, NYCU, Hsinchu, Taiwan


Overcoming K-means Limitations

Original Points K-means Clusters

One solution is to use many clusters.


Find parts of clusters, but need to put together.
Department of IEM, NYCU, Hsinchu, Taiwan
Overcoming K-means Limitations

Original Points K-means Clusters

Department of IEM, NYCU, Hsinchu, Taiwan


Overcoming K-means Limitations

Original Points K-means Clusters

Department of IEM, NYCU, Hsinchu, Taiwan


3.3 K medoids
• K-means and K-medoids algorithms are partitional (breaking the dataset up
into groups)

• Both of them attempt to minimize the distance between points labeled to be


in a cluster and a point assigned as the center of that cluster

• K-medoids chooses datapoints as centers (medoids) and can be used with


arbitrary distances, while k-means only minimizes the squared Euclidean
distances.

• K-medoids is more robust to noise and outliers than K-means because it


minimizes a sum of pairwise dissimilarities instead of a sum of squared
Euclidean distances.

• A medoid can be defined as the object of a cluster whose average


dissimilarity to all the objects in the cluster is minimal: it is a most centrally
located point in the cluster (the most representative data point).
Department of IEM, NYCU, Hsinchu, Taiwan
3.4 Fuzzy c means (fuzzy version of K-means)


FCM can be viewed as a fuzzy version of K means

E: Error function

Wij denotes the possibility of data xj assigned to group ci


Department of IEM, NYCU, Hsinchu, Taiwan
3.4 Fuzzy c means
After minimizing error objective:
 2 / m 1
W j
ij
m
xj
W ij 
x j  ci
 2 / m 1
Ci  c

 W ij
m

k 1
x j  ci
j

m is usually set by 2, when the distance between data j and center i is bigger,
the possibility that the data j belongs to cluster i becomes smaller

Details of FCM:
1. Initialize the possibility matrix Wij
2. Calculate centroids Ci
3. Updating the possibility Wij
4. Repeat step 2 and step 3 until converging
5. Output final results

Department of IEM, NYCU, Hsinchu, Taiwan


3.5 GMC (Applications of Bayes theory)

Posterior

Prior

Department of IEM, NYCU, Hsinchu, Taiwan


3.5 GMC (EM- expectation maximization)

Mahalanbis distance
Department of IEM, NYCU, Hsinchu, Taiwan
3.5 GMC (steps)

• incorporates the dependences of features


into clustering

Department of IEM, NYCU, Hsinchu, Taiwan


How to know the number of clusters?

 DB index (Davies–Bouldin)
 FS index (Fukuyama and Sugeno)
 XB index (Xie and Beni)
 PC (partition coefficient)
 PE (partition entropy)

Department of IEM, NYCU, Hsinchu, Taiwan


How to know the number of segments

Separation between
groups i, j

Compactness within group i


C=3

Department of IEM, NYCU, Hsinchu, Taiwan


How to know the number of clusters?

Department of IEM, NYCU, Hsinchu, Taiwan


Different distance measures

• Elliptical data pattern

Elliptic data patterns imply dependent input features

Department of IEM, NYCU, Hsinchu, Taiwan

You might also like