0% found this document useful (0 votes)
12 views19 pages

Clustering Part1

- Clustering is an unsupervised learning technique that groups similar data points together. It assigns data points to clusters such that points within a cluster are as close as possible to each other and as far as possible from points in other clusters. - The number of clusters depends on the selected features and distance metric. Different features and metrics can result in different clusterings of the same data. - K-means clustering aims to partition data into K clusters by minimizing the within-cluster sum of squares. It works by assigning data points to the closest cluster mean and recalculating the means repeatedly until convergence. Selecting the appropriate number of clusters K can be challenging.

Uploaded by

daniel.olea
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views19 pages

Clustering Part1

- Clustering is an unsupervised learning technique that groups similar data points together. It assigns data points to clusters such that points within a cluster are as close as possible to each other and as far as possible from points in other clusters. - The number of clusters depends on the selected features and distance metric. Different features and metrics can result in different clusterings of the same data. - K-means clustering aims to partition data into K clusters by minimizing the within-cluster sum of squares. It works by assigning data points to the closest cluster mean and recalculating the means repeatedly until convergence. Selecting the appropriate number of clusters K can be challenging.

Uploaded by

daniel.olea
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 19

Clustering

What is Clustering?
• Attach label to each observation or data points in a set
• You can say this “unsupervised learning”
• Clustering is alternatively called as “grouping”
• Intuitively, if you would want to assign the same label to
a data points that are “close” to each other
But, how many clusters do we have?
Example 1 Example 2

students in our school of


engineering
Application Example
Geochemical study of an impacted fluvial system
• sediments collected along the Rapel Fluvial System in Central Chile
• samples are analyzed using chemical and mineralogical analysis methods
• Features vectors are built
• Clustering
nts • Examples are shown in the map
oal
ure
and
ble
ese
cal
his
sis
and
ate
Clustering depends on the selected features

• The number of clusters (i.e. the clustering) depends on


what features are selected.
! ! !
s p x ∈ℜn c
Pre Extracción (Cálculo)
Sensado Clustering
Procesamiento de Características

n: número de
características

Mundo
Real
But, also in the distance metric

• Thus, after features are selected, the clustering


algorithms rely on a distance metric between data points
(features)
• Sometimes, it is said that the for clustering, the distance
metric is more important than the clustering algorithm
Distances: Quantitative Variables

Data point:
xi = [ xi1 … xip ]T
Some examples
Distances: Ordinal and Categorical
Variables

• Ordinal variables can be forced to lie within (0, 1) and then


a quantitative metric can be applied:

k −1/ 2
, k = 1,2,…, M
M

• For categorical variables, distances must be specified by


user between each pair of categories.
But, in some cases to use distances can be
tricky
K-means Overview

• “K” stands for number of clusters, it is typically a user


input to the algorithm; some criteria can be used to
automatically estimate K
• It is an approximation to an NP-hard combinatorial
optimization problem
• K-means algorithm is iterative in nature
• It converges, however only a local minimum is obtained
• Works only for numerical data
• Easy to implement
K-means: Setup

• x1,…, xN are data points or vectors of observations

• Each observation (vector xi) will be assigned to one and only one cluster

• C(i) denotes cluster number for the ith observation

• Dissimilarity measure: Euclidean distance metric

• K-means minimizes within-cluster point scatter:


1 K 2 K

∑ ∑ ∑ x − xj = ∑ Nk ∑
2
W (C) = xi − mk (Exercise)
2 k =1 C (i )=k C ( j )=k i k =1 C (i )=k

where

mk is the mean vector of the kth cluster

Nk is the number of observations in kth cluster


Within and Between Cluster Criteria
Let’s consider total point scatter for a set of N data points:

1 N N
T = ∑∑ d ( xi , x j )
2 i =1 j =1
Distance between two points
T can be re-written as:
1 K
T = ∑ ∑ ( ∑ d ( xi , x j ) + ∑ d ( xi , x j ))
2 k =1 C (i ) =k C ( j ) =k C ( j )≠k

= W (C ) + B(C )

If d is square Euclidean distance, then


K
1
Where, W (C ) = ∑ ∑ ∑ d ( xi , x j ) W (C ) = ∑ N k
K

∑ x −m
2
2 k =1 C (i )=k C ( j )=k i k
k =1 C (i )=k
Within cluster 1 K
scatter B(C ) = ∑ ∑ ∑ d ( xi , x j )
2 k =1 C (i )=k C ( j )≠ k
K
and B(C ) = ∑ N k mk − m
2
Ex.
k =1

Grand mean
Between cluster
scatter
Minimizing W(C) is equivalent to maximizing B(C)
K-means Algorithm

• For a given cluster assignment C of the data points,


compute the cluster means mk:

∑x
i:C ( i ) = k
i

mk = , k = 1,…, K .
Nk

• For a current set of cluster means, assign each


observation as:
2
C (i ) = arg min xi − mk , i = 1,…, N
1≤ k ≤ K

• Iterate above two steps until convergence


K-means example 1
K-means example 2
Selecting the Number of Clusters

• In a given data distribution k-means, and in general any


clustering algorithm can find k cluster (k>1 & k<N).
• Is it not easy how to choose k!
Fuzzy c-means
Fuzzy c-means
Fuzzy c-means

You might also like