0% found this document useful (0 votes)
25 views61 pages

DEU CSC5045 Intelligent System Applications Using Fuzzy - 4+clustering

The document discusses different clustering techniques including hierarchical clustering, k-means clustering, and measures of similarity used in clustering like distance functions and metrics. It provides examples of calculating distances between data points and categorical variables. It also describes different hierarchical, partitioning, density-based, grid-based, and model-based clustering methods.

Uploaded by

an7l7a
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
25 views61 pages

DEU CSC5045 Intelligent System Applications Using Fuzzy - 4+clustering

The document discusses different clustering techniques including hierarchical clustering, k-means clustering, and measures of similarity used in clustering like distance functions and metrics. It provides examples of calculating distances between data points and categorical variables. It also describes different hierarchical, partitioning, density-based, grid-based, and model-based clustering methods.

Uploaded by

an7l7a
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 61

Clustering

Measure of Similarity,
Hierarchical Clustering,
K-means Clustering

References:
Han, J. , Kamber, M., Pei, J., (2011). Data Mining: Concepts and Techniques.
Larose, Daniel T. (2005). Discovering Knowledge In Data – An Introduction to Data Mining.
Tan, P., Steinbach, M., Kumar, v. (2006) Introduction to Data Mining.
Bramer, M., (2007) Principles of Data Mining.
Birant, D. Lecture Notes (2012).
Vahaplar, A. Lecture Notes (2012)
Clustering
• Clustering is the process of grouping a set of physical or abstract
unlabelled objects into classes of similar objects.
• A Cluster is a collection of data objects that are similar to one
another within the same cluster, and dissimilar to the objects in
other clusters.
• Clustering is an important human activity:
o Distinguishing animals and plants, male and female, cars and busses etc.

• Goals:
o Detecting natural groups in data,
o Creating homogenous classes,
o Data reduction, Outlier detection.
Clustering
• Measuring Similarity
or measuring dissimilarity?
• A distance measure to calculate the differences between two objects
d(x,y) should have the properties:
1. d(x, y)  0 for all x and y and d(x, y) = 0 only if
x = y. (Positive definiteness)
2. d(x, y) = d(y, x) for all x and y. (Symmetry)
3. d(x, z)  d(x, y) + d(y, z) for all points x, y, and z.
(Triangle Inequality)
Clustering
• Distance Function
• Euclidean Distance

d Euc ( x, y )  i i i
( x  y ) 2

• Manhattan (City Block) Distance

d Man ( x, y ) i xi  yi
• Minkowski Distance

d Min ( x, y ) 

 i
xi  yi
Clustering
• Distance Measure
• Example:
Clustering
3
point x y
2 p1
p1 0 2
p3 p4
1
p2 2 0
p2 p3 3 1
0 p4 5 1
0 1 2 3 4 5 6

p1 p2 p3 p4 p1 p2 p3 p4
p1 0 2.828 3.162 5.099 p1 0 4 4 6
p2 2.828 0 1.414 3.162 p2 4 0 2 4
p3 3.162 1.414 0 2 p3 4 2 0 2
p4 5.099 3.162 2 0 p4 6 4 2 0

Euclidean Distance Matrix Manhattan Distance Matrix


Clustering
3
point x y
2 p1
p1 0 2
p3 p4
1
p2 2 0
p2 p3 3 1
0 p4 5 1
0 1 2 3 4 5 6

p1 p2 p3 p4 p1 p2 p3 p4
p1 0 2.828 3.162 5.099 p1 0 4 4 6
p2 2.828 0 1.414 3.162 p2 4 0 2 4
p3 3.162 1.414 0 2 p3 4 2 0 2
p4 5.099 3.162 2 0 p4 6 4 2 0

Euclidean Distance Matrix Manhattan Distance Matrix


Clustering
• Problems in distance measure
o Different ranges in data
• Normalization (min-max, Z-score, etc)
o Categorical variables
Clustering
• Find the distance between Ali and Ayşe, Ali and Veli, Ayşe and Veli
Adı Yaşı Kilosu Gözrengi
Ali 22 65 Siyah
Ayşe 19 52 Ela
Veli 23 60 Siyah

Variable Yaşı Kilosu


Min 18 50
Max 30 85
Clustering
• Find the distance between Ali and Ayşe, Ali and Veli, Ayşe and Veli
Adı Yaşı Kilosu Gözrengi
Ali 22 (0,33) 65 (0,43) Siyah
Ayşe 19 (0,08) 52 (0,06) Ela
Veli 23 (0,42) 60 (0,29) Siyah

Variable Yaşı Kilosu


Min 18 50
Max 30 85

d Ali Ayşe Veli


Ali 0 1.096 0.165
Ayşe 1.096 0 1.079
Veli 0.165 1.079 0
Clustering
• Distance measure for Categorical Variables
• Binary Data (0/1 - presence/absence – Yes/No)
• Jackard’s Distance
Object j

1 0 sum
Object i
1 a b a b
0 c d cd
sum a  c b  d p

d (i, j)  bc
a bc
Example for Clustering Categorical Data

• Find the Jaccard's distance between Apple and Banana.


Feature of Fruit Sphere shape Sweet Sour Crunchy
Object i =Apple Yes Yes Yes Yes
Object j =Banana No Yes No No

(a = 1, b = 3, c = 0, d= 0) d (i, j)  bc
a bc

(3+0) / (1 + 3 + 0) = 3/4 = 0.75 Object j


1 0 sum
1 a b a b
Object i
0 c d cd
sum a  c b  d p
Example for Clustering Categorical Data

• Who are the most likely to have a similar disease?


Name Fever Cough Test-1 Test-2 Test-3 Test-4
Jack Y N P N N N
Mary Y N P N P N
Jim Y P N N N N
Let the values Y and P be set to 1, and the value N be set to 0

0 1 d (i, j)  bc
d(Jack,Mary)=  0.33 a bc
2  0 1
11 Object j
d(Jack,Jim) =1  1  1  0.67 1 0 sum
1 2 1 a b a b
d(Jim,Mary) = 1  1  2  0.75
Object i
0 c d cd
sum a  c b  d p

Result: Jim and Mary are unlikely to have a similar disease.

Jack and Mary are the most likely to have a similar disease.
Clustering Methods
• Hierarchical Methods
o AGNES, DIANA, BIRCH, Fuzzy Joint Points (FJP), ...

• Partitioning Methods
o K-Means, K-Medoids, Fuzzy c-Means, ...

• Density-Based Methods
o DBSCAN, OPTICS, Fuzzy Joint Points (FJP), ...

• Grid-Based Methods
o STING, WaveCluster, CLIQUE ...

• Model-Based Methods
o COBWEB, CLASSIT, SOM (Self-Organizing Feature Maps) ...
Hierarchical Clustering
• A tree like cluster structure (dendrogram)
• Agglomerative
o Each item is a tiny cluster of its own at the beginning,
o Two closest clusters are aggregated,
o At the end, all items are in one cluster.

• Divisive methods
o All items are in one cluster at the beginning,
o Most dissimilar cluster are seperated,
o At the end, each record represents its own cluster.
Hierarchical Clustering
• Measuring distance between clusters in Hierarchical Clustering
• Single linkage,
o the nearest-neighbor approach,
o based on the minimum distance between any record in two clusters

• Complete linkage,
o the farthest-neighbor approach,
o based on the maximum distance between any record in two clusters.

• Average linkage ,
o is designed to reduce the dependence of the cluster-linkage criterion on
extreme values, such as the most similar or dissimilar records.
o the criterion is the average distance of all the records in cluster A from all
the records in cluster B.
Hierarchical Clustering
• Single link: smallest distance between an element in one cluster
and an element in the other,

• Complete link: largest distance between an element in one


cluster and an element in the other.

• Average: avg. distance between an element in one cluster and an


element in the other.
Single-Linkage Clustering - Example
Dataset: 2,5,9,15,16,18,25,33,33,45
Complete-Linkage Clustering - Example
• Dataset: 2,5,9,15,16,18,25,33,33,45
Average-Linkage Clustering - Example
• Dataset: 2,5,9,15,16,18,25,33,33,45

 d ( x, y)
x A yB
The average dis tan ce 
Acount * Bcount
How the Clusters are Merged?

5 0.4
1 0.2
4 1 0.35
3
2 5 0.3
5 0.15 5 0.25
2 1 2
0.2
2 3 6 0.1 3 6 0.15
3
1 0.1
0.05
4 4 0.05
4 0 0
3 6 2 5 4 1 3 6 4 1 2 5

Single Link Complete Link

5 4 1 0.25

2 0.2
5
2 0.15
3 6 0.1
1
4 0.05
3
0
3 6 4 1 2 5

Average Link
Hierarchical Clustering
• Single Linkage
o Can handle non-elliptical shapes
o Sensitive to noise and outliers

• Complete Linkage
o Less sensitive to noise and outliers
o Tends to break large clusters and to form more compact, globular clusters

• Average Linkage
o Less sensitive to noise and outliers
o Tends to form more compact, globular clusters (similar to complete
linkage)
Hierarchical Clustering
• Advantages
o Does not require the number of cluster
o Easy to implement
o Fast and less complex

• Disadvantages
o Need to know where to cut the tree
o Sensitivity to noise and outliers
o Difficulty handling different sized clusters and convex shapes
o Tend to break large clusters
Partition Based Clustering
• Aims to construct a partition of a database D of n objects into a set
of k clusters such that the sum of squared distances is minimized.
• Given a k, find a partition of k clusters that optimizes the chosen
partitioning criterion e.g. minimize SSE.
Partition Based Clustering
• Within Cluster Variation (WCV)

SSE   ik1 pCi d ( p  ci ) 2

• Between Cluster Variation (BCV)

BCV = d(c1, c2)


• Maximize the between-cluster-variation with respect to to within-
cluster-variation

BCV d (c1 , c2 )

WCV SSE
Partition Based Clustering

• k-means Clustering
• is an algorithm to cluster n objects based on attributes into k
partitions, k < n
• Step 1: Ask k,
• Step 2: Randomly assign k point as the initial cluster centers,
• Step 3: For each data point, find the nearest cluster center and
assign it to that cluster,
• Step 4: For each k cluster, find the new cluster centers,
• Step 5: Repeat Step 3-5 until
o Centers do not move,
o No data point changes cluster,
o Desired SSE is obtained.
 Step 1: let k be 2
 Step 2: Randomly assign initial cluster centers, let
c1=(1,1) and c2=(2,1)
 Step 3: (first pass) for each record find the nearest cluster
center. (c1=(1,1) and c2=(2,1))
c1 c2

SSE   ik1 pCi d ( p  ci ) 2


SSE   ik1 pCi d ( p  ci ) 2  2 2  2.24 2  2.832  3.612  12  2.24 2  0 2  0 2  36.0762
BCV d (c1  c2 ) 1
   0,0278
WCV SSE 36
 We expect this ratio to increase with successive passes.
 Step 4: For each of the k clusters find the cluster centroid and
update the location of each cluster center to the new value of
the centroid.

 1  1  1   3  2  1 
new c1      (1,2)
 3   3 
 3  4  5  4  2   3  3  3  2  1 
new c2      (3.6,2.4)
 5   5 
 Step 5: repeat steps 3 and 4 until convergence.
 Step 3 (second pass) : update cluster centers c1=(1,2) and
c2=(3.6,2.4). Calculate the distances between each point and
updated cluster centers.
 Step 3 (second pass) : update cluster centers c1=(1,2) and
c2=(3.6,2.4). Calculate the distances between each point and
updated cluster centers.

c1 c2

C1

SSE   ik1 pCi d ( p  ci ) 2  12  0.852  0.72 2  1.52 2  0 2  0.57 2  12  1.412  7,88


BCV d (c1  c2 ) 2.63
   0,3338
WCV SSE 7.88
 Step 4 (second pass) : For each of the k clusters find the cluster
centroid and update the location of each cluster center to the
new value of the centroid.

 1  1  1  2   3  2  1  1 
new c1      (1.25,1.75)
 4   4 
 3  4  5  4   3  3  3  2 
new c2      (4,2.75)
 4   4 

 Step 5: repeat steps 3 and 4 until convergence.


 Step 3 (third pass) : update cluster centers c1=(1.25,1.75) and
c2=(4,2.75). Calculate the distances between each point and
updated cluster centers.
c1 c2

C1

BCV d (c1  c2 ) 2.93


SSE   ik1 pCi d ( p  ci ) 2  6,25    0,4688
WCV SSE 6,25
 Step 4 (third pass) : For each of the k clusters find the cluster
centroid and update the location of each cluster center to the
new value of the centroid. Since no records have shifted cluster
membership, the cluster centroids therefore also remain
unchanged.
 Step 5: Repeat steps 3 and 4 until convergence or termination.
Since the centroids remain unchanged, the algorithm
terminates.
K-means example, step 1

k1
Y

Pick 3
k2
initial
cluster
centers
(randomly)

k3

X
K-means example, step 2

k1
Y

k2
Assign
each point
to the closest
cluster
center
k3

X
K-means example, step 3

k1 k1
Y

Move k2
each cluster center
to the mean k3
of each cluster
k2

k3

X
K-means example, step 4

Reassign k1
points
closest to a Y
different new
cluster center

Q: Which points
are reassigned? k3
k2

X
K-means example, step 4 …

k1
Y

A: three
points with
animation
k3
k2

X
K-means example, step 4b

k1
Y

re-compute
cluster means

k3
k2

X
K-means example, step 5

k1

k2
move cluster
centers to k3
cluster means

X
k-means Clustering
• Strength:
o Relatively efficient and fast: O(tkn)
o Easy to understand
o Often terminates at a local optimum

• Weakness
o Applicable only when mean is defined, then what about categorical data?
o Need to specify k, the number of clusters, in advance
o Unable to handle noisy data and outliers
o Not suitable to discover clusters with non-convex shapes
o Result can vary significantly depending on initial choice of centroids
o Total steps can vary depending on initial choice of centroids
k-means Clustering
k-means Clustering
k-means Clustering types
• Alternatives
• K-medians – instead of mean, use medians of each cluster
205
o Mean of 1, 3, 5, 7, 1009 is
5
o Median of 1, 3, 5, 7, 1009 is

• K-modes – to cluster categorical data by using modes instead of means for


clusters.
• K-medoids
o A medoid can be defined as the object of a cluster, whose average dissimilarity to
all the objects in the cluster is minimal.
o PAM (Partitioning Around Medoids) Algorithm

• Fuzzy c-means
o a method of clustering which allows one piece of data to belong to two or more
clusters.
Fuzzy c-Means Clustering
• Step 1: Ask k,
• Step 2: Randomly assign k point as the initial cluster centers,
• Step 3: For each data point, find the membership degree to each cluster according to the
following formula:

• where m>1 and is a membership degree


• of xi to the j.th cluster.
• Step 4: For each k cluster, find the new cluster centers,

• Step 5: Repeat Step 3-4 until desired SSE is obtained.


Fuzzy c-Means Clustering
Density Based Clustering
• Clustering based on density (local cluster criterion), such as density-
connected points.
• Major features:
o Discover clusters of arbitrary shape
o Handle noise
o One scan
o Need density parameters as termination condition

• Several interesting studies:


o DBSCAN: Ester, et al. (KDD’96)
o OPTICS: Ankerst, et al (SIGMOD’99).
o DENCLUE: Hinneburg & D. Keim (KDD’98)
DBSCAN

• Density-Based Spatial Clustering of Applications with


Noise.
– Density = number of points within a specified radius (Eps)

– A point is a core point if it has more than a specified


number of points (MinPts) within Eps
• These are points that are at the interior of a cluster

– A border point has fewer than MinPts within Eps, but is in


the neighborhood of a core point

– A noise point is any point that is not a core point or a


border point.
DBSCAN: Core, Border, and Noise Points
DBSCAN Algorithm

• Eliminate noise points

• Perform clustering on the remaining points


DBSCAN: Core, Border and Noise Points

Original Points Point types: core,


border and noise

Eps = 10, MinPts = 4


When DBSCAN Works Well

Original Points Clusters

• Resistant to Noise
• Can handle clusters of different shapes and sizes
When DBSCAN Does NOT Work Well

(MinPts=4, Eps=9.75).

Original Points

• Varying densities
• High-dimensional data
(MinPts=4, Eps=9.92)
Fuzzy Joint Points Clustering (FJP)
Fuzzy Joint Points Clustering (FJP)
Fuzzy Joint Points Clustering (FJP)
Fuzzy Joint Points Clustering (FJP)

• max:
Fuzzy Joint Points Clustering (FJP)
Model Based Methods
Attempt to optimize the fit between the given data and some mathematical model
It uses statistical functions

You might also like