DEU CSC5045 Intelligent System Applications Using Fuzzy - 4+clustering
DEU CSC5045 Intelligent System Applications Using Fuzzy - 4+clustering
Measure of Similarity,
Hierarchical Clustering,
K-means Clustering
References:
Han, J. , Kamber, M., Pei, J., (2011). Data Mining: Concepts and Techniques.
Larose, Daniel T. (2005). Discovering Knowledge In Data – An Introduction to Data Mining.
Tan, P., Steinbach, M., Kumar, v. (2006) Introduction to Data Mining.
Bramer, M., (2007) Principles of Data Mining.
Birant, D. Lecture Notes (2012).
Vahaplar, A. Lecture Notes (2012)
Clustering
• Clustering is the process of grouping a set of physical or abstract
unlabelled objects into classes of similar objects.
• A Cluster is a collection of data objects that are similar to one
another within the same cluster, and dissimilar to the objects in
other clusters.
• Clustering is an important human activity:
o Distinguishing animals and plants, male and female, cars and busses etc.
• Goals:
o Detecting natural groups in data,
o Creating homogenous classes,
o Data reduction, Outlier detection.
Clustering
• Measuring Similarity
or measuring dissimilarity?
• A distance measure to calculate the differences between two objects
d(x,y) should have the properties:
1. d(x, y) 0 for all x and y and d(x, y) = 0 only if
x = y. (Positive definiteness)
2. d(x, y) = d(y, x) for all x and y. (Symmetry)
3. d(x, z) d(x, y) + d(y, z) for all points x, y, and z.
(Triangle Inequality)
Clustering
• Distance Function
• Euclidean Distance
d Euc ( x, y ) i i i
( x y ) 2
d Man ( x, y ) i xi yi
• Minkowski Distance
d Min ( x, y )
i
xi yi
Clustering
• Distance Measure
• Example:
Clustering
3
point x y
2 p1
p1 0 2
p3 p4
1
p2 2 0
p2 p3 3 1
0 p4 5 1
0 1 2 3 4 5 6
p1 p2 p3 p4 p1 p2 p3 p4
p1 0 2.828 3.162 5.099 p1 0 4 4 6
p2 2.828 0 1.414 3.162 p2 4 0 2 4
p3 3.162 1.414 0 2 p3 4 2 0 2
p4 5.099 3.162 2 0 p4 6 4 2 0
p1 p2 p3 p4 p1 p2 p3 p4
p1 0 2.828 3.162 5.099 p1 0 4 4 6
p2 2.828 0 1.414 3.162 p2 4 0 2 4
p3 3.162 1.414 0 2 p3 4 2 0 2
p4 5.099 3.162 2 0 p4 6 4 2 0
1 0 sum
Object i
1 a b a b
0 c d cd
sum a c b d p
d (i, j) bc
a bc
Example for Clustering Categorical Data
(a = 1, b = 3, c = 0, d= 0) d (i, j) bc
a bc
0 1 d (i, j) bc
d(Jack,Mary)= 0.33 a bc
2 0 1
11 Object j
d(Jack,Jim) =1 1 1 0.67 1 0 sum
1 2 1 a b a b
d(Jim,Mary) = 1 1 2 0.75
Object i
0 c d cd
sum a c b d p
Jack and Mary are the most likely to have a similar disease.
Clustering Methods
• Hierarchical Methods
o AGNES, DIANA, BIRCH, Fuzzy Joint Points (FJP), ...
• Partitioning Methods
o K-Means, K-Medoids, Fuzzy c-Means, ...
• Density-Based Methods
o DBSCAN, OPTICS, Fuzzy Joint Points (FJP), ...
• Grid-Based Methods
o STING, WaveCluster, CLIQUE ...
• Model-Based Methods
o COBWEB, CLASSIT, SOM (Self-Organizing Feature Maps) ...
Hierarchical Clustering
• A tree like cluster structure (dendrogram)
• Agglomerative
o Each item is a tiny cluster of its own at the beginning,
o Two closest clusters are aggregated,
o At the end, all items are in one cluster.
• Divisive methods
o All items are in one cluster at the beginning,
o Most dissimilar cluster are seperated,
o At the end, each record represents its own cluster.
Hierarchical Clustering
• Measuring distance between clusters in Hierarchical Clustering
• Single linkage,
o the nearest-neighbor approach,
o based on the minimum distance between any record in two clusters
• Complete linkage,
o the farthest-neighbor approach,
o based on the maximum distance between any record in two clusters.
• Average linkage ,
o is designed to reduce the dependence of the cluster-linkage criterion on
extreme values, such as the most similar or dissimilar records.
o the criterion is the average distance of all the records in cluster A from all
the records in cluster B.
Hierarchical Clustering
• Single link: smallest distance between an element in one cluster
and an element in the other,
d ( x, y)
x A yB
The average dis tan ce
Acount * Bcount
How the Clusters are Merged?
5 0.4
1 0.2
4 1 0.35
3
2 5 0.3
5 0.15 5 0.25
2 1 2
0.2
2 3 6 0.1 3 6 0.15
3
1 0.1
0.05
4 4 0.05
4 0 0
3 6 2 5 4 1 3 6 4 1 2 5
5 4 1 0.25
2 0.2
5
2 0.15
3 6 0.1
1
4 0.05
3
0
3 6 4 1 2 5
Average Link
Hierarchical Clustering
• Single Linkage
o Can handle non-elliptical shapes
o Sensitive to noise and outliers
• Complete Linkage
o Less sensitive to noise and outliers
o Tends to break large clusters and to form more compact, globular clusters
• Average Linkage
o Less sensitive to noise and outliers
o Tends to form more compact, globular clusters (similar to complete
linkage)
Hierarchical Clustering
• Advantages
o Does not require the number of cluster
o Easy to implement
o Fast and less complex
• Disadvantages
o Need to know where to cut the tree
o Sensitivity to noise and outliers
o Difficulty handling different sized clusters and convex shapes
o Tend to break large clusters
Partition Based Clustering
• Aims to construct a partition of a database D of n objects into a set
of k clusters such that the sum of squared distances is minimized.
• Given a k, find a partition of k clusters that optimizes the chosen
partitioning criterion e.g. minimize SSE.
Partition Based Clustering
• Within Cluster Variation (WCV)
BCV d (c1 , c2 )
WCV SSE
Partition Based Clustering
• k-means Clustering
• is an algorithm to cluster n objects based on attributes into k
partitions, k < n
• Step 1: Ask k,
• Step 2: Randomly assign k point as the initial cluster centers,
• Step 3: For each data point, find the nearest cluster center and
assign it to that cluster,
• Step 4: For each k cluster, find the new cluster centers,
• Step 5: Repeat Step 3-5 until
o Centers do not move,
o No data point changes cluster,
o Desired SSE is obtained.
Step 1: let k be 2
Step 2: Randomly assign initial cluster centers, let
c1=(1,1) and c2=(2,1)
Step 3: (first pass) for each record find the nearest cluster
center. (c1=(1,1) and c2=(2,1))
c1 c2
1 1 1 3 2 1
new c1 (1,2)
3 3
3 4 5 4 2 3 3 3 2 1
new c2 (3.6,2.4)
5 5
Step 5: repeat steps 3 and 4 until convergence.
Step 3 (second pass) : update cluster centers c1=(1,2) and
c2=(3.6,2.4). Calculate the distances between each point and
updated cluster centers.
Step 3 (second pass) : update cluster centers c1=(1,2) and
c2=(3.6,2.4). Calculate the distances between each point and
updated cluster centers.
c1 c2
C1
1 1 1 2 3 2 1 1
new c1 (1.25,1.75)
4 4
3 4 5 4 3 3 3 2
new c2 (4,2.75)
4 4
C1
k1
Y
Pick 3
k2
initial
cluster
centers
(randomly)
k3
X
K-means example, step 2
k1
Y
k2
Assign
each point
to the closest
cluster
center
k3
X
K-means example, step 3
k1 k1
Y
Move k2
each cluster center
to the mean k3
of each cluster
k2
k3
X
K-means example, step 4
Reassign k1
points
closest to a Y
different new
cluster center
Q: Which points
are reassigned? k3
k2
X
K-means example, step 4 …
k1
Y
A: three
points with
animation
k3
k2
X
K-means example, step 4b
k1
Y
re-compute
cluster means
k3
k2
X
K-means example, step 5
k1
k2
move cluster
centers to k3
cluster means
X
k-means Clustering
• Strength:
o Relatively efficient and fast: O(tkn)
o Easy to understand
o Often terminates at a local optimum
• Weakness
o Applicable only when mean is defined, then what about categorical data?
o Need to specify k, the number of clusters, in advance
o Unable to handle noisy data and outliers
o Not suitable to discover clusters with non-convex shapes
o Result can vary significantly depending on initial choice of centroids
o Total steps can vary depending on initial choice of centroids
k-means Clustering
k-means Clustering
k-means Clustering types
• Alternatives
• K-medians – instead of mean, use medians of each cluster
205
o Mean of 1, 3, 5, 7, 1009 is
5
o Median of 1, 3, 5, 7, 1009 is
• Fuzzy c-means
o a method of clustering which allows one piece of data to belong to two or more
clusters.
Fuzzy c-Means Clustering
• Step 1: Ask k,
• Step 2: Randomly assign k point as the initial cluster centers,
• Step 3: For each data point, find the membership degree to each cluster according to the
following formula:
• Resistant to Noise
• Can handle clusters of different shapes and sizes
When DBSCAN Does NOT Work Well
(MinPts=4, Eps=9.75).
Original Points
• Varying densities
• High-dimensional data
(MinPts=4, Eps=9.92)
Fuzzy Joint Points Clustering (FJP)
Fuzzy Joint Points Clustering (FJP)
Fuzzy Joint Points Clustering (FJP)
Fuzzy Joint Points Clustering (FJP)
• max:
Fuzzy Joint Points Clustering (FJP)
Model Based Methods
Attempt to optimize the fit between the given data and some mathematical model
It uses statistical functions