IT3080 Lecture04 2023
IT3080 Lecture04 2023
Partitioning Method
Hierarchical Method
Density-based Method
Fuzzy clustering
Model-Based Method
CLUSTERING METHODS – PARTITIONING METHODS
Most partitioning methods cluster objects based on the distance between objects.
Such methods can find only spherical-shaped clusters and encounter difficulty in discovering
clusters of arbitrary shapes.
Other clustering methods have been developed based on the notion of density.
Their general idea is to continue growing a given cluster as long as the density (number of
objects or data points) in the “neighborhood” exceeds some threshold.
For example, for each data point within a given cluster, the neighborhood of a given radius has
to contain at least a minimum number of points.
Such a method can be used to filter out noise or outliers and discover clusters of arbitrary shape
CLUSTERING METHODS – DENSITY-BASED METHODS
Choosing variables
Similarity and dissimilarity measurement
Standardization
Weights and thresholds
CHOOSING VARIABLES
Select relevant variables
Ex: identifying which type of drivers are at high risk of insurance claims
Relevant variables : age, penalties, marital status
Irrelevant : height, weight of vehicle
Inclusion of a variable such as the height or weight of an automobile may
adversely affect the outcome of the categorization because they are not relevant
to the problem.
fewer the better to adequately address the problem
SIMILARITY AND DISSIMILARITY MEASUREMENT
Euclidean distance
How can distance between two
points in a 2D space could be
calculated? Pythagoras theorem
could be used
A general form: distance between
SIMILARITY AND DISSIMILARITY
MEASUREMENT(CONTD.)
In a 3D space
𝑑 ( 𝐴 , 𝐵 ) =𝑑 ( 𝐵 , 𝐴 )= √ ( 𝑎2 −𝑏 1 ) ¿ ¿ ¿
(b1,b2,b3….bn) 2
In cluster analysis the distance between two points are known within-cluster
STANDARDIZATION
When different variables are often represented in different
dimensions (units) standardization of variables might be
required.
The standardization of an attribute involves two steps:
calculate the difference between the value of the attribute and the mean
of all samples involving the attribute, and
divide the difference by its standard deviation
K-MEANS ALGORITHM
With an input of k, which denotes the number of expected clusters, k
centers or centroids will be defined that will facilitate defining the k
partitions.
The centroid is (typically) the mean of the points in the cluster.
‘Closeness’ is measured by Euclidean distance, cosine similarity,
correlation, etc.
Initial centroids are often chosen randomly.
WHAT IS A CENTROID?
A centroid is the mean position of a group of points
K-MEANS ALGORITHM
Based on these centers (centroids), the algorithm identifies the
members and thus builds a partition followed by the re-
computation of the new centers based on the identified members.
This process is iterated until the clear, and optimal dissimilarities
that make the partition really unique are exposed.
Hence, the accuracy of the centroids is the key for the partition-
based clustering algorithm to be successful.
HOW THE CLUSTERS ARE COMPUTED
Iteration 6
1
2
3
4
5
3
2.5
1.5
y
0.5
0.2 6 5
4
0.15
3 4
2
5
0.1
2
0.05
1
3 1
0
1 3 2 5 4 6
DENDOGRAM: HIERARCHICAL CLUSTERING
Single-link Complete-link
5
2 1 0.15
3 6 0.1
0.05
4
0
3 6 2 5 4 1
EXAMPLE - SINGLE-LINK (CONTD.)
1
0.2
5
2 1 0.15
2 3 6 0.1
0.05
4
0
3 6 2 5 4 1
EXAMPLE - SINGLE-LINK (CONTD.)
1 0.2
3
5 0.15
2 1
2 3 6 0.1
0.05
4
0
3 6 2 5 4 1
EXAMPLE - SINGLE-LINK (CONTD.)
1
3 0.2
5
2 1
0.15
2 3 6 0.1
0.05
4
4 0
3 6 2 5 4 1
EXAMPLE - SINGLE-LINK (CONTD.)
5
1
3
0.2
5
2 1
0.15
2 3 6
0.1
0.05
4 0
3 6 2 5 4 1
4
EXAMPLE - COMPLETE-LINK
1
0.2
5 0.15
2 1
3 6 0.1
0.05
4
0
3 6 2 5 4 1
EXAMPLE - COMPLETE-LINK (CONTD.)
1
0.2
5 0.15
2 1
0.1
2 3 6
0.05
4 0
3 6 2 5 4 1
1 0.4
2 0.35
5 2
0.3
0.25
3 6
0.2
3
0.15
1 0.1
4 0.05
0
3 6 4 1 2 5
EXAMPLE - COMPLETE-LINK (CONTD.)
4 1 0.4
2 0.35
5 0.3
2 0.25
3 6
0.2
3 1
0.15
0.1
4 0.05
0
3 6 4 1 2 5
HIERARCHICAL CLUSTERING : COMPLETE-
LINK
4 1
2 5 0.4
0.35
5
2 0.3
0.25
3 6 0.2
3 0.15
1 0.1
4 0.05
0
3 6 4 1 2 5
DEMO
THANK YOU