Chapter7 Clustering Exercises v2 20230112
Chapter7 Clustering Exercises v2 20230112
Exercises
Exercise 1
Theoretical Questions
- In k-means the central point is calculated as the mean of all points in the cluster.
This center is usually not a real data point in the training set.
- In k-medoids the central point is that one point, located most centrally in the
cluster, that is having the smallest distance to all other points in the cluster. This
central point is called medoid and is a real data point in the training set.
- If the mean of a data set cannot not calculated, then k-means cannot be used.
In this case, k-medoids can be used.
- This is the case of categorical data, e.g. where an attribute contains strings. We
cannot calculate its mean value. We could define a distance function, e.g. 1 for
equal strings and 0 otherwise. But there is no mean value.
- Between-Cluster Variation:
Can these two quality measures be used to determine the k in k-Means? Why?
- Local: Typically it's a drawback because the main goal is to find a global
maximum in which we can assure that all other points in the space are smaller
or equal. In a local maximum we can assure this only for an area, which can be
very small, around this local maximum.
x 3 6 8 1 2 2 6 6 7 7 8 8
y 5 2 3 5 4 6 1 8 3 6 1 7
y
10
8
x 3 6 8 1 2 2 6 6 7 7 8 8
7
y 5 2 3 5 4 6 1 8 3 6 1 7
6
5
• Use the first 3 data tuples as initial
cluster centroids 4
3
k=3
2
Start – Iteration # 0 1
0
0 1 2 3 4 5 6 7 8 9 10 x
y
10
8
x 3 6 8 1 2 2 6 6 7 7 8 8
7
y 5 2 3 5 4 6 1 8 3 6 1 7
6
5
• Use the first 3 data tuples as initial
cluster centroids 4
3
k=3
2
Iteration # 0 1
0
0 1 2 3 4 5 6 7 8 9 10 x
y
10
8
x 3 6 8 1 2 2 6 6 7 7 8 8
7
y 5 2 3 5 4 6 1 8 3 6 1 7
6
5
• Use the first 3 data tuples as initial
cluster centroids 4
3
k=3
2
Iteration # 1 1
0
0 1 2 3 4 5 6 7 8 9 10 x
y
10
8
x 3 6 8 1 2 2 6 6 7 7 8 8
7
y 5 2 3 5 4 6 1 8 3 6 1 7
6
5
• Use the first 3 data tuples as initial
cluster centroids 4
3
k=3
2
Iteration # 2 1
0
0 1 2 3 4 5 6 7 8 9 10 x
y
10
8
x 3 6 8 1 2 2 6 6 7 7 8 8
7
y 5 2 3 5 4 6 1 8 3 6 1 7
6
5
• Use the first 3 data tuples as initial
cluster centroids 4
3
k=3
2
Iteration # 3 1
0
0 1 2 3 4 5 6 7 8 9 10 x
y
10
8
x 3 6 8 1 2 2 6 6 7 7 8 8
7
y 5 2 3 5 4 6 1 8 3 6 1 7
6
5
• Use the first 3 data tuples as initial
cluster centroids 4
3
k=3
2
Iteration # 4 1
0
0 1 2 3 4 5 6 7 8 9 10 x
y
10
8
x 3 6 8 1 2 2 6 6 7 7 8 8
7
y 5 2 3 5 4 6 1 8 3 6 1 7
6
5
• Use the first 3 data tuples as initial
cluster centroids 4
3
k=3
2
Iteration # 5 1
0
0 1 2 3 4 5 6 7 8 9 10 x
y
- The k-means algorithm needs 3 10
0
0 1 2 3 4 5 6 7 8 9 10 x
y
Voronoi regions
- The k-means algorithm needs 3 10
0
0 1 2 3 4 5 6 7 8 9 10 x
x 3 3 4 4 5 6 7 7 8 9 1 2 2 3 4 5 5 6 7 7
y 1 2 2 3 3 4 4 6 5 7 3 4 5 6 6 7 8 8 8 9
- Which problem arises when this dataset is clustered with the k-means algorithm
with k = 2 clusters?
- Describe the difference between the desired behaviour and the result of the k-
means algorithm.
- Which clustering algorithm could better fit the data and why?
Which problem arises when this dataset is clustered with the k-means algorithm
with k = 2 clusters?
k=2
x 3 3 4 4 5 6 7 7 8 9 1 2 2 3 4 5 5 6 7 7
y 1 2 2 3 3 4 4 6 5 7 3 4 5 6 6 7 8 8 8 9
Which problem arises when this dataset is clustered with the k-means algorithm
with k = 2 clusters?
k=2
x 3 3 4 4 5 6 7 7 8 9 1 2 2 3 4 5 5 6 7 7
y 1 2 2 3 3 4 4 6 5 7 3 4 5 6 6 7 8 8 8 9
Which clustering algorithm could better fit the data and why?
k=2
x 3 3 4 4 5 6 7 7 8 9 1 2 2 3 4 5 5 6 7 7
y 1 2 2 3 3 4 4 6 5 7 3 4 5 6 6 7 8 8 8 9
- Aim: Find the parameters for the normal distributions and how much
each normal distribution contributes to the data.
x 2 12 16 25 29 45
x 2 12 16 25 29 45
4 4 4 4
Centroid Method 12
Cluster
17
distance 28.2
New
centroid
x 2 12 16 25 29 45
Single Linkage
Cluster
distance
There is no
cluster
representative
x 2 12 16 25 29 45
Complete Linkage
Cluster
distance
There is no
cluster
representative
x 3 3 4 4 5 6 7 7 8 9 1 2 2 3 4 5 5 6 7 7
y 1 2 2 3 3 4 4 6 5 7 3 4 5 6 6 7 8 8 8 9
x 3 3 4 4 5 6 7 7 8 9 1 2 2 3 4 5 5 6 7 7
T
y 1 2 2 3 3 4 4 6 5 7 3 4 5 6 6 7 8 8 8 9 Q R S
ε = 1.2
P J
N O H
F G
L
D E
K
B C
x 3 3 4 4 5 6 7 7 8 9 1 2 2 3 4 5 5 6 7 7
T
y 1 2 2 3 3 4 4 6 5 7 3 4 5 6 6 7 8 8 8 9
Q R S
ε = 1.2
P J
N O H
M I
Let’s look for core points. F G
L
Is MinPts=2 satisfied? B C
All Core
x 3 3 4 4 5 6 7 7 8 9 1 2 2 3 4 5 5 6 7 7 Points
T
y 1 2 2 3 3 4 4 6 5 7 3 4 5 6 6 7 8 8 8 9
Q R S
ε = 1.2
J
All points in a red circle are core P
D E
All points in a grey circle are noise K
circle. No A
MinPoints
x 3 3 4 4 5 6 7 7 8 9 1 2 2 3 4 5 5 6 7 7
T
y 1 2 2 3 3 4 4 6 5 7 3 4 5 6 6 7 8 8 8 9
Q R S
ε = 1.2
P J
N O H
Final Solution.
Start from any point, till you find a core M I
point. L
F G
x 3 3 4 4 5 6 7 7 8 9 1 2 2 3 4 5 5 6 7 7
T
y 1 2 2 3 3 4 4 6 5 7 3 4 5 6 6 7 8 8 8 9
Q R S
ε = 1.5
P J
N O H
required. L
F G
D E
Core points are just C and R. K
B C
x 3 3 4 4 5 6 7 7 8 9 1 2 2 3 4 5 5 6 7 7
T
y 1 2 2 3 3 4 4 6 5 7 3 4 5 6 6 7 8 8 8 9
Q R S
ε = 1.5
P J
Final Solution. N O H
are: A, B, D and E. F G
L
From point R directly density-reachable
D E
are: P, Q, S and T. K
connected. A
- Find the clusters, using the DBSCAN algorithm with Euclidian distance, and
MinPoints = 2, using ε = 1.5, ε = 2.1, and ε = 10.
- Which objects are not covered by each clustering solution?
ε = 1.5
No
MinPoints
Outlier = A, B, E, G
Cluster = {C;D}, {F;H}
ε = 2.1
No
Outlier = B MinPoints
Cluster = {A;C;D}, {E;F; G; H}
- ε = 10 and MinPoints = 2
ε = 10
Outlier = none
Cluster = {A; B; C; D; E; F; G; H}
Plot latitude and longitude in a view (OSM Map or Scatter Plot) and use that to
help you visually optimize k
k=3 k=6 k = 10
Avg Silhouette Coeff. 0.6 Avg Silhouette Coeff. 0.571 Avg Silhouette Coeff. 0.496