0% found this document useful (0 votes)
17 views49 pages

Chapter7 Clustering Exercises v2 20230112

The document contains exercises on clustering, specifically focusing on k-means and k-medoids algorithms, their differences, and quality measures for clustering. It discusses theoretical questions, practical applications, and drawbacks associated with k-means clustering. Additionally, it provides a hands-on exercise for implementing k-means on a given dataset and analyzing the results.

Uploaded by

maha.kandadai
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views49 pages

Chapter7 Clustering Exercises v2 20230112

The document contains exercises on clustering, specifically focusing on k-means and k-medoids algorithms, their differences, and quality measures for clustering. It discusses theoretical questions, practical applications, and drawbacks associated with k-means clustering. Additionally, it provides a hands-on exercise for implementing k-means on a given dataset and analyzing the results.

Uploaded by

maha.kandadai
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 49

Clustering:

Exercises
Exercise 1
Theoretical Questions

Guide to Intelligent Data Science Second Edition, 2020 2


Theoretical Questions

1. What is the difference between k-means and k-medoids?


Give a typical data definition, where one of the two is not applicable
2. Name two quality measures for k-means / k-medoid clustering.
Include a short definition and explanation of both.
3. Can these two quality measures be used to determine the k in k-Means? Why?
4. The Expectation Maximization algorithms has the drawback that it can
converge to a local maximum likelihood:
a. Why is it a drawback?
b. Why is a local maximum bad?
c. Why can the local maximum happen?

Guide to Intelligent Data Science Second Edition, 2020 3


1. k-Means and k-Medoids

What is the difference between k-means and k-medoids?


Give a typical data definition, where one of the two is not applicable

- In k-means the central point is calculated as the mean of all points in the cluster.
This center is usually not a real data point in the training set.
- In k-medoids the central point is that one point, located most centrally in the
cluster, that is having the smallest distance to all other points in the cluster. This
central point is called medoid and is a real data point in the training set.
- If the mean of a data set cannot not calculated, then k-means cannot be used.
In this case, k-medoids can be used.
- This is the case of categorical data, e.g. where an attribute contains strings. We
cannot calculate its mean value. We could define a distance function, e.g. 1 for
equal strings and 0 otherwise. But there is no mean value.

Guide to Intelligent Data Science Second Edition, 2020 4


2. Cluster Quality Measures

Name two quality measures for k-means / k-medoid clustering.


Include a short definition and explanation of both.

Centroid : mean vector of all objects in clustering C


- Within-Cluster Variation:

- Between-Cluster Variation:

- Clustering Quality (one possible measure):

Guide to Intelligent Data Science Second Edition, 2020 5


2. Silhouette-Coefficient for object

Name two quality measures for k-means / k-medoid clustering.


Include a short definition and explanation of both.

Silhouette-Coefficient [Kaufman & Rousseeuw 1990] measures the quality of


clustering

- : distance of object to its cluster representative


- : distance of object to the representative of the „second-best“ cluster
- Silhouette of

- Average of Silhouette is calculated for each cluster


- Average on all clusters
- Good clusters  high silhouette coefficient

Guide to Intelligent Data Science Second Edition, 2020 6


2. Separation Index

Name two quality measures for k-means / k-medoid clustering.


Include a short definition and explanation of both.

- Separation index – designed to identify compact and well-separated


clusters
min 𝑑 ( 𝑥𝑖 , 𝑥 𝑗)
𝑥 ∈ 𝐶𝑖 , 𝑦 ∈ 𝐶 𝑗
𝐷= min min
𝑖=1 , ⋯ ,𝐶 𝑗=𝑖 +1 ,⋯ , 𝐶 min 𝑑𝑖𝑎𝑚𝑘
𝑘=1 ,⋯ , 𝐶

- Where expresses the extent of a cluster

Guide to Intelligent Data Science Second Edition, 2020 7


3. Quality Measures and k

Can these two quality measures be used to determine the k in k-Means? Why?

- Both measures can be used to determine k. However, we have to keep in mind


that both of them negatively correlate with the value of k.
- The more you increase k the smaller the measurement gets, until k equals the
number of training points.
- They could definitely be used to compare different runs with similar k

Guide to Intelligent Data Science Second Edition, 2020 8


4. Local maximum drawback

The Expectation Maximization algorithms has the drawback that it can


converge to a local maximum likelihood:
a. Why is it a drawback?
b. Why is a local maximum bad?
c. Why can the local maximum happen?

- As the name Expectation Maximization is indicating our goal is to find a


maximum of clustering quality.

- Local: Typically it's a drawback because the main goal is to find a global
maximum in which we can assure that all other points in the space are smaller
or equal. In a local maximum we can assure this only for an area, which can be
very small, around this local maximum.

Guide to Intelligent Data Science Second Edition, 2020 9


Exercise 2
k-Means

Guide to Intelligent Data Science Second Edition, 2020 10


1. Hands-on k-Means

- The following two-dimensional dataset is given:

x 3 6 8 1 2 2 6 6 7 7 8 8

y 5 2 3 5 4 6 1 8 3 6 1 7

- Apply the k-means clustering algorithm with k = 3 (3 Clusters).


- Use the first 3 data tuples as initial cluster centroids
- Track the movement of the cluster centers (e.g. by painting the movement into the
diagram).
- Determine the Voronoi regions of the final clustering and draw the final cluster
centers, all data points, and the Voronoi regions into a diagram.

Guide to Intelligent Data Science Second Edition, 2020 11


1. Hands-on k-Means

Given k, the k-Means algorithm is implemented in four steps:

1. Partition objects into non-empty subsets, calculate their centroids


(i.e., mean point, of the cluster)
2. Assign each object to the cluster with the nearest centroid (Euclidean
distance)
3. Compute the centroids from the current partition as
4. Go back to Step 2, repeat until the updated centroids stop moving
significantly

- Note: Each data point can only belong to a single cluster


- for the cluster with closest prototype, 0 otherwise

Guide to Intelligent Data Science Second Edition, 2020 12


1. Hands-on k-Means

y
10

8
x 3 6 8 1 2 2 6 6 7 7 8 8
7
y 5 2 3 5 4 6 1 8 3 6 1 7
6

5
• Use the first 3 data tuples as initial
cluster centroids 4

3
k=3
2

Start – Iteration # 0 1

0
0 1 2 3 4 5 6 7 8 9 10 x

Guide to Intelligent Data Science Second Edition, 2020 13


1. Hands-on k-Means

y
10

8
x 3 6 8 1 2 2 6 6 7 7 8 8
7
y 5 2 3 5 4 6 1 8 3 6 1 7
6

5
• Use the first 3 data tuples as initial
cluster centroids 4

3
k=3
2

Iteration # 0 1

0
0 1 2 3 4 5 6 7 8 9 10 x

Guide to Intelligent Data Science Second Edition, 2020 14


1. Hands-on k-Means

y
10

8
x 3 6 8 1 2 2 6 6 7 7 8 8
7
y 5 2 3 5 4 6 1 8 3 6 1 7
6

5
• Use the first 3 data tuples as initial
cluster centroids 4

3
k=3
2

Iteration # 1 1

0
0 1 2 3 4 5 6 7 8 9 10 x

Guide to Intelligent Data Science Second Edition, 2020 15


1. Hands-on k-Means

y
10

8
x 3 6 8 1 2 2 6 6 7 7 8 8
7
y 5 2 3 5 4 6 1 8 3 6 1 7
6

5
• Use the first 3 data tuples as initial
cluster centroids 4

3
k=3
2

Iteration # 2 1

0
0 1 2 3 4 5 6 7 8 9 10 x

Guide to Intelligent Data Science Second Edition, 2020 16


1. Hands-on k-Means

y
10

8
x 3 6 8 1 2 2 6 6 7 7 8 8
7
y 5 2 3 5 4 6 1 8 3 6 1 7
6

5
• Use the first 3 data tuples as initial
cluster centroids 4

3
k=3
2

Iteration # 3 1

0
0 1 2 3 4 5 6 7 8 9 10 x

Guide to Intelligent Data Science Second Edition, 2020 17


1. Hands-on k-Means

y
10

8
x 3 6 8 1 2 2 6 6 7 7 8 8
7
y 5 2 3 5 4 6 1 8 3 6 1 7
6

5
• Use the first 3 data tuples as initial
cluster centroids 4

3
k=3
2

Iteration # 4 1

0
0 1 2 3 4 5 6 7 8 9 10 x

Guide to Intelligent Data Science Second Edition, 2020 18


1. Hands-on k-Means

y
10

8
x 3 6 8 1 2 2 6 6 7 7 8 8
7
y 5 2 3 5 4 6 1 8 3 6 1 7
6

5
• Use the first 3 data tuples as initial
cluster centroids 4

3
k=3
2

Iteration # 5 1

0
0 1 2 3 4 5 6 7 8 9 10 x

Guide to Intelligent Data Science Second Edition, 2020 19


1. Hands-on k-Means

y
- The k-means algorithm needs 3 10

iterations to reach a stable state, 9


meaning that all cluster centroids
8
have reached a stable position
(or 4 iterations if you consider 7

the last test where the cluster 6


assignment did not change).
5
- The end positions of the cluster
4
centroids are C1(2; 5), C2(7; 2)
3
and C3(7; 7).
2

0
0 1 2 3 4 5 6 7 8 9 10 x

Guide to Intelligent Data Science Second Edition, 2020 20


1. Hands-on k-Means

y
Voronoi regions
- The k-means algorithm needs 3 10

iterations to reach a stable state, 9


meaning that all cluster centroids
8
have reached a stable position
(or 4 iterations if you consider 7

the last test where the cluster 6


assignment did not change).
5
- The end positions of the cluster
4
centroids are C1(2; 5), C2(7; 2)
3
and C3(7; 7).
2

0
0 1 2 3 4 5 6 7 8 9 10 x

Guide to Intelligent Data Science Second Edition, 2020 21


2. K-Means drawbacks

- The following two-dimensional dataset is given:

x 3 3 4 4 5 6 7 7 8 9 1 2 2 3 4 5 5 6 7 7

y 1 2 2 3 3 4 4 6 5 7 3 4 5 6 6 7 8 8 8 9

- Which problem arises when this dataset is clustered with the k-means algorithm
with k = 2 clusters?
- Describe the difference between the desired behaviour and the result of the k-
means algorithm.
- Which clustering algorithm could better fit the data and why?

Guide to Intelligent Data Science Second Edition, 2020 22


2. K-Means drawbacks

Which problem arises when this dataset is clustered with the k-means algorithm
with k = 2 clusters?

k=2
x 3 3 4 4 5 6 7 7 8 9 1 2 2 3 4 5 5 6 7 7

y 1 2 2 3 3 4 4 6 5 7 3 4 5 6 6 7 8 8 8 9

There are two ellipsoidal clusters in the


dataset. The k-means algorithm with
Euclidean distance is not able to detect
these cluster shapes.

Guide to Intelligent Data Science Second Edition, 2020 23


2. K-Means drawbacks

Which problem arises when this dataset is clustered with the k-means algorithm
with k = 2 clusters?

k=2
x 3 3 4 4 5 6 7 7 8 9 1 2 2 3 4 5 5 6 7 7

y 1 2 2 3 3 4 4 6 5 7 3 4 5 6 6 7 8 8 8 9

One option is to modify the distance


measure by including a covariance
matrix:

this is called the Mahalanobis distance

Guide to Intelligent Data Science Second Edition, 2020 24


2. K-Means drawbacks

Which clustering algorithm could better fit the data and why?

k=2
x 3 3 4 4 5 6 7 7 8 9 1 2 2 3 4 5 5 6 7 7

y 1 2 2 3 3 4 4 6 5 7 3 4 5 6 6 7 8 8 8 9

Another option is to use the EM


Clustering algorithm with a normal
probability density distribution, which
also includes a covariance matrix.

Guide to Intelligent Data Science Second Edition, 2020 25


Gaussian Mixture Model – EM Clustering

- Assumption: Data were generated by sampling a set of normal


distributions. (The probability density is a mixture of normal
distributions.)

- Aim: Find the parameters for the normal distributions and how much
each normal distribution contributes to the data.

- Algorithm: EM clustering (expectation maximisation).

- Alternating scheme in which the parameters of the normal distributions


and the likelihoods of the data points to be generated by the
corresponding normal distributions are estimated.

Guide to Intelligent Data Science Second Edition, 2020 26


Exercise 3
Agglomerative Hierarchical
Clustering

Guide to Intelligent Data Science Second Edition, 2020 27


Hierarchical Clustering

- The following one-dimensional dataset is given:

x 2 12 16 25 29 45

- Cluster this dataset with agglomerative clustering using:


a) the centroid method
b) the single-linkage method
c) the complete-linkage method

- Draw a dendrogram for each of the methods (a)-(c).

Guide to Intelligent Data Science Second Edition, 2020 28


Hierarchical Clustering: Centroid Method

x 2 12 16 25 29 45

4 4 4 4

Centroid Method 12

Cluster
17
distance 28.2

New
centroid

Guide to Intelligent Data Science Second Edition, 2020 29


Hierarchical Clustering: Single Linkage

x 2 12 16 25 29 45

Single Linkage

Cluster
distance

There is no
cluster
representative

Guide to Intelligent Data Science Second Edition, 2020 30


Hierarchical Clustering: Complete Linkage

x 2 12 16 25 29 45

Complete Linkage

Cluster
distance

There is no
cluster
representative

Guide to Intelligent Data Science Second Edition, 2020 31


Exercise 4
DBSCAN

Guide to Intelligent Data Science Second Edition, 2020 32


1. Hands-on DBSCAN

- The following two-dimensional dataset is given:

x 3 3 4 4 5 6 7 7 8 9 1 2 2 3 4 5 5 6 7 7

y 1 2 2 3 3 4 4 6 5 7 3 4 5 6 6 7 8 8 8 9

- Use the DBSCAN algorithm to cluster the data


a. With ε = 1.2 and MinPoints = 2
b. With ε = 1.5 and MinPoints = 5
- Draw the clusters in a diagram
- Determine the core, directly density-reachable and direct-connected objects.

Guide to Intelligent Data Science Second Edition, 2020 33


1. Hands-on DBSCAN

Use the DBSCAN algorithm to cluster the data


With ε = 1.2 and MinPoints = 2

x 3 3 4 4 5 6 7 7 8 9 1 2 2 3 4 5 5 6 7 7
T

y 1 2 2 3 3 4 4 6 5 7 3 4 5 6 6 7 8 8 8 9 Q R S
ε = 1.2
P J

N O H

Let’s draw and name the data points M I

F G
L

D E
K

B C

Guide to Intelligent Data Science Second Edition, 2020 34


Clustering: DBSCAN

DBSCAN - a density-based clustering algorithm - defines five types of points


in a dataset.
- Core Points are points that have at least a minimum number of neighbors
(MinPts) within a specified distance ().
- Noise Points are neither core points nor border points.
- Border Points are points that are within of a core point, but have less
than MinPts neighbors.
- Directly Density Reachable Points are within of a core point.
- Density Reachable Points are reachable with a chain of Directly Density
Reachable points.

Clusters are built by joining core and density-reachable points to one


another.
Guide to Intelligent Data Science Second Edition, 2020 35
1. Hands-on DBSCAN

Use the DBSCAN algorithm to cluster the data


With ε = 1.2 and MinPoints = 2

x 3 3 4 4 5 6 7 7 8 9 1 2 2 3 4 5 5 6 7 7
T
y 1 2 2 3 3 4 4 6 5 7 3 4 5 6 6 7 8 8 8 9
Q R S
ε = 1.2
P J

N O H

M I
Let’s look for core points. F G
L

We center the circle on each point. K


D E

Is MinPts=2 satisfied? B C

Guide to Intelligent Data Science Second Edition, 2020 36


1. Hands-on DBSCAN

Use the DBSCAN algorithm to cluster the data


With ε = 1.2 and MinPoints = 2

All Core
x 3 3 4 4 5 6 7 7 8 9 1 2 2 3 4 5 5 6 7 7 Points
T
y 1 2 2 3 3 4 4 6 5 7 3 4 5 6 6 7 8 8 8 9
Q R S
ε = 1.2
J
All points in a red circle are core P

points. After placing the circle on each N O H

point, you always get at least another M I

point in it (MinPoints=2 satisfied). F G


L

D E
All points in a grey circle are noise K

points. There is no other point in the B C

circle. No A

MinPoints

Guide to Intelligent Data Science Second Edition, 2020 37


1. Hands-on DBSCAN

Use the DBSCAN algorithm to cluster the data


With ε = 1.2 and MinPoints = 2

x 3 3 4 4 5 6 7 7 8 9 1 2 2 3 4 5 5 6 7 7
T
y 1 2 2 3 3 4 4 6 5 7 3 4 5 6 6 7 8 8 8 9
Q R S
ε = 1.2
P J

N O H
Final Solution.
Start from any point, till you find a core M I

point. L
F G

Then from there reach all density K


D E

reachable points to form the cluster. B C

Guide to Intelligent Data Science Second Edition, 2020 38


1. Hands-on DBSCAN

Use the DBSCAN algorithm to cluster the data


With ε = 1.5 and MinPoints = 5

x 3 3 4 4 5 6 7 7 8 9 1 2 2 3 4 5 5 6 7 7
T
y 1 2 2 3 3 4 4 6 5 7 3 4 5 6 6 7 8 8 8 9
Q R S
ε = 1.5
P J

N O H

Larger radius, but more MinPoints M I

required. L
F G

D E
Core points are just C and R. K

B C

Guide to Intelligent Data Science Second Edition, 2020 39


1. Hands-on DBSCAN

Use the DBSCAN algorithm to cluster the data


With ε = 1.5 and MinPoints = 5

x 3 3 4 4 5 6 7 7 8 9 1 2 2 3 4 5 5 6 7 7
T
y 1 2 2 3 3 4 4 6 5 7 3 4 5 6 6 7 8 8 8 9
Q R S
ε = 1.5
P J

Final Solution. N O H

From point C directly density-reachable M I

are: A, B, D and E. F G
L
From point R directly density-reachable
D E
are: P, Q, S and T. K

All points within a cluster are density- B C

connected. A

Guide to Intelligent Data Science Second Edition, 2020 40


2. DBSCAN for outlier detection

- The two-dimensional data points A;B;C;D;E; F; G;H with are given.

- Find the clusters, using the DBSCAN algorithm with Euclidian distance, and
MinPoints = 2, using ε = 1.5, ε = 2.1, and ε = 10.
- Which objects are not covered by each clustering solution?

Guide to Intelligent Data Science Second Edition, 2020 41


2. DBSCAN for outlier detection

- ε = 1.5 and MinPoints = 2

ε = 1.5
No
MinPoints
Outlier = A, B, E, G
Cluster = {C;D}, {F;H}

Guide to Intelligent Data Science Second Edition, 2020 42


2. DBSCAN for outlier detection

- ε = 2.1 and MinPoints = 2

ε = 2.1
No
Outlier = B MinPoints
Cluster = {A;C;D}, {E;F; G; H}

Guide to Intelligent Data Science Second Edition, 2020 43


2. DBSCAN for outlier detection

- ε = 10 and MinPoints = 2

ε = 10

Outlier = none
Cluster = {A; B; C; D; E; F; G; H}

Guide to Intelligent Data Science Second Edition, 2020 44


Exercise 5
Practice with KNIME

Guide to Intelligent Data Science Second Edition, 2020 45


1. k-Means Clustering

Use the k-Means algorithm to cluster location data.

1. Read the dataset location_data.table


2. Filter to entries from California (region_code = CA)
3. Train a k-means model with k=3. Use only position data for clustering
(latitude and longitude)
4. Calculate the Silhoutte Coefficients using the Silhouette Coefficient
node
5. Plot latitude and longitude in a view (OSM Map or Scatter Plot) and
use that to help you visually optimize k

Guide to Intelligent Data Science Second Edition, 2020 46


1. k-Means Clustering

Use the k-Means algorithm to cluster location data

Guide to Intelligent Data Science Second Edition, 2020 47


1. k-Means Clustering

Plot latitude and longitude in a view (OSM Map or Scatter Plot) and use that to
help you visually optimize k

k=3 k=6 k = 10

Avg Silhouette Coeff. 0.6 Avg Silhouette Coeff. 0.571 Avg Silhouette Coeff. 0.496

Guide to Intelligent Data Science Second Edition, 2020 48


Thank you
For any questions please contact: [email protected]

Guide to Intelligent Data Science Second Edition, 2020 49

You might also like