Lab 08 Solutions
Lab 08 Solutions
Exercises on Clustering
1. Use the k-means algorithm and Euclidean distance to cluster the following 8 examples into 3 clusters:
A1=(2,10), A2=(2,5), A3=(8,4), A4=(5,8), A5=(7,5), A6=(6,4), A7=(1,2), A8=(4,9). Suppose that the
initial seeds (centers of each cluster) are A1, A4 and A7. Run the k-means algorithm for 1 epoch. At the
end of this epoch show:
a. The new clusters (i.e. the examples belonging to each cluster);
b. The centers of the new clusters;
c. Draw a 10 by 10 space with all the 8 points and show the clusters after the first epoch and the
new centroids.
d. How many more iterations are needed to converge? Draw the result for each epoch.
Solution
The Euclidean distances between the given points are in the following matrix:
a.
2. Use single and complete link agglomerative clustering to group the data described by the following
distance matrix. Show the dendrograms.
A B C D
A 0 1 4 5
B 0 2 6
C 0 3
D 0
Solution
1. Single link: distance between two clusters is the shortest distance between a pair of elements from
the two clusters.
At the beginning, each point A,B,C, and D is a cluster à c1 = {A}, c2={B}, c3={C}, c4={D}
Iteration 1
The shortest distance is d(c1,c2)=1 à c1 and c2 are merged à the clusters are c3={C}, c4={D},
c5={A,B}
The distances from the new cluster to the others are d(c5,c3) = 2, d(c5,c4)=5
Iteration 2
The shortest distance is d(c5,c3)=2 à c5 and c3 are merged à the clusters are c6={A,B,C},
c4={D}
The distances from the new cluster to the others are: d(c6,c4)=3
Iteration 3
c6 and c4 are merged à the final cluster is c7={A,B,C,D}
The dendrogram is
2. Complete link: The distance between two clusters is the distance of two furthest data points in the
two clusters
We apply the algorithm presented in lecture 10 (ml_2012_lecture_10.pdf) page 4.
At the beginning, each point A,B,C, and D is a cluster à c1 = {A}, c2={B}, c3={C}, c4={D}
Iteration 1
The shortest distance is d(c1,c2)=1 à c1 and c2 are merged à the clusters are c3={C}, c4={D},
c5={A,B}
The distances from the new cluster to the others are: d(c5,c3) = 4, d(c5,c4)=6
Iteration 2
The shortest distance is d(c3,c4)=3 à c3 and c4 are merged à the clusters are c6={C,D},
c5={A,B}
The distances from the new cluster to the others are: d(c6,c5)=6
Iteration 3
c6 and c5 are merged à the final cluster is c7={A,B,C,D}
The dendrogram is
3. Use single-link complete-link, average-link, and centroid agglomerative clustering, to cluster the
following 8 examples: A1=(2,10), A2=(2,5), A3=(8,4), A4=(5,8), A5=(7,5), A6=(6,4), A7=(1,2),
A8=(4,9). Show the dendrograms.
Solution
The solutions for single-link and complete-link are analogous to the previous one. The solutions for average-
link and centroid are also similar, what is changing is the calculation of the distances between clusters.
• For average link the distance is the average of all the distances between points belonging to the two
clusters. For instance if c1={A,B} and c2={C,D},
dist(c1, c2) = (dist(A,B) + dist(A,D) + dist(B,C) + dist(B,D)) / 4
• For centroid the distance between two cluster is the distance between their centroids.
4. Consider a data set in two dimensions with five data points at: {(1, 0), (−1, 0), (0, 1), (3, 0), (3, 1)}. Run
two iterations of k-means by hand with initial points at (−1, 0) and (3, 1). What are the assignments at
each iteration and what are the centroids? Has the algorithm converged?
Solution
The solution is analogous to the solution of Exercise 1.
5. How can we make k-means robust to outliers? Explain the two methods we have seen.
Solution
Refer to lecture 9 (ml_2012_lecture_09.pdf), pages 15-16.
6. Explain the main similarities and differences between k-means and hierarchical clustering.
Solution
Refer to lecture 9 (ml_2012_lecture_09.pdf) and lecture 10 (ml_2012_lecture_10.pdf).
9. Is the result of k-means clustering sensitive to the choice of the initial seeds? How? Make an example.
Solution
Refer to lecture 9 (ml_2012_lecture_09.pdf), page 17.
10. Which is a good algorithm for finding clusters of arbitrary shape? Is finding these clusters always a good
idea? When it is not?
Solution
Refer to lecture 9 (ml_2012_lecture_09.pdf), page 21 and to lecture 10 (ml_2012_lecture_10.pdf), page 5.
12. Explain the single-link and the complete-link methods for hierarchical clustering.
Solution
Refer to lecture 10 (ml_2012_lecture_10.pdf), pages 5-6.
13. Make 2 examples of distance functions that can be used for numeric attributes.
Solution
Refer to lecture 10 (ml_2012_lecture_10.pdf), pages 8-9.