MS6711 Data Mining Homework 1: 1.1 Implement K-Means Manually (8 PTS)
MS6711 Data Mining Homework 1: 1.1 Implement K-Means Manually (8 PTS)
Homework 1
Semester B, 2021-2022
Due: Feb. 8, 2PM
Instruction
This homework contains both coding and non-coding questions. Please submit two files,
1. One word or pdf document of answers and plots of ALL questions without coding details.
This question checks your understanding of the K-means algorithm, you are asked to perform each step
of K-means manually. Not necessary to use KMeans function in python, but you may use python or a
simple calculator to help you calculate distances.
Given X a matrix of data points, with 10 data points and 2 variables. You are asked to perform a K-
pPp
2
means clustering step by step. Here we use L2 distance d = i=1 (xi − yi ) . The figure below plots all
data observations in blue triangular, while the three initialized centers are c1 = (6.2, 3.2), c2 = (6.5, 3.0)
and c3 = (6.6, 3.7).
4.4
5.9 3.2
4.2
4.6 2.9 (5.5, 4.2)
4
6.2 2.8
3.8
4.7 3.2 (5.1, 3.8)
(6.6, 3.7)
5.0 3.0
3.6
X=
5.5 4.2 3.4
4.9 3.1 3.2
(4.7, 3.2) (5.9, 3.2) (6.2, 3.2)
(4.9, 3.1) (6.7, 3.1)
6.7 3.1 3
(5.0, 3.0) (6.0, 3.0) (6.5, 3.0)
(4.6, 2.9)
5.1 3.8 2.8
(6.2, 2.8)
3. Which data observations will be assigned to the first cluster (red) in the second iteration? Answer
by the row index of data in matrix X.
Hint: you may write a python function to help you calculate distances, for example
import numpy as np
def L2dis (x1 , y1 , x2 , y2):
return (np.sqrt(pow(x1 − x2 , 2) + pow(y1 − y2 , 2)))
# c a l c u l a t e L2 d i s t a n c e b e t w e e n t h e first two o b s e r v a t i o n s
L2dis (5.9 , 3.2, 4.6, 2.9)
2
Problem 2 [20 points]
Figures below show clustering results of 6 different datasets, denoted as A, B, C, D, E, F. Each dataset
is clustered using two different methods, while one of them is K-means. Please determine which one is
more likely to be results of K-means. Note that centers of clusters are denoted by a black cross X on the
figure. All distances are L2 here.
A1 A2 B1 B2
X X
XX X
X X X
X
X
C1 X
C2 D1 D2
X
X X
X
X X X X
X X
E1 E2 F1 F2
X
X X
X
X X
X
X
X X
1. Dataset A (Please answer A1 or A2 is more likely to be results from K-means, same for other
Figure 2: Clustered results for 6 datasets
datasets below.)
is more likely to be generated by K-means method. (Hint: check the state when K-means converges; Centers
2.forDataset B have been noted as X; Since x and y axis are scaled proportionally, you can determine the
each cluster
distance to centers geometrically). The distance measure used here is the Euclidean distance.
3. Dataset C A (write A1 or A2, [1 pt], same in the following question);
1. Dataset
4. Dataset D
2. Dataset B
5. Dataset E
3. Dataset C
6. Dataset F
4. Dataset D
5. Dataset E
6. Dataset F
3
Problem 3 [20 points]
This problem checks your understanding of the hierarchical clustering. Same as Problem 1, not necessary
to perform clustering in python but you may use it to help calculate distances.
The figure below plots data of two clusters, denoted by red triangular (cluster A) and blue diamond
(cluster B) respectively, with coordinate labeled.
3.4
3.3
3.2
3.1
2.9
(4.6, 2.9)
2.8
(6.2, 2.8)
2.7
2.6
4 4.5 5 5.5 6 6.5 7 7.5 8
2. What is the distance between the two closest members? (single link) [2 pt]
4. Among all three distances above, which one is robust to noise? Answer either “complete”, “single”,
or “average”. [1 pt]
2. If we project all points into the 1 d subspace by the second principal components. What is the variance
of the projected data? [2 pt]
3. For a given dataset of X, all the eigenvalues of its covariance matrix C are {2.2, 1.7, 1.4, 0.8, 0.4, 0.2, 0.15, 0.02, 0.001},
if we want to explain more than 90% of the total variance 4 using the first k principal components, what is
the least number of k? [3 pt]
Problem 4 [20 points]
This problem considers hierarchical clustering with single or complete linkage, and confirm that the results
are invariant under monotone transformation.
Load data hw1p4.csv, you will have a matrix x, a 40 × 2 matrix of 40 observations and 2 variables.
1. Calculate distance matrix d, which should be a 40 × 40 matrix, dij denote the L2 distance between
Xi and Xj .
2. Run hierarchical clustering with single linkage, using the distance matrix directly rather than the
X matrix. Cut the tree at K = 4. Plot the points x with different color denoting clusters. Also
show the dendrogram.
4. Take monotone transformation, use d2 as the distance, repeat (1) and (2). Did the clustering results
change? Did the dendogram change?
5. Run hierarchical clustering with average linkage using both distance d and d2 . Cut both trees at
K = 4. Are the clustering results the same? How about K = 3?
Hint: In python, you may use the following function to force HC cuts at K = 3, and fit hierarchical
clustering using the distance matrix. For more details (draw dendrogram, etc), check page 184 - 188 of
reference book Introduction to Machine Learning with Python.
# n c l u s t e r s : cut at K = 3
# affinity : set it as ’ precomputed ’ , then will use d i s t a n c e matrix instead
# linkage : s i n g l e , complete , or average
HC = AgglomerativeClustering ( n clusters =3,
affinity =’precomputed ’, linkage = ’single ’)
5
Problem 5 [20 points]
This problem considers K-means clustering to diagnose breast cancer based solely on Fine Needle Aspi-
ration (FNA), which takes small tissue from the tumor and analyze. The data contains 30 characteristics
(features) such as size, shape, and texture of the tissue sample.
Load the data breast data.csv, you should have a matrix with 30 characteristics and 569 samples.
Load data breast truth.csv, which contains a list of 569 values, either 0 or 1. It denotes the diagnosis
results of corresponding features in breast data.csv, where 0 denotes benign and 1 denotes malign.
1. Run K-means clustering on the data with K = 2. Check the percentage of malign tumor in each
cluster. Report the percentage of malign / benign in each cluster. Did you K-means model help
distinguish malign or benign? Ideally we want to see that one cluster has high percentage of benign
and the other cluster has high percentage of malign.
2. Repeat your algorithm five times with different initialization. (You may add input init = ’random’)
when you call function sklearn.cluster.KMeans to initialize randomly. Report results in 1, did it
change much from run to run?