0% found this document useful (0 votes)
208 views6 pages

MS6711 Data Mining Homework 1: 1.1 Implement K-Means Manually (8 PTS)

This homework assignment involves both coding and non-coding questions regarding data mining techniques. It contains 3 problems assessing understanding of K-means clustering, hierarchical clustering, and identifying clustering results. Students are asked to manually perform K-means clustering steps, determine which of 6 datasets are more likely generated by K-means, and calculate linkages between clusters for hierarchical clustering. Codes must be submitted in a Jupyter notebook along with written answers and plots in a separate word/pdf document.

Uploaded by

Yihan Wang
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
208 views6 pages

MS6711 Data Mining Homework 1: 1.1 Implement K-Means Manually (8 PTS)

This homework assignment involves both coding and non-coding questions regarding data mining techniques. It contains 3 problems assessing understanding of K-means clustering, hierarchical clustering, and identifying clustering results. Students are asked to manually perform K-means clustering steps, determine which of 6 datasets are more likely generated by K-means, and calculate linkages between clusters for hierarchical clustering. Codes must be submitted in a Jupyter notebook along with written answers and plots in a separate word/pdf document.

Uploaded by

Yihan Wang
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

MS6711 Data Mining

Homework 1

Semester B, 2021-2022
Due: Feb. 8, 2PM

Instruction

This homework contains both coding and non-coding questions. Please submit two files,

1. One word or pdf document of answers and plots of ALL questions without coding details.

2. One jupyter notebook of your codes.

Problem 1 [20 points]

This question checks your understanding of the K-means algorithm, you are asked to perform each step
of K-means manually. Not necessary to use KMeans function in python, but you may use python or a
simple calculator to help you calculate distances.
Given X a matrix of data points, with 10 data points and 2 variables. You are asked to perform a K-
pPp
2
means clustering step by step. Here we use L2 distance d = i=1 (xi − yi ) . The figure below plots all

data observations in blue triangular, while the three initialized centers are c1 = (6.2, 3.2), c2 = (6.5, 3.0)
and c3 = (6.6, 3.7).
  4.4

5.9 3.2
 
  4.2
4.6 2.9 (5.5, 4.2)
 
  4
6.2 2.8
 
  3.8
4.7 3.2 (5.1, 3.8)
 
  (6.6, 3.7)
5.0 3.0
3.6
 
X= 
5.5 4.2 3.4
 
 
4.9 3.1 3.2
  (4.7, 3.2) (5.9, 3.2) (6.2, 3.2)
 
  (4.9, 3.1) (6.7, 3.1)
6.7 3.1 3
  (5.0, 3.0) (6.0, 3.0) (6.5, 3.0)
  (4.6, 2.9)
5.1 3.8 2.8
  (6.2, 2.8)

6.0 3.0 2.6


4.5 5 5.5 6 6.5

1. Which data observations will be Figure


assigned to the second
1: Scatter cluster (blue)
plot of datasets and theininitialized
the first centers
iteration?
of 3 Answer
clusters
by the row index of data in matrix X.
1.1 Implement k-means manually [8 pts]
Given the matrix X whose rows represent di↵erent data points, you are asked to perform a k-means c
tering on this dataset using the1Euclidean distance as the distance function. Here k is p
chosen
Pp as 3.
Euclidean distance d between a vector x and a vector y both in Rp is defined as d = i=1 (xi y
2. Suppose you have assigned data to cluster centers. Next, where is the center of first cluster (red)
after one iteration? (Answer in the format of [x1 , x2 ] indicating location, rounding your results to
three decimals.)

3. Which data observations will be assigned to the first cluster (red) in the second iteration? Answer
by the row index of data in matrix X.

Hint: you may write a python function to help you calculate distances, for example

import numpy as np
def L2dis (x1 , y1 , x2 , y2):
return (np.sqrt(pow(x1 − x2 , 2) + pow(y1 − y2 , 2)))

# c a l c u l a t e L2 d i s t a n c e b e t w e e n t h e first two o b s e r v a t i o n s
L2dis (5.9 , 3.2, 4.6, 2.9)

2
Problem 2 [20 points]

Figures below show clustering results of 6 different datasets, denoted as A, B, C, D, E, F. Each dataset
is clustered using two different methods, while one of them is K-means. Please determine which one is
more likely to be results of K-means. Note that centers of clusters are denoted by a black cross X on the
figure. All distances are L2 here.

A1 A2 B1 B2
X X
XX X
X X X

X
X

C1 X
C2 D1 D2
X
X X
X
X X X X

X X

E1 E2 F1 F2
X

X X
X

X X
X
X
X X

1. Dataset A (Please answer A1 or A2 is more likely to be results from K-means, same for other
Figure 2: Clustered results for 6 datasets
datasets below.)
is more likely to be generated by K-means method. (Hint: check the state when K-means converges; Centers
2.forDataset B have been noted as X; Since x and y axis are scaled proportionally, you can determine the
each cluster
distance to centers geometrically). The distance measure used here is the Euclidean distance.
3. Dataset C A (write A1 or A2, [1 pt], same in the following question);
1. Dataset

4. Dataset D
2. Dataset B

5. Dataset E
3. Dataset C

6. Dataset F
4. Dataset D

5. Dataset E

6. Dataset F

3
Problem 3 [20 points]

This problem checks your understanding of the hierarchical clustering. Same as Problem 1, not necessary
to perform clustering in python but you may use it to help calculate distances.
The figure below plots data of two clusters, denoted by red triangular (cluster A) and blue diamond
(cluster B) respectively, with coordinate labeled.
3.4

3.3

3.2

(4.7, 3.2) (5.9, 3.2)

3.1

(4.9, 3.1) (6.7, 3.1)

(5.0, 3.0) (6.0, 3.0)

2.9

(4.6, 2.9)

2.8

(6.2, 2.8)

2.7

2.6
4 4.5 5 5.5 6 6.5 7 7.5 8

Figure 3: Scatter plot samples in two clusters


1. What is the linkage between cluster A and B if you use complete linkage?
1.3 Hierarchical clustering [9 pts]
2.In What
Figureis3 the linkage
there between
are two clusterscluster
A (red)Aand
andBB(blue),
if youeach
use has
single linkage?
four members and plotted in Figure 3.
The coordinates of each member are labeled in the figure. Compute the distance between two clusters using
3.Euclidean
What isdistance.
the average linkage?
1. What is the distance between the two farthest members? (complete link) (round to four decimal places
4.here,
Among all three
and next linkages[2 above,
2 problems); pt] which one is robust to noise?

2. What is the distance between the two closest members? (single link) [2 pt]

3. What is the average distance between all pairs? [4 pt]

4. Among all three distances above, which one is robust to noise? Answer either “complete”, “single”,
or “average”. [1 pt]

1.4 PCA [7 pts]


Consider 4 data points in the 2 d space: ( 1, 1), (0.5, 0.5), (1, 1), ( 0.5, 0.5).
1. What is the first principal component? (Answer in the format of [a, b], round to 4 decimal places, use
positive values in the case of roots); [2 pt]

2. If we project all points into the 1 d subspace by the second principal components. What is the variance
of the projected data? [2 pt]

3. For a given dataset of X, all the eigenvalues of its covariance matrix C are {2.2, 1.7, 1.4, 0.8, 0.4, 0.2, 0.15, 0.02, 0.001},
if we want to explain more than 90% of the total variance 4 using the first k principal components, what is
the least number of k? [3 pt]
Problem 4 [20 points]

This problem considers hierarchical clustering with single or complete linkage, and confirm that the results
are invariant under monotone transformation.
Load data hw1p4.csv, you will have a matrix x, a 40 × 2 matrix of 40 observations and 2 variables.

1. Calculate distance matrix d, which should be a 40 × 40 matrix, dij denote the L2 distance between
Xi and Xj .

2. Run hierarchical clustering with single linkage, using the distance matrix directly rather than the
X matrix. Cut the tree at K = 4. Plot the points x with different color denoting clusters. Also
show the dendrogram.

3. Repeat (1) for complete linkage.

4. Take monotone transformation, use d2 as the distance, repeat (1) and (2). Did the clustering results
change? Did the dendogram change?

5. Run hierarchical clustering with average linkage using both distance d and d2 . Cut both trees at
K = 4. Are the clustering results the same? How about K = 3?

Hint: In python, you may use the following function to force HC cuts at K = 3, and fit hierarchical
clustering using the distance matrix. For more details (draw dendrogram, etc), check page 184 - 188 of
reference book Introduction to Machine Learning with Python.

from sklearn . cluster import AgglomerativeClustering


# s u p p o s e you a l r e a d y have a d i s t a n c e m a t r i x ( numpy m a t r i x ) D
# calculate y o u r own m a t r i x D ! ! !

# n c l u s t e r s : cut at K = 3
# affinity : set it as ’ precomputed ’ , then will use d i s t a n c e matrix instead
# linkage : s i n g l e , complete , or average
HC = AgglomerativeClustering ( n clusters =3,
affinity =’precomputed ’, linkage = ’single ’)

# WARNING, h e r e t h e i n p u t is distance matrix D


label = HC. fit predict (D)

5
Problem 5 [20 points]

This problem considers K-means clustering to diagnose breast cancer based solely on Fine Needle Aspi-
ration (FNA), which takes small tissue from the tumor and analyze. The data contains 30 characteristics
(features) such as size, shape, and texture of the tissue sample.
Load the data breast data.csv, you should have a matrix with 30 characteristics and 569 samples.
Load data breast truth.csv, which contains a list of 569 values, either 0 or 1. It denotes the diagnosis
results of corresponding features in breast data.csv, where 0 denotes benign and 1 denotes malign.

1. Run K-means clustering on the data with K = 2. Check the percentage of malign tumor in each
cluster. Report the percentage of malign / benign in each cluster. Did you K-means model help
distinguish malign or benign? Ideally we want to see that one cluster has high percentage of benign
and the other cluster has high percentage of malign.

2. Repeat your algorithm five times with different initialization. (You may add input init = ’random’)
when you call function sklearn.cluster.KMeans to initialize randomly. Report results in 1, did it
change much from run to run?

3. Is K = 2 a good choice? Report the results of CH and DB index.

You might also like