0% found this document useful (0 votes)

208 views6 pages

MS6711 Data Mining Homework 1: 1.1 Implement K-Means Manually (8 PTS)

This homework assignment involves both coding and non-coding questions regarding data mining techniques. It contains 3 problems assessing understanding of K-means clustering, hierarchical clustering, and identifying clustering results. Students are asked to manually perform K-means clustering steps, determine which of 6 datasets are more likely generated by K-means, and calculate linkages between clusters for hierarchical clustering. Codes must be submitted in a Jupyter notebook along with written answers and plots in a separate word/pdf document.

Uploaded by

Yihan Wang

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

208 views6 pages

MS6711 Data Mining Homework 1: 1.1 Implement K-Means Manually (8 PTS)

Uploaded by

Yihan Wang

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 6

MS6711 Data Mining

Homework 1

Semester B, 2021-2022
Due: Feb. 8, 2PM

Instruction

This homework contains both coding and non-coding questions. Please submit two files,

1. One word or pdf document of answers and plots of ALL questions without coding details.

2. One jupyter notebook of your codes.

Problem 1 [20 points]

This question checks your understanding of the K-means algorithm, you are asked to perform each step
of K-means manually. Not necessary to use KMeans function in python, but you may use python or a
simple calculator to help you calculate distances.
Given X a matrix of data points, with 10 data points and 2 variables. You are asked to perform a K-
pPp
2
means clustering step by step. Here we use L2 distance d = i=1 (xi − yi ) . The figure below plots all

data observations in blue triangular, while the three initialized centers are c1 = (6.2, 3.2), c2 = (6.5, 3.0)
and c3 = (6.6, 3.7).
  4.4

5.9 3.2
 
  4.2
4.6 2.9 (5.5, 4.2)
 
  4
6.2 2.8
 
  3.8
4.7 3.2 (5.1, 3.8)
 
  (6.6, 3.7)
5.0 3.0
3.6
 
X= 
5.5 4.2 3.4
 
 
4.9 3.1 3.2
  (4.7, 3.2) (5.9, 3.2) (6.2, 3.2)
 
  (4.9, 3.1) (6.7, 3.1)
6.7 3.1 3
  (5.0, 3.0) (6.0, 3.0) (6.5, 3.0)
  (4.6, 2.9)
5.1 3.8 2.8
  (6.2, 2.8)

6.0 3.0 2.6

4.5 5 5.5 6 6.5

1. Which data observations will be Figure

assigned to the second
1: Scatter cluster (blue)
plot of datasets and theininitialized
the first centers
iteration?
of 3 Answer
clusters
by the row index of data in matrix X.
1.1 Implement k-means manually [8 pts]
Given the matrix X whose rows represent di↵erent data points, you are asked to perform a k-means c
tering on this dataset using the1Euclidean distance as the distance function. Here k is p
chosen
Pp as 3.
Euclidean distance d between a vector x and a vector y both in Rp is defined as d = i=1 (xi y
2. Suppose you have assigned data to cluster centers. Next, where is the center of first cluster (red)
after one iteration? (Answer in the format of [x1 , x2 ] indicating location, rounding your results to
three decimals.)

3. Which data observations will be assigned to the first cluster (red) in the second iteration? Answer
by the row index of data in matrix X.

Hint: you may write a python function to help you calculate distances, for example

import numpy as np
def L2dis (x1 , y1 , x2 , y2):
return (np.sqrt(pow(x1 − x2 , 2) + pow(y1 − y2 , 2)))

# c a l c u l a t e L2 d i s t a n c e b e t w e e n t h e first two o b s e r v a t i o n s
L2dis (5.9 , 3.2, 4.6, 2.9)

2
Problem 2 [20 points]

Figures below show clustering results of 6 different datasets, denoted as A, B, C, D, E, F. Each dataset
is clustered using two different methods, while one of them is K-means. Please determine which one is
more likely to be results of K-means. Note that centers of clusters are denoted by a black cross X on the
figure. All distances are L2 here.

A1 A2 B1 B2
X X
XX X
X X X

X
X

C1 X
C2 D1 D2
X
X X
X
X X X X

X X

E1 E2 F1 F2
X

X X
X

X X
X
X
X X

1. Dataset A (Please answer A1 or A2 is more likely to be results from K-means, same for other
Figure 2: Clustered results for 6 datasets
datasets below.)
is more likely to be generated by K-means method. (Hint: check the state when K-means converges; Centers
2.forDataset B have been noted as X; Since x and y axis are scaled proportionally, you can determine the
each cluster
distance to centers geometrically). The distance measure used here is the Euclidean distance.
3. Dataset C A (write A1 or A2, [1 pt], same in the following question);
1. Dataset

4. Dataset D
2. Dataset B

5. Dataset E
3. Dataset C

6. Dataset F
4. Dataset D

5. Dataset E

6. Dataset F

3
Problem 3 [20 points]

This problem checks your understanding of the hierarchical clustering. Same as Problem 1, not necessary
to perform clustering in python but you may use it to help calculate distances.
The figure below plots data of two clusters, denoted by red triangular (cluster A) and blue diamond
(cluster B) respectively, with coordinate labeled.
3.4

3.3

3.2

(4.7, 3.2) (5.9, 3.2)

3.1

(4.9, 3.1) (6.7, 3.1)

(5.0, 3.0) (6.0, 3.0)

2.9

(4.6, 2.9)

2.8

(6.2, 2.8)

2.7

2.6
4 4.5 5 5.5 6 6.5 7 7.5 8

Figure 3: Scatter plot samples in two clusters

1. What is the linkage between cluster A and B if you use complete linkage?
1.3 Hierarchical clustering [9 pts]
2.In What
Figureis3 the linkage
there between
are two clusterscluster
A (red)Aand
andBB(blue),
if youeach
use has
single linkage?
four members and plotted in Figure 3.
The coordinates of each member are labeled in the figure. Compute the distance between two clusters using
3.Euclidean
What isdistance.
the average linkage?
1. What is the distance between the two farthest members? (complete link) (round to four decimal places
4.here,
Among all three
and next linkages[2 above,
2 problems); pt] which one is robust to noise?

2. What is the distance between the two closest members? (single link) [2 pt]

3. What is the average distance between all pairs? [4 pt]

4. Among all three distances above, which one is robust to noise? Answer either “complete”, “single”,
or “average”. [1 pt]

1.4 PCA [7 pts]

Consider 4 data points in the 2 d space: ( 1, 1), (0.5, 0.5), (1, 1), ( 0.5, 0.5).
1. What is the first principal component? (Answer in the format of [a, b], round to 4 decimal places, use
positive values in the case of roots); [2 pt]

2. If we project all points into the 1 d subspace by the second principal components. What is the variance
of the projected data? [2 pt]

3. For a given dataset of X, all the eigenvalues of its covariance matrix C are {2.2, 1.7, 1.4, 0.8, 0.4, 0.2, 0.15, 0.02, 0.001},
if we want to explain more than 90% of the total variance 4 using the first k principal components, what is
the least number of k? [3 pt]
Problem 4 [20 points]

This problem considers hierarchical clustering with single or complete linkage, and confirm that the results
are invariant under monotone transformation.
Load data hw1p4.csv, you will have a matrix x, a 40 × 2 matrix of 40 observations and 2 variables.

1. Calculate distance matrix d, which should be a 40 × 40 matrix, dij denote the L2 distance between
Xi and Xj .

2. Run hierarchical clustering with single linkage, using the distance matrix directly rather than the
X matrix. Cut the tree at K = 4. Plot the points x with different color denoting clusters. Also
show the dendrogram.

3. Repeat (1) for complete linkage.

4. Take monotone transformation, use d2 as the distance, repeat (1) and (2). Did the clustering results
change? Did the dendogram change?

5. Run hierarchical clustering with average linkage using both distance d and d2 . Cut both trees at
K = 4. Are the clustering results the same? How about K = 3?

Hint: In python, you may use the following function to force HC cuts at K = 3, and fit hierarchical
clustering using the distance matrix. For more details (draw dendrogram, etc), check page 184 - 188 of
reference book Introduction to Machine Learning with Python.

from sklearn . cluster import AgglomerativeClustering

# s u p p o s e you a l r e a d y have a d i s t a n c e m a t r i x ( numpy m a t r i x ) D
# calculate y o u r own m a t r i x D ! ! !

# n c l u s t e r s : cut at K = 3
# affinity : set it as ’ precomputed ’ , then will use d i s t a n c e matrix instead
# linkage : s i n g l e , complete , or average
HC = AgglomerativeClustering ( n clusters =3,
affinity =’precomputed ’, linkage = ’single ’)

# WARNING, h e r e t h e i n p u t is distance matrix D

label = HC. fit predict (D)

5
Problem 5 [20 points]

This problem considers K-means clustering to diagnose breast cancer based solely on Fine Needle Aspi-
ration (FNA), which takes small tissue from the tumor and analyze. The data contains 30 characteristics
(features) such as size, shape, and texture of the tissue sample.
Load the data breast data.csv, you should have a matrix with 30 characteristics and 569 samples.
Load data breast truth.csv, which contains a list of 569 values, either 0 or 1. It denotes the diagnosis
results of corresponding features in breast data.csv, where 0 denotes benign and 1 denotes malign.

1. Run K-means clustering on the data with K = 2. Check the percentage of malign tumor in each
cluster. Report the percentage of malign / benign in each cluster. Did you K-means model help
distinguish malign or benign? Ideally we want to see that one cluster has high percentage of benign
and the other cluster has high percentage of malign.

2. Repeat your algorithm five times with different initialization. (You may add input init = ’random’)
when you call function sklearn.cluster.KMeans to initialize randomly. Report results in 1, did it
change much from run to run?

3. Is K = 2 a good choice? Report the results of CH and DB index.

Impact of Ott Platforms On Teen
88% (32)
Impact of Ott Platforms On Teen
21 pages
Foreign and Russian Experience of Blockchain
No ratings yet
Foreign and Russian Experience of Blockchain
7 pages
Bandana SAP MM-Ariba S2P-Consultant
No ratings yet
Bandana SAP MM-Ariba S2P-Consultant
4 pages
Assignment 3
No ratings yet
Assignment 3
2 pages
51 DA5400 - FML51 - 20250501 ProblemSet06
No ratings yet
51 DA5400 - FML51 - 20250501 ProblemSet06
4 pages
Capture D'écran, Le 2025-04-21 À 21.26.38
No ratings yet
Capture D'écran, Le 2025-04-21 À 21.26.38
14 pages
10-601 Machine Learning: Homework 7: Instructions
No ratings yet
10-601 Machine Learning: Homework 7: Instructions
5 pages
Cluster Analysis Thesis Matlab Code PDF
100% (3)
Cluster Analysis Thesis Matlab Code PDF
7 pages
Data Mining Quiz 1 Clustering
100% (2)
Data Mining Quiz 1 Clustering
4 pages
K Means Example
No ratings yet
K Means Example
8 pages
K Means Clustering
No ratings yet
K Means Clustering
11 pages
Cluster Analysis Chapter 8 Solution
No ratings yet
Cluster Analysis Chapter 8 Solution
8 pages
DWDM Lab All
No ratings yet
DWDM Lab All
20 pages
IML-IITKGP - Assignment 8 Solution
No ratings yet
IML-IITKGP - Assignment 8 Solution
8 pages
Unsupervisd Learning Algorithm
No ratings yet
Unsupervisd Learning Algorithm
6 pages
Learn Lab3
No ratings yet
Learn Lab3
12 pages
K Means
No ratings yet
K Means
3 pages
ML Exp5 C36
No ratings yet
ML Exp5 C36
18 pages
01 K Means - Merged
No ratings yet
01 K Means - Merged
26 pages
637227449508725497DataMining (Chapter8)
No ratings yet
637227449508725497DataMining (Chapter8)
8 pages
DM 2023
No ratings yet
DM 2023
8 pages
Experiment 9
No ratings yet
Experiment 9
10 pages
10 EST Solution
No ratings yet
10 EST Solution
16 pages
EE4146 Test1 202324 Semb Solution
No ratings yet
EE4146 Test1 202324 Semb Solution
7 pages
K Means
100% (2)
K Means
329 pages
Tutorial Series 4: Exercise 1
No ratings yet
Tutorial Series 4: Exercise 1
1 page
CS-3035 (ML) - CS End April 2024
No ratings yet
CS-3035 (ML) - CS End April 2024
21 pages
Exercises - Dss - Partd - Handout
No ratings yet
Exercises - Dss - Partd - Handout
12 pages
DEU CSC5045 Intelligent System Applications Using Fuzzy - 4+clustering
No ratings yet
DEU CSC5045 Intelligent System Applications Using Fuzzy - 4+clustering
61 pages
Quiz 4
No ratings yet
Quiz 4
4 pages
INAIO Stage 2 Sample Problems MLTheory
No ratings yet
INAIO Stage 2 Sample Problems MLTheory
6 pages
Lecture 6
No ratings yet
Lecture 6
55 pages
Exam Advanced Data Mining Date: 5-11-2009 Time: 14.00-17.00: General Remarks
100% (1)
Exam Advanced Data Mining Date: 5-11-2009 Time: 14.00-17.00: General Remarks
5 pages
2.3 Aiml Rishit
No ratings yet
2.3 Aiml Rishit
7 pages
Mlsec Solution Exercise Sheet 7
No ratings yet
Mlsec Solution Exercise Sheet 7
6 pages
Data Mining Project - Parijat
No ratings yet
Data Mining Project - Parijat
28 pages
ML DSBA Lab7
No ratings yet
ML DSBA Lab7
6 pages
Clustering: CMPUT 466/551 Nilanjan Ray
No ratings yet
Clustering: CMPUT 466/551 Nilanjan Ray
34 pages
Lp2-Etl Model Assignment No. 2: R (2) C (4) V (2) T (2) Total (10) Dated Sign
No ratings yet
Lp2-Etl Model Assignment No. 2: R (2) C (4) V (2) T (2) Total (10) Dated Sign
7 pages
ML Exercises 4 5 6 en
No ratings yet
ML Exercises 4 5 6 en
4 pages
Zainab Pate Data PPF #5 - Colab
No ratings yet
Zainab Pate Data PPF #5 - Colab
10 pages
K-Means in Python - Solution
No ratings yet
K-Means in Python - Solution
6 pages
ML-Notes - 4 and 5 - 16 Marks
No ratings yet
ML-Notes - 4 and 5 - 16 Marks
21 pages
0252 001
No ratings yet
0252 001
8 pages
Problem Sheet 1
No ratings yet
Problem Sheet 1
3 pages
ML Imp Ques 2
No ratings yet
ML Imp Ques 2
37 pages
CS 7641 CSE/ISYE 6740 Mid-Term Exam 2 (Fall 2016) Solutions: 1 Probability and Bayes' Rule (14 PTS)
No ratings yet
CS 7641 CSE/ISYE 6740 Mid-Term Exam 2 (Fall 2016) Solutions: 1 Probability and Bayes' Rule (14 PTS)
12 pages
4.1 Clustering
No ratings yet
4.1 Clustering
80 pages
Introduction To Data Mining Clustering Analysis
No ratings yet
Introduction To Data Mining Clustering Analysis
84 pages
DM 2022
No ratings yet
DM 2022
4 pages
Previous Exam Exercises On Classification: Exercise 4 2012: Classification With 2 Features
No ratings yet
Previous Exam Exercises On Classification: Exercise 4 2012: Classification With 2 Features
9 pages
ML Module 4 2022 1 PDF
No ratings yet
ML Module 4 2022 1 PDF
31 pages
Exercise6 - Unsupervised Learning With K-Means
No ratings yet
Exercise6 - Unsupervised Learning With K-Means
26 pages
Unit IV
No ratings yet
Unit IV
51 pages
Data Mining Assignment No. 1
No ratings yet
Data Mining Assignment No. 1
22 pages
D3 Docs
No ratings yet
D3 Docs
6 pages
Data Mining
No ratings yet
Data Mining
18 pages
K - Means Clustering and Related Algorithms: Ryan P. Adams COS 324 - Elements of Machine Learning Princeton University
No ratings yet
K - Means Clustering and Related Algorithms: Ryan P. Adams COS 324 - Elements of Machine Learning Princeton University
18 pages
K Means
No ratings yet
K Means
25 pages
Tutorial 8
No ratings yet
Tutorial 8
12 pages
Assignments Introduction To Machine Learning 2024
No ratings yet
Assignments Introduction To Machine Learning 2024
45 pages
Assignment 8 Solution
No ratings yet
Assignment 8 Solution
7 pages
IGNOU BCA Introduction to Algorithm Design Previous Year Unsolved Papers BCS 042
From Everand
IGNOU BCA Introduction to Algorithm Design Previous Year Unsolved Papers BCS 042
Manish Soni
No ratings yet
Array Leetcode PDF
No ratings yet
Array Leetcode PDF
4 pages
Online Vehicle Rental Management System-Mern
No ratings yet
Online Vehicle Rental Management System-Mern
5 pages
EC8661 VLSI Design Lab Manual
100% (3)
EC8661 VLSI Design Lab Manual
76 pages
Prince Mishra Resume
No ratings yet
Prince Mishra Resume
2 pages
JIGYANSHU MOHAPATRA - Digital Marketing Manager (OKC - FT) - 20241102
No ratings yet
JIGYANSHU MOHAPATRA - Digital Marketing Manager (OKC - FT) - 20241102
2 pages
LLSPS - INT - 2831 - Predicting Life Expectancy Using Machine Learning
100% (1)
LLSPS - INT - 2831 - Predicting Life Expectancy Using Machine Learning
36 pages
Optimized Decision Support For B Immaturity Assessment
No ratings yet
Optimized Decision Support For B Immaturity Assessment
30 pages
Journalist Resume Samples
100% (1)
Journalist Resume Samples
5 pages
DLP ART6 August 15,2019 Thur (WEEK1)
No ratings yet
DLP ART6 August 15,2019 Thur (WEEK1)
5 pages
URT180 User Manual
No ratings yet
URT180 User Manual
22 pages
AG7 Access+resource+secrets+more+securely+across+services Ed1
No ratings yet
AG7 Access+resource+secrets+more+securely+across+services Ed1
55 pages
Summer Training Project Report
No ratings yet
Summer Training Project Report
33 pages
The Internet and Drug Markets
No ratings yet
The Internet and Drug Markets
140 pages
SAP User Interface Shortcuts
No ratings yet
SAP User Interface Shortcuts
4 pages
Set 1
No ratings yet
Set 1
5 pages
Devansh Rathod CV
No ratings yet
Devansh Rathod CV
2 pages
Software Requirements Specification: Splitpay
No ratings yet
Software Requirements Specification: Splitpay
13 pages
Simatic Net: Rugged Ethernet Switches
No ratings yet
Simatic Net: Rugged Ethernet Switches
48 pages
Unit 4
No ratings yet
Unit 4
11 pages
12 Acers
No ratings yet
12 Acers
3 pages
Microcontroller Syllabus For EC 4 Sem 2018 Scheme - VTU CBCS 18EC46 Syllabus
No ratings yet
Microcontroller Syllabus For EC 4 Sem 2018 Scheme - VTU CBCS 18EC46 Syllabus
3 pages
Introduction To Bookkeeping PRETEST
No ratings yet
Introduction To Bookkeeping PRETEST
2 pages
Mongodb Vs Mysql
No ratings yet
Mongodb Vs Mysql
10 pages
RF Online Setup Log
No ratings yet
RF Online Setup Log
1,353 pages
Pipeline Log
No ratings yet
Pipeline Log
10 pages
Data Entry Task Emails
No ratings yet
Data Entry Task Emails
11 pages
Com - Magic.solitairegame Logcat
No ratings yet
Com - Magic.solitairegame Logcat
28 pages

MS6711 Data Mining Homework 1: 1.1 Implement K-Means Manually (8 PTS)

Uploaded by

MS6711 Data Mining Homework 1: 1.1 Implement K-Means Manually (8 PTS)

Uploaded by

MS6711 Data Mining

2. One jupyter notebook of your codes.

Problem 1 [20 points]

6.0 3.0 2.6

1. Which data observations will be Figure

(4.7, 3.2) (5.9, 3.2)

(4.9, 3.1) (6.7, 3.1)

(5.0, 3.0) (6.0, 3.0)

Figure 3: Scatter plot samples in two clusters

3. What is the average distance between all pairs? [4 pt]

1.4 PCA [7 pts]

3. Repeat (1) for complete linkage.

from sklearn . cluster import AgglomerativeClustering

# WARNING, h e r e t h e i n p u t is distance matrix D

3. Is K = 2 a good choice? Report the results of CH and DB index.

You might also like