0% found this document useful (0 votes)

40 views7 pages

EE4146 Test1 202324 Semb Solution

The document provides solutions and marking schemes for Test 1 of EE4146 Data Engineering and Learning Systems, covering three parts: cosine similarity classification of documents, K-means clustering, and hierarchical clustering of universities. It also includes questions on dimensionality reduction, clustering methods, dataset issues for supervised learning, types of data, definitions of accuracy and precision, and eigenvalues for PCA. Each question is accompanied by detailed calculations and justifications for the answers.

Uploaded by

ezra.rephael

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

40 views7 pages

EE4146 Test1 202324 Semb Solution

Uploaded by

ezra.rephael

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 7

EE4146 Data Engineering and Learning Systems Semester B, 2023-24

Test 1 Solution and Marking Scheme

The test consists of 3 parts.

PART B Part B consists of 3 questions, carrying 28 marks in total.

B.1. The following two 4-dimensional vectors data are collected from 2 types of topics, including stories and
fictions. (10 marks)
Stories Fictions
[0.6, 0.1, 2.0, -1.0] [1.9, 3.6, -0.4, -2.0]
We are also given the below 3 different documents 4-d vectors from unknown topics:
D1 [0.9, -0.1, 1.2, -0.3]
D2 [1.3, 1.2, -0.9, -0.3]
D3 [0.7, 0.8, 0.8, -0.9]

If the cosine similarity between a document and a topic is greater than 0.5, then we can consider this document
is in the area of the topic. Use cosine similarity to classify if the above documents D1, D2, D3 are in the area of
stories or fictions. Show your work to justify the answers.
Solution: Let us first calculate the similarity between the corpus data and document data.

Cos ( A, B ) =
A.B
=
( a1.b1 + ..... + aN .bN )
A B
( )(
a12 + ... + a 2N . a12 + ... + a 2N )
𝐶𝑜𝑠(𝑆𝑡𝑜𝑟𝑖𝑒𝑠, 𝐷1) = 0.90 𝐶𝑜𝑠(𝐹𝑖𝑐𝑡𝑖𝑜𝑛𝑠, 𝐷1) = 0.22
𝐶𝑜𝑠(𝑆𝑡𝑜𝑟𝑖𝑒𝑠, 𝐷2) = −0.13 𝐶𝑜𝑠(𝐹𝑖𝑐𝑡𝑖𝑜𝑛𝑠, 𝐷2) = 0.85
𝐶𝑜𝑠(𝑆𝑡𝑜𝑟𝑖𝑒𝑠, 𝐷3) = 0.81 𝐶𝑜𝑠(𝐹𝑖𝑐𝑡𝑖𝑜𝑛𝑠, 𝐷3) = 0.79

EE4146 Test1 4/2/2024 Page 1 of 7

B.2. Below shows a 1-D data with 10 data points. There are 3 clusters. Find the 3 clusters using K-means.
You are given that the 3 initial clusters are 18, 12 and 8. Your work should have converged in 2 iterations.
Give the cluster centers (CC) in i.e., 𝐶𝐶1 = 18, 𝐶𝐶2 = 12, 𝐶𝐶3 = 8 and cluster elements
ie., 𝐶1(𝑎); 𝐶2(𝑏, 𝑐); 𝐶3(𝑑, 𝑒, 𝑓, 𝑔, ℎ, 𝑗, 𝑘) for the 1st and 2nd iteration. (12 marks)
26 18 16 14 13 12 11 9 8 4
LABEL 𝑎 𝑏 𝑐 𝑑 𝑒 𝑓 𝑔 ℎ 𝑗 𝑘

Solution and scheme:

Initial CC (cluster center) 𝐶𝐶10 = 18, 𝐶𝐶20 = 12, 𝐶𝐶30 = 8
1st iteration:
Elements 𝐶11 : (𝑎, 𝑏, 𝑐), 𝐶21 : (𝑑, 𝑒, 𝑓, 𝑔), 𝐶31 (ℎ, 𝑗, 𝑘)
𝐶𝐶11 = 20; 𝐶𝐶21 = 12.5; 𝐶𝐶31 = 7 (let them off if the numerical value is not very very accurate)
2nd and final iteration:
Elements 𝐶11 : (𝑎, 𝑏, ), 𝐶21 : (𝑐, 𝑑, 𝑒, 𝑓, 𝑔), 𝐶31 (ℎ, 𝑗, 𝑘) 2% × 3 = 6%
𝐶𝐶12 = 22; 𝐶𝐶22 = 13.2; 𝐶𝐶 32 = 7 2% × 3 = 6%

EE4146 Test1 4/2/2024 Page 2 of 7

B.3. Below is a table showing the proximity matrix of 5 universities. Use hierarchical clustering to find the
tree diagram of these 5 universities. Draw the tress diagram as your answer and give the key values
(merged values of different cells) in the tree diagram. (6 marks)
Solution: Step 1: The given proximity matrix.
CityU Birmingham Hull UCL Durham
CityU 0 59.4 63.9 118.3 106.9
Birmingham 0 5.7 69 60.1
Hull 0 68.5 60.6
UCL 0 12.2
Durham 0

The nearest pair is (Bir/Hull) 5.7, merge them to one cluster.

Step 2: Update the proximity matrix.

CityU UCL Durham Bir/Hull
CityU 0 118.3 106.9 59.4
UCL 0 12.2 68.5
Durham 0 60.1
Bir/Hull 0

The nearest pair is (UCL/Durham) 12.2, merge them to one cluster.

Step 3: Update the proximity matrix.

CityU Bir/Hull UCL/Durham
CityU 0 59.4 106.9
Bir/Hull 0 60.1
UCL/Durham 0

The nearest pair is (CityU/(Bir/Hull)) 59.4, merge them to one cluster.

Step 4: Update the proximity matrix.
UCL/Durham CityU/Bir/Hull
UCL/Durham 0 60.1
CityU/Bir/Hull 0

The hierarchical tree is as the following:

EE4146 Test1 4/2/2024 Page 3 of 7

59.4 CityU – Bir/Hull

12.2 UCL - Durham

5.7 Bir - Hull

CityU Bir Hull UCL Durham

PART C Part C consists of 6 questions and each question carries 5 marks.

C.1. The below figure shows a dataset that we need to do dimensionality reduction from 3-D to 2-D. In this
process, the result must preserve the original data information as close as possible, i.e., we need to keep the
“red dots data information inside the dark ring in the reduced 2-D space. Can we use PCA to do the job? If
yes, why? If no, why?

Answer will NOT be awarded with score if no correct justification/explanation is given.

Answer: No PCA, because PCA is linear and the red dots info inside the dark ring will be distorted or lost (or
merged with the blue data info) mapping in the 2-D space.

EE4146 Test1 4/2/2024 Page 4 of 7

C.2. Below are data points in a 2-D space that we need to work on its clustering. Determine if K-means or
GMM clustering should be used. You MUST construct the possible clustering boundaries in the diagram to
support your claim. Also use less than 10 words to justify your answer.

Answer: We must use GMM, because GMM clusters boundary is elliptical that can fit well to the dataset. K-
means circle will have cluster elements being overlapped in this case.

EE4146 Test1 4/2/2024 Page 5 of 7

C.3. Below table shows part of a dataset used for classifying elderly dementia disease. The dataset consists
of 7 attributes with over 500 data points and a class label of “demented” or “not demented”. The
researchers try to use the supervised learning method to study its characteristics, perform classification, and
try to determine if a newly given patient suffers from dementia or not. List 2 major problems of the shown
dataset for supervised learning.
ID Sex Yeas of Height Weight Heart Mother Father DEMENTED?
Education (inches) (lbs.) Rate Demented? Demented?
1 M 16 -- -- -- Yes No Yes
2 F 15 58 110 72 No No No
3 F -- 63 211 64 No No --
4 M 20 64 142 52 -- No Yes
5 M 19 69 192 64 Yes No --
6 F 14 65 121 72 Yes -- No
7 F 16 60 101 67 No No Yes
8 F 19 57 110 70 No Yes No
9 M 18 -- -- 72 Yes Yes --
10 F 16 65 136 60 Yes No Yes

Answer: Missing link or missing data (no entry) (2%). Non-numerical data (2%) got both correct then give 1
bonus pt to give it 5% (non-numerical: ordinal and norminal data) 1% bonus

C.4. Name 3 different types of data and give 1 example of these 3 types of data to explain what they are.
Answer: Numerical: ie., 3, 5 1%, Ordinal: Very Good, Bad 2%, Nominal: Red, Black 2%

C.5. The confusion matrix is shown below. Define Accuracy and Precision in terms of TP (true positive),
False Positive (FP), False negative (FN), and True negative (TN).

Answer: 𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 = 𝑇𝑃 + 𝑇𝑁/(𝑇𝑃 + 𝐹𝑃 + 𝑇𝑁 + 𝐹𝑁) 3%, 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 = 𝑇𝑃/(𝑇𝑃 + 𝐹𝑃) 2%

C.6 Below shows the eigen-values found in a PCA process for dealing with a 12-dimensional dataset.
Determine how many dimensions and which that can be adequately used for dimensionality reduction from the
view of preserving data information. You must use numerical evidence to support your answer.

EE4146 Test1 4/2/2024 Page 6 of 7

Eigen- 7.5 4 3.5 1.2 0.8 0.6 0.2 0.1 0.07 0.02 0.006 0.004
value

All eigenvalues sum = 18:

First 2 eigenvalues / 18 = 11.5/18= 63.9% (not high enough)

First 3 eigenvalues / 18 = 15/18 = 83.3% (that is very good enough way over 75%)

First 4 eigenvalues/ 18 = 16.2/18 = 90% (no need to go to 90% too much )

EE4146 Test1 4/2/2024 Page 7 of 7

Shreya Bansal - 250418 - 153433
No ratings yet
Shreya Bansal - 250418 - 153433
971 pages
Data Mining Practice Final Exam Solutions: True/False Questions
100% (1)
Data Mining Practice Final Exam Solutions: True/False Questions
5 pages
Dat Science: CLASS 11: Clustering and Dimensionality Reduction
No ratings yet
Dat Science: CLASS 11: Clustering and Dimensionality Reduction
30 pages
Social Web Analytics - Solution Answers
33% (3)
Social Web Analytics - Solution Answers
22 pages
DMG Exam 3
No ratings yet
DMG Exam 3
3 pages
Week 6 (PCA, SVD, LDA)
No ratings yet
Week 6 (PCA, SVD, LDA)
14 pages
Muhammad Mubeen - HW4 2
No ratings yet
Muhammad Mubeen - HW4 2
5 pages
HW 2
No ratings yet
HW 2
7 pages
Exam Advanced Data Mining Date: 5-11-2009 Time: 14.00-17.00: General Remarks
100% (1)
Exam Advanced Data Mining Date: 5-11-2009 Time: 14.00-17.00: General Remarks
5 pages
Advancing Biopharmaceutical Process Control
No ratings yet
Advancing Biopharmaceutical Process Control
31 pages
Example Exam
No ratings yet
Example Exam
12 pages
Multivariate Data Analysis in R PDF
No ratings yet
Multivariate Data Analysis in R PDF
400 pages
Data Mining
No ratings yet
Data Mining
7 pages
Research Paper: Sciencedirect
No ratings yet
Research Paper: Sciencedirect
14 pages
ML, DL Questions: Downloaded From
No ratings yet
ML, DL Questions: Downloaded From
4 pages
Previous Exam Paper 2 Solutions
No ratings yet
Previous Exam Paper 2 Solutions
7 pages
R08 Multiple Regression and Machine Learning
No ratings yet
R08 Multiple Regression and Machine Learning
24 pages
Machine Learning With Python Barua Hiran Jain Doshi 2024
100% (2)
Machine Learning With Python Barua Hiran Jain Doshi 2024
541 pages
Mmds Exam 2022
No ratings yet
Mmds Exam 2022
17 pages
(COMP1942) (2022) (S) Midterm Thliai 91588
No ratings yet
(COMP1942) (2022) (S) Midterm Thliai 91588
13 pages
Chazelle
No ratings yet
Chazelle
61 pages
Face Detection & Face Recognition Using Open Computer Vision Classifies
0% (1)
Face Detection & Face Recognition Using Open Computer Vision Classifies
19 pages
DM Endsem 2023-1
No ratings yet
DM Endsem 2023-1
4 pages
GEOvol
No ratings yet
GEOvol
27 pages
Comp 1942 finalExamQuestion-2016
No ratings yet
Comp 1942 finalExamQuestion-2016
11 pages
MA4151 - Unit 5
No ratings yet
MA4151 - Unit 5
1 page
Polymer International - 2001 - Miranda - Ultraviolet Induced Crosslinking of Poly Vinyl Alcohol Evaluated by Principal
No ratings yet
Polymer International - 2001 - Miranda - Ultraviolet Induced Crosslinking of Poly Vinyl Alcohol Evaluated by Principal
5 pages
Ca-3 QB (Pec-It602b) - 2024-1
No ratings yet
Ca-3 QB (Pec-It602b) - 2024-1
12 pages
Pca + SVC
No ratings yet
Pca + SVC
5 pages
HW02 - KNN DT
No ratings yet
HW02 - KNN DT
3 pages
Comp2712 l05 ML Feature
No ratings yet
Comp2712 l05 ML Feature
20 pages
Srivastava2016 PDF
No ratings yet
Srivastava2016 PDF
10 pages
Exam DM 071214 Ans
No ratings yet
Exam DM 071214 Ans
7 pages
Comp 1942 finalExamSol-2018
No ratings yet
Comp 1942 finalExamSol-2018
24 pages
Major 2020
No ratings yet
Major 2020
2 pages
Evaluating The Effectiveness of Web Based Learning Tools. Kay, Robin (2011)
No ratings yet
Evaluating The Effectiveness of Web Based Learning Tools. Kay, Robin (2011)
8 pages
Practice Exam - Gradescope Ver.
No ratings yet
Practice Exam - Gradescope Ver.
19 pages
Universitas Mataram: Jurnal Magister Manajemen VOL. 6 No. 1 MARET 2017
No ratings yet
Universitas Mataram: Jurnal Magister Manajemen VOL. 6 No. 1 MARET 2017
18 pages
Final Compre - Solutions - Updated FoDS
No ratings yet
Final Compre - Solutions - Updated FoDS
12 pages
Quiz 1-A
No ratings yet
Quiz 1-A
5 pages
DWM Qu
No ratings yet
DWM Qu
10 pages
HW 1
No ratings yet
HW 1
5 pages
CS Preliminaries: ECS289A
No ratings yet
CS Preliminaries: ECS289A
39 pages
Cabral Et Al., 2018
No ratings yet
Cabral Et Al., 2018
15 pages
Foods 12 03556
No ratings yet
Foods 12 03556
23 pages
MS6711 Data Mining Homework 1: 1.1 Implement K-Means Manually (8 PTS)
No ratings yet
MS6711 Data Mining Homework 1: 1.1 Implement K-Means Manually (8 PTS)
6 pages
Niels-Peter Vest Nielsen, 1998
No ratings yet
Niels-Peter Vest Nielsen, 1998
19 pages
10-601 Machine Learning: Homework 7: Instructions
No ratings yet
10-601 Machine Learning: Homework 7: Instructions
5 pages
Brcko Et Al. (2013) Taxonomy and Distribution of The Salamander Genus Bolitoglossa in Brazilian Amazonia
No ratings yet
Brcko Et Al. (2013) Taxonomy and Distribution of The Salamander Genus Bolitoglossa in Brazilian Amazonia
31 pages
HCPC Husson Josse
No ratings yet
HCPC Husson Josse
17 pages
Exp 15
No ratings yet
Exp 15
12 pages
Previous Exam Paper 2 Solutions
No ratings yet
Previous Exam Paper 2 Solutions
6 pages
COMP1942 Question Paper
No ratings yet
COMP1942 Question Paper
7 pages
Midterm F07 Solutions
No ratings yet
Midterm F07 Solutions
4 pages
2023 Summer Final
No ratings yet
2023 Summer Final
21 pages
Comp 1942 finalExamQuestion-2019
No ratings yet
Comp 1942 finalExamQuestion-2019
14 pages
It-3031 (DMDW) - CS End Nov 2023
No ratings yet
It-3031 (DMDW) - CS End Nov 2023
23 pages
Business Analytics and Data Mining Modeling Using R
No ratings yet
Business Analytics and Data Mining Modeling Using R
6 pages
10 EST Solution
No ratings yet
10 EST Solution
16 pages
Mid Term
No ratings yet
Mid Term
12 pages
Feedback The Correct Answer Is:analysis of Time Series
No ratings yet
Feedback The Correct Answer Is:analysis of Time Series
42 pages
Final Exam: CS 189 Spring 2020 Introduction To Machine Learning
No ratings yet
Final Exam: CS 189 Spring 2020 Introduction To Machine Learning
19 pages
AI4COVID-19: AI Enabled Preliminary Diagnosis For COVID-19 From Cough Samples Via An App
No ratings yet
AI4COVID-19: AI Enabled Preliminary Diagnosis For COVID-19 From Cough Samples Via An App
12 pages
HW 02
No ratings yet
HW 02
3 pages
DM 2019
No ratings yet
DM 2019
7 pages
STAT3888 PracticeQuizSolutions
No ratings yet
STAT3888 PracticeQuizSolutions
22 pages
Exam dm1 121017 Ans
No ratings yet
Exam dm1 121017 Ans
8 pages
IML-IITKGP - Assignment 8 Solution
No ratings yet
IML-IITKGP - Assignment 8 Solution
8 pages
Data Mining Comprehensive Exam - Regular PDF
No ratings yet
Data Mining Comprehensive Exam - Regular PDF
3 pages
Culinary Tourism Experiences in Agri-Tourism Destinations and Sustainable Consumption-Understanding Italian Tourists' Motivations
No ratings yet
Culinary Tourism Experiences in Agri-Tourism Destinations and Sustainable Consumption-Understanding Italian Tourists' Motivations
17 pages
Important Questions Related To Module-1 & Module-2
No ratings yet
Important Questions Related To Module-1 & Module-2
2 pages
Quiz 4
No ratings yet
Quiz 4
4 pages
ECE457 Pattern Recognition Techniques and Algorithms: Answer All Questions
No ratings yet
ECE457 Pattern Recognition Techniques and Algorithms: Answer All Questions
3 pages
15A05602 Data Warehousing & Mining
No ratings yet
15A05602 Data Warehousing & Mining
2 pages
COMP1942 Question Paper
No ratings yet
COMP1942 Question Paper
5 pages
ML Roadmap
No ratings yet
ML Roadmap
7 pages
Support Vector Data Description Applied To Machine Vibration Analysis
No ratings yet
Support Vector Data Description Applied To Machine Vibration Analysis
8 pages
COSC 6335 Data Mining (Dr. Eick) Solution Sketches Midterm Exam October 25, 2012
No ratings yet
COSC 6335 Data Mining (Dr. Eick) Solution Sketches Midterm Exam October 25, 2012
11 pages
Compre FoDS
No ratings yet
Compre FoDS
3 pages
Data Mining - Sem 3 - Assignment - 2
No ratings yet
Data Mining - Sem 3 - Assignment - 2
5 pages
Missing Data Imputation Using Singular Value Decomposition
No ratings yet
Missing Data Imputation Using Singular Value Decomposition
6 pages
Exam DUT 070816 Ans
No ratings yet
Exam DUT 070816 Ans
5 pages
CSBS - AD3491 - FDSA - IA 1 - Answer Key
100% (11)
CSBS - AD3491 - FDSA - IA 1 - Answer Key
14 pages
Endsem ML Makeup AK - 1
No ratings yet
Endsem ML Makeup AK - 1
7 pages
CS 7641 CSE/ISYE 6740 Mid-Term Exam 2 (Fall 2016) Solutions: 1 Probability and Bayes' Rule (14 PTS)
No ratings yet
CS 7641 CSE/ISYE 6740 Mid-Term Exam 2 (Fall 2016) Solutions: 1 Probability and Bayes' Rule (14 PTS)
12 pages
Endsem ML Regular AK
No ratings yet
Endsem ML Regular AK
7 pages
IS328 Final Exam
No ratings yet
IS328 Final Exam
12 pages
2022 CS244 End Sem Soln
No ratings yet
2022 CS244 End Sem Soln
6 pages
Foundations of Image Science
From Everand
Foundations of Image Science
Harrison H. Barrett
No ratings yet
Worked Examples in Mathematics for Scientists and Engineers
From Everand
Worked Examples in Mathematics for Scientists and Engineers
G. Stephenson
No ratings yet

EE4146 Test1 202324 Semb Solution

Uploaded by

EE4146 Test1 202324 Semb Solution

Uploaded by

EE4146 Data Engineering and Learning Systems Semester B, 2023-24

Test 1 Solution and Marking Scheme

PART B Part B consists of 3 questions, carrying 28 marks in total.

EE4146 Test1 4/2/2024 Page 1 of 7

Solution and scheme:

EE4146 Test1 4/2/2024 Page 2 of 7

The nearest pair is (Bir/Hull) 5.7, merge them to one cluster.

Step 2: Update the proximity matrix.

The nearest pair is (UCL/Durham) 12.2, merge them to one cluster.

Step 3: Update the proximity matrix.

The nearest pair is (CityU/(Bir/Hull)) 59.4, merge them to one cluster.

The hierarchical tree is as the following:

EE4146 Test1 4/2/2024 Page 3 of 7

12.2 UCL - Durham

CityU Bir Hull UCL Durham

PART C Part C consists of 6 questions and each question carries 5 marks.

Answer will NOT be awarded with score if no correct justification/explanation is given.

EE4146 Test1 4/2/2024 Page 4 of 7

EE4146 Test1 4/2/2024 Page 5 of 7

Answer: 𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 = 𝑇𝑃 + 𝑇𝑁/(𝑇𝑃 + 𝐹𝑃 + 𝑇𝑁 + 𝐹𝑁) 3%, 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 = 𝑇𝑃/(𝑇𝑃 + 𝐹𝑃) 2%

EE4146 Test1 4/2/2024 Page 6 of 7

All eigenvalues sum = 18:

First 2 eigenvalues / 18 = 11.5/18= 63.9% (not high enough)

First 4 eigenvalues/ 18 = 16.2/18 = 90% (no need to go to 90% too much )

EE4146 Test1 4/2/2024 Page 7 of 7

You might also like