0% found this document useful (0 votes)
40 views7 pages

EE4146 Test1 202324 Semb Solution

The document provides solutions and marking schemes for Test 1 of EE4146 Data Engineering and Learning Systems, covering three parts: cosine similarity classification of documents, K-means clustering, and hierarchical clustering of universities. It also includes questions on dimensionality reduction, clustering methods, dataset issues for supervised learning, types of data, definitions of accuracy and precision, and eigenvalues for PCA. Each question is accompanied by detailed calculations and justifications for the answers.

Uploaded by

ezra.rephael
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
40 views7 pages

EE4146 Test1 202324 Semb Solution

The document provides solutions and marking schemes for Test 1 of EE4146 Data Engineering and Learning Systems, covering three parts: cosine similarity classification of documents, K-means clustering, and hierarchical clustering of universities. It also includes questions on dimensionality reduction, clustering methods, dataset issues for supervised learning, types of data, definitions of accuracy and precision, and eigenvalues for PCA. Each question is accompanied by detailed calculations and justifications for the answers.

Uploaded by

ezra.rephael
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

EE4146 Data Engineering and Learning Systems Semester B, 2023-24

Test 1 Solution and Marking Scheme


The test consists of 3 parts.

PART B Part B consists of 3 questions, carrying 28 marks in total.

B.1. The following two 4-dimensional vectors data are collected from 2 types of topics, including stories and
fictions. (10 marks)
Stories Fictions
[0.6, 0.1, 2.0, -1.0] [1.9, 3.6, -0.4, -2.0]
We are also given the below 3 different documents 4-d vectors from unknown topics:
D1 [0.9, -0.1, 1.2, -0.3]
D2 [1.3, 1.2, -0.9, -0.3]
D3 [0.7, 0.8, 0.8, -0.9]

If the cosine similarity between a document and a topic is greater than 0.5, then we can consider this document
is in the area of the topic. Use cosine similarity to classify if the above documents D1, D2, D3 are in the area of
stories or fictions. Show your work to justify the answers.
Solution: Let us first calculate the similarity between the corpus data and document data.

Cos ( A, B ) =
A.B
=
( a1.b1 + ..... + aN .bN )
A B
( )(
a12 + ... + a 2N . a12 + ... + a 2N )
𝐶𝑜𝑠(𝑆𝑡𝑜𝑟𝑖𝑒𝑠, 𝐷1) = 0.90 𝐶𝑜𝑠(𝐹𝑖𝑐𝑡𝑖𝑜𝑛𝑠, 𝐷1) = 0.22
𝐶𝑜𝑠(𝑆𝑡𝑜𝑟𝑖𝑒𝑠, 𝐷2) = −0.13 𝐶𝑜𝑠(𝐹𝑖𝑐𝑡𝑖𝑜𝑛𝑠, 𝐷2) = 0.85
𝐶𝑜𝑠(𝑆𝑡𝑜𝑟𝑖𝑒𝑠, 𝐷3) = 0.81 𝐶𝑜𝑠(𝐹𝑖𝑐𝑡𝑖𝑜𝑛𝑠, 𝐷3) = 0.79

EE4146 Test1 4/2/2024 Page 1 of 7


B.2. Below shows a 1-D data with 10 data points. There are 3 clusters. Find the 3 clusters using K-means.
You are given that the 3 initial clusters are 18, 12 and 8. Your work should have converged in 2 iterations.
Give the cluster centers (CC) in i.e., 𝐶𝐶1 = 18, 𝐶𝐶2 = 12, 𝐶𝐶3 = 8 and cluster elements
ie., 𝐶1(𝑎); 𝐶2(𝑏, 𝑐); 𝐶3(𝑑, 𝑒, 𝑓, 𝑔, ℎ, 𝑗, 𝑘) for the 1st and 2nd iteration. (12 marks)
26 18 16 14 13 12 11 9 8 4
LABEL 𝑎 𝑏 𝑐 𝑑 𝑒 𝑓 𝑔 ℎ 𝑗 𝑘

Solution and scheme:


Initial CC (cluster center) 𝐶𝐶10 = 18, 𝐶𝐶20 = 12, 𝐶𝐶30 = 8
1st iteration:
Elements 𝐶11 : (𝑎, 𝑏, 𝑐), 𝐶21 : (𝑑, 𝑒, 𝑓, 𝑔), 𝐶31 (ℎ, 𝑗, 𝑘)
𝐶𝐶11 = 20; 𝐶𝐶21 = 12.5; 𝐶𝐶31 = 7 (let them off if the numerical value is not very very accurate)
2nd and final iteration:
Elements 𝐶11 : (𝑎, 𝑏, ), 𝐶21 : (𝑐, 𝑑, 𝑒, 𝑓, 𝑔), 𝐶31 (ℎ, 𝑗, 𝑘) 2% × 3 = 6%
𝐶𝐶12 = 22; 𝐶𝐶22 = 13.2; 𝐶𝐶 32 = 7 2% × 3 = 6%

EE4146 Test1 4/2/2024 Page 2 of 7


B.3. Below is a table showing the proximity matrix of 5 universities. Use hierarchical clustering to find the
tree diagram of these 5 universities. Draw the tress diagram as your answer and give the key values
(merged values of different cells) in the tree diagram. (6 marks)
Solution: Step 1: The given proximity matrix.
CityU Birmingham Hull UCL Durham
CityU 0 59.4 63.9 118.3 106.9
Birmingham 0 5.7 69 60.1
Hull 0 68.5 60.6
UCL 0 12.2
Durham 0

The nearest pair is (Bir/Hull) 5.7, merge them to one cluster.

Step 2: Update the proximity matrix.


CityU UCL Durham Bir/Hull
CityU 0 118.3 106.9 59.4
UCL 0 12.2 68.5
Durham 0 60.1
Bir/Hull 0

The nearest pair is (UCL/Durham) 12.2, merge them to one cluster.

Step 3: Update the proximity matrix.


CityU Bir/Hull UCL/Durham
CityU 0 59.4 106.9
Bir/Hull 0 60.1
UCL/Durham 0

The nearest pair is (CityU/(Bir/Hull)) 59.4, merge them to one cluster.


Step 4: Update the proximity matrix.
UCL/Durham CityU/Bir/Hull
UCL/Durham 0 60.1
CityU/Bir/Hull 0

The hierarchical tree is as the following:

EE4146 Test1 4/2/2024 Page 3 of 7


59.4 CityU – Bir/Hull

12.2 UCL - Durham


5.7 Bir - Hull

CityU Bir Hull UCL Durham

PART C Part C consists of 6 questions and each question carries 5 marks.

C.1. The below figure shows a dataset that we need to do dimensionality reduction from 3-D to 2-D. In this
process, the result must preserve the original data information as close as possible, i.e., we need to keep the
“red dots data information inside the dark ring in the reduced 2-D space. Can we use PCA to do the job? If
yes, why? If no, why?

Answer will NOT be awarded with score if no correct justification/explanation is given.

Answer: No PCA, because PCA is linear and the red dots info inside the dark ring will be distorted or lost (or
merged with the blue data info) mapping in the 2-D space.

EE4146 Test1 4/2/2024 Page 4 of 7


C.2. Below are data points in a 2-D space that we need to work on its clustering. Determine if K-means or
GMM clustering should be used. You MUST construct the possible clustering boundaries in the diagram to
support your claim. Also use less than 10 words to justify your answer.

Answer: We must use GMM, because GMM clusters boundary is elliptical that can fit well to the dataset. K-
means circle will have cluster elements being overlapped in this case.

EE4146 Test1 4/2/2024 Page 5 of 7


C.3. Below table shows part of a dataset used for classifying elderly dementia disease. The dataset consists
of 7 attributes with over 500 data points and a class label of “demented” or “not demented”. The
researchers try to use the supervised learning method to study its characteristics, perform classification, and
try to determine if a newly given patient suffers from dementia or not. List 2 major problems of the shown
dataset for supervised learning.
ID Sex Yeas of Height Weight Heart Mother Father DEMENTED?
Education (inches) (lbs.) Rate Demented? Demented?
1 M 16 -- -- -- Yes No Yes
2 F 15 58 110 72 No No No
3 F -- 63 211 64 No No --
4 M 20 64 142 52 -- No Yes
5 M 19 69 192 64 Yes No --
6 F 14 65 121 72 Yes -- No
7 F 16 60 101 67 No No Yes
8 F 19 57 110 70 No Yes No
9 M 18 -- -- 72 Yes Yes --
10 F 16 65 136 60 Yes No Yes

Answer: Missing link or missing data (no entry) (2%). Non-numerical data (2%) got both correct then give 1
bonus pt to give it 5% (non-numerical: ordinal and norminal data) 1% bonus

C.4. Name 3 different types of data and give 1 example of these 3 types of data to explain what they are.
Answer: Numerical: ie., 3, 5 1%, Ordinal: Very Good, Bad 2%, Nominal: Red, Black 2%

C.5. The confusion matrix is shown below. Define Accuracy and Precision in terms of TP (true positive),
False Positive (FP), False negative (FN), and True negative (TN).

Answer: 𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 = 𝑇𝑃 + 𝑇𝑁/(𝑇𝑃 + 𝐹𝑃 + 𝑇𝑁 + 𝐹𝑁) 3%, 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 = 𝑇𝑃/(𝑇𝑃 + 𝐹𝑃) 2%


C.6 Below shows the eigen-values found in a PCA process for dealing with a 12-dimensional dataset.
Determine how many dimensions and which that can be adequately used for dimensionality reduction from the
view of preserving data information. You must use numerical evidence to support your answer.

EE4146 Test1 4/2/2024 Page 6 of 7


Eigen- 7.5 4 3.5 1.2 0.8 0.6 0.2 0.1 0.07 0.02 0.006 0.004
value

All eigenvalues sum = 18:

First 2 eigenvalues / 18 = 11.5/18= 63.9% (not high enough)

First 3 eigenvalues / 18 = 15/18 = 83.3% (that is very good enough way over 75%)

First 4 eigenvalues/ 18 = 16.2/18 = 90% (no need to go to 90% too much )

EE4146 Test1 4/2/2024 Page 7 of 7

You might also like