EE4146 Test1 202324 Semb Solution
EE4146 Test1 202324 Semb Solution
B.1. The following two 4-dimensional vectors data are collected from 2 types of topics, including stories and
fictions. (10 marks)
Stories Fictions
[0.6, 0.1, 2.0, -1.0] [1.9, 3.6, -0.4, -2.0]
We are also given the below 3 different documents 4-d vectors from unknown topics:
D1 [0.9, -0.1, 1.2, -0.3]
D2 [1.3, 1.2, -0.9, -0.3]
D3 [0.7, 0.8, 0.8, -0.9]
If the cosine similarity between a document and a topic is greater than 0.5, then we can consider this document
is in the area of the topic. Use cosine similarity to classify if the above documents D1, D2, D3 are in the area of
stories or fictions. Show your work to justify the answers.
Solution: Let us first calculate the similarity between the corpus data and document data.
Cos ( A, B ) =
A.B
=
( a1.b1 + ..... + aN .bN )
A B
( )(
a12 + ... + a 2N . a12 + ... + a 2N )
𝐶𝑜𝑠(𝑆𝑡𝑜𝑟𝑖𝑒𝑠, 𝐷1) = 0.90 𝐶𝑜𝑠(𝐹𝑖𝑐𝑡𝑖𝑜𝑛𝑠, 𝐷1) = 0.22
𝐶𝑜𝑠(𝑆𝑡𝑜𝑟𝑖𝑒𝑠, 𝐷2) = −0.13 𝐶𝑜𝑠(𝐹𝑖𝑐𝑡𝑖𝑜𝑛𝑠, 𝐷2) = 0.85
𝐶𝑜𝑠(𝑆𝑡𝑜𝑟𝑖𝑒𝑠, 𝐷3) = 0.81 𝐶𝑜𝑠(𝐹𝑖𝑐𝑡𝑖𝑜𝑛𝑠, 𝐷3) = 0.79
C.1. The below figure shows a dataset that we need to do dimensionality reduction from 3-D to 2-D. In this
process, the result must preserve the original data information as close as possible, i.e., we need to keep the
“red dots data information inside the dark ring in the reduced 2-D space. Can we use PCA to do the job? If
yes, why? If no, why?
Answer: No PCA, because PCA is linear and the red dots info inside the dark ring will be distorted or lost (or
merged with the blue data info) mapping in the 2-D space.
Answer: We must use GMM, because GMM clusters boundary is elliptical that can fit well to the dataset. K-
means circle will have cluster elements being overlapped in this case.
Answer: Missing link or missing data (no entry) (2%). Non-numerical data (2%) got both correct then give 1
bonus pt to give it 5% (non-numerical: ordinal and norminal data) 1% bonus
C.4. Name 3 different types of data and give 1 example of these 3 types of data to explain what they are.
Answer: Numerical: ie., 3, 5 1%, Ordinal: Very Good, Bad 2%, Nominal: Red, Black 2%
C.5. The confusion matrix is shown below. Define Accuracy and Precision in terms of TP (true positive),
False Positive (FP), False negative (FN), and True negative (TN).
First 3 eigenvalues / 18 = 15/18 = 83.3% (that is very good enough way over 75%)