Assignment 2 Final
Assignment 2 Final
Assignment - 2
General Instructions
• Your assignment must be implemented in Python.
• While you’re allowed to use ChatGPT for assistance, you must explicitly
declare in comments the prompts you used and indicate which parts of
the code were generated with the help of ChatGPT.
• Plagiarism will only be taken into consideration for code that is not gen-
erated by ChatGPT. Any code generated with the assistance of ChatGPT
should be considered as a resource, similar to using a textbook or online
tutorial.
• The difficulty of your viva or assessment will be determined by the per-
centage of code in your assignment that is not attributed to ChatGPT.
If during the viva if you are unable to explain any part of the code, that
code will be considered as plagiarized.
• Clearly label and organize your code, including comments that explain the
purpose of each section and key steps in your implementation.
• Properly document your code and include explanations for any non-trivial
algorithms or techniques you employ.
• Ensure that your Jupyter Notebook is well-structured, with headings, sub-
headings, and explanations as necessary.
• Your assignment will be evaluated not only based on correctness but also
on the quality of code, the clarity of explanations, and the extent to which
you’ve understood and applied the concepts covered in the course.
• Make sure to test your code thoroughly before submission to avoid any
runtime errors or unexpected behavior.
• The Deadline will not be extended.
1
• Moss will be run on all submissions along with checking against online
resources.
• We are aware how easy it is to write code now in the presence of ChatGPT
and Github Co-Pilot, but we strongly encourage you to write the code
yourself.
1 Problem 1
This task involves exploring methods of dimensionality reduction. We will be
looking into PCA (principal component analysis), for this task. Principal Com-
ponent Analysis (PCA) is the general name for a technique which uses sophis-
ticated underlying mathematical principles to transforms a number of possibly
correlated variables into a smaller number of variables called principal compo-
nents. IEEE Signal Processing Magazine (Accessible through college internet)
Use only NumPy, Pandas, Matplotlib, and Plotly libraries for the tasks. The
use of any other libraries shall be accepted only upon the approval of the TAs.
• Perform the dimensionality reduction on features that you have used for
assignment 1 (pictionary dataset) and show the metrics you have shown for
the assignment 1. Compare the results and write down the observations
in the MARKDOWN.
2
• Observe the impact of dimensionality reduction on the dataset. Use a clas-
sifier on the dataset pre and post-dimensionality reduction (if the number
of features of the dataset is n, perform dimensionality reduction varying
the principal components from 1 to n) and note the accuracies of the
classifier. You are free to use external libraries for the classifier.
2 Problem 2
The EM algorithm is used for obtaining maximum likelihood estimates of pa-
rameters when some of the data is missing. More generally, however, the EM
algorithm can also be applied when there is latent, i.e. unobserved, data which
was never intended to be observed in the first place. In that case, we simply
assume that the latent data is missing and proceed to apply the EM algorithm.
The EM algorithm has many applications throughout statistics. It is often used
for example, in machine learning and data mining applications, and in Bayesian
statistics where it is often used to obtain the mode of the posterior marginal
distributions of parameters. [Columbia University]
Membership value ric of a sample xi is the probability that the sample
belongs to cluster c, in a given GMM (Gaussian Mixture Model). Likelihood
values for a set of samples, measures the likelihood of the given data under a
fixed model. In other words, likelihoods are about how likely the data is given
the model, while membership values are about how likely the model is given the
data. [reference]
Use only NumPy, Pandas, Matplotlib, and Plotly libraries for the tasks. The
use of any other libraries shall be accepted only upon the approval of the TAs.
3
Figure 1: Scatter Plot of the Wine
Dataset, after PCA with 2 principal
components. (One of the axes, rep-
resents Principle Component - 1 and
Other one, Principal Component - 2)
3 Problem 3
Hierarchical clustering is a popular method for grouping objects. It creates
groups so that objects within a group are similar to each other and different
from objects in other groups. Clusters are visually represented in a hierarchical
tree called a dendrogram. [Reference for hierarchical clustering and linkages]
Use only NumPy, Pandas, Matplotlib, and Plotly libraries for the tasks. The
use of any other libraries shall be accepted only upon the approval of the TAs.
4
Figure 2: Example of a dendo-
gram, obtained from a different dataset.
The Vertical axis represents individual
points. The horizontal one represents
the distance between cluster.
• Perform hierarchical clustering on the dataset and obtain the linkage ma-
trix. Vary the linkages and features used and state your observations.
Plot the dendogram using the linkage matrix.
• Perform hierarchical clustering on the gene expression dataset and obtain
the linkage matrix. Vary the linkages and features used and state your
observations. (In the dataset, you are given 58 genes, their respective
expression levels for 12 proteins and their IDs, resulting in a 58 × 13
matrix). Plot the dendogram, using the linkages obtained.
4 Problem 4
For this section, you are free to use external libraries to solve the question(s).
Submit separate notebooks for each of the following sub-questions in this section.
5
4.2 Problem 4.2 [20]
The task is to determine the optimal horizontal and vertical euclidean distance
thresholds between bounding boxes containing words on a document page. The
objective of this task is to establish connections between boxes within a para-
graph while ensuring that boxes across paragraphs and columns remain uncon-
nected. Attached are illustrative examples showcasing the desired box connec-
tions and a sample visualization of the expected output ATTACHMENT.
You have also been given the following scripts -
• To visualize the enclosing boxes.
6
5 Relevant Readings
This section contains some reading material regarding the assignment, which
may assist you in solving or understanding the question, couple with some re-
sources to gain deeper knowledge regarding the topics. This section is in-
tended as just some help, and it is not graded or evaluated.
• Reference for Gaussian Mixture Model