Assignment 2 Final

The document provides instructions for an assignment on statistical methods in artificial intelligence. It outlines 4 problems involving dimensionality reduction, clustering, and alignment techniques. Students are asked to implement algorithms like PCA, GMM, hierarchical clustering and orientation alignment on various datasets and analyze the results.

Uploaded by

Mohammed Nasir Ali Khan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

22 views

Assignment 2 Final

Uploaded by

Mohammed Nasir Ali Khan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 7

Statistical Methods in AI

Instructor : Prof Ravi Kiran Sarvadevabhatla

Deadline : 25 September 2023 11:55 P.M

Assignment - 2

General Instructions
• Your assignment must be implemented in Python.

• While you’re allowed to use ChatGPT for assistance, you must explicitly
declare in comments the prompts you used and indicate which parts of
the code were generated with the help of ChatGPT.
• Plagiarism will only be taken into consideration for code that is not gen-
erated by ChatGPT. Any code generated with the assistance of ChatGPT
should be considered as a resource, similar to using a textbook or online
tutorial.
• The difficulty of your viva or assessment will be determined by the per-
centage of code in your assignment that is not attributed to ChatGPT.
If during the viva if you are unable to explain any part of the code, that
code will be considered as plagiarized.
• Clearly label and organize your code, including comments that explain the
purpose of each section and key steps in your implementation.
• Properly document your code and include explanations for any non-trivial
algorithms or techniques you employ.
• Ensure that your Jupyter Notebook is well-structured, with headings, sub-
headings, and explanations as necessary.
• Your assignment will be evaluated not only based on correctness but also
on the quality of code, the clarity of explanations, and the extent to which
you’ve understood and applied the concepts covered in the course.
• Make sure to test your code thoroughly before submission to avoid any
runtime errors or unexpected behavior.
• The Deadline will not be extended.

1
• Moss will be run on all submissions along with checking against online
resources.
• We are aware how easy it is to write code now in the presence of ChatGPT
and Github Co-Pilot, but we strongly encourage you to write the code
yourself.

• We are aware of the possibility of submitting the assignment late in github

classrooms using various hacks. Note that we will have measures in place
for that and anyone caught attempting to do the same would be give zero
in the assignment.

• SUBMISSION FORMAT : Submit seperate files with all the

worked out codes and necessary observations in the MARK-
DOWN for each problem.

1 Problem 1
This task involves exploring methods of dimensionality reduction. We will be
looking into PCA (principal component analysis), for this task. Principal Com-
ponent Analysis (PCA) is the general name for a technique which uses sophis-
ticated underlying mathematical principles to transforms a number of possibly
correlated variables into a smaller number of variables called principal compo-
nents. IEEE Signal Processing Magazine (Accessible through college internet)
Use only NumPy, Pandas, Matplotlib, and Plotly libraries for the tasks. The
use of any other libraries shall be accepted only upon the approval of the TAs.

1.1 PCA [25]

This task requires you to implement Principal Component Analysis and perform
dimensionality reduction on a given dataset(s). The list of subtasks is given
below.
• Perform dimensionality reduction on the IIIT-CFW dataset, varying the
number of principle components. We have given the script to pre-process
the data and to get the necessary information from the image Script.
• Plot the the relationship between the cumulative explained variance and
the number of principal components. The x-axis of the plot typically
represents the number of principal components, and the y-axis represents
the cumulative explained variance.

• Perform the dimensionality reduction on features that you have used for
assignment 1 (pictionary dataset) and show the metrics you have shown for
the assignment 1. Compare the results and write down the observations
in the MARKDOWN.

2
• Observe the impact of dimensionality reduction on the dataset. Use a clas-
sifier on the dataset pre and post-dimensionality reduction (if the number
of features of the dataset is n, perform dimensionality reduction varying
the principal components from 1 to n) and note the accuracies of the
classifier. You are free to use external libraries for the classifier.

1.2 Pictionary Dataset [10]

This task is to perform the PCA on the Pictionary Dataset (Dataset).The at-
tachment also contains the description for the Dataset. Perform PCA for both
drawer and guesser.
• Plot the above features with respect to the obtained PCA axes.
• What does each of the new axes that are obtained from PCA represent ?

2 Problem 2
The EM algorithm is used for obtaining maximum likelihood estimates of pa-
rameters when some of the data is missing. More generally, however, the EM
algorithm can also be applied when there is latent, i.e. unobserved, data which
was never intended to be observed in the first place. In that case, we simply
assume that the latent data is missing and proceed to apply the EM algorithm.
The EM algorithm has many applications throughout statistics. It is often used
for example, in machine learning and data mining applications, and in Bayesian
statistics where it is often used to obtain the mode of the posterior marginal
distributions of parameters. [Columbia University]
Membership value ric of a sample xi is the probability that the sample
belongs to cluster c, in a given GMM (Gaussian Mixture Model). Likelihood
values for a set of samples, measures the likelihood of the given data under a
fixed model. In other words, likelihoods are about how likely the data is given
the model, while membership values are about how likely the model is given the
data. [reference]
Use only NumPy, Pandas, Matplotlib, and Plotly libraries for the tasks. The
use of any other libraries shall be accepted only upon the approval of the TAs.

2.1 GMM: Gaussian Mixture Models[25]

This task requires you to implement the EM algorithm for GMM and perform
clustering operations on a given dataset(s). The list of subtasks is given below.
• Find the parameters of GMM associated with the customer-dataset, us-
ing the EM method. Vary the number of components, and observe the
results. Implement GMM in a class which has the routines to fit data (e.g.
gmm.fit(data, number of clusters)), a routine to obtain the parameters, a
routine to calculate the likelihoods for a given set of samples and a routine
to obtain the membership values of data samples.

3
Figure 1: Scatter Plot of the Wine
Dataset, after PCA with 2 principal
components. (One of the axes, rep-
resents Principle Component - 1 and
Other one, Principal Component - 2)

• Perform clustering on the wine-dataset using Gaussian Mixture Model

(GMM) and K-Means algorithms. Find the optimal number of clusters
for GMM using BIC (Bayesian Information Criterion) and AIC (Akaike
Information Criterion). Reduce the dataset dimension to 2 using Principal
Component Analysis (PCA), plot scatter plots for each of the clustering
mentioned above, analyze your observations and report them. Also, com-
pute the silhouette scores for each clustering and compare the results.
You are free to use sklearn for the dataset, PCA, and Silhouette Score
computation.

3 Problem 3
Hierarchical clustering is a popular method for grouping objects. It creates
groups so that objects within a group are similar to each other and different
from objects in other groups. Clusters are visually represented in a hierarchical
tree called a dendrogram. [Reference for hierarchical clustering and linkages]
Use only NumPy, Pandas, Matplotlib, and Plotly libraries for the tasks. The
use of any other libraries shall be accepted only upon the approval of the TAs.

3.1 Hierarchical Clustering: Linkages and Features [25]

This task requires you to implement Hierarchical clustering, and perform clus-
tering on a given dataset(s). The list of subtasks is given below. You are ex-
pected to implement the required, using classes and methods. We expect to see
routines like hc.linkages(X, linkage type) (takes the data and provides linkage
matrix), hc.dendogram(Z) (takes the linkage matrix and plots a dendogram).

4
Figure 2: Example of a dendo-
gram, obtained from a different dataset.
The Vertical axis represents individual
points. The horizontal one represents
the distance between cluster.

• Perform hierarchical clustering on the dataset and obtain the linkage ma-
trix. Vary the linkages and features used and state your observations.
Plot the dendogram using the linkage matrix.
• Perform hierarchical clustering on the gene expression dataset and obtain
the linkage matrix. Vary the linkages and features used and state your
observations. (In the dataset, you are given 58 genes, their respective
expression levels for 12 proteins and their IDs, resulting in a 58 × 13
matrix). Plot the dendogram, using the linkages obtained.

4 Problem 4
For this section, you are free to use external libraries to solve the question(s).
Submit separate notebooks for each of the following sub-questions in this section.

4.1 Problem 4.1 [20]

You have been provided an dataset of 99 different shapes KIMIA-99. The task
is to find the align the remaining shapes based on the orientation of the given
template shape. Along with the code, write the flowchart of the algorithm that
you will be using to implement the following task in the Jupyter Notebook itself
as a MARKDOWN.
Attached is an example of the task Example .

5
4.2 Problem 4.2 [20]
The task is to determine the optimal horizontal and vertical euclidean distance
thresholds between bounding boxes containing words on a document page. The
objective of this task is to establish connections between boxes within a para-
graph while ensuring that boxes across paragraphs and columns remain uncon-
nected. Attached are illustrative examples showcasing the desired box connec-
tions and a sample visualization of the expected output ATTACHMENT.
You have also been given the following scripts -
• To visualize the enclosing boxes.

• Script to visualize the connecting boxes is there in the above attachment.

The input for this script is a dataframe object with the following at-
tributes.
– ID : The ID number of the word which is in int datatype.
– Top-Left, Bottom-Right, Top edge center, Bottom edge center, Right
edge center, Left edge center : A list containing the x and y co-
ordinates of the respective coordinates as understood by their at-
tribute names.
– Top box, Bottom box, Right box, Left Box - A list containing the dis-
tance and id of the nearest neighbour in the Top, Bottom, Right and
Left directions respectively. HINT - To remove the connection
make sure that the list is [-1, 0].
NOTE : Make all necessary modifications to the script and then
run the script.

4.3 Problem 4.3 [20]

For this problem, you shall be provided with a dataset of 2-dimensional points,
of various colors. The data you will be given is in the form of an array, where
each element, X, represents a point in the 2D color space. The data has been
generated from 7 distinct Gaussian color components. The list of subtasks is
given below.
• Find the likely color components which generate the dataset.
• Create a function which would take in an input of (number of components
(an integer, n), means (an numpy array of shape (n, 2)), covariances
(a numpy array of shape n, 2, 2)), and generates a sample dataset with
the n likely components described by the above components. State your
observations.

6
5 Relevant Readings
This section contains some reading material regarding the assignment, which
may assist you in solving or understanding the question, couple with some re-
sources to gain deeper knowledge regarding the topics. This section is in-
tended as just some help, and it is not graded or evaluated.
• Reference for Gaussian Mixture Model

• Reference for Bayesian Information Criterion

• Reference for hierarchical clustering

Education - Post 12th Standard - CSV
88% (16)
Education - Post 12th Standard - CSV
11 pages
Assignment#2 RT WQ2021
No ratings yet
Assignment#2 RT WQ2021
2 pages
Fresco
100% (2)
Fresco
17 pages
ML Mod32019
No ratings yet
ML Mod32019
6 pages
51 DA5400_FML51_20250501 ProblemSet06
No ratings yet
51 DA5400_FML51_20250501 ProblemSet06
4 pages
Assignment 1 A
No ratings yet
Assignment 1 A
12 pages
Numpy
No ratings yet
Numpy
15 pages
hw7 Sol
No ratings yet
hw7 Sol
12 pages
DMV & ML Lab
No ratings yet
DMV & ML Lab
103 pages
Mloa Exp2 C121
No ratings yet
Mloa Exp2 C121
20 pages
Education - Post 12th Standard - CSV
No ratings yet
Education - Post 12th Standard - CSV
11 pages
Kinya Sharon - Ass2 - Machine Learning
No ratings yet
Kinya Sharon - Ass2 - Machine Learning
12 pages
Ldats2470 Project
No ratings yet
Ldats2470 Project
2 pages
CS 229, Public Course Problem Set #4: Unsupervised Learning and Re-Inforcement Learning
No ratings yet
CS 229, Public Course Problem Set #4: Unsupervised Learning and Re-Inforcement Learning
5 pages
MLFILE
No ratings yet
MLFILE
21 pages
10-701/15-781, Machine Learning: Homework 5: Aarti Singh Carnegie Mellon University
No ratings yet
10-701/15-781, Machine Learning: Homework 5: Aarti Singh Carnegie Mellon University
13 pages
PRML Assignment1 2022
No ratings yet
PRML Assignment1 2022
2 pages
Exp3
No ratings yet
Exp3
7 pages
AML Unit - 1 Material
No ratings yet
AML Unit - 1 Material
36 pages
Assignment (3) ML.AmanVerma
No ratings yet
Assignment (3) ML.AmanVerma
6 pages
Day14-PCA - Problem Statement
0% (1)
Day14-PCA - Problem Statement
4 pages
DWDM Lab Manual
No ratings yet
DWDM Lab Manual
32 pages
Assignment 3A 2024
No ratings yet
Assignment 3A 2024
4 pages
Python-Unit-4
No ratings yet
Python-Unit-4
43 pages
Lecture-3 Unit 3
No ratings yet
Lecture-3 Unit 3
22 pages
E6 - Report: Problem 1
No ratings yet
E6 - Report: Problem 1
16 pages
PR Practical File
No ratings yet
PR Practical File
38 pages
DSBA+Master+Codebook+-+Unsupervised+Learning
No ratings yet
DSBA+Master+Codebook+-+Unsupervised+Learning
7 pages
PCA Problem Statement
No ratings yet
PCA Problem Statement
25 pages
Assignment Report - Data Mining
No ratings yet
Assignment Report - Data Mining
24 pages
ML - Unit 3
No ratings yet
ML - Unit 3
4 pages
Iml Practical Assignment
No ratings yet
Iml Practical Assignment
22 pages
IRJMETS443407
No ratings yet
IRJMETS443407
7 pages
Data Science Cheatsheet
No ratings yet
Data Science Cheatsheet
5 pages
End semester Answer key format-fods
No ratings yet
End semester Answer key format-fods
8 pages
6
No ratings yet
6
4 pages
Lec4 - Python with data analysis
No ratings yet
Lec4 - Python with data analysis
20 pages
assignment4 (1)
No ratings yet
assignment4 (1)
4 pages
cs229 Notes10 PDF
No ratings yet
cs229 Notes10 PDF
6 pages
Assignment
No ratings yet
Assignment
24 pages
fds_merged (3) (1)
No ratings yet
fds_merged (3) (1)
102 pages
Fds Answers
No ratings yet
Fds Answers
53 pages
FDS RECORD 5-8
No ratings yet
FDS RECORD 5-8
15 pages
Experiment 3
No ratings yet
Experiment 3
8 pages
DR Pca
No ratings yet
DR Pca
22 pages
Data Pre-Processing-IV (Feature Extraction-PCA)_7c5a4c5da931f4f69a14c94e7e8b9062
No ratings yet
Data Pre-Processing-IV (Feature Extraction-PCA)_7c5a4c5da931f4f69a14c94e7e8b9062
23 pages
A COMPLETE GUIDE TO PRINCIPAL COMPONENT ANALYSIS in ML 1598272724
No ratings yet
A COMPLETE GUIDE TO PRINCIPAL COMPONENT ANALYSIS in ML 1598272724
16 pages
Batch2_FDS_printout
No ratings yet
Batch2_FDS_printout
38 pages
01-134192-066-9559671601-28052022-103753pm.docx
No ratings yet
01-134192-066-9559671601-28052022-103753pm.docx
1 page
Matlab Assignment Help
100% (1)
Matlab Assignment Help
4 pages
Principal Component Analysis
No ratings yet
Principal Component Analysis
1 page
Copy of Vedant 2024801005 Experiment 3
No ratings yet
Copy of Vedant 2024801005 Experiment 3
18 pages
Fundamentals of Data Science Students
No ratings yet
Fundamentals of Data Science Students
52 pages
20250304_ASSN04
No ratings yet
20250304_ASSN04
8 pages
AD3411 (2)
No ratings yet
AD3411 (2)
28 pages
Project LA
No ratings yet
Project LA
13 pages
Pattern Recognition
No ratings yet
Pattern Recognition
26 pages
Pca
No ratings yet
Pca
17 pages