CS F320 - Assignment II - Draft (Subject To A Few Changes in The Description of Problems)
CS F320 - Assignment II - Draft (Subject To A Few Changes in The Description of Problems)
Lab Problems
Instructions
1. You are allowed to use sklearn only for PCA and scaling operations.
2. You can use pandas, numpy, SciPy, and matplotlib for data manipulation and visualization.
3. Document all your steps clearly in your code with comments. Include plots wherever
necessary to support your analysis.
4. You can access the dataset for these lab problems through this link:
https://drive.google.com/drive/folders/1jphk1rq2yPXZKebSXZLKVMGTl25NYiMP?usp=d
rive_link
P1: KL Divergence Calculation
The objective of this task is to code a function to calculate KL Divergence between two
discrete probability distributions, measuring how one diverges from the other.
Problem Statement:
You are given the following probability distributions for a random variable X:
X (Random Variable) 1 2 3 4 5
Instructions:
1. KL Divergence Formula:
2. Steps to Implement:
○ Write a function named kl_divergence() to compute the KL Divergence using the
formula above.
○ Use the given P(X) and Q(X) distributions to calculate the KL Divergence.
○ Validate that all user inputs (if provided) are valid probability distributions.
○ Display the KL Divergence score rounded to 4 decimal places.
P2: Feature Selection 2 (Greedy Feature Selection Methods)
The goal is to predict home sale prices using the optimal subset of features from the given
dataset. Two feature selection methods will be applied:
Instructions:
1. Data Preprocessing:
○ Load the dataset.
○ Handle missing or inconsistent data.
○ Apply standardization or Min-Max scaling on the features.
○ Perform an 80:20 or 90:10 train-test split.
2. Greedy Forward Selection:
○ Start with an empty model.
○ Add features one-by-one, selecting the feature that gives the best improvement in
model performance at each step.
○ Stop when no further improvement is observed or all features are added.
3. Greedy Backward Selection:
○ Start with all features in the model.
○ Remove features one-by-one, selecting the feature whose removal leads to the
smallest decrease in performance.
○ Stop when further removal decreases performance significantly.
Input:
House_Price_Prediction.csv
Output:
● Forward Selection: List of selected features, feature count, and training/testing error.
● Backward Selection: List of selected features, feature count, and training/testing error.
P3: Feature Selection 2 (Spearman Correlation Coefficient
Method)
Use the dataset provided in Part 3 to build regression models for predicting housing prices. Find
the optimal subset of features by selecting those with the highest Spearman correlation with the
target attribute (price). Select feature sets based on the Spearman Correlation Coefficient to
identify features most correlated with the target attribute (price). Use these selected features to
train regression models and evaluate their predictive performance.
Input Format:
● Dataset from Part 3 (containing attributes such as price, bedrooms, and other property
characteristics).
Instructions:
Output Format:
● Spearman Feature Selection: Display selected features, feature count, and the
corresponding training and testing errors.
● Comparison Table:
○ Present a table showing the training and testing errors for:
■ Greedy Forward Feature Selection
■ Greedy Backward Feature Selection
■ Spearman Correlation Coefficient Selection
■ Model with all features
Conclusion:
● Identify the feature selection method that yields the best performance.
● Discuss trade-offs, such as using fewer features vs. better predictive accuracy.
● Provide recommendations for future models based on your analysis.
Part 4: Prior and Posterior Distributions
A study was conducted to assess public opinion on a new smartphone brand. Let ‘p’ denote the
probability of a person liking the smartphone. Before the product launch, market analysts
assumed that ‘p’ follows a beta distribution with parameters α, β = (3, 5). After the initial survey,
70 out of 100 respondents stated they liked the smartphone. Plot the prior and posterior
probability distribution of ‘p.’
The following day, a second survey was conducted, where out of the 70 respondents who liked
the smartphone, 40 said they would not recommend it. Plot the posterior distribution of ‘p’ after
this survey.
Input Format:
● α, β = (3, 5)
● First Survey: Like - 70, Total - 100 respondents
● Second Survey: Dislike - 40, Total - 70 respondents
Output Format:
where H(X) is the entropy of the feature, H(Y) is the entropy of the target variable, and H(X, Y)
is their joint entropy.
Input Format:
The 1st line contains three integers n, m and k. n is the number of data points, m is the number of
features and k is the number of required features. The next n lines contain m integers each
separated by a space. This is the input matrix X. Each column of this matrix is a feature of the
data point. The next line contains n integers separated by a space. This is the target matrix y.
Output Format:
Instructions:
○ Visualize Relationships:
■ Use pair plots and correlation heatmaps to visualize relationships
between key features like visit_per_week, avg_time_in_gym, and
abonement_type.
3. PCA Application:
○ Apply PCA:
■ Use sklearn's PCA to reduce dimensionality.
■ Identify how many principal components are required to cover 80-90% of
the variance.
○ Scree Plot and Cumulative Variance Plot:
■ Create a scree plot to show the explained variance ratio for each
component.
■ Plot the cumulative variance to visualize how many components are
needed to cover most of the variability.
● Discuss Findings:
○ Interpret the principal components and their importance in explaining the
variability of the dataset.
○ Comment on any interesting patterns or relationships observed in the gym
members’ data (e.g., how different membership types cluster or how visit
frequency influences PCA).
Input:
gym_customers.csv
Output:
1. Scree Plot and Cumulative Variance Plot showing explained variance by components.
2. 2D Scatter Plot of the first two principal components with loading vectors.
3. Insights from PCA results and interpretation of component meanings.
Steps:
● Train-Test Split:
○ Split the dataset into 80% training and 20% testing data.
● Implement Multiple Linear Regression:
○ Develop a linear regression model using the training set.
○ Evaluate the model's performance using metrics like R² and Mean Absolute
Error (MAE).
○ Visualize the predictions vs. actual prices on a scatter plot.
3. PCA Analysis:
● Develop a multivariate regression model using the transformed dataset with selected
principal components.
● Train and evaluate the model using the same metrics (R², MAE) for comparison with
the previous regression model.
● Visualize the PCA-based regression predictions against actual prices.
Input:
Flight_data.csv
Output: