ISE 291 Introduction To Data Science: Term 212 Homework #6
ISE 291 Introduction To Data Science: Term 212 Homework #6
ISE 291 Introduction To Data Science: Term 212 Homework #6
Prereq
uis
it e
s et r
rp
te
In
U nd
ISE 291
ersta n d
e V a li d a t e
Introduction to Data Science
Term 212
a ly z
Homework #6
An
re
P
pa
re
el
Mod
Homework Guidelines
To receive full credit, you should make sure you adhere to the following guidelines. For any questions/-
comments contact your section instructor.
Problem # A 50 marks
Consider data given in CSV file HW6DataA and the following data description:
+ Note: Solve all the above questions using Python. Use Pandas, Seaborn, Sklearn, etc. libraries
for all the above analysis
Do the following tasks using data given in HW6DataA and Table-1:
A-1: Regression. Given a regression problem along with the input columns and output column, describe
the steps to build a regression model. Explain how the regression model can be used for predicting
the output column values.
A-2: Regularization. Discuss in detail the potential use of both Ridge and LASSO regression? How
are they different from the OLS regression?
A-3: Cross-Validation. In both Ridge and LASSO regression, which technique do we use to select the
best value for α?
A-4: Given Data. Read and display the data given in HW6DataA. Refer to Table-1 for the data
description.
A-5: OLS Regression. Build an OLS regression model for predicting the Final score of each student.
Consider the following:
All the variables except StdID, Gender, and Final shall be considered as input variables.
Train the model using 70% of the data and use the rest for testing. Set random state to 42.
A-6: LASSO and Ridge. Using the same training data from OLS model (task A-5), estimate the
coefficients (betas) using LASSO and Ridge regression. Obtain the best value of α among
using 10 fold cross validation. Compare and comment on the coefficients of the three models.
Compare the performance of the OLS model against LASSO and Ridge models on the testing data.
A-7: SISO Regression. Using the closed form method (formula), build a SISO regression model to
predict the Finale score. Use the variable with the highest regression coefficient obtained by LASSO
as input variable (say, top variable). Using the corresponding testing data, compare the performance
of SISO model (top variable vs Final score) with that of LASSO reported in A-6. Also, depict top
variable vs Final score.
1
ISE-291 HW-6
Problem #B 50 marks
Consider data given in CSV file HW6DataB and the following data description of hypothetical samples
of gilled mushrooms:
+ Note: Solve all the above questions using Python. Use Pandas, Seaborn, Sklearn, etc. libraries
for all the above analysis.
Do the following tasks (in exact sequence) using data given in HW6DataB and Table-2:
B-1: Entropy. What do we measure from Entropy? What does it mean to say that the Entropy is 0?
What is the use of Entropy in decision tree?
B-2: Given Data. Read the data and display the data. Display the unique values for each column.
B-3: Decision Tree. Build a decision tree classifier for predicting the class label. Consider the following:
All the features (input columns) shall be considered as input to the model. You can do necessary transformations for
the input columns.
Fit the model using 75% of the data and use the rest for testing. Set random state to 110, criterion to entropy, and
splitter to best.
B-4: Information Gain. Calculate the Information Game (IG) for the class variable given the feature
selected as a root node.
B-5: Classification Rules. Write all the classification rules from the decision tree classifier.
B-6: Association Rules. Write association rules for “bruises → gill-size”, which has the highest
support. Write the corresponding support and accuracy.
B-7: Naı̈ve Bayes. Using the same training data from B-3, fit a Naı̈ve Bayes classifier. Use the
CategoricalNB classifier. Set alpha to 0.001, class prior=None, and fit prior=True.
B-8: Metrics. Using the same test data from B-3, compare the performance of the Decision Tree with
the Naı̈ve Bayes classifier in terms of accuracy, precision, and recall. Print the confusion matrix
for both classifiers. Which classifier showed better performance?
2
ISE-291 HW-6
Consider the following python methods, available in naive python, or numpy/pandas/sklearn libraries:
+ Note: You must use help() function from python to answer all the above questions.
3
ISE-291 HW-6
Consider data given in HW6DataC.csv taken from a public repository 1 . The data is related to tic-tac-
toe game. Specifically, the database shows possible board configurations at the end of tic-tac-toe game
between two players, ‘x’ and ‘o’. In all the given board configurations, player ‘x’ played the first move.
Each board end configuration is presented in a row (record), and there are 958 instances. Each end
configuration is represented by 9 features, corresponding to nine tic-tac-toe boxes. All the nine features
contain exactly one of the following values: ‘x’, ‘o’ or ‘b’, where an ‘x’ indicates player ‘x’ took the box,
an ‘o’ indicates player ‘o’ took the box, and a ‘b’ indicates the box is blank at the end of the game.
The output column is ‘win-for-x’, where a ‘True’ value indicates player ‘x’ was the winner in that game
instance, and a ‘False’ value indicates the game was either draw or player ‘o’ was the winner.
Do the following tasks using data given in HW6DataC:
D-1: Given Data. Read the data and display the data. Count the number of rows and columns in the
data. Count the number of non-null rows for each column. Display the description of both numeric
and non-numeric columns.
D-2: Entropy & Information Gain. Do the following:
Identify the entropy of the input column.
Identify the input column(s) that has the maximum information gain. Report any ties.
+ Note: Solve all the above questions using Python (not by hand). Use Pandas, Seaborn, SkLearn,
etc. libraries for all the above analysis.
1 UCI repository.
4
ISE-291 HW-6
Explain the following Python codes. Assume df represents an existing pandas’ dataframe, where the
columns are C1, C2,...C14,O1. The columns with odd numbers are categorical, and columns with even
numbers are numerical. The columns with label ‘O1’ indicates output column. ’Set1’ and ’Set2’ are two
random subsets of the rows of the dataframe. Also, assume that relevant libraries are imported before
executing the following code:
Code-1:
In [1]: 1 P=df[’C1’].value_counts()/len(df.index)
2 print(P)
Code-2:
In [2]: 1 np.average(df[’C2’].values,weights=df[’C4’].values)
Code-3:
Code-4:
Code-5:
Code-6:
Code-7:
In [7]: 1 X = df.iloc[:,:-1].values
2 y = df.iloc[:, -1].values
3 print(np.c_[np.ones(len(df.index)), X])