0% found this document useful (0 votes)
64 views6 pages

ISE 291 Introduction To Data Science: Term 212 Homework #6

Download as pdf or txt
Download as pdf or txt
Download as pdf or txt
You are on page 1/ 6

[The HW must be submitted as one .ipynb file. Write names & IDs of all the group members.

Prereq
uis
it e
s et r
rp
te

In
U nd

ISE 291
ersta n d

e V a li d a t e
Introduction to Data Science
Term 212

a ly z
Homework #6

An
re
P

pa
re
el
Mod

Homework Guidelines
To receive full credit, you should make sure you adhere to the following guidelines. For any questions/-
comments contact your section instructor.

Homework Presentation & Submission:


ˆ You should submit the solutions for the FIRST TWO problems only.
ˆ Every sub-problem (part) should be answered on a DIFFERENT CELL as given in the template.
ˆ EVERY CELL should have problem and part number clearly written in the first line.
ˆ All cells of your homework should be in CHRONOLOGICAL order. One cell per sub-problem.
ˆ Any text should be written as comment in the code cell. Do NOT modify code cell into markdown
cell.
ˆ Submit entire HW as ONE single .ipynb document.
ˆ Do NOT add/delete any cell in the given template.
ˆ ONE HW per group should be submitted.
ˆ Your NAMEs, IDs, and the homework number should be clearly indicated in the FIRST CELL of
the notebook.
ISE-291 HW-6

Problem # A 50 marks

Consider data given in CSV file HW6DataA and the following data description:

Table 1: Data Description


Field Description
StdID Student ID (index)
Statistical background Whether the student has a background in statistics.
Python background The student background in python (Excellent, Good, Fair)
Gender The student gender (Male or Female)
Class level The student class level (Freshman, Sophomore, Junior, Senior)
Weekly studying hours Average number of hours student studies per week.
Previous exams Number of previous exams solved.
Absences Number of absences throughout the semester
Class size Number of students in the class.
Mid Midterm score
Project score Project score
Final Final score (output variable)

+ Note: Solve all the above questions using Python. Use Pandas, Seaborn, Sklearn, etc. libraries
for all the above analysis
Do the following tasks using data given in HW6DataA and Table-1:
A-1: Regression. Given a regression problem along with the input columns and output column, describe
the steps to build a regression model. Explain how the regression model can be used for predicting
the output column values.

A-2: Regularization. Discuss in detail the potential use of both Ridge and LASSO regression? How
are they different from the OLS regression?
A-3: Cross-Validation. In both Ridge and LASSO regression, which technique do we use to select the
best value for α?

A-4: Given Data. Read and display the data given in HW6DataA. Refer to Table-1 for the data
description.
A-5: OLS Regression. Build an OLS regression model for predicting the Final score of each student.
Consider the following:
ˆ All the variables except StdID, Gender, and Final shall be considered as input variables.
ˆ Train the model using 70% of the data and use the rest for testing. Set random state to 42.

A-6: LASSO and Ridge. Using the same training data from OLS model (task A-5), estimate the
coefficients (betas) using LASSO and Ridge regression. Obtain the best value of α among

{10−3 , 10−2 , 10−1 , 100 , 101 , 102 , 103 }

using 10 fold cross validation. Compare and comment on the coefficients of the three models.
Compare the performance of the OLS model against LASSO and Ridge models on the testing data.

A-7: SISO Regression. Using the closed form method (formula), build a SISO regression model to
predict the Finale score. Use the variable with the highest regression coefficient obtained by LASSO
as input variable (say, top variable). Using the corresponding testing data, compare the performance
of SISO model (top variable vs Final score) with that of LASSO reported in A-6. Also, depict top
variable vs Final score.

1
ISE-291 HW-6

Problem #B 50 marks

Consider data given in CSV file HW6DataB and the following data description of hypothetical samples
of gilled mushrooms:

Table 2: Data Description


Field Description
class classes: edible=e, poisonous=p
bruises bruises=t,no=f
gill-spacing close=c, crowded=w, distant=d
gill-size broad=b, narrow=n
stalk-shape enlarging=e,tapering=t
veil-color brown=n,orange=o,white=w,yellow=y

+ Note: Solve all the above questions using Python. Use Pandas, Seaborn, Sklearn, etc. libraries
for all the above analysis.
Do the following tasks (in exact sequence) using data given in HW6DataB and Table-2:
B-1: Entropy. What do we measure from Entropy? What does it mean to say that the Entropy is 0?
What is the use of Entropy in decision tree?

B-2: Given Data. Read the data and display the data. Display the unique values for each column.
B-3: Decision Tree. Build a decision tree classifier for predicting the class label. Consider the following:
ˆ All the features (input columns) shall be considered as input to the model. You can do necessary transformations for
the input columns.
ˆ Fit the model using 75% of the data and use the rest for testing. Set random state to 110, criterion to entropy, and
splitter to best.

B-4: Information Gain. Calculate the Information Game (IG) for the class variable given the feature
selected as a root node.
B-5: Classification Rules. Write all the classification rules from the decision tree classifier.

B-6: Association Rules. Write association rules for “bruises → gill-size”, which has the highest
support. Write the corresponding support and accuracy.
B-7: Naı̈ve Bayes. Using the same training data from B-3, fit a Naı̈ve Bayes classifier. Use the
CategoricalNB classifier. Set alpha to 0.001, class prior=None, and fit prior=True.
B-8: Metrics. Using the same test data from B-3, compare the performance of the Decision Tree with
the Naı̈ve Bayes classifier in terms of accuracy, precision, and recall. Print the confusion matrix
for both classifiers. Which classifier showed better performance?

2
ISE-291 HW-6

Problem #C (Practice only. No submission required.)

Consider the following python methods, available in naive python, or numpy/pandas/sklearn libraries:

C-1: sklearn.model selection.train test split()

C-2: sklearn.metrics.mean squared error() C-7: scipy.stats.entropy()


C-3: numpy.linalg.inv() C-8: sklearn.tree.plot tree()
C-4: numpy.c C-9: sklearn.tree.export text()
C-5: numpy.average() C-10: sklearn.tree.accuracy score()

C-6: numpy.exp() C-11: sklearn.tree.confusion matrix()

Answer the following questions for each of the above methods:


ˆ List all the argument of the method.
ˆ Classify the arguments as positional or keyword arguments.
ˆ Identify the data types for each of the arguments.
ˆ Write the default values for each of the arguments.

Consider the following python classes, available in sklearn library:

C-12: sklearn.linear model.LinearRegression()


C-13: sklearn.linear model.RidgeCV()
C-14: sklearn.linear model.LassoCV()
C-15: sklearn.tree.DecisionTreeClassifier()
C-16: sklearn.ensemble.RandomForestClassifier()
C-17: sklearn.naive bayes.CategoricalNB()
C-18: sklearn.naive bayes.GaussianNB()

Answer the following questions for each of the above classes:


ˆ List all the methods and properties/attributes.
ˆ Discuss any three input arguments for the above classes.
ˆ Discuss the .fit() method.
ˆ Discuss the .predict() method.
ˆ Discuss the .coef attribute (wherever applicable).
ˆ Discuss the .intercept attribute (wherever applicable).
ˆ Discuss the .classes attribute (wherever applicable).

+ Note: You must use help() function from python to answer all the above questions.

3
ISE-291 HW-6

Problem #D (Practice only. No submission required.)

Consider data given in HW6DataC.csv taken from a public repository 1 . The data is related to tic-tac-
toe game. Specifically, the database shows possible board configurations at the end of tic-tac-toe game
between two players, ‘x’ and ‘o’. In all the given board configurations, player ‘x’ played the first move.
Each board end configuration is presented in a row (record), and there are 958 instances. Each end

configuration is represented by 9 features, corresponding to nine tic-tac-toe boxes. All the nine features
contain exactly one of the following values: ‘x’, ‘o’ or ‘b’, where an ‘x’ indicates player ‘x’ took the box,
an ‘o’ indicates player ‘o’ took the box, and a ‘b’ indicates the box is blank at the end of the game.
The output column is ‘win-for-x’, where a ‘True’ value indicates player ‘x’ was the winner in that game
instance, and a ‘False’ value indicates the game was either draw or player ‘o’ was the winner.
Do the following tasks using data given in HW6DataC:
D-1: Given Data. Read the data and display the data. Count the number of rows and columns in the
data. Count the number of non-null rows for each column. Display the description of both numeric
and non-numeric columns.
D-2: Entropy & Information Gain. Do the following:
ˆ Identify the entropy of the input column.
ˆ Identify the input column(s) that has the maximum information gain. Report any ties.

D-3: Classification Rules. Do the following:


ˆ Transform the input and output columns appropriately, so that the data can be used for building decision trees.
ˆ Build a decision tree with max depth of 3. Do you see any pure leaf nodes?
ˆ Build a decision tree with no restriction on max depth.
ˆ Identify two classification rules one for each ‘True’ and ‘False’ class, which has the shortest length (least number of
conditions on the if side).
ˆ Build a random forests classifier with 4 estimators, and max depth of 3 for each estimator.
ˆ Identify a classification rule that appears in 2 or more estimators.

+ Note: Solve all the above questions using Python (not by hand). Use Pandas, Seaborn, SkLearn,
etc. libraries for all the above analysis.

1 UCI repository.

4
ISE-291 HW-6

Problem #E (Practice only. No submission required.)

Explain the following Python codes. Assume df represents an existing pandas’ dataframe, where the
columns are C1, C2,...C14,O1. The columns with odd numbers are categorical, and columns with even
numbers are numerical. The columns with label ‘O1’ indicates output column. ’Set1’ and ’Set2’ are two
random subsets of the rows of the dataframe. Also, assume that relevant libraries are imported before
executing the following code:

Code-1:

In [1]: 1 P=df[’C1’].value_counts()/len(df.index)
2 print(P)

Code-2:

In [2]: 1 np.average(df[’C2’].values,weights=df[’C4’].values)

Code-3:

In [3]: 1 from sklearn.tree import DecisionTreeClassifier


2 clf = DecisionTreeClassifier(random_state=0,criterion=’entropy’,splitter=’best’)
3 ndf = pd.get_dummies(df,drop_first=True)
4 clf = clf.fit(ndf.drop(’O1’, axis=1), ndf[’O1’])

Code-4:

In [4]: 1 from sklearn.ensemble import RandomForestClassifier


2 ndf = pd.get_dummies(df,drop_first=True)
3 rf = RandomForestClassifier(n_estimators=6,criterion=’entropy’,max_depth=2,
random_state=0)
4 rf = rf.fit(ndf.drop(’O1’, axis=1), ndf[’O1’])
5 tree.plot_tree(rf.estimators_[2], feature_names = ndf.columns[0:-1],
6 class_names=[’no’,’yes’], filled = True);

Code-5:

In [5]: 1 from sklearn.preprocessing import OrdinalEncoder


2 encoder = OrdinalEncoder()
3 x = encoder.fit_transform(df.values)
4 y = encoder.inverse_transform(x)[:,1]

Code-6:

In [6]: 1 from sklearn.linear_model import LassoCV


2 reg = LassoCV(alphas=[1e-3, 1e-2, 1e-1, 1e0, 1e1, 1e2, 1e3],
3 fit_intercept=False,cv=10, random_state=0).fit(X_train, y_train)
4 y_pred = reg.predict(X_test)

Code-7:

In [7]: 1 X = df.iloc[:,:-1].values
2 y = df.iloc[:, -1].values
3 print(np.c_[np.ones(len(df.index)), X])

You might also like