Homework3
Homework3
Question 1: [4 points] Explain what is the bias-variance trade-off? Describe few techniques to
reduce bias and variance respectively.
Question 2: [6 points] Assume the following confusion matrix of a classifier. Please compute its
1) precision,
2) recall, and
3) F1-score.
Predicted results
Actual values
Class 1 Class 2
Class 1 50 30
Class 2 40 60
Question 3: [10 points] Build a decision tree using the following training instances (using
information gain approach):
ST E VEN S I N ST I T U T E oƒ T EC H N O L O G Y
Question 4. [10 points] The naïve Bayes method is an ensemble method as we learned in
Module 5. Assuming we have 3 classifiers, and their predicted results are given in the table 1.
The confusion matrix of each classifier is given in table 2. Please give the final decision using the
Naïve Bayes method:
Classifier 1 Class 1
Classifier 2 Class 1
Classifier 3 Class 2
Use decision tree and random forest to train the titanic.csv dataset included in the assignment.
Step 1: Read in Titanic.csv and observe a few samples, some features are categorical, and
others are numerical. If some features are missing, fill them in using the average of the same
feature of other samples. Take a random 80% samples for training and the rest 20% for test.
Step 2: Fit a decision tree model using independent variables ‘pclass + sex + age + sibsp’ and
dependent variable ‘survived’. Plot the full tree. Make sure ‘survived’ is a qualitative variable
ST E VEN S I N ST I T U T E oƒ T EC H N O L O G Y
taking 1 (yes) or 0 (no) in your code. You may see a tree similar to this one (the actual structure
and size of your tree can be different):
Step 3: Use the GridSearchCV() function to find the best parameter max_leaf_nodes to prune the
tree. Plot the pruned tree which shall be smaller than the tree you obtained in Step 2.
Step 4: For the pruned tree, report its accuracy on the test set for the following:
Step 5: Use the RandomForestClassifier() function to train a random forest using the value of
max_leaf_nodes you found in Step 3. You can set n_estimators as 50. Report the accuracy of
random forest on the test set for the following:
ST E VEN S I N ST I T U T E oƒ T EC H N O L O G Y