Model Answers For Chapter 7: CLASSIFICATION AND REGRESSION TREES
Model Answers For Chapter 7: CLASSIFICATION AND REGRESSION TREES
Model Answers For Chapter 7: CLASSIFICATION AND REGRESSION TREES
Note 1: the variable DEP_TIME (actual departure time) cannot be used for predicting
new flights, unless we are classifying them after their departure. For this reason we
omit it from the model.
Note 2: Once binned dummies are used for the scheduled departure time, we
remove the original variable (CRS_DEP_TIME) from the analysis.
Answer to 7.2.a:
Note: If variable names are too long, they might be truncated when they appear on
the tree. To see their full name, examine the “Best Pruned Tree Rules” table at the
bottom of worksheet “CT_Output1”.
Answer to 7.2.b:
We cannot use this tree, because we must know the day of month and distance
traveled (although the last can be inferred from the route DCA-EWR). The redundant
information is the day of week (Monday) and arrival airport (EWR). The tree requires
knowing whether the departure airport is DCA or not, the day of month, the distance
traveled (if it is more than 220.5 or not), and the departure time (whether it is
between 8AM-10AM or not).
Answer to 7.2.c.i:
In the best-pruned tree we get a single terminal node labeled “ontime.” Therefore
any new flight will be classified as being “on time”.
Answer to 7.2.c.ii:
This is equivalent to the naïve rule, which is the majority rule. In this dataset most of
the flights arrived on time, and therefore the naïve rule is to classify a new flight as
arriving on time.
Answer to 7.2.c.iii:
Answer to 7.2.c.iv:
The pruned tree results in a single node because adding splits does not reduce the
classification error on the validation set (see sheet CT_PruneLog2”). With one node
the validation error rate is 19.5%; adding nodes increase the error to 20.8%.
Answer to 7.2.c.v:
A full-grown tree leads to over fitting of the training data, which will lead to poor
performance on new data. In contrast, the best-pruned tree is obtained by
assessing the classification accuracy of the tree on the validation set, and therefore
avoids over-fitting.
Answer to 7.2.c.vi:
Our second classification tree coincides with the naïve rule (it has a single “ontime”
node). Considering that 80.55% of the flights are on time in the full data set, the
error rate for this tree's rule of "every flight is on time" is only 19.45. The logistic
regression has only a slightly lower overall error rate when examining the validation
data (18%). So it could be that there is little predictive power in the predictor
variables (regardless of method). In addition, logistic regression's improvement
might be due to the different pre-processing of the data. For example, in the logistic
regression the days of week are grouped into two (“Sunday or Monday” vs. “Other”)
whereas in the tree we have six dummies. Also, the departure time in the logistic
regression is broken down into 16 bins, whereas in the classification tree it uses 8
bins. Finally, because the dataset is not very large, a model-based method such as
logistic regression (which imposes more structure) is likely to have more accuracy
than a data-driven method, like the classification tree.