0% found this document useful (0 votes)
106 views3 pages

Model Answers For Chapter 7: CLASSIFICATION AND REGRESSION TREES

Download as pdf or txt
Download as pdf or txt
Download as pdf or txt
You are on page 1/ 3

Model Answers for chapter 7: CLASSIFICATION AND REGRESSION TREES

Note 1: the variable DEP_TIME (actual departure time) cannot be used for predicting
new flights, unless we are classifying them after their departure. For this reason we
omit it from the model.

Note 2: Once binned dummies are used for the scheduled departure time, we
remove the original variable (CRS_DEP_TIME) from the analysis.

Refer to “Data_Partition1” excel sheet in 7.2_Flightdelay.

Answer to 7.2.a:

Refer to “CT_PruneTree1” and “CT_Output1” excel sheets in


7.2_Flightdelay.

Note: If variable names are too long, they might be truncated when they appear on
the tree. To see their full name, examine the “Best Pruned Tree Rules” table at the
bottom of worksheet “CT_Output1”.

If (Origin_DCA > 0.5) then classify as ontime


If (Origin_DCA < 0.5) and (Day_OF_MONTH < 3.5) then classify as ontime
If (Origin_DCA < 0.5) and (5.5 < Day_OF_MONTH < 14.5) then classify as ontime
If (Origin_DCA < 0.5) and (14.5 < Day_OF_MONTH < 24.5) then classify as
ontime
If (Origin_DCA < 0.5) and (24.5 < Day_OF_MONTH < 27.5) then classify as
delayed
If (Origin_DCA < 0.5) and (Day_OF_MONTH > 27.5) then classify as ontime
If (Origin_DCA < 0.5) and (3.5 < Day_OF_MONTH < 5.5) and (Distance < 220.5)
and (BinnedDEP_Time2 <0.5) then classify as delayed
[Note: BinnedDEP_Time2 means 8AM-10-AM]
If (Origin_DCA < 0.5) and (3. 5 < Day_OF_MONTH < 5.5) and (Distance < 220.5)
and (BinnedDEP_Time2 > 0.5) then classify as ontime
If (Origin_DCA < 0.5) and (3.5 < Day_OF_MONTH < 5.5) and (Distance > 220.5)
then classify as ontime

Answer to 7.2.b:

We cannot use this tree, because we must know the day of month and distance
traveled (although the last can be inferred from the route DCA-EWR). The redundant
information is the day of week (Monday) and arrival airport (EWR). The tree requires
knowing whether the departure airport is DCA or not, the day of month, the distance
traveled (if it is more than 220.5 or not), and the departure time (whether it is
between 8AM-10AM or not).

Refer to the “CT_PruneTree2” sheet in 7.2_Flightdelay.xls.

Answer to 7.2.c.i:

In the best-pruned tree we get a single terminal node labeled “ontime.” Therefore
any new flight will be classified as being “on time”.

Answer to 7.2.c.ii:

This is equivalent to the naïve rule, which is the majority rule. In this dataset most of
the flights arrived on time, and therefore the naïve rule is to classify a new flight as
arriving on time.

Answer to 7.2.c.iii:

Refer to “CT_FullTree2” excel sheet in 7.2_Flightdelay.

Top three predictors are ORIGIN_DCA, DEST_JFK and CARRIER_MQ.

Answer to 7.2.c.iv:

The pruned tree results in a single node because adding splits does not reduce the
classification error on the validation set (see sheet CT_PruneLog2”). With one node
the validation error rate is 19.5%; adding nodes increase the error to 20.8%.
Answer to 7.2.c.v:

A full-grown tree leads to over fitting of the training data, which will lead to poor
performance on new data. In contrast, the best-pruned tree is obtained by
assessing the classification accuracy of the tree on the validation set, and therefore
avoids over-fitting.

Answer to 7.2.c.vi:

Our second classification tree coincides with the naïve rule (it has a single “ontime”
node). Considering that 80.55% of the flights are on time in the full data set, the
error rate for this tree's rule of "every flight is on time" is only 19.45. The logistic
regression has only a slightly lower overall error rate when examining the validation
data (18%). So it could be that there is little predictive power in the predictor
variables (regardless of method). In addition, logistic regression's improvement
might be due to the different pre-processing of the data. For example, in the logistic
regression the days of week are grouped into two (“Sunday or Monday” vs. “Other”)
whereas in the tree we have six dummies. Also, the departure time in the logistic
regression is broken down into 16 bins, whereas in the classification tree it uses 8
bins. Finally, because the dataset is not very large, a model-based method such as
logistic regression (which imposes more structure) is likely to have more accuracy
than a data-driven method, like the classification tree.

You might also like