Decision TREE
Decision TREE
We shall examine two of the many methods for measuring leaf node purity, which lead to the
two leading algorithms for constructing decision trees:
• CART algorithm
• C5.0 algorithm
import pandas as pd
import numpy as np
import statsmodels.tools.tools as stattools
from sklearn.tree import DecisionTreeClassifier, export_graphviz
adult_tr = pd.read_csv("C:/.../adult_ch6_training")
For simplicity, we save the Income variable as y.
y = adult_tr[[’Income’]]
We have a categorical variable, Marital status, among our predictors. The CART model
implemented in the sklearn package needs categorical variables converted to a dummy variable
form. Thus, we will make a series of dummy variables for Marital status using the categorical()
command.
mar_np = np.array(adult_tr[’Marital status’])
(mar_cat, mar_cat_dict) = stattools.categorical(mar_np, drop=True, dictnames = True)
We turn the variable Marital status into an array using array(), then use the categorical()
command from the stattools package to create a matrix of dummy variables for each value of
Marital status. We save the matrix and dictionary separately using (mar_cat, mar_cat_dict).
The matrix mar_cat contains five columns, one for each category in the original Marital status
variable. Each row represents a record in the adult_tr data set. Each row will have a 1 in the
column which matches the value that record had in the original Marital status variable. You can
tell which column represents which category by examining mar_cat_dict. In our case, the first
row of mar_cat has a 1 in the third column. By examining mar_cat_dict, we know the third
column represents the “Never married” category. Sure enough, the first record of adult_tr has
“Never married” as the Marital status variable value.
Now, we need to add the newly made dummy variables back into the X variables.
mar_cat_pd = pd.DataFrame(mar_cat)
X = pd.concat((adult_tr[[’Cap_Gains_Losses’]], mar_cat_pd), axis = 1)
We first make the mar_cat matrix a data frame using the DataFrame() command. We then use
the concat() command to attach the predictor variable Cap_Gains_Losses to the data frame of
dummy variables that represent marital status. We save the result as X.
Before we run the CART algorithm, note that the columns of X do not include the different
values of the Marital status variable. Run mar_cat_dict to see that the first column is for the
value “Divorced,” the second for “Married,” and so on. Since the first column of X is
Cap_Gains_Losses, we can specify the names of each column of X.
X_names = ["Cap_Gains_Losses", "Divorced", "Married","Never-married", "Separated",
"Widowed"]
It will help us when visualizing the CART model to know the levels of y as well.
y_names = ["<=50K", ">50K"]
Now, we are ready to run the CART algorithm!
cart01 = DecisionTreeClassifier(criterion = "gini", max_leaf_nodes=5).fit(X,y)
To run the CART algorithm, we use the DecisionTreeClassifier() command. The
DecisionTreeClassifier() command sets up the various parameters for the decision tree. For
example, the criterion = “gini” input specifies that we are using a CART model which utilizes
the Gini criterion, and the max_leaf_nodes input trims the CART tree to have at most the
specified number of leaf nodes. For this example, we have limited our tree to five leaf nodes. The
fit() command tells Python to fit the decision tree that was previously specified to the data. The
predictor variables are given first, followed by the target variable. Thus, the two inputs to fit()
are the X and y objects we created. We save the decision tree as cart01.
Finally, to obtain the tree structure, we use the export_graphviz() command.
export_graphviz(cart01, out_file = "C:/.../cart01.dot",feature_names=X_names,
class_names=y_names)
The first input is the decision tree itself, which we saved as cart01. The out_file input will save
the tree structure to the specified location and name the file cart01.dot. Run the contents of the
file through the graphviz package to display the CART model. Specifying feature_names =
X_names and class_names = y_names add the predictor variable names and the target variable
values to the cart01.dot file, greatly increasing its readability. To obtain the classifications of the
Income variable for every variable in the training data set, use the predict() command.
predIncomeCART = cart01.predict(X)
Using the predict() command on cart01 says that we want to use our CART model to make the
classifications. Including the predictor variables X as input specifies that we want predictions for
those records in particular. The result is the classification, according to our CART model, for
every record in the training data set. We save the predictions as predIncomeCART.