Decision Tree
Decision Tree
Decision trees are a classifier in machine learning that allows us to make predictions based on
previous data. They are like a series of sequential “if … then” statements you feed new data into to
get a result.
To demonstrate decision trees, let’s take a look at an example. Imagine we want to predict whether
Mike is going to go grocery shopping on any given day. We can look at previous factors that led Mike
to go to the store:
Here we can see the amount of grocery supplies Mike had, the weather, and whether Mike worked
each day. Green rows are days he went to the store, and red days are those he didn’t. The goal of a
decision tree is to try to understand why Mike goes to the store, and apply that to new data later on.
Let’s divide the first attribute up into a tree. Mike can either have a low, medium, or high amount of
supplies:
Figure 2. Our first split
Here we can see that Mike never goes to the store if he has a high amount of supplies. This is called
a pure subset, a subset with only positive or only negative examples. With decision trees, there is no
need to break a pure subset down further.
Let’s break the Med Supplies category into whether Mike worked that day:
Here we can see we have two more pure subsets, so this tree is complete. We can replace any pure
subsets with their respective answer - in this case, yes or no.
Finally, let’s split the Low Supplies category by the Weather attribute:
Figure 4. Our third split
Now that we have all pure subsets, we can create our final decision tree:
Motivation
Decision trees are easily created, visualized, and interpreted. Because of this, they are typically the
first method used to model a dataset. The hierarchical structure and categorical nature of a decision
tree makes it highly intuitive to implement. Decision trees expand logarithmically based on the
number of data points you have, meaning larger datasets will impact the tree creation process less
than other classifiers. Because of the tree structure, classifying new data points is also performed
logarithmically.
Splitting (Induction)
Decision trees are created through a process of splitting called induction, but how do we know when
to split? We need a recursive algorithm that determines the best attributes to split on. One such
algorithm is the greedy algorithm:
This process is repeated until all nodes have the same value as the target result, or splitting adds no
value to a prediction. This algorithm has the root node as the best classifier.
Cost of Splitting
The cost of a split is determined by a cost function. The goal of using a cost function is to split the
data in a way that can be computed and that provides the most information gain.
For classification trees, those that provide an answer rather than a value, we can compute
imformation gain using Gini Impurities:
Ref: https://sebastianraschka.com/faq/docs/decision-tree-binary.html
Ref: https://sebastianraschka.com/faq/docs/decision-tree-binary.html
To calculate information gain, we first start by computing the Gini Impurity of our root node. Let’s
take a look at the data we used earlier:
Weathe
Supplies Worked? Shopped?
r
D6 High Sunny No No
D7 High Raining No No
D1
Low Raining No Yes
0
D1
Med Sunny No Yes
1
D1
High Sunny Yes No
2
Our root node is the target variable, whether Mike is going to go shopping. To calculate its Gini
Impurity, we need to find the sum of probabilities squared for each outcome and subtract this result
from one:
Let’s calculate the Gini Information Gain if we split on the first attribute, Supplies. We have three
different categories we can split by - Low, Med, and High. For each of these, we calculate its Gini
Impurity:
As you can see, the impurity for High supplies is 0. This means that if we split on Supplies and receive
High input, we immediately know what the outcome will be. To determine the Gini Information Gain
for this split, we compute the root’s impurity minus the weighted average of each child’s impurity:
We continue this pattern for every possible split, then choose the split that gives us the highest
information gain value. Maximizing information gain leaves us with the most polarized splits possible,
lowering the probability new input is incorrectly classified.
Pruning
A decision tree created through a sufficiently large dataset may end up with an excessive amount of
splits, each with decreasing usefulness. A highly detailed decision tree can even lead to overfitting,
discussed in the previous module. Because of this, it’s beneficial to prune less important splits of a
decision tree away. Pruning involves calculating the information gain of each ending sub-tree (the
leaf nodes and their parent node), then removing the sub-tree with the least information gain:
Ref: http://www.cs.cmu.edu/~bhiksha/courses/10-601/decisiontrees/
As you can see, the sub-tree is replaced with the more prominent result, becoming a new leaf. This
process can be repeated until you reach a desired complexity level, tree height, or information gain
amount. Information gain can be tracked and stored as the tree is built to save time when pruning as
well. Each model should make use of its own pruning algorithm to meet its needs.
Conclusion
Decision trees allow you to quickly and efficiently classify data. Because they shape data into a
heirarchy of decisions, they are highly understandable by even non-experts. Decision trees are
created and refined in a two-step process - induction and pruning. Induction involves picking the best
attribute to split on, while pruning helps to filter out results deemed useless. Because decision trees
are so simple to create and understand, they are typically the first approach used to model and
predict outcomes of a dataset.