Open Machine Learning With Decision Trees and Random Forests
Open Machine Learning With Decision Trees and Random Forests
Table of Contents
Abstract .................................................................................................................................... 3
1 Motivation ......................................................................................................................... 4
5 Pruning ............................................................................................................................ 19
7 Relevant Metrics............................................................................................................. 26
7.5 F1 Score.................................................................................................................... 28
365 DATA SCIENCE 3
Abstract
learning algorithm. It is famous for being one of the most intuitive and easy to
understand methods. This makes it, as well, a good starter algorithm to learn the
quirks of the specific dataset and problem one is trying to solve. When it comes to
actual results, though, decision trees have another trick up their sleeve – they can
be stacked together to form a random forest, that can outperform many other
methods.
Decision Trees and Random Forests” course. They list the algorithms’ pros and
cons, outline the working of the decision trees and random forests algorithms,
cover in greater detail the more involved topic of Gini impurity and entropy, and
1 Motivation
algorithms.
To the naked eye, decision trees might look simple at first glance – in fact,
they may look way too simple to be remotely useful. But it is that simplicity that
makes them useful. Since in today’s world it is extremely easy to create very
complex models with just a few clicks, many data scientist cannot understand nor
explain what their model is doing. Decision trees, while performing averagely in
their basic form, are easy to understand and when stacked reach excellent results.
Pros Cons
Intuitive Average results
Easy to visualize and interpret Trees can be unstable with respect to
the train data
In-built feature selection Greedy learning algorithms
No preprocessing required Susceptible to overfitting (there are
measures to counter this)
Performs well with large datasets
Moderately fast to train and extremely
fast to predict with
Random forests are built upon many different decision trees, however, with
different measures set in place to restrict overfitting. Thus, they obtain higher
performance.
365 DATA SCIENCE 5
Pros Cons
Gives great results Black box model - looses the
interpretability of a single decision tree
Requires no preprocessing of the data Depending on the number of trees in
the forest, can take a while to train
Automatically handles overfitting in Outperformed by gradient-boosted
most cases trees
Lots of hyperparameters to control
Performs well with large datasets
365 DATA SCIENCE 6
From this picture, it’s clear how the name came about – it definitely does look
like an upside-down tree, branching more and more as you go down. Now, there
are 2 main elements that make up the tree – nodes and edges/branches.
• Nodes are the black circles in the picture above. They contain the
actual data. This data is, generally, not restricted to a particular type.
• Edges are the black lines connecting the different nodes. They are
structure – graph. And that’s entirely correct, you can think of the tree as a graph
Those restrictions are that a node can be connected to a different node that
is either one level higher, or one level lower. Moreover, every node, except the very
first one, should be connected to exactly one node higher up. These rules mean
that connections such as the ones illustrated below are not permitted.
From the pictures so far, we can see that a tree is a structure with the pattern
of a node, connected to other nodes through edges, repeated again and again
distinguish between the different parts. First, it’s crucial to remember that a tree
has a well-defined hierarchy and we always view it from top to bottom. Then, we
• Root node – this is the uppermost node, the one that starts the tree
node is the one that is one level higher and connects to that subtree
(see figure 3)
• Child node – when given a node, the ones stemming from it, one
• Leaf node – A node that has no children. This is where the tree
different sides)
• Height – how many levels the tree has. For example, we can say that
• Branching factor – this signifies how many children there are per
that the tree has no definitive branching factor. In principle, there can
exist trees with however big branching factor you want. However, an
One very common use of the binary tree is the binary search trees which are
used for efficient implementations of searching and sorting algorithms. There are,
Decision trees are a common occurrence in many fields, not just machine
learning. In fact, we commonly use this data structure in operations research and
decision analysis to help identify the strategy that is most likely to reach a goal.
The idea is that there are different questions a person might ask about a particular
problem, with branching answers that lead to other questions and respective
answers, until they can reach a final decision. But, this is a very visual topic, so let’s
As can be seen from those examples, the nodes in a decision tree hold
important questions regarding the decision one wants to make. Then, the
the different questions and following the structure down, one arrives at a leaf node
(marked yellow in the above illustrations) that represents the outcome (decision).
problem. How would an ML decision tree look like? Well, here it is:
365 DATA SCIENCE 11
Figure 6: ML decision tree based on the Iris dataset. The dataset features flower petal and sepal dimensions with
the objective to predict the exact flower species
This is a real tree trained on the Iris dataset. The input features are the sizes
of the petals and sepals of different Iris flowers, with the objective to classify the
them into 3 Iris species. The nodes now carry a lot more information, most of
which is just informative to the reader, not part of the “questions”. In machine
learning context, the decision tree asks questions regarding the input features
themselves (most often the question is whether the value is bigger or smaller than
some threshold).
And here lies the usefulness of decision trees. As they can be easily
visualized, they offer data scientists tools to analyze how model makes predictions.
Moreover, since the tree is a hierarchical structure, the higher a certain node is, the
more important it is to the problem at hand. Thus, we can say that decision trees
From the above tree, we can extract another important term for decision
trees – split. Let’s consider what happens to our training set when we apply it to a
decision tree? Since it consists of many different data points with different feature
values, we would expect that some of the points follow the left arrow, while others
follow the right one. Thus, we have effectively split our dataset in two. Then, each
of the parts is further chopped in two at every subsequent node. That’s why, a
Well, decision trees can solve both regression and classification problems.
While the decision tree itself is easy to understand, the process that creates
it is slightly more complicated. Nevertheless, there are a couple of main points that
algorithms are ones that work by choosing the best option available right now,
without considering the whole picture. Therefore, they are fast, as there’s no need
solutions. Take, as an example, the problem of finding the shortest path between
two cities. If you take the shortest street at every junction, you may end up in a
situation where you are actually going away from the desired destination. So, a
true optimal solution can be reached only if you consider all the roads between
you and the destination, not just the ones at each junction.
solution as close as possible to the optimal one, that this doesn’t matter. What
So, these algorithms construct the tree one node at a time. During the
process, they look at the tree so far, and decide which node would best separate
the data. In other words, they look for the best way to split the data at each node.
behind it. In decision trees, this is done by defining different metrics that quantify
how “good” or “bad” a certain split is. The algorithm simply tries to minimize or
maximize those. So, the real important part are the actual metrics. The two most
365 DATA SCIENCE 14
popular ones are gini impurity and information gain (entropy). Let’s take a look at
those.
Before we dive in, it is worth noting what really matters for the algorithms –
the number of samples from each class that is present in the node. They don’t look
at the entire tree, but rather take it one node at a time. In this context, a node may
be called a “split”, since we split our data in 2 – for Yes or No. Of course, some
splits are more useful than others. For instance, one that funnels all of the data to
the left branch and none of it to the right might be a really bad split, since we
haven’t actually changed anything. The metrics’ job is to quantify exactly how
good or bad a certain split is. Gini impurity is one such metric.
𝑛𝑛 𝑛𝑛
where 𝑝𝑝𝑖𝑖 is the proportion of samples in the i-th class with respect to the
whole set of data. The data considered here, is the data present at the node we
are computing this metric for. This will be the whole dataset for the root node, but
it will necessarily become a smaller and smaller subset the deeper down the tree
we go.
of how often a randomly chosen element from the set would be incorrectly
subset. In simple words, the idea is to find how much would change in the node if
Let’s say we have a dataset with 2 classes – red and blue. Imagine we have a
node that contains 10 data points. Now, suppose these samples have the
Figure 7: Data inside a node - 3 data points of the red class and 7 data points of the blue class
So far, every data point is in the correct bin, so we might say that we have no
misclassifications.
In this case, gini impurity will try to measure how the accuracy will change if
we randomly shuffled those 10 samples. We still need to have 3 samples in the red
bin and 7 samples in the blue bin. However, the bins are no longer guaranteed to
contain only red and blue data points, respectively. In other words, there is
what the misclassification rate is. The bigger it is, the bigger the gini impurity. The
algorithms, thus, try to minimize the gini impurity. The smallest possible gini is 0
and it is achieved when all samples in the node are of a single class.
365 DATA SCIENCE 16
So, a random shuffle the data above may look like this:
Figure 8: The same data in the node, but now randomly shuffled. Notice that there is some "misclassification".
Now, we can see that not all red points are in the red bin. Likewise for the
blue data points. Thus, there is some misclassification. Gini impurity tries to
So, how does this relate to the formula expressed above. Well, let’s take
𝑛𝑛
The process of randomly shuffling the data can be divided into 2 steps:
1. 𝑝𝑝𝑖𝑖 represents the probability that we pick a datapoint of the i-th class
at random
365 DATA SCIENCE 17
bin
So, we multiply those two probabilities and sum them for every different
class present in the node. For our example node of 3 red and 7 blue samples, we
3 3 3 7 21
For the red class: 10
(1 − 10) = 10
× 10 = 10
7 7 7 3 21
For the blue class: 10 (1 − 10) = 10 × 10 = 10
42
And, summing those, we get 𝐺𝐺𝐺𝐺𝐺𝐺𝐺𝐺 = 10
In this manner, we can obtain the gini metric of both child nodes of a split.
By combining them (based on the number of samples in each one), we can judge
This is another metric to measure how good a certain split is. It is often
metric that measures how much information there is in a set. Its formula is:
𝑛𝑛
child nodes and subtract it from the entropy of the parent in order to find how
(maximize the information contained in the tree). They will continue to run
attempts of splitting the data until the information gain is 0 and no more
the information gain favors smaller ones. However, in practice, there is not much
difference between the two. Researchers estimated that the metric we use matters
in only 2% of the cases. In the rest, the decision whether to use gini or entropy
default to gini.
365 DATA SCIENCE 19
5 Pruning
decision trees.
Now, even though decision trees are a relatively simple model, they have a
tendency to overfit. A lot. This is expressed in the tree having way too many
Figure 9: A heavily overfitted decision tree. There are extremely many different nodes.
We can see that this is overly complicated tree. This not only reduces the
performance of the model itself, but also negates one of the main advantages of
Luckily, it is not all gloom and doom. There is a technique to deal with this
overfitting and it’s called pruning. As an example, here his how the same tree
Figure 10: The same tree as the one above, after pruning. It is much simpler and has 5% better accuracy.
It is exactly what it says on the tin. Think of how you trim bushes and plants –
this is practically the same thing. In essence, pruning is a technique that removes
parts of the tree that are not necessary for the final classification. It reduces the
complexity of the final classifier, and hence improves predictive accuracy by the
reduction of overfitting.
Pruning processes can be divided into two types – pre-pruning and post-
pruning. Pre-pruning is done during the training process itself, while post-pruning
is done after the tree is already generated. In practice post-pruning is the way
In terms of pruning algorithms, there are many. But here are a couple:
365 DATA SCIENCE 21
One of the simplest forms of pruning is reduced error pruning. Starting at the
leaves, each node is replaced with its most popular class. If the prediction
accuracy is not affected, then the change is kept. While somewhat naive, reduced
parameter is used to define the cost-complexity measure, 𝑅𝑅𝛼𝛼 (𝑇𝑇) of a given tree 𝑇𝑇:
In simple words, the algorithm identifies the subtree with the smallest
contribution, measured in the cost complexity metric, and cuts it off from the
actual tree, repeating the process until the effective parameter for the whole tree
is large enough.
365 DATA SCIENCE 22
The random forest algorithm is one of the few non-neural network models
that give very high accuracy for both regression and classification tasks. It simply
gives good results. And while decision trees do provide great interpretability,
when it comes down to performance, they lose against random forests. In fact,
unless transparency of the model is a priority, almost every data scientist and
analyst will use random forests over decision trees. So, let’s see what this algorithm
is made of.
algorithms alone. So, in our case, the collection of decision trees as a whole unit
behaves much better than any stand-alone decision tree. In short, random forests
The more observant among you might have noticed, though, that through
the process of creating many different trees, we lose one of the important
properties that decision trees had in the first place – namely, interpretability. Even
though each individual tree in the collection is simple to follow, when we have
happening at a glance. That’s why this algorithm is usually treated as a black box
365 DATA SCIENCE 23
gain additional performance and accuracy, but lose the interpretability and
A logical follow-up question you might have right now is: “How do we
determine what the final result should be?”. After all, each decision tree produces
its own answer. So, we end up with hundreds of different answers. The good news
is that we can deal with this problem in a very intuitive way – by using majority
voting to determine the final outcome. In other words, we choose the most
common result.
6.2 Bootstrapping
overfitting. That’s why there are many individual decision trees, with the idea that
one tree can overfit, but many will not do so in the same manner. Thus, their
A crucial part in that logic is that we don’t train all of the trees on the exact
same dataset. But it is rarely that we have different datasets related to the same
problem, nor we can split one dataset into a hundred parts. So, how do we create
many different datasets out of a single one? Well, that is the technique of
bootstrapping.
original dataset with replacement. What that means, is that it goes through the
original dataset and copies data points at random to create the new set. However,
the copied points still remain in the original set and, potentially, can be copied
365 DATA SCIENCE 24
again (one can think of them being moved to the new set and then replaced in the
original one, that’s where with replacement comes from). Thus, the newly
generated datasets contain no new data, it’s the same data but some of it is
Figure 12: New datasets generated through bootstrapping. Notice how some of the data is repeated.
These newly generated datasets are then used as the training data for the
As a note, a dataset created in this manner that is of the same size as the
1
unique samples in it is 1 − ≈ 63%.
𝑒𝑒
bootstrapping, we feed them to decision trees and then collect the results and
choose the final outcome through majority voting. You would be forgiven to think
that this construct is the random forest. But actually, this is called Bagged decision
trees. It stands for Bootstrap Aggregated decision trees. There is one crucial
And that is to allow each tree access to only some features, not all. That
is right, in the forest, each tree can only see part of the input features. This is done
again, chosen at random for each tree. What we can control is the size of this
subset – whether we want to consider half the features, or 70% and so on.
This is, in essence, the random forest algorithm. All of the steps outlined
above act as regularization to reduce the overfitting. And so, random forest rarely
overfit, if not at all. This, in turn, leads to better performance of the algorithm.
365 DATA SCIENCE 26
7 Relevant Metrics
In this section, we introduce some of the relevant metrics that could be used
classification task.
A confusion matrix, 𝐶𝐶, is constructed such that each entry, 𝐶𝐶𝑖𝑖𝑖𝑖 , equals the
to only one of two classes. We denote these two classes by 0 and 1 and, for the
time being, define 1 to be the positive class. This would result in the confusion
0 TN FP
True label
1 FN TP
0 1
Predicted
label
Figure 11: A 2 × 2 confusion matrix denoting the cells representing the true and false positives and negatives.
Here, class 1 is defined as the positive one.
365 DATA SCIENCE 27
• Top-left cell – true negatives (TN). This is the number of samples whose
true class is 0 and the model has correctly classified them as such.
• Top-right cell – false positives (FP). This is the number of samples whose
• Bottom-left cell – false negatives (FN). This is the number of samples whose
• Bottom-right cell – true positives (TP). This is the number of samples whose
true class is 1 and the model has correctly classified them as such.
to one of three classes, 0, 1, or 2, with class 1 again defined as the positive class.
This makes classes 0 and 2 negative. The confusion matrix would then look like the
one in Figure .
0 TN FP FN
True label
1 FN TP FN
2 FN FP TN
0 1 2
Predicted label
Figure 12: A 3 × 3 confusion matrix denoting the cells representing the true and false positives and negatives.
Here, class 1 is defined as the positive one.
Making use of these confusion matrices, we introduce four useful metrics for
7.2 Accuracy
𝑇𝑇𝑇𝑇 + 𝑇𝑇𝑇𝑇
Accuracy =
𝑇𝑇𝑇𝑇 + 𝐹𝐹𝑇𝑇 + 𝐹𝐹𝑇𝑇 + 𝑇𝑇𝑇𝑇
7.3 Precision
7.4 Recall
𝑇𝑇𝑇𝑇
Recall =
𝑇𝑇𝑇𝑇 + 𝐹𝐹𝑇𝑇
7.5 F1 Score
2
F1 =
1 1
+
precision recall
365 DATA SCIENCE 29
The F1 score can be thought of as putting precision and recall into a single metric.
Contrary to taking the simple arithmetic mean of precision and recall, the F1 score
penalizes low values more heavily. That is to say, if either precision or recall is very
low, while the other is high, the F1 score would be significantly lower compared to
the ordinary arithmetic mean.
Copyright 2022 365 Data Science Ltd. Reproduction is forbidden unless authorized. All rights reserved.
Nikola Pulev
Email: [email protected]