0% found this document useful (0 votes)
30 views

Open Machine Learning With Decision Trees and Random Forests

Uploaded by

daisyyrosemann
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
30 views

Open Machine Learning With Decision Trees and Random Forests

Uploaded by

daisyyrosemann
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 30

Nikola Pulev

Machine Learning with


Decision Trees and
Random Forests
365 DATA SCIENCE 2

Table of Contents

Abstract .................................................................................................................................... 3

1 Motivation ......................................................................................................................... 4

1.1 Decision Trees .......................................................................................................... 4

1.2 Random Forests ....................................................................................................... 4

2 Computer Science Background: Trees ....................................................................... 6

3 What is a decision tree?.................................................................................................. 9

4 How are decision trees constructed?......................................................................... 13

4.1 Gini impurity ........................................................................................................... 14

4.2 Information gain (entropy) ................................................................................... 17

4.3 Which metric to use?............................................................................................. 18

5 Pruning ............................................................................................................................ 19

6 From Decision Trees to Random Forests .................................................................. 22

6.1 Ensemble learning ................................................................................................ 22

6.2 Bootstrapping ........................................................................................................ 23

6.3 Random Forests ..................................................................................................... 25

7 Relevant Metrics............................................................................................................. 26

7.1 The Confusion Matrix ............................................................................................ 26

7.2 Accuracy .................................................................................................................. 28

7.3 Precision .................................................................................................................. 28

7.4 Recall ........................................................................................................................ 28

7.5 F1 Score.................................................................................................................... 28
365 DATA SCIENCE 3

Abstract

A Decision tree is a supervised classification and regression machine

learning algorithm. It is famous for being one of the most intuitive and easy to

understand methods. This makes it, as well, a good starter algorithm to learn the

quirks of the specific dataset and problem one is trying to solve. When it comes to

actual results, though, decision trees have another trick up their sleeve – they can

be stacked together to form a random forest, that can outperform many other

methods.

The following notes serve as a complement to the “Machine Learning with

Decision Trees and Random Forests” course. They list the algorithms’ pros and

cons, outline the working of the decision trees and random forests algorithms,

cover in greater detail the more involved topic of Gini impurity and entropy, and

summarize the most commonly used performance metrics.

Keywords: machine learning algorithm, decision tree, random forest,

classification, gini impurity, information gain, pruning, ensemble learning,

bootstrapping, confusion matrix, accuracy, precision, recall, F1 score


365 DATA SCIENCE 4

1 Motivation

In this section, we summarize the advantages and disadvantages of both

algorithms.

1.1 Decision Trees

To the naked eye, decision trees might look simple at first glance – in fact,

they may look way too simple to be remotely useful. But it is that simplicity that

makes them useful. Since in today’s world it is extremely easy to create very

complex models with just a few clicks, many data scientist cannot understand nor

explain what their model is doing. Decision trees, while performing averagely in

their basic form, are easy to understand and when stacked reach excellent results.

Table 1: Pros and cons of Decision trees.

Pros Cons
Intuitive Average results
Easy to visualize and interpret Trees can be unstable with respect to
the train data
In-built feature selection Greedy learning algorithms
No preprocessing required Susceptible to overfitting (there are
measures to counter this)
Performs well with large datasets
Moderately fast to train and extremely
fast to predict with

1.2 Random Forests

Random forests are built upon many different decision trees, however, with

different measures set in place to restrict overfitting. Thus, they obtain higher

performance.
365 DATA SCIENCE 5

Table 2: Pros and cons of Random Forests.

Pros Cons
Gives great results Black box model - looses the
interpretability of a single decision tree
Requires no preprocessing of the data Depending on the number of trees in
the forest, can take a while to train
Automatically handles overfitting in Outperformed by gradient-boosted
most cases trees
Lots of hyperparameters to control
Performs well with large datasets
365 DATA SCIENCE 6

2 Computer Science Background: Trees

In order to discuss decision trees, we have to clear up the meaning of ‘tree’

in a programming context. In computer science, a tree is a specific structure used

to represent data. It might look something like this.

Figure 1: A typical tree in computer science

From this picture, it’s clear how the name came about – it definitely does look

like an upside-down tree, branching more and more as you go down. Now, there

are 2 main elements that make up the tree – nodes and edges/branches.

• Nodes are the black circles in the picture above. They contain the

actual data. This data is, generally, not restricted to a particular type.

• Edges are the black lines connecting the different nodes. They are

often called branches.

You might recognize these two elements from a different mathematical

structure – graph. And that’s entirely correct, you can think of the tree as a graph

with additional restrictions.


365 DATA SCIENCE 7

Those restrictions are that a node can be connected to a different node that

is either one level higher, or one level lower. Moreover, every node, except the very

first one, should be connected to exactly one node higher up. These rules mean

that connections such as the ones illustrated below are not permitted.

Figure 2: The highlighted connections are forbidden in a typical tree structure

From the pictures so far, we can see that a tree is a structure with the pattern

of a node, connected to other nodes through edges, repeated again and again

recursively to create the whole tree. Thus, it is a good idea to be able to

distinguish between the different parts. First, it’s crucial to remember that a tree

has a well-defined hierarchy and we always view it from top to bottom. Then, we

can identify the following elements:

• Root node – this is the uppermost node, the one that starts the tree

• Parent node – when considering a subset of the tree, the parent

node is the one that is one level higher and connects to that subtree

(see figure 3)

• Child node – when given a node, the ones stemming from it, one

level lower, are its children (see figure 3)


365 DATA SCIENCE 8

• Leaf node – A node that has no children. This is where the tree

terminates (Note that a tree can terminate at different points on

different sides)

• Height – how many levels the tree has. For example, we can say that

the tree from figure 1 has a height of 4

• Branching factor – this signifies how many children there are per

node. If different nodes have different number of children, we can say

that the tree has no definitive branching factor. In principle, there can

exist trees with however big branching factor you want. However, an

extremely popular tree subvariant has 2 branches per node at most.

This type of tree is called a binary tree.

Figure 3: The relationship between a parent and children nodes

One very common use of the binary tree is the binary search trees which are

used for efficient implementations of searching and sorting algorithms. There are,

of course, many others, including decision trees.


365 DATA SCIENCE 9

3 What is a decision tree?

Decision trees are a common occurrence in many fields, not just machine

learning. In fact, we commonly use this data structure in operations research and

decision analysis to help identify the strategy that is most likely to reach a goal.

The idea is that there are different questions a person might ask about a particular

problem, with branching answers that lead to other questions and respective

answers, until they can reach a final decision. But, this is a very visual topic, so let’s

just see 2 examples of that:

Figure 4: Decision tree example about method of transportation based on weather


365 DATA SCIENCE 10

Figure 5: Decision tree example based on whether to accept a job offer

As can be seen from those examples, the nodes in a decision tree hold

important questions regarding the decision one wants to make. Then, the

edges/branches represent the possible answers to those questions. By answering

the different questions and following the structure down, one arrives at a leaf node

(marked yellow in the above illustrations) that represents the outcome (decision).

The decision trees so far, however, do not represent a machine learning

problem. How would an ML decision tree look like? Well, here it is:
365 DATA SCIENCE 11

Figure 6: ML decision tree based on the Iris dataset. The dataset features flower petal and sepal dimensions with
the objective to predict the exact flower species

This is a real tree trained on the Iris dataset. The input features are the sizes

of the petals and sepals of different Iris flowers, with the objective to classify the

them into 3 Iris species. The nodes now carry a lot more information, most of

which is just informative to the reader, not part of the “questions”. In machine

learning context, the decision tree asks questions regarding the input features

themselves (most often the question is whether the value is bigger or smaller than

some threshold).

And here lies the usefulness of decision trees. As they can be easily

visualized, they offer data scientists tools to analyze how model makes predictions.

Moreover, since the tree is a hierarchical structure, the higher a certain node is, the

more important it is to the problem at hand. Thus, we can say that decision trees

incorporate feature selection automatically.


365 DATA SCIENCE 12

From the above tree, we can extract another important term for decision

trees – split. Let’s consider what happens to our training set when we apply it to a

decision tree? Since it consists of many different data points with different feature

values, we would expect that some of the points follow the left arrow, while others

follow the right one. Thus, we have effectively split our dataset in two. Then, each

of the parts is further chopped in two at every subsequent node. That’s why, a

node is often referred to as a split during training.

So, what types of decision trees are there?

Well, decision trees can solve both regression and classification problems.

Popular implementations of the algorithm include ID3, C4.5 and CART.

CART (Classification And Regression Tree) is especially important here as it is the

algorithm that sklearn chose in order to implement decision trees.


365 DATA SCIENCE 13

4 How are decision trees constructed?

While the decision tree itself is easy to understand, the process that creates

it is slightly more complicated. Nevertheless, there are a couple of main points that

we can discuss. So, let’s take a look at them.

Decision trees are generated through greedy algorithms. Greedy

algorithms are ones that work by choosing the best option available right now,

without considering the whole picture. Therefore, they are fast, as there’s no need

to go through all of the possibilities, but can sometimes produce suboptimal

solutions. Take, as an example, the problem of finding the shortest path between

two cities. If you take the shortest street at every junction, you may end up in a

situation where you are actually going away from the desired destination. So, a

true optimal solution can be reached only if you consider all the roads between

you and the destination, not just the ones at each junction.

Nevertheless, the greedy algorithms do a good enough job at finding a

solution as close as possible to the optimal one, that this doesn’t matter. What

matters is that they are fast.

So, these algorithms construct the tree one node at a time. During the

process, they look at the tree so far, and decide which node would best separate

the data. In other words, they look for the best way to split the data at each node.

Here, the word “best” is subjective, so we need to assign concrete meaning

behind it. In decision trees, this is done by defining different metrics that quantify

how “good” or “bad” a certain split is. The algorithm simply tries to minimize or

maximize those. So, the real important part are the actual metrics. The two most
365 DATA SCIENCE 14

popular ones are gini impurity and information gain (entropy). Let’s take a look at

those.

4.1 Gini impurity

Before we dive in, it is worth noting what really matters for the algorithms –

the number of samples from each class that is present in the node. They don’t look

at the entire tree, but rather take it one node at a time. In this context, a node may

be called a “split”, since we split our data in 2 – for Yes or No. Of course, some

splits are more useful than others. For instance, one that funnels all of the data to

the left branch and none of it to the right might be a really bad split, since we

haven’t actually changed anything. The metrics’ job is to quantify exactly how

good or bad a certain split is. Gini impurity is one such metric.

The formula for Gini impurity is:

𝑛𝑛 𝑛𝑛

𝐺𝐺𝐺𝐺𝐺𝐺𝐺𝐺 = � 𝑝𝑝𝑖𝑖 (1 − 𝑝𝑝𝑖𝑖 ) = 1 − �(𝑝𝑝𝑖𝑖 )2 ,


𝑖𝑖 𝑖𝑖

where 𝑝𝑝𝑖𝑖 is the proportion of samples in the i-th class with respect to the

whole set of data. The data considered here, is the data present at the node we

are computing this metric for. This will be the whole dataset for the root node, but

it will necessarily become a smaller and smaller subset the deeper down the tree

we go.

Gini impurity (named after Italian mathematician Corrado Gini) is a measure

of how often a randomly chosen element from the set would be incorrectly

labeled if it was randomly labeled according to the distribution of labels in the


365 DATA SCIENCE 15

subset. In simple words, the idea is to find how much would change in the node if

we randomly shuffle our data. Allow me to illustrate with an example.

Let’s say we have a dataset with 2 classes – red and blue. Imagine we have a

node that contains 10 data points. Now, suppose these samples have the

following class distribution – 3 of them are red, and 7 are blue.

Figure 7: Data inside a node - 3 data points of the red class and 7 data points of the blue class

So far, every data point is in the correct bin, so we might say that we have no

misclassifications.

In this case, gini impurity will try to measure how the accuracy will change if

we randomly shuffled those 10 samples. We still need to have 3 samples in the red

bin and 7 samples in the blue bin. However, the bins are no longer guaranteed to

contain only red and blue data points, respectively. In other words, there is

misclassification of some of the data. Essentially, we use this measure to identify

what the misclassification rate is. The bigger it is, the bigger the gini impurity. The

algorithms, thus, try to minimize the gini impurity. The smallest possible gini is 0

and it is achieved when all samples in the node are of a single class.
365 DATA SCIENCE 16

So, a random shuffle the data above may look like this:

Figure 8: The same data in the node, but now randomly shuffled. Notice that there is some "misclassification".

Now, we can see that not all red points are in the red bin. Likewise for the

blue data points. Thus, there is some misclassification. Gini impurity tries to

measure the average rate of precisely this misclassification.

So, how does this relate to the formula expressed above. Well, let’s take

another look at it:

𝑛𝑛

𝐺𝐺𝐺𝐺𝐺𝐺𝐺𝐺 = � 𝑝𝑝𝑖𝑖 (1 − 𝑝𝑝𝑖𝑖 )


𝑖𝑖

The process of randomly shuffling the data can be divided into 2 steps:

1. First, we pick one datapoint at random,

2. Then we place it in one of the bins at random.

We can see those two steps expressed mathematically in the formula:

1. 𝑝𝑝𝑖𝑖 represents the probability that we pick a datapoint of the i-th class

at random
365 DATA SCIENCE 17

2. (1 − 𝑝𝑝𝑖𝑖 ) represents the probability that it is placed in a different class

bin

So, we multiply those two probabilities and sum them for every different

class present in the node. For our example node of 3 red and 7 blue samples, we

can compute the gini impurity to be:

3 3 3 7 21
For the red class: 10
(1 − 10) = 10
× 10 = 10

7 7 7 3 21
For the blue class: 10 (1 − 10) = 10 × 10 = 10

42
And, summing those, we get 𝐺𝐺𝐺𝐺𝐺𝐺𝐺𝐺 = 10

In this manner, we can obtain the gini metric of both child nodes of a split.

By combining them (based on the number of samples in each one), we can judge

whether the split is good or not.

4.2 Information gain (entropy)

This is another metric to measure how good a certain split is. It is often

called entropy because of a very similar metric in information theory. Entropy is a

metric that measures how much information there is in a set. Its formula is:

𝑛𝑛

𝐸𝐸𝐺𝐺𝐸𝐸𝐸𝐸𝐸𝐸𝑝𝑝𝐸𝐸 = − � 𝑝𝑝𝑖𝑖 log 2 𝑝𝑝𝑖𝑖


𝑖𝑖

Information gain’s job is to compute the entropy (information content) in the

child nodes and subtract it from the entropy of the parent in order to find how

much information can be gained by making the split.


365 DATA SCIENCE 18

The tree generating algorithms using this metric try to maximize it

(maximize the information contained in the tree). They will continue to run

attempts of splitting the data until the information gain is 0 and no more

information can be squeezed out of the data.

4.3 Which metric to use?

In theory, gini impurity favors bigger partitions – or distributions – whereas

the information gain favors smaller ones. However, in practice, there is not much

difference between the two. Researchers estimated that the metric we use matters

in only 2% of the cases. In the rest, the decision whether to use gini or entropy

won’t influence your results.

In fact, since information gain needs logarithms to be calculated, it turns out

to be a bit more computationally expensive. That’s why most implementations will

default to gini.
365 DATA SCIENCE 19

5 Pruning

In this section, we outline the technique of pruning in the context of

decision trees.

Now, even though decision trees are a relatively simple model, they have a

tendency to overfit. A lot. This is expressed in the tree having way too many

nodes and splits, going on forever. Here is an example of an overfitted tree:

Figure 9: A heavily overfitted decision tree. There are extremely many different nodes.

We can see that this is overly complicated tree. This not only reduces the

performance of the model itself, but also negates one of the main advantages of

decision trees – being able to be easily visualized and understood.


365 DATA SCIENCE 20

Luckily, it is not all gloom and doom. There is a technique to deal with this

overfitting and it’s called pruning. As an example, here his how the same tree

looks after pruning:

Figure 10: The same tree as the one above, after pruning. It is much simpler and has 5% better accuracy.

Now this is a much better-looking tree. It even exhibits 5% better accuracy.

So, how does pruning work?

It is exactly what it says on the tin. Think of how you trim bushes and plants –

this is practically the same thing. In essence, pruning is a technique that removes

parts of the tree that are not necessary for the final classification. It reduces the

complexity of the final classifier, and hence improves predictive accuracy by the

reduction of overfitting.

Pruning processes can be divided into two types – pre-pruning and post-

pruning. Pre-pruning is done during the training process itself, while post-pruning

is done after the tree is already generated. In practice post-pruning is the way

more popular method.

In terms of pruning algorithms, there are many. But here are a couple:
365 DATA SCIENCE 21

Reduced error pruning

One of the simplest forms of pruning is reduced error pruning. Starting at the

leaves, each node is replaced with its most popular class. If the prediction

accuracy is not affected, then the change is kept. While somewhat naive, reduced

error pruning has the advantage of simplicity and speed.

Minimal cost-complexity pruning

This is arguably the most popular pruning algorithm. This algorithm is

parameterized by 𝛼𝛼 ≥ 0 known as the complexity parameter. The complexity

parameter is used to define the cost-complexity measure, 𝑅𝑅𝛼𝛼 (𝑇𝑇) of a given tree 𝑇𝑇:

𝑅𝑅𝛼𝛼 (𝑇𝑇) = 𝑅𝑅(𝑇𝑇) + 𝛼𝛼�𝑇𝑇��

� � is the number of terminal nodes in 𝑇𝑇 and 𝑅𝑅(𝑇𝑇) is traditionally defined


where �𝑇𝑇

as the total misclassification rate of the terminal nodes. Minimal cost-complexity

pruning finds the subtree of 𝑇𝑇 that minimizes 𝑅𝑅𝛼𝛼 (𝑇𝑇).

In simple words, the algorithm identifies the subtree with the smallest

contribution, measured in the cost complexity metric, and cuts it off from the

actual tree, repeating the process until the effective parameter for the whole tree

is large enough.
365 DATA SCIENCE 22

6 From Decision Trees to Random Forests

The random forest algorithm is one of the few non-neural network models

that give very high accuracy for both regression and classification tasks. It simply

gives good results. And while decision trees do provide great interpretability,

when it comes down to performance, they lose against random forests. In fact,

unless transparency of the model is a priority, almost every data scientist and

analyst will use random forests over decision trees. So, let’s see what this algorithm

is made of.

6.1 Ensemble learning

In essence, a random forest is the collection of many decision trees applied

to the same problem. In machine learning, this is referred to as ensemble

modelling. In general, ensemble methods use multiple learning algorithms to

obtain better predictive performance than any of the constituent learning

algorithms alone. So, in our case, the collection of decision trees as a whole unit

behaves much better than any stand-alone decision tree. In short, random forests

rely on the wisdom of the crowd.

The more observant among you might have noticed, though, that through

the process of creating many different trees, we lose one of the important

properties that decision trees had in the first place – namely, interpretability. Even

though each individual tree in the collection is simple to follow, when we have

hundreds of them in a single model it becomes almost impossible to grasp what’s

happening at a glance. That’s why this algorithm is usually treated as a black box
365 DATA SCIENCE 23

model. That’s the trade-off of random forests as opposed to decision trees – we

gain additional performance and accuracy, but lose the interpretability and

transparency of the model.

A logical follow-up question you might have right now is: “How do we

determine what the final result should be?”. After all, each decision tree produces

its own answer. So, we end up with hundreds of different answers. The good news

is that we can deal with this problem in a very intuitive way – by using majority

voting to determine the final outcome. In other words, we choose the most

common result.

6.2 Bootstrapping

The purpose of the random forest algorithm is to organically decrease

overfitting. That’s why there are many individual decision trees, with the idea that

one tree can overfit, but many will not do so in the same manner. Thus, their

average would reflect the true dependence.

A crucial part in that logic is that we don’t train all of the trees on the exact

same dataset. But it is rarely that we have different datasets related to the same

problem, nor we can split one dataset into a hundred parts. So, how do we create

many different datasets out of a single one? Well, that is the technique of

bootstrapping.

In technical language, bootstrapping works by uniformly sampling from the

original dataset with replacement. What that means, is that it goes through the

original dataset and copies data points at random to create the new set. However,

the copied points still remain in the original set and, potentially, can be copied
365 DATA SCIENCE 24

again (one can think of them being moved to the new set and then replaced in the

original one, that’s where with replacement comes from). Thus, the newly

generated datasets contain no new data, it’s the same data but some of it is

repeated. You can see two examples of this below:

Figure 11: Schematic of datasets created through bootstrapping.

Figure 12: New datasets generated through bootstrapping. Notice how some of the data is repeated.

These newly generated datasets are then used as the training data for the

decision trees in the forest.


365 DATA SCIENCE 25

As a note, a dataset created in this manner that is of the same size as the

original dataset is called bootstrap sample. It is expected that the amount of

1
unique samples in it is 1 − ≈ 63%.
𝑒𝑒

6.3 Random Forests

So far, we’ve discussed that we create slightly different datasets through

bootstrapping, we feed them to decision trees and then collect the results and

choose the final outcome through majority voting. You would be forgiven to think

that this construct is the random forest. But actually, this is called Bagged decision

trees. It stands for Bootstrap Aggregated decision trees. There is one crucial

detail that must be satisfied in order for it to become a random forest.

And that is to allow each tree access to only some features, not all. That

is right, in the forest, each tree can only see part of the input features. This is done

to further reduce the chance of overfitting. The features to be considered are,

again, chosen at random for each tree. What we can control is the size of this

subset – whether we want to consider half the features, or 70% and so on.

This is, in essence, the random forest algorithm. All of the steps outlined

above act as regularization to reduce the overfitting. And so, random forest rarely

overfit, if not at all. This, in turn, leads to better performance of the algorithm.
365 DATA SCIENCE 26

7 Relevant Metrics

In this section, we introduce some of the relevant metrics that could be used

to evaluate the performance of a machine learning model dealing with a

classification task.

7.1 The Confusion Matrix

A confusion matrix, 𝐶𝐶, is constructed such that each entry, 𝐶𝐶𝑖𝑖𝑖𝑖 , equals the

number of observations known to be in group 𝐺𝐺 and predicted to be in group 𝑗𝑗.

A confusion matrix is a square 2 × 2, or larger, matrix showing the number

of (in)correctly predicted samples from each class.

Consider a classification problem where each sample in a dataset belongs

to only one of two classes. We denote these two classes by 0 and 1 and, for the

time being, define 1 to be the positive class. This would result in the confusion

matrix from Figure .

0 TN FP
True label

1 FN TP

0 1

Predicted

label

Figure 11: A 2 × 2 confusion matrix denoting the cells representing the true and false positives and negatives.
Here, class 1 is defined as the positive one.
365 DATA SCIENCE 27

The matrix consists of the following cells:

• Top-left cell – true negatives (TN). This is the number of samples whose

true class is 0 and the model has correctly classified them as such.

• Top-right cell – false positives (FP). This is the number of samples whose

true class is 0 but have been incorrectly classified as 1s.

• Bottom-left cell – false negatives (FN). This is the number of samples whose

true class is 1 but have been incorrectly classified as 0s.

• Bottom-right cell – true positives (TP). This is the number of samples whose

true class is 1 and the model has correctly classified them as such.

Consider now a classification problem where each sample in a dataset belongs

to one of three classes, 0, 1, or 2, with class 1 again defined as the positive class.

This makes classes 0 and 2 negative. The confusion matrix would then look like the

one in Figure .

0 TN FP FN
True label

1 FN TP FN

2 FN FP TN

0 1 2

Predicted label

Figure 12: A 3 × 3 confusion matrix denoting the cells representing the true and false positives and negatives.
Here, class 1 is defined as the positive one.

Making use of these confusion matrices, we introduce four useful metrics for

evaluating the performance of a classifier.


365 DATA SCIENCE 28

7.2 Accuracy

The ratio between the number of all correctly

predicted samples and the number of all samples.

𝑇𝑇𝑇𝑇 + 𝑇𝑇𝑇𝑇
Accuracy =
𝑇𝑇𝑇𝑇 + 𝐹𝐹𝑇𝑇 + 𝐹𝐹𝑇𝑇 + 𝑇𝑇𝑇𝑇

7.3 Precision

The ratio between the number of true positives and

the number of all samples classified as positive.


𝑇𝑇𝑇𝑇
Precision =
𝑇𝑇𝑇𝑇 + 𝐹𝐹𝑇𝑇

7.4 Recall

The ratio between the number of true positives and the

number of all samples whose true class is the positive one.

𝑇𝑇𝑇𝑇
Recall =
𝑇𝑇𝑇𝑇 + 𝐹𝐹𝑇𝑇

7.5 F1 Score

The harmonic mean of precision and recall.

2
F1 =
1 1
+
precision recall
365 DATA SCIENCE 29

The F1 score can be thought of as putting precision and recall into a single metric.
Contrary to taking the simple arithmetic mean of precision and recall, the F1 score
penalizes low values more heavily. That is to say, if either precision or recall is very
low, while the other is high, the F1 score would be significantly lower compared to
the ordinary arithmetic mean.

Copyright 2022 365 Data Science Ltd. Reproduction is forbidden unless authorized. All rights reserved.
Nikola Pulev

Email: [email protected]

You might also like