0% found this document useful (0 votes)
19 views21 pages

Unit 3 Part 2

Decision tree learning is a popular method for approximating discrete-valued target functions, represented as a tree structure that can also be expressed as if-then rules. It effectively classifies instances by sorting them down the tree based on attribute tests, with various impurity measures like information gain, gain ratio, and Gini index used to evaluate splits. Algorithms such as ID3, C4.5, and CART are commonly used for constructing decision trees, with ID3 focusing on maximizing information gain at each node.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views21 pages

Unit 3 Part 2

Decision tree learning is a popular method for approximating discrete-valued target functions, represented as a tree structure that can also be expressed as if-then rules. It effectively classifies instances by sorting them down the tree based on attribute tests, with various impurity measures like information gain, gain ratio, and Gini index used to evaluate splits. Algorithms such as ID3, C4.5, and CART are commonly used for constructing decision trees, with ID3 focusing on maximizing information gain at each node.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 21

MACHINE LEARNING

UNIT-III

DECISION TREE LEARNING

5.1 INTRODUCTION
▪ Decision tree learning is a method for approximating discrete-valued target functions, in
which the learned function is represented by a decision tree.
▪ Learned trees can also be re-represented as sets of if-then rules to improve human
readability. These learning methods are among the most popular of inductive inference
algorithms and have been successfully applied to a broad range of tasks from learning to
diagnose medical cases to learning to assess credit risk of loan applicants.
▪ A decision tree can be said to be a map of reasoning process. It uses a structure resembling
that of a tree to describe a dataset and solutions can be visualized by following different
pathways through the tree. It is a hierarchical set of rules explaining the way in which a
large set of data can be divided into smaller data partitions.
▪ Each time a split takes place, the components of the resulting partitions become
increasingly similar to one another with regard to the target. If we had to select a
classification method capable of performing well across a wide range of situations without
the analyst needing to put in effort, and easy for the customer to understand, the tree
methodology would be the preferred choice. Several types of decision-free learning
techniques are available with varying needs and abilities. Decision-tree learning is usually
best suited to problems with the following features: • Patterns are described by a fixed set
of attributes xj; j = 1, 2, …, n, and each attribute xj takes on a small number of disjoint
possible values (categorical or numeric) vlxj; l = 1, 2, …, dj. • The output variable y is a
Boolean-valued function (binary classification problems) defined over the set S of patterns
{s(i)} ={x(i)} ; i = 1, 2, …, N. That is, y takes on values yq ; q = 1, 2. If we assume y1= 0
and y2=1, then y : S -> [0, 1].
▪ The training data is described by the dataset D of N patterns with corresponding observed
outputs: D = {s(i) , y(i) } = {x(i) y(i) }; i = 1, 2, …,N.

5.2 EXAMPLE OF CLASSIFICATION DECISION TREE


▪ Decision trees classify instances by sorting them down the tree from the root to some leaf
node, which provides the classification of the instance. Each node in the tree specifies a
test of some attribute of the instance, and each branch descending from that node
corresponds to one of the possible values for this attribute.
▪ An instance is classified by starting at the root node of the tree, testing the attribute
specified by this node, then moving down the tree branch corresponding to the value of the
attribute in the given example. This process is then repeated for the subtree rooted at the
new node.
▪ Consider the weather data set:
▪ Below figure illustrates a typical learned decision tree. This decision tree classifies
Saturday mornings according to whether they are suitable for playing tennis. For example,
the instance

(Outlook = Sunny, Temperature = Hot, Humidity = High, Wind = Strong)


would be sorted down the leftmost branch of this decision tree and would therefore be
classified as a negative instance (i.e., the tree predicts that PlayTennis = no).
● In general, decision trees represent a disjunction of conjunctions of constraints on the
attribute values of instances. Each path from the tree root to a leaf corresponds to a
conjunction of attribute tests, and the tree itself to a disjunction of these conjunctions.
● For example, the decision tree shown in Figure 1 corresponds to the expression
(Outlook = Sunny Ʌ Humidity = Normal)
V (Outlook = Overcast)
v (Outlook = Rain Ʌ Wind = Weak)
Fig: A decision tree for the concept PlayTennis.

3.3 MEASURES OF IMPURITY FOR EVALUATING SPLITS IN DECISION TREES


An impurity measure is a heuristic for selection of the splitting criterion that best separates a given
dataset D of class-labeled training tuples into individual classes. If we divide D into smaller
partitions as per the outcome of the splitting criterion, each partition should ideally be pure, with
all the tuples falling into each partition belonging to the same class.
Three popular impurity measures are information gain, gain ratio, and Gini index.

5.3.1 Information Gain/Entropy reduction


Information theory has a measure for this, called entropy, which measures how disorganized a
system is. The root node holds the entire dataset D which describes the patterns s (1), s (2), …, s
(N)
with their corresponding classes y1 or y2 (for a binary classification task). Imagine selecting a
pattern at random from the dataset D and announcing that it belongs to class y q. This message
has the probability

where freq(yq, D) stands for the number of patterns in D that belong to class y q and |D| denotes the
total number of patterns in D (|D | = N). The expected information needed to classify a pattern in
D is given by

A log function to the base 2 is used because information is encoded in bits. Info (D) is just the
average amount of information needed to identify the class label of a pattern in D. Note that at
this point, the information we have is solely based on the proportions of patterns in each class.
Info (D) can also be expressed as entropy of D, denoted as Entropy (D).

Associated with root node of the decision tree, Info(D) represents the expected amount of
information that would be needed to specify whether a new instance should be classified as y 1 or
y2, given that the example reached the node. Info (D) is 0 if all patterns in D belong to the same
class (P1 = 0, P2 = 1): – P1 log2 P1 – P2 log2 P2 = 0 (note that 0log20 = 0). Info (D) is 1 when the
collection D contains an equal number of Class 1 and Class 2 patterns (P 1=1/2,P2=1/2),
representing maximum heterogeneity (randomness) in the dataset: – P 1 log2 P1 – P2 log2 P2 = 1. If
the collection D contains unequal number of Class 1 and Class 2 patterns, Info (D) is between 0
and 1. It is, thus, a measure of impurity of the collection of examples.
To illustrate, we consider training set of Table 8.1 (Weather Data). It has nine examples of class
Yes, and five examples of class No. Therefore,

Root node with dataset D will therefore be a highly impure node. The training set D contains
instances that belong to a mixture of classes (high entropy). In this situation, the idea of ‘divide-
and-conquer’ strategy is to divide D into subsets of instances that are, or seem to be, heading
towards single-class collection of instances.
Suppose we select attribute xj for the root node. xj has distinct values vlxj ; l = 1, …, dj , as observed
from the training data D. Attribute xj can be used to split data D into l; l = 1, …, dj , partitions or
subsets {D1, D2, …, Dd j }, where Dl contains those patterns in D that have values vlxj of xj . These
partitions would correspond to branches grown from the node. Ideally, we would like this
partitioning to produce an exact classification, i.e., we would like each partition to be pure.
This amount is measured by

The term |Dl|/|D| acts as the weight of l th partition.

where Pql is the probability that the arbitrary sample in subset Dl belongs to class yq, and is
estimated as
Information gain is defined as the difference between the original information requirement (i.e,
based on the partition of classes in the entire dataset D) and the new requirement (i.e., obtained
after partitioning on xj ). That is,

In other words, Gain (D, xj ) tells us how much would be gained by branching on xj . It is the
expected reduction in information requirement (expected reduction in entropy) by partitioning on
xj . The attribute xj with the highest information gain, Gain (D, x j ), is chosen as the splitting
attribute at the root node. This is equivalent to saying that we want to partition on the attribute x j
that would do the best classification so that the amount of information still required (i.e., Info(D,
xj )) to finish classification task is minimal.
The information gain, Gain (D, xj ), measures the expected reduction in entropy, caused by
partitioning the patterns in dataset D according to the attribute x j .

5.3.2 Gain Ratio


It applies a kind of normalization to information gain using a ‘split information’ value defined
analogously with Info(D, xj ) as

This value is representative of the potential information derived from the division of the dataset,
D, into dj partitions matching with the dj values of the attribute x j . For each value of xj , the number
of tuples possessing that value is considered with respect to the total number of tuples in D. This
is different from information gain, which measures the information with respect to classification
obtained on the basis of same partitioning. The gain ratio is defined as

5.3.3 Gini Index


Gini index is used in CART.
where Pq is the probability that a tuple in D belongs to class y q, and is estimated by

Gini index considers a binary split for each attribute. Let us first consider the case where xj is
continuous-valued attribute having dj distinct values v lxj ; l = 1, 2, …, dj. It is common to take
mid-point between each pair of (sorted) adjacent values as a possible split-point (It is a simple
policy, although something might be gained by adoping a more sophisticated policy. One such
policy will be discussed in the next section). The point giving the minimum Gini index for the
attribute xj is taken as its split-point

For a possible split-point of xj , D1 is the number of tuples in D satisfying xj ≤ split-point, and D2


is the set of tuples satisfying xj > split-point. The reduction in impurity that would be incurred by
a binary split on xj is

The attribute that maximizes the reduction in impurity (or equivalently, has the minimum Gini
index) is selected as the splitting attribute. Then one of these two parts (D 1, D2) is divided in a
similar manner by choosing a variable again and a split value for the variable. This process is
continued till we get pure leaf nodes.

5.4 ID3, C4.5 AND CART DECISION TREES


The algorithms are clearly explained below
5.4.1.ID3 algorithm:
▪ Many algorithms have been developed for constructing the decision trees.
▪ The basic decision tree learning algorithm is ID3 which follows the top-down approach
for constructing the decision tree.

5.4.2 Which Attribute Is the Best Classifier?


▪ The central choice in the ID3 algorithm is selecting which attribute to test at each node in
the tree.
▪ For this, ID3 algorithm uses information gain measure. The attribute with the highest
information gain is chosen for testing at a node.
▪ Information gain measures how well a given attribute separates the training examples
according to the target classification.
▪ In order to define information gain precisely, a measure called entropy is used.
▪ Entropy measures the impurity of an arbitrary collection of samples. (i.e. it measures the
homogeneity of samples).
▪ Given a collection S, containing positive and negative examples of some target concept,
the entropy of S relative to this Boolean classification is

where , is the proportion of positive examples in S and , is the proportion of


negative examples in S.

▪ To illustrate, suppose S is a collection of 14 examples of some Boolean concept, including


9 positive and 5 negative examples. Then the entropy of S relative to this Boolean
classification is

▪ Notice that the entropy is 0 if all members of S belong to the same class. Also note that
the entropy is 1 when the collection contains an equal number of positive and negative
examples.
▪ If the collection contains unequal numbers of positive and negative examples, the entropy
is between 0 and 1. The following figure shows this.

▪ If the target classification has c classes, then the entropy of S relative to this c-wise
classification is defined as

▪ where pi is the proportion of S belonging to class i.


▪ Having been defined entropy, now we can define the information gain,Gain(S, A) of an
attribute A, relative to a collection of samples S as

𝐺𝑎𝑖𝑛(𝑆, 𝐴) = 𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝑆) − 𝐸𝑛𝑡𝑟𝑜𝑝𝑦 (𝑆)


Where 𝐸𝑛𝑡𝑟𝑜𝑝𝑦 (𝑆) is given as

where Values(A) is the set of all possible values for attribute A, and S, is the subset of S
for which attribute A has value v.
▪ As an example, let us construct the decision tree for the following data which shows
training examples for the target concept PlayTennis.

First, compute Entropy(S) [S is the given data set. There are two classes yes = 9, No = 5].

Next, compute the information gain of each attribute in the data set.
𝐸𝑛𝑡𝑟𝑜𝑝𝑦 (𝑆) is given as

|𝑆 |
⬚ 𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝑆 )
|𝑆|
Є ( )
5 3 3 2 2 4 4 4 5 3 3 2 2
= − 𝑙𝑜𝑔 − 𝑙𝑜𝑔 + − 𝑙𝑜𝑔 + − 𝑙𝑜𝑔 − 𝑙𝑜𝑔
14 5 5 5 5 14 4 4 14 5 5 5 5

= 0.694

Therefore
𝐺𝑎𝑖𝑛(𝑆, 𝑂𝑢𝑡𝑙𝑜𝑜𝑘) = 𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝑆) − 𝐸𝑛𝑡𝑟𝑜𝑝𝑦 (𝑆)

= 0.940 – 0.694
= 0.246

Similarly,
𝐸𝑛𝑡𝑟𝑜𝑝𝑦 (𝑆) is given as

|𝑆 |
⬚ 𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝑆 )
|𝑆|
Є ( )

4 2 2 2 2 6 4 4 2 2 4 3 3 1 1
= − 𝑙𝑜𝑔 − 𝑙𝑜𝑔 + − 𝑙𝑜𝑔 − 𝑙𝑜𝑔 + − 𝑙𝑜𝑔 − 𝑙𝑜𝑔
14 4 4 4 4 14 6 6 6 6 14 4 4 4 4

= 0.911

Therefore
𝐺𝑎𝑖𝑛(𝑆, 𝑇𝑒𝑚𝑝𝑒𝑟𝑎𝑡𝑢𝑟𝑒) = 𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝑆) − 𝐸𝑛𝑡𝑟𝑜𝑝𝑦 (𝑆)

= 0.940 – 0.911
= 0.029

Similarly,
𝐸𝑛𝑡𝑟𝑜𝑝𝑦 (𝑆) is given as

|𝑆 |
⬚ 𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝑆 )
|𝑆|
Є ( )

7 3 3 4 4 7 6 6 1 1
= − 𝑙𝑜𝑔 − 𝑙𝑜𝑔 + − 𝑙𝑜𝑔 − 𝑙𝑜𝑔
14 7 7 7 7 14 7 7 7 7

= 0.789

Therefore
𝐺𝑎𝑖𝑛(𝑆, 𝐻𝑢𝑚𝑖𝑑𝑖𝑡𝑦) = 𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝑆) − 𝐸𝑛𝑡𝑟𝑜𝑝𝑦 (𝑆)
= 0.940 – 0.789
= 0.151

Similarly,
𝐸𝑛𝑡𝑟𝑜𝑝𝑦 (𝑆) is given as

|𝑆 |
⬚ 𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝑆 )
|𝑆|
Є ( )

8 6 6 2 2 6 3 3 3 3
= − 𝑙𝑜𝑔 − 𝑙𝑜𝑔 + − 𝑙𝑜𝑔 − 𝑙𝑜𝑔
14 8 8 8 8 14 6 6 6 6
= 0.892

Therefore
𝐺𝑎𝑖𝑛(𝑆, 𝑊𝑖𝑛𝑑) = 𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝑆) − 𝐸𝑛𝑡𝑟𝑜𝑝𝑦 (𝑆)

= 0.940 – 0.892
= 0.048

Since, Gain(S, Outlook) is high so, the attribute Outlook is tested first and becomes the root
for the decision tree. It then classifies S into three partitions [Let us say S1:leftmost subtree,
S2:middle subtree, S3:rightmost subtree] (because it has three different values: Sunny, Overcast,
Rainy) as shown below.
Now, we need to apply the same procedure to decide the root node for the leftmost subtree
and for the rightmost subtree. However, observe that all the samples of middle tree are related to
the same class (Yes class). So, we can create a leaf node here and label it with Yes. The final
decision tree for the given data set, S, is given below.
After understanding this procedure, now we are ready to design the algorithm for the basic
decision tree learning algorithm (ID3). This is given below.

ID3((Examples, Targetattribute, Attributes)

5.4.2 The C4.5 Decision Tree

To illustrate the process of implementation of C4.5 decision tree, we consider the (toy) training set
, in which there are four attributes: x1 (Outlook), x2 (Temperature), x3 (Humidity), x4 (Wind); and
two classes for the output variable PlayTennis: Yes, No. The attributes x 1 and x4 have categorical
values, and the attributes x2 and x3 have continuous numeric values. Attribute x1 (Outlook) has
three categorical values: sunny, overcast and rain, and therefore the node labeled Outlook will
have three branches. The node for categorical variable x2 (Wind) will have two branches. The other
two variables are continuous-valued, and as per the strategy followed in C4.5, the corresponding
nodes will have binary splits. This will require discretization of these variables.

There are four choices of attribute at the root node: Outlook, Temperature, Humidity, and Wind.
Tree stumps for the attribute Outlook (x1) are shown above; a three-way split corresponding to the
three categorical values of x1.

Temperature (x2) has continuous numeric values. For this attribute, we will select the best cut-
point Tx2 from its range of values by evaluating every candidate cut point. Examples are first sorted
by increasing value of the attribute, and interval between each successive pair of values in the
sorted sequence gives a potential cut-point. The cut-point that minimizes the entropy will never
occur between two patterns of the same class. Therefore, it is necessary to consider potential
divisions that separate patterns of different classes. For the weather data , this gives us eight
potential cut-points: {64, 65, 70, 71, 72, 75, 80, 83}. Note that boundary points of the intervals
between classes have been taken as the potential cut-points. Entropy for each of these cut-points
is evaluated and the one that results in maximum information gain/gain ratio is selected as the
split-value for the attribute x2. It follows that x2 = 83 is the selected split-value

Tree stump for x2 (Temperature) is shown below. Evaluating all the four candidate root node
variables, Outlook turns out to be the best choice with respect to entropy reduction/gain ratio
measure.
Table shows the dataset for the branch sunny, obtained from the data D . Repeating the process
described above on this dataset, we select Humidity as the daughter node with splitvalue = 70

The fully-grown C4.5 decision tree is shown below:

Fig: Decision tree for the weather data

5.4.3 CART Decision tree

Consider the data sample given below


● There are N = 24 points in two dimensions given by the data sample; a pilot random sample
of households in a city with respect to ownership of a lawn tractor.
● The dataset is set up for a predictive model—a lawn-tractor manufacturer would like to
find a way of classifying households in a city into those likely to purchase a lawn tractor
and those not likely to buy one. Input variables, x1 = Income and x2 = Lawn size, are
recorded for 24 households, and the target variable y = Ownership of a lawn tractor, is
assigned to each household.
● The dataset is balanced, containing equal numbers of Owner/Nonowner households. When
searching for a binary split on a continuous-valued input variable, midpoints between the
consecutive values may be treated as candidate values for the split. The candidate split
points for the variable x1 (Income) are {38.1, 45.3, 50.1, …, 109.5}, and those for x 2 (Lawn
size) are {14.4, 15.4, 16.2, …, 23}. We need to rank the candidate split points according to
how much they reduce impurity (heterogeneity) in the resulting subsets after the split. Note
that the total dataset D given in above table has the highest impurity. With respect to Gini
index as impurity measure,
● It can easily be verified that the Gini index impurity measure is at its peak when Pq = 0.5,
i.e., when the data is perfectly balanced. Calculating Gini index for all the candidate split
points for both x1 and x2 variables, and ranking them according to how much they reduce
impurity, we choose x2 (Lawn size) for the first split with a splitting value of 19. The (x 1,
x2) space is now divided into two rectangles, one with x2 ≤ 19 and the other with x2 > 19.
This is illustrated as
● Thus, the Gini impurity index decreased from 0.5 before the split to 0.375 after the split.
Each of the rectangles created by the split is more homogeneous than the rectangle before
the split. The upper rectangle contains points that are mostly Owners and the lower
rectangle contains mostly Nonowners. By comparing the reduction in impurity across all
possible splits in all possible attributes, the next split is chosen. The next split is found to
be on the x1 (Income) variable at the value 84.75. Below figure shows that once again the
tree procedure has actually chosen to split a rectangle to increase the purity of the resulting
rectangles. The left lower rectangle (x1 ≤ 84.75, x2 ≤19) has all points that are Nonowners
with one exception, whereas the right lower rectangle (x1 > 84.75, x2 ≤ 19) consists
exclusively of Owners.
If we continue partitioning till all the branches hit leaf nodes, each rectangle will have data points
from just one of the two classes.

5.5 PRUNING THE TREE

● Pruning is a data compression technique in machine learning and search algorithms that
reduces the size of decision trees by removing sections of the tree that are non-critical and
redundant to classify instances.
● The fundamental algorithm for decision trees continues to grow the tree by splitting nodes
as long as new divisions generate daughter nodes that increase purity. Such a tree has
undergone optimization for the training set. Therefore, elimination of any leaf nodes will
simply lead to an increase in the error rate of the tree on the training set. But this certainly
does not mean that the entire tree with pure leaf nodes also performs the best on new data!
● Pruning is the procedure that decreases the size of decision trees. It can decrease the risk
of overfitting by defining the size of the tree or eliminating areas of the tree that support
little power. Pruning supports by trimming the branches that follow anomalies in the
training information because of noise or outliers and supports the original tree in a method
that enhances the generalization efficiency of the tree,
● Various methods generally use statistical measures to delete the least reliable departments,
frequently resulting in quicker classification and an improvement in the capability of the
tree to properly classify independent test data.
● There are two approaches to tree pruning which are as follows −
5.5.1 Pre-pruning Approach

● In the pre-pruning approach, a tree is “pruned” by labored its construction early (e.g., by
determining not to further divide or partition the subset of training samples at a provided
node).
● Upon halting, the node turns into a leaf. The leaf can influence the most common class
between the subset samples, or the probability distribution of those samples.
● When making a tree, measures including statistical significance, information gain, etc., can
be used to create the generosity of a split. If partitioning the samples at a node can result in
a split that declines below a pre-specified threshold, then partitioning of the given subset
is halted. There are problems in selecting an appropriate threshold. High thresholds can
result in oversimplified trees, while low thresholds can result in very little simplification.
5.5.2 Post-pruning Approach

The post-pruning approach eliminates branches from a “completely grown” tree.

● A tree node is pruned by eliminating its branches. The price complexity pruning algorithm
is an instance of the post-pruning approach. The pruned node turns into a leaf and is labeled
by the most common class between its previous branches.
● For each non-leaf node in the tree, the algorithm computes the expected error rate that can
appear if the subtree at that node were shortened. Next, the expected error rate appearing
if the node were not pruned is computed using the error rates for each branch, connected
by weighting according to the dimension of observations along each branch. If pruning the
node leads to a higher expected error rate, then the subtree is preserved. Therefore, it is
pruned.
● After creating a set of increasingly pruned trees, an independent test set can estimate the
efficiency of each tree. The decision tree that diminishes the expected error cost is
preferred.

5.6 METRICS FOR ASSESSING CLASSIFICATION ACCURACY


● Evaluation metrics are tied to machine learning tasks. There are different metrics for the
tasks of classification and regression. Some metrics, like precision-recall, are useful for
multiple tasks. Classification and regression are examples of supervised learning, which
constitutes a majority of machine learning applications.
● Using different metrics for performance evaluation, we should be able to improve our
model’s overall predictive power before we roll it out for production on unseen data.
Without doing a proper evaluation of the Machine Learning model by using different
evaluation metrics, and only depending on accuracy, can lead to a problem when the
respective model is deployed on unseen data and may end in poor predictions.
● There are many ways for measuring classification performance. Accuracy, confusion
matrix, log-loss, and AUC-ROC are some of the most popular metrics. Precision-recall is
a widely used metrics for classification problems.

i. Accuracy
Accuracy simply measures how often the classifier correctly predicts. We can define accuracy as
the ratio of the number of correct predictions and the total number of predictions.
Accuracy=(TP+TN)/(TP+TN+FP+FN)
Where TP=True Positive, TN=True Negative, FP=False Positive, FN=False Negative

ii. Confusion Matrix


Confusion Matrix is a performance measurement for the machine learning classification problems
where the output can be two or more classes. It is a table with combinations of predicted and actual
values.

A confusion matrix is defined as thetable that is often used to describe the performance of a
classification model on a set of the test data for which the true values are known.
Fig: Confusion matrix

iii. Precision
Precision explains how many of the correctly predicted cases actually turned out to be positive.
Precision is useful in the cases where False Positive is a higher concern than False Negatives.

“Precision for a label is defined as the number of true positives divided by the number of
predicted positives”.

iv. Recall (Sensitivity) — Recall explains how many of the actual positive cases we were able to
predict correctly with our model. It is a useful metric in cases where False Negative is of higher
concern than False Positive.

Recall for a label is defined as the number of true positives divided by the total number of
actual positives.

v. F1 Score
It gives a combined idea about Precision and Recall metrics. It is maximum when Precision is
equal to Recall.
F1 Score is the harmonic mean of precision and recall.
The F1 score punishes extreme values more. F1 Score could be an effective evaluation metric in
the following cases:

● When FP and FN are equally costly.


● Adding more data doesn’t effectively change the outcome
● True Negative is high

You might also like