0% found this document useful (0 votes)

213 views14 pages

F1 Score

The document discusses different methods for averaging F1 scores when evaluating multi-class classification models: macro averaging, weighted averaging, and micro averaging. It uses an example of classifying images into airplanes, boats, and cars to illustrate how each averaging method is calculated. Macro averaging takes the average of per-class F1 scores, treating each class equally. Weighted averaging weights each class's F1 score by its support size. Micro averaging sums true positives, false negatives, and false positives across classes, and is equivalent to overall accuracy for multi-class classification. The document recommends choosing macro averaging for balanced data, weighted averaging for imbalanced data where larger classes matter more, and micro averaging/accuracy for a single overall performance metric

Uploaded by

Aftab Aalam

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

213 views14 pages

F1 Score

Uploaded by

Aftab Aalam

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 14

Micro, Macro & Weighted Averages of F1

Score, Clearly Explained

Understanding the concepts behind the micro average, macro
average and weighted average of F1 score in multi-class
classification with simple illustrations

Image by author and Freepik

The F1 score (aka F-measure) is a popular metric for evaluating the

performance of a classification model.

In the case of multi-class classification, we adopt averaging methods

for F1 score calculation, resulting in a set of different average
scores (macro, weighted, micro) in the classification report.

This article looks at the meaning of these averages, how to calculate

them, and which one to choose for reporting.
Contents

(1) Recap of the Basics (Optional)

(2) Setting the Motivating Example
(3) Macro Average
(4) Weighted Average
(5) Micro Average
(6) Which average should I choose?

(1) Recap of the Basics (Optional)

Note: Skip this section if you are already familiar with the concepts of
precision, recall, and F1 score.

Precision

Layman definition: Of all the positive predictions I made, how many

of them are truly positive?

Calculation: Number of True Positives (TP) divided by the Total

Number of True Positives (TP) and False Positives (FP).

The equation for precision | Image by author

Recall

Layman definition: Of all the actual positive examples out there,

how many of them did I correctly predict to be positive?

Calculation: Number of True Positives (TP) divided by the Total

Number of True Positives (TP) and False Negatives (FN).

The equation for Recall | Image by author

If you compare the formula for precision and recall, you will notice
both looks similar. The only difference is the second term of the
denominator, where it is False Positive for precision but False
Negative for recall.

F1 Score

To evaluate model performance comprehensively, we should

examine both precision and recall. The F1 score serves as a helpful
metric that considers both of them.

Definition: Harmonic mean of precision and recall for a more

balanced summarization of model performance.
Calculation:

The equation for F1 score | Image by author

If we express it in terms of True Positive (TP), False Positive (FP), and

False Negative (FN), we get this equation:

The alternative equation for F1 score | Image by author

(2) Setting the Motivating Example

To illustrate the concepts of averaging F1 scores, we will use the

following example as the context of this tutorial.

Imagine we have trained an image classification model on

a multi-class dataset containing images
of three classes: Airplane, Boat, and Car.
Image by macrovector — freepik.com

We use this model to predict the classes of ten test set images. Here

are the raw predictions:

Sample predictions of our demo classifier | Image by author

Upon running sklearn.metrics.classification_report, we get the
following classification report:

Classification report from scikit-learn package | Image by author

The columns (in orange) with the per-class scores (i.e. score for each
class) and average scores are the focus of our discussion.

We can see from the above that the dataset is imbalanced (only one
out of ten test set instances is ‘Boat’). Thus the proportion of
correct matches (aka accuracy) would be ineffective in assessing
model performance.

Instead, let us look at the confusion matrix for a holistic

understanding of the model predictions.
Confusion matrix | Image by author

The confusion matrix above allows us to compute the critical values of

True Positive (TP), False Positive (FP), and False Negative (FN), as
shown below.

Calculated TP, FP, and FN values from confusion matrix | Image by author
The above table sets us up nicely to compute the per-class values
of precision, recall, and F1 score for each of the three classes.

It is important to remember that in multi-class classification, we

calculate the F1 score for each class in a One-vs-Rest
(OvR) approach instead of a single overall F1 score as seen in binary
classification.

In this OvR approach, we determine the metrics for each class

separately, as if there is a different classifier for each class. Here are the
per-class metrics (with the F1 score calculation displayed):

However, instead of having multiple per-class F1 scores, it would be

better to average them to obtain a single number to describe
overall performance.

Now, let’s discuss the averaging methods that led to the three

different average F1 scores in the classification report.
(3) Macro Average

Macro averaging is perhaps the most straightforward amongst the

numerous averaging methods.

The macro-averaged F1 score (or macro F1 score) is computed by

taking the arithmetic mean (aka unweighted mean) of all the per-
class F1 scores.

This method treats all classes equally regardless of

their support values.

Calculation of macro F1 score | Image by author

The value of 0.58 we calculated above matches the macro-averaged F1

score in our classification report.
(4) Weighted Average

The weighted-averaged F1 score is calculated by taking the mean of

all per-class F1 scores while considering each class’s support.

Support refers to the number of actual occurrences of the class

in the dataset. For example, the support value of 1 in Boat means
that there is only one observation with an actual label of Boat.

The ‘weight’ essentially refers to the proportion of each class’s support

relative to the sum of all support values.
Calculation of weighted F1 score | Image by author

With weighted averaging, the output average would have accounted for
the contribution of each class as weighted by the number of examples
of that given class.

The calculated value of 0.64 tallies with the weighted-averaged F1

score in our classification report.
(5) Micro Average

Micro averaging computes a global average F1 score by counting

the sums of the True Positives (TP), False Negatives (FN), and False
Positives (FP).

We first sum the respective TP, FP, and FN values across all classes
and then plug them into the F1 equation to get our micro F1 score.

Calculation of micro F1 score | Image by author

In the classification report, you might be wondering why our micro F1

score of 0.60 is displayed as ‘accuracy ’ and why there is NO row
stating ‘micro avg’.
The reason is that micro-averaging essentially computes
the proportion of correctly classified observations out of all
observations. If we think about this, this definition is in fact what we
use to calculate overall accuracy.

Furthermore, if we were to do micro-averaging for precision and recall,

we would get the same value of 0.60.

Calculation of all micro-averaged metrics | Image by author

These results mean that in multi-class classification cases where each

observation has a single label, the micro-F1, micro-
precision, micro-recall, and accuracy share the same value
(i.e., 0.60 in this example).

And this explains why the classification report only needs to display

a single accuracy value, since micro-F1, micro-precision, and
micro-recall also have the same value.
micro-F1 = accuracy = micro-precision = micro-
recall

(6) Which average should I choose?

In general, if you are working with an imbalanced dataset where all

classes are equally important, using the macro average would be a
good choice as it treats all classes equally.

It means that for our example involving the classification of airplanes,

boats, and cars, we would use the macro-F1 score.

If you have an imbalanced dataset but want to assign greater

contribution to classes with more examples in the dataset, then
the weighted average is preferred.

This is because, in weighted averaging, the contribution of each class to

the F1 average is weighted by its size.

Suppose you have a balanced dataset and want an easily

understandable metric for overall performance regardless of the class.
In that case, you can go with accuracy, which is essentially
our micro F1 score.

Performance Metrics Classification
No ratings yet
Performance Metrics Classification
39 pages
AIML Ritesh
No ratings yet
AIML Ritesh
18 pages
ML CH 5
No ratings yet
ML CH 5
45 pages
Macro - and Micro-Averaged Evaluation
No ratings yet
Macro - and Micro-Averaged Evaluation
27 pages
Confusion Matrix
No ratings yet
Confusion Matrix
5 pages
BigData Section6
No ratings yet
BigData Section6
10 pages
Confusion Matrix For Your Multi-Class Machine Learning Model - by Joydwip Mohajon - Towards Data Science
No ratings yet
Confusion Matrix For Your Multi-Class Machine Learning Model - by Joydwip Mohajon - Towards Data Science
9 pages
Evaluation Measures
No ratings yet
Evaluation Measures
8 pages
WINSEM2024-25 CBS3006 ETH VL2024250505168 2025-01-09 Reference-Material-IV
No ratings yet
WINSEM2024-25 CBS3006 ETH VL2024250505168 2025-01-09 Reference-Material-IV
20 pages
A Novel Approach
No ratings yet
A Novel Approach
12 pages
Confusion Matrix
No ratings yet
Confusion Matrix
14 pages
11.2 - Classification Evaluation Metrics
No ratings yet
11.2 - Classification Evaluation Metrics
22 pages
Chapter 5 Model Evaluation
No ratings yet
Chapter 5 Model Evaluation
21 pages
Lecture - (3-4) Evaluation Metrices Classification and Regression
No ratings yet
Lecture - (3-4) Evaluation Metrices Classification and Regression
28 pages
Imbalance Problem
No ratings yet
Imbalance Problem
13 pages
F1 - Score
No ratings yet
F1 - Score
13 pages
Instruction & Option Choice
No ratings yet
Instruction & Option Choice
6 pages
جلسه 13
No ratings yet
جلسه 13
76 pages
Imp Notes For Aamd
No ratings yet
Imp Notes For Aamd
6 pages
Iai&ml Unit-5
No ratings yet
Iai&ml Unit-5
15 pages
Machine Learningassignment
No ratings yet
Machine Learningassignment
10 pages
Comprehensive Guide On Confusion Matrix 1657202063
No ratings yet
Comprehensive Guide On Confusion Matrix 1657202063
5 pages
ML Lecture 11 Evaluation
No ratings yet
ML Lecture 11 Evaluation
17 pages
Lec 12 13 Evaluation Measures
No ratings yet
Lec 12 13 Evaluation Measures
45 pages
Lecture - 3
No ratings yet
Lecture - 3
24 pages
Confusion Matrix
No ratings yet
Confusion Matrix
11 pages
Performance Metrics
No ratings yet
Performance Metrics
12 pages
Unit-6 Notes PART A
No ratings yet
Unit-6 Notes PART A
20 pages
ML Unit 3
No ratings yet
ML Unit 3
127 pages
Lesson 3.0 Introduction To Classification Structured Data Projects
No ratings yet
Lesson 3.0 Introduction To Classification Structured Data Projects
10 pages
Accuracy and Error Measures
No ratings yet
Accuracy and Error Measures
14 pages
Exp7 MLAI2
No ratings yet
Exp7 MLAI2
8 pages
ML Material
No ratings yet
ML Material
21 pages
19-Performance Metrics
No ratings yet
19-Performance Metrics
23 pages
Evaluation Metrics
No ratings yet
Evaluation Metrics
11 pages
06-FSSR DS610 2024 2025T1 Metrics
No ratings yet
06-FSSR DS610 2024 2025T1 Metrics
24 pages
Chater 3 Class 10
No ratings yet
Chater 3 Class 10
4 pages
Confusion Matrix
No ratings yet
Confusion Matrix
8 pages
Evaluation Measures For Machine Learning Models
No ratings yet
Evaluation Measures For Machine Learning Models
6 pages
Module 7 - Evaluation Measures
No ratings yet
Module 7 - Evaluation Measures
27 pages
Lecture 5
No ratings yet
Lecture 5
21 pages
Unit 3
No ratings yet
Unit 3
13 pages
Metrics For Multi-Class Classification
No ratings yet
Metrics For Multi-Class Classification
17 pages
M M - C C: O: Etrics For Ulti Lass Lassification AN Verview
No ratings yet
M M - C C: O: Etrics For Ulti Lass Lassification AN Verview
17 pages
COnfusion Matrix
No ratings yet
COnfusion Matrix
32 pages
BSC ML CH1
No ratings yet
BSC ML CH1
63 pages
Accuracy, Precision, Recall & F1 Score Interpretation of Performance Measures
No ratings yet
Accuracy, Precision, Recall & F1 Score Interpretation of Performance Measures
5 pages
CH 4
No ratings yet
CH 4
9 pages
Notes 03
No ratings yet
Notes 03
38 pages
Chap3 Part1 Classification
No ratings yet
Chap3 Part1 Classification
38 pages
Lec5 Classification
No ratings yet
Lec5 Classification
27 pages
Confusion Matrix V 2.0
No ratings yet
Confusion Matrix V 2.0
14 pages
Ai DS 2 Book-Chpt-5
No ratings yet
Ai DS 2 Book-Chpt-5
17 pages
Accuracy Precision and Recall
No ratings yet
Accuracy Precision and Recall
21 pages
JNN 5.2 Confusion Matrix and Performance Evaluation Metrics
No ratings yet
JNN 5.2 Confusion Matrix and Performance Evaluation Metrics
13 pages
Classification Matrics
No ratings yet
Classification Matrics
18 pages
Unit 4 Learning
No ratings yet
Unit 4 Learning
100 pages
Unit8 (Evaluation Method)
No ratings yet
Unit8 (Evaluation Method)
43 pages
De-Mystifying Math and Stats for Machine Learning: Mastering the Fundamentals of Mathematics and Statistics for Machine Learning
From Everand
De-Mystifying Math and Stats for Machine Learning: Mastering the Fundamentals of Mathematics and Statistics for Machine Learning
Seaport AI Madhavan
No ratings yet
Microsoft Excel Functions Vol 2
From Everand
Microsoft Excel Functions Vol 2
Palani Murugappan
No ratings yet
Development of Arduino-Based High Heat Detector Temperature Control Prototype For Household Appliances
No ratings yet
Development of Arduino-Based High Heat Detector Temperature Control Prototype For Household Appliances
20 pages
Artificial Intelligence and Deep Learning For Computer Network Management and Analysis
No ratings yet
Artificial Intelligence and Deep Learning For Computer Network Management and Analysis
137 pages
Predicting Fatalities Among Shark Attacks: Comparison of Classifiers
No ratings yet
Predicting Fatalities Among Shark Attacks: Comparison of Classifiers
7 pages
A Hybrid Approach To Automatic Corpus Generation For Chinese Spelling Check
No ratings yet
A Hybrid Approach To Automatic Corpus Generation For Chinese Spelling Check
11 pages
Attention Based Graph Classification
No ratings yet
Attention Based Graph Classification
5 pages
Main
No ratings yet
Main
12 pages
Choice Issue of AUC or F1
No ratings yet
Choice Issue of AUC or F1
2 pages
Malware Detection and Classification Using Generative Adversarial Network
No ratings yet
Malware Detection and Classification Using Generative Adversarial Network
18 pages
Sentiment Analysis Report
No ratings yet
Sentiment Analysis Report
31 pages
Rock Mine Classification Using Supervised Machine Learning Algorithms
No ratings yet
Rock Mine Classification Using Supervised Machine Learning Algorithms
8 pages
Implementation of Logistic Regression On Diabetic Dataset Using Train-Test-Split, K-Fold and Stratified K-Fold Approach
No ratings yet
Implementation of Logistic Regression On Diabetic Dataset Using Train-Test-Split, K-Fold and Stratified K-Fold Approach
4 pages
Placement Cell Interview Preparation Question Paper 1
No ratings yet
Placement Cell Interview Preparation Question Paper 1
20 pages
NN Bnu3
No ratings yet
NN Bnu3
42 pages
ANN and CNN Based Ensemble Learning For Recognizing Renowned Medicinal Plants
No ratings yet
ANN and CNN Based Ensemble Learning For Recognizing Renowned Medicinal Plants
6 pages
딥러닝 기반 의미론적 분할 기법을 통한 건물 자동추출 연구 모델의 가중치 경중과 전이학습에
No ratings yet
딥러닝 기반 의미론적 분할 기법을 통한 건물 자동추출 연구 모델의 가중치 경중과 전이학습에
11 pages
Cyber Bullying Detection Using Machine Learning
No ratings yet
Cyber Bullying Detection Using Machine Learning
4 pages
Green Pervasive and Cloud Computing 17th International Conference GPC 2022 Chengdu China December 2 4 2022 Proceedings 1st Edition Chen Yu PDF Download
100% (1)
Green Pervasive and Cloud Computing 17th International Conference GPC 2022 Chengdu China December 2 4 2022 Proceedings 1st Edition Chen Yu PDF Download
84 pages
AK X AI Preboard
No ratings yet
AK X AI Preboard
9 pages
IR Pract
No ratings yet
IR Pract
7 pages
ML Passing Package - 1
No ratings yet
ML Passing Package - 1
43 pages
Class 10 Ai Sample Paper - 3
No ratings yet
Class 10 Ai Sample Paper - 3
4 pages
Ashwin Kumar REPORT - 1BI21IS019
No ratings yet
Ashwin Kumar REPORT - 1BI21IS019
57 pages
Advantages of Transformer and Its Application For Medical Image Segmentation: A Survey
No ratings yet
Advantages of Transformer and Its Application For Medical Image Segmentation: A Survey
22 pages
A Deep Learning Based Object Representation Algorithm For Smart Retail Management
No ratings yet
A Deep Learning Based Object Representation Algorithm For Smart Retail Management
8 pages
AI-900 - Fundamental Principles of ML
No ratings yet
AI-900 - Fundamental Principles of ML
55 pages
Classification Model Evaluation Metrics
No ratings yet
Classification Model Evaluation Metrics
9 pages
Module 2
No ratings yet
Module 2
151 pages
Building Machine Learning Systems With A Feature Store: Batch, Real-Time, and LLM Systems (Early Release) Jim Dowling
No ratings yet
Building Machine Learning Systems With A Feature Store: Batch, Real-Time, and LLM Systems (Early Release) Jim Dowling
55 pages
Physical Fitness in Skating
No ratings yet
Physical Fitness in Skating
16 pages
It 311-Ads Module 5
No ratings yet
It 311-Ads Module 5
9 pages

F1 Score

Uploaded by

F1 Score

Uploaded by

Micro, Macro & Weighted Averages of F1

Score, Clearly Explained

Image by author and Freepik

The F1 score (aka F-measure) is a popular metric for evaluating the

In the case of multi-class classification, we adopt averaging methods

This article looks at the meaning of these averages, how to calculate

(1) Recap of the Basics (Optional)

(1) Recap of the Basics (Optional)

Layman definition: Of all the positive predictions I made, how many

Calculation: Number of True Positives (TP) divided by the Total

The equation for precision | Image by author

Layman definition: Of all the actual positive examples out there,

Calculation: Number of True Positives (TP) divided by the Total

The equation for Recall | Image by author

To evaluate model performance comprehensively, we should

Definition: Harmonic mean of precision and recall for a more

The equation for F1 score | Image by author

If we express it in terms of True Positive (TP), False Positive (FP), and

The alternative equation for F1 score | Image by author

(2) Setting the Motivating Example

To illustrate the concepts of averaging F1 scores, we will use the

Imagine we have trained an image classification model on

We use this model to predict the classes of ten test set images. Here

Sample predictions of our demo classifier | Image by author

Classification report from scikit-learn package | Image by author

Instead, let us look at the confusion matrix for a holistic

The confusion matrix above allows us to compute the critical values of

It is important to remember that in multi-class classification, we

In this OvR approach, we determine the metrics for each class

However, instead of having multiple per-class F1 scores, it would be

Now, let’s discuss the averaging methods that led to the three

Macro averaging is perhaps the most straightforward amongst the

The macro-averaged F1 score (or macro F1 score) is computed by

This method treats all classes equally regardless of

Calculation of macro F1 score | Image by author

The value of 0.58 we calculated above matches the macro-averaged F1

The weighted-averaged F1 score is calculated by taking the mean of

Support refers to the number of actual occurrences of the class

The ‘weight’ essentially refers to the proportion of each class’s support

The calculated value of 0.64 tallies with the weighted-averaged F1

Micro averaging computes a global average F1 score by counting

Calculation of micro F1 score | Image by author

In the classification report, you might be wondering why our micro F1

Furthermore, if we were to do micro-averaging for precision and recall,

Calculation of all micro-averaged metrics | Image by author

These results mean that in multi-class classification cases where each

And this explains why the classification report only needs to display

(6) Which average should I choose?

In general, if you are working with an imbalanced dataset where all

It means that for our example involving the classification of airplanes,

If you have an imbalanced dataset but want to assign greater

This is because, in weighted averaging, the contribution of each class to

Suppose you have a balanced dataset and want an easily

You might also like