0% found this document useful (0 votes)
14 views35 pages

04 Evaluation

Uploaded by

Huang Leonard
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views35 pages

04 Evaluation

Uploaded by

Huang Leonard
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 35

Continual Learning: On Machines

that can Learn Continually


Official Open-Access Course @ University of Pisa, ContinualAI, AIDA

Lecture 4: Evaluation & Metrics

Vincenzo Lomonaco
University of Pisa & ContinualAI
[email protected]
TABLE OF CONTENTS

01 02 03
Evaluation Continual Avalanche
Protocol Learning Metrics &
Metrics Loggers
Evaluation
Protocols
Classic ML Evaluation

Train - Validation - Test split

● Model selection: train on training set, eval


on validation set

● Model assessment: train on training (+


validation) set, eval on test set

Variations allowed

● K-fold Cross-Validation
● Leave-one-out
● ...

Test, training and validation sets (brainstobytes.com)

Evaluating Machine Learning Models, by Alice Zheng, 2015.


Basic CL Evaluation Protocol

Different Data

● Classic Machine Learning -> static dataset


Train Split
● Continual Learning -> stream of datasets (experiences) Test Split

A Simple Extension to CL

● Split by patterns: one train-(validation)-test per


experience (or parallel streams of experiences)

● This is the simplest and most common evaluation


protocol
Split by Patterns

● Training phase: train the model on training sets of each experience, sequentially

● Test phase: evaluate the model on the sets of the experiences (order does not matter)

● Examples in the training and test sets are disjoint!

● We may have a single test set or one for each experience

● Multiple evaluation streams are possible (Valid, Test, Out-of-Distribution, etc.)

● Cross-Validation & Hyper-parameters selection can be operated based on the final aggregate metric
at the end of the training.
When and What to Test On

When to test?

● At the end of each experience, usually.


● A finer granularity is always possible (epochs, iterations, etc.)

On what to test?

● Current experience
● Future experiences
● Past experiences
● All experiences
● …

...depending also on the metrics you want to use!


Growing vs Fixed Test Set

Growing Test Set

● We consider only the test set of the current


and previously encountered experiences

● Compute the performance metrics average


over those

Fixed test set

● Common for some benchmarks

● Clear view on overall system performance

● Recover experience-wise performance, if


needed

Continuous Learning in Single-Incremental-Task Scenarios. Maltoni & Lomonaco, Neural Networks Journal 2019.
Is it Enough for CL?

● Split by patterns: one train-validation-test per


experience (or parallel streams of experiences)

● But is it enough for Continual Learning? -> we would like


Train Split
a way to evaluate if we are actually able to learn
continually! Test Split

● Split by experiences: model selection on a first set of Train Split Test Split

experiences, model assessment on a second set of


experiences

● Model assessment should also involve training.

Efficient Lifelong Learning with A-GEM Chaudhry et. al. ICLR, 2019.
Hyper-parameters Selection for CL

● We mentioned Hyper-parameters selection can be operated based on the final aggregate metric at the
end of the training

● But this may be seen as a form of cheating: we select the best hyperparameters that maximize the the
performance on a specific sequence of training experiences

● We may partially solve this with several runs with a random order of the training experiences

● This may be still unfair: we should calibrate hyper-parameters on a limited set of experiences

Class-incremental learning: survey and performance evaluation on image classification. Masana et al. 2020.
A continual learning survey: Defying forgetting in classification tasks. De Lange et al, 2019.
A more Articulated Protocol: An Example

● Model selection: train the


model on a first split of
experiences, select best
hyperparameters with a
cross-validation scheme.

● Model assessment: train &


evaluate the CL algorithm on
a second split of experiences

Efficient Lifelong Learning with A-GEM Chaudhry et. al. ICLR, 2019.
Continual Learning
Metrics
What to Monitor?

● Performance on current experience


● Performance on past experiences
● Performance on future experiences
● Resource consumption
(Memory / CPU / GPU / Disk usage)
● Model size growth
(with respect to the first model)
● Execution time
● Data efficiency
● ...

Gradient Episodic Memory for Continual Learning, Lopez-Paz et al. NIPS, 2017.
Continual learning for robotics: Definition, framework, learning strategies, opportunities and challenges, Lesort et al. Information Fusion, 2020.
Accuracy
Q: How accurate is my model?

In many different sauces

● Accuracy on the current experience


● Accuracy on previous experiences (plus the current one)
● Accuracy on future experiences (plus the current one)

ACC Metric

● After training on all experiences, average accuracy over all


the test experiences.

A Metric

● Average of the accuracy on all experiences at any point in


time.

Don't forget, there is more than forgetting: new metrics for Continual Learning, Rrodriguez-Diaz et al. CL Workshop @ NeurIPS, 2018.
Gradient Episodic Memory for Continual Learning. Lopez-Paz & Ranzato, NeurIPS 2017.
Forward Transfer

Q: How much learning the current experience improves my


performance on future experiences?

FWT Metric

● Accuracy on experience i after training on last experience


Minus

● Accuracy on experience i before training on the first


experience (model init)

● Averaged over i=2,...,T

Gradient Episodic Memory for Continual Learning. Lopez-Paz & Ranzato, NeurIPS 2017.
Backward Transfer

Q: How much learning the current experience improves my


performance on previous experiences?

BWT Metric

● Accuracy on experience i after training on experience T


Minus

● Accuracy on experience i after training on experience i

● Averaged over i=1,...,T-1

FORGETTING = - BWT

Gradient Episodic Memory for Continual Learning. Lopez-Paz & Ranzato, NeurIPS 2017.
Memory

Not only performance

● How much space does your model occupy? (MB, #


of params, etc.)

● What is the increment in space required for each


new experience?

● How much space do you require for additional


information (replay buffer, past models…)?

Don't forget, there is more than forgetting: new metrics for Continual Learning, Rrodriguez-Diaz et al. CL Workshop @ NeurIPS, 2018.
Computation

Not only performance

● What is the computational overhead during


training? (# MACs, Running Time, GPU/CPU time, …)

● What about its increment over time?

● What is the computational overhead during


inference?

Don't forget, there is more than forgetting: new metrics for Continual Learning, Rrodriguez-Diaz et al. CL Workshop @ NeurIPS, 2018.
Don't Forget: There is More than Forgetting!

A plethora of other possible metrics

● Accuracy vs offline baseline


● Model Robustness
● Model Plasticity & Capacity
● …

More complex Score Functions

● Additional, more informative derived


metrics can be devised as well.

● They can be tuned depending on the


specific application goals.

Don't forget, there is more than forgetting: new metrics for Continual Learning, Rrodriguez-Diaz et al. CL Workshop @ NeurIPS, 2018.
Summing Up

● Choose an evaluation protocol and declare it (no standard, yet)

● Choose the metrics you monitor wisely (what are you interested in?)

● Do not focus exclusively on performance metrics, if possible

Q: you can achieve low/zero forgetting by occupying a lot of space. How?

Q: which metrics would you monitor to evaluate a continual learner deployed and trained on the edge on
image classification tasks?
CLEVA-Compass: A Continual Learning EValuation
Assessment Compass

Recall lecture 1: there are various Depending on where inspiration has been drawn from, continual
machine learning formulations that learning setups and evaluation can vary dramatically.
have continuous components

(a small snapshot from the overall paradigm relationships)

CLEVA-Compass: A Continual Learning EValuation Assessment Compass to Promote Research Transparency and Comparability, arXiv, 2021.
CLEVA-Compass: A Continual Learning EValuation
Assessment Compass

Existence of various scenarios is not a


problem, but actually meaningful because
different applications can desire different things!

But reproducibility & comparability can be


problematic, which is a constant subject in the
scientific literature.

Recently, the CLEVA-Compass has been


introduced to promote transparency &
comparability

CLEVA-Compass: A Continual Learning EValuation Assessment Compass to Promote Research Transparency and Comparability, arXiv, 2021.
CLEVA-Compass: A Continual Learning EValuation
Assessment Compass

Inner compass level (star plot):


indicates related paradigm inspiration &
continual setting configuration (assumptions)

CLEVA-Compass: A Continual Learning EValuation Assessment Compass to Promote Research Transparency and Comparability, arXiv, 2021.
CLEVA-Compass: A Continual Learning EValuation
Assessment Compass

Inner compass level (star plot):


indicates related paradigm inspiration &
continual setting configuration (assumptions)

Inner compass level of supervision:


“rings” on the star plot indicate presence of
supervision. Importantly: supervision is individual
to each dimension!

CLEVA-Compass: A Continual Learning EValuation Assessment Compass to Promote Research Transparency and Comparability, arXiv, 2021.
CLEVA-Compass: A Continual Learning EValuation
Assessment Compass

Inner compass level (star plot):


indicates related paradigm inspiration &
continual setting configuration (assumptions)

Inner compass level of supervision:


“rings” on the star plot indicate presence of
supervision. Importantly: supervision is individual
to each dimension!

Outer compass level:


Contains a comprehensive set of practically
reported measures

CLEVA-Compass: A Continual Learning EValuation Assessment Compass to Promote Research Transparency and Comparability, arXiv, 2021.
Avalanche Metrics
& Loggers
How to Monitor Experiments?

Evaluation module provides

● Metrics (accuracy, forgetting, CPU Usage…) - you can create your own!

● Loggers to report results in different ways - you can create your own!

● Automatic integration in the training and evaluation loop through the Evaluation Plugin

● A dictionary with all recorded metrics always available for custom use

V. Lomonaco et al. Avalanche: an End-to-End Library for Continual Learning. CLVision Workshop at CVPR 2021.
Let’s Track our Experiments

V. Lomonaco et al. Avalanche: an End-to-End Library for Continual Learning. CLVision Workshop at CVPR 2021.
Interactive Logger Output

-- >> Start of training phase << --


-- Starting training on experience 0 (Task 0) from train stream --
Epoch 0 ended.
Loss_Epoch/train_phase/train_stream/Task000 = 1.1099
Top1_Acc_Epoch/train_phase/train_stream/Task000 = 0.8926
...
-- >> End of training phase << --
-- >> Start of eval phase << --
-- Starting eval on experience 0 (Task 0) from test stream --
> Eval on experience 0 (Task 0) from test stream ended.
Loss_Exp/eval_phase/test_stream/Task000/Exp000 = 0.0208
Top1_Acc_Exp/eval_phase/ test_stream/Task000/Exp000 = 0.9981
...
-- >> End of eval phase << --
Loss_Stream/eval_phase/test_stream = 4.4492

V. Lomonaco et al. Avalanche: an End-to-End Library for Continual Learning. CLVision Workshop at CVPR 2021.
Tensorboard Logger in Action

V. Lomonaco et al. Avalanche: an End-to-End Library for Continual Learning. CLVision Workshop at CVPR 2021.
Standalone Metrics

V. Lomonaco et al. Avalanche: an End-to-End Library for Continual Learning. CLVision Workshop at CVPR 2021.
What’s Next?

● Evaluation of a CL algorithm is not only about metrics and loggers.

● More support for the definition of training and evaluation protocols

○ How to perform cross validation in CL?

○ How to evaluate multiple runs?

● The objective of a shared protocol is possible only with the help of the community

V. Lomonaco et al. Avalanche: an End-to-End Library for Continual Learning. CLVision Workshop at CVPR 2021.
!
Avalanche Evaluation Module Demo Session

Avalanche, From Zero To Hero tutorial: Evaluation, Avalanche, From Zero To Hero tutorial: Loggers
Next:
Methodologies [Part 1]
Do you have any questions?

[email protected]
vincenzolomonaco.com
University of Pisa

THANKS
CREDITS: This presentation template was created by Slidesgo,
including icons by Flaticon, and infographics & images by Freepik

You might also like