0% found this document useful (0 votes)
16 views

AutoRL Tutorials

AutoRL Tutorials

Uploaded by

Pinte Tomare
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views

AutoRL Tutorials

AutoRL Tutorials

Uploaded by

Pinte Tomare
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 80

Beyond Trial & Error: A Tutorial on

Automated Reinforcement Learning


André Biedenkapp & Theresa Eimer

ECAI 2024
Santiago de Compostela, 19. October 2024
André Biedenkapp Theresa Eimer
University of Freiburg University of Hannover
2
Come to the Next COSEAL Workshop
The next COSEAL workshop will take
place 5th - 7th of May 2025 in Porto
● Algorithm selection
● Algorithm configuration
● Algorithm portfolios
● Performance predictions
and empirical performance models
● Bayesian optimization
● Hyperparameter optimization
● Automated machine learning (AutoML)
● Automated reinforcement learning (AutoRL)
● Neural architecture search
● Meta-learning
● Algorithm and parameter control
● Explorative landscape analysis
● Programming by optimization COSEAL.net
● Hyper-heuristics
3
Outline
Introduction and algorithmic part on AutoRL (André, ~60min; 9:00 - 10:00)
💪Practical Session I (André, ~30min; 10:00 - 10:30)

☕ coffee break (resume 11:00) ☕

Practical guidelines and case studies (Theresa, ~45min; 11:00 - 11:45)


💪Practical Session II: Tools for HPO in RL (Theresa, ~45min; 11:45 - 12:30)

4
Why does AutoRL matter?

5
Primer on Reinforcement Learning
Agents learn by interacting with their world

6
Primer on Reinforcement Learning
Environments/Worlds are typically modelled as Markov Decision Processes (MDPs)

An MDP consists of:

• A state space S ← Observable by the agent


• An action space A ← Known to the agent
• A transition function T:S x A x S → [0, 1] ← Intractable in practice
• A reward function R: S x A → [-infinity, infinity] ← Observable by the agent

7
Primer on Reinforcement Learning
How can we learn by interacting with the world?

1. Initialize a policy : S x A → [0, 1]

2. Generate rollouts by following this policy.


3. Learn from your collected transitions (s, a, r, s’)
a. Learn a model of the value (i.e., expected reward to go) V(s) or Q(s,a)
Use this model to find a better policy
b. Directly update the policy by following a policy gradient
c. Learn a model of the world (T’: S x A x S → [0, 1])
Use planning to find a better policy
4. Repeat 2 & 3 until your training budget is exhausted

8
Primer on Reinforcement Learning

9
Primer on Reinforcement Learning
Many, well publicised
success stories

10
11
source: https://huggingface.co/learn/deep-rl-course/unit3/hands-on
12
source: https://sonyresearch.github.io/gt_sophy_public/
13
source: UZH Robotics and Perception Group YouTube channel
Reinforcement Learning is Sensitive to Hyperparameters

All previous examples worked well due to lots and lots of


domain expertise!

14
Reinforcement Learning is Sensitive to Hyperparameters
RL is sensitive to:

● Hyperparameters
● Network Architecture
● Reward Scale
● Random Seeds & Trials
● Environment Type
● Codebases

15
Henderson et al., AAAI 2018

16
How to Choose a Reinforcement-Learning Algorithm
More and more “heuristics” on how to choose the Bongratz
Caspi et al.,
et al.,
github
arXiv
2017
2024
right algorithm for the task

17
Properties of AutoRL
Hyperparameter Landscapes

18
Learning by Iteration
Supervised Learning: Reinforcement Learning

1. Initialize model 1. Initialize policy


2. Fit model to data set 2. Generate
Generate Observations
Observations
3. Observe model performance 3. Observe policy performance
→ compute loss → compute loss
4. Adapt model based on loss 4. Adapt policy based on loss
5. Repeat 2 - 4 until learning budget is 5. Repeat 2 - 4 until learning budget is
exhausted exhausted

19
Reinforcement Learning by Iteration

[Sutton & Barto, MIT Press 2018] 20


Reinforcement Learning by Iteration
Q-Learning example

Parameters that are being learned:

Model of the Value-Function

Hyperparameters:

learning rate

discounting factor

21
Properties of AutoRL Hyperparameter Landscapes
Landscape analysis of a DQN agent
on the cartpole environment by
[Mohan et al., AutoML 2023].

We’ll come back to this in the first practical session! 22


Properties of AutoRL Hyperparameter Landscapes

HPO for RL has to deal with much stronger non-stationarity


than is the case for supervised learning!

23
AutoRL: Optimizing the Full Pipeline

24
What Can We Automate?

25
What Can We Automate? → The Role of Networks
As in supervised learning:
The architecture choice matters! (recall Henderson et al., AAAI’18)

But:
Deep RL often uses much smaller networks

26
What Can We Automate? → The Role of Networks
As in supervised learning:
The architecture choice matters! (recall Henderson et al., AAAI’18)

But:
Deep RL often uses much smaller networks
Particularly when state-features are “well engineered”

Could potentially pose the perfect playground for NAS with smaller search spaces

27
What Can We Automate? → The Role of Networks
Recall: (Reinforcement) Learning is done by iteration

In Deep RL we are faced with the Primacy Bias (Nikishin et al., ICML 2022):
• Agents are prone to overfit to early experiences
• This often negatively affects the whole learning process

Why?
Replay Buffers & Replay Ratios

Combats Training Increases Sample-Efficiency


Non-Stationarity

28
What Can We Automate? → The Role of Networks
Heavy Priming: Nikishin et al., ICML 2022

● after collecting 100 data


points the agent is
updated 10^5 times
using the resulting replay
buffer
● ~heavy overfitting to a
single batch of early data

29
What Can We Automate? → The Role of Networks
“Is the data collected by an Nikishin et al., ICML 2022
overfitted agent unusable for
learning?”

Train the same agent twice:


● Once with an empty
replay buffer
● Starting from the final
replay buffer of the first

The collected data is of high


quality and enables quick
warmstarting.
30
Resetting Combats the Primacy Bias
Partial or even full resetting to the Nikishin et al., ICML 2022
rescue!
● Periodically resetting the full network
tends to harm performance
● Keeping the input layers but
periodically resetting the output
tends to drastically improve
performance

When or how often should we reset?

Related idea: Resetting internal


parameters of optimizers such as Adam
(Asadi et al., NeurIPS 2023)
31
What Can We Automate? → The Role of Environments
We can, e.g., optimize:
● Observation Spaces
● Reward Functions
● Action spaces
● Task Curriculum

32
Q&A: Part 1

33
Practical Session 1 https://s.gwdg.de/kuBEC1

34
Examples of successful AutoRL, DAC
and online configuration Approaches

35
Prioritized Level Replay [Jiang et al. 2021]

Idea: train on levels which are most important for training performance

36
HOOF: Hyperparameter Optimisation on the Fly [Paul et al. 2019]

Idea: efficient dynamic configuration by approximating hyperparameter quality

Query new Fit existing Approximate


Partial Training
configuration samples cost

37
HOOF: Hyperparameter Optimisation on the Fly [Paul et al. 2019]

Idea: efficient dynamic configuration by approximating hyperparameter quality


For each collected sample s:

Query new Fit existing Approximate


Partial Training
configuration samples cost

38
HOOF: Hyperparameter Optimisation on the Fly [Paul et al. 2019]

Idea: efficient dynamic configuration by approximating hyperparameter quality


For each collected sample s:
Is s more likely to occur under the new
configuration than before?

Query new Fit existing Approximate


Partial Training
configuration samples cost

39
HOOF: Hyperparameter Optimisation on the Fly [Paul et al. 2019]

Idea: efficient dynamic configuration by approximating hyperparameter quality


For each collected sample s:
Is s more likely to occur under the new
configuration than before?
X Quality of s

Query new Fit existing Approximate


Partial Training
configuration samples cost

40
Population-Based Training [Jaderberg et al. 2017]

Population
Idea: dynamic configuration via parallel trials

Selection &
Partial Training
Mutation


41
Combining AutoRL Approaches

42
In RL Everything Is Connected

• Most successful AutoRL methods are dynamic

43
In RL Everything Is Connected

• Most successful AutoRL methods are dynamic

• But: changing one aspect of the learning setting influences others

44
In RL Everything Is Connected

• Most successful AutoRL methods are dynamic

• But: changing one aspect of the learning setting influences others

• Example 1: as we progress to harder levels with PLR, we likely need different


hyperparameters and maybe even a different network

45
In RL Everything Is Connected

• Most successful AutoRL methods are dynamic

• But: changing one aspect of the learning setting influences others

• Example 1: as we progress to harder levels with PLR, we likely need different


hyperparameters and maybe even a different network

• Example 2: when the agent improves in training, we may want to not only adapt the
hyperparameters with PBT, but also change the algorithm

46
Combining HPO And NAS: BG-PBT [Wan et al. 2022]

47
Evaluation and Generalization of
AutoRL

48
What Is Generalization in AutoRL?

AutoRL

Images via Freepik 49


What Is Generalization in AutoRL?

AutoRL

Images via Freepik 50


What Is Generalization in AutoRL?

AutoRL

Images via Freepik 51


What Is Generalization in AutoRL?

AutoRL

Images via Freepik 52


Evaluating RL

Commonly we measure Return R:

53
Evaluating RL

Commonly we measure Return R:

54
Evaluating RL

Commonly we measure Return R:

Problem: both environment and policy can be stochastic

55
Evaluating RL

Problem: both environment and policy can be stochastic

56
Evaluating RL

Problem: both environment and policy can be stochastic

57
Evaluating RL

Problem: both environment and policy can be stochastic

58
Evaluating RL

Problem: both environment and policy can be stochastic

59
Evaluating RL

Problem: both environment and policy can be stochastic

Common Practice: average across multiple rollouts

• Trade-off between extra cost and reliability of the estimate


• Some return distributions are bimodal and lead to large deviations between rollouts
• This has to be done per seed, task variation, etc.
• Usually either mean or IQM are used as metrics

60
Evaluating AutoRL

Problem: unreliable estimates meet unstable target function meet potentially stochastic
AutoRL method

61
Evaluating AutoRL

Problem: unreliable estimates meet unstable target function meet potentially stochastic
AutoRL method
RL evaluation cost: 1

AutoRL

Images via Freepik


62
Evaluating AutoRL

Problem: unreliable estimates meet unstable target function meet potentially stochastic
AutoRL method
RL evaluation cost: 1

RL across seeds: 10

AutoRL

Images via Freepik


63
Evaluating AutoRL

Problem: unreliable estimates meet unstable target function meet potentially stochastic
AutoRL method
RL evaluation cost: 1

RL across seeds: 10

AutoRL

Across AutoRL seeds: 50

Images via Freepik


64
Evaluating AutoRL

Problem: unreliable estimates meet unstable target function meet potentially stochastic
AutoRL method
RL evaluation cost: 1

RL across seeds: 10

AutoRL

Across AutoRL seeds: 50 AutoRL evaluation


across task variation:
50 per task
Images via Freepik
65
HPO for RL

66
HPO for RL

• RL algorithms have many important hyperparameters [Eimer et al. 2023]

67
HPO for RL

• RL algorithms have many important hyperparameters [Eimer et al. 2023]

• Which hyperparameters matter most is highly domain dependent

68
HPO for RL

• RL algorithms have many important hyperparameters [Eimer et al. 2023]

• Which hyperparameters matter most is highly domain dependent

Left: DQN HP
importance on
Acrobot.

Right: DQN HP
importance on
MiniGrid 5x5.

69
HPO for RL

• RL algorithms have many important hyperparameters [Eimer et al. 2023]

• Which hyperparameters matter most is highly domain dependent

• But: in practice hand tuning and grid search are most common

• Little systematic work, partly due to high computational cost

70
HPO for RL

HPO for RL outperforms Grid Search


Baselines from State-Of-The-Art
Papers!

71
HPO for RL

HPO for RL outperforms Grid Search


Baselines from State-Of-The-Art
Papers!

But…

72
Seeding Matters

73
Seeding Matters

74
Zero-Cost Benchmarking: HPO-RL-Bench [Shala et al. 2024]

• Tabular benchmark: results for different


configurations of 5 algorithms on
22 environments
• “Evaluation” is a single lookup,
so essentially free
• Support for dynamic
configuration schedules
• Pre-evaluated baselines

75
Zero-Cost Benchmarking: HPO-RL-Bench [Shala et al. 2024]

• Tabular benchmark: results for different


configurations of 5 algorithms on
22 environments
• “Evaluation” is a single lookup,
so essentially free
• Support for dynamic
configuration schedules
• Pre-evaluated baselines

AutoML-Conf Best Paper Nominee


76
Flexible Benchmarking: ARLBench [Becktepe, Dierkes et al., 2024]

• JAX-based RL implementations for efficient online execution


• Subset selection to find 4-5 relevant environments (from 22) for three algorithms
• Speedup factors between 7x and 12x per algorithm
• Support for dynamic configuration and flexible objectives as well as large search spaces

77
Flexible Benchmarking: ARLBench [Becktepe, Dierkes et al., 2024]

• JAX-based RL implementations for efficient online execution


• Subset selection to find 4-5 relevant environments (from 22) for three algorithms
• Speedup factors between 7x and 12x per algorithm
• Support for dynamic configuration and flexible objectives as well as large search spaces

78
Practical Session 2 https://s.gwdg.de/xsIDyi

79
Thank You!

André Biedenkapp Theresa Eimer


University of Freiburg University of Hannover

You might also like