A Crash Course On Reinforcement Learning - Felix Wagner
A Crash Course On Reinforcement Learning - Felix Wagner
reinforcement learning
Felix Wagner
Institute of High Energy Physics of the Austrian Academy of Sciences
Inverted CERN School of Computing 2023
Three types of machine learning
Supervised Learning
Pentagon
Square
Triangle Circle
Triangle Triangle
Learning to label
Model
Three types of machine learning
Pentagon
Square
Triangle Circle
Triangle Triangle
Model Model
Three types of machine learning
Pentagon
Square
Triangle Circle
Triangle Triangle
“build a pyramid
with suitable item”
Model Model Model
Reinforcement learning
… a framework for
model-free,
time-discrete control
problems.
Well, if you …
● have ever asked yourself “What would be the best strategy to win UNO,
chess, black jack, …?”
● work on problems that involve optimizing the control of machines or other
types of goal-oriented action planning.
● are not frightened by mathematical definitions and linear algebra.
● are generally curious about machine learning and artificial intelligence.
● and otherwise ready to learn something completely new and exciting!
8
Outline
10
Markov decision processes (MDPs)
Markov decision processes (MDPs)
12
Markov decision processes (MDPs)
state space
action space
dynamics function
reward function
13
Markov decision processes (MDPs)
14
Markov decision processes (MDPs)
15
Values and the Bellman equation
For a given policy and MDP, we can calculate the expected returns
for any initial state. These are also called the state values.
16
Values and the Bellman equation
For a given policy and MDP, we can calculate the expected returns
for any initial state. These are also called the state values.
17
Values and the Bellman equation
For a given policy and MDP, we can calculate the expected returns
for any initial state. These are also called the state values.
18
Values and the Bellman equation
For a given policy and MDP, we can calculate the expected returns
for any initial state. These are also called the state values.
Linear system of
equations ⇒ values
are unique and can
be computed with
linear algebra. 19
Values and the Bellman equation
For a given policy and MDP, we can calculate the expected returns
for any initial state. These are also called the state values.
Notebook 1 on
github.com/fewagner/icsc23
21
Some more realistic examples
states (500) = {position on 5x4 grid, states (704) = {own points, dealers
location and destination passenger visible points, whether you hold a
g/r/y/b} usable ace}
Attention!
Not every RL environment necessarily satisfies the Markov property. Some have
unobserved, internal states. These are called partially observable MDPs
(POMDPs).
23
MDPs: transfer questions
24
Solving small MDPs
with tabular methods
25
A greedy control algorithm
a0 a1
s0 1.96 2.1
s1 5.69 4.72
s2 2.00 2.54
26
A greedy control algorithm
We can just always take the action with the highest Q value!
a0 a1
s0 1.96 2.1
s1 5.69 4.72
s2 2.00 2.54
27
In practice, it’s not that easy …
28
We need to learn the values from data
29
We need to sample data cleverly
30
Temporal difference (TD) learning
“If one had to identify one idea as central and novel to
reinforcement learning, it would undoubtedly be TD learning.”
- Sutton & Barto, first sentence of the TD chapter
31
Temporal difference (TD) learning
“If one had to identify one idea as central and novel to
reinforcement learning, it would undoubtedly be TD learning.”
- Sutton & Barto, first sentence of the TD chapter
32
Exploration vs. exploitation
“Epsilon-greedy” policy:
33
Temporal difference (TD) learning
“If one had to identify one idea as central and novel to
reinforcement learning, it would undoubtedly be TD learning.”
- Sutton & Barto, first sentence of the TD chapter
● “Epsilon-greedy” policy
34
Temporal difference (TD) learning
“If one had to identify one idea as central and novel to
reinforcement learning, it would undoubtedly be TD learning.”
- Sutton & Barto, first sentence of the TD chapter
● “Epsilon-greedy” policy
35
Let’s look back at the Bellman equation
36
We can use the Bellman equation as an update rule
37
We can use the Bellman equation as an update rule
38
Temporal difference (TD) learning
“If one had to identify one idea as central and novel to
reinforcement learning, it would undoubtedly be TD learning.”
- Sutton & Barto, first sentence of the TD chapter
● “Epsilon-greedy” policy
● Update rule:
39
On-policy and off-policy methods
SARSA
Q-learning
On-policy: SARSA considers that future actions are taken according to current
policy.
Off-policy: Q-learning considers that future actions are taken with another
(target) policy, in this case the greedy policy.
SARSA
Q-learning
41
Cliff walking
The agents walks in a gridworld, receiving -1 until it reaches the goal, and -100
for falling off the cliff.
Q-learning learns the optimal policy to walk next to the cliff, but falls off
sometimes due to the ε-greedy action selection. SARSA considers this action
selection and obtains higher rewards online.
42
(from Sutton & Barto)
SARSA/Q-learning on taxi driver/black jack
Notebook 3 on
github.com/fewagner/icsc23
43
Tabular methods: transfer questions
44
Solving large MDPs
with policy gradient methods
45
How large is a large MDP?
46
How large is a large MDP?
47
How large is a large MDP?
48
How large is a large MDP?
● How many states has driving a car? (all positions and velocities of wheels,
all sensor readings, visual input, GPS data, … ???)
49
Function approximation
Many environment have continuous (real numbers) state and action spaces. We cannot
build and exact Q-table for such environments!
Instead, we treat the Q-table as a function, called value function:
In this formulation, we can use function approximators to learn the value function, similarly
as we updated the Q table.
50
Function approximation
Many environment have continuous (real numbers) state and action spaces. We cannot
build and exact Q-table for such environments!
Instead, we treat the Q-table as a function, called value function:
In this formulation, we can use function approximators to learn the value function, similarly
as we updated the Q table.
For large or continuous action spaces, we can treat the policy as a policy function:
52
Examples of parametric function approximators
53
Learning parameters with gradient descent
Notebook 4 on
github.com/fewagner/icsc23
54
Loss functions for values/policy
Value function loss: minimize mean squared error between returns and values,
leading to gradient update
Policy function loss: maximize probability for actions with high TD error, leading to
gradient update
55
Actor-critic: TD learning in continuous spaces
56
Actor-critic on the lunar lander
57
Policy gradient methods: transfer questions
When would you prefer a method that uses approximation over a tabular
method?
Actor-critic (?)
SARSA (on-policy)
Q-learning (off-policy)
58
Recap
59
Recap
60
Recap
61
Recap
Markov decision processes Policy, dynamics- and reward function Values, Bellman equation
62
Recap
Markov decision processes Policy, dynamics- and reward function Values, Bellman equation
SARSA (on-policy) 63
Recap
Markov decision processes Policy, dynamics- and reward function Values, Bellman equation
Markov decision processes Policy, dynamics- and reward function Values, Bellman equation
ntrol
prediction/co
bandits
66
Reinforcement learning in (experimental) physics
Phys. Rev. Accel. Beams 24, 104601 - 104618 (2021). Nature 602, 414–419 (2022).
https://doi.org/10.1103/PhysRevAccelBeams.24.104601 https://doi.org/10.1038/s41586-021-04301-9
67
Reinforcement learning in (experimental) physics
68
Questions?
Backup
70
Bandits - contextual bandits - reinforcement learning
g
Bandits consider only immediate rewards. derin
consi DP
this M
Bandit: “The actions bring on average: a0 a1
1.2 -0.1
0 0 3.5 0 0 -0.3
RL: “We have to plan action-state trajectories ahead and consider delayed
rewards! 🤓 But …how to assign credit to individual action…? 🤔”
71
We discuss this in section 2 and 3!
Derivation of the Bellman equation
72
The tiger problem, a POMDP
observations = {hear tiger right, hear tiger left}, 85% probability to hear tiger on correct side
rewards = 10 for opening door with treasure, -100 for opening door with tiger, -1 for listening
73
Prediction and control
“Solve” an MDP can mean two separate thing: solving a prediction and a control
problem. For many algorithms (especially for control) the problems are solved
simultaneously and iteratively.
75
Temporal difference (TD) prediction
“If one had to identify one idea as central and novel to
reinforcement learning, it would undoubtedly be TD learning.”
- Sutton & Barto, first sentence of the TD chapter
Fix a policy.
● Let an agent take actions in an environment according to policy.
● After every step, make update on values according to Bellman equation:
Notebook 2 on
github.com/fewagner/icsc23
77
Approximating values with gradient descent
We choose a function approximator for our value function and update its parameters
such that the squared errors with the true value function are minimized:
Note, that we bootstrapped the true value function with the reward and next state value,
as introduced in the TD learning chapter!
78
Approximating policies with gradient descent
Can you think of any additional challenges when using RL with large function
approximators (e.g. deep neural networks)?
84