0% found this document useful (0 votes)
11 views84 pages

A Crash Course On Reinforcement Learning - Felix Wagner

Uploaded by

bilouslouise2
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views84 pages

A Crash Course On Reinforcement Learning - Felix Wagner

Uploaded by

bilouslouise2
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 84

A crash course on

reinforcement learning

Felix Wagner
Institute of High Energy Physics of the Austrian Academy of Sciences
Inverted CERN School of Computing 2023
Three types of machine learning

Supervised Learning

Labeled Training Data

Pentagon

Square
Triangle Circle
Triangle Triangle

Learning to label

New Data Square!

Model
Three types of machine learning

Supervised Learning Unsupervised Learning

Labeled Training Data Unlabeled Training Data

Pentagon

Square
Triangle Circle
Triangle Triangle

Learning to label Learning to cluster

New Data Square! New Data

Model Model
Three types of machine learning

Supervised Learning Unsupervised Learning Reinforcement Learning

Labeled Training Data Unlabeled Training Data

Pentagon

Square
Triangle Circle
Triangle Triangle

Learning to label Learning to cluster


Learning to make decisions

New Data Square! New Data best expected return


Task:

“build a pyramid
with suitable item”
Model Model Model
Reinforcement learning

Reinforcement learning (RL) in a nutshell:


“Learn a policy to maximize rewards over time””
Reinforcement learning

There are famous examples of RL for learning games.

“AlphaGo” winning against “AlphaStar” wins Starcraft against


the Go world champion 99.85% of human players.

But it can also be applied to real world problems, as …


6
Reinforcement learning

… a framework for
model-free,
time-discrete control
problems.

Videos and GIFs are disabled in


the PDF version of these slides.

OpenAI Gym - A framework for


reinforcement learning
https://www.gymlibrary.dev/
This lecture is for you!

Well, if you …

● have ever asked yourself “What would be the best strategy to win UNO,
chess, black jack, …?”
● work on problems that involve optimizing the control of machines or other
types of goal-oriented action planning.
● are not frightened by mathematical definitions and linear algebra.
● are generally curious about machine learning and artificial intelligence.
● and otherwise ready to learn something completely new and exciting!

8
Outline

We cover the following topics:

● Markov decision processes (MDPs)


Transfer questions after
● Solving small MDPs with tabular methods every section.

● Solving large MDPs with policy gradient methods

Jupyter notebooks with code examples are


hosted on github.com/fewagner/icsc23.
9
Literature

“Sutton & Barto” is the standard text


book for RL - and main source for this
lecture.

However, developments in the field are fast and


staying up to date involves skimming papers and
proceedings all the time.

10
Markov decision processes (MDPs)
Markov decision processes (MDPs)

An MDP is a 4-tuple consisting of an


action space and state space,
a dynamics function:

and a reward function:

Sometimes this notation is convenient:

12
Markov decision processes (MDPs)

For the MDP displayed on the right:

state space

action space

dynamics function

reward function

13
Markov decision processes (MDPs)

RL models the interaction of an


agent with an environment (an
MDP).

The agent aims to take actions that


maximize the collected rewards over
time (returns), following a policy
function:

14
Markov decision processes (MDPs)

RL models the interaction of an


agent with an environment (an
MDP).

The agent aims to take actions that


maximize the collected rewards over
time (returns), following a policy
function:

15
Values and the Bellman equation
For a given policy and MDP, we can calculate the expected returns
for any initial state. These are also called the state values.

16
Values and the Bellman equation
For a given policy and MDP, we can calculate the expected returns
for any initial state. These are also called the state values.

Some MDPs have terminal states (episodic MDPs),


others run forever (continuous MDPs).

Especially for continuous MDPs values could diverge


over time. Therefore we introduce a discounting
factor 0<γ<1.

17
Values and the Bellman equation
For a given policy and MDP, we can calculate the expected returns
for any initial state. These are also called the state values.

The state values satisfy a recursive relationship that is


called the Bellman equation:

18
Values and the Bellman equation
For a given policy and MDP, we can calculate the expected returns
for any initial state. These are also called the state values.

The state values satisfy a recursive relationship that is


called the Bellman equation:

Linear system of
equations ⇒ values
are unique and can
be computed with
linear algebra. 19
Values and the Bellman equation
For a given policy and MDP, we can calculate the expected returns
for any initial state. These are also called the state values.

The state values satisfy a recursive relationship that is


called the Bellman equation:

In practice it’s often


handy to define
state-action values
instead.
20
A few numerical experiments with our MDP

Notebook 1 on
github.com/fewagner/icsc23

21
Some more realistic examples

Taxi Black Jack

states (500) = {position on 5x4 grid, states (704) = {own points, dealers
location and destination passenger visible points, whether you hold a
g/r/y/b} usable ace}

actions (6) = {move actions (2) = {hit, draw}


up/down/left/right, pickup/dropoff
passenger} rewards = +1 win, -1 lose, 0 draw We solve these
environments in
rewards = +20 for delivering, -10 for section 2!
wrong dropoff/pickup, -1 else
22
The Markov property

MDPs are “memoryless”, i.e. the dynamics and reward


function do not depend on history prior to the current
state and action.

Attention!
Not every RL environment necessarily satisfies the Markov property. Some have
unobserved, internal states. These are called partially observable MDPs
(POMDPs).
23
MDPs: transfer questions

On which level of detail should we formulate the MDP?


Consider driving a car. Is the action to “drive to the shop”, or is it
“turn the turning wheel to the right”, or even “move the left front
wheel by 15 degrees”?

Are there limitations to the requirement of a scalar


reward? Can all types of goal-oriented decision making
be formulated as maximizing rewards?

24
Solving small MDPs
with tabular methods

25
A greedy control algorithm

Assume, we know the action-value (Q) function corresponding to an optimal


policy.

a0 a1

s0 1.96 2.1

s1 5.69 4.72

s2 2.00 2.54

26
A greedy control algorithm

Assume, we know the action-value (Q) function corresponding to an optimal


policy.

We can just always take the action with the highest Q value!

a0 a1

s0 1.96 2.1

s1 5.69 4.72

s2 2.00 2.54

27
In practice, it’s not that easy …

Let’s look back at the Bellman equation

28
We need to learn the values from data

The expected value depends on the dynamics of the


environment and has to be learned from experience.

29
We need to sample data cleverly

The expected value depends on the dynamics of the


environment and has to be learned from experience.

An efficient algorithm collects preferably experience


that is relevant for finding an optimal policy.

30
Temporal difference (TD) learning
“If one had to identify one idea as central and novel to
reinforcement learning, it would undoubtedly be TD learning.”
- Sutton & Barto, first sentence of the TD chapter

General idea of TD learning:

● Implement a policy based on the Q-values to choose next action.

● After every step, make update on Q-values.

31
Temporal difference (TD) learning
“If one had to identify one idea as central and novel to
reinforcement learning, it would undoubtedly be TD learning.”
- Sutton & Barto, first sentence of the TD chapter

General idea of TD learning:

● Implement a policy based on the Q-values to choose next action.

● After every step, make update on Q-values.

32
Exploration vs. exploitation

Rewards in RL are typically sparse and need


to be discovered before they can be
propagated to neighboring and distant state
values.

“Epsilon-greedy” policy:

Take greedy action with probability 1-ε and


random action with probability ε.

33
Temporal difference (TD) learning
“If one had to identify one idea as central and novel to
reinforcement learning, it would undoubtedly be TD learning.”
- Sutton & Barto, first sentence of the TD chapter

General idea of TD learning:

● “Epsilon-greedy” policy

● After every step, make update on Q-values.

34
Temporal difference (TD) learning
“If one had to identify one idea as central and novel to
reinforcement learning, it would undoubtedly be TD learning.”
- Sutton & Barto, first sentence of the TD chapter

General idea of TD learning:

● “Epsilon-greedy” policy

● After every step, make update on Q-values.

35
Let’s look back at the Bellman equation

Can’t we learn an expected value iteratively?

36
We can use the Bellman equation as an update rule

learning rate ∊ (0,1)

37
We can use the Bellman equation as an update rule

learning rate ∊ (0,1)

We are moving our previous estimate of the Q-value a small


step towards the new experience!

38
Temporal difference (TD) learning
“If one had to identify one idea as central and novel to
reinforcement learning, it would undoubtedly be TD learning.”
- Sutton & Barto, first sentence of the TD chapter

General idea of TD learning:

● “Epsilon-greedy” policy

● Update rule:

39
On-policy and off-policy methods

The simplest on/off policy control schemes :

SARSA

Q-learning

On-policy: SARSA considers that future actions are taken according to current
policy.
Off-policy: Q-learning considers that future actions are taken with another
(target) policy, in this case the greedy policy.

can learn policy without actually following it!


40
On-policy and off-policy methods

SARSA

Q-learning

41
Cliff walking
The agents walks in a gridworld, receiving -1 until it reaches the goal, and -100
for falling off the cliff.

Q-learning learns the optimal policy to walk next to the cliff, but falls off
sometimes due to the ε-greedy action selection. SARSA considers this action
selection and obtains higher rewards online.

42
(from Sutton & Barto)
SARSA/Q-learning on taxi driver/black jack

Notebook 3 on
github.com/fewagner/icsc23

43
Tabular methods: transfer questions

What problems could occur with the epsilon-greedy scheme?

When would you use an on-policy method (SARSA) and when an


off-policy method (Q-learning)?

44
Solving large MDPs
with policy gradient methods

45
How large is a large MDP?

● How many states has the game tic-tac-toe?

46
How large is a large MDP?

● How many states has the game tic-tac-toe? 765

● How many states has the game go?

47
How large is a large MDP?

● How many states has the game tic-tac-toe? 765

● How many states has the game go? 3361=10172

for comparison: our universe


has ~1082 atoms
● How many states has driving a car?

48
How large is a large MDP?

● How many states has the game tic-tac-toe? 765

● How many states has the game go? 3361=10172

● How many states has driving a car? (all positions and velocities of wheels,
all sensor readings, visual input, GPS data, … ???)

How would we set up a


Q-table for this?

49
Function approximation

Many environment have continuous (real numbers) state and action spaces. We cannot
build and exact Q-table for such environments!
Instead, we treat the Q-table as a function, called value function:

In this formulation, we can use function approximators to learn the value function, similarly
as we updated the Q table.

But … how to choose a greedy action for continuous action spaces? 🤔

50
Function approximation

Many environment have continuous (real numbers) state and action spaces. We cannot
build and exact Q-table for such environments!
Instead, we treat the Q-table as a function, called value function:

In this formulation, we can use function approximators to learn the value function, similarly
as we updated the Q table.
For large or continuous action spaces, we can treat the policy as a policy function:

There are many ways to approximate a function, we


We call this value a “preference”. focus on parametric approximators.
51
w and Θ are the real-values parameter vectors.
Actor-critic: TD learning in continuous spaces

Simplest version of an actor-critic agent:

The value function (critic) estimates the returns for states.

The policy (actor) learns actions that have high values.

52
Examples of parametric function approximators

linear regression Gaussian radial basis neural network


function

53
Learning parameters with gradient descent

Videos and GIFs are disabled in


the PDF version of these slides.

Notebook 4 on
github.com/fewagner/icsc23
54
Loss functions for values/policy

Value function loss: minimize mean squared error between returns and values,
leading to gradient update

Policy function loss: maximize probability for actions with high TD error, leading to
gradient update

55
Actor-critic: TD learning in continuous spaces

Simplest version of an actor-critic agent:

The value function (critic) estimates the returns for states.

The policy (actor) learns actions that have high values.

Exploration-exploitation is balanced by sampling actions from


the policy’s probability distribution.

After every step, updates are done according to:

In this formulation, the algorithm can be only used for episodic


tasks and uses a state value function.

56
Actor-critic on the lunar lander

Videos and GIFs are disabled in


the PDF version of these slides.

Notebook 5 on OpenAI Gym:


github.com/fewagner/icsc23 LunarLander-v2

57
Policy gradient methods: transfer questions

When would you prefer a method that uses approximation over a tabular
method?

Is our version of actor critic an on-policy or off-policy algorithm?

Actor-critic (?)

SARSA (on-policy)

Q-learning (off-policy)

58
Recap

59
Recap

Markov decision processes

60
Recap

Markov decision processes Policy, dynamics- and reward function

61
Recap

Markov decision processes Policy, dynamics- and reward function Values, Bellman equation

62
Recap

Markov decision processes Policy, dynamics- and reward function Values, Bellman equation

SARSA (on-policy) 63
Recap

Markov decision processes Policy, dynamics- and reward function Values, Bellman equation

SARSA (on-policy) Q-learning (off-policy) 64


Recap

Markov decision processes Policy, dynamics- and reward function Values, Bellman equation

SARSA (on-policy) Q-learning (off-policy) Actor-critic 65


model-f
re e/mode
l-based
partially obs
ervable MDP
s

ntrol
prediction/co
bandits

There is much more to learn …

A few examples of reinforcement


learning in physics.

66
Reinforcement learning in (experimental) physics

Phys. Rev. Accel. Beams 24, 104601 - 104618 (2021). Nature 602, 414–419 (2022).
https://doi.org/10.1103/PhysRevAccelBeams.24.104601 https://doi.org/10.1038/s41586-021-04301-9
67
Reinforcement learning in (experimental) physics

Optimal operation of cryogenic calorimeters with reinforcement learning


(paper in preparation)

68
Questions?
Backup

70
Bandits - contextual bandits - reinforcement learning

g
Bandits consider only immediate rewards. derin
consi DP
this M
Bandit: “The actions bring on average: a0 a1

1.2 -0.1

Contextual bandit: “The action-state pairs bring on


average the rewards:

S0,a0 S0,a1 S1,a0 S1,a1 S2,a0 S2,a1

0 0 3.5 0 0 -0.3

RL: “We have to plan action-state trajectories ahead and consider delayed
rewards! 🤓 But …how to assign credit to individual action…? 🤔”
71
We discuss this in section 2 and 3!
Derivation of the Bellman equation

72
The tiger problem, a POMDP

cute, but valuable, but not


dangerous worth dying for

For POMDPs, the agent obtains


observations of the environment, while
the state is a hidden internal property states = {tiger right, tiger left}
of the environment.
actions = {open left, open right, listen}

observations = {hear tiger right, hear tiger left}, 85% probability to hear tiger on correct side

rewards = 10 for opening door with treasure, -100 for opening door with tiger, -1 for listening
73
Prediction and control

“Solve” an MDP can mean two separate thing: solving a prediction and a control
problem. For many algorithms (especially for control) the problems are solved
simultaneously and iteratively.

Prediction: policy is fixed, Control: find a policy that


predict the returns for given maximizes the returns
starting states (does not necessarily need values)
74
Model-based and model-free

If we only knew everything … 🤷


In general we do not know the dynamics- and reward function (model-free)!

For problems with known environment we can use model-based techniques:


planing, ...

75
Temporal difference (TD) prediction
“If one had to identify one idea as central and novel to
reinforcement learning, it would undoubtedly be TD learning.”
- Sutton & Barto, first sentence of the TD chapter

General idea of TD prediction:

Fix a policy.
● Let an agent take actions in an environment according to policy.
● After every step, make update on values according to Bellman equation:

Scheme with this update


rule is also called TD(0)
learning rate ∊ (0,1)
76
TD prediction on our MDP

Notebook 2 on
github.com/fewagner/icsc23

77
Approximating values with gradient descent

We choose a function approximator for our value function and update its parameters
such that the squared errors with the true value function are minimized:

Note, that we bootstrapped the true value function with the reward and next state value,
as introduced in the TD learning chapter!

78
Approximating policies with gradient descent

For the policy function a parametrization with explicitly known


expectation is advantageous. The natural choice is a Gaussian
function. We therefore use two function approximators to learn the
mean and standard deviation:

We update the policy parameters to maximize the state values, using


the policy gradient theorem to calculate the derivative:
This expression is not trivial!
Proof in Sutton & Barto.

δt is again the TD error.


⇒ update rule: γt is not trivial, derivation is
exercise in Sutton & Barto. 79
TD(0) full algorithm

(from Sutton & Barto) 80


SARSA full algorithm

(from Sutton & Barto) 81


Q-learning full algorithm

(from Sutton & Barto) 82


Actor-critic full algorithm

(from Sutton & Barto) 83


Additional transfer questions

How would we describe the pendulum problem as an MDP?

When would you prefer a model-based algorithm and when a model-free


algorithm?

Tabular methods are interesting as they allow the calculation of proofs of


convergence, and other mathematical properties. Can you imagine any
disadvantages or limitations of tabular methods?

Can you think of any additional challenges when using RL with large function
approximators (e.g. deep neural networks)?

Could we use different parameterizations of the policy function (non-Gaussian)?

84

You might also like