0% found this document useful (0 votes)
80 views167 pages

Ideai Reinforcement Learning

The document provides an introduction to reinforcement learning. It discusses key reinforcement learning concepts like the multi-armed bandit problem, Markov property, maximizing reward, and applications including AlphaGo, robotic control, and optimizing treatment policies in healthcare. The presentation is meant to provide both theoretical background and practical examples of reinforcement learning techniques.

Uploaded by

Arohan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
80 views167 pages

Ideai Reinforcement Learning

The document provides an introduction to reinforcement learning. It discusses key reinforcement learning concepts like the multi-armed bandit problem, Markov property, maximizing reward, and applications including AlphaGo, robotic control, and optimizing treatment policies in healthcare. The presentation is meant to provide both theoretical background and practical examples of reinforcement learning techniques.

Uploaded by

Arohan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 167

Introduction to Reinforcement Learning

‣ Leonardo De Marchi
www.ideai.io
Leonardo De Marchi
Reinforcement learning

Theory PRACTICE
‣ Intro ‣ Bandit methods
‣ Bandit methods ‣ Monte Carlo
‣ Monte Carlo ‣ SARSA
‣ SARSA ‣ Q-learning
‣ Q-Learning ‣ Gradient Methods
‣ Gradient methods
Poll: What do you hope to
get out of today's course?
Machine learning
MACHINE LEARNING

SUPERVISED UNSUPERVISED REINFORCEMENT


SUPERVISED - TRAINING

INPUT OUTPUT
SUPERVISED - TRAINING

INPUT MODEL OUTPUT


SUPERVISED - scoring

INPUT MODEL OUTPUT


SUPERVISED
- scoring
unSUPERVISED

INPUT MODEL input clustered


unSUPERVISED

INPUT MODEL input clustered


Reinforcement
learning
Reinforcement learning

agent

feedback action

environment
RL applications
Alpha go
Alpha zero
Robotic
Why it matters
‣ Text summarisation engines
‣ Dialog agents (text, speech)
‣ Learning optimal treatment policies in healthcare
‣ Online stocking
‣ Scheduling
‣ …
Why it matters
‣ Learn how to make decisions to achieve a goal
Why it matters
‣ Learn how to make decisions to achieve a goal
by itself!
games
A2C
GQN
GQN
Interesting Applications

https://www.youtube.com/watch?v=oo0TraGu6Q
Y

https://www.youtube.com/watch?time_continue=
72&v=TmPfTpjtdgg

https://www.youtube.com/watch?v=UZHTNBMAf
AA

https://www.youtube.com/watch?time_continue=
118&v=eHipy_j29Xw
Questions
MULTI-ARMED BANDIT
Poll : What do you know
about Bandit methods?
Maximising
reward
Markov property
‣ The following state depends only on the current state and
action
‣ The state must include all info of past agent-interaction
environment that will make a difference in the future
‣ Action: decision we want to learn how to make

p(s0,r|s,a) = Pr{St=s0,Rt=r|St 1=s,At 1=a},


Simple multi-armed Bandit

30% 77% 50%


Current Success Rate Current Success Rate Current Success Rate
Simple multi-armed Bandit
‣ How many trial?
‣ Stable?

30% 77% 50%


Current Success Rate Current Success Rate Current Success Rate
multi-armed Bandit
‣ Many options, variability ?%
?%
?%
?%
?%
Current Success Rate
?%
Current Success Rate

?% ?%
Current Success Rate
?%
Current Success Rate
?%
?%
?%
?%
Current Success Rate
?%
Current Success Rate
?%
Current Success Rate
?%
?%
Current Success Rate
?%
Current Success Rate
?%
Current Success Rate
Current Success Rate
?%
Current Success Rate
?%Current Success Rate
?%
?%
Current Success Rate
?%
Current Success Rate
?%
Current Success Rate
Current Success Rate
?%Current Success Rate
?%Current Success Rate
?%
77%
77%
77%
Current Success Rate ?%
Current Success Rate
?%
Current
?%
Current
Success
Success
Rate
Rate
Current Success Rate
?% Current Success Rate
Current Success Rate

30% 77%
Current Success Rate
77%
Current Success Rate
77% ?%
Current
?%
Current
?%
Success
Success
Rate
Rate
Current Success Rate
Current Success Rate
Current Success Rate
30%
30%
30%
Current Success Rate
77%
Current Success Rate
77%
Current Success Rate
77%
Current
?% Success Rate
Current Success Rate
?%
Current Success Rate
30%
Current Success Rate
30%
Current Success Rate
30%
Current Success
77% Rate
Current Success Rate
Current
77%
Success Rate ?%
Current Success Rate
Current Success Rate
Current Success Rate
Current Success Rate
30%
Current Success Rate
30%
Current Success Rate Current77%
Current Success Rate
77%
Success
77% Rate Current Success Rate
Current Success Rate
30%
Current Success Rate
30%
Current Success Rate
30%
Current Success Rate
Current Success Rate
77%
Current Success Rate
77%
Current Success Rate
77%
30%
Current Success Rate
30%
Current Success Rate
30%
Current Success Rate
Current Success Rate
Current Success Rate
Current Success Rate
30%
Current Success Rate
30%
Current Success Rate
30%
Current Success Rate
Current Success Rate
Current Success Rate

30%
Current Success Rate
Current Success Rate
Current Success Rate
Current Success Rate
Current Success Rate
Exploration vs Exploitation

We want MAB
‣ Maximise our total ‣ Estimate the payoff
reward for each option

‣ Explore different ‣ Takes the best option


solutions to find the but sometimes explore
best one others
Example - Newspaper Headlines

Disneyland
Disneyland You will never
increases
increases believe what
prices by 67%
prices Disneyland did
in 10 years
Example - Newspaper Headlines

Disneyland
Disneyland You will never
increases
increases believe what
prices by 67%
prices Disneyland did
in 10 years

‣ Age band 1 ‣ Age band 2 ‣ Age band 3


Algorithms
‣ Greedy
‣ 𝜀-greedy
‣ Thompson sampling
Thompson Sampling
‣ Best solution: Thompson Sampling

Regret = sum of all


differences between
reward returned by the
strategy taken and best
possible reward
Thompson Sampling
Exercise: MAB
𝜀-greedy
Poll: How are you planning
to use Bandit methods?
RL problem
The Problem

agent

feedback action

environment
Environment
Environment
‣ Anything that
cannot be
changed arbitrarily
by the agent is
considered
environment
Environment
‣ More complex than MAB
‣ Multiple states
‣ Complex reward function
Feedback
‣ Returned by the environment

+10
+1
Goal
‣ Maximise the total reward
Total Reward
OpenAi’s gym basics
‣ Import: import gym
‣ Load environment: env = gym.make(‘SpaceInvaders-v0’)
‣ Start episode: env.reset()
‣ Display the environment: env.render()
‣ Evaluate an action: env.step(action)
‣ It returns an observation, a reward, if it’s finished and some
info on the environment
‣ observation, reward, done, info = env.step(action)
‣ You can start with 20 episodes with 100 time steps each
Value-based methods
‣ Estimate the value
function
‣ Policy is implicit
(eg 𝜀-greedy)
‣ i.e. Sarsa,
Q-learning

Value-based
methods
Policy-Based methods
‣ No value function
‣ Estimate the policy
‣ For simpler
problems
‣ i.e. Reinforce

Policy-based
methods
Actor Critic Models

Actor-critic
methods
Value-based Policy-based
methods methods
Markov Decision
Process
Markov Decision Process

at

Agent Environment
Rt Rt+1 st+1

St
Markov property
‣ The following state depends only on the current state and
action
‣ The state must include all info of past agent-interaction
environment that will make a difference in the future
‣ Action: decision we want to learn how to make

p(s’,r|s,a) = Pr{St=s’, Rt=r | St-1=s, At-1=a}


Goal - Episodic
‣ Goal is the maximisation of the expected value of the
cumulative sum of a received scalar signal (called reward).

‣ The time limit is well defined

Gt ≐ Rt+1 + Rt+2 + Rt+3 + ··· + RT


Goal - Continuous
‣ Goal is the maximisation of the expected value of the cumulative
sum of a discounted received scalar signal (called reward).

Gt ≐ Rt+1 + 𝛾Rt+2 + 𝛾2Rt+3 + ··· =𝛴k∞ 𝛾k-t-1Rk

‣ Where the discount rate 𝛾 0 ≤ 𝛾 ≤ 1


Goal - Unified Notation
‣ Goal is the maximisation of the expected value of the cumulative
sum of a discounted (or not) reward till state T

Unified formula for total rewards: Gt ≐

R1=+1 R2=+1 R3=+1 R4=0


S0 S1 S2
R5=0

‣ Where 0 ≤ 𝛾 ≤ 1 and T can be finite or infinite
Questions
break
Monte carlo
Methods
Monte carlo Methods
‣ Class of computational algorithms
‣ They rely on repeated random sampling
‣ To obtain numerical results.
Monte carlo simulation
‣ y = a*x + b*z

x
z
Policy iteration
‣ Policy iteration runs an loop between policy evaluation and
policy improvement.
Methods
‣ Policy Iteration
‣ Policy iteration runs an loop between policy evaluation and
policy improvement.
Model-free v.s. Model-based
‣ The model stands for the simulation of the dynamics of the
environment. Model-based algorithms become impractical as the
state space and action space grows
‣ On the other hand, model-free algorithms rely on trial-and-error
to update its knowledge. As a result, it does not require space to
store all the combination of states and actions. All the algorithms
discussed in the next section fall into this category.
Model-based
‣ Model is given
‣ Monte Carlo tree search (MCTS)
‣ computer Go
‣ Go programs as well as a milestone in machine learning as it uses
Monte Carlo tree search with artificial neural networks (a deep
learning method) for policy (move selection) and value
‣ The focus of Monte Carlo tree search is on the analysis of the
most promising moves, expanding the search tree based on
random sampling of the search space
Monte Carlo
‣ Optimizes the rewards using sampling and averages
‣ Play enough number of episodes of the game and extract the
information needed.
‣ In Monte Carlo (MC) we play an episode of the game starting by
some random state (not necessarily the beginning) till the end,
record the states, actions and rewards that we encountered then
compute the V(s) and Q(s) for each state we passed through. We
repeat this process by playing more episodes and we average the
values of the discovered V(s) and Q(s).
‣ In Monte Carlo there is no guarantee that we will visit all the
possible states, another weakness of this method is that we need
to wait until the game ends to be able to update our V(s) and
Q(s), this is problematic in games that never ends.
Monte Carlo
‣ The main problem with TD learning and DP is that their step
updates are biased on the initial conditions of the learning
parameters.
‣ The bootstrapping process typically updates a function or lookup
Q(s,a) on a successor value Q(s',a') using whatever the current
estimates are in the latter. Clearly at the very start of learning
these estimates contain no information from any real rewards or
state transitions.
‣ If learning works as intended, then the bias will reduce
asymptotically over multiple iterations. However, the bias can
cause significant problems, especially for off-policy methods (e.g.
Q Learning) and when using function approximators. That
combination is so likely to fail to converge that it is called the
deadly triad in Sutton & Bart.
Monte Carlo
Monte Carlo

Episode following policy 𝞹

S0 S1 S2 End
Monte Carlo

Episode following policy 𝞹

S0 S1 S2 End

Returns + R Returns + R Returns + R


Monte Carlo

Episode following policy 𝞹

S0 S1 S2 End

Returns + R Returns + R Returns + R

V(s) = Avg(Returns)
Exercise: MAB
Questions
Value BASED METHODS
(Sarsa, q-learning)
Value function (Q-matrix)
Policies
𝞹(a|s) ≐ Probability of taking action a from At under state s from St

The value function of a state s under a policy 𝞹 is the


expected return when starting in s and following 𝞹 afterwards.

For MDP:

E𝞹 is the Expected value under policy 𝞹 and starting on state s

This function consider only the current state


Policies
We can define the same also for the value function depending on
both status and action taken.

The value function of an action-value function for policy 𝞹 is


the expected return when starting in s, taking action a and
following 𝞹 afterwards

max𝞹𝑞𝞹(s, a) is the action-value function


Bellman Equation
‣ The value of the start state must be equal to the discounted
value of the expected next state plus the expected reward
Dynamic Programming
‣ Algorithms to compute optimal policies

‣ Needs a perfect model of the environment as a MDP


‣ Not RL, but it’s an useful foundation for it

‣ Main idea: Use value functions to organise and structure the


search for good policies
Policy Iteration
‣ Initialisation

‣ Policy evaluation using the value function

‣ Policy Improvement
Value Iteration
‣ Initialisation

‣ Value evaluation

‣ Value Improvement
Generalised policy
iteration
Optimality
‣ We define 𝑞∗(s, a) as the optimal policy 𝑞∗(s, a) ≐ max𝞹𝑞𝞹(s, a)

‣ It converges at ∞

‣ In real world we just need good approximations


Value-based methods
‣ Estimate the value
function
‣ Policy is implicit
(eg 𝜀-greedy)
‣ i.e. Sarsa,
Q-learning

Value-based
methods
Temporal-Difference Learning

‣ Can learn directly from raw experience without a model of the


environment’s dynamics.

‣ Like DP, TD methods update estimates based in part on other


learned estimates, without waiting for a final outcome (they
bootstrap)
Exploration vs Exploitation
‣ An algorithm:
‣ wants to take the best decision
‣ wants to explore to find the best decision
Temporal Difference (TD)
‣ TD-learning estimates the value function directly
Temporal Difference (TD)
‣ TD-learning estimates the value function directly
‣ Doesn't try to learn the underlying MDP
Temporal Difference (TD)
‣ TD-learning estimates the value function directly
‣ Doesn't try to learn the underlying MDP
‣ Keep an estimate of V π(s) in a table
Temporal Difference (TD)
‣ TD-learning estimates the value function directly
‣ Doesn't try to learn the underlying MDP
‣ Keep an estimate of V π(s) in a table
‣ Update these estimates as we gather more experience
Temporal Difference (TD)
‣ TD-learning estimates the value function directly
‣ Doesn't try to learn the underlying MDP
‣ Keep an estimate of V π(s) in a table
‣ Update these estimates as we gather more experience
‣ Estimates depend on the exploration policy (e.g. 𝜀-greedy) and π
Temporal Difference (TD)
‣ TD-learning estimates the value function directly
‣ Doesn't try to learn the underlying MDP
‣ Keep an estimate of V π(s) in a table
‣ Update these estimates as we gather more experience
‣ Estimates depend on the exploration policy (e.g. 𝜀-greedy) and π
‣ Generate a policy from the Value Function (e.g. using 𝜀-greedy)

V π(s) is guaranteed to converge to V *(s) after an infinite number of experiences


Policy update
On-policy (i.e. SARSA)
‣ Agent commits to always explore and find the best policy that
still explore

Off-policy (i.e. Q-learning)


‣ The agent learns a deterministic optimal policy that might be
unrelated from the policy followed
sarsa
‣ Episode: alternating sequence of state/action pairs
‣ SARSA is a TD technique

Rt+1
St St+1
At At+1
Sarsa algorithm
‣ Algorithm parameters: step size 𝛼 ∈(0, 1], small 𝜀 > 0
‣ Initialise Q(s,a), for all s ∈ S+, a ∈ A(s), arbitrarily except that Q(terminal,·) = 0
‣ Loop for each episode:
‣ Initialise S
‣ Choose A from S using policy derived from Q (e.g., 𝜀-greedy)
‣ Loop for each step of episode:
‣ Choose A’ from S’ using policy derived from Q (e.g., 𝜀-greedy)
Take action A, observe R, S’
Q(S, A) ⟵ Q(S, A) + 𝛼[R +𝛾 Q(S’, A’) - Q(S, A)]
S ⟵S’; A ⟵ A’;
until S is terminal
Value function (Q-matrix)
Policy
𝞹(a|s) ≐ Probability of taking action a from At under state s from St

• Use policy and expected return to take action


• Estimate the value function
• Policy is implicit (eg 𝜀-greedy)
• i.e. Sarsa, Q-learning
Policy
𝞹(a|s) ≐ Probability of taking action a from At under state s from St
Value function (Q-matrix)

S0
S2
0
0
0
S1 S3
320

0 S4
500
S5
POLICY – Value function + 𝜀-greedy

S0
S2
0
0
0
S1 S3
320

0 S4
500
S5
POLICY – Value function + 𝜀-greedy
Q(S, A) ⟵ Q(1, 3) + 𝛼[R +𝛾 Q(S’, A’) - Q(1, 3)]
S ⟵S’; A ⟵ A’;
S0
S2
0
0
0
S1 S3
320

0 S4
500
S5
POLICY – Value function + 𝜀-greedy
Q(S, A) ⟵ Q(1, 3) + 𝛼[R +𝛾 Q(3, 4) - Q(1, 3)]
1 ⟵3; 3 ⟵ 4;
S0
S2
0
0
0
S1 S3
320

0 S4
500
S5
Pool: How would you
use the SARSA method?
Exercise: Frozen Lake -
Actions

S
Exercise: Frozen Lake -
Environment
Exercise: sarsa
Questions
Break
Q-learning
Q-learning
‣ Estimating the Q-matrix
‣ Off policy: does not necessarily uses the policy learned
Q-learning
‣ Algorithm parameters: step size 𝛼 ∈(0, 1], small 𝜀 > 0
‣ Initialise Q(s,a), for all s ∈ S+, a ∈ A(s), arbitrarily except that Q(terminal,·) =
0
‣ Loop for each episode:
‣ Initialise S
‣ Choose A from S using policy derived from Q (e.g., 𝜀-greedy)
‣ Loop for each step of episode:
‣ Choose A’ from S’ using policy derived from Q (e.g., 𝜀-greedy)
Take action A, observe R, S’
Q(S, A) ⟵ Q(S, A) + 𝛼[R +𝛾 max Q(S’, a) - Q(S,A)]
max Q(S’, a) is the estimated optimal future value
‣ S ⟵S’ A ⟵ A’;
until S is terminal
Value function (Q-matrix)
Policy
𝞹(a|s) ≐ Probability of taking action a from At under state s from St

• Use policy and expected return to take action


• Estimate the value function
• Policy is implicit (eg 𝜀-greedy)
• i.e. Sarsa, Q-learning
Policy
𝞹(a|s) ≐ Probability of taking action a from At under state s from St
Value function (Q-matrix)

S0
S2
0
0
0
S1 S3
320

0 S4
500
S5
POLICY – Value function + 𝜀-greedy

S0
S2
0
0
0
S1 S3
320

0 S4
500
S5
POLICY – Value function + 𝜀-greedy
Q(S, A) ⟵ Q(1, 3) + 𝛼[R +𝛾 max Q(S’, a) - Q(1, 3)]
S ⟵S’; A ⟵ A’;
S0
S2
0
0
0
S1 S3
320

0 S4
500
S5
POLICY – Value function + 𝜀-greedy
Q(S, A) ⟵ Q(1, 3) + 𝛼[R +𝛾 max Q(S’, a) - Q(1, 3)]
S ⟵S’; A ⟵ A’;
S0
S2
0
0
0
S1 S3
320

0 S4
500
S5
POLICY – Value function + 𝜀-greedy
Q(S, A) ⟵ Q(1, 3) + 𝛼[R +𝛾 Q(3, 4) - Q(1, 3)]
1 ⟵3; 3 ⟵ 4;
S0
S2
0
0
0
S1 S3
320

0 S4
500
S5
Pool: Where do you think
Q-learning can be used?
Exercise:
Frozen lake
with q-learning
questions
Gradient methods
Policy-Based methods
‣ No value function
‣ Estimate the policy
‣ For simpler
problems
‣ i.e. Reinforce

Policy-based
methods
Policy gradients

𝝅𝜽(s|a)

action
feedback
environment

𝝅𝜃(s|a) = probability of action a in state s


Advantages
• Stochastic policies
• Continuous actions
• Directly improving the policy
• Can be used in conjunctions with DNN
Policy bASED METHODS

Policy
E=[max ∑3
456 𝑅 𝑠𝑡 | 𝝅𝜃 ]
-

If we change an action we have a big impact


Changing the action distribution will have a smaller impact
Policy improvement

Update the action distribution for all possible actions

Correcting the update by the probability of taking that action


Reinforce
Push harder for actions that are more promising
Reinforce
REward Increment = Nonnegative Factor × Offset Reinforcement ×Characteristic
Eligibility

The policy is part is equivalent (due to the chain and derivation rule)

By subtracting V(s) from Q(s, a), we get the advantage function A(s, a).

This function tells us how much better or worse taking action a in state is is
compared to acting according to the policy
Reinforce
1.Trajectory roll-out using the current policy
2.Store log probabilities of both policy and reward values at each step
3.Calculate discounted cumulative future reward at each step
4.Compute policy gradient and update policy parameter

5.Repeat
Reinforce
Policy-Based methods
Gradient methods
‣ Perform policy gradient directly on the performance surface underlying the
chosen parametric policy class

‣ Solve simpler problems, faster

‣ Innate exploration by his stochastic nature

‣ Can be used together with supervised learning


Deep RL Algorithm
• Some approaches do not use gradient
Hill climbing Simplex
Genetic algorithms

• Greater efficiency often possible using gradient


Deep learning

Perceptron
Deep learning

Input layer First Layer Second layer Output layer


Deep learning

Input layer First Layer Second layer Output layer


Deep learning

Input layer First Layer Second layer Output layer


Deep learning

Input layer First Layer Second layer Output layer


Deep learning

0.77
Leonardo Da Vinci
0.11
Leonardo De Marchi
0.12
Leonardo DI Caprio

Input layer First Layer Second layer Output layer


Deep learning

0.77
Leonardo Da Vinci
0.11
Leonardo De Marchi
0.12
Leonardo DI Caprio

Feature Engineering
Policy gradients

𝝅𝜽(s|a)

action
feedback
environment

𝝅𝜃 (s|a) = probability of action a in state s


Deep learning

0.77
Leonardo Da Vinci
0.11
Leonardo De Marchi
0.12
Leonardo DI Caprio

Input layer First Layer Second layer Input layer


Dataset

Prob Action 1

Prob Action 2

Prob Action 3

State Softmax
Output Reward
Dataset

State Action
taken Reward
Deep Reinforcement learning

0.77
Action 1
0.11
Action 2 X Reward
0.12
Action 3

Input layer First Layer Second layer Output layer


Policy gradients
Actions

𝝅𝜽(s|a)

action
feedback
environment
Pool: Where do you think
gradient methods can be
used?
EXERCISE: GRADIENT METHODS
Summary

Theory PRACTICE
‣ Intro ‣ Bandit methods
‣ Bandit methods ‣ Monte Carlo
‣ Monte Carlo ‣ SARSA
‣ SARSA ‣ Q-learning
‣ Q-Learning ‣ Gradient Methods
‣ Gradient methods
Value-based methods
‣ Estimate the value
function
‣ Policy is implicit
(eg 𝜀-greedy)
‣ i.e. Sarsa,
Q-learning

Value-based
methods
Policy-Based methods
‣ No value function
‣ Estimate the policy
‣ For simpler
problems
‣ i.e. Reinforce

Policy-based
methods
Deep RL Algorithm
• Get the state from the environment.
• Feed forward our policy network to predict the probability of each action,
• Sample from this distribution to choose which action to take
• Receive the reward and the next state state.
• Store this transition sequence of state, action, reward, for later training.
• Repeat the previous steps till the episode end.
• Once the episode is over, we train our neural network to learn from our
stored transitions using our reward guided loss function.
• Play next episode and repeat steps above.
Actor Critic Models

Actor-critic
methods
Value-based Policy-based
methods methods
Actor Critic Models
In TD models:
‣ TD only evaluates a particular policy
‣ Does not learn a better policy
‣ We can change the policy as we learn

In AC models:
‣ Policy is the actor
‣ Value-function estimate is the critic

Success is generally dependent on the starting policy being “good enough”


Actor Critic Models

Critic

Values

Actor

state action
reward
environment
Actor Critic Models
• Actor: takes in the current environment state and determines the best action
to take from there

• Critic plays the “evaluation” role from the DQN by taking in the environment
state and an action and returning a score that represents how apt the action is
for the state.

• Allows actor critic to be more sample efficient via TD updates at every step.
Actor Critic Models
• Implement generalized policy iteration - alternating between a policy
evaluation and a policy

• Actor improvement: aims at improving the current policy

• Critic evaluation: evaluates the current policy


If the critic is modelled by a bootstrapping method it reduces the variance so
the learning is more stable than pure policy gradient methods
Questions
Thank you!
You can contact me at
www.ideai.io [email protected]

You might also like