Ideai Reinforcement Learning
Ideai Reinforcement Learning
‣ Leonardo De Marchi
www.ideai.io
Leonardo De Marchi
Reinforcement learning
Theory PRACTICE
‣ Intro ‣ Bandit methods
‣ Bandit methods ‣ Monte Carlo
‣ Monte Carlo ‣ SARSA
‣ SARSA ‣ Q-learning
‣ Q-Learning ‣ Gradient Methods
‣ Gradient methods
Poll: What do you hope to
get out of today's course?
Machine learning
MACHINE LEARNING
INPUT OUTPUT
SUPERVISED - TRAINING
agent
feedback action
environment
RL applications
Alpha go
Alpha zero
Robotic
Why it matters
‣ Text summarisation engines
‣ Dialog agents (text, speech)
‣ Learning optimal treatment policies in healthcare
‣ Online stocking
‣ Scheduling
‣ …
Why it matters
‣ Learn how to make decisions to achieve a goal
Why it matters
‣ Learn how to make decisions to achieve a goal
by itself!
games
A2C
GQN
GQN
Interesting Applications
https://www.youtube.com/watch?v=oo0TraGu6Q
Y
https://www.youtube.com/watch?time_continue=
72&v=TmPfTpjtdgg
https://www.youtube.com/watch?v=UZHTNBMAf
AA
https://www.youtube.com/watch?time_continue=
118&v=eHipy_j29Xw
Questions
MULTI-ARMED BANDIT
Poll : What do you know
about Bandit methods?
Maximising
reward
Markov property
‣ The following state depends only on the current state and
action
‣ The state must include all info of past agent-interaction
environment that will make a difference in the future
‣ Action: decision we want to learn how to make
?% ?%
Current Success Rate
?%
Current Success Rate
?%
?%
?%
?%
Current Success Rate
?%
Current Success Rate
?%
Current Success Rate
?%
?%
Current Success Rate
?%
Current Success Rate
?%
Current Success Rate
Current Success Rate
?%
Current Success Rate
?%Current Success Rate
?%
?%
Current Success Rate
?%
Current Success Rate
?%
Current Success Rate
Current Success Rate
?%Current Success Rate
?%Current Success Rate
?%
77%
77%
77%
Current Success Rate ?%
Current Success Rate
?%
Current
?%
Current
Success
Success
Rate
Rate
Current Success Rate
?% Current Success Rate
Current Success Rate
30% 77%
Current Success Rate
77%
Current Success Rate
77% ?%
Current
?%
Current
?%
Success
Success
Rate
Rate
Current Success Rate
Current Success Rate
Current Success Rate
30%
30%
30%
Current Success Rate
77%
Current Success Rate
77%
Current Success Rate
77%
Current
?% Success Rate
Current Success Rate
?%
Current Success Rate
30%
Current Success Rate
30%
Current Success Rate
30%
Current Success
77% Rate
Current Success Rate
Current
77%
Success Rate ?%
Current Success Rate
Current Success Rate
Current Success Rate
Current Success Rate
30%
Current Success Rate
30%
Current Success Rate Current77%
Current Success Rate
77%
Success
77% Rate Current Success Rate
Current Success Rate
30%
Current Success Rate
30%
Current Success Rate
30%
Current Success Rate
Current Success Rate
77%
Current Success Rate
77%
Current Success Rate
77%
30%
Current Success Rate
30%
Current Success Rate
30%
Current Success Rate
Current Success Rate
Current Success Rate
Current Success Rate
30%
Current Success Rate
30%
Current Success Rate
30%
Current Success Rate
Current Success Rate
Current Success Rate
30%
Current Success Rate
Current Success Rate
Current Success Rate
Current Success Rate
Current Success Rate
Exploration vs Exploitation
We want MAB
‣ Maximise our total ‣ Estimate the payoff
reward for each option
Disneyland
Disneyland You will never
increases
increases believe what
prices by 67%
prices Disneyland did
in 10 years
Example - Newspaper Headlines
Disneyland
Disneyland You will never
increases
increases believe what
prices by 67%
prices Disneyland did
in 10 years
agent
feedback action
environment
Environment
Environment
‣ Anything that
cannot be
changed arbitrarily
by the agent is
considered
environment
Environment
‣ More complex than MAB
‣ Multiple states
‣ Complex reward function
Feedback
‣ Returned by the environment
+10
+1
Goal
‣ Maximise the total reward
Total Reward
OpenAi’s gym basics
‣ Import: import gym
‣ Load environment: env = gym.make(‘SpaceInvaders-v0’)
‣ Start episode: env.reset()
‣ Display the environment: env.render()
‣ Evaluate an action: env.step(action)
‣ It returns an observation, a reward, if it’s finished and some
info on the environment
‣ observation, reward, done, info = env.step(action)
‣ You can start with 20 episodes with 100 time steps each
Value-based methods
‣ Estimate the value
function
‣ Policy is implicit
(eg 𝜀-greedy)
‣ i.e. Sarsa,
Q-learning
Value-based
methods
Policy-Based methods
‣ No value function
‣ Estimate the policy
‣ For simpler
problems
‣ i.e. Reinforce
Policy-based
methods
Actor Critic Models
Actor-critic
methods
Value-based Policy-based
methods methods
Markov Decision
Process
Markov Decision Process
at
Agent Environment
Rt Rt+1 st+1
St
Markov property
‣ The following state depends only on the current state and
action
‣ The state must include all info of past agent-interaction
environment that will make a difference in the future
‣ Action: decision we want to learn how to make
x
z
Policy iteration
‣ Policy iteration runs an loop between policy evaluation and
policy improvement.
Methods
‣ Policy Iteration
‣ Policy iteration runs an loop between policy evaluation and
policy improvement.
Model-free v.s. Model-based
‣ The model stands for the simulation of the dynamics of the
environment. Model-based algorithms become impractical as the
state space and action space grows
‣ On the other hand, model-free algorithms rely on trial-and-error
to update its knowledge. As a result, it does not require space to
store all the combination of states and actions. All the algorithms
discussed in the next section fall into this category.
Model-based
‣ Model is given
‣ Monte Carlo tree search (MCTS)
‣ computer Go
‣ Go programs as well as a milestone in machine learning as it uses
Monte Carlo tree search with artificial neural networks (a deep
learning method) for policy (move selection) and value
‣ The focus of Monte Carlo tree search is on the analysis of the
most promising moves, expanding the search tree based on
random sampling of the search space
Monte Carlo
‣ Optimizes the rewards using sampling and averages
‣ Play enough number of episodes of the game and extract the
information needed.
‣ In Monte Carlo (MC) we play an episode of the game starting by
some random state (not necessarily the beginning) till the end,
record the states, actions and rewards that we encountered then
compute the V(s) and Q(s) for each state we passed through. We
repeat this process by playing more episodes and we average the
values of the discovered V(s) and Q(s).
‣ In Monte Carlo there is no guarantee that we will visit all the
possible states, another weakness of this method is that we need
to wait until the game ends to be able to update our V(s) and
Q(s), this is problematic in games that never ends.
Monte Carlo
‣ The main problem with TD learning and DP is that their step
updates are biased on the initial conditions of the learning
parameters.
‣ The bootstrapping process typically updates a function or lookup
Q(s,a) on a successor value Q(s',a') using whatever the current
estimates are in the latter. Clearly at the very start of learning
these estimates contain no information from any real rewards or
state transitions.
‣ If learning works as intended, then the bias will reduce
asymptotically over multiple iterations. However, the bias can
cause significant problems, especially for off-policy methods (e.g.
Q Learning) and when using function approximators. That
combination is so likely to fail to converge that it is called the
deadly triad in Sutton & Bart.
Monte Carlo
Monte Carlo
S0 S1 S2 End
Monte Carlo
S0 S1 S2 End
S0 S1 S2 End
V(s) = Avg(Returns)
Exercise: MAB
Questions
Value BASED METHODS
(Sarsa, q-learning)
Value function (Q-matrix)
Policies
𝞹(a|s) ≐ Probability of taking action a from At under state s from St
For MDP:
‣ Policy Improvement
Value Iteration
‣ Initialisation
‣ Value evaluation
‣ Value Improvement
Generalised policy
iteration
Optimality
‣ We define 𝑞∗(s, a) as the optimal policy 𝑞∗(s, a) ≐ max𝞹𝑞𝞹(s, a)
‣ It converges at ∞
Value-based
methods
Temporal-Difference Learning
Rt+1
St St+1
At At+1
Sarsa algorithm
‣ Algorithm parameters: step size 𝛼 ∈(0, 1], small 𝜀 > 0
‣ Initialise Q(s,a), for all s ∈ S+, a ∈ A(s), arbitrarily except that Q(terminal,·) = 0
‣ Loop for each episode:
‣ Initialise S
‣ Choose A from S using policy derived from Q (e.g., 𝜀-greedy)
‣ Loop for each step of episode:
‣ Choose A’ from S’ using policy derived from Q (e.g., 𝜀-greedy)
Take action A, observe R, S’
Q(S, A) ⟵ Q(S, A) + 𝛼[R +𝛾 Q(S’, A’) - Q(S, A)]
S ⟵S’; A ⟵ A’;
until S is terminal
Value function (Q-matrix)
Policy
𝞹(a|s) ≐ Probability of taking action a from At under state s from St
S0
S2
0
0
0
S1 S3
320
0 S4
500
S5
POLICY – Value function + 𝜀-greedy
S0
S2
0
0
0
S1 S3
320
0 S4
500
S5
POLICY – Value function + 𝜀-greedy
Q(S, A) ⟵ Q(1, 3) + 𝛼[R +𝛾 Q(S’, A’) - Q(1, 3)]
S ⟵S’; A ⟵ A’;
S0
S2
0
0
0
S1 S3
320
0 S4
500
S5
POLICY – Value function + 𝜀-greedy
Q(S, A) ⟵ Q(1, 3) + 𝛼[R +𝛾 Q(3, 4) - Q(1, 3)]
1 ⟵3; 3 ⟵ 4;
S0
S2
0
0
0
S1 S3
320
0 S4
500
S5
Pool: How would you
use the SARSA method?
Exercise: Frozen Lake -
Actions
S
Exercise: Frozen Lake -
Environment
Exercise: sarsa
Questions
Break
Q-learning
Q-learning
‣ Estimating the Q-matrix
‣ Off policy: does not necessarily uses the policy learned
Q-learning
‣ Algorithm parameters: step size 𝛼 ∈(0, 1], small 𝜀 > 0
‣ Initialise Q(s,a), for all s ∈ S+, a ∈ A(s), arbitrarily except that Q(terminal,·) =
0
‣ Loop for each episode:
‣ Initialise S
‣ Choose A from S using policy derived from Q (e.g., 𝜀-greedy)
‣ Loop for each step of episode:
‣ Choose A’ from S’ using policy derived from Q (e.g., 𝜀-greedy)
Take action A, observe R, S’
Q(S, A) ⟵ Q(S, A) + 𝛼[R +𝛾 max Q(S’, a) - Q(S,A)]
max Q(S’, a) is the estimated optimal future value
‣ S ⟵S’ A ⟵ A’;
until S is terminal
Value function (Q-matrix)
Policy
𝞹(a|s) ≐ Probability of taking action a from At under state s from St
S0
S2
0
0
0
S1 S3
320
0 S4
500
S5
POLICY – Value function + 𝜀-greedy
S0
S2
0
0
0
S1 S3
320
0 S4
500
S5
POLICY – Value function + 𝜀-greedy
Q(S, A) ⟵ Q(1, 3) + 𝛼[R +𝛾 max Q(S’, a) - Q(1, 3)]
S ⟵S’; A ⟵ A’;
S0
S2
0
0
0
S1 S3
320
0 S4
500
S5
POLICY – Value function + 𝜀-greedy
Q(S, A) ⟵ Q(1, 3) + 𝛼[R +𝛾 max Q(S’, a) - Q(1, 3)]
S ⟵S’; A ⟵ A’;
S0
S2
0
0
0
S1 S3
320
0 S4
500
S5
POLICY – Value function + 𝜀-greedy
Q(S, A) ⟵ Q(1, 3) + 𝛼[R +𝛾 Q(3, 4) - Q(1, 3)]
1 ⟵3; 3 ⟵ 4;
S0
S2
0
0
0
S1 S3
320
0 S4
500
S5
Pool: Where do you think
Q-learning can be used?
Exercise:
Frozen lake
with q-learning
questions
Gradient methods
Policy-Based methods
‣ No value function
‣ Estimate the policy
‣ For simpler
problems
‣ i.e. Reinforce
Policy-based
methods
Policy gradients
𝝅𝜽(s|a)
action
feedback
environment
Policy
E=[max ∑3
456 𝑅 𝑠𝑡 | 𝝅𝜃 ]
-
The policy is part is equivalent (due to the chain and derivation rule)
By subtracting V(s) from Q(s, a), we get the advantage function A(s, a).
This function tells us how much better or worse taking action a in state is is
compared to acting according to the policy
Reinforce
1.Trajectory roll-out using the current policy
2.Store log probabilities of both policy and reward values at each step
3.Calculate discounted cumulative future reward at each step
4.Compute policy gradient and update policy parameter
5.Repeat
Reinforce
Policy-Based methods
Gradient methods
‣ Perform policy gradient directly on the performance surface underlying the
chosen parametric policy class
Perceptron
Deep learning
0.77
Leonardo Da Vinci
0.11
Leonardo De Marchi
0.12
Leonardo DI Caprio
0.77
Leonardo Da Vinci
0.11
Leonardo De Marchi
0.12
Leonardo DI Caprio
Feature Engineering
Policy gradients
𝝅𝜽(s|a)
action
feedback
environment
0.77
Leonardo Da Vinci
0.11
Leonardo De Marchi
0.12
Leonardo DI Caprio
Prob Action 1
Prob Action 2
Prob Action 3
State Softmax
Output Reward
Dataset
State Action
taken Reward
Deep Reinforcement learning
0.77
Action 1
0.11
Action 2 X Reward
0.12
Action 3
𝝅𝜽(s|a)
action
feedback
environment
Pool: Where do you think
gradient methods can be
used?
EXERCISE: GRADIENT METHODS
Summary
Theory PRACTICE
‣ Intro ‣ Bandit methods
‣ Bandit methods ‣ Monte Carlo
‣ Monte Carlo ‣ SARSA
‣ SARSA ‣ Q-learning
‣ Q-Learning ‣ Gradient Methods
‣ Gradient methods
Value-based methods
‣ Estimate the value
function
‣ Policy is implicit
(eg 𝜀-greedy)
‣ i.e. Sarsa,
Q-learning
Value-based
methods
Policy-Based methods
‣ No value function
‣ Estimate the policy
‣ For simpler
problems
‣ i.e. Reinforce
Policy-based
methods
Deep RL Algorithm
• Get the state from the environment.
• Feed forward our policy network to predict the probability of each action,
• Sample from this distribution to choose which action to take
• Receive the reward and the next state state.
• Store this transition sequence of state, action, reward, for later training.
• Repeat the previous steps till the episode end.
• Once the episode is over, we train our neural network to learn from our
stored transitions using our reward guided loss function.
• Play next episode and repeat steps above.
Actor Critic Models
Actor-critic
methods
Value-based Policy-based
methods methods
Actor Critic Models
In TD models:
‣ TD only evaluates a particular policy
‣ Does not learn a better policy
‣ We can change the policy as we learn
In AC models:
‣ Policy is the actor
‣ Value-function estimate is the critic
Critic
Values
Actor
state action
reward
environment
Actor Critic Models
• Actor: takes in the current environment state and determines the best action
to take from there
• Critic plays the “evaluation” role from the DQN by taking in the environment
state and an action and returning a score that represents how apt the action is
for the state.
• Allows actor critic to be more sample efficient via TD updates at every step.
Actor Critic Models
• Implement generalized policy iteration - alternating between a policy
evaluation and a policy