Tut21 RL
Tut21 RL
Reinforcement Learning
Markov Decision Processes
Mausam
IIT Delhi
National Centre for
Supercomputing Development of
Mission Advanced Computing
Planning Agent Static vs. Dynamic
Environment
Fully
vs.
Partially
Deterministic
Observable vs.
What action Stochastic
next?
Perfect Instantaneous
vs. vs.
Noisy Durative
Percepts Actions
2
Search Algorithms Static
Environment
Fully
Observable
Deterministic
What action
next?
Instantaneous
Perfect
Percepts Actions
3
Stochastic Planning:Static
MDPs
Environment
Fully
Observable
Stochastic
What action
next?
Instantaneous
Perfect
Percepts Actions
4
Markov Decision Process (MDP)
• S: A set of states
• A: A set of actions
• T(s,a,s’): transition model
• C(s,a,s’): cost model
• G: set of goals
• s0: start state
• : discount factor
• R(s,a,s’): reward model
5
Objective of an MDP
• Find a policy : S → A
• which optimizes
• minimizes discounted expected cost to reach a goal
• maximizes or expected reward
• maximizes undiscount. expected (reward-cost)
6
Role of Discount Factor ()
• Keep the total reward/total cost finite
• useful for infinite horizon problems
• Intuition (economics):
• Money today is worth more than money tomorrow.
7
Acyclic vs. Cyclic MDPs
a P b P
a b
0.6 0.4 0.5 0.5 0.6 0.4 0.5 0.5
Q R S T R S T
c c c c c c c
G G
C(a) = 5, C(b) = 10, C(c) =1 Expectimin doesn’t work
•infinite loop
Expectimin works • V(R/S/T) = 1
• V(Q/R/S/T) = 1 • Q(P,b) = 11
• V(P) = 6 – action a • Q(P,a) = ????
• suppose I decide to take a in P
• Q(P,a) = 5+ 0.4*1 + 0.6Q(P,a)
• = 13.5
Brute force Algorithm
• Go over all policies ¼
• How many? |A||S| finite
9
Policy Evaluation
• Given a policy ¼: compute V¼
• V¼ : cost of reaching goal while following ¼
10
Deterministic MDPs
• Policy Graph for ¼
¼(s0) = a0; ¼(s1) = a1
C=5 C=1
s0 a0
s1 a1
sg
• V¼(s1) = 1
• V¼(s0) = 6
add costs on path to goal
11
Acyclic MDPs
• Policy Graph for ¼
Pr=0.6 s1 C=1
C=5 a1
a0
s0 a2
sg
Pr=0.4
C=2
s2
C=4
backward pass in
reverse topological
• V¼(s1) = 1
order
• V¼(s2) = 4
• V¼(s0) = 0.6(5+1) + 0.4(2+4) = 6
12
General MDPs can be cyclic!
Pr=0.6 s1 C=1
C=5 a1
a0
s0 sg
Pr=0.4 Pr=0.7
C=2 a2 C=4
s2 cannot do a
Pr=0.3
C=3 simple single pass
• V¼(s1) = 1
• V¼(s2) = ?? (depends on V¼(s0))
• V¼(s0) = ?? (depends on V¼(s2))
13
General SSPs can be cyclic!
Pr=0.6 s1 C=1
C=5 a1
a0
s0 sg
Pr=0.4 Pr=0.7
C=2 a2 C=4
s2
Pr=0.3
C=3 a simple system of
linear equations
• V¼(g) = 0
• V¼(s1) = 1+V¼(sg) = 1
• V¼(s2) = 0.7(4+V¼(sg)) + 0.3(3+V¼(s0))
• V¼(s0) = 0.6(5+V¼(s1)) + 0.4(2+V¼(s2))
14
Policy Evaluation (Approach 1)
• Solving the System of Linear Equations
¼
V (s) = 0 if s 2 G
X
0 0 ¼ 0
= T (s; ¼(s); s ) [C(s; ¼(s); s ) + V (s )]
s0 2S
• |S| variables.
• O(|S|3) running time
15
Iterative Policy Evaluation
1
Pr=0.6 s1 C=1
C=5 0
4.4+0.4V¼(s2) a0 a1
0 s0 sg
5.88 Pr=0.4 Pr=0.7
6.5856 C=2 a2 C=4
6.670272 s2
Pr=0.3
6.68043..
C=3
3.7+0.3V¼(s0)
3.7
5.464
5.67568
5.7010816
5.704129…
16
Policy Evaluation (Approach 2)
X
¼ 0 0 ¼ 0
V (s) = T (s; ¼(s); s ) [C(s; ¼(s); s ) + V (s )]
s0 2S
iterative refinement
X £ ¤
¼ 0 0 ¼ 0
Vn (s) Ã T (s; ¼(s); s ) C(s; ¼(s); s ) + Vn¡1 (s ) (1)
s0 2S
17
Iterative Policy Evaluation
iteration n
²-consistency
termination
condition
18
Policy Evaluation Value Iteration (Bellman Equations)
• <S, A, T, C ,G, s0>
• Define V*(s) {optimal cost} as the minimum expected cost
to reach a goal from this state.
• V* should satisfy the following equation:
¤
V (s) = 0 if s 2 G
X
0 0 ¤ 0
= min T (s; a; s ) [C(s; a; s ) + V (s )]
a2A
s0 2S
Q*(s,a)
V*(s) = mina Q*(s,a)
19
Bellman Equations for Reward Maximization MDP
• <S, A, T, R, s0, >
• Define V*(s) {optimal value} as the maximum expected discounted reward
from this state.
• V* should satisfy the following equation:
20
Fixed Point Computation in VI
X
¤ 0 0 ¤ 0
V (s) = min T (s; a; s ) [C(s; a; s ) + V (s )]
a2A
s0 2S
iterative refinement
X
0 0 0
Vn (s) Ã min T (s; a; s ) [C(s; a; s ) + Vn¡1 (s )]
a2A
s0 2S
non-linear
21
Example
a20 a40
a00 s2 s4 C=5
a41
a21 a1 C=2 Pr=0.6
s0 a3 sg
Pr=0.4
a01 s1 s3
22
Bellman Backup
a40 Q1(s4,a40) = 5 + 0
s4 C=5 Q1(s4,a41) = 2+ 0.6£ 0
a41
Pr=0.6
+ 0.4£ 2
C=2
a3 sg = 2.8
Pr=0.4
s3 min
agreedy = a41
a40 sg V0= 0
C=5
V1= 2.8
C=2
s4 a41
s3 V0= 2
Value Iteration [Bellman 57]
No restriction on initial value function
iteration n
²-consistency
termination
condition
24
Example
(all actions cost 1 unless otherwise stated)
a20 a40
a00 s2 s4 C=5
a41
a21 a1 C=2 Pr=0.6
s0 a3 sg
Pr=0.4
a01 s1 s3
n Vn(s0) Vn(s1) Vn(s2) Vn(s3) Vn(s4)
0 3 3 2 2 1
1 3 3 2 2 2.8
2 3 3 3.8 3.8 2.8
3 4 4.8 3.8 3.8 3.52
4 4.8 4.8 4.52 4.52 3.52
5 5.52 5.52 4.52 4.52 3.808
25
20 5.99921 5.99921 4.99969 4.99969 3.99969
Comments
• Decision-theoretic Algorithm
• Dynamic Programming
• Fixed Point Computation
• Probabilistic version of Bellman-Ford Algorithm
• for shortest path computation
• Cost Minimization MDP : Stochastic Shortest Path Problem
• Time Complexity
• one iteration: O(|S|2|A|)
• number of iterations: poly(|S|, |A|, 1/², 1/(1-))
• Space Complexity: O(|S|)
26
Thank You
IIT Kharagpur IIT Madras IIT Goa IIT Palakkad
Reinforcement Learning
Q Learning
Mausam
IIT Delhi
National Centre for
Supercomputing Development of
Mission Advanced Computing
Pavlov’s Dog
2
Image from https://www.open.edu/openlearn/ocw/mod/oucontent/view.php?id=83455§ion=2.2.1
MDPs Static
Environment
Fully
Observable
Stochastic
What action
next?
Instantaneous
Perfect
Percepts Actions
3
Reinforcement Learning
• S: a set of states
• A: a set of actions
• T(s,a,s’): transition model Model
• R(s,a): reward model
• : discount factor
• Still looking for policy (s)
• 𝑉 𝜋 𝑠 = ′ ′ 𝜋 ′
𝑠′ 𝑇 𝑠, 𝜋(𝑠), 𝑠 [𝑅 𝑠, 𝜋 𝑠 ,𝑠 + 𝛾𝑉 𝑠 ]
7
TD Learning
𝜋 ′ ′ 𝜋 ′
•𝑉 𝑠 = 𝑠′ 𝑇 𝑠, 𝜋(𝑠), 𝑠 [𝑅 𝑠, 𝜋 𝑠 , 𝑠 + 𝛾𝑉 (𝑠 )]
𝜋 𝜋
• Say I know correct values of 𝑉 𝑠1 and 𝑉 (𝑠2 )
V¼=5
s1 s1
Pr=0.6
R=5
a0
s s
Pr=0.4
R=2
s2 s2
V¼=3
V¼(s)=0.6(5+5) + 0.4(2+3) V¼(s)= (10+10+10+5+5)/5
=6+2=8 =8 8
TD Learning
𝜋 ′ ′ 𝜋 ′
• 𝑉 𝑠 = 𝑠′ 𝑇 𝑠, 𝜋(𝑠), 𝑠 [𝑅 𝑠, 𝜋 𝑠 , 𝑠 + 𝛾𝑉 (𝑠 )]
• Inner term is the sample value
• (s,s’,r): reached s’ from s by executing 𝜋 𝑠 and got
immediate reward of r
𝜋 ′
• sample = r + 𝛾𝑉 (𝑠 )
𝜋 1
• Compute 𝑉 𝑠 = 𝑖 𝑠𝑎𝑚𝑝𝑙𝑒𝑖
𝑁
𝜋 ′
• Problem: we don’t know true values of 𝑉 (𝑠 )
• learn together using dynamic programming!
Skill Needed: Estimating mean via online updates
sample n+1
average of n+1 samples learning rate
𝜋 𝜋 𝜋
•
𝑉𝑛+1 (s) +𝑉𝑛 (s) 𝛼(samplen+1−𝑉𝑛 (s))
• Nudge the old estimate towards the sample
10
TD Learning
• (s,s’,r)
𝜋 𝜋 𝜋
• 𝑉 (s) 𝑉 (s) + 𝛼(sample−𝑉 (s))
• 𝑉 𝜋 (s) 𝑉 𝜋 (s) + 𝛼(r+𝛾𝑉 𝜋 (𝑠 ′ ) − 𝑉 𝜋 (s)) TD-error
𝜋 𝜋 𝜋 ′
• 𝑉 (s) (1 − 𝛼)𝑉 (s) + 𝛼(r+𝛾𝑉 (𝑠 ))
11
Pavlov’s Dog
Compute V*, Q*, * VI/PI on approx. MDP Compute V*, Q*, * Q-learning
Evaluate a policy PE on approx. MDP Evaluate a policy TD-Learning
TD Learning TD (V*) Learning
• Can we do TD-like updates on V*?
∗ ′ ′ ∗ ′
• 𝑉 𝑠 = max 𝑠′ 𝑇 𝑠, 𝑎, 𝑠 [𝑅 𝑠, 𝑎, 𝑠 + 𝛾𝑉 (𝑠 )]
𝑎
∗ ′ ′ ∗ ′
• 𝑉 𝑠 = max 𝑠′ 𝑇 𝑠, 𝑎, 𝑠 [𝑅 𝑠, 𝑎, 𝑠 + 𝛾𝑉 (𝑠 )]
𝑎
• 𝑄∗ 𝑠, 𝑎 = 𝑠′ 𝑇 𝑠, 𝑎, 𝑠 ′
[𝑅 𝑠, 𝑎, 𝑠 ′
+ 𝛾𝑉 ∗ ′
(𝑠 )]
∗ ′ ′ ∗
• 𝑄 𝑠, 𝑎 = 𝑠′ 𝑇 𝑠, 𝑎, 𝑠 [𝑅(𝑠, 𝑎, 𝑠 + 𝛾 max 𝑄 𝑠′, 𝑎′ ]
𝑎′
• VI Q-Value Iteration
• TD Learning Q Learning
Q Learning
• Directly learn Q*(s,a) values
• Receive a sample (s, a, s’, r)
• Your old estimate Q(s,a)
• New sample value: r+𝛾 max 𝑄(𝑠 ′ , 𝑎′ )
𝑎′
21
Q Learning Algorithm
• Forall s, a
• Initialize Q(s, a) = 0
• Repeat Forever
Where are you? s.
Choose some action a
Execute it in real world: (s, a, r, s’)
Do update:
′ ′
𝑄(s,a) (1 − 𝛼)𝑄(s,a)+ 𝛼(r+𝛾 max 𝑄(𝑠 , 𝑎 ))
𝑎′
• as long as
• states/actions finite, all rewards bounded
• No (s,a) is starved: infinite visits over infinite samples
• Learning rate decays with visits to state-action pairs
• but not too fast decay. (∑ia(s,a,i) = ∞, ∑ia2(s,a,i) < ∞)
23
Q Learning Algorithm
• Forall s, a
• Initialize Q(s, a) = 0
• Repeat Forever
Where are you? s.
Choose some action a
Execute it in real world: (s, a, r, s’)
Do update:
′ ′
𝑄(s,a) (1 − 𝛼)𝑄(s,a)+ 𝛼(r+𝛾 max 𝑄(𝑠 , 𝑎 ))
𝑎′
How to choose?
new: exploration
greedy: exploitation
24
Exploration vs. Exploitation
Tradeoff
• A fundamental tradeoff in RL
• Exploration: must take actions that may be suboptimal but help discover
new rewards and in the long run increase utility
• Exploitation: must take actions that are known to be good (and seem
currently optimal) to optimize the overall utility
25
Explore/Exploit Policies
• Simplest scheme: ϵ-greedy
• Every time step flip a coin
• With probability 1-ϵ, take the greedy action
• With probability ϵ, take a random action
• Problem
• Exploration probability is constant
• Solutions
• Lower ϵ over time
• Use an exploration function
26
Explore/Exploit Policies
• Boltzmann Exploration
• Select action a with probability
exp(𝑄 𝑠,𝑎 𝑇))
• Pr(𝑎|𝑠) =
𝑎′∈𝐴 exp(𝑄 𝑠,𝑎′ 𝑇))
• T: Temperature
• Similar to simulated annealing
• Large T: uniform, Small T: greedy
• Start with large T and decrease with time
28
Explore/Exploit Policies
• A Famous Exploration Policy: UCB
• Upper Confidence Bound
ln n( s)
UCT ( s) arg max a Q( s, a) c
n( s , a )
Value Term:
favors actions that looked Exploration Term:
good historically actions get an exploration
bonus that grows with ln(n)
Optimistic in the Face of Uncertainty
29
Next Class
• RL + Deep Learning = Deep RL
Thank You
IIT Kharagpur IIT Madras IIT Goa IIT Palakkad
Reinforcement Learning
Deep RL
Mausam
IIT Delhi
National Centre for
Supercomputing Development of
Mission Advanced Computing
Q Learning Algorithm
• Forall s, a
• Initialize Q(s, a) = 0
• Repeat Forever
Where are you? s.
Choose some explore-exploit action a
Execute it in real world: (s, a, r, s’)
Do update:
′ ′
𝑄(s,a) 𝑄(s,a)+ 𝛼(r+𝛾 max 𝑄 𝑠 , 𝑎 − 𝑄(𝑠, 𝑎))
𝑎′
2
Model based RL vs. Model Free RL
• Model based RL
• estimate O(|S|2|A|) parameters
• requires relatively larger data for learning
• can make use of background knowledge easily
• Model free RL
• estimate O(|S||A|) parameters
• requires relatively less data for learning
Can we Enumerate State Space?
• Basic Q-Learning (or VI) keeps a table of all q-values
4
Chess
• branching factor b≈35
5
Game of Go
Chess Go
Size of board 8x8 19 x 19
Average no. of 100 300
moves per game
Avg branching 35 235
factor per turn
Additional Players can
complexity
pass
6
Generalize Across States
• Basic Q-Learning (or VI) keeps a table of all q-values
7
Deep Q Learning
Function Approximation
• Lookup Table (e.g., Q(s,a) table)
• does not scale – humongous for large problems
• (also known as curse of dimensionality)
• is not feasible if state space is continuous
• The Key Idea of Function Approximation
• approximate Q(s,a) as a parameteric function
• automatically learn the parameters (w)
𝑄 𝑠, 𝑎 ~ 𝑄 𝑠, 𝑎; 𝑤
• The Key Idea of Deep Q Learning
• Train a deep network to represent 𝑄 function
• w are the parameters of deep network
9
Deep Q Learning
13
Online Deep Q Learning Algorithm
• Estimate Q*(s,a) values using deep network(w)
• Receive a sample (s, a, s’, r)
• Compute target: 𝑦 = r+𝛾 max 𝑄(𝑠 ′ , 𝑎′ ; 𝑤)
𝑎′
2
• Update w to minimize: 𝐿 𝑤 = 𝑦 − 𝑄(s, a; w)
14
Online Deep Q Learning Algorithm
• Estimate Q*(s,a) values using deep network(w)
• Receive a sample (s, a, s’, r)
• Compute target: 𝑦 = r+𝛾 max 𝑄(𝑠 ′ , 𝑎′ ; 𝑤)
𝑎′
2
• Update w to minimize: 𝐿 𝑤 = 𝑦 − 𝑄(s, a; w)
15
Challenges in Training Deep Q Learning
• Target values are moving
• Successive states are not i.i.d., they are correlated
• Successive states depend on w (policy)
• Small changes in w might lead to large changes in policy
16
Challenges in Training Deep Q Learning
• Target values are moving
• Successive states are not i.i.d., they are correlated
• Successive states depend on w (policy)
• Small changes in w might lead to large changes in policy
• Solutions
• Freeze target Q network weights; update sporadically
• Experience Replay
17
Experience Replay
• Step I: Compute experience buffer
18
Experience Replay
• Step II: Weight updates using replay buffer
• Repeat K steps
• Randomly sample a mini-batch B of (s,a,r,s’) experiences from
replay memory buffer
19
Experience Replay
• Step II: Weight updates using replay buffer
• Repeat K steps
• Randomly sample a mini-batch B of (s,a,r,s’) experiences from
replay memory buffer
• Perform update to minimize loss on this minibatch
2
′ ′ −
• (𝑠,𝑎,𝑟,𝑠 ′ )∈𝐵 𝐄 r + 𝛾 max 𝑄(𝑠 , 𝑎 ; 𝑤 ) − 𝑄(𝑠, 𝑎; w)
𝑎′
Parameters of target
network are frozen
−
• Once in a while: 𝑤 𝑤
• Go to Step I (recompute replay buffer)
20
Atari Games
21
Deep Q Learning for Atari
22
End-to-End Deep Q Learning for Atari
• Input: last 4 frames
• Output: Q(s,a) for 18 joystick actions
• Architecture: CNN (same config for all games)
23
Image taken from Mnih et al, Nature (2015)
Deep Q Learning for Atari: Results
● Atari-Breakout trained agent:
https://www.youtube.com/watch?v=V1eYniJ0Rnk
● Uses Variation of DQN
24
(Deep) Policy Gradient
Policy gradient methods
• Learning the policy directly can be much simpler than learning Q values
exp(𝑓(𝑠, 𝑎; 𝜃))
𝜋𝜃 (𝑎|𝑠) =
𝑎′ exp(𝑓(𝑠, 𝑎′; 𝜃))
𝜋
• Update: 𝜃 ← 𝜃 + 𝛼∇𝜃 𝑉 (𝜃)
REINFORCE
𝜋
𝑉 𝑠; 𝜃 = 𝜋 𝑎|𝑠; 𝜃 𝐺 𝑠, 𝑎
𝑎
𝛻𝜃 𝑉 𝜋 (𝑠) = 𝛻𝜃 𝜋 𝑎|𝑠 𝐺 𝑠, 𝑎
𝑎
𝛻𝜃 𝑉 𝜋 (𝑠) = 𝐺 𝑠, 𝑎 𝛻𝜃 𝜋 𝑎|𝑠
𝑎
𝜋
𝛻𝜃 𝜋 𝑎|𝑠
𝛻𝜃 𝑉 (𝑠) = 𝜋 𝑎|𝑠 𝐺 𝑠, 𝑎
𝜋 𝑎|𝑠
𝑎
G(st ,at)
28
Policy Gradient vs Q Learning
• In many domains policy is easier to approximate than value function
30
Image taken from Silver et al, Nature (2016)
AlphaGo Zero
𝜋𝜃 𝒂 𝑐 = 𝜋𝜃 𝑎𝑡 𝑐, 𝑎1:𝑡−1
𝑡=1
𝑠 𝑎
ℛ 𝒂 𝑐, 𝐸𝑠 = 𝑅𝑒𝑐𝑎𝑙𝑙 𝐸 𝑠 ,𝐸 𝑎 =1 . 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛(𝐸 ,𝐸 )
1. REINFORCE
• Inefficient search space exploration
*Memory augmented policy optimization for program synthesis and semantic parsing, Liang et al, NIPS 2018
A Plug on NPTEL AI Course
https://onlinecourses.nptel.ac.in/noc21_cs42/preview
A Plug on School of AI
https://scai.iitd.ac.in
Thank You