0% found this document useful (0 votes)
13 views101 pages

Tut21 RL

Uploaded by

purushotham1982
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views101 pages

Tut21 RL

Uploaded by

purushotham1982
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 101

IIT Kharagpur IIT Madras IIT Goa IIT Palakkad

Introduction to Deep Learning

Reinforcement Learning
Markov Decision Processes
Mausam
IIT Delhi
National Centre for
Supercomputing Development of
Mission Advanced Computing
Planning Agent Static vs. Dynamic

Environment
Fully
vs.
Partially
Deterministic
Observable vs.
What action Stochastic
next?

Perfect Instantaneous
vs. vs.
Noisy Durative

Percepts Actions
2
Search Algorithms Static

Environment

Fully
Observable
Deterministic
What action
next?
Instantaneous
Perfect

Percepts Actions
3
Stochastic Planning:Static
MDPs

Environment

Fully
Observable
Stochastic
What action
next?
Instantaneous
Perfect

Percepts Actions
4
Markov Decision Process (MDP)
• S: A set of states
• A: A set of actions
• T(s,a,s’): transition model
• C(s,a,s’): cost model
• G: set of goals
• s0: start state
• : discount factor
• R(s,a,s’): reward model

5
Objective of an MDP
• Find a policy : S → A

• which optimizes
• minimizes discounted expected cost to reach a goal
• maximizes or expected reward
• maximizes undiscount. expected (reward-cost)

• given a ____ horizon


• finite
• infinite
• indefinite

• assuming full observability

6
Role of Discount Factor ()
• Keep the total reward/total cost finite
• useful for infinite horizon problems

• Intuition (economics):
• Money today is worth more than money tomorrow.

• Total reward: r1 + r2 + 2r3 + …


• Total cost: c1 + c2 + 2c3 + …

7
Acyclic vs. Cyclic MDPs
a P b P
a b
0.6 0.4 0.5 0.5 0.6 0.4 0.5 0.5

Q R S T R S T

c c c c c c c

G G
C(a) = 5, C(b) = 10, C(c) =1 Expectimin doesn’t work
•infinite loop
Expectimin works • V(R/S/T) = 1
• V(Q/R/S/T) = 1 • Q(P,b) = 11
• V(P) = 6 – action a • Q(P,a) = ????
• suppose I decide to take a in P
• Q(P,a) = 5+ 0.4*1 + 0.6Q(P,a)
• = 13.5
Brute force Algorithm
• Go over all policies ¼
• How many? |A||S| finite

• Evaluate each policy how to evaluate?


• V¼(s) Ã expected cost of reaching goal from s

• Choose the best


• We know that best exists (SSP optimality principle)
• V¼*(s) · V¼(s)

9
Policy Evaluation
• Given a policy ¼: compute V¼
• V¼ : cost of reaching goal while following ¼

10
Deterministic MDPs
• Policy Graph for ¼
¼(s0) = a0; ¼(s1) = a1

C=5 C=1
s0 a0
s1 a1
sg

• V¼(s1) = 1
• V¼(s0) = 6
add costs on path to goal

11
Acyclic MDPs
• Policy Graph for ¼

Pr=0.6 s1 C=1
C=5 a1
a0
s0 a2
sg
Pr=0.4
C=2
s2
C=4
backward pass in
reverse topological
• V¼(s1) = 1
order
• V¼(s2) = 4
• V¼(s0) = 0.6(5+1) + 0.4(2+4) = 6

12
General MDPs can be cyclic!
Pr=0.6 s1 C=1
C=5 a1
a0
s0 sg
Pr=0.4 Pr=0.7
C=2 a2 C=4
s2 cannot do a
Pr=0.3
C=3 simple single pass

• V¼(s1) = 1
• V¼(s2) = ?? (depends on V¼(s0))
• V¼(s0) = ?? (depends on V¼(s2))

13
General SSPs can be cyclic!
Pr=0.6 s1 C=1
C=5 a1
a0
s0 sg
Pr=0.4 Pr=0.7
C=2 a2 C=4
s2
Pr=0.3
C=3 a simple system of
linear equations

• V¼(g) = 0
• V¼(s1) = 1+V¼(sg) = 1
• V¼(s2) = 0.7(4+V¼(sg)) + 0.3(3+V¼(s0))
• V¼(s0) = 0.6(5+V¼(s1)) + 0.4(2+V¼(s2))
14
Policy Evaluation (Approach 1)
• Solving the System of Linear Equations
¼
V (s) = 0 if s 2 G
X
0 0 ¼ 0
= T (s; ¼(s); s ) [C(s; ¼(s); s ) + V (s )]
s0 2S

• |S| variables.
• O(|S|3) running time

15
Iterative Policy Evaluation
1

Pr=0.6 s1 C=1
C=5 0
4.4+0.4V¼(s2) a0 a1
0 s0 sg
5.88 Pr=0.4 Pr=0.7
6.5856 C=2 a2 C=4
6.670272 s2
Pr=0.3
6.68043..
C=3
3.7+0.3V¼(s0)
3.7
5.464
5.67568
5.7010816
5.704129…
16
Policy Evaluation (Approach 2)
X
¼ 0 0 ¼ 0
V (s) = T (s; ¼(s); s ) [C(s; ¼(s); s ) + V (s )]
s0 2S

iterative refinement

X £ ¤
¼ 0 0 ¼ 0
Vn (s) Ã T (s; ¼(s); s ) C(s; ¼(s); s ) + Vn¡1 (s ) (1)
s0 2S

17
Iterative Policy Evaluation

iteration n

²-consistency

termination
condition
18
Policy Evaluation  Value Iteration (Bellman Equations)
• <S, A, T, C ,G, s0>
• Define V*(s) {optimal cost} as the minimum expected cost
to reach a goal from this state.
• V* should satisfy the following equation:

¤
V (s) = 0 if s 2 G
X
0 0 ¤ 0
= min T (s; a; s ) [C(s; a; s ) + V (s )]
a2A
s0 2S

Q*(s,a)
V*(s) = mina Q*(s,a)
19
Bellman Equations for Reward Maximization MDP
• <S, A, T, R, s0, >
• Define V*(s) {optimal value} as the maximum expected discounted reward
from this state.
• V* should satisfy the following equation:

20
Fixed Point Computation in VI
X
¤ 0 0 ¤ 0
V (s) = min T (s; a; s ) [C(s; a; s ) + V (s )]
a2A
s0 2S

iterative refinement

X
0 0 0
Vn (s) Ã min T (s; a; s ) [C(s; a; s ) + Vn¡1 (s )]
a2A
s0 2S

non-linear
21
Example

a20 a40
a00 s2 s4 C=5
a41
a21 a1 C=2 Pr=0.6
s0 a3 sg
Pr=0.4
a01 s1 s3

22
Bellman Backup
a40 Q1(s4,a40) = 5 + 0
s4 C=5 Q1(s4,a41) = 2+ 0.6£ 0
a41
Pr=0.6
+ 0.4£ 2
C=2
a3 sg = 2.8
Pr=0.4
s3 min

agreedy = a41
a40 sg V0= 0
C=5
V1= 2.8
C=2
s4 a41

s3 V0= 2
Value Iteration [Bellman 57]
No restriction on initial value function

iteration n

²-consistency

termination
condition
24
Example
(all actions cost 1 unless otherwise stated)

a20 a40
a00 s2 s4 C=5
a41
a21 a1 C=2 Pr=0.6
s0 a3 sg
Pr=0.4
a01 s1 s3
n Vn(s0) Vn(s1) Vn(s2) Vn(s3) Vn(s4)
0 3 3 2 2 1
1 3 3 2 2 2.8
2 3 3 3.8 3.8 2.8
3 4 4.8 3.8 3.8 3.52
4 4.8 4.8 4.52 4.52 3.52
5 5.52 5.52 4.52 4.52 3.808
25
20 5.99921 5.99921 4.99969 4.99969 3.99969
Comments
• Decision-theoretic Algorithm
• Dynamic Programming
• Fixed Point Computation
• Probabilistic version of Bellman-Ford Algorithm
• for shortest path computation
• Cost Minimization MDP : Stochastic Shortest Path Problem

• Time Complexity
• one iteration: O(|S|2|A|)
• number of iterations: poly(|S|, |A|, 1/², 1/(1-))
• Space Complexity: O(|S|)

26
Thank You
IIT Kharagpur IIT Madras IIT Goa IIT Palakkad

Introduction to Deep Learning

Reinforcement Learning
Q Learning
Mausam
IIT Delhi
National Centre for
Supercomputing Development of
Mission Advanced Computing
Pavlov’s Dog

2
Image from https://www.open.edu/openlearn/ocw/mod/oucontent/view.php?id=83455&section=2.2.1
MDPs Static

Environment

Fully
Observable
Stochastic
What action
next?
Instantaneous
Perfect

Percepts Actions
3
Reinforcement Learning
• S: a set of states
• A: a set of actions
• T(s,a,s’): transition model Model
• R(s,a): reward model
• : discount factor
• Still looking for policy (s)

• New Twist: we don’t know T and/or R


• we don’t know which state is good/what actions do
• must learn from data/experience

• Fundamental model for learning of human behavior 4


Settings
• Batch setting in MDPs
• Data (Experience)  Model (MDP)  Prediction (V.I.)

• Active setting in MDPs


• Action  Data  (Model?)

• Actions have two purposes


• To maximize reward
• To learn the model
5
Skill Needed: Expectation = avg(samples)
Goal: Compute expected age of DL course students
Known P(A)

Without P(A), instead collect samples [a1, a2, … aN]


Unknown P(A): “Model Unknown P(A): “Model
Based” Free”
Why does this Why does this
work? Because work? Because
eventually you samples appear
learn the right with the right
model. frequencies.
Temporal Difference Learning
• Given a policy ¼: compute V¼
• 𝑉 𝜋 : expected discounted long-term reward following ¼

• 𝑉 𝜋 𝑠 = ′ ′ 𝜋 ′
𝑠′ 𝑇 𝑠, 𝜋(𝑠), 𝑠 [𝑅 𝑠, 𝜋 𝑠 ,𝑠 + 𝛾𝑉 𝑠 ]

• TD Learning: computing this expectation as average

7
TD Learning
𝜋 ′ ′ 𝜋 ′
•𝑉 𝑠 = 𝑠′ 𝑇 𝑠, 𝜋(𝑠), 𝑠 [𝑅 𝑠, 𝜋 𝑠 , 𝑠 + 𝛾𝑉 (𝑠 )]

𝜋 𝜋
• Say I know correct values of 𝑉 𝑠1 and 𝑉 (𝑠2 )
V¼=5

s1 s1
Pr=0.6
R=5
a0
s s
Pr=0.4
R=2
s2 s2
V¼=3
V¼(s)=0.6(5+5) + 0.4(2+3) V¼(s)= (10+10+10+5+5)/5
=6+2=8 =8 8
TD Learning
𝜋 ′ ′ 𝜋 ′
• 𝑉 𝑠 = 𝑠′ 𝑇 𝑠, 𝜋(𝑠), 𝑠 [𝑅 𝑠, 𝜋 𝑠 , 𝑠 + 𝛾𝑉 (𝑠 )]
• Inner term is the sample value
• (s,s’,r): reached s’ from s by executing 𝜋 𝑠 and got
immediate reward of r
𝜋 ′
• sample = r + 𝛾𝑉 (𝑠 )
𝜋 1
• Compute 𝑉 𝑠 = 𝑖 𝑠𝑎𝑚𝑝𝑙𝑒𝑖
𝑁

𝜋 ′
• Problem: we don’t know true values of 𝑉 (𝑠 )
• learn together using dynamic programming!
Skill Needed: Estimating mean via online updates

• Don’t learn T or R; directly maintain V¼


• Update V¼(s) each time you take an action in s via a moving average
𝜋 1 𝜋
• 𝑉𝑛+1 (s)  (n. 𝑉𝑛 (s) + samplen+1)
𝑛+1
𝜋 1 𝜋
• 𝑉𝑛+1 (s)  ((n+1-1).𝑉𝑛 (s) + samplen+1)
𝑛+1
𝜋 𝜋 1
• 𝑉𝑛+1 (s)  𝑉𝑛 (s) + (samplen+1−𝑉𝑛𝜋 (s))
𝑛+1

sample n+1
average of n+1 samples learning rate

𝜋 𝜋 𝜋
• 
𝑉𝑛+1 (s) +𝑉𝑛 (s) 𝛼(samplen+1−𝑉𝑛 (s))
• Nudge the old estimate towards the sample
10
TD Learning
• (s,s’,r)
𝜋 𝜋 𝜋
• 𝑉 (s)  𝑉 (s) + 𝛼(sample−𝑉 (s))
• 𝑉 𝜋 (s)  𝑉 𝜋 (s) + 𝛼(r+𝛾𝑉 𝜋 (𝑠 ′ ) − 𝑉 𝜋 (s)) TD-error
𝜋 𝜋 𝜋 ′
• 𝑉 (s)  (1 − 𝛼)𝑉 (s) + 𝛼(r+𝛾𝑉 (𝑠 ))

• Update maintains a mean of (noisy) value samples

• If the learning rate decreases appropriately with the number


of samples (e.g. 1/n) then the value estimates will converge to
true values! (non-trivial)

11
Pavlov’s Dog

Image from https://www.open.edu/openlearn/ocw/mod/oucontent/view.php?id=83455&section=2.2.1 12


The Story So Far: MDPs and RL

Known MDP: Offline Solution


Goal Technique

Compute V*, Q*, * Value / policy iteration


Evaluate a policy  Policy evaluation

Unknown MDP: Model-Based Unknown MDP: Model-Free


Goal Technique Goal Technique

Compute V*, Q*, * VI/PI on approx. MDP Compute V*, Q*, * Q-learning
Evaluate a policy  PE on approx. MDP Evaluate a policy  TD-Learning
TD Learning  TD (V*) Learning
• Can we do TD-like updates on V*?

∗ ′ ′ ∗ ′
• 𝑉 𝑠 = max 𝑠′ 𝑇 𝑠, 𝑎, 𝑠 [𝑅 𝑠, 𝑎, 𝑠 + 𝛾𝑉 (𝑠 )]
𝑎

• Hmmm… what to do?


• RHS should be expectation.
• Instead of V* write all equations in Q*
Bellman Equations (V*)  Bellman Equations (Q*)

∗ ′ ′ ∗ ′
• 𝑉 𝑠 = max 𝑠′ 𝑇 𝑠, 𝑎, 𝑠 [𝑅 𝑠, 𝑎, 𝑠 + 𝛾𝑉 (𝑠 )]
𝑎

• 𝑄∗ 𝑠, 𝑎 = 𝑠′ 𝑇 𝑠, 𝑎, 𝑠 ′
[𝑅 𝑠, 𝑎, 𝑠 ′
+ 𝛾𝑉 ∗ ′
(𝑠 )]

∗ ′ ′ ∗
• 𝑄 𝑠, 𝑎 = 𝑠′ 𝑇 𝑠, 𝑎, 𝑠 [𝑅(𝑠, 𝑎, 𝑠 + 𝛾 max 𝑄 𝑠′, 𝑎′ ]
𝑎′

• VI  Q-Value Iteration
• TD Learning  Q Learning
Q Learning
• Directly learn Q*(s,a) values
• Receive a sample (s, a, s’, r)
• Your old estimate Q(s,a)
• New sample value: r+𝛾 max 𝑄(𝑠 ′ , 𝑎′ )
𝑎′

Nudge the estimates:


′ ′
• 𝑄(s,a)  𝑄(s,a) + 𝛼(r+𝛾 max 𝑄(𝑠 , 𝑎 ) − 𝑄(s,a))
𝑎′
• 𝑄(s,a)  (1 − 𝛼)𝑄(s,a)+ 𝛼(r+𝛾 max 𝑄(𝑠 ′ , 𝑎′ ))
𝑎′

21
Q Learning Algorithm
• Forall s, a
• Initialize Q(s, a) = 0

• Repeat Forever
Where are you? s.
Choose some action a
Execute it in real world: (s, a, r, s’)
Do update:
′ ′
𝑄(s,a)  (1 − 𝛼)𝑄(s,a)+ 𝛼(r+𝛾 max 𝑄(𝑠 , 𝑎 ))
𝑎′

Is an off policy learning algorithm


22
Properties
• Q Learning converges to optimal values Q*
• Irrespective of initialization,
• Irrespective of action choice policy
• Irrespective of learning rate

• as long as
• states/actions finite, all rewards bounded
• No (s,a) is starved: infinite visits over infinite samples
• Learning rate decays with visits to state-action pairs
• but not too fast decay. (∑ia(s,a,i) = ∞, ∑ia2(s,a,i) < ∞)

23
Q Learning Algorithm
• Forall s, a
• Initialize Q(s, a) = 0

• Repeat Forever
Where are you? s.
Choose some action a
Execute it in real world: (s, a, r, s’)
Do update:
′ ′
𝑄(s,a)  (1 − 𝛼)𝑄(s,a)+ 𝛼(r+𝛾 max 𝑄(𝑠 , 𝑎 ))
𝑎′

How to choose?
new: exploration
greedy: exploitation

24
Exploration vs. Exploitation
Tradeoff
• A fundamental tradeoff in RL

• Exploration: must take actions that may be suboptimal but help discover
new rewards and in the long run increase utility

• Exploitation: must take actions that are known to be good (and seem
currently optimal) to optimize the overall utility

• Slowly move from exploration exploitation

25
Explore/Exploit Policies
• Simplest scheme: ϵ-greedy
• Every time step flip a coin
• With probability 1-ϵ, take the greedy action
• With probability ϵ, take a random action

• Problem
• Exploration probability is constant

• Solutions
• Lower ϵ over time
• Use an exploration function

26
Explore/Exploit Policies
• Boltzmann Exploration
• Select action a with probability
exp(𝑄 𝑠,𝑎 𝑇))
• Pr(𝑎|𝑠) =
𝑎′∈𝐴 exp(𝑄 𝑠,𝑎′ 𝑇))

• T: Temperature
• Similar to simulated annealing
• Large T: uniform, Small T: greedy
• Start with large T and decrease with time

• GLIE: greedy in the limit of infinite exploration


27
Explore/Exploit Policies
• Exploration Functions
• stop exploring actions whose badness is established
• continue exploring other actions
• Let Q(s,a) = q, #visits(s,a) = n
• E.g.: f q, n = 𝑞 + 𝑘/𝑛
• Unexplored states have infinite f
• Highly explored bad states have low f
• Modified Q update
• 𝑄(s,a)  (1 − 𝛼)𝑄(s,a) + 𝛼(r+𝛾 max 𝑓(𝑄 𝑠 ′ , 𝑎′ , 𝑁 𝑠 ′ , 𝑎′ ))
𝑎′
States leading to unexplored states are also preferred

28
Explore/Exploit Policies
• A Famous Exploration Policy: UCB
• Upper Confidence Bound

ln n( s)
 UCT ( s)  arg max a Q( s, a)  c
n( s , a )
Value Term:
favors actions that looked Exploration Term:
good historically actions get an exploration
bonus that grows with ln(n)
Optimistic in the Face of Uncertainty
29
Next Class
• RL + Deep Learning = Deep RL
Thank You
IIT Kharagpur IIT Madras IIT Goa IIT Palakkad

Introduction to Deep Learning

Reinforcement Learning
Deep RL
Mausam
IIT Delhi
National Centre for
Supercomputing Development of
Mission Advanced Computing
Q Learning Algorithm
• Forall s, a
• Initialize Q(s, a) = 0

• Repeat Forever
Where are you? s.
Choose some explore-exploit action a
Execute it in real world: (s, a, r, s’)
Do update:
′ ′
𝑄(s,a)  𝑄(s,a)+ 𝛼(r+𝛾 max 𝑄 𝑠 , 𝑎 − 𝑄(𝑠, 𝑎))
𝑎′

2
Model based RL vs. Model Free RL
• Model based RL
• estimate O(|S|2|A|) parameters
• requires relatively larger data for learning
• can make use of background knowledge easily

• Model free RL
• estimate O(|S||A|) parameters
• requires relatively less data for learning
Can we Enumerate State Space?
• Basic Q-Learning (or VI) keeps a table of all q-values

• In realistic situations, we cannot possibly learn about every


single state!
• Too many states to visit them all in training
• Too many states to hold the q-tables in memory

4
Chess
• branching factor b≈35

• game length m≈100

• search space bm ≈ 35100 ≈ 10154

• The Universe: number of atoms ≈ 10 78

• Exact solution completely infeasible

5
Game of Go

Chess Go
Size of board 8x8 19 x 19
Average no. of 100 300
moves per game
Avg branching 35 235
factor per turn
Additional Players can
complexity
pass

6
Generalize Across States
• Basic Q-Learning (or VI) keeps a table of all q-values

• In realistic situations, we cannot possibly learn about every


single state!
• Too many states to visit them all in training
• Too many states to hold the q-tables in memory

• Instead, we want to generalize:


• Learn about some small number of training states from experience
• Generalize that experience to new, similar situations
• This is a fundamental idea in machine learning

7
Deep Q Learning
Function Approximation
• Lookup Table (e.g., Q(s,a) table)
• does not scale – humongous for large problems
• (also known as curse of dimensionality)
• is not feasible if state space is continuous
• The Key Idea of Function Approximation
• approximate Q(s,a) as a parameteric function
• automatically learn the parameters (w)
𝑄 𝑠, 𝑎 ~ 𝑄 𝑠, 𝑎; 𝑤
• The Key Idea of Deep Q Learning
• Train a deep network to represent 𝑄 function
• w are the parameters of deep network

9
Deep Q Learning

• Regular Q Learning: nudge Q(s,a) towards target


′ ′
• 𝑄(𝑠, 𝑎)  𝑄(𝑠, 𝑎) + 𝛼(r + 𝛾 max 𝑄(𝑠 , 𝑎 ) − 𝑄(𝑠, 𝑎))
𝑎′
target
Deep Q Learning

• Regular Q Learning: nudge Q(s,a) towards target


′ ′
• 𝑄(𝑠, 𝑎)  𝑄(𝑠, 𝑎) + 𝛼(r + 𝛾 max 𝑄(𝑠 , 𝑎 ) − 𝑄(𝑠, 𝑎))
𝑎′
target

• Deep Q Learning: nudge approximated Q values towards target by


minimizing squared error
2
′ ′
• Loss(w) = r + 𝛾 max 𝑄(𝑠 , 𝑎 ; 𝑤) − 𝑄(𝑠, 𝑎; w)
𝑎′
target estimate
Deep Q Learning vs. Deep Supervised Learning

• Deep (Supervised) Learning: nudge predicted y towards target y


2
• Loss(w) = 𝑦 − 𝑓(x; w)
target estimate

• Deep Q Learning: nudge approximated Q values towards target by


minimizing squared error
2
′ ′
• Loss(w) = r + 𝛾 max 𝑄(𝑠 , 𝑎 ; 𝑤) − 𝑄(𝑠, 𝑎; w)
𝑎′
target estimate

• Difference: Target in Q Learning is also moving!


Online Deep Q Learning Algorithm
• Estimate Q*(s,a) values using deep network(w)
• Receive a sample (s, a, s’, r)
• Compute target: 𝑦 = r+𝛾 max 𝑄(𝑠 ′ , 𝑎′ ; 𝑤)
𝑎′

13
Online Deep Q Learning Algorithm
• Estimate Q*(s,a) values using deep network(w)
• Receive a sample (s, a, s’, r)
• Compute target: 𝑦 = r+𝛾 max 𝑄(𝑠 ′ , 𝑎′ ; 𝑤)
𝑎′
2
• Update w to minimize: 𝐿 𝑤 = 𝑦 − 𝑄(s, a; w)

14
Online Deep Q Learning Algorithm
• Estimate Q*(s,a) values using deep network(w)
• Receive a sample (s, a, s’, r)
• Compute target: 𝑦 = r+𝛾 max 𝑄(𝑠 ′ , 𝑎′ ; 𝑤)
𝑎′
2
• Update w to minimize: 𝐿 𝑤 = 𝑦 − 𝑄(s, a; w)

Nudge the estimates:


𝜕𝐿
• 𝑤 𝑤 − 𝛼 or 𝑤 𝑤 − 𝛼∇𝑤 𝐿
𝜕𝑤
• where gradient ∇𝑤 𝐿 = (𝑄 𝑠, 𝑎; 𝑤 − 𝑦)∇𝑤 𝑄(𝑠, 𝑎; 𝑤)

15
Challenges in Training Deep Q Learning
• Target values are moving
• Successive states are not i.i.d., they are correlated
• Successive states depend on w (policy)
• Small changes in w might lead to large changes in policy

• These make training deep Q networks highly unstable

16
Challenges in Training Deep Q Learning
• Target values are moving
• Successive states are not i.i.d., they are correlated
• Successive states depend on w (policy)
• Small changes in w might lead to large changes in policy

• These make training deep Q networks highly unstable

• Solutions
• Freeze target Q network weights; update sporadically
• Experience Replay

17
Experience Replay
• Step I: Compute experience buffer

• At each time step:


• Take action at according to ϵ-greedy policy
• Store experience (st, at, rt+1, st+1) in replay memory buffer
• Replay memory buffer contains
• Many (st, at, rt+1, st+1) tuples for t=0,1,2,…T

18
Experience Replay
• Step II: Weight updates using replay buffer
• Repeat K steps
• Randomly sample a mini-batch B of (s,a,r,s’) experiences from
replay memory buffer

19
Experience Replay
• Step II: Weight updates using replay buffer
• Repeat K steps
• Randomly sample a mini-batch B of (s,a,r,s’) experiences from
replay memory buffer
• Perform update to minimize loss on this minibatch
2
′ ′ −
• (𝑠,𝑎,𝑟,𝑠 ′ )∈𝐵 𝐄 r + 𝛾 max 𝑄(𝑠 , 𝑎 ; 𝑤 ) − 𝑄(𝑠, 𝑎; w)
𝑎′
Parameters of target
network are frozen


• Once in a while: 𝑤  𝑤
• Go to Step I (recompute replay buffer)
20
Atari Games

21
Deep Q Learning for Atari

22
End-to-End Deep Q Learning for Atari
• Input: last 4 frames
• Output: Q(s,a) for 18 joystick actions
• Architecture: CNN (same config for all games)

23
Image taken from Mnih et al, Nature (2015)
Deep Q Learning for Atari: Results
● Atari-Breakout trained agent:
https://www.youtube.com/watch?v=V1eYniJ0Rnk
● Uses Variation of DQN

Image taken from Mnih


et al, Nature (2015)

24
(Deep) Policy Gradient
Policy gradient methods
• Learning the policy directly can be much simpler than learning Q values

• We can train a neural network to output stochastic policies, or


probabilities of taking each action in a given state: 𝜋𝜃 (𝑎|𝑠)
• Softmax policy: Compute logits f(s,a) and take softmax over all a.

exp(𝑓(𝑠, 𝑎; 𝜃))
𝜋𝜃 (𝑎|𝑠) =
𝑎′ exp(𝑓(𝑠, 𝑎′; 𝜃))

𝜋
• Update: 𝜃 ← 𝜃 + 𝛼∇𝜃 𝑉 (𝜃)
REINFORCE
𝜋
𝑉 𝑠; 𝜃 = 𝜋 𝑎|𝑠; 𝜃 𝐺 𝑠, 𝑎
𝑎

𝛻𝜃 𝑉 𝜋 (𝑠) = 𝛻𝜃 𝜋 𝑎|𝑠 𝐺 𝑠, 𝑎
𝑎

𝛻𝜃 𝑉 𝜋 (𝑠) = 𝐺 𝑠, 𝑎 𝛻𝜃 𝜋 𝑎|𝑠
𝑎

𝜋
𝛻𝜃 𝜋 𝑎|𝑠
𝛻𝜃 𝑉 (𝑠) = 𝜋 𝑎|𝑠 𝐺 𝑠, 𝑎
𝜋 𝑎|𝑠
𝑎

𝛻𝜃 𝑉 𝜋 (𝑠) = 𝜋 𝑎|𝑠 𝐺 𝑠, 𝑎 𝛻𝜃 ln(𝜋 𝑎|𝑠


𝑎

𝛻𝜃 𝑉 𝜋 (𝑠) = 𝔼[𝐺(𝑠, 𝑎)𝛻𝜃 ln(𝜋 𝑎|𝑠 ]


REINFORCE

G(st ,at)

28
Policy Gradient vs Q Learning
• In many domains policy is easier to approximate than value function

• Naturally applies to continuous actions as opposed to a Q learning

• Useful in games with imperfect information


• where the optimal play is a stochastic policy
DeepMind’s AlphaGo
 Policy network: initialized by
supervised training on large
amount of human games

 Value network: trained like Q-


learning based on self-play

 Networks are used to guide


Monte Carlo tree search (MCTS)

30
Image taken from Silver et al, Nature (2016)
AlphaGo Zero

Image taken from


https://deepmind.com/blog/article/alphago-zero-starting-scratch 31
Applications
• Stochastic Games
• Robotics: navigation, helicopter manuevers…
• Finance: options, investments
• Communication Networks
• Medicine: Radiation planning for cancer
• Controlling workflows
• Optimize bidding decisions in auctions
• Traffic flow optimization
• Aircraft queueing for landing; airline meal provisioning
• Optimizing software on mobiles
• Forest firefighting
•… 32
Case Study: RL for NLP
Introduction
Task Oriented Dialogs (TOD)
SELECT * FROM KB WHERE
Suggest an expensive restaurant that's in the location = south AND
south section of the city price = moderate
User t=1
Do you have a cuisine preference? KB Query
Agent
No, I don't care about the type of cuisine. Also,
could you make the price range moderate KB
User
t=2
Peking Restaurant is a moderately priced
Query Results
and is in the south part of the town Agent
Restaurant Cuisine … Phone
May I have their phone number? Prezzo Italian … 1098-1134
User t=3 Taj Tandoori Indian … 1648-1796
Their phone number is 2343-4040 Peking Chinese … 2343-4040
Agent Restaurant
Introduction
Task Oriented Dialogs (TOD)
SELECT * FROM KB WHERE
Suggest an expensive restaurant that's in the location = south AND
south section of the city price = moderate
User t=1
Do you have a cuisine preference?
KB Query
No, I don't care about the type of cuisine. Also,
could you make the price range moderate KB
User
t=2
Peking Restaurant is a moderately priced
Query Results
and is in the south part of the town Agent
Restaurant Cuisine … Phone
May I have their phone number? Prezzo Italian … 1098-1134
User t=3 Taj Tandoori Indian … 1648-1796
Their phone number is 2343-4040 Peking Chinese … 2343-4040
Agent Restaurant
Introduction
Task Oriented Dialogs (TOD)
SELECT * FROM KB WHERE
Suggest an expensive restaurant that's in the location = south AND
south section of the city price = moderate
User t=1
Do you have a cuisine preference?
Agent KB Query
No, I don't care about the type of cuisine. Also,
could you make the price range moderate KB
User
t=2
Peking Restaurant is a moderately priced
Query Results
and is in the south part of the town Agent
Restaurant Cuisine … Phone
May I have their phone number? Prezzo Italian … 1098-1134
User t=3 Taj Tandoori Indian … 1648-1796
Their phone number is 2343-4040 Peking Chinese … 2343-4040
Agent Restaurant
Introduction
Task Oriented Dialogs (TOD)
SELECT * FROM KB WHERE
Suggest an expensive restaurant that's in the location = south AND
south section of the city price = moderate
User t=1
Do you have a cuisine preference?
Agent KB Query
No, I don't care about the type of cuisine. Also,
could you make the price range moderate KB
User
t=2
Peking Restaurant is a moderately priced
Query Results
and is in the south part of the town Agent
Restaurant Cuisine … Phone
May I have their phone number? Prezzo Italian … 1098-1134
User t=3 Taj Tandoori Indian … 1648-1796
Their phone number is 2343-4040 Peking Chinese … 2343-4040
Agent Restaurant
Introduction
Existing TOD Systems
t=2
SELECT * FROM KB WHERE
Suggest an expensive restaurant that's in the location = south AND
south section of the city price = moderate
User t=1
Do you have a cuisine preference?
Agent KB Query
No, I don't care about the type of cuisine. Also,
could you make the price range moderate KB
User
t=2
Peking Restaurant is a moderately priced
Query Results
and is in the south part of the town Agent
Restaurant Cuisine … Phone
May I have their phone number? Prezzo Italian … 1098-1134
User t=3 Taj Tandoori Indian … 1648-1796
Their phone number is 2343-4040 Peking Chinese … 2343-4040
Agent Restaurant
 As KB Queries go undocumented
Introduction  they need manual annotation
Unannotated TOD Systems  expensive and hinders scalability
t=2
SELECT * FROM KB WHERE
Suggest an expensive restaurant that's in the location = south AND
south section of the city price = moderate
User t=1
Do you have a cuisine preference?
Agent KB Query
No, I don't care about the type of cuisine. Also,
could you make the price range moderate KB
User
t=2
Peking Restaurant is a moderately priced
and is in the south part of the town Agent
May I have their phone number?
User t=3
Their phone number is 2343-4040
Agent
KB Query Predictor
Weak Supervision
SELECT * FROM KB WHERE
Suggest an expensive restaurant that's in the location = south AND
south section of the city price = moderate
User
Do you have a cuisine preference?
Agent KB Query
No, I don't care about the type of cuisine. Also,
could you make the price range moderate KB
User
Peking Restaurant is a moderately priced
Query Results
and is in the south part of the town Agent
Restaurant Cuisine … Phone
May I have their phone number? Prezzo Italian … 1098-1134
User Taj Tandoori Indian … 1648-1796
Their phone number is 2343-4040 Peking Chinese … 2343-4040
Agent Restaurant
KB Query Predictor
RL Formulation
Suggest an expensive restaurant that's in the Dialog
south section of the city
context
Do you have a cuisine preference?

No, I don't care about the type of cuisine. Also,


could you make the price range moderate

Peking Restaurant is a moderately priced


and is in the south part of the town Subsequent
Dialog
May I have their phone number?

Their phone number is 2343-4040


𝑐 is the dialog context
KB Query Predictor 𝑠
𝐸 is the entities in subsequent dialog
RL Formulation
Suggest an expensive restaurant that's in the Dialog
south section of the city
context 𝑐
Do you have a cuisine preference?

No, I don't care about the type of cuisine. Also,


could you make the price range moderate

Peking Restaurant is a moderately priced


and is in the south part of the town Subsequent
Dialog 𝐸𝑠
May I have their phone number?

Their phone number is 2343-4040


𝑐 is the dialog context
KB Query Predictor 𝒂 is the KB query
𝑎
𝐸 is the entities in KB results
RL Formulation 𝐸 𝑠 is the entities in subsequent dialog

Suggest an expensive restaurant that's in the Dialog


south section of the city
context 𝑐 Policy
Do you have a cuisine preference?
𝜋𝜃
No, I don't care about the type of cuisine. Also,
could you make the price range moderate 𝐸 𝑎 𝒂
KB
Peking Restaurant is a moderately priced
and is in the south part of the town Subsequent
Dialog 𝐸𝑠 Reward
May I have their phone number?

Their phone number is 2343-4040
KB Query Predictor
Policy
𝑇

𝜋𝜃 𝒂 𝑐 = 𝜋𝜃 𝑎𝑡 𝑐, 𝑎1:𝑡−1
𝑡=1

Each action 𝑎𝑡 is a word predicted by the query predictor. It can be


1. a keyword from the SQL query language
2. a field name from the KB
3. a word in the dialog context 𝑐
4. an <end_of_query> token
KB Query Predictor
Rewards

𝑠 𝑎
ℛ 𝒂 𝑐, 𝐸𝑠 = ૤ 𝑅𝑒𝑐𝑎𝑙𝑙 𝐸 𝑠 ,𝐸 𝑎 =1 . 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛(𝐸 ,𝐸 )

• recall-based indicator ensures that all necessary entities are retrieved


• precision penalizes retrieval of a large number of unused entities
KB Query Predictor
Policy Optimization

1. REINFORCE
• Inefficient search space exploration

2. MAPO: Memory Augmented Policy Optimization*


• Systematic explores search space [Liang et al NIPS’18]
• Our proposed solution is an extension of MAPO [Raghu et al TACL’21]

*Memory augmented policy optimization for program synthesis and semantic parsing, Liang et al, NIPS 2018
A Plug on NPTEL AI Course

https://onlinecourses.nptel.ac.in/noc21_cs42/preview

A Plug on School of AI
https://scai.iitd.ac.in
Thank You

You might also like