0% found this document useful (0 votes)

13 views101 pages

Tut21 RL

Uploaded by

purushotham1982

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

13 views101 pages

Tut21 RL

Uploaded by

purushotham1982

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 101

IIT Kharagpur IIT Madras IIT Goa IIT Palakkad

Introduction to Deep Learning

Reinforcement Learning
Markov Decision Processes
Mausam
IIT Delhi
National Centre for
Supercomputing Development of
Mission Advanced Computing
Planning Agent Static vs. Dynamic

Environment
Fully
vs.
Partially
Deterministic
Observable vs.
What action Stochastic
next?

Perfect Instantaneous
vs. vs.
Noisy Durative

Percepts Actions
2
Search Algorithms Static

Environment

Fully
Observable
Deterministic
What action
next?
Instantaneous
Perfect

Percepts Actions
3
Stochastic Planning:Static
MDPs

Environment

Fully
Observable
Stochastic
What action
next?
Instantaneous
Perfect

Percepts Actions
4
Markov Decision Process (MDP)
• S: A set of states
• A: A set of actions
• T(s,a,s’): transition model
• C(s,a,s’): cost model
• G: set of goals
• s0: start state
• : discount factor
• R(s,a,s’): reward model

5
Objective of an MDP
• Find a policy : S → A

• which optimizes
• minimizes discounted expected cost to reach a goal
• maximizes or expected reward
• maximizes undiscount. expected (reward-cost)

• given a ____ horizon

• finite
• infinite
• indefinite

• assuming full observability

6
Role of Discount Factor ()
• Keep the total reward/total cost finite
• useful for infinite horizon problems

• Intuition (economics):
• Money today is worth more than money tomorrow.

• Total reward: r1 + r2 + 2r3 + …

• Total cost: c1 + c2 + 2c3 + …

7
Acyclic vs. Cyclic MDPs
a P b P
a b
0.6 0.4 0.5 0.5 0.6 0.4 0.5 0.5

Q R S T R S T

c c c c c c c

G G
C(a) = 5, C(b) = 10, C(c) =1 Expectimin doesn’t work
•infinite loop
Expectimin works • V(R/S/T) = 1
• V(Q/R/S/T) = 1 • Q(P,b) = 11
• V(P) = 6 – action a • Q(P,a) = ????
• suppose I decide to take a in P
• Q(P,a) = 5+ 0.4*1 + 0.6Q(P,a)
• = 13.5
Brute force Algorithm
• Go over all policies ¼
• How many? |A||S| finite

• Evaluate each policy how to evaluate?

• V¼(s) Ã expected cost of reaching goal from s

• Choose the best

• We know that best exists (SSP optimality principle)
• V¼*(s) · V¼(s)

9
Policy Evaluation
• Given a policy ¼: compute V¼
• V¼ : cost of reaching goal while following ¼

10
Deterministic MDPs
• Policy Graph for ¼
¼(s0) = a0; ¼(s1) = a1

C=5 C=1
s0 a0
s1 a1
sg

• V¼(s1) = 1
• V¼(s0) = 6
add costs on path to goal

11
Acyclic MDPs
• Policy Graph for ¼

Pr=0.6 s1 C=1
C=5 a1
a0
s0 a2
sg
Pr=0.4
C=2
s2
C=4
backward pass in
reverse topological
• V¼(s1) = 1
order
• V¼(s2) = 4
• V¼(s0) = 0.6(5+1) + 0.4(2+4) = 6

12
General MDPs can be cyclic!
Pr=0.6 s1 C=1
C=5 a1
a0
s0 sg
Pr=0.4 Pr=0.7
C=2 a2 C=4
s2 cannot do a
Pr=0.3
C=3 simple single pass

• V¼(s1) = 1
• V¼(s2) = ?? (depends on V¼(s0))
• V¼(s0) = ?? (depends on V¼(s2))

13
General SSPs can be cyclic!
Pr=0.6 s1 C=1
C=5 a1
a0
s0 sg
Pr=0.4 Pr=0.7
C=2 a2 C=4
s2
Pr=0.3
C=3 a simple system of
linear equations

• V¼(g) = 0
• V¼(s1) = 1+V¼(sg) = 1
• V¼(s2) = 0.7(4+V¼(sg)) + 0.3(3+V¼(s0))
• V¼(s0) = 0.6(5+V¼(s1)) + 0.4(2+V¼(s2))
14
Policy Evaluation (Approach 1)
• Solving the System of Linear Equations
¼
V (s) = 0 if s 2 G
X
0 0 ¼ 0
= T (s; ¼(s); s ) [C(s; ¼(s); s ) + V (s )]
s0 2S

• |S| variables.
• O(|S|3) running time

15
Iterative Policy Evaluation
1

Pr=0.6 s1 C=1
C=5 0
4.4+0.4V¼(s2) a0 a1
0 s0 sg
5.88 Pr=0.4 Pr=0.7
6.5856 C=2 a2 C=4
6.670272 s2
Pr=0.3
6.68043..
C=3
3.7+0.3V¼(s0)
3.7
5.464
5.67568
5.7010816
5.704129…
16
Policy Evaluation (Approach 2)
X
¼ 0 0 ¼ 0
V (s) = T (s; ¼(s); s ) [C(s; ¼(s); s ) + V (s )]
s0 2S

iterative refinement

X £ ¤
¼ 0 0 ¼ 0
Vn (s) Ã T (s; ¼(s); s ) C(s; ¼(s); s ) + Vn¡1 (s ) (1)
s0 2S

17
Iterative Policy Evaluation

iteration n

²-consistency

termination
condition
18
Policy Evaluation  Value Iteration (Bellman Equations)
• <S, A, T, C ,G, s0>
• Define V*(s) {optimal cost} as the minimum expected cost
to reach a goal from this state.
• V* should satisfy the following equation:

¤
V (s) = 0 if s 2 G
X
0 0 ¤ 0
= min T (s; a; s ) [C(s; a; s ) + V (s )]
a2A
s0 2S

Q*(s,a)
V*(s) = mina Q*(s,a)
19
Bellman Equations for Reward Maximization MDP
• <S, A, T, R, s0, >
• Define V*(s) {optimal value} as the maximum expected discounted reward
from this state.
• V* should satisfy the following equation:

20
Fixed Point Computation in VI
X
¤ 0 0 ¤ 0
V (s) = min T (s; a; s ) [C(s; a; s ) + V (s )]
a2A
s0 2S

iterative refinement

X
0 0 0
Vn (s) Ã min T (s; a; s ) [C(s; a; s ) + Vn¡1 (s )]
a2A
s0 2S

non-linear
21
Example

a20 a40
a00 s2 s4 C=5
a41
a21 a1 C=2 Pr=0.6
s0 a3 sg
Pr=0.4
a01 s1 s3

22
Bellman Backup
a40 Q1(s4,a40) = 5 + 0
s4 C=5 Q1(s4,a41) = 2+ 0.6£ 0
a41
Pr=0.6
+ 0.4£ 2
C=2
a3 sg = 2.8
Pr=0.4
s3 min

agreedy = a41
a40 sg V0= 0
C=5
V1= 2.8
C=2
s4 a41

s3 V0= 2
Value Iteration [Bellman 57]
No restriction on initial value function

iteration n

²-consistency

termination
condition
24
Example
(all actions cost 1 unless otherwise stated)

a20 a40
a00 s2 s4 C=5
a41
a21 a1 C=2 Pr=0.6
s0 a3 sg
Pr=0.4
a01 s1 s3
n Vn(s0) Vn(s1) Vn(s2) Vn(s3) Vn(s4)
0 3 3 2 2 1
1 3 3 2 2 2.8
2 3 3 3.8 3.8 2.8
3 4 4.8 3.8 3.8 3.52
4 4.8 4.8 4.52 4.52 3.52
5 5.52 5.52 4.52 4.52 3.808
25
20 5.99921 5.99921 4.99969 4.99969 3.99969
Comments
• Decision-theoretic Algorithm
• Dynamic Programming
• Fixed Point Computation
• Probabilistic version of Bellman-Ford Algorithm
• for shortest path computation
• Cost Minimization MDP : Stochastic Shortest Path Problem

• Time Complexity
• one iteration: O(|S|2|A|)
• number of iterations: poly(|S|, |A|, 1/², 1/(1-))
• Space Complexity: O(|S|)

26
Thank You
IIT Kharagpur IIT Madras IIT Goa IIT Palakkad

Introduction to Deep Learning

Reinforcement Learning
Q Learning
Mausam
IIT Delhi
National Centre for
Supercomputing Development of
Mission Advanced Computing
Pavlov’s Dog

2
Image from https://www.open.edu/openlearn/ocw/mod/oucontent/view.php?id=83455&section=2.2.1
MDPs Static

Environment

Fully
Observable
Stochastic
What action
next?
Instantaneous
Perfect

Percepts Actions
3
Reinforcement Learning
• S: a set of states
• A: a set of actions
• T(s,a,s’): transition model Model
• R(s,a): reward model
• : discount factor
• Still looking for policy (s)

• New Twist: we don’t know T and/or R

• we don’t know which state is good/what actions do
• must learn from data/experience

• Fundamental model for learning of human behavior 4

Settings
• Batch setting in MDPs
• Data (Experience)  Model (MDP)  Prediction (V.I.)

• Active setting in MDPs

• Action  Data  (Model?)

• Actions have two purposes

• To maximize reward
• To learn the model
5
Skill Needed: Expectation = avg(samples)
Goal: Compute expected age of DL course students
Known P(A)

Without P(A), instead collect samples [a1, a2, … aN]

Unknown P(A): “Model Unknown P(A): “Model
Based” Free”
Why does this Why does this
work? Because work? Because
eventually you samples appear
learn the right with the right
model. frequencies.
Temporal Difference Learning
• Given a policy ¼: compute V¼
• 𝑉 𝜋 : expected discounted long-term reward following ¼

• 𝑉 𝜋 𝑠 = ′ ′ 𝜋 ′
𝑠′ 𝑇 𝑠, 𝜋(𝑠), 𝑠 [𝑅 𝑠, 𝜋 𝑠 ,𝑠 + 𝛾𝑉 𝑠 ]

• TD Learning: computing this expectation as average

7
TD Learning
𝜋 ′ ′ 𝜋 ′
•𝑉 𝑠 = 𝑠′ 𝑇 𝑠, 𝜋(𝑠), 𝑠 [𝑅 𝑠, 𝜋 𝑠 , 𝑠 + 𝛾𝑉 (𝑠 )]

𝜋 𝜋
• Say I know correct values of 𝑉 𝑠1 and 𝑉 (𝑠2 )
V¼=5

s1 s1
Pr=0.6
R=5
a0
s s
Pr=0.4
R=2
s2 s2
V¼=3
V¼(s)=0.6(5+5) + 0.4(2+3) V¼(s)= (10+10+10+5+5)/5
=6+2=8 =8 8
TD Learning
𝜋 ′ ′ 𝜋 ′
• 𝑉 𝑠 = 𝑠′ 𝑇 𝑠, 𝜋(𝑠), 𝑠 [𝑅 𝑠, 𝜋 𝑠 , 𝑠 + 𝛾𝑉 (𝑠 )]
• Inner term is the sample value
• (s,s’,r): reached s’ from s by executing 𝜋 𝑠 and got
immediate reward of r
𝜋 ′
• sample = r + 𝛾𝑉 (𝑠 )
𝜋 1
• Compute 𝑉 𝑠 = 𝑖 𝑠𝑎𝑚𝑝𝑙𝑒𝑖
𝑁

𝜋 ′
• Problem: we don’t know true values of 𝑉 (𝑠 )
• learn together using dynamic programming!
Skill Needed: Estimating mean via online updates

• Don’t learn T or R; directly maintain V¼

• Update V¼(s) each time you take an action in s via a moving average
𝜋 1 𝜋
• 𝑉𝑛+1 (s)  (n. 𝑉𝑛 (s) + samplen+1)
𝑛+1
𝜋 1 𝜋
• 𝑉𝑛+1 (s)  ((n+1-1).𝑉𝑛 (s) + samplen+1)
𝑛+1
𝜋 𝜋 1
• 𝑉𝑛+1 (s)  𝑉𝑛 (s) + (samplen+1−𝑉𝑛𝜋 (s))
𝑛+1

sample n+1
average of n+1 samples learning rate

𝜋 𝜋 𝜋
• 
𝑉𝑛+1 (s) +𝑉𝑛 (s) 𝛼(samplen+1−𝑉𝑛 (s))
• Nudge the old estimate towards the sample
10
TD Learning
• (s,s’,r)
𝜋 𝜋 𝜋
• 𝑉 (s)  𝑉 (s) + 𝛼(sample−𝑉 (s))
• 𝑉 𝜋 (s)  𝑉 𝜋 (s) + 𝛼(r+𝛾𝑉 𝜋 (𝑠 ′ ) − 𝑉 𝜋 (s)) TD-error
𝜋 𝜋 𝜋 ′
• 𝑉 (s)  (1 − 𝛼)𝑉 (s) + 𝛼(r+𝛾𝑉 (𝑠 ))

• Update maintains a mean of (noisy) value samples

• If the learning rate decreases appropriately with the number

of samples (e.g. 1/n) then the value estimates will converge to
true values! (non-trivial)

11
Pavlov’s Dog

Image from https://www.open.edu/openlearn/ocw/mod/oucontent/view.php?id=83455&section=2.2.1 12

The Story So Far: MDPs and RL

Known MDP: Offline Solution

Goal Technique

Compute V, Q, * Value / policy iteration

Evaluate a policy  Policy evaluation

Unknown MDP: Model-Based Unknown MDP: Model-Free

Goal Technique Goal Technique

Compute V*, Q*, * VI/PI on approx. MDP Compute V*, Q*, * Q-learning
Evaluate a policy  PE on approx. MDP Evaluate a policy  TD-Learning
TD Learning  TD (V*) Learning
• Can we do TD-like updates on V*?

∗ ′ ′ ∗ ′
• 𝑉 𝑠 = max 𝑠′ 𝑇 𝑠, 𝑎, 𝑠 [𝑅 𝑠, 𝑎, 𝑠 + 𝛾𝑉 (𝑠 )]
𝑎

• Hmmm… what to do?

• RHS should be expectation.
• Instead of V* write all equations in Q*
Bellman Equations (V*)  Bellman Equations (Q*)

∗ ′ ′ ∗ ′
• 𝑉 𝑠 = max 𝑠′ 𝑇 𝑠, 𝑎, 𝑠 [𝑅 𝑠, 𝑎, 𝑠 + 𝛾𝑉 (𝑠 )]
𝑎

• 𝑄∗ 𝑠, 𝑎 = 𝑠′ 𝑇 𝑠, 𝑎, 𝑠 ′
[𝑅 𝑠, 𝑎, 𝑠 ′
+ 𝛾𝑉 ∗ ′
(𝑠 )]

∗ ′ ′ ∗
• 𝑄 𝑠, 𝑎 = 𝑠′ 𝑇 𝑠, 𝑎, 𝑠 [𝑅(𝑠, 𝑎, 𝑠 + 𝛾 max 𝑄 𝑠′, 𝑎′ ]
𝑎′

• VI  Q-Value Iteration
• TD Learning  Q Learning
Q Learning
• Directly learn Q*(s,a) values
• Receive a sample (s, a, s’, r)
• Your old estimate Q(s,a)
• New sample value: r+𝛾 max 𝑄(𝑠 ′ , 𝑎′ )
𝑎′

Nudge the estimates:

′ ′
• 𝑄(s,a)  𝑄(s,a) + 𝛼(r+𝛾 max 𝑄(𝑠 , 𝑎 ) − 𝑄(s,a))
𝑎′
• 𝑄(s,a)  (1 − 𝛼)𝑄(s,a)+ 𝛼(r+𝛾 max 𝑄(𝑠 ′ , 𝑎′ ))
𝑎′

21
Q Learning Algorithm
• Forall s, a
• Initialize Q(s, a) = 0

• Repeat Forever
Where are you? s.
Choose some action a
Execute it in real world: (s, a, r, s’)
Do update:
′ ′
𝑄(s,a)  (1 − 𝛼)𝑄(s,a)+ 𝛼(r+𝛾 max 𝑄(𝑠 , 𝑎 ))
𝑎′

Is an off policy learning algorithm

22
Properties
• Q Learning converges to optimal values Q*
• Irrespective of initialization,
• Irrespective of action choice policy
• Irrespective of learning rate

• as long as
• states/actions finite, all rewards bounded
• No (s,a) is starved: infinite visits over infinite samples
• Learning rate decays with visits to state-action pairs
• but not too fast decay. (∑ia(s,a,i) = ∞, ∑ia2(s,a,i) < ∞)

23
Q Learning Algorithm
• Forall s, a
• Initialize Q(s, a) = 0

• Repeat Forever
Where are you? s.
Choose some action a
Execute it in real world: (s, a, r, s’)
Do update:
′ ′
𝑄(s,a)  (1 − 𝛼)𝑄(s,a)+ 𝛼(r+𝛾 max 𝑄(𝑠 , 𝑎 ))
𝑎′

How to choose?
new: exploration
greedy: exploitation

24
Exploration vs. Exploitation
Tradeoff
• A fundamental tradeoff in RL

• Exploration: must take actions that may be suboptimal but help discover
new rewards and in the long run increase utility

• Exploitation: must take actions that are known to be good (and seem
currently optimal) to optimize the overall utility

• Slowly move from exploration exploitation

25
Explore/Exploit Policies
• Simplest scheme: ϵ-greedy
• Every time step flip a coin
• With probability 1-ϵ, take the greedy action
• With probability ϵ, take a random action

• Problem
• Exploration probability is constant

• Solutions
• Lower ϵ over time
• Use an exploration function

26
Explore/Exploit Policies
• Boltzmann Exploration
• Select action a with probability
exp(𝑄 𝑠,𝑎 𝑇))
• Pr(𝑎|𝑠) =
𝑎′∈𝐴 exp(𝑄 𝑠,𝑎′ 𝑇))

• T: Temperature
• Similar to simulated annealing
• Large T: uniform, Small T: greedy
• Start with large T and decrease with time

• GLIE: greedy in the limit of infinite exploration

27
Explore/Exploit Policies
• Exploration Functions
• stop exploring actions whose badness is established
• continue exploring other actions
• Let Q(s,a) = q, #visits(s,a) = n
• E.g.: f q, n = 𝑞 + 𝑘/𝑛
• Unexplored states have infinite f
• Highly explored bad states have low f
• Modified Q update
• 𝑄(s,a)  (1 − 𝛼)𝑄(s,a) + 𝛼(r+𝛾 max 𝑓(𝑄 𝑠 ′ , 𝑎′ , 𝑁 𝑠 ′ , 𝑎′ ))
𝑎′
States leading to unexplored states are also preferred

28
Explore/Exploit Policies
• A Famous Exploration Policy: UCB
• Upper Confidence Bound

ln n( s)
 UCT ( s)  arg max a Q( s, a)  c
n( s , a )
Value Term:
favors actions that looked Exploration Term:
good historically actions get an exploration
bonus that grows with ln(n)
Optimistic in the Face of Uncertainty
29
Next Class
• RL + Deep Learning = Deep RL
Thank You
IIT Kharagpur IIT Madras IIT Goa IIT Palakkad

Introduction to Deep Learning

Reinforcement Learning
Deep RL
Mausam
IIT Delhi
National Centre for
Supercomputing Development of
Mission Advanced Computing
Q Learning Algorithm
• Forall s, a
• Initialize Q(s, a) = 0

• Repeat Forever
Where are you? s.
Choose some explore-exploit action a
Execute it in real world: (s, a, r, s’)
Do update:
′ ′
𝑄(s,a)  𝑄(s,a)+ 𝛼(r+𝛾 max 𝑄 𝑠 , 𝑎 − 𝑄(𝑠, 𝑎))
𝑎′

2
Model based RL vs. Model Free RL
• Model based RL
• estimate O(|S|2|A|) parameters
• requires relatively larger data for learning
• can make use of background knowledge easily

• Model free RL
• estimate O(|S||A|) parameters
• requires relatively less data for learning
Can we Enumerate State Space?
• Basic Q-Learning (or VI) keeps a table of all q-values

• In realistic situations, we cannot possibly learn about every

single state!
• Too many states to visit them all in training
• Too many states to hold the q-tables in memory

4
Chess
• branching factor b≈35

• game length m≈100

• search space bm ≈ 35100 ≈ 10154

• The Universe: number of atoms ≈ 10 78

• Exact solution completely infeasible

5
Game of Go

Chess Go
Size of board 8x8 19 x 19
Average no. of 100 300
moves per game
Avg branching 35 235
factor per turn
Additional Players can
complexity
pass

6
Generalize Across States
• Basic Q-Learning (or VI) keeps a table of all q-values

• In realistic situations, we cannot possibly learn about every

single state!
• Too many states to visit them all in training
• Too many states to hold the q-tables in memory

• Instead, we want to generalize:

• Learn about some small number of training states from experience
• Generalize that experience to new, similar situations
• This is a fundamental idea in machine learning

7
Deep Q Learning
Function Approximation
• Lookup Table (e.g., Q(s,a) table)
• does not scale – humongous for large problems
• (also known as curse of dimensionality)
• is not feasible if state space is continuous
• The Key Idea of Function Approximation
• approximate Q(s,a) as a parameteric function
• automatically learn the parameters (w)
𝑄 𝑠, 𝑎 ~ 𝑄 𝑠, 𝑎; 𝑤
• The Key Idea of Deep Q Learning
• Train a deep network to represent 𝑄 function
• w are the parameters of deep network

9
Deep Q Learning

• Regular Q Learning: nudge Q(s,a) towards target

′ ′
• 𝑄(𝑠, 𝑎)  𝑄(𝑠, 𝑎) + 𝛼(r + 𝛾 max 𝑄(𝑠 , 𝑎 ) − 𝑄(𝑠, 𝑎))
𝑎′
target
Deep Q Learning

• Regular Q Learning: nudge Q(s,a) towards target

′ ′
• 𝑄(𝑠, 𝑎)  𝑄(𝑠, 𝑎) + 𝛼(r + 𝛾 max 𝑄(𝑠 , 𝑎 ) − 𝑄(𝑠, 𝑎))
𝑎′
target

• Deep Q Learning: nudge approximated Q values towards target by

minimizing squared error
2
′ ′
• Loss(w) = r + 𝛾 max 𝑄(𝑠 , 𝑎 ; 𝑤) − 𝑄(𝑠, 𝑎; w)
𝑎′
target estimate
Deep Q Learning vs. Deep Supervised Learning

• Deep (Supervised) Learning: nudge predicted y towards target y

2
• Loss(w) = 𝑦 − 𝑓(x; w)
target estimate

• Deep Q Learning: nudge approximated Q values towards target by

minimizing squared error
2
′ ′
• Loss(w) = r + 𝛾 max 𝑄(𝑠 , 𝑎 ; 𝑤) − 𝑄(𝑠, 𝑎; w)
𝑎′
target estimate

• Difference: Target in Q Learning is also moving!

Online Deep Q Learning Algorithm
• Estimate Q*(s,a) values using deep network(w)
• Receive a sample (s, a, s’, r)
• Compute target: 𝑦 = r+𝛾 max 𝑄(𝑠 ′ , 𝑎′ ; 𝑤)
𝑎′

13
Online Deep Q Learning Algorithm
• Estimate Q*(s,a) values using deep network(w)
• Receive a sample (s, a, s’, r)
• Compute target: 𝑦 = r+𝛾 max 𝑄(𝑠 ′ , 𝑎′ ; 𝑤)
𝑎′
2
• Update w to minimize: 𝐿 𝑤 = 𝑦 − 𝑄(s, a; w)

14
Online Deep Q Learning Algorithm
• Estimate Q*(s,a) values using deep network(w)
• Receive a sample (s, a, s’, r)
• Compute target: 𝑦 = r+𝛾 max 𝑄(𝑠 ′ , 𝑎′ ; 𝑤)
𝑎′
2
• Update w to minimize: 𝐿 𝑤 = 𝑦 − 𝑄(s, a; w)

Nudge the estimates:

𝜕𝐿
• 𝑤 𝑤 − 𝛼 or 𝑤 𝑤 − 𝛼∇𝑤 𝐿
𝜕𝑤
• where gradient ∇𝑤 𝐿 = (𝑄 𝑠, 𝑎; 𝑤 − 𝑦)∇𝑤 𝑄(𝑠, 𝑎; 𝑤)

15
Challenges in Training Deep Q Learning
• Target values are moving
• Successive states are not i.i.d., they are correlated
• Successive states depend on w (policy)
• Small changes in w might lead to large changes in policy

• These make training deep Q networks highly unstable

16
Challenges in Training Deep Q Learning
• Target values are moving
• Successive states are not i.i.d., they are correlated
• Successive states depend on w (policy)
• Small changes in w might lead to large changes in policy

• These make training deep Q networks highly unstable

• Solutions
• Freeze target Q network weights; update sporadically
• Experience Replay

17
Experience Replay
• Step I: Compute experience buffer

• At each time step:

• Take action at according to ϵ-greedy policy
• Store experience (st, at, rt+1, st+1) in replay memory buffer
• Replay memory buffer contains
• Many (st, at, rt+1, st+1) tuples for t=0,1,2,…T

18
Experience Replay
• Step II: Weight updates using replay buffer
• Repeat K steps
• Randomly sample a mini-batch B of (s,a,r,s’) experiences from
replay memory buffer

19
Experience Replay
• Step II: Weight updates using replay buffer
• Repeat K steps
• Randomly sample a mini-batch B of (s,a,r,s’) experiences from
replay memory buffer
• Perform update to minimize loss on this minibatch
2
′ ′ −
• (𝑠,𝑎,𝑟,𝑠 ′ )∈𝐵 𝐄 r + 𝛾 max 𝑄(𝑠 , 𝑎 ; 𝑤 ) − 𝑄(𝑠, 𝑎; w)
𝑎′
Parameters of target
network are frozen

−
• Once in a while: 𝑤  𝑤
• Go to Step I (recompute replay buffer)
20
Atari Games

21
Deep Q Learning for Atari

22
End-to-End Deep Q Learning for Atari
• Input: last 4 frames
• Output: Q(s,a) for 18 joystick actions
• Architecture: CNN (same config for all games)

23
Image taken from Mnih et al, Nature (2015)
Deep Q Learning for Atari: Results
● Atari-Breakout trained agent:
https://www.youtube.com/watch?v=V1eYniJ0Rnk
● Uses Variation of DQN

Image taken from Mnih

et al, Nature (2015)

24
(Deep) Policy Gradient
Policy gradient methods
• Learning the policy directly can be much simpler than learning Q values

• We can train a neural network to output stochastic policies, or

probabilities of taking each action in a given state: 𝜋𝜃 (𝑎|𝑠)
• Softmax policy: Compute logits f(s,a) and take softmax over all a.

exp(𝑓(𝑠, 𝑎; 𝜃))
𝜋𝜃 (𝑎|𝑠) =
𝑎′ exp(𝑓(𝑠, 𝑎′; 𝜃))

𝜋
• Update: 𝜃 ← 𝜃 + 𝛼∇𝜃 𝑉 (𝜃)
REINFORCE
𝜋
𝑉 𝑠; 𝜃 = 𝜋 𝑎|𝑠; 𝜃 𝐺 𝑠, 𝑎
𝑎

𝛻𝜃 𝑉 𝜋 (𝑠) = 𝛻𝜃 𝜋 𝑎|𝑠 𝐺 𝑠, 𝑎
𝑎

𝛻𝜃 𝑉 𝜋 (𝑠) = 𝐺 𝑠, 𝑎 𝛻𝜃 𝜋 𝑎|𝑠
𝑎

𝜋
𝛻𝜃 𝜋 𝑎|𝑠
𝛻𝜃 𝑉 (𝑠) = 𝜋 𝑎|𝑠 𝐺 𝑠, 𝑎
𝜋 𝑎|𝑠
𝑎

𝛻𝜃 𝑉 𝜋 (𝑠) = 𝜋 𝑎|𝑠 𝐺 𝑠, 𝑎 𝛻𝜃 ln(𝜋 𝑎|𝑠

𝑎

𝛻𝜃 𝑉 𝜋 (𝑠) = 𝔼[𝐺(𝑠, 𝑎)𝛻𝜃 ln(𝜋 𝑎|𝑠 ]

REINFORCE

G(st ,at)

28
Policy Gradient vs Q Learning
• In many domains policy is easier to approximate than value function

• Naturally applies to continuous actions as opposed to a Q learning

• Useful in games with imperfect information

• where the optimal play is a stochastic policy
DeepMind’s AlphaGo
 Policy network: initialized by
supervised training on large
amount of human games

 Value network: trained like Q-

learning based on self-play

 Networks are used to guide

Monte Carlo tree search (MCTS)

30
Image taken from Silver et al, Nature (2016)
AlphaGo Zero

Image taken from

https://deepmind.com/blog/article/alphago-zero-starting-scratch 31
Applications
• Stochastic Games
• Robotics: navigation, helicopter manuevers…
• Finance: options, investments
• Communication Networks
• Medicine: Radiation planning for cancer
• Controlling workflows
• Optimize bidding decisions in auctions
• Traffic flow optimization
• Aircraft queueing for landing; airline meal provisioning
• Optimizing software on mobiles
• Forest firefighting
•… 32
Case Study: RL for NLP
Introduction
Task Oriented Dialogs (TOD)
SELECT * FROM KB WHERE
Suggest an expensive restaurant that's in the location = south AND
south section of the city price = moderate
User t=1
Do you have a cuisine preference? KB Query
Agent
No, I don't care about the type of cuisine. Also,
could you make the price range moderate KB
User
t=2
Peking Restaurant is a moderately priced
Query Results
and is in the south part of the town Agent
Restaurant Cuisine … Phone
May I have their phone number? Prezzo Italian … 1098-1134
User t=3 Taj Tandoori Indian … 1648-1796
Their phone number is 2343-4040 Peking Chinese … 2343-4040
Agent Restaurant
Introduction
Task Oriented Dialogs (TOD)
SELECT * FROM KB WHERE
Suggest an expensive restaurant that's in the location = south AND
south section of the city price = moderate
User t=1
Do you have a cuisine preference?
KB Query
No, I don't care about the type of cuisine. Also,
could you make the price range moderate KB
User
t=2
Peking Restaurant is a moderately priced
Query Results
and is in the south part of the town Agent
Restaurant Cuisine … Phone
May I have their phone number? Prezzo Italian … 1098-1134
User t=3 Taj Tandoori Indian … 1648-1796
Their phone number is 2343-4040 Peking Chinese … 2343-4040
Agent Restaurant
Introduction
Task Oriented Dialogs (TOD)
SELECT * FROM KB WHERE
Suggest an expensive restaurant that's in the location = south AND
south section of the city price = moderate
User t=1
Do you have a cuisine preference?
Agent KB Query
No, I don't care about the type of cuisine. Also,
could you make the price range moderate KB
User
t=2
Peking Restaurant is a moderately priced
Query Results
and is in the south part of the town Agent
Restaurant Cuisine … Phone
May I have their phone number? Prezzo Italian … 1098-1134
User t=3 Taj Tandoori Indian … 1648-1796
Their phone number is 2343-4040 Peking Chinese … 2343-4040
Agent Restaurant
Introduction
Task Oriented Dialogs (TOD)
SELECT * FROM KB WHERE
Suggest an expensive restaurant that's in the location = south AND
south section of the city price = moderate
User t=1
Do you have a cuisine preference?
Agent KB Query
No, I don't care about the type of cuisine. Also,
could you make the price range moderate KB
User
t=2
Peking Restaurant is a moderately priced
Query Results
and is in the south part of the town Agent
Restaurant Cuisine … Phone
May I have their phone number? Prezzo Italian … 1098-1134
User t=3 Taj Tandoori Indian … 1648-1796
Their phone number is 2343-4040 Peking Chinese … 2343-4040
Agent Restaurant
Introduction
Existing TOD Systems
t=2
SELECT * FROM KB WHERE
Suggest an expensive restaurant that's in the location = south AND
south section of the city price = moderate
User t=1
Do you have a cuisine preference?
Agent KB Query
No, I don't care about the type of cuisine. Also,
could you make the price range moderate KB
User
t=2
Peking Restaurant is a moderately priced
Query Results
and is in the south part of the town Agent
Restaurant Cuisine … Phone
May I have their phone number? Prezzo Italian … 1098-1134
User t=3 Taj Tandoori Indian … 1648-1796
Their phone number is 2343-4040 Peking Chinese … 2343-4040
Agent Restaurant
 As KB Queries go undocumented
Introduction  they need manual annotation
Unannotated TOD Systems  expensive and hinders scalability
t=2
SELECT * FROM KB WHERE
Suggest an expensive restaurant that's in the location = south AND
south section of the city price = moderate
User t=1
Do you have a cuisine preference?
Agent KB Query
No, I don't care about the type of cuisine. Also,
could you make the price range moderate KB
User
t=2
Peking Restaurant is a moderately priced
and is in the south part of the town Agent
May I have their phone number?
User t=3
Their phone number is 2343-4040
Agent
KB Query Predictor
Weak Supervision
SELECT * FROM KB WHERE
Suggest an expensive restaurant that's in the location = south AND
south section of the city price = moderate
User
Do you have a cuisine preference?
Agent KB Query
No, I don't care about the type of cuisine. Also,
could you make the price range moderate KB
User
Peking Restaurant is a moderately priced
Query Results
and is in the south part of the town Agent
Restaurant Cuisine … Phone
May I have their phone number? Prezzo Italian … 1098-1134
User Taj Tandoori Indian … 1648-1796
Their phone number is 2343-4040 Peking Chinese … 2343-4040
Agent Restaurant
KB Query Predictor
RL Formulation
Suggest an expensive restaurant that's in the Dialog
south section of the city
context
Do you have a cuisine preference?

No, I don't care about the type of cuisine. Also,

could you make the price range moderate

Peking Restaurant is a moderately priced

and is in the south part of the town Subsequent
Dialog
May I have their phone number?

Their phone number is 2343-4040

𝑐 is the dialog context
KB Query Predictor 𝑠
𝐸 is the entities in subsequent dialog
RL Formulation
Suggest an expensive restaurant that's in the Dialog
south section of the city
context 𝑐
Do you have a cuisine preference?

No, I don't care about the type of cuisine. Also,

could you make the price range moderate

Peking Restaurant is a moderately priced

and is in the south part of the town Subsequent
Dialog 𝐸𝑠
May I have their phone number?

Their phone number is 2343-4040

𝑐 is the dialog context
KB Query Predictor 𝒂 is the KB query
𝑎
𝐸 is the entities in KB results
RL Formulation 𝐸 𝑠 is the entities in subsequent dialog

Suggest an expensive restaurant that's in the Dialog

south section of the city
context 𝑐 Policy
Do you have a cuisine preference?
𝜋𝜃
No, I don't care about the type of cuisine. Also,
could you make the price range moderate 𝐸 𝑎 𝒂
KB
Peking Restaurant is a moderately priced
and is in the south part of the town Subsequent
Dialog 𝐸𝑠 Reward
May I have their phone number?
ℛ
Their phone number is 2343-4040
KB Query Predictor
Policy
𝑇

𝜋𝜃 𝒂 𝑐 = 𝜋𝜃 𝑎𝑡 𝑐, 𝑎1:𝑡−1
𝑡=1

Each action 𝑎𝑡 is a word predicted by the query predictor. It can be

1. a keyword from the SQL query language
2. a field name from the KB
3. a word in the dialog context 𝑐
4. an <end_of_query> token
KB Query Predictor
Rewards

𝑠 𝑎
ℛ 𝒂 𝑐, 𝐸𝑠 = ૤ 𝑅𝑒𝑐𝑎𝑙𝑙 𝐸 𝑠 ,𝐸 𝑎 =1 . 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛(𝐸 ,𝐸 )

• recall-based indicator ensures that all necessary entities are retrieved

• precision penalizes retrieval of a large number of unused entities
KB Query Predictor
Policy Optimization

1. REINFORCE
• Inefficient search space exploration

2. MAPO: Memory Augmented Policy Optimization*

• Systematic explores search space [Liang et al NIPS’18]
• Our proposed solution is an extension of MAPO [Raghu et al TACL’21]

*Memory augmented policy optimization for program synthesis and semantic parsing, Liang et al, NIPS 2018
A Plug on NPTEL AI Course

https://onlinecourses.nptel.ac.in/noc21_cs42/preview

A Plug on School of AI
https://scai.iitd.ac.in
Thank You

Reinforcement Learning I:: The Setting and Classical Stochastic Dynamic Programming Algorithms
No ratings yet
Reinforcement Learning I:: The Setting and Classical Stochastic Dynamic Programming Algorithms
42 pages
RL 1
No ratings yet
RL 1
30 pages
Lecture4 Model Free Prediction
No ratings yet
Lecture4 Model Free Prediction
34 pages
06 MDP
No ratings yet
06 MDP
89 pages
15 MDP
No ratings yet
15 MDP
35 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
43 pages
l1 Mdps Exact Methods
No ratings yet
l1 Mdps Exact Methods
69 pages
Littomore
No ratings yet
Littomore
169 pages
242 Sheet 02 03
No ratings yet
242 Sheet 02 03
5 pages
CSE2530 Reinforcement Learning 2025 P1+2
No ratings yet
CSE2530 Reinforcement Learning 2025 P1+2
115 pages
Dynamic Programming RL Answers Final
No ratings yet
Dynamic Programming RL Answers Final
3 pages
2024 MDPs Part 1
No ratings yet
2024 MDPs Part 1
59 pages
Module 04
No ratings yet
Module 04
63 pages
Unit 4
No ratings yet
Unit 4
49 pages
L13 Reinforcement Learning
No ratings yet
L13 Reinforcement Learning
57 pages
Lecture26 Ri
No ratings yet
Lecture26 Ri
55 pages
2025 - MDPs 1
No ratings yet
2025 - MDPs 1
62 pages
DRL #4-5 - Introducing MDP and Dynamic Programming Solution
No ratings yet
DRL #4-5 - Introducing MDP and Dynamic Programming Solution
74 pages
L12 Markov Decision Processes
No ratings yet
L12 Markov Decision Processes
64 pages
کتاب هشتم بارگزاری شده
No ratings yet
کتاب هشتم بارگزاری شده
112 pages
DLMAIRIL01 Q4-2024 Session2
No ratings yet
DLMAIRIL01 Q4-2024 Session2
68 pages
16 RL
No ratings yet
16 RL
51 pages
11-DL-Deep Learning For Reinforcement Learning
No ratings yet
11-DL-Deep Learning For Reinforcement Learning
47 pages
Lecture#5 Monte Carlo Methods Part I
No ratings yet
Lecture#5 Monte Carlo Methods Part I
28 pages
Unit-5 Ai
No ratings yet
Unit-5 Ai
19 pages
Lec17 ReinforcementLearning
No ratings yet
Lec17 ReinforcementLearning
58 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
45 pages
Lecture 12 Slides - After
No ratings yet
Lecture 12 Slides - After
50 pages
20ai903 - RL - Unit 2
No ratings yet
20ai903 - RL - Unit 2
27 pages
CS229
No ratings yet
CS229
17 pages
Reinforcement Learning Cheatsheet
No ratings yet
Reinforcement Learning Cheatsheet
16 pages
DSA5102 Lecture11
No ratings yet
DSA5102 Lecture11
44 pages
Reinforcement Learning Note
No ratings yet
Reinforcement Learning Note
16 pages
Add-On DRL CS06
No ratings yet
Add-On DRL CS06
23 pages
EE290 Lecture 16
No ratings yet
EE290 Lecture 16
4 pages
Cs229-Notes12 Reinforcement in Control
No ratings yet
Cs229-Notes12 Reinforcement in Control
17 pages
Lecture 30 Reinforcement-Learning
No ratings yet
Lecture 30 Reinforcement-Learning
50 pages
Tri-Tue-Nhan-Tao - Nathan-Lambert - Lec13 - 6up-Reinforcement-Learning - (Cuuduongthancong - Com)
No ratings yet
Tri-Tue-Nhan-Tao - Nathan-Lambert - Lec13 - 6up-Reinforcement-Learning - (Cuuduongthancong - Com)
8 pages
An Overview of Machine Learning
No ratings yet
An Overview of Machine Learning
42 pages
A17 Complexdecisions
No ratings yet
A17 Complexdecisions
28 pages
22 Reinforcement Learning
No ratings yet
22 Reinforcement Learning
18 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
101 pages
Unit 5 Reinforcement Learning Notes
No ratings yet
Unit 5 Reinforcement Learning Notes
20 pages
5.4-Reinforcement Learning-Part1-Introduction
No ratings yet
5.4-Reinforcement Learning-Part1-Introduction
15 pages
4 Reinforcement Learning - Basic Algorithms: - S, A) ) and The Immediate Reward Function R (R (S, A, S
No ratings yet
4 Reinforcement Learning - Basic Algorithms: - S, A) ) and The Immediate Reward Function R (R (S, A, S
16 pages
RL 10 QUESTIONS FOR MID II Scheme of Evaluvation
No ratings yet
RL 10 QUESTIONS FOR MID II Scheme of Evaluvation
15 pages
Instructor (Andrew NG) :okay, Good Morning. Welcome Back. So I Hope All of You Had
No ratings yet
Instructor (Andrew NG) :okay, Good Morning. Welcome Back. So I Hope All of You Had
14 pages
New CZ3005 Module 4 - Markov Decision Process
No ratings yet
New CZ3005 Module 4 - Markov Decision Process
38 pages
02 MarkovDecisionProcess
No ratings yet
02 MarkovDecisionProcess
51 pages
Reinforcement Learning in A Nutshell
No ratings yet
Reinforcement Learning in A Nutshell
12 pages
Reinforcement Learning As Classification: Leveraging Modern Classifiers
No ratings yet
Reinforcement Learning As Classification: Leveraging Modern Classifiers
8 pages
Reinforcement Learning: Amulya Viswambaran (202090007) Kehkashan Fatima (202090202) Sruthi Krishnan (202090333)
No ratings yet
Reinforcement Learning: Amulya Viswambaran (202090007) Kehkashan Fatima (202090202) Sruthi Krishnan (202090333)
40 pages
Reinforcement Learning: Csci 5512: Artificial Intelligence Ii
No ratings yet
Reinforcement Learning: Csci 5512: Artificial Intelligence Ii
30 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
46 pages
RL Complete Unit-5
No ratings yet
RL Complete Unit-5
30 pages
Markov Decision Processes & Reinforcement Learning: Megan Smith Lehigh University, Fall 2006
No ratings yet
Markov Decision Processes & Reinforcement Learning: Megan Smith Lehigh University, Fall 2006
40 pages
Stochastic Process - Markov Property - Markov Chain - Markov Decision Process - Reinforcement Learning - RL Techniques - Example Applications
No ratings yet
Stochastic Process - Markov Property - Markov Chain - Markov Decision Process - Reinforcement Learning - RL Techniques - Example Applications
39 pages
Reinforcement Learning and Control: CS229 Lecture Notes
No ratings yet
Reinforcement Learning and Control: CS229 Lecture Notes
15 pages
Dynamic Programming and Optimal Control 3rd Edition, Volume II
No ratings yet
Dynamic Programming and Optimal Control 3rd Edition, Volume II
233 pages
An Introduction To Deep Reinforcement Learning PDF
No ratings yet
An Introduction To Deep Reinforcement Learning PDF
140 pages
Reinforcement Learning and Control: CS229 Lecture Notes
No ratings yet
Reinforcement Learning and Control: CS229 Lecture Notes
7 pages
An Introduction To Reinforcement Learning
No ratings yet
An Introduction To Reinforcement Learning
63 pages
Iste Life Membership Manual
100% (1)
Iste Life Membership Manual
6 pages
Reinforcement Learning Cheat Sheet: Return
No ratings yet
Reinforcement Learning Cheat Sheet: Return
7 pages
An Analysis of Stochastic Game Theory For Multiagent Reinforcement Learning
No ratings yet
An Analysis of Stochastic Game Theory For Multiagent Reinforcement Learning
12 pages
VSP25 - Candidate Details Upload Template
No ratings yet
VSP25 - Candidate Details Upload Template
8 pages
1electric Vehicle Technology
100% (1)
1electric Vehicle Technology
1 page
Ai&Ml Lab
No ratings yet
Ai&Ml Lab
63 pages
Introduction To Industry 4.0 and Industrial Internet of Things-1
No ratings yet
Introduction To Industry 4.0 and Industrial Internet of Things-1
1 page
BE AIDS R 20 VII VIII Sem Syllabus - Compressed
No ratings yet
BE AIDS R 20 VII VIII Sem Syllabus - Compressed
55 pages
3PCB Hardware Design
No ratings yet
3PCB Hardware Design
2 pages
Assignment 4 - PSC - Jan 2025
No ratings yet
Assignment 4 - PSC - Jan 2025
69 pages
5NPTEL BSDR - All Assignments
No ratings yet
5NPTEL BSDR - All Assignments
43 pages
002 2012 Intro To Optimal Control
No ratings yet
002 2012 Intro To Optimal Control
53 pages
CS 601 Machine Learning Unit 4
No ratings yet
CS 601 Machine Learning Unit 4
14 pages
3assignment - All Weeks-Soultions
No ratings yet
3assignment - All Weeks-Soultions
21 pages
9 Unsupervised Learning: 9.1 K-Means Clustering
No ratings yet
9 Unsupervised Learning: 9.1 K-Means Clustering
34 pages
Exam Prep 4 Solutions: Q1. MDPS: Dice Bonanza
No ratings yet
Exam Prep 4 Solutions: Q1. MDPS: Dice Bonanza
4 pages
On Generalization of Adversarial Imitation Learning and Beyond
No ratings yet
On Generalization of Adversarial Imitation Learning and Beyond
80 pages
Reinforcement Learning and Deep Learning Unit 1,2
No ratings yet
Reinforcement Learning and Deep Learning Unit 1,2
74 pages
NPTEL BSDR - Assignment 1
No ratings yet
NPTEL BSDR - Assignment 1
2 pages
MA Neville Walo Seer RLRL
No ratings yet
MA Neville Walo Seer RLRL
53 pages
Unit 04 Finite Markov Decision Processes
No ratings yet
Unit 04 Finite Markov Decision Processes
8 pages
Answer Key
No ratings yet
Answer Key
29 pages
Eece Cont
No ratings yet
Eece Cont
6 pages
Ltimindtree 2024 (1-5)
No ratings yet
Ltimindtree 2024 (1-5)
5 pages
W3A1
No ratings yet
W3A1
5 pages
9electric and Hybrid Vehicles (Theory Integrated Lab)
No ratings yet
9electric and Hybrid Vehicles (Theory Integrated Lab)
2 pages
Optimizing A Dynamic Order-Picking Process: Yossi Bukchin, Eugene Khmelnitsky, Pini Yakuel
No ratings yet
Optimizing A Dynamic Order-Picking Process: Yossi Bukchin, Eugene Khmelnitsky, Pini Yakuel
26 pages
ML Mid-2 Objective
No ratings yet
ML Mid-2 Objective
12 pages
1 s2.0 S1474034621001956 Main
No ratings yet
1 s2.0 S1474034621001956 Main
13 pages
Decision-Making Technology For Autonomous Vehicles: Learning-Based Methods, Applications and Future Outlook
No ratings yet
Decision-Making Technology For Autonomous Vehicles: Learning-Based Methods, Applications and Future Outlook
8 pages
Update Monte Carlo Tree Search (UMCTS) Algorithm For Heuristic Global Search of Sizing Optimization Problems For Truss Structures
No ratings yet
Update Monte Carlo Tree Search (UMCTS) Algorithm For Heuristic Global Search of Sizing Optimization Problems For Truss Structures
25 pages
Coaching Class 8
No ratings yet
Coaching Class 8
2 pages
Deep Reinforcement Learning: From Q-Learning To Deep Q-Learning
No ratings yet
Deep Reinforcement Learning: From Q-Learning To Deep Q-Learning
9 pages
Unit - 1 Class Test
No ratings yet
Unit - 1 Class Test
1 page
RL Sem 8
No ratings yet
RL Sem 8
3 pages
Coaching Class 1
No ratings yet
Coaching Class 1
1 page
5.5 Reinforcement Learning
No ratings yet
5.5 Reinforcement Learning
5 pages
AI Fundamentals Finals
No ratings yet
AI Fundamentals Finals
6 pages
RL-Theory-Question Bank
No ratings yet
RL-Theory-Question Bank
3 pages
A Short Tutorial On Reinforcement Learning: Review and Applications
No ratings yet
A Short Tutorial On Reinforcement Learning: Review and Applications
5 pages
Sample Questions For COMP-424 Final Exam: Doina Precup
No ratings yet
Sample Questions For COMP-424 Final Exam: Doina Precup
4 pages
Syllabus
No ratings yet
Syllabus
2 pages
MIE1607 Course Outline 2020
No ratings yet
MIE1607 Course Outline 2020
2 pages
Learning Analytics Tools ASSIG
No ratings yet
Learning Analytics Tools ASSIG
1 page
Hill Climbing: Fundamentals and Applications
From Everand
Hill Climbing: Fundamentals and Applications
Fouad Sabry
No ratings yet

Tut21 RL

Uploaded by

Tut21 RL

Uploaded by

IIT Kharagpur IIT Madras IIT Goa IIT Palakkad

Introduction to Deep Learning

• given a ____ horizon

• assuming full observability

• Total reward: r1 + r2 + 2r3 + …

• Evaluate each policy how to evaluate?

• Choose the best

Introduction to Deep Learning

• New Twist: we don’t know T and/or R

• Fundamental model for learning of human behavior 4

• Active setting in MDPs

• Actions have two purposes

Without P(A), instead collect samples [a1, a2, … aN]

• TD Learning: computing this expectation as average

• Don’t learn T or R; directly maintain V¼

• Update maintains a mean of (noisy) value samples

• If the learning rate decreases appropriately with the number

Image from https://www.open.edu/openlearn/ocw/mod/oucontent/view.php?id=83455&section=2.2.1 12

Known MDP: Offline Solution

Compute V*, Q*, * Value / policy iteration

Unknown MDP: Model-Based Unknown MDP: Model-Free

• Hmmm… what to do?

Nudge the estimates:

Is an off policy learning algorithm

• Slowly move from exploration exploitation

• GLIE: greedy in the limit of infinite exploration

Introduction to Deep Learning

• In realistic situations, we cannot possibly learn about every

• game length m≈100

• search space bm ≈ 35100 ≈ 10154

• The Universe: number of atoms ≈ 10 78

• Exact solution completely infeasible

• In realistic situations, we cannot possibly learn about every

• Instead, we want to generalize:

• Regular Q Learning: nudge Q(s,a) towards target

• Regular Q Learning: nudge Q(s,a) towards target

• Deep Q Learning: nudge approximated Q values towards target by

• Deep (Supervised) Learning: nudge predicted y towards target y

• Deep Q Learning: nudge approximated Q values towards target by

• Difference: Target in Q Learning is also moving!

Nudge the estimates:

• These make training deep Q networks highly unstable

• These make training deep Q networks highly unstable

• At each time step:

Image taken from Mnih

• We can train a neural network to output stochastic policies, or

𝛻𝜃 𝑉 𝜋 (𝑠) = 𝜋 𝑎|𝑠 𝐺 𝑠, 𝑎 𝛻𝜃 ln(𝜋 𝑎|𝑠

𝛻𝜃 𝑉 𝜋 (𝑠) = 𝔼[𝐺(𝑠, 𝑎)𝛻𝜃 ln(𝜋 𝑎|𝑠 ]

• Naturally applies to continuous actions as opposed to a Q learning

• Useful in games with imperfect information

 Value network: trained like Q-

 Networks are used to guide

Image taken from

No, I don't care about the type of cuisine. Also,

Peking Restaurant is a moderately priced

Their phone number is 2343-4040

No, I don't care about the type of cuisine. Also,

Peking Restaurant is a moderately priced

Their phone number is 2343-4040

Suggest an expensive restaurant that's in the Dialog

Each action 𝑎𝑡 is a word predicted by the query predictor. It can be

• recall-based indicator ensures that all necessary entities are retrieved

2. MAPO: Memory Augmented Policy Optimization*

You might also like

Compute V, Q, * Value / policy iteration