Solution 3

The document contains solutions to various questions related to reinforcement learning and multi-arm bandit problems, focusing on policy gradient methods and their convergence properties. It discusses the differences between contextual bandits and full reinforcement learning problems, as well as the implications of stationary policies. Additionally, it explores reward formulations in grid world domains and the relationship between policies and value functions in Markov Decision Processes (MDPs).

Uploaded by

tt5828818

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

14 views4 pages

Solution 3

Uploaded by

tt5828818

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 4

Assignment 3 (Sol.

)
Reinforcement Learning
Prof. B. Ravindran

1. In solving a multi-arm bandit problem using the policy gradient method, are we assured of
converging to the optimal solution?

(a) no
(b) yes

Sol. (a)
Depending upon the properties of the function whose gradient is being ascended, the policy
gradient approach may converge to a local optimum.
2. In many supervised machine learning algorithms, such as neural networks, we rely on the
gradient descent technique. However, in the policy gradient approach to bandit problems, we
made use of gradient ascent. This discrepancy can mainly be attributed to the differences in

(a) the objectives of the learning tasks

(b) the parameters of the functions whose gradient are being calculated
(c) the nature of the feedback received by the algorithms

Sol. (c)
The feedback in most supervised learning algorithms is an error signal which we wish to
minimise. Hence, we would look to perform gradient descent. In policy gradient, we are trying
to maximise the reward signal, hence gradient ascent.
3. Consider a bandit problem in which the parameters on which the policy depends are the
preferences of the actions and the action selection probabilities are determined by the softmax
θi
relationship as π(ai ) = Pke eθj , where k is the total number of actions and θi is the preference
j=1
value of action ai . Derive the parameter update conditions according to the REINFORCE
procedure considering the above described parameters and where the baseline is the reference
reward defined as the average of the rewards received for all arms.

(a) ∆θi = α(R − b)(1 − π(ai ; θ))

θi
(b) ∆θi = α(R − b) Pke eθj
j=1

(c) ∆θi = α(R − b)(π(ai ; θ) − 1)

θi
(d) ∆θi = α(R − b) P1−e
k
eθj
j=1

1
Sol. (a)
According to the REINFORCE method, at each step, we update the parameter as
∂lnπ(an ; θ)
∆θn = α(Rn − bn )
∂θn

Considering the softmax preferences, the characteristic eligibility is

k
∂lnπ(an ; θ) ∂ eθ i ∂ X
= ln Pk = (θn − ln( eθj )) = 1 − π(ai ; θ)
∂θn ∂θn j=1 e
θ j ∂θ n j=1

Considering the baseline, b as the average of all rewards seen so far, and α as the step size
parameter, the update conditions are
∆θi = α(R − b)(1 − π(ai ; θ))

4. Repeat the above problem for the case where the parameters are the mean and variance of the
normal distribution according to which the actions are selected and the baseline is zero.
n 2
o
(a) ∆µn = αn rn anσ−µ n
n
; ∆σn = αn rn (an −µ 2
σn
n)
−1
(an −µn )2 −σn2
n o
(b) ∆µn = αn rn anσ−µ 2
n
; ∆σn = αn rn σn2
n

(an −µn )2 −σn2

n o
an −µn
(c) ∆µn = αn rn σn2 ; ∆σn = αn rn σn3
n o
(an −µn )2 −σ
(d) ∆µn = αn rn anσ−µ n
n
; ∆σ n = αn rn σ 2
n

Sol. (c)
Assuming parameters to be µ and σ, we have policy,
1 (a−µ)2
π(a; µ, σ 2 ) = √ e− 2σ
2πσ
For the mean:
(an − µn )2

∂ ln π(an ; µn , σn ) ∂ an − µn
= − =
∂µn ∂µn 2σn2 σn2
For the variance:
√ (an − µn )2

∂ ln π(an ; µn , σn ) ∂ ∂
= − ln( 2πσn ) + −
∂σn ∂σn ∂σn 2σn2
Solving, we have:
2
(an − µn )2

∂ ln π(an ; µn , σn ) 1 1 an − µn
=− + = { − 1}
∂σn σn σn3 σn σn

Thus, the updates are:

an − µn
µn+1 = µn + αn rn
σn2
( 2 )
1 an − µn
σn+1 = σn + αn rn −1
σn σn

2
5. Which among the following is/are differences between contextual bandits and full RL problems?

(a) the actions and states in contextual bandits share features, but not in full RL problems
(b) the actions in contextual bandits do not determine the next state, but typically do in full
RL problems
(c) full RL problems can be modelled as MDPs whereas contextual bandit problems cannot
(d) no difference

Sol. (b)
Option (a) is not true because we can think of representations where actions and states share
features. Similarly, option (c) is false because contextual bandit problems can also be modelled
as MDPs.
6. Given a stationary policy, is it possible that if the agent is in the same state at two different
time steps, it can choose two different actions?

(a) no
(b) yes

Sol. (b)
A stationary policy does not mean that the policy is deterministic. For example, for state
s1 and actions a1 and a2 , a stationary policy π may have π(a1 |s1 ) = 0.4 and π(a2 |s1 ) = 0.6.
Thus, at different time steps, if the agent is in the same state s1 , it may perform either of the
two actions.
7. In class we saw that it is possible to learn via a sequence of stationary policies, i.e., during an
episode, the policy does not change, but we move to a different stationary policy before the
next episode begins. Does the temporal difference method encountered when discussing the
tic-tac-toe example follow this pattern of learning?

(a) no
(b) yes

Sol. (a)
Recall that within an episode, at each step, after observing a reward, we modify the value
function (the probability values in the tic-tac-toe example). Since the policy used to select
actions is derived from the value function, in effect we are modifying the policy at each step.
8. We saw the following definition of the action-value function for policy π: qπ (s, a) = Eπ [Gt |St =
s, At = a]. Suppose that the action selected according to the policy π in state s is a1 . For the
same state, will the function be defined for actions other than a1 ?

(a) no
(b) yes

Sol. (b)
The interpretation of the action-value function qπ (s, a) = Eπ [Gt |St = s, At = a] is the expec-
tation of the return given that we take action a in state s at time t and thereafter select actions
based on policy π. Thus, the only constraint for the actions is that they must be applicable
in state s.

3
9. Consider a 100x100 grid world domain where the agent starts each episode in the bottom-left
corner, and the goal is to reach the top-right corner in the least number of steps. To learn an
optimal policy to solve this problem you decide on a reward formulation in which the agent
receives a reward of +1 on reaching the goal state and 0 for all other transitions. Suppose
you try two variants of this reward formulation, (P1 ), where you use discounted returns with
γ ∈ (0, 1), and (P2 ), where no discounting is used. Which among the following would you
expect to observe?

(a) the same policy is learned in (P1 ) and (P2 )

(b) no learning in (P1 )
(c) no learning in (P2 )
(d) policy learned in (P2 ) is better than the policy learned in (P1 )

Sol. (c)
In (P2 ), since there is no discounting, the return for each episode regardless of the number of
steps is +1. This prevents the agent from learning a policy which tries to minimise the number
of steps to reach the goal state. In (P1 ), the discount factor ensures that the longer the agent
takes to reach the goal state, the lesser reward it gets. This motivates the agent to find take
the shortest path to the goal state.
10. Given an MDP with finite state set, S, and an arbitrary function, f , that maps S to the real
numbers, the MDP has a policy π such that f = V π .

(a) false
(b) true

Sol. (a)
Consider an MDP where all rewards are 0.

Assignment 1: Reinforcement Learning Prof. B. Ravindran
100% (2)
Assignment 1: Reinforcement Learning Prof. B. Ravindran
4 pages
AI 3000 / CS5500: Reinforcement Learning Exam 1: Instructions
0% (1)
AI 3000 / CS5500: Reinforcement Learning Exam 1: Instructions
4 pages
Problem 1: Markov Reward Process
No ratings yet
Problem 1: Markov Reward Process
3 pages
CS 188 Fall 2018 Written HW4 Soln
No ratings yet
CS 188 Fall 2018 Written HW4 Soln
6 pages
Assignment 3: Reinforcement Learning Prof. B. Ravindran
100% (1)
Assignment 3: Reinforcement Learning Prof. B. Ravindran
4 pages
Assignment 3 - Solution
No ratings yet
Assignment 3 - Solution
4 pages
Assignment 5
100% (1)
Assignment 5
2 pages
Quiz2 Sol
No ratings yet
Quiz2 Sol
4 pages
A12 Spring2024
No ratings yet
A12 Spring2024
5 pages
Homework 1: ELEN E6885: Introduction To Reinforcement Learning September 21, 2021
No ratings yet
Homework 1: ELEN E6885: Introduction To Reinforcement Learning September 21, 2021
8 pages
Practice Assignment 5: Reinforcement Learning Prof. B. Ravindran
No ratings yet
Practice Assignment 5: Reinforcement Learning Prof. B. Ravindran
2 pages
Unit 4 QP
No ratings yet
Unit 4 QP
19 pages
Solution 9
No ratings yet
Solution 9
3 pages
Week 12
No ratings yet
Week 12
22 pages
Notes
No ratings yet
Notes
6 pages
Wa 1
No ratings yet
Wa 1
9 pages
Mid Term Report SoS
No ratings yet
Mid Term Report SoS
18 pages
DRL Homework 1
No ratings yet
DRL Homework 1
4 pages
Bits
No ratings yet
Bits
5 pages
HGTFHGFHTF
No ratings yet
HGTFHGFHTF
5 pages
RL Exam Tutti
No ratings yet
RL Exam Tutti
47 pages
Tutorial Questions (Annexure I) Que S-Tion No Questions Co BTL
No ratings yet
Tutorial Questions (Annexure I) Que S-Tion No Questions Co BTL
6 pages
Reinforcement Learning and Control: CS229 Lecture Notes
No ratings yet
Reinforcement Learning and Control: CS229 Lecture Notes
7 pages
Reinforcement Learning and Control: CS229 Lecture Notes
No ratings yet
Reinforcement Learning and Control: CS229 Lecture Notes
15 pages
Practice Assignment 6: Reinforcement Learning Prof. B. Ravindran
No ratings yet
Practice Assignment 6: Reinforcement Learning Prof. B. Ravindran
24 pages
CS6700 RL 2024 Wa1
No ratings yet
CS6700 RL 2024 Wa1
7 pages
Assignment 5 (Sol.) : Reinforcement Learning
100% (1)
Assignment 5 (Sol.) : Reinforcement Learning
4 pages
Assignment 6 (Sol.) : Reinforcement Learning
No ratings yet
Assignment 6 (Sol.) : Reinforcement Learning
4 pages
RL Mid-1 Bit Bank
No ratings yet
RL Mid-1 Bit Bank
10 pages
Exam Prep 4 Solutions: Q1. MDPS: Dice Bonanza
No ratings yet
Exam Prep 4 Solutions: Q1. MDPS: Dice Bonanza
4 pages
Reinforcement Learning Cheatsheet
No ratings yet
Reinforcement Learning Cheatsheet
16 pages
RL 10 QUESTIONS FOR MID II Scheme of Evaluvation
No ratings yet
RL 10 QUESTIONS FOR MID II Scheme of Evaluvation
15 pages
RL Unit - Ii
No ratings yet
RL Unit - Ii
20 pages
RL Paper Deepsk
No ratings yet
RL Paper Deepsk
4 pages
Reinforcement Learning Exam
No ratings yet
Reinforcement Learning Exam
6 pages
Reinforcement Learning - Unit 6 - Week 4
No ratings yet
Reinforcement Learning - Unit 6 - Week 4
3 pages
Reinforcement Learning Note
No ratings yet
Reinforcement Learning Note
16 pages
q2B Review
No ratings yet
q2B Review
9 pages
CSE2530 Reinforcement Learning 2025 P1+2
No ratings yet
CSE2530 Reinforcement Learning 2025 P1+2
115 pages
Notações Dos Algoritimos
No ratings yet
Notações Dos Algoritimos
10 pages
Littomore
No ratings yet
Littomore
169 pages
Tut RL-1
No ratings yet
Tut RL-1
2 pages
CS229
No ratings yet
CS229
17 pages
Reinforcement Learning: Karan Kathpalia
No ratings yet
Reinforcement Learning: Karan Kathpalia
80 pages
DL Unit 6 QP Solution
No ratings yet
DL Unit 6 QP Solution
15 pages
Assignment 4
No ratings yet
Assignment 4
6 pages
Deep RL - Content Beyond Syllabus
No ratings yet
Deep RL - Content Beyond Syllabus
16 pages
5385 Representation Driven Reinforc
No ratings yet
5385 Representation Driven Reinforc
16 pages
Assignment 4 (Sol.) : Reinforcement Learning
No ratings yet
Assignment 4 (Sol.) : Reinforcement Learning
6 pages
A Crash Course On Reinforcement Learning - Felix Wagner
No ratings yet
A Crash Course On Reinforcement Learning - Felix Wagner
84 pages
Some Thoughts On Reinforcement Learning: 1 Motivation
No ratings yet
Some Thoughts On Reinforcement Learning: 1 Motivation
9 pages
A Brief Introduction To Reinforcement Learning
No ratings yet
A Brief Introduction To Reinforcement Learning
4 pages
RL-UNIT2 - RL Unit 2 RL-UNIT2 - RL Unit 2
No ratings yet
RL-UNIT2 - RL Unit 2 RL-UNIT2 - RL Unit 2
23 pages
Cs748 s2021 Quizzes Till q4
No ratings yet
Cs748 s2021 Quizzes Till q4
4 pages
Reinf 2
No ratings yet
Reinf 2
4 pages
AI 3000 / CS 5500: Reinforcement Learning Assignment 1: Problem 1: Markov Reward Process
No ratings yet
AI 3000 / CS 5500: Reinforcement Learning Assignment 1: Problem 1: Markov Reward Process
5 pages
Subtitle
No ratings yet
Subtitle
2 pages
Reinforcement Learning in A Nutshell
No ratings yet
Reinforcement Learning in A Nutshell
12 pages
q2B Review Sol
No ratings yet
q2B Review Sol
14 pages
A-level Maths Revision: Cheeky Revision Shortcuts
From Everand
A-level Maths Revision: Cheeky Revision Shortcuts
Scool Revision
3.5/5 (8)
Modul Matematik SPM
No ratings yet
Modul Matematik SPM
37 pages
Precal 1st Monthly Test Answers
No ratings yet
Precal 1st Monthly Test Answers
4 pages
OSM ConsolEx 4A02 2c lv2
No ratings yet
OSM ConsolEx 4A02 2c lv2
3 pages
Structural Analysis Matrix (Displacement) Method: Structural Theory 2 CE 421 Engr. Christopher S. Paladio
No ratings yet
Structural Analysis Matrix (Displacement) Method: Structural Theory 2 CE 421 Engr. Christopher S. Paladio
31 pages
Minterm, Maxterm, 2 Variable K-Map
No ratings yet
Minterm, Maxterm, 2 Variable K-Map
13 pages
Introduction To Graphs: 15.1.1 A Bar Graph
No ratings yet
Introduction To Graphs: 15.1.1 A Bar Graph
18 pages
1 January 2023 Paper 1H Print
No ratings yet
1 January 2023 Paper 1H Print
16 pages
Assignment 1
No ratings yet
Assignment 1
3 pages
Gre Math Review 6
No ratings yet
Gre Math Review 6
11 pages
Math Quiz Bee Presentation in Colorful Comic Illustrative Style
No ratings yet
Math Quiz Bee Presentation in Colorful Comic Illustrative Style
17 pages
Maple Mathematica and Matlab The 3Ms Without The Tape
No ratings yet
Maple Mathematica and Matlab The 3Ms Without The Tape
9 pages
Handouts Guess and Check
No ratings yet
Handouts Guess and Check
2 pages
TSSM Test Differentiation With SOLUTION
No ratings yet
TSSM Test Differentiation With SOLUTION
27 pages
Pre-Test (GE Math) Questions
No ratings yet
Pre-Test (GE Math) Questions
5 pages
Revision Test-Mock-1 - XII
No ratings yet
Revision Test-Mock-1 - XII
7 pages
Number Theory
No ratings yet
Number Theory
6 pages
Gcse Wjec Math Past Paper 2013 p2
No ratings yet
Gcse Wjec Math Past Paper 2013 p2
20 pages
Digital Signals Processing ﺔﯿﻤﻗﺮﻟا تارﺎﺷﻻا ﺔﺠﻟﺎﻌﻣ: Lectures (Nine and Ten)
No ratings yet
Digital Signals Processing ﺔﯿﻤﻗﺮﻟا تارﺎﺷﻻا ﺔﺠﻟﺎﻌﻣ: Lectures (Nine and Ten)
14 pages
Diploma Business Information Technology - Brochure
No ratings yet
Diploma Business Information Technology - Brochure
4 pages
Matrices - Determinant - JEE (Main) - 2024
100% (1)
Matrices - Determinant - JEE (Main) - 2024
143 pages
Day1 - Arcs, Chords, and Central Angles
No ratings yet
Day1 - Arcs, Chords, and Central Angles
24 pages
Essential Mathematics For Engineers PDF
No ratings yet
Essential Mathematics For Engineers PDF
268 pages
Teaching Plan For MSG456
No ratings yet
Teaching Plan For MSG456
4 pages
Class - 8 Chapter 9 Algebric Expressions Practice Questions Answer
No ratings yet
Class - 8 Chapter 9 Algebric Expressions Practice Questions Answer
15 pages
Unit-3 TOC
No ratings yet
Unit-3 TOC
42 pages
Chapter 11 - Index Number
No ratings yet
Chapter 11 - Index Number
6 pages
2EE301 Network Analysis and Synthesis
No ratings yet
2EE301 Network Analysis and Synthesis
2 pages
(D) (A - B) (A - C) : Dav Public School Pokhariput Set Theory MCQ
No ratings yet
(D) (A - B) (A - C) : Dav Public School Pokhariput Set Theory MCQ
7 pages
Cdo Refcard
No ratings yet
Cdo Refcard
2 pages
Revision Resources s4-6
No ratings yet
Revision Resources s4-6
11 pages