0% found this document useful (0 votes)
830 views2 pages

Practice Assignment 4: Reinforcement Learning Prof. B. Ravindran

The document is a practice assignment on reinforcement learning. It contains 5 multiple choice questions about reinforcement learning concepts like the Bellman optimality equation, properties of Markov decision processes (MDPs), and benefits of using reinforcement learning algorithms to solve MDPs. The key points are: 1) The correct Bellman optimality equation is given. 2) For general MDPs, a state-action pair can lead to multiple resultant states with different probabilities. 3) The state transition graph of an MDP is not necessarily a directed acyclic graph as it can include cycles. 4) The optimal policy can be determined from the optimal q-value function alone. 5) A benefit of reinforcement learning algorithms is that

Uploaded by

udayraj singh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
830 views2 pages

Practice Assignment 4: Reinforcement Learning Prof. B. Ravindran

The document is a practice assignment on reinforcement learning. It contains 5 multiple choice questions about reinforcement learning concepts like the Bellman optimality equation, properties of Markov decision processes (MDPs), and benefits of using reinforcement learning algorithms to solve MDPs. The key points are: 1) The correct Bellman optimality equation is given. 2) For general MDPs, a state-action pair can lead to multiple resultant states with different probabilities. 3) The state transition graph of an MDP is not necessarily a directed acyclic graph as it can include cycles. 4) The optimal policy can be determined from the optimal q-value function alone. 5) A benefit of reinforcement learning algorithms is that

Uploaded by

udayraj singh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 2

Practice Assignment 4

Reinforcement Learning
Prof. B. Ravindran
1. Select the correct Bellman optimality equation:
(a) v ∗ (s) = maxa s′ p(s′ |s, a)[E[r|s, a, s′ ] + γv ∗ (s′ )]
P

(b) v ∗ (s) = maxa s′ p(s′ |s, a)v ∗ (s′ )


P

(c) v ∗ (s) = maxa s′ p(s′ |s, a)[γE[r|s, a, s′ ] + v ∗ (s′ )]


P

(d) v ∗ (s) = maxa s′ p(s′ |s, a)γ[E[r|s, a, s′ ] + v ∗ (s′ )]


P

Sol. (a)
Refer to video on Bellman optimality equation
2. State True/False
In MDPs, there is a unique resultant state for any given state-action pair.
(a) True
(b) False
Sol. (b)
The statement is true for deterministic MDPs, but for general MDPs, for a given state-action
pair, there can be multiple resultant states with different probabilities associated with them.
3. State True/False The state transition graph for any MDP is a directed acyclic graph.
(a) True
(b) False
Sol. (b)
The statement is false. There is a possibility of transitioning to the same state, as well as
having other cycles.
4. Consider the following statements:
(i) The optimal policy of an MDP is unique.
(ii) We can determine an optimal policy for a MDP using only the optimal value function(v ∗ ),
without accessing the MDP parameters.
(iii) We can determine an optimal policy for a given MDP using only the optimal q-value
function(q ∗ ), without accessing the MDP parameters.
Which of these statements are true?
(a) Only (ii)
(b) Only (iii)
(c) Only (i), (ii)
(d) Only (i), (iii)
(e) Only (ii), (iii)

1
Sol. (b)
Optimal policy can be recovered from an optimal q-value function.
5. Which of the following is a benefit of using RL algorithms for solving MDPs?
(a) They do not require the state of the agent for solving a MDP.
(b) They do not require the action taken by the agent for solving a MDP.
(c) They do not require the state transition probability matrix for solving a MDP.
(d) They do not require the reward signal for solving a MDP.
Sol. (c)
RL algorithms require to know the state the agent is in, the action it takes and a reward
signal from the environment to solve the MDP. However, they do not need to know the state
transition probability matrix.

You might also like