0% found this document useful (0 votes)
4 views

Dynamic_Programming_RL_Answers_Final

Uploaded by

ITNishadAnjali
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

Dynamic_Programming_RL_Answers_Final

Uploaded by

ITNishadAnjali
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 3

Dynamic Programming in Reinforcement Learning

1. What is Dynamic Programming in Reinforcement Learning?

- Dynamic Programming (DP) is a group of algorithms used in reinforcement learning to find the

optimal policy and value functions when we have a perfect model of the environment.

- The environment is modeled as a Markov Decision Process (MDP).

- DP uses Bellman equations to compute the value of states or state-action pairs.

- The main idea is to break down a complex problem into smaller subproblems and solve them

recursively.

- DP provides exact solutions, but it is computationally expensive, so it is not used in large real-world

problems.

- Other RL methods like Monte Carlo and Temporal Difference are approximations of DP.

2. Policy Evaluation (Prediction)

- Policy Evaluation is used to calculate how good a policy pi is by estimating the value function vpi(s)

for all states s.

- It uses the Bellman expectation equation: vpi(s) = suma pi(a|s) sums',r p(s', r|s, a) [r + gamma

vpi(s')].

- Instead of solving the equation directly, we use iterative updates starting from an initial guess.

- This method is called Iterative Policy Evaluation and continues until the value function converges.

- The updates are based on expected values, not samples, and are done through multiple sweeps of

the state space.

3. Policy Improvement

- Policy Improvement improves a given policy by checking if different actions provide higher value.

- We compute the action-value function: qpi(s, a) = sums',r p(s', r|s, a) [r + gamma vpi(s')].

- If a different action gives a higher value, we update the policy at that state.

- The new policy pi' is better if vpi'(s) >= vpi(s) for all states s. This is the Policy Improvement
Theorem.

- Acting greedily with respect to the value function usually improves the policy.

4. Policy Iteration

- Policy Iteration finds the optimal policy by repeating Policy Evaluation and Policy Improvement.

- It starts with any policy and evaluates it using Iterative Policy Evaluation.

- Then it improves the policy using the Policy Improvement step.

- These steps are repeated until the policy does not change anymore.

- The final policy and value function are both optimal.

5. Value Iteration

- Value Iteration simplifies Policy Iteration by combining evaluation and improvement into one step.

- It updates value using the Bellman Optimality Equation: v(s) = maxa sums',r p(s', r|s, a) [r +

gamma v(s')].

- Values are updated directly and repeatedly until they converge.

- Once the values stabilize, the optimal policy is formed by choosing the action with the highest

value in each state.

- It is faster than Policy Iteration because it avoids full evaluation.

6. Asynchronous Dynamic Programming

- Asynchronous DP updates the value of states in any order instead of all at once.

- In regular DP, we perform full sweeps of all states, but in Asynchronous DP, we update one or a

few states at a time.

- This is useful in large problems where full sweeps are expensive.

- It still converges if all states are updated enough times.

- It is more practical and flexible for real-world applications.

7. Generalized Policy Iteration (GPI)

- GPI is the general idea of combining Policy Evaluation and Policy Improvement.

- Evaluation and improvement are done together, not in fixed steps.


- As value functions improve, policies improve, and vice versa.

- This process continues until both the policy and value function converge.

- GPI is the foundation for many advanced reinforcement learning algorithms.

You might also like