Dynamic_Programming_RL_Answers_Final
Dynamic_Programming_RL_Answers_Final
- Dynamic Programming (DP) is a group of algorithms used in reinforcement learning to find the
optimal policy and value functions when we have a perfect model of the environment.
- The main idea is to break down a complex problem into smaller subproblems and solve them
recursively.
- DP provides exact solutions, but it is computationally expensive, so it is not used in large real-world
problems.
- Other RL methods like Monte Carlo and Temporal Difference are approximations of DP.
- Policy Evaluation is used to calculate how good a policy pi is by estimating the value function vpi(s)
- It uses the Bellman expectation equation: vpi(s) = suma pi(a|s) sums',r p(s', r|s, a) [r + gamma
vpi(s')].
- Instead of solving the equation directly, we use iterative updates starting from an initial guess.
- This method is called Iterative Policy Evaluation and continues until the value function converges.
- The updates are based on expected values, not samples, and are done through multiple sweeps of
3. Policy Improvement
- Policy Improvement improves a given policy by checking if different actions provide higher value.
- We compute the action-value function: qpi(s, a) = sums',r p(s', r|s, a) [r + gamma vpi(s')].
- If a different action gives a higher value, we update the policy at that state.
- The new policy pi' is better if vpi'(s) >= vpi(s) for all states s. This is the Policy Improvement
Theorem.
- Acting greedily with respect to the value function usually improves the policy.
4. Policy Iteration
- Policy Iteration finds the optimal policy by repeating Policy Evaluation and Policy Improvement.
- It starts with any policy and evaluates it using Iterative Policy Evaluation.
- These steps are repeated until the policy does not change anymore.
5. Value Iteration
- Value Iteration simplifies Policy Iteration by combining evaluation and improvement into one step.
- It updates value using the Bellman Optimality Equation: v(s) = maxa sums',r p(s', r|s, a) [r +
gamma v(s')].
- Once the values stabilize, the optimal policy is formed by choosing the action with the highest
- Asynchronous DP updates the value of states in any order instead of all at once.
- In regular DP, we perform full sweeps of all states, but in Asynchronous DP, we update one or a
- GPI is the general idea of combining Policy Evaluation and Policy Improvement.
- This process continues until both the policy and value function converge.