Reinforcement Learning Note
Reinforcement Learning Note
Unit - 2
Reinforcement Learning
Interaction protocol
In a given MDP M = (S, A, P, R, γ), the agent interacts with
the environment according to the following protocol: the agent
starts at some state s1; at each time step t = 1, 2, . . ., the agent
takes an action at ∈ A, obtains the immediate reward rt = R(st,
at), and observes the next state st+1 sampled from P(st, at), or
st+1 ∼ P(st, at). The interaction record
τ = (s1, a1, r1, s2, . . . , sH+1)
is called a trajectory of length H.
(1)
Reinforcement Learning
(2)
Reward Models:
In the context of Markov Decision Problems (MDPs), there
are several types of reward models that characterize different
aspects of the expected cumulative rewards the agent aims to
optimize. These reward models include:
● Examples:
● Playing a single game of chess, where the game starts from an
initial board state, and it ends when one player wins or a draw
occurs.
● Solving a maze, where the agent starts at the entrance and
finishes upon reaching the exit.
● Training an agent to perform a specific task in a video game
level, where an episode ends when the level is completed or the
character dies.
Continuing Tasks:
● Examples:
● Stock trading, where an agent makes investment decisions over
an indefinite time horizon.
Reinforcement Learning
= Rt+1 + γ Gt+1
Bellman's equation
Or
Where:
- (V(s) represents the value of being in state s. This value represents the expected
cumulative reward or utility that can be obtained starting from state \(s\) and
following an optimal policy.
- a represents the action taken in state s.
- R(s, a) is the immediate reward obtained after taking action a in state s.
- γ (gamma) is the discount factor, which represents the importance of future
rewards. It's a value between 0 and 1.
- Es' represents a sum over all possible next states s' that can be reached from state s
by taking action a.
- P(s' | s, a) is the probability of transitioning to state s' when action a is taken in
state s.
- V(s') represents the value of the next state s'.
The policy iteration algorithm consists of two main steps that are repeated
iteratively until convergence:
Reinforcement Learning
1. Policy Evaluation:
- In this step, we evaluate the value function for a given policy. The value
function, denoted as Vπ (s), represents the expected cumulative reward starting from
state s and following policy π thereafter.
- The value function is updated iteratively using the Bellman expectation
equation:
Vπ =Ea π ( a | s) Es', r P(s', r | s, a) [r + γ Vπ (s')]
- Here, π (a|s) is the probability of taking action a in state s, and P(s', r | s, a)
represents the transition probabilities and rewards associated with taking action a in
state s and transitioning to state s' with reward r.
- The above equation is solved for each state until the value function converges to
a fixed point.
2. Policy Improvement:
- Once we have the value function V^\π (s) for the current policy \π , we can
improve the policy by selecting the action in each state that maximizes the expected
reward. This results in a new policy \π ':
- The new policy π ' is a greedy policy with respect to the current value function
Vπ (s). It selects the action that is expected to yield the highest reward in each state.
- If π ' is not significantly different from π (i.e., the policies have not changed
much), then the algorithm terminates, indicating that the optimal policy has been
found. Otherwise, the process continues with policy evaluation using the new policy
π '.
This algorithm is effective for finding the optimal policy in MDPs and is widely
used in reinforcement learning and dynamic programming.
1. Initialization:
- Initialize a value function V(s) for each state s in the MDP. This can be done
arbitrarily or with an initial guess.
- Set a convergence threshold epsilon to determine when the algorithm has
converged.
2. Value Iteration:
● For each state s, update the value function V(s) using the Bellman optimality
equation:
V(s) <--- maxa Es', r P(s', r | s, a) [r + γ V(s')]
In this equation:
● a represents the action taken in state s.
● P(s', r | s, a) represents the transition probabilities and rewards
associated with taking action a in state s and transitioning to state s'
with reward r.
● γ ( γ) is the discount factor, which represents the importance of future
rewards.
Reinforcement Learning
3. Policy Extraction:
● Once the value iteration process converges, you can extract the optimal
policy π* by choosing the action that maximizes the right-hand side of the
Bellman optimality equation for each state:
● The policy π* is now the optimal policy that maximizes the expected
cumulative reward in the MDP.