RL-UNIT2 - RL Unit 2 RL-UNIT2 - RL Unit 2
RL-UNIT2 - RL Unit 2 RL-UNIT2 - RL Unit 2
RL-UNIT2 - RL unit 2
Reinforcement Learning
UNIT-II
Markov Decision Process Problem, Policy, and value function .Reward
models (infinite discounted, total, finite horizon, and average) Episodic &
continuing tasks, Bellman’s optimality operator, and Value iteration &
policy iteration.
In the problem, an agent is supposed to decide the best action to select based
on his current state. When this step is repeated, the problem is known as
a Markov Decision Process.
:
MDP=(S,A,P,R,۷)
Let us take the second one (UP UP RIGHT RIGHT RIGHT) for the subsequent
discussion.
For example, if the agent says UP the probability of going UP is 0.8 whereas the
probability of going LEFT is 0.1, and the probability of going RIGHT is 0.1 (since
LEFT and RIGHT are right angles to UP).
Small reward each step (can be negative when can also be term as punishment,
in the above example entering the Fire can have a reward of -1).
Big rewards come at the end (good or bad).
The goal is to Maximize the sum of rewards.
Value Function:
The value function is a function that estimates the expected
cumulative reward an agent can achieve starting from a
particular state and following a specific policy.
The value function provides a measure of how good it is for
the agent to be in a particular state and follow a particular
policy.
The notion of “how good” here is defined in terms of
future rewards that can be expected
There are two types of value functions:
Action-Value Function
We define the value of taking action a in state s under a policy ∏,
denoted q∏(s, a), as the expected return starting from s, taking the
action a, and thereafter following policy∏ :
Reward models
In the context of reinforcement learning and Markov Decision
Processes (MDPs), reward models play a crucial role in guiding the
learning process and shaping the behavior of an agent.
Reward models define the immediate feedback an agent
receives for each action taken in a particular state.
They are used to represent the goals and objectives of the
agent in the learning environment.
Various types of reward models are used to represent different objectives and goals
for an agent. The four main types of reward models are:
3.Finite Horizon Reward Model: The finite horizon reward model is similar
to the total reward model, but it allows for variable time horizons.
Instead of having a fixed time horizon T, each episode (sequence of
actions) may have a different length, and the agent aims to maximize
the cumulative reward over each individual episode.
Finite Horizon Reward = Σ [R_t],
where t ranges from 0 to T_k, the time step at the end of
episode k.
The finite horizon reward model is particularly useful in
episodic environments, where each episode has a natural
ending point.
4.Average Reward Model: In the average reward model, the agent aims to
maximize the average reward obtained per time step over an infinite
time horizon. Unlike the infinite discounted reward model, which
sums up discounted rewards, the average reward model calculates
the average of rewards without discounting.
Average Reward = lim (T → ∞) [Σ [R_t] / T],
where T is the number of time steps.
The average reward model is often used in stationary
environments, where the reward distribution remains constant
over time, and the agent's objective is to find a policy that
maximizes the long-term average reward per time step.
Episodic Tasks:
In episodic tasks, the agent's interactions with the environment
are organized into episodes.
Each episode is a sequence of actions and states that begins
with an initial state and ends with a terminal state.
The terminal state is a special state that marks the end of an
episode, after which the environment is reset to the initial state,
and a new episode begins.
Examples of episodic tasks include games with multiple rounds
or levels, where each round is an episode, or a robotic task that
requires the agent to complete a specific task within a limited
time frame.
The objective of the agent in an episodic task is often to maximize the
cumulative reward obtained within each individual episode.
After each episode, the agent receives a terminal signal (usually a
reward of 0) to indicate the end of the episode.
Continuing Tasks:
The objective of the agent in a continuing task is typically to maximize the long-term
cumulative reward over an infinite or very long time horizon.
Since there are no episodes, the agent's experience continually accumulates, and there
is no explicit end-of-episode signal.
Key Differences:
To calculate the discounted cumulative rewards G0, G1, ..., G5, we work
backward starting from the terminal state (T=5) and use the given discount
factor (ɼ = 0.5). The discounted cumulative reward G_t at time step t is defined
as:
G_t = R_t+1 + (# * G_t+1).
1. G5 (Terminal State):
G5 = R5 = 2.
2. G4:
G4 = R5 + (ɼ * G5) = 2 + (0.5 * 2) = 2+ 1 = 3.
3. G3:
G3 = R4 + (ɼ * G4) = 3+ (0.5 * 3) = 3+ 1.5 = 4.5.
4. G2:
G2 = R3 + (ɼ * G3) = 6 + (0.5 * 4.5) = 6 +2.25= 8.25.
5. G1:
G1 = R2 + (ɼ* G2) = 2 + (0.5 * 8.25) = 2 + 4.125= 6.125
6. G0 (Initial State): G0 = R1+(ɼ * G1) =-1+ 0.5 * 6.125= 2.0625
So, the discounted cumulative rewards for each time step are: G0 = 0, G1 = 5,
G2 = 10, G3 = 5, G4 = 3, G5 = 2.
For a given state s, the Bellman optimality equation for a state value V*(S) is
defined as follows:
Where:
The Bellman optimality operator takes the current value function V*(s)
and updates it to a new value based on the expected rewards of taking
the optimal action from each state.
Policy iteration :
Policy iteration is an iterative algorithm used to find the optimal
policy (What action to take in a certain states to maximize the
reward/total discount reward is called optimal policy) in a
Markov Decision Process (MDP) or reinforcement learning
problem.
It combines two main steps: 1.policy evaluation and 2.policy
improvement.
The goal of policy iteration is to converge to the optimal policy
by improving an initial policy through multiple iterations.
The algorithm starts with an initial policy, which can be any arbitrary
policy or a random policy. Then, it alternates between the following
two steps until the policy converges to the optimal policy:
1. Policy Evaluation: In this step, the algorithm evaluates the value function for
the state for the current policy. The value function for a policy π is
denoted by Vπ(s) and can be computed using the following
equation:
Vπ(s) = Σ[s', r] P(s', r | s, π(s)) * [r + γ * Vπ(s')]
Where:
The policy evaluation step involves solving the above equation for each state
in the MDP, and this process is usually repeated until the values of the value
function converge.
2. Policy Improvement: The policy improvement step involves selecting the best
action in each state to maximize the expected cumulative reward according to
the current value function.
Where:
After policy improvement, the new policy π' is obtained, and the process
repeats.
Policy Iteration:
Value iteration:
The algorithm is based on the Bellman optimality equation, which relates the
value of a state to the value of its successor states under the optimal policy.