0% found this document useful (0 votes)
29 views23 pages

RL-UNIT2 - RL Unit 2 RL-UNIT2 - RL Unit 2

material

Uploaded by

pillipramod8096
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
29 views23 pages

RL-UNIT2 - RL Unit 2 RL-UNIT2 - RL Unit 2

material

Uploaded by

pillipramod8096
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 23

lOMoARcPSD|51464582

RL-UNIT2 - RL unit 2

Cse (ai & ml) (Jawaharlal Nehru Technological University, Hyderabad)

Scan to open on Studocu

Studocu is not sponsored or endorsed by any college or university


Downloaded by PILLI PRAMOD KUMAR ([email protected])
lOMoARcPSD|51464582

Reinforcement Learning
UNIT-II
Markov Decision Process Problem, Policy, and value function .Reward
models (infinite discounted, total, finite horizon, and average) Episodic &
continuing tasks, Bellman’s optimality operator, and Value iteration &
policy iteration.

Markov Decision Process (MDP):


A Markov Decision Process is a mathematical framework used to model decision-making in
situations where the outcome depends on both the current state and the action taken by an agent.
It is widely used in the field of artificial intelligence, reinforcement learning, operations research,
and control theory.

In the problem, an agent is supposed to decide the best action to select based
on his current state. When this step is repeated, the problem is known as
a Markov Decision Process.
:

The Agent–Environment Interface

Downloaded by PILLI PRAMOD KUMAR ([email protected])


lOMoARcPSD|51464582

Markov Decision Process (MDP) model contains

MDP=(S,A,P,R,۷)

1. States (S): The System can be represented with A finite set


of possible states .These states represent the different
situations or configurations of the environment.
2. Actions (A): A finite set of actions that the agent can take.
Actions represent the decisions or choices that the agent can
make to influence the state transitions.
3. Transition Probabilities (P): A set of transition probabilities
that describe the likelihood of moving from one state to
another when an action is taken.

These probabilities are denoted by P(s'|s, a), which is the


probability of transitioning to state s' given that the current
state is s and the action a is taken.
4. Rewards (R): A reward function that provides a scalar value
as feedback to the agent after taking an action in a particular
state. The reward function is denoted by R(s, a, s'), where s
is the current state, a is the action taken, and s' is the next
state.
5. Discount Factor (γ): A discount factor that is used to weigh
the importance of future rewards compared to immediate
rewards. It ranges between 0 and 1, where 0 indicates that
the agent only cares about immediate rewards, and 1
indicates that the agent values future rewards equally with
immediate rewards.

Downloaded by PILLI PRAMOD KUMAR ([email protected])


lOMoARcPSD|51464582

 The objective of an agent in a Markov Decision Process


is to find a policy (π), which is a mapping from states to
actions that maximizes the expected cumulative reward
over time.
 The policy tells the agent which action to take in each
state to achieve the best overall outcome.
 Solving a Markov Decision Process involves finding the
optimal policy that maximizes the expected total
reward.
 There are various algorithms to do this, such as Value
Iteration, Policy Iteration, Q-Learning, and SARSA,
which are commonly used in the context of
reinforcement learning.
 Markov Decision Processes provide a powerful
framework for modeling decision-making problems in
stochastic and uncertain environments, making them
essential tools in the study of artificial intelligence and
decision theory.

Downloaded by PILLI PRAMOD KUMAR ([email protected])


lOMoARcPSD|51464582

Policy and value function

 In the context of a Markov Decision Process (MDP), both


policy and value function are fundamental concepts
used to guide decision-making and evaluate the quality
of actions in different states.
Policy: A policy (π) in an MDP is a strategy or a rule that defines the
agent's behavior.
 It specifies the action taken by the agent in each state to
maximize its expected cumulative reward over time.
 . Mathematically, a policy is represented as:

π(a | s) = P(take action a | in state s),


where:

π(a | s) is the probability of taking action a in state s according to the policy.

 A policy can be deterministic (choosing a single action for


each state) or stochastic (selecting actions with certain
probabilities).
 The objective of the agent is to find the best policy that leads
to the highest expected total reward

Downloaded by PILLI PRAMOD KUMAR ([email protected])


lOMoARcPSD|51464582

Let us take the example of a grid world:

An agent lives in the grid. The above example is a 3*4 grid.


The grid has a START state(grid no 1,1). The purpose of
the agent is to wander around the grid to finally reach the
Blue Diamond (grid no 4,3). Under all circumstances, the
agent should avoid the Fire grid (orange color, grid no 4,2).
Also the grid no 2,2 is a blocked grid, it acts as a wall hence
the agent cannot enter it.
The agent can take any one of these actions: UP, DOWN, LEFT, RIGHT
Walls block the agent path, i.e., if there is a wall in the direction the agent would
have taken, the agent stays in the same place. So for example, if the agent says
LEFT in the START grid he would stay put in the START grid.
First Aim: To find the shortest sequence getting from START to the Diamond.
Two such sequences can be found:

 RIGHT RIGHT UP UPRIGHT


 UP UP RIGHT RIGHT RIGHT

Downloaded by PILLI PRAMOD KUMAR ([email protected])


lOMoARcPSD|51464582

Let us take the second one (UP UP RIGHT RIGHT RIGHT) for the subsequent
discussion.
For example, if the agent says UP the probability of going UP is 0.8 whereas the
probability of going LEFT is 0.1, and the probability of going RIGHT is 0.1 (since
LEFT and RIGHT are right angles to UP).

The agent receives rewards each time step:-

 Small reward each step (can be negative when can also be term as punishment,
in the above example entering the Fire can have a reward of -1).
 Big rewards come at the end (good or bad).
 The goal is to Maximize the sum of rewards.

Value Function:
 The value function is a function that estimates the expected
cumulative reward an agent can achieve starting from a
particular state and following a specific policy.
 The value function provides a measure of how good it is for
the agent to be in a particular state and follow a particular
policy.
 The notion of “how good” here is defined in terms of
future rewards that can be expected
 There are two types of value functions:

Downloaded by PILLI PRAMOD KUMAR ([email protected])


lOMoARcPSD|51464582

1.The state value function (V∏(s))

We call the function v∏ the state-value function for policy ∏.

 where E∏[·] denotes the expected value of a random variable


given that the agent follows policy ∏, and t is any time step.
 γ is the discount factor that weighs the importance of future rewards.
 Rt is the reward obtained at time step t.

Action-Value Function
We define the value of taking action a in state s under a policy ∏,
denoted q∏(s, a), as the expected return starting from s, taking the
action a, and thereafter following policy∏ :

Downloaded by PILLI PRAMOD KUMAR ([email protected])


lOMoARcPSD|51464582

Reward models
In the context of reinforcement learning and Markov Decision
Processes (MDPs), reward models play a crucial role in guiding the
learning process and shaping the behavior of an agent.
 Reward models define the immediate feedback an agent
receives for each action taken in a particular state.
 They are used to represent the goals and objectives of the
agent in the learning environment.

A reward model can be thought of as a mapping from state-action pairs to


real-valued rewards. Mathematically, a reward model can be represented as
a function R(s, a), where:
 s is the current state in the environment.
 a is the action taken by the agent in that state.
 R(s, a) is the reward received by the agent after taking action a in
state s.

Downloaded by PILLI PRAMOD KUMAR ([email protected])


lOMoARcPSD|51464582

Various types of reward models are used to represent different objectives and goals
for an agent. The four main types of reward models are:

Infinite Discounted Reward Model: In the infinite discounted


reward model, the agent aims to maximize the cumulative discounted
reward over an infinite time horizon.
The cumulative discounted reward at time step t is given by the sum
of rewards obtained at each time step, each multiplied by a discount
factor γ raised to the power of the time step:
Cumulative Discounted Reward = Σ [γ^t * R_t],
where:
 γ (gamma) is the discount factor, 0 ≤ γ < 1. It determines the
relative importance of immediate rewards compared to future
rewards. A smaller γ values emphasize immediate rewards,
while larger γ values emphasize future rewards.

1. The objective of the agent is to find a policy that maximizes the


expected cumulative discounted reward.
2. The discounted reward model helps the agent prioritize getting
rewards sooner rather than later, as future rewards are discounted with
each time step.

2.Total Reward Model: In


the total reward model, the agent aims to
maximize the cumulative sum of rewards over a finite time
horizon T. The objective is to obtain as much reward as
possible within a fixed number of time steps.
 Total Reward = Σ [R_t],
 where t ranges from 0 to T.
 Unlike the infinite discounted reward model, the
total reward model does not involve a discount
factor, and the agent's focus is on maximizing the
sum of rewards obtained over a fixed time horizon.

Downloaded by PILLI PRAMOD KUMAR ([email protected])


lOMoARcPSD|51464582

3.Finite Horizon Reward Model: The finite horizon reward model is similar
to the total reward model, but it allows for variable time horizons.
Instead of having a fixed time horizon T, each episode (sequence of
actions) may have a different length, and the agent aims to maximize
the cumulative reward over each individual episode.
 Finite Horizon Reward = Σ [R_t],
 where t ranges from 0 to T_k, the time step at the end of
episode k.
 The finite horizon reward model is particularly useful in
episodic environments, where each episode has a natural
ending point.

4.Average Reward Model: In the average reward model, the agent aims to
maximize the average reward obtained per time step over an infinite
time horizon. Unlike the infinite discounted reward model, which
sums up discounted rewards, the average reward model calculates
the average of rewards without discounting.
 Average Reward = lim (T → ∞) [Σ [R_t] / T],
 where T is the number of time steps.
 The average reward model is often used in stationary
environments, where the reward distribution remains constant
over time, and the agent's objective is to find a policy that
maximizes the long-term average reward per time step.

Each of these reward models represents different objectives for the


agent and can lead to distinct optimal policies and behaviors. The
choice of reward model depends on the specific task and objectives
of the reinforcement learning problem at hand.

Downloaded by PILLI PRAMOD KUMAR ([email protected])


lOMoARcPSD|51464582

Episodic and continuing tasks


Episodic and continuing tasks are two types of environments that
are commonly encountered in the context of reinforcement learning
and Markov Decision Processes (MDPs).
 They differ in terms of the duration and structure of the agent's
interactions with the environment.

Episodic Tasks:
 In episodic tasks, the agent's interactions with the environment
are organized into episodes.
 Each episode is a sequence of actions and states that begins
with an initial state and ends with a terminal state.
 The terminal state is a special state that marks the end of an
episode, after which the environment is reset to the initial state,
and a new episode begins.
 Examples of episodic tasks include games with multiple rounds
or levels, where each round is an episode, or a robotic task that
requires the agent to complete a specific task within a limited
time frame.
The objective of the agent in an episodic task is often to maximize the
cumulative reward obtained within each individual episode.
After each episode, the agent receives a terminal signal (usually a
reward of 0) to indicate the end of the episode.

Continuing Tasks:

 In continuing tasks, the agent's interactions with the


environment do not have an explicit episode structure or a
terminal state.
 The agent's experience continues indefinitely without any
predefined end point.
 Examples of continuing tasks include controlling a robot that
operates continuously without a predefined stopping condition
or an agent learning to navigate a virtual environment without
any specific endpoint.

Downloaded by PILLI PRAMOD KUMAR ([email protected])


lOMoARcPSD|51464582

The objective of the agent in a continuing task is typically to maximize the long-term
cumulative reward over an infinite or very long time horizon.
Since there are no episodes, the agent's experience continually accumulates, and there
is no explicit end-of-episode signal.

Key Differences:

With respect to Episodic tasks Continuing tasks

In an episodic environment the agent In sequential environment on


Unified Notation experience is divided into atomic the other hand the current
episodes. In episodic environment the decision affects the future
choice action in each episode depends decisions
only on the episode itself. 1.

2. while continuing tasks do not


Temporal Episodic tasks have well-defined episodes have a predefined episode
Structure with a fixed number of time steps structure or terminal states.

In continuing tasks, the objective


Learning Objective In episodic tasks, the agent aims to is to maximize the long-term
maximize the cumulative reward within cumulative reward over an infinite
each episode. or extended time horizon.

3. . In continuing tasks, there is no


Resetting Episodic tasks involve resetting the reset between episodes since they
environment to its initial state after each do not have an end.
episode, allowing the agent to start a new 4.
episode

5. In continuing tasks, there is no


Termination Signal In episodic tasks, the terminal state or a explicit termination signal.
terminal signal marks the end of each
episode. 6.

Downloaded by PILLI PRAMOD KUMAR ([email protected])


lOMoARcPSD|51464582

Reinforcement learning algorithms and approaches can be


adapted to handle both episodic and continuing tasks, and
the choice of the task type depends on the specific problem
being addressed.

Downloaded by PILLI PRAMOD KUMAR ([email protected])


lOMoARcPSD|51464582

To calculate the discounted cumulative rewards G0, G1, ..., G5, we work
backward starting from the terminal state (T=5) and use the given discount
factor (ɼ = 0.5). The discounted cumulative reward G_t at time step t is defined
as:
G_t = R_t+1 + (# * G_t+1).

Using the given sequence of rewards R1 = -1, R2 = 2, R3 = 6, R4 = 3, and R5 = 2, let's calculate


G0, G1, ..., G5 step by step:

1. G5 (Terminal State):
G5 = R5 = 2.
2. G4:
G4 = R5 + (ɼ * G5) = 2 + (0.5 * 2) = 2+ 1 = 3.
3. G3:
G3 = R4 + (ɼ * G4) = 3+ (0.5 * 3) = 3+ 1.5 = 4.5.

Downloaded by PILLI PRAMOD KUMAR ([email protected])


lOMoARcPSD|51464582

4. G2:
G2 = R3 + (ɼ * G3) = 6 + (0.5 * 4.5) = 6 +2.25= 8.25.
5. G1:
G1 = R2 + (ɼ* G2) = 2 + (0.5 * 8.25) = 2 + 4.125= 6.125
6. G0 (Initial State): G0 = R1+(ɼ * G1) =-1+ 0.5 * 6.125= 2.0625

So, the discounted cumulative rewards for each time step are: G0 = 0, G1 = 5,
G2 = 10, G3 = 5, G4 = 3, G5 = 2.

Bellman’s optimality operator

 The Bellman optimality operator is a mathematical operator


used to define the optimal value function in a Markov Decision
Process (MDP) of dynamic programming.
 In an MDP, an agent makes decisions in an environment,
transitioning from one state to another while receiving rewards
based on its actions.
 The goal of the agent is to find a policy (a strategy for selecting
actions) that maximizes the expected cumulative reward over
time.

The Bellman optimality operator is used to iteratively update the value


function based on the Bellman optimality equation. and learn the
optimal policy through interactions with the environment.

For a given state s, the Bellman optimality equation for a state value V*(S) is
defined as follows:

Downloaded by PILLI PRAMOD KUMAR ([email protected])


lOMoARcPSD|51464582

Where:

 V*(s) is the optimal value function for state s.


 max[a] denotes taking the maximum over all possible actions a that
the agent can take in state s.
 P(s', r | s, a) is the probability of transitioning to state s' and receiving
reward r, given that the agent is in state s and takes action a.
 r is the immediate reward the agent receives from the environment
after taking action a in state s and transitioning to state s'.
 γ (gamma) is the discount factor, a constant between 0 and 1, which
determines the agent's preference for immediate rewards versus
future rewards.

 The Bellman optimality operator takes the current value function V*(s)
and updates it to a new value based on the expected rewards of taking
the optimal action from each state.

Downloaded by PILLI PRAMOD KUMAR ([email protected])


lOMoARcPSD|51464582

Policy iteration :
 Policy iteration is an iterative algorithm used to find the optimal
policy (What action to take in a certain states to maximize the
reward/total discount reward is called optimal policy) in a
Markov Decision Process (MDP) or reinforcement learning
problem.
 It combines two main steps: 1.policy evaluation and 2.policy
improvement.
 The goal of policy iteration is to converge to the optimal policy
by improving an initial policy through multiple iterations.

The algorithm starts with an initial policy, which can be any arbitrary
policy or a random policy. Then, it alternates between the following
two steps until the policy converges to the optimal policy:

1. Policy Evaluation: In this step, the algorithm evaluates the value function for
the state for the current policy. The value function for a policy π is
denoted by Vπ(s) and can be computed using the following
equation:
Vπ(s) = Σ[s', r] P(s', r | s, π(s)) * [r + γ * Vπ(s')]
Where:

 Vπ(s) is the value function for policy π in state s.


 Vπ(s') is the value function of the future(next) state
 P(s', r | s, π(s)) is the probability of transitioning to state s' and receiving
reward r when taking the action determined by policy π in state s.
 γ (gamma) is the discount factor, a constant between 0 and 1, which
determines the agent's preference for immediate rewards versus future
rewards.

The policy evaluation step involves solving the above equation for each state
in the MDP, and this process is usually repeated until the values of the value
function converge.

Downloaded by PILLI PRAMOD KUMAR ([email protected])


lOMoARcPSD|51464582

2. Policy Improvement: The policy improvement step involves selecting the best
action in each state to maximize the expected cumulative reward according to
the current value function.

The updated policy, denoted as π', is given by:

π'(s) = argmax[a] { Σ[s', r] P(s', r | s, a) * [r + γ * Vπ(s')] }

Where:

 π'(s) is the action selected by the updated policy π' in state s.


 The argmax[a] denotes taking the action that maximizes the expression in the
curly brackets.

After policy improvement, the new policy π' is obtained, and the process
repeats.

Policy Iteration:

Downloaded by PILLI PRAMOD KUMAR ([email protected])


lOMoARcPSD|51464582

Policy iteration Algorithm:

Value iteration:

 The process of repeatedly applying the Bellman optimality operator to


the value function until it converges to the true optimal value function
is known as the value iteration algorithm, a common approach to
solving MDPs in dynamic programming.

 It is a dynamic programming method and is particularly efficient for solving


MDPs with large state spaces.
 The value iteration algorithm starts with an initial estimate of the optimal
value function and then repeatedly updates the value function until it
converges to the true optimal value function.

Downloaded by PILLI PRAMOD KUMAR ([email protected])


lOMoARcPSD|51464582

 The algorithm is based on the Bellman optimality equation, which relates the
value of a state to the value of its successor states under the optimal policy.

Downloaded by PILLI PRAMOD KUMAR ([email protected])


lOMoARcPSD|51464582

Downloaded by PILLI PRAMOD KUMAR ([email protected])


lOMoARcPSD|51464582

Downloaded by PILLI PRAMOD KUMAR ([email protected])

You might also like