0% found this document useful (0 votes)
1 views

Reinforcement Learning Note

The document provides an overview of Reinforcement Learning, focusing on Markov Decision Processes (MDPs), which include components such as state space, action space, transition function, and reward function. It discusses various algorithms for finding optimal policies, including Value Iteration, Policy Iteration, Q-Learning, and SARSA, as well as different reward models and task types (episodic and continuing). Additionally, it explains Bellman's equations for both policy and value iteration, emphasizing their role in solving MDPs and optimizing decision-making strategies.

Uploaded by

sakethdosapati11
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
1 views

Reinforcement Learning Note

The document provides an overview of Reinforcement Learning, focusing on Markov Decision Processes (MDPs), which include components such as state space, action space, transition function, and reward function. It discusses various algorithms for finding optimal policies, including Value Iteration, Policy Iteration, Q-Learning, and SARSA, as well as different reward models and task types (episodic and continuing). Additionally, it explains Bellman's equations for both policy and value iteration, emphasizing their role in solving MDPs and optimizing decision-making strategies.

Uploaded by

sakethdosapati11
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

Reinforcement Learning

Unit - 2
Reinforcement Learning

Markov Decision Problem,Policy,


and value function, Reward models
(infinite discounted, total, finite
horizon, and average), Episodic &
continuing tasks, Bellman’s
optimally operator, and value
iteration & policy iteration
Reinforcement Learning

Markov Decision Problem

Markov Decision Processes

In reinforcement learning, the interactions between the agent and


the environment are often described by a Markov Decision
Process (MDP) , specified by:

• State space S: A finite set of states represents the different


situations or configurations the agent can be in. At each time step, the agent
is in one of these states
• Action space A: A finite set of actions represents the
choices the agent can make. Actions are the decisions or moves
available to the agent.
Reinforcement Learning

• Transition function P : S ×A → ∆(S), where ∆(S) is the


space of probability distributions over S (i.e., the probability
simplex). P(s 0 |s, a) is the probability of transitioning into state s
0 upon taking action a in state s.
• Reward function R : S × A → [0, Rmax], where Rmax >
0 is a constant. R(s, a) is the immediate reward associated with
taking action a in state s.
• Discount factor γ ∈ [0, 1), which defines a horizon for
the problem.

Interaction protocol
In a given MDP M = (S, A, P, R, γ), the agent interacts with
the environment according to the following protocol: the agent
starts at some state s1; at each time step t = 1, 2, . . ., the agent
takes an action at ∈ A, obtains the immediate reward rt = R(st,
at), and observes the next state st+1 sampled from P(st, at), or
st+1 ∼ P(st, at). The interaction record
τ = (s1, a1, r1, s2, . . . , sH+1)
is called a trajectory of length H.

In some situations, it is necessary to specify how the initial


state s1 is generated. We consider s1 sampled from an initial
distribution d0 ∈ ∆(S). When d0 is of importance to the
discussion, we include it as part of the MDP definition, and write
M = (S, A, P, R, γ, d0).
Reinforcement Learning

The goal in a Markov Decision Problem is to find an


optimal policy, denoted as π*, that maximizes the expected
cumulative reward over time. This cumulative reward is often
referred to as the "return." In other words, the agent aims to
make a sequence of decisions that yield the highest expected sum
of rewards over the long run.

To find the optimal policy, various algorithms and methods


can be used, including:

Value Iteration: An iterative algorithm that computes the


optimal value function (expected cumulative reward) for each
state and then derives the optimal policy from it.

Policy Iteration: An iterative algorithm that alternates between


policy evaluation (computing the value function for a policy) and
policy improvement (selecting a better policy based on the value
function).

Q-Learning: A popular model-free reinforcement learning


algorithm that learns the optimal action-value function
(Q-function) through exploration and exploitation.
Reinforcement Learning

SARSA: Another model-free reinforcement learning algorithm


that learns the Q-function by estimating the expected return of
taking specific actions in specific states.

MDPs are used to model a wide range of real-world


decision-making problems, including robotics, game playing,
autonomous systems, recommendation systems, and more. They
provide a structured framework for studying and solving
problems where decisions must be made sequentially in the
presence of uncertainty.

Policy and value

A (deterministic and stationary) policy π : S → A specifies a


decision-making strategy in which the agent chooses actions
adaptively based on the current state, i.e., at = π(st). More
generally, the agent may also choose actions according to a
stochastic policy π : S → ∆(A), and with a slight abuse of
notation we write at ∼ π(st). A deterministic policy is its special
case when π(s) is a point mass for all s ∈ S.
The goal of the agent is to choose a policy π to maximize t
he expected discounted sum of rewards, or value:

(1)
Reinforcement Learning

The expectation is with respect to the randomness of the


trajectory, that is, the randomness in state transitions and the
stochasticity of π. Notice that, since rt is nonnegative and upper
bounded by Rmax, we have

(2)

Hence, the discounted sum of rewards (or the discounted


return) along any actual trajectory is always bounded in range [0,
Rmax / 1−γ ], and so is its expectation of any form. This fact will
be important when we later analyze the error propagation of
planning and learning algorithms.
Note that for a fixed policy, its value may differ for
different choice of s1, and we define the value function
V π M : S → R as

which is the value obtained by following policy π starting at


state s. Similarly we define the action value (or Q-value) function
Qπ M : S × A → R as
Reinforcement Learning

Henceforth, the dependence of any notation on M will be


made implicit whenever it is clear from context.

Reward Models:
In the context of Markov Decision Problems (MDPs), there
are several types of reward models that characterize different
aspects of the expected cumulative rewards the agent aims to
optimize. These reward models include:

1. Infinite Discounted Reward Model:


In the infinite discounted reward model, the objective is to
maximize the expected cumulative discounted reward over an
infinite time horizon.
The cumulative reward is discounted at each time step by a
discount factor γ (0 ≤ γ < 1) to account for the agent's preference
for immediate rewards over future rewards. The objective is to
maximize the following quantity:

2. Total Reward Model:


- In the total reward model, the objective is to maximize the
expected cumulative reward over a finite time horizon T.
Reinforcement Learning

- Unlike the infinite discounted reward model, there is no


discount factor applied to future rewards. The objective is to
maximize the following quantity:

3. Finite Horizon Reward Model:


- In the finite horizon reward model, the agent aims to
maximize the expected cumulative reward over a fixed and finite
time horizon T.
- This model is similar to the total reward model but is
explicitly defined for a predetermined number of time steps.

4. Average Reward Model:


- In the average reward model, the objective is to maximize
the expected average reward per time step over an infinite time
horizon.
- It is particularly useful when comparing different policies
because it normalizes the cumulative reward by the number of
time steps. The objective is to maximize the following quantity:
Reinforcement Learning

These different reward models reflect variations in the


temporal focus and objectives of the agent's decision-making
process. The choice of reward model depends on the specific
problem and the agent's goals.
If the agent is interested in optimizing rewards over a fixed
time horizon without discounting, the total or finite horizon
reward model may be used. The average reward model is often
used when comparing policies in a steady-state setting.
It's important to note that the choice of reward model can
significantly influence the optimal policy and the behavior of the
agent in an MDP. Therefore, it should be carefully considered
when formulating and solving MDPs.
Episodic & Continuing tasks:
Episodic Tasks:
● Episodic tasks are problems that have a well-defined starting point
(initial state) and a terminal point (terminal state) or goal. The agent
interacts with the environment for a finite number of time steps, and
the episode terminates when a specific goal state is reached or when
a predetermined maximum number of steps is reached.

1. Episodic tasks naturally have a finite and discrete time horizon.


2. The objective is typically to maximize the cumulative reward within a
single episode.
3. Learning and decision-making occur independently in each episode,
and the agent's behavior doesn't have to consider long-term
consequences beyond the current episode.
Reinforcement Learning

● Examples:
● Playing a single game of chess, where the game starts from an
initial board state, and it ends when one player wins or a draw
occurs.
● Solving a maze, where the agent starts at the entrance and
finishes upon reaching the exit.
● Training an agent to perform a specific task in a video game
level, where an episode ends when the level is completed or the
character dies.

Gt . = Rt+1 + Rt+2 + Rt+3 + ··· + RT

Continuing Tasks:

● Continuing tasks, on the other hand, do not have a natural endpoint or


terminal state. The agent interacts with the environment indefinitely,
and there is no predefined limit on the number of time steps or
episodes.

1. Continuing tasks involve a potentially infinite and continuous time


horizon.
2. The objective is to maximize the agent's expected cumulative reward
over the long run, rather than within a single episode.
3. Learning and decision-making must take into account the long-term
consequences of actions because there is no natural endpoint to the
task.

● Examples:
● Stock trading, where an agent makes investment decisions over
an indefinite time horizon.
Reinforcement Learning

● Robot control, where a robot must continuously adapt to its


surroundings and perform tasks over time.
● Recommendation systems, where an algorithm continuously
suggests items to users based on their preferences.

Gt = Rt+1 +γ Rt+2 + γ 2Rt+3 +γ 3Rt+4 + ···

= Rt+1 + γ ( Rt+2 + γ Rt+3 + γ 2Rt+4 + ··· )

= Rt+1 + γ Gt+1

Bellman's equation

Bellman's equation is a fundamental concept in dynamic programming and


reinforcement learning. It plays a crucial role in solving problems that involve
making a sequence of decisions over time. The equation is named after Richard E.
Bellman, who made significant contributions to the field of dynamic programming.

The basic form of Bellman's equation can be expressed as follows:

V(s) = maxa [ R(s, a) + γ Es' P(s' | s, a) V(s')]

Or

V(s) = maxa [ R(s, a) + γ V(s')]


Reinforcement Learning

Bellman's equation is a key component of various algorithms in reinforcement


learning, such as the Bellman equation for policy evaluation, Q-learning, and the
value iteration algorithm. These algorithms use Bellman's equation as a foundation
for finding optimal policies and values in Markov decision processes (MDPs) and
other sequential decision-making problems.

Where:
- (V(s) represents the value of being in state s. This value represents the expected
cumulative reward or utility that can be obtained starting from state \(s\) and
following an optimal policy.
- a represents the action taken in state s.
- R(s, a) is the immediate reward obtained after taking action a in state s.
- γ (gamma) is the discount factor, which represents the importance of future
rewards. It's a value between 0 and 1.
- Es' represents a sum over all possible next states s' that can be reached from state s
by taking action a.
- P(s' | s, a) is the probability of transitioning to state s' when action a is taken in
state s.
- V(s') represents the value of the next state s'.

The objective of using Bellman's equation is to find the optimal value


function V*(s), which represents the maximum expected cumulative reward
achievable from each state under an optimal policy. Solving for V*(s) allows you
to determine the best actions to take in each state to maximize your expected return.

Bellman equations for policy iteration:


Bellman's policy iteration is an iterative algorithm used in reinforcement
learning and dynamic programming to find an optimal policy for a Markov decision
process (MDP). The goal of policy iteration is to determine the best actions to take
in each state in order to maximize the expected cumulative reward.

The policy iteration algorithm consists of two main steps that are repeated
iteratively until convergence:
Reinforcement Learning

1. Policy Evaluation:
- In this step, we evaluate the value function for a given policy. The value
function, denoted as Vπ (s), represents the expected cumulative reward starting from
state s and following policy π thereafter.
- The value function is updated iteratively using the Bellman expectation
equation:
Vπ =Ea π ( a | s) Es', r P(s', r | s, a) [r + γ Vπ (s')]
- Here, π (a|s) is the probability of taking action a in state s, and P(s', r | s, a)
represents the transition probabilities and rewards associated with taking action a in
state s and transitioning to state s' with reward r.
- The above equation is solved for each state until the value function converges to
a fixed point.

2. Policy Improvement:
- Once we have the value function V^\π (s) for the current policy \π , we can
improve the policy by selecting the action in each state that maximizes the expected
reward. This results in a new policy \π ':

π '(s) = argmaxa Es', r P(s', r | s, a) [r + γ Vπ (s')]

- The new policy π ' is a greedy policy with respect to the current value function
Vπ (s). It selects the action that is expected to yield the highest reward in each state.
- If π ' is not significantly different from π (i.e., the policies have not changed
much), then the algorithm terminates, indicating that the optimal policy has been
found. Otherwise, the process continues with policy evaluation using the new policy
π '.

The policy iteration algorithm alternates between policy evaluation and


policy improvement until convergence. At convergence, the policy becomes
optimal, meaning that it maximizes the expected cumulative reward in the given
MDP. Policy iteration is guaranteed to converge to the optimal policy for finite
MDPs, although it may take several iterations to do so.
Reinforcement Learning

This algorithm is effective for finding the optimal policy in MDPs and is widely
used in reinforcement learning and dynamic programming.

Example for policy iteration given in class notes

Bellman equations for Value iteration:

Bellman's value iteration is an iterative algorithm used in reinforcement


learning and dynamic programming to find an optimal policy for a Markov decision
process (MDP). It is a powerful approach for solving MDPs and determining the
best actions to take in each state in order to maximize the expected cumulative
reward.

Value iteration is an iterative process that works as follows:

1. Initialization:
- Initialize a value function V(s) for each state s in the MDP. This can be done
arbitrarily or with an initial guess.
- Set a convergence threshold epsilon to determine when the algorithm has
converged.

2. Value Iteration:
● For each state s, update the value function V(s) using the Bellman optimality
equation:
V(s) <--- maxa Es', r P(s', r | s, a) [r + γ V(s')]
In this equation:
● a represents the action taken in state s.
● P(s', r | s, a) represents the transition probabilities and rewards
associated with taking action a in state s and transitioning to state s'
with reward r.
● γ ( γ) is the discount factor, which represents the importance of future
rewards.
Reinforcement Learning

● Update V(s) for all states simultaneously.


● Repeat this process until the change in the value function Delta V(s) for all
states is smaller than the convergence threshold (Delta V(s) < for all s).

3. Policy Extraction:
● Once the value iteration process converges, you can extract the optimal
policy π* by choosing the action that maximizes the right-hand side of the
Bellman optimality equation for each state:

π*(s) = argmaxa Es', r P(s', r | s, a) [r + γ V*(s')]

● The policy π* is now the optimal policy that maximizes the expected
cumulative reward in the MDP.

Value iteration converges to the optimal policy in finite MDPs, guaranteeing


that the policy it produces is the best policy. It combines policy evaluation and
policy improvement into a single step and iteratively refines the value function until
it converges. This makes it a computationally efficient way to solve MDPs and is
widely used in reinforcement learning and dynamic programming.

Example for value iteration given in class notes

You might also like