0% found this document useful (0 votes)
7 views

Module 1

Module 1 of CSE3011 covers the fundamentals of Reinforcement Learning (RL), including key concepts such as agents, environments, states, actions, and rewards. It explains how RL differs from other machine learning paradigms, introduces the Markov Decision Process (MDP) as a framework for solving RL problems, and discusses various types of RL environments. The module also outlines the typical steps involved in RL algorithms and emphasizes the importance of maximizing rewards through agent interactions with the environment.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views

Module 1

Module 1 of CSE3011 covers the fundamentals of Reinforcement Learning (RL), including key concepts such as agents, environments, states, actions, and rewards. It explains how RL differs from other machine learning paradigms, introduces the Markov Decision Process (MDP) as a framework for solving RL problems, and discusses various types of RL environments. The module also outlines the typical steps involved in RL algorithms and emphasizes the importance of maximizing rewards through agent interactions with the environment.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 81

Module 1

CSE3011 Reinforcement Learning


Credit Structure : 3-0-3
Module 1 : Introduction to RL
Topics : Elements of RL - Agent, environment Interface, Goals and rewards, RL
platforms, Applications of RL, Markov decision process (MDP), RL environment
as a MDP, Maths essentials of RL, Policy and its types, episodic and continuous
tasks, return and discount factor, fundamental functions of RL – value and Q
functions, model-based and model-free learning, types of RL environments,
Solving MDP using Bellman Equation, Algorithms for optimal policy using
Dynamic Programming -Value iteration and policy iteration, Example : Frozen
Lake problem, Limitations and Scope.

2
Introduction to RL
• Reinforcement Learning(RL) is one of the areas of Machine Learning(ML).
• It is a feedback-based Machine learning technique in which an agent learns to behave in an
environment by performing the actions and seeing the results of actions.
• For each good action, the agent gets positive feedback, and for each bad action, the agent gets
negative feedback or penalty.
• The agent interacts with the environment and explores it by itself. The primary goal of an
agent in reinforcement learning is to improve the performance by getting the maximum
positive rewards.
• It is one of the most active research areas in AI.
• It has evolved and capable of building a recommendation system to self-driving cars.
• Reason for this evolution is deep RL, a combination of DL and RL.
Introduction to RL..
• In Reinforcement Learning, the agent learns automatically using feedbacks
without any labeled data, unlike supervised learning.

• Since there is no labeled data, so the agent is bound to learn by its experience
only.

• RL solves a specific type of problem where decision making is sequential, and


the goal is long-term, such as game-playing, robotics, etc.
RL agents
• The goal of reinforcement learning is to train an agent to complete
a task within an uncertain environment.
• At each time interval, the agent receives observations and a
reward from the environment and sends an action to the
environment.
• The reward is a measure of how successful the previous action
(taken from the previous state) was with respect to completing the
task goal.
Example
• Goal of an AI agent has to find the diamond present within a maze environment
• The agent interacts with the environment by performing some actions, and based on those
actions, the state of the agent gets changed, and it also receives a reward or penalty as
feedback for its actions.
• The agent continues doing these three things
(take action, change state/remain in the same state, and
get feedback), and by doing these actions, he learns and
explores the environment.
• The agent learns that what actions lead to positive feedback
or rewards and what actions lead to negative feedback penalty.
As a positive reward, the agent gets a positive point, and as a
penalty, it gets a negative point.
Elements of RL
1. Agent:
• It is a software that learns to make intelligent decisions.
• In an RL setting, an agent is a learner.
• Ex1: a chess-player is an agent, the player learns to make the best
moves (decisions) to win the game.
• Ex2 : Mario in a Super Mario Bros Video Game
Elements of RL
2. Environment :
• It is the world of the agent, within which the agent stays, takes
actions and interacts.
• Ex1: chess-board in a chess game.
• The chess player(agent) stays in the chess board to learn how to
play the game.
Elements of RL
3. State and Action:
• In an RL setup, the environment has many positions where the
agent can be in.
• Each such position is a state.
• A state is denoted by s
• Ex: in a chess-board environment, each position is a state.
Elements of RL
3. State and Action:
• The agent interacts with the environment and moves from one
state to another state by performing an action.
• Ex: In a chess-game environment, the action is the move
performed by the player(agent).
• An action is denoted by a
Elements of RL
4. Reward :
• The agent interacts with the environment by performing an action
and moves from one state to another.
• Based on the action the agent receives a reward.
• Reward is a numerical value, ex: +1 for a good action, -1 for a bad
action.
• Ex: in a chess-game, good action – the agent’s move which takes
one of the opponent’s chess piece; bad action – the agent’s move
loosing one chess piece to the opponent
Basic idea of RL
• To understand the working process of the RL, we need to consider two main things:
• Environment: It can be anything such as a room, maze, football ground, etc.
• Agent: An intelligent agent such as AI robot.
• Let's take an example of a maze environment that the goal of the agent is to explore and find
the path to the diamond in few steps.
• States : S1 to S12, where S6 is a wall, S8 is a
fire pit and S4 has diamond.
• Actions : move left, right, up and down
A typical RL setup • The agent has two components : policy and the RL
algorithm.
• Policy :
• It is a mapping from the current environment’s
observation to a probability distribution of the
actions to be taken.
• Within an agent, the policy is implemented by a
function approximator with tunable parameters
and a specific approximation model, such as a
deep neural network.
• RL algorithm:
• The learning algorithm continuously updates the
policy parameters based on the actions,
observations, and rewards.
• The goal of the learning algorithm is to find an
optimal policy that maximizes the expected
cumulative long-term reward received during the
task.
How RL differs from other ML paradigms?
Task : train a dog to catch a ball
• Difference between RL and Supervised learning:
• Supervised learning, we train the dog explicitly
with training data : turn left, go right, move forward
seven steps, catch the ball and so on.
• In RL, we simply throw the ball, every time the dog
catches the ball, we give it a cookie(reward).
• So the dog will learn to catch the ball while trying to
maximize the cookies(rewards) it can get.
How RL differs from other ML paradigms?
• Difference between RL and unsupervised learning?
• Task : Movie recommendation system- recommend a new movie to the user
• Unsupervised learning: the model will recommend a new movie based on the
similar movies the user has viewed before.
• In RL, each time the user watches a movie, the agent receives feedback from the
user.
• Feedbacks are rewards(ratings given by the user to this movie, time spent watching
the movie, etc).
• Based on these rewards, the RL agent will understand the movie preference of the
user, then suggests new movies accordingly.
How RL differs from other ML paradigms?
• Hence, the supervised and unsupervised the models learn from
the training data set.
• In RL, the agent learns by continuously interacting with the
environment.
• Hence entire RL is about the interaction between the agent and the
environment.
A typical RL algorithm
• The steps involved in a typical RL algorithm are:
1. First, the agent interacts with the environment by performing an action.
2. By performing an action, the agent moves from one state to another.
3. Then the agent will receive a reward based on the action it performed.
4. Based on the reward, the agent will understand whether the action is good or bad.
5. If the action was good, that is, if the agent received a positive reward, then the agent will
prefer performing that action, else the agent will try performing other actions in search of
a positive reward.
The goal of the agent is to maximize the reward it gets. If the agent receives a good reward, then
it means it has performed a good action. If the agent performs a good action, then it implies that
it can win the game. Thus, the agent learns to win the game by maximizing the reward
RL agent in the grid world environment
• Environment : Grid World environment
• States : A,B,C,D,E,F,G,H and I. Shaded states are hole states.
• Goal of the agent : reach state I from state A.
• Actions : move up, down, left and right.
• Every time the agent reaches one of the shaded states it
receives a negative reward(-1).
• Every time the agent reaches one of the unshaded states it
receives a positive reward(+1).
• First time when the agent interacts with the envt (first
iteration), it performs a random action in each state, and
mostly ends up with negative rewards.
• But, over a series of iterations, it learns to perform the
correct action in each state, based on the rewards it has
obtained in that state in the previous iterations and hence
reaches the goal.
RL agent in the Grid World Environment
• Iteration 1:
RL agent in the Grid World Environment
• Iteration 2:
RL agent in the Grid World Environment
• Iteration 3:
RL agent in the Grid World Environment
• Result of Iteration 3, the agent reaches the goal state without reaching the shaded states.
• The agent has successfully learnt to reach the goal state I from state A, without visiting the
shaded states based on the rewards.
• The goal of the agent is to maximize the rewards and
ultimately achieve the goal
• Each iteration known as an episode in RL terms.
Types of RL environments
• Deterministic environment : It is certain that, when an agent in
state s, performs an action a, then it always reaches state s’.
Ex: Chess – there would be only a few possible moves for a coin at the current
state and these moves can be determined.
Types of RL environments
• Stochastic environment : We cannot say that, when an agent in state s, performs an action a,
then we cannot say that, it always reaches state s’.
• We cannot determine the outcome of the action in the current state
• This is due to the randomness in the stochastic environment.
Ex1: Self-Driving Cars- the actions of a self-driving car are not unique, it varies time to time.
Ex2: The radio station is a stochastic environment where the listener is not aware about the
next song.
Types of RL environments
3. Discrete Environment: The action space of the environment is discrete.
Ex: action space of the grid world environment is [up, down, left, right].
4. Continuous environment: The action space of the environment is
continuous
Ex1: To train an agent to drive a car, then the action space will involve multiple
actions like [changing the car’s speed, the no. of degrees to rotate the wheel,
etc..]
Ex2: In a basketball game, the position of players (Environment) keeps
changing continuously and hitting (Action) the ball towards the basket can
have different angles and speed so infinite possibilities.
Types of RL environments
5. Episodic/Non-Sequential Environment: the agent’s current
action will not affect the future actions
Ex: A support bot (agent) answering to a question and then
answering to another question and so on. So each question-answer
is a single episode.
6. Non-Episodic/Sequential Envt: the agent’s current action will
affect its future actions.
Ex: a chess-board is a sequential environment since the agent’s
current action will affect its future actions in a chess match.
7. Single and Multi-agent environments:
• Single agent environment where an environment is explored by a single agent. All actions
are performed by a single agent in the environment.

• Real-life Example: Playing tennis against the ball is a single agent environment where there
is only one player.

• If two or more agents are taking actions in the environment, it is known as a multi-agent
environment.

• Real-life Example: Playing a soccer match is a multi-agent environment.


Markov Decision Process(MDP)
• It provides a mathematical framework for solving the RL problem
• MDP is mainly used to study optimization problems via dynamic programming.
• A Markov decision process (MDP) refers to a stochastic decision-making process that
uses a mathematical framework to model the decision-making of a dynamic system.
• It is used in scenarios where the results are either random or controlled by a decision
maker, which makes sequential decisions over time.
• MDPs evaluate which actions the decision maker should take considering the current
state and environment of the system.
• Almost all RL problems can be modelled as a MDP.
• In artificial intelligence, MDPs model sequential decision-making scenarios with probabilistic
dynamics.
• They are used to design intelligent machines or agents that need to function longer in an
environment where actions can yield uncertain results.
MDP
• MDP uses two entities namely Markov property and Markov chain.
• Markov property : says the future depends Only on the present and not on the past.
• Markov chain/Markov process : has a sequence of states that strictly obey the Markov
property.
• Markov chain is a probabilistic model that solely depends on the current state to predict the
next state and not the previous states.
• The future is conditionally independent of the past.
• Ex : if the current state of weather is cloudy, we can predict the next state to be rainy.
• We made this prediction only based on the current state(cloudy) and not on the previous
states which might be sunny, windy, etc.
• Markov property is not valid for all processes.
• Ex: while throwing a dice(the next state), doesn’t depend on the previous number that
showed up on the dice(the current state)
MDP
• MDP uses states and state transition probabilities.
• is the probability of moving from current state s to next state
• State transition probabilities can be represented using Markov table, state
diagram or transition matrix.
• Hence, a Markov process consists
of a set of states along with their
transition probabilities
MDP…
• Markov Reward process (MRP) :
• An extension of the Markov chain with the reward function.
• A reward function says the reward we obtain in each state
• The MRP consists of states , transition probabilities and a reward function
.
• Markov Decision Process(MDP):
• An extension of the MRP with states , actions transition probabilities
• In a RL setup, the agent makes decisions only based on the current state and not based on the
past states.
• Hence we can model a RL problem as a MDP
Grid World as MDP
• Goal : the agent has to move from state A to state I, without visiting
the shaded states.
• States : set of states, from A to I
• Actions : a set of actions that our agent can perform in each state as in
up, down, left, right
• Transition probability: The probability of moving from the current state
s to the next state
• Ex:
Grid World as MDP
• Reward function: the reward the agent receives while moving from state to
state while performing action Denoted by .
• Ex: .
Fundamental concepts of RL
• Maths essentials : Expectation of a random variable X
• A random variable takes values from a random experiment such as throwing a dice, tossing a coin, etc.
• Ex : if we are throwing a fair dice, then the possible outcomes(X) are 1,2,3,4,5,and 6.
• The probability of occurrence of each of these outcomes are 1/6

• Find the average value of the random variable X?


Ans : take the weighted average of X
Fundamental concepts of RL
• Expectation of a function of a random variable X
• Ex: Let
Fundamental concepts of RL
• Action Space : the set of all possible actions in the environment
• Ex: for the grid world environment, the action space is[up,down,left,right]
• Types of Action Space: discrete and continuous
• A Discrete action space has actions that are discrete.
• Ex: the action space of the grid world environment
• A continuous action space has action that are continuous.
• Ex: training an agent to drive a car, actions are continuous in nature such as speed,
degrees to rotate the wheel, etc..
Fundamental concepts of RL
• A policy defines the agent’s behaviour in an environment.
• It tells the agent what action to perform in each state.
• Ex: in the grid world envt with states from A to I, and 4 actions, a policy may
tell the agent to move down in state A, move right in state D, so on.
• In the first iteration, the agent starts with a random policy, taking a random
action in each state.
• Learns whether the actions taken in each state are good or bad based on the
reward it gets.
• Over a series of iterations, the agent learns a good policy that gets a positive
reward.
• This good(optimal) policy is the policy that gets the agent a good reward
and helps the agent to reach the goal state.
Fundamental concepts of RL..
• Types of policy : deterministic and stochastic
• Ex of an optimal policy • Deterministic Policy:
• this policy tells the agent to perform one particular
action in a state.
• Denoted by μ
• If the agent is in state ‘s’ at time ‘t’, the deterministic
policy tells the agent to perform action ‘a’, expressed by

Ex :
• Stochastic policy :
• This policy doesn’t map a state to one particular action.
• It maps a state to a probability distribution over an
action space
• Denoted by π, or
• Ex: if the stochastic policy for state A over the 4 action
space [up,down,left, right] is [0.10,0.70,0.10,0.10]
respectively, when the agent in state A, it chooses
action ‘up’ 10% of the time, ‘down’ 70% of the time, left
10% of the time and right 10% of the time.
Fundamental concepts of RL..
• Types of Stochastic policy : Categorical and Gaussian
• Categorical policy:
• If the action space of a stochastic policy is discrete,
then it is a categorical policy
• Prob distributions are taken over a discrete action space
• Gaussian policy
• A stochastic policy whose action space is continuous
• Its uses a Gaussian prob distribution over an action space
• Ex: if we are training an agent to drive a car, there is a continuous action in our action space –
speed of the car whose value ranges from 0 to 150kmph.
• The stochastic policy uses the Gaussian distribution over the action space to select an action
• Episode: the agent-environment interaction starting from the initial state until the final state
is called an episode
• Often known as a trajectory(the path taken by the agent)
• Denoted by τ
• An agent can play a game any no. of episodes, each episode is independent of the other.
• What is the use of playing the same game for multiple episodes?
• To learn the optimal policy, that is, the policy that tells the agent to perform the correct
action in each state
• The episode information is of the form state, action, reward starting from the initial state to
the final state, i.e (
Episode and optimal policy in Grid world Envt
• The agent generates the first episode using a random policy
• Explores the envt over several episodes to learn an optimal policy
• Episode 1
• Episode 2 : the agent tries a different policy to avoid the negative
rewards it got in the previous episode
• Episode n: Over a series of episodes, the agent learns the optimal
policy, the policy that takes the agent from state A to state I,
without visiting the shaded states and also maximising the
rewards.

Episodic and continuous tasks
• Episodic tasks: a task made up of episodes and thus they have a terminal state
• Ex: car racing game
• Continuous tasks: do not have any episodes and so don’t have any terminal state.
• Ex: a personal assistance robot does not have a terminal state
• Horizon : the time step until which the agent interacts with the envt.
• Types : finite and infinite horizon
• Finite horizon : the agent-envt interaction stops at a particular time step.
• Ex: in an episodic task, the agent-envt interaction stops after the agent reaches the final
state T.
• Infinite horizon: the agent-envt interaction never stops
• Ex: a continuous task without final state has an infinite horizon.
Return and discount factor
• Return : sum of the rewards obtained by an agent in an episode.
• Denoted by R or G.
• Ex: if the agent starts at initial state at time step t=0 and reaches the final
state at time step T, then the return by the agent is

• Ex: for the trajectory below:


Return and discount factor..
• So the goal of the agent is to maximise the return, i.e, maximise the sum of the
rewards obtained over an episode.
• How can we maximise this return? How can we perform the correct action in
each state?
• By using the optimal policy – the policy that gets our agent the maximum
return (sum of the rewards) by performing the correct action in each
state.
• How to define return for continuous tasks, where there is no terminal state?
• Return for continuous tasks – sum of the rewards upto infinity.
Return and discount factor..
• How to maximise the return that sums to infinity?
• Using discount factor

• Discount factor helps us by preventing the return in reaching infinity, by


deciding how much importance we should give to immediate rewards
and future rewards.
• Its value ranges from 0 to 1
Return and discount factor..
• If the discount factor is too small(close to 0), it means we give
more importance to immediate rewards than future rewards.
• If the discount factor is set to a large value(close to 1), it means we
give more importance to future rewards than immediate rewards.
Return and discount factor..
• What happens when the discount factor is small? γ = 0.2?

• So, when we set discount factor to a small value, we give more


importance to immediate rewards than future rewards.
Return and discount factor..
• What happens when the discount factor is large? γ = 0.9?
Return and discount factor..
• What happens when the discount factor is set to 0? γ = 0?

• Our return is just the immediate reward


Return and discount factor..
• What happens when the discount factor is set to 1? γ = 1?

• Our return is just the sum of the rewards upto infinity


Return and discount factor..
• If the discount factor is set to 0, the agent never learns, as it
considers only the immediate reward.
• If the discount factor is set to 1, the agent will learn forever,
looking for the future rewards that lead to infinity.
• So, the optimal value of discount factor is between 0.2 to 0.8
• For certain tasks future rewards are more important than
immediate rewards and vice versa.
Value function
• Value function also called the state value function gives the value
of a state.
• The value of a state ‘s’ is the return of the trajectory τ starting
from that state to the final state following a policy π.

• The policy could be deterministic or stochastic


• A deterministic policy maps each state to a one particular action
• A stochastic policy selects action for a state based on a probability
distribution of the action space.
Value function for a deterministic policy
• If the trajectory τ, for the grid world environment, using some
policy deterministic policy π is :
Value function for a deterministic policy…
• The value function can be calculated for each state as the
return(sum of the rewards) of the trajectory starting from that
state :

• The value function of the finals state is zero, since a reward is


associated only with a state that has a transition.
Value function for stochastic policy
• (Expected)value function of a state with a stochastic policy is the
expected return that the agent would get starting from that state s
and following a stochastic policy π.
• Return of a state in a trajectory τ, following a stochastic policy π, is
a random variable.
• It takes different values with some probability in each trajectory.
• It is expressed as :
Value function for stochastic policy…
• Ex: In state A, the stochastic policy gives a prob distribution over
the action space [up,down,left,right] as [0.0, 0.8, 0.0, 0.2], i.e
perform the action down 80% of the time, that is, 𝜋(down|𝐴) =
0.8, and the action right 20% of the time, that is (right|𝐴
𝐴 ) = 0.20
• This gives two trajectories from state A.
• Assume the stochastic policy selects “right” in states D and E and
“down” in B and F 100% of the time.
Value function for stochastic policy…
• The first trajectory is :

• Value of state A is, the return(sum of the rewards) of the trajectory


starting from state A.
• Thus, 𝑉(𝐴) = 𝑅() =1+1+1+1=4.
Value function for stochastic policy…
• The second trajectory is :

• Value of state A is, the return(sum of the rewards) of the trajectory starting from state A.
• Thus, 𝑉(𝐴) = 𝑅() =-1+1+1+1=2.
• Also, it’s the same policy, V(A) differs with the trajectories.
• For this policy, return is 4, 80% of the time and 2, for 20% of the time.
Value function for stochastic policy…
• Value of a state for a stochastic policy is the expected return of the trajectory starting from
that state.

• Expected return is the weighted average, sum of the returns, multiplied by their probabilities.
• So, V(A) is :
Value function for stochastic policy…
• Thus, the value of a state is the expected return of the trajectory starting from that state.
• Value function depends on the policy.
• There can be many value functions for a state, according to different policies.
• The optimal value function V*(s) is the maximum value of the state, among all its value
functions

Ex: Text book Pg.33 . We can find the optimal state from a Value table.
Q function
• Q function denotes the value of a state-action pair for a particular
state, s.
• It is the return that the agent will obtain starting from a state s,
and performing an action a, following a policy π
• It is also known as state-action value function

• Ex: Given trajectory τ, :


Q function…
• Find the q-function of A-down, D-right :
• Ans 1:

• Ans 2: ? 3
• Since return is a random variable taking a different value with
some probability, instead of taking the return directly, we take
the expected return.
Q function….
• Q function depends on the policy.
• There will be Q values for a (s,a) pair depending on the policy.
• The optimal policy for a (s,a) pair is :

• The optimal policy 𝜋* is the policy that gives the maximum Q value for a (s,a).
• Given a Q table, we find the optimal policy 𝜋*
Model-based and model-free learning
• Model-based learning : the agent find/learns the optimal policy
by using the model dynamics of the environment.
• Model dynamics of the environment is defined using
• 1. state transition probabilities and 2. reward function

• Model-free learning : the agent tries to find/learns the optimal


policy without using the model dynamics of the environment
Bellman Equation and Dynamic Programming
• In RL, the agent has to learn an optimal policy
• An optimal policy selects the correction action for an agent in each
state, so that the agent can get the maximum return and achieve
its goal.
• Two classical RL algorithms – value and policy iteration, helps
the agent to learn an optimal policy.
• These algos are model based and use Dynamic Programming
• Bellman equation is used in RL to find the optimal value function
and optimal Q functions recursively.
• These are then used to find the optimal policy
Bellman Equation?
• Bellman equation?
• Bellman optimality equation?
• Relationship between value and Q function?
• Dynamic Programming – value and policy iteration methods?
• Solving the Frozen Lake problem using value and policy iteration
methods.
Bellman equation for the value function
• As per Bellman Equation, the value of a state is the sum of the
immediate reward and the discounted value of the next state .
Bellman equation for the value function
• In a deterministic environment:
• Ex: Given a trajectory τ using some policy π as :

• Using Bellman equation find V(


• Ans:
=
• Hence the Bellman equation of the value function for a deterministic environment associated with a
policy π is :
• Where the rhs term is known as Bellman backup
Bellman equation for the value function
• In a stochastic environment:
Bellman equation for the value function
• Modify the Bellman equation of the value function with
expectations (weighted average)
• Bellman backup multiplied with the transition probability of the
next state.
Bellman equation for the value function
• Ex: same trajectory, find V(
• Ans:

• Hence the Bellman equation for the value function for a


stochastic environment for a policy π is :
Bellman equation for the value function
• What if the policy itself is stochastic? Instead of performing the same action in a state, we
select an action based on the prob distbn over the action space.
• Ex:
Bellman eqn for value function…
• to include the stochasticity present in the environment in the Bellman equation, we took
the expectation (the weighted average), that is, a sum of the Bellman backup multiplied by
the corresponding transition probability of the next state. •
• Similarly, to include the stochastic nature of the policy in the Bellman equation, we can use
the expectation (the weighted average), that is, a sum of the Bellman backup multiplied by
the corresponding probability of action.

• Using expectations :
Bellman Eqn of the Q function
• For deterministic envt : Bellman eqn of the Q function says that,
the Q value of a state-action pair is a sum of the immediate reward
and the discounted Q value of the next state-action pair :
• Ex: Given a trajectory τ using some policy π, find the Q value of (

• Ans:
Bellman Eqn of the Q function…
• For stochastic environment: An agent in state ‘s’, performs an
action ‘a’, then the next state is not always the same.
• Bellman eqn of the Q function for a stochastic envt, uses the
expectation (weighted average), that is, a sum of the Bellman
backup multiplied by their corresponding transition probability of
the next state.
• The Bellman equation of the Q function is:

You might also like