0% found this document useful (0 votes)

9 views81 pages

Module 1

Module 1 of CSE3011 covers the fundamentals of Reinforcement Learning (RL), including key concepts such as agents, environments, states, actions, and rewards. It explains how RL differs from other machine learning paradigms, introduces the Markov Decision Process (MDP) as a framework for solving RL problems, and discusses various types of RL environments. The module also outlines the typical steps involved in RL algorithms and emphasizes the importance of maximizing rewards through agent interactions with the environment.

Uploaded by

sarikajeshnavikoneti

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

9 views81 pages

Module 1

Uploaded by

sarikajeshnavikoneti

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 81

Module 1

CSE3011 Reinforcement Learning

Credit Structure : 3-0-3
Module 1 : Introduction to RL
Topics : Elements of RL - Agent, environment Interface, Goals and rewards, RL
platforms, Applications of RL, Markov decision process (MDP), RL environment
as a MDP, Maths essentials of RL, Policy and its types, episodic and continuous
tasks, return and discount factor, fundamental functions of RL – value and Q
functions, model-based and model-free learning, types of RL environments,
Solving MDP using Bellman Equation, Algorithms for optimal policy using
Dynamic Programming -Value iteration and policy iteration, Example : Frozen
Lake problem, Limitations and Scope.

2
Introduction to RL
• Reinforcement Learning(RL) is one of the areas of Machine Learning(ML).
• It is a feedback-based Machine learning technique in which an agent learns to behave in an
environment by performing the actions and seeing the results of actions.
• For each good action, the agent gets positive feedback, and for each bad action, the agent gets
negative feedback or penalty.
• The agent interacts with the environment and explores it by itself. The primary goal of an
agent in reinforcement learning is to improve the performance by getting the maximum
positive rewards.
• It is one of the most active research areas in AI.
• It has evolved and capable of building a recommendation system to self-driving cars.
• Reason for this evolution is deep RL, a combination of DL and RL.
Introduction to RL..
• In Reinforcement Learning, the agent learns automatically using feedbacks
without any labeled data, unlike supervised learning.

• Since there is no labeled data, so the agent is bound to learn by its experience
only.

• RL solves a specific type of problem where decision making is sequential, and

the goal is long-term, such as game-playing, robotics, etc.
RL agents
• The goal of reinforcement learning is to train an agent to complete
a task within an uncertain environment.
• At each time interval, the agent receives observations and a
reward from the environment and sends an action to the
environment.
• The reward is a measure of how successful the previous action
(taken from the previous state) was with respect to completing the
task goal.
Example
• Goal of an AI agent has to find the diamond present within a maze environment
• The agent interacts with the environment by performing some actions, and based on those
actions, the state of the agent gets changed, and it also receives a reward or penalty as
feedback for its actions.
• The agent continues doing these three things
(take action, change state/remain in the same state, and
get feedback), and by doing these actions, he learns and
explores the environment.
• The agent learns that what actions lead to positive feedback
or rewards and what actions lead to negative feedback penalty.
As a positive reward, the agent gets a positive point, and as a
penalty, it gets a negative point.
Elements of RL
1. Agent:
• It is a software that learns to make intelligent decisions.
• In an RL setting, an agent is a learner.
• Ex1: a chess-player is an agent, the player learns to make the best
moves (decisions) to win the game.
• Ex2 : Mario in a Super Mario Bros Video Game
Elements of RL
2. Environment :
• It is the world of the agent, within which the agent stays, takes
actions and interacts.
• Ex1: chess-board in a chess game.
• The chess player(agent) stays in the chess board to learn how to
play the game.
Elements of RL
3. State and Action:
• In an RL setup, the environment has many positions where the
agent can be in.
• Each such position is a state.
• A state is denoted by s
• Ex: in a chess-board environment, each position is a state.
Elements of RL
3. State and Action:
• The agent interacts with the environment and moves from one
state to another state by performing an action.
• Ex: In a chess-game environment, the action is the move
performed by the player(agent).
• An action is denoted by a
Elements of RL
4. Reward :
• The agent interacts with the environment by performing an action
and moves from one state to another.
• Based on the action the agent receives a reward.
• Reward is a numerical value, ex: +1 for a good action, -1 for a bad
action.
• Ex: in a chess-game, good action – the agent’s move which takes
one of the opponent’s chess piece; bad action – the agent’s move
loosing one chess piece to the opponent
Basic idea of RL
• To understand the working process of the RL, we need to consider two main things:
• Environment: It can be anything such as a room, maze, football ground, etc.
• Agent: An intelligent agent such as AI robot.
• Let's take an example of a maze environment that the goal of the agent is to explore and find
the path to the diamond in few steps.
• States : S1 to S12, where S6 is a wall, S8 is a
fire pit and S4 has diamond.
• Actions : move left, right, up and down
A typical RL setup • The agent has two components : policy and the RL
algorithm.
• Policy :
• It is a mapping from the current environment’s
observation to a probability distribution of the
actions to be taken.
• Within an agent, the policy is implemented by a
function approximator with tunable parameters
and a specific approximation model, such as a
deep neural network.
• RL algorithm:
• The learning algorithm continuously updates the
policy parameters based on the actions,
observations, and rewards.
• The goal of the learning algorithm is to find an
optimal policy that maximizes the expected
cumulative long-term reward received during the
task.
How RL differs from other ML paradigms?
Task : train a dog to catch a ball
• Difference between RL and Supervised learning:
• Supervised learning, we train the dog explicitly
with training data : turn left, go right, move forward
seven steps, catch the ball and so on.
• In RL, we simply throw the ball, every time the dog
catches the ball, we give it a cookie(reward).
• So the dog will learn to catch the ball while trying to
maximize the cookies(rewards) it can get.
How RL differs from other ML paradigms?
• Difference between RL and unsupervised learning?
• Task : Movie recommendation system- recommend a new movie to the user
• Unsupervised learning: the model will recommend a new movie based on the
similar movies the user has viewed before.
• In RL, each time the user watches a movie, the agent receives feedback from the
user.
• Feedbacks are rewards(ratings given by the user to this movie, time spent watching
the movie, etc).
• Based on these rewards, the RL agent will understand the movie preference of the
user, then suggests new movies accordingly.
How RL differs from other ML paradigms?
• Hence, the supervised and unsupervised the models learn from
the training data set.
• In RL, the agent learns by continuously interacting with the
environment.
• Hence entire RL is about the interaction between the agent and the
environment.
A typical RL algorithm
• The steps involved in a typical RL algorithm are:
1. First, the agent interacts with the environment by performing an action.
2. By performing an action, the agent moves from one state to another.
3. Then the agent will receive a reward based on the action it performed.
4. Based on the reward, the agent will understand whether the action is good or bad.
5. If the action was good, that is, if the agent received a positive reward, then the agent will
prefer performing that action, else the agent will try performing other actions in search of
a positive reward.
The goal of the agent is to maximize the reward it gets. If the agent receives a good reward, then
it means it has performed a good action. If the agent performs a good action, then it implies that
it can win the game. Thus, the agent learns to win the game by maximizing the reward
RL agent in the grid world environment
• Environment : Grid World environment
• States : A,B,C,D,E,F,G,H and I. Shaded states are hole states.
• Goal of the agent : reach state I from state A.
• Actions : move up, down, left and right.
• Every time the agent reaches one of the shaded states it
receives a negative reward(-1).
• Every time the agent reaches one of the unshaded states it
receives a positive reward(+1).
• First time when the agent interacts with the envt (first
iteration), it performs a random action in each state, and
mostly ends up with negative rewards.
• But, over a series of iterations, it learns to perform the
correct action in each state, based on the rewards it has
obtained in that state in the previous iterations and hence
reaches the goal.
RL agent in the Grid World Environment
• Iteration 1:
RL agent in the Grid World Environment
• Iteration 2:
RL agent in the Grid World Environment
• Iteration 3:
RL agent in the Grid World Environment
• Result of Iteration 3, the agent reaches the goal state without reaching the shaded states.
• The agent has successfully learnt to reach the goal state I from state A, without visiting the
shaded states based on the rewards.
• The goal of the agent is to maximize the rewards and
ultimately achieve the goal
• Each iteration known as an episode in RL terms.
Types of RL environments
• Deterministic environment : It is certain that, when an agent in
state s, performs an action a, then it always reaches state s’.
Ex: Chess – there would be only a few possible moves for a coin at the current
state and these moves can be determined.
Types of RL environments
• Stochastic environment : We cannot say that, when an agent in state s, performs an action a,
then we cannot say that, it always reaches state s’.
• We cannot determine the outcome of the action in the current state
• This is due to the randomness in the stochastic environment.
Ex1: Self-Driving Cars- the actions of a self-driving car are not unique, it varies time to time.
Ex2: The radio station is a stochastic environment where the listener is not aware about the
next song.
Types of RL environments
3. Discrete Environment: The action space of the environment is discrete.
Ex: action space of the grid world environment is [up, down, left, right].
4. Continuous environment: The action space of the environment is
continuous
Ex1: To train an agent to drive a car, then the action space will involve multiple
actions like [changing the car’s speed, the no. of degrees to rotate the wheel,
etc..]
Ex2: In a basketball game, the position of players (Environment) keeps
changing continuously and hitting (Action) the ball towards the basket can
have different angles and speed so infinite possibilities.
Types of RL environments
5. Episodic/Non-Sequential Environment: the agent’s current
action will not affect the future actions
Ex: A support bot (agent) answering to a question and then
answering to another question and so on. So each question-answer
is a single episode.
6. Non-Episodic/Sequential Envt: the agent’s current action will
affect its future actions.
Ex: a chess-board is a sequential environment since the agent’s
current action will affect its future actions in a chess match.
7. Single and Multi-agent environments:
• Single agent environment where an environment is explored by a single agent. All actions
are performed by a single agent in the environment.

• Real-life Example: Playing tennis against the ball is a single agent environment where there
is only one player.

• If two or more agents are taking actions in the environment, it is known as a multi-agent
environment.

• Real-life Example: Playing a soccer match is a multi-agent environment.

Markov Decision Process(MDP)
• It provides a mathematical framework for solving the RL problem
• MDP is mainly used to study optimization problems via dynamic programming.
• A Markov decision process (MDP) refers to a stochastic decision-making process that
uses a mathematical framework to model the decision-making of a dynamic system.
• It is used in scenarios where the results are either random or controlled by a decision
maker, which makes sequential decisions over time.
• MDPs evaluate which actions the decision maker should take considering the current
state and environment of the system.
• Almost all RL problems can be modelled as a MDP.
• In artificial intelligence, MDPs model sequential decision-making scenarios with probabilistic
dynamics.
• They are used to design intelligent machines or agents that need to function longer in an
environment where actions can yield uncertain results.
MDP
• MDP uses two entities namely Markov property and Markov chain.
• Markov property : says the future depends Only on the present and not on the past.
• Markov chain/Markov process : has a sequence of states that strictly obey the Markov
property.
• Markov chain is a probabilistic model that solely depends on the current state to predict the
next state and not the previous states.
• The future is conditionally independent of the past.
• Ex : if the current state of weather is cloudy, we can predict the next state to be rainy.
• We made this prediction only based on the current state(cloudy) and not on the previous
states which might be sunny, windy, etc.
• Markov property is not valid for all processes.
• Ex: while throwing a dice(the next state), doesn’t depend on the previous number that
showed up on the dice(the current state)
MDP
• MDP uses states and state transition probabilities.
• is the probability of moving from current state s to next state
• State transition probabilities can be represented using Markov table, state
diagram or transition matrix.
• Hence, a Markov process consists
of a set of states along with their
transition probabilities
MDP…
• Markov Reward process (MRP) :
• An extension of the Markov chain with the reward function.
• A reward function says the reward we obtain in each state
• The MRP consists of states , transition probabilities and a reward function
.
• Markov Decision Process(MDP):
• An extension of the MRP with states , actions transition probabilities
• In a RL setup, the agent makes decisions only based on the current state and not based on the
past states.
• Hence we can model a RL problem as a MDP
Grid World as MDP
• Goal : the agent has to move from state A to state I, without visiting
the shaded states.
• States : set of states, from A to I
• Actions : a set of actions that our agent can perform in each state as in
up, down, left, right
• Transition probability: The probability of moving from the current state
s to the next state
• Ex:
Grid World as MDP
• Reward function: the reward the agent receives while moving from state to
state while performing action Denoted by .
• Ex: .
Fundamental concepts of RL
• Maths essentials : Expectation of a random variable X
• A random variable takes values from a random experiment such as throwing a dice, tossing a coin, etc.
• Ex : if we are throwing a fair dice, then the possible outcomes(X) are 1,2,3,4,5,and 6.
• The probability of occurrence of each of these outcomes are 1/6

• Find the average value of the random variable X?

Ans : take the weighted average of X
Fundamental concepts of RL
• Expectation of a function of a random variable X
• Ex: Let
Fundamental concepts of RL
• Action Space : the set of all possible actions in the environment
• Ex: for the grid world environment, the action space is[up,down,left,right]
• Types of Action Space: discrete and continuous
• A Discrete action space has actions that are discrete.
• Ex: the action space of the grid world environment
• A continuous action space has action that are continuous.
• Ex: training an agent to drive a car, actions are continuous in nature such as speed,
degrees to rotate the wheel, etc..
Fundamental concepts of RL
• A policy defines the agent’s behaviour in an environment.
• It tells the agent what action to perform in each state.
• Ex: in the grid world envt with states from A to I, and 4 actions, a policy may
tell the agent to move down in state A, move right in state D, so on.
• In the first iteration, the agent starts with a random policy, taking a random
action in each state.
• Learns whether the actions taken in each state are good or bad based on the
reward it gets.
• Over a series of iterations, the agent learns a good policy that gets a positive
reward.
• This good(optimal) policy is the policy that gets the agent a good reward
and helps the agent to reach the goal state.
Fundamental concepts of RL..
• Types of policy : deterministic and stochastic
• Ex of an optimal policy • Deterministic Policy:
• this policy tells the agent to perform one particular
action in a state.
• Denoted by μ
• If the agent is in state ‘s’ at time ‘t’, the deterministic
policy tells the agent to perform action ‘a’, expressed by

Ex :
• Stochastic policy :
• This policy doesn’t map a state to one particular action.
• It maps a state to a probability distribution over an
action space
• Denoted by π, or
• Ex: if the stochastic policy for state A over the 4 action
space [up,down,left, right] is [0.10,0.70,0.10,0.10]
respectively, when the agent in state A, it chooses
action ‘up’ 10% of the time, ‘down’ 70% of the time, left
10% of the time and right 10% of the time.
Fundamental concepts of RL..
• Types of Stochastic policy : Categorical and Gaussian
• Categorical policy:
• If the action space of a stochastic policy is discrete,
then it is a categorical policy
• Prob distributions are taken over a discrete action space
• Gaussian policy
• A stochastic policy whose action space is continuous
• Its uses a Gaussian prob distribution over an action space
• Ex: if we are training an agent to drive a car, there is a continuous action in our action space –
speed of the car whose value ranges from 0 to 150kmph.
• The stochastic policy uses the Gaussian distribution over the action space to select an action
• Episode: the agent-environment interaction starting from the initial state until the final state
is called an episode
• Often known as a trajectory(the path taken by the agent)
• Denoted by τ
• An agent can play a game any no. of episodes, each episode is independent of the other.
• What is the use of playing the same game for multiple episodes?
• To learn the optimal policy, that is, the policy that tells the agent to perform the correct
action in each state
• The episode information is of the form state, action, reward starting from the initial state to
the final state, i.e (
Episode and optimal policy in Grid world Envt
• The agent generates the first episode using a random policy
• Explores the envt over several episodes to learn an optimal policy
• Episode 1
• Episode 2 : the agent tries a different policy to avoid the negative
rewards it got in the previous episode
• Episode n: Over a series of episodes, the agent learns the optimal
policy, the policy that takes the agent from state A to state I,
without visiting the shaded states and also maximising the
rewards.
•
Episodic and continuous tasks
• Episodic tasks: a task made up of episodes and thus they have a terminal state
• Ex: car racing game
• Continuous tasks: do not have any episodes and so don’t have any terminal state.
• Ex: a personal assistance robot does not have a terminal state
• Horizon : the time step until which the agent interacts with the envt.
• Types : finite and infinite horizon
• Finite horizon : the agent-envt interaction stops at a particular time step.
• Ex: in an episodic task, the agent-envt interaction stops after the agent reaches the final
state T.
• Infinite horizon: the agent-envt interaction never stops
• Ex: a continuous task without final state has an infinite horizon.
Return and discount factor
• Return : sum of the rewards obtained by an agent in an episode.
• Denoted by R or G.
• Ex: if the agent starts at initial state at time step t=0 and reaches the final
state at time step T, then the return by the agent is

• Ex: for the trajectory below:

Return and discount factor..
• So the goal of the agent is to maximise the return, i.e, maximise the sum of the
rewards obtained over an episode.
• How can we maximise this return? How can we perform the correct action in
each state?
• By using the optimal policy – the policy that gets our agent the maximum
return (sum of the rewards) by performing the correct action in each
state.
• How to define return for continuous tasks, where there is no terminal state?
• Return for continuous tasks – sum of the rewards upto infinity.
Return and discount factor..
• How to maximise the return that sums to infinity?
• Using discount factor

• Discount factor helps us by preventing the return in reaching infinity, by

deciding how much importance we should give to immediate rewards
and future rewards.
• Its value ranges from 0 to 1
Return and discount factor..
• If the discount factor is too small(close to 0), it means we give
more importance to immediate rewards than future rewards.
• If the discount factor is set to a large value(close to 1), it means we
give more importance to future rewards than immediate rewards.
Return and discount factor..
• What happens when the discount factor is small? γ = 0.2?

• So, when we set discount factor to a small value, we give more

importance to immediate rewards than future rewards.
Return and discount factor..
• What happens when the discount factor is large? γ = 0.9?
Return and discount factor..
• What happens when the discount factor is set to 0? γ = 0?

• Our return is just the immediate reward

Return and discount factor..
• What happens when the discount factor is set to 1? γ = 1?

• Our return is just the sum of the rewards upto infinity

Return and discount factor..
• If the discount factor is set to 0, the agent never learns, as it
considers only the immediate reward.
• If the discount factor is set to 1, the agent will learn forever,
looking for the future rewards that lead to infinity.
• So, the optimal value of discount factor is between 0.2 to 0.8
• For certain tasks future rewards are more important than
immediate rewards and vice versa.
Value function
• Value function also called the state value function gives the value
of a state.
• The value of a state ‘s’ is the return of the trajectory τ starting
from that state to the final state following a policy π.

• The policy could be deterministic or stochastic

• A deterministic policy maps each state to a one particular action
• A stochastic policy selects action for a state based on a probability
distribution of the action space.
Value function for a deterministic policy
• If the trajectory τ, for the grid world environment, using some
policy deterministic policy π is :
Value function for a deterministic policy…
• The value function can be calculated for each state as the
return(sum of the rewards) of the trajectory starting from that
state :

• The value function of the finals state is zero, since a reward is

associated only with a state that has a transition.
Value function for stochastic policy
• (Expected)value function of a state with a stochastic policy is the
expected return that the agent would get starting from that state s
and following a stochastic policy π.
• Return of a state in a trajectory τ, following a stochastic policy π, is
a random variable.
• It takes different values with some probability in each trajectory.
• It is expressed as :
Value function for stochastic policy…
• Ex: In state A, the stochastic policy gives a prob distribution over
the action space [up,down,left,right] as [0.0, 0.8, 0.0, 0.2], i.e
perform the action down 80% of the time, that is, 𝜋(down|𝐴) =
0.8, and the action right 20% of the time, that is (right|𝐴
𝐴 ) = 0.20
• This gives two trajectories from state A.
• Assume the stochastic policy selects “right” in states D and E and
“down” in B and F 100% of the time.
Value function for stochastic policy…
• The first trajectory is :

• Value of state A is, the return(sum of the rewards) of the trajectory

starting from state A.
• Thus, 𝑉(𝐴) = 𝑅() =1+1+1+1=4.
Value function for stochastic policy…
• The second trajectory is :

• Value of state A is, the return(sum of the rewards) of the trajectory starting from state A.
• Thus, 𝑉(𝐴) = 𝑅() =-1+1+1+1=2.
• Also, it’s the same policy, V(A) differs with the trajectories.
• For this policy, return is 4, 80% of the time and 2, for 20% of the time.
Value function for stochastic policy…
• Value of a state for a stochastic policy is the expected return of the trajectory starting from
that state.

• Expected return is the weighted average, sum of the returns, multiplied by their probabilities.
• So, V(A) is :
Value function for stochastic policy…
• Thus, the value of a state is the expected return of the trajectory starting from that state.
• Value function depends on the policy.
• There can be many value functions for a state, according to different policies.
• The optimal value function V*(s) is the maximum value of the state, among all its value
functions

Ex: Text book Pg.33 . We can find the optimal state from a Value table.
Q function
• Q function denotes the value of a state-action pair for a particular
state, s.
• It is the return that the agent will obtain starting from a state s,
and performing an action a, following a policy π
• It is also known as state-action value function

• Ex: Given trajectory τ, :

Q function…
• Find the q-function of A-down, D-right :
• Ans 1:

• Ans 2: ? 3
• Since return is a random variable taking a different value with
some probability, instead of taking the return directly, we take
the expected return.
Q function….
• Q function depends on the policy.
• There will be Q values for a (s,a) pair depending on the policy.
• The optimal policy for a (s,a) pair is :

• The optimal policy 𝜋* is the policy that gives the maximum Q value for a (s,a).
• Given a Q table, we find the optimal policy 𝜋*
Model-based and model-free learning
• Model-based learning : the agent find/learns the optimal policy
by using the model dynamics of the environment.
• Model dynamics of the environment is defined using
• 1. state transition probabilities and 2. reward function

• Model-free learning : the agent tries to find/learns the optimal

policy without using the model dynamics of the environment
Bellman Equation and Dynamic Programming
• In RL, the agent has to learn an optimal policy
• An optimal policy selects the correction action for an agent in each
state, so that the agent can get the maximum return and achieve
its goal.
• Two classical RL algorithms – value and policy iteration, helps
the agent to learn an optimal policy.
• These algos are model based and use Dynamic Programming
• Bellman equation is used in RL to find the optimal value function
and optimal Q functions recursively.
• These are then used to find the optimal policy
Bellman Equation?
• Bellman equation?
• Bellman optimality equation?
• Relationship between value and Q function?
• Dynamic Programming – value and policy iteration methods?
• Solving the Frozen Lake problem using value and policy iteration
methods.
Bellman equation for the value function
• As per Bellman Equation, the value of a state is the sum of the
immediate reward and the discounted value of the next state .
Bellman equation for the value function
• In a deterministic environment:
• Ex: Given a trajectory τ using some policy π as :

• Using Bellman equation find V(

• Ans:
=
• Hence the Bellman equation of the value function for a deterministic environment associated with a
policy π is :
• Where the rhs term is known as Bellman backup
Bellman equation for the value function
• In a stochastic environment:
Bellman equation for the value function
• Modify the Bellman equation of the value function with
expectations (weighted average)
• Bellman backup multiplied with the transition probability of the
next state.
Bellman equation for the value function
• Ex: same trajectory, find V(
• Ans:

• Hence the Bellman equation for the value function for a

stochastic environment for a policy π is :
Bellman equation for the value function
• What if the policy itself is stochastic? Instead of performing the same action in a state, we
select an action based on the prob distbn over the action space.
• Ex:
Bellman eqn for value function…
• to include the stochasticity present in the environment in the Bellman equation, we took
the expectation (the weighted average), that is, a sum of the Bellman backup multiplied by
the corresponding transition probability of the next state. •
• Similarly, to include the stochastic nature of the policy in the Bellman equation, we can use
the expectation (the weighted average), that is, a sum of the Bellman backup multiplied by
the corresponding probability of action.

• Using expectations :
Bellman Eqn of the Q function
• For deterministic envt : Bellman eqn of the Q function says that,
the Q value of a state-action pair is a sum of the immediate reward
and the discounted Q value of the next state-action pair :
• Ex: Given a trajectory τ using some policy π, find the Q value of (

• Ans:
Bellman Eqn of the Q function…
• For stochastic environment: An agent in state ‘s’, performs an
action ‘a’, then the next state is not always the same.
• Bellman eqn of the Q function for a stochastic envt, uses the
expectation (weighted average), that is, a sum of the Bellman
backup multiplied by their corresponding transition probability of
the next state.
• The Bellman equation of the Q function is:

Kguh
No ratings yet
Kguh
38 pages
Sections
No ratings yet
Sections
76 pages
Reinforcement Learning 1
No ratings yet
Reinforcement Learning 1
14 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
29 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
10 pages
Maai 6
No ratings yet
Maai 6
143 pages
Deep Reinforcement Learning
No ratings yet
Deep Reinforcement Learning
25 pages
tiếng anhi
No ratings yet
tiếng anhi
7 pages
Unit-5 (AI)
No ratings yet
Unit-5 (AI)
21 pages
Introduction To Reinforcement Learning
No ratings yet
Introduction To Reinforcement Learning
19 pages
Introduction To Reinforcement Learning
100% (1)
Introduction To Reinforcement Learning
52 pages
RL Module 1
No ratings yet
RL Module 1
6 pages
RL RS-Unit - 3
No ratings yet
RL RS-Unit - 3
6 pages
Deep Reinforcement Learning
No ratings yet
Deep Reinforcement Learning
25 pages
Reinforcement Learning and Robotics
No ratings yet
Reinforcement Learning and Robotics
35 pages
ML Unit-4
No ratings yet
ML Unit-4
10 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
3 pages
Reinforcement Learning, Q-Learning
No ratings yet
Reinforcement Learning, Q-Learning
20 pages
Unit 5
No ratings yet
Unit 5
10 pages
Unit 1 - Reinforcement Learning, Overfitting, Training, Validation Sets, Metrics, Bias and Variance
No ratings yet
Unit 1 - Reinforcement Learning, Overfitting, Training, Validation Sets, Metrics, Bias and Variance
16 pages
Module 1
No ratings yet
Module 1
72 pages
Reinforced Learning
No ratings yet
Reinforced Learning
25 pages
RL Unit-1
No ratings yet
RL Unit-1
52 pages
L13 Reinforcement Learning
No ratings yet
L13 Reinforcement Learning
35 pages
Unit-5 ML Notes
No ratings yet
Unit-5 ML Notes
31 pages
Module 01
No ratings yet
Module 01
66 pages
Hindsight Experience Replay
No ratings yet
Hindsight Experience Replay
15 pages
A Primer Chapter On Reinforcement Learning-Final
No ratings yet
A Primer Chapter On Reinforcement Learning-Final
22 pages
Sdfesdf
No ratings yet
Sdfesdf
23 pages
UNIT V Reinforcement Learning
No ratings yet
UNIT V Reinforcement Learning
8 pages
Understanding Reinforcement Learning Algorithms Q Learning
No ratings yet
Understanding Reinforcement Learning Algorithms Q Learning
18 pages
Disertatie
No ratings yet
Disertatie
5 pages
Reinforcement 2
No ratings yet
Reinforcement 2
2 pages
Reinforcement Learning - Personal Study Notes
No ratings yet
Reinforcement Learning - Personal Study Notes
12 pages
21ai020 & Reinforcement Learning: The Agent-Environment Interface
No ratings yet
21ai020 & Reinforcement Learning: The Agent-Environment Interface
8 pages
L07 Slides - rl1
No ratings yet
L07 Slides - rl1
20 pages
5.5 Reinforcement Learning
No ratings yet
5.5 Reinforcement Learning
5 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
4 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
5 pages
RL MJJ
No ratings yet
RL MJJ
32 pages
RL
No ratings yet
RL
94 pages
Reinforcement Learning and Deep Learning Unit 1,2
No ratings yet
Reinforcement Learning and Deep Learning Unit 1,2
74 pages
RL Week - 1
No ratings yet
RL Week - 1
53 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
9 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
38 pages
DRL Final Notes
No ratings yet
DRL Final Notes
281 pages
Unit 5
No ratings yet
Unit 5
7 pages
Winter Semester 2023-24 - CSE4037 - ETH - AP2023246000594 - 2024-01-05 - Reference-Material-I
No ratings yet
Winter Semester 2023-24 - CSE4037 - ETH - AP2023246000594 - 2024-01-05 - Reference-Material-I
35 pages
Unit - 5
No ratings yet
Unit - 5
43 pages
AI Unit - 3
No ratings yet
AI Unit - 3
102 pages
7.reinforcement Learning-Introduction-The Learning Task Q-Learning
No ratings yet
7.reinforcement Learning-Introduction-The Learning Task Q-Learning
34 pages
RL & DL Notes
No ratings yet
RL & DL Notes
43 pages
4.1 Reinforcement Learning 2
No ratings yet
4.1 Reinforcement Learning 2
31 pages
Unit 5 Part1 Rl Notes
No ratings yet
Unit 5 Part1 Rl Notes
22 pages
RL Ese Answers
No ratings yet
RL Ese Answers
16 pages
HER Hindsight Experience Replay
No ratings yet
HER Hindsight Experience Replay
11 pages
114021
No ratings yet
114021
55 pages
Lecture 9 Reiforcement Learning
No ratings yet
Lecture 9 Reiforcement Learning
29 pages
Reinforcement Learning Explained - A Step-by-Step Guide to Reward-Driven AI
From Everand
Reinforcement Learning Explained - A Step-by-Step Guide to Reward-Driven AI
Luka Nikolic
No ratings yet
RBT Exam Study Guide
From Everand
RBT Exam Study Guide
Jennifer Meller
No ratings yet
Factors That Can Affect Model Performance: Seonwoo Lee
No ratings yet
Factors That Can Affect Model Performance: Seonwoo Lee
34 pages
Ge3151 - PSPP Model QP
No ratings yet
Ge3151 - PSPP Model QP
2 pages
Syllabus - Principle of Data Science
No ratings yet
Syllabus - Principle of Data Science
4 pages
Load Graph
No ratings yet
Load Graph
6 pages
Smart Grid System Operation (ELEN-6108) Lec 3
No ratings yet
Smart Grid System Operation (ELEN-6108) Lec 3
14 pages
James Ruse 20203 UTrials&Solutions
No ratings yet
James Ruse 20203 UTrials&Solutions
26 pages
Amortized Analysis-2
No ratings yet
Amortized Analysis-2
15 pages
Dept. of Computer Science and Engineering: Networks Lab Manual
No ratings yet
Dept. of Computer Science and Engineering: Networks Lab Manual
38 pages
MATLAB Mathematics (MathWorks)
No ratings yet
MATLAB Mathematics (MathWorks)
768 pages
Quiz 1 - PE5022
No ratings yet
Quiz 1 - PE5022
1 page
SSRN 4080107
No ratings yet
SSRN 4080107
38 pages
Seminar 9 Sol PDF
No ratings yet
Seminar 9 Sol PDF
2 pages
SouvenirSales Multiplicative
No ratings yet
SouvenirSales Multiplicative
57 pages
Machine Learning Interview Questions & Answers - MIQ
No ratings yet
Machine Learning Interview Questions & Answers - MIQ
17 pages
Quantum Resistant Cryptography
No ratings yet
Quantum Resistant Cryptography
13 pages
Disk Scheduling Algorithms
No ratings yet
Disk Scheduling Algorithms
9 pages
Round Robin Algorithm With Examples
100% (1)
Round Robin Algorithm With Examples
6 pages
OR Module 2 SESSION 3
No ratings yet
OR Module 2 SESSION 3
65 pages
Chapter 3
No ratings yet
Chapter 3
21 pages
INT345 Aspx
No ratings yet
INT345 Aspx
2 pages
SOM - POM - K MAP - Don't Care Word Problems
No ratings yet
SOM - POM - K MAP - Don't Care Word Problems
8 pages
Basics of Polynomials: Coefficients, Degree, and Leading Terms
No ratings yet
Basics of Polynomials: Coefficients, Degree, and Leading Terms
6 pages
Ma2024 SWAN PreprocessingSGDEnablesAdamLevelPerfOnLLMTrainingMemoryReduction
No ratings yet
Ma2024 SWAN PreprocessingSGDEnablesAdamLevelPerfOnLLMTrainingMemoryReduction
34 pages
BCS-054 - Compressed PDF
No ratings yet
BCS-054 - Compressed PDF
7 pages
Deep Learning Using SVM in Matlab
No ratings yet
Deep Learning Using SVM in Matlab
13 pages
44 Wiley Signals and Systems Ebook Tlfebook
No ratings yet
44 Wiley Signals and Systems Ebook Tlfebook
592 pages
Statistical Advisor
No ratings yet
Statistical Advisor
1 page
Team Presentation Slides
No ratings yet
Team Presentation Slides
43 pages
18.445 Introduction To Stochastic Processes
No ratings yet
18.445 Introduction To Stochastic Processes
10 pages
Student Note-Lecture-6 - New
No ratings yet
Student Note-Lecture-6 - New
5 pages

Module 1

Uploaded by

Module 1

Uploaded by

Module 1

CSE3011 Reinforcement Learning

• RL solves a specific type of problem where decision making is sequential, and

• Real-life Example: Playing a soccer match is a multi-agent environment.

• Find the average value of the random variable X?

• Ex: for the trajectory below:

• Discount factor helps us by preventing the return in reaching infinity, by

• So, when we set discount factor to a small value, we give more

• Our return is just the immediate reward

• Our return is just the sum of the rewards upto infinity

• The policy could be deterministic or stochastic

• The value function of the finals state is zero, since a reward is

• Value of state A is, the return(sum of the rewards) of the trajectory

• Ex: Given trajectory τ, :

• Model-free learning : the agent tries to find/learns the optimal

• Using Bellman equation find V(

• Hence the Bellman equation for the value function for a

You might also like