Module 1
Module 1
2
Introduction to RL
• Reinforcement Learning(RL) is one of the areas of Machine Learning(ML).
• It is a feedback-based Machine learning technique in which an agent learns to behave in an
environment by performing the actions and seeing the results of actions.
• For each good action, the agent gets positive feedback, and for each bad action, the agent gets
negative feedback or penalty.
• The agent interacts with the environment and explores it by itself. The primary goal of an
agent in reinforcement learning is to improve the performance by getting the maximum
positive rewards.
• It is one of the most active research areas in AI.
• It has evolved and capable of building a recommendation system to self-driving cars.
• Reason for this evolution is deep RL, a combination of DL and RL.
Introduction to RL..
• In Reinforcement Learning, the agent learns automatically using feedbacks
without any labeled data, unlike supervised learning.
• Since there is no labeled data, so the agent is bound to learn by its experience
only.
• Real-life Example: Playing tennis against the ball is a single agent environment where there
is only one player.
• If two or more agents are taking actions in the environment, it is known as a multi-agent
environment.
Ex :
• Stochastic policy :
• This policy doesn’t map a state to one particular action.
• It maps a state to a probability distribution over an
action space
• Denoted by π, or
• Ex: if the stochastic policy for state A over the 4 action
space [up,down,left, right] is [0.10,0.70,0.10,0.10]
respectively, when the agent in state A, it chooses
action ‘up’ 10% of the time, ‘down’ 70% of the time, left
10% of the time and right 10% of the time.
Fundamental concepts of RL..
• Types of Stochastic policy : Categorical and Gaussian
• Categorical policy:
• If the action space of a stochastic policy is discrete,
then it is a categorical policy
• Prob distributions are taken over a discrete action space
• Gaussian policy
• A stochastic policy whose action space is continuous
• Its uses a Gaussian prob distribution over an action space
• Ex: if we are training an agent to drive a car, there is a continuous action in our action space –
speed of the car whose value ranges from 0 to 150kmph.
• The stochastic policy uses the Gaussian distribution over the action space to select an action
• Episode: the agent-environment interaction starting from the initial state until the final state
is called an episode
• Often known as a trajectory(the path taken by the agent)
• Denoted by τ
• An agent can play a game any no. of episodes, each episode is independent of the other.
• What is the use of playing the same game for multiple episodes?
• To learn the optimal policy, that is, the policy that tells the agent to perform the correct
action in each state
• The episode information is of the form state, action, reward starting from the initial state to
the final state, i.e (
Episode and optimal policy in Grid world Envt
• The agent generates the first episode using a random policy
• Explores the envt over several episodes to learn an optimal policy
• Episode 1
• Episode 2 : the agent tries a different policy to avoid the negative
rewards it got in the previous episode
• Episode n: Over a series of episodes, the agent learns the optimal
policy, the policy that takes the agent from state A to state I,
without visiting the shaded states and also maximising the
rewards.
•
Episodic and continuous tasks
• Episodic tasks: a task made up of episodes and thus they have a terminal state
• Ex: car racing game
• Continuous tasks: do not have any episodes and so don’t have any terminal state.
• Ex: a personal assistance robot does not have a terminal state
• Horizon : the time step until which the agent interacts with the envt.
• Types : finite and infinite horizon
• Finite horizon : the agent-envt interaction stops at a particular time step.
• Ex: in an episodic task, the agent-envt interaction stops after the agent reaches the final
state T.
• Infinite horizon: the agent-envt interaction never stops
• Ex: a continuous task without final state has an infinite horizon.
Return and discount factor
• Return : sum of the rewards obtained by an agent in an episode.
• Denoted by R or G.
• Ex: if the agent starts at initial state at time step t=0 and reaches the final
state at time step T, then the return by the agent is
• Value of state A is, the return(sum of the rewards) of the trajectory starting from state A.
• Thus, 𝑉(𝐴) = 𝑅() =-1+1+1+1=2.
• Also, it’s the same policy, V(A) differs with the trajectories.
• For this policy, return is 4, 80% of the time and 2, for 20% of the time.
Value function for stochastic policy…
• Value of a state for a stochastic policy is the expected return of the trajectory starting from
that state.
• Expected return is the weighted average, sum of the returns, multiplied by their probabilities.
• So, V(A) is :
Value function for stochastic policy…
• Thus, the value of a state is the expected return of the trajectory starting from that state.
• Value function depends on the policy.
• There can be many value functions for a state, according to different policies.
• The optimal value function V*(s) is the maximum value of the state, among all its value
functions
Ex: Text book Pg.33 . We can find the optimal state from a Value table.
Q function
• Q function denotes the value of a state-action pair for a particular
state, s.
• It is the return that the agent will obtain starting from a state s,
and performing an action a, following a policy π
• It is also known as state-action value function
• Ans 2: ? 3
• Since return is a random variable taking a different value with
some probability, instead of taking the return directly, we take
the expected return.
Q function….
• Q function depends on the policy.
• There will be Q values for a (s,a) pair depending on the policy.
• The optimal policy for a (s,a) pair is :
• The optimal policy 𝜋* is the policy that gives the maximum Q value for a (s,a).
• Given a Q table, we find the optimal policy 𝜋*
Model-based and model-free learning
• Model-based learning : the agent find/learns the optimal policy
by using the model dynamics of the environment.
• Model dynamics of the environment is defined using
• 1. state transition probabilities and 2. reward function
• Using expectations :
Bellman Eqn of the Q function
• For deterministic envt : Bellman eqn of the Q function says that,
the Q value of a state-action pair is a sum of the immediate reward
and the discounted Q value of the next state-action pair :
• Ex: Given a trajectory τ using some policy π, find the Q value of (
• Ans:
Bellman Eqn of the Q function…
• For stochastic environment: An agent in state ‘s’, performs an
action ‘a’, then the next state is not always the same.
• Bellman eqn of the Q function for a stochastic envt, uses the
expectation (weighted average), that is, a sum of the Bellman
backup multiplied by their corresponding transition probability of
the next state.
• The Bellman equation of the Q function is: