0% found this document useful (0 votes)
44 views45 pages

Unit-5 Reinforcemnt and Q Learning

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
44 views45 pages

Unit-5 Reinforcemnt and Q Learning

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 45

UNIT-5

REINFORCEMENT LEARNING
CONTENTS

• Reinforcement learning
• Active reinforcement and passive reinforcement learning
• Adaptive dynamic learning
• Temporal difference learning
• Function approximation
• Generalization in reinforcement learning
• Application of reinforcement learning
REINFORCEMENT LEARNING

• Reinforcement Learning is a feedback-based Machine learning


technique in which an agent learns to behave in an environment by
performing the actions and seeing the results of actions.
• For each good action, the agent gets positive feedback, and for each
bad action, the agent gets negative feedback or penalty.
• In Reinforcement Learning, the agent learns automatically using
feedbacks without any labeled data, unlike supervised learning.
• Since there is no labeled data, so the agent is bound to learn by its
experience only.
REINFORCEMENT LEARNING

• RL solves a specific type of problem where decision making is


sequential, and the goal is long-term, such as game-playing, robotics,
etc.
• The agent interacts with the environment and explores it by itself. The
primary goal of an agent in reinforcement learning is to improve the
performance by getting the maximum positive rewards.
• Ex: How a Robotic dog learns the movement of his arms is an example
of Reinforcement learning.
REINFORCEMENT LEARNING-EXAMPLE

• Example: Suppose there is an AI agent present within a maze environment, and


his goal is to find the diamond. The agent interacts with the environment by
performing some actions, and based on those actions, the state of the agent gets
changed, and it also receives a reward or penalty as feedback.
• The agent continues doing these three things (take action, change state/remain
in the same state, and get feedback), and by doing these actions, he learns
and explores the environment.
• The agent learns that what actions lead to positive feedback or rewards and what
actions lead to negative feedback penalty. As a positive reward, the agent gets a
positive point, and as a penalty, it gets a negative point.
ELEMENTS OF REINFORCEMENT LEARNING

• There are four main elements of reinforcement learning


• Policy
• Reward signal
• Value function
• Model of the environment
ELEMENTS OF REINFORCEMENT LEARNING

• Policy: A policy can be defined as a way how an agent behaves at a given time. perceived
states of the environment to the actions taken on those states. A policy is the core element
of the RL as it alone can define the behavior of the agent. In some cases, it may be a simple
function or a lookup table, whereas, for other cases, it may involve general computation as a
search process.
• Reward signal: The goal of reinforcement learning is defined by the reward signal. At each
state, the environment sends an immediate signal to the learning agent, and this signal is
known as a reward signal. These rewards are given according to the good and bad actions
taken by the agent. The agent's main objective is to maximize the total number of rewards
for good actions. The reward signal can change the policy, such as if an action selected by
the agent leads to low reward, then the policy may change to select other actions in the
future.
ELEMENTS OF REINFORCEMENT LEARNING

• Value Function: The value function gives information about how good the
situation and action are and how much reward an agent can expect. A reward
indicates the immediate signal for each good and bad action, whereas a
value function specifies the good state and action for the future. The value
function depends on the reward as, without reward, there could be no value. The
goal of estimating values is to achieve more rewards.
• Model : The last element of reinforcement learning is the model, which mimics
the behavior of the environment. With the help of the model, one can make
inferences about how the environment will behave. Such as, if a state and an
action are given, then a model can predict the next state and reward.
• The model is used for planning, which means it provides a way to take
a course of action by considering all future situations before actually
experiencing those situations. The approaches for solving the RL
problems with the help of the model are termed as the model-
based approach. Comparatively, an approach without using a
model is called a model-free approach.
TYPES OF REINFORCEMENT LEARNING

• There are mainly two types of reinforcement learning


• Positive Reinforcement
• Negative Reinforcement

• Positive Reinforcement: The positive reinforcement learning means


adding something to increase the tendency that expected behavior
would occur again. It impacts positively on the behavior of the agent
and increases the strength of the behavior.
• This type of reinforcement can sustain the changes for a long time, but
too much positive reinforcement may lead to an overload of states that
can reduce the consequences.
• Negative Reinforcement: The negative reinforcement learning is
opposite to the positive reinforcement as it increases the tendency that
the specific behavior will occur again by avoiding the negative
condition.
• It can be more effective than the positive reinforcement depending on
situation and behavior, but it provides reinforcement only to meet
minimum behavior.
PASSIVE REINFORCEMENT LEARNING

• In case of passive reinforcement learning, the agent’s policy is fixed. It means that the agent
is told what to do
• As the goal of the agent is to evaluate how good an optimal policy is, the agent needs to
learn the expected utility for each state s. This can be done in three ways.
a) Direct Utility estimation: In this method, the agent executes a sequence of trials or runs
(sequences of states-actions transitions that continue until the agent reaches the terminal
state).
• Each trial gives a sample value and the agent estimates the utility based on the sample
values
• But, the main drawback is that this method makes a wrong assumption that state utilities
are independent, while in reality they are markovian.
b) Adaptive Dynamic Programming (ADP) : ADP is a smarter method than
Direct Utility Estimation as it runs trials to learn the model of the
environment by estimating the utility of a state as a sum of reward for
being in that state and the expected discounted reward of being in the
next state.
• c) Temporal Difference Learning (TD): TD learning does not require the
agent to learn the transition model. The update occurs between
successive states and agent only updates states that are directly
affected.
ACTIVE LEARNING

• As the goal of an active agent is to learn an optimal policy, the agent


needs to learn the expected utility of each state and update its policy.
Can be done using a passive ADP agent and then using value or policy
iteration it can learn optimal actions. But this approach results into a
greedy agent. Hence, we use an approach that gives higher
weights to unexplored actions and lower weights to actions
with lower utilities.
ACTIVE LEARNING
• An active RL agent is an extension of a passive one, e.g. the passive
ADP agent, and adds I Needs to learn a complete transition model for
all actions (not just π), using passive ADP learning I Utilities need to
reflect the optimal policy π ∗ , as expressed by the Bellman equations I
Equations can be solved by VI or PI methods described before I Action
to be selected as the optimal/maximizing one
DIFFERENCES BETWEEN REINFORCEMENT
LEARNING AND SUPERVISED LEARNING
APPLICATIONS OF REINFORCEMENT LEARNING
1.Robotics:
1. RL is used in Robot navigation, Robo-soccer, walking, juggling, etc.

2.Control:
1. RL can be used for adaptive control such as Factory processes, admission control in telecommunication,
and Helicopter pilot is an example of reinforcement learning.

3.Game Playing:
1. RL can be used in Game playing such as tic-tac-toe, chess, etc.

4.Chemistry:
1. RL can be used for optimizing the chemical reactions.

5.Business:
1. RL is now used for business strategy planning.

6.Manufacturing:
1. In various automobile manufacturing companies, the robots use deep reinforcement learning to pick goods
and put them in some containers.

7.Finance Sector:
1. The RL is currently used in the finance sector for evaluating trading strategies.
ADAPTIVE DYNAMIC LEARNING
• An adaptive dynamic programming (or ADP) agent takes advantage of
the constraints among the utilities of states by learning the transition
model that connects them and solving the corresponding Markov
decision process using a dynamic programming method
• The process of learning the model itself is easy, because the
environment is fully observable. This means that we have a supervised
learning task where the input is a state–action pair and the output is
the resulting state.
WHY FUNCTION APPROXIMATION IS
REQUIRED
• A state is the combination of observable features or variables. Which means every time a
feature or a variable has a new value, it results in a a new state.
• let’s take a concrete example. Suppose an agent is in a 4x4 grid, so the location of the of
the agent on the grid is a feature. This gives 16 different locations meaning 16 different
states.
• But that’s not all, suppose the orientation (north, south, east, west) is also a feature. This
gives 4 possibilities for each location, which makes the number of states to 16*4 = 64.
Furthermore if the agent has the possibility of using 5 different tools (including “no tool”
case), this will grow the number of states to 64 * 5 = 320.
• Hence this may results in infinitely more number of states with changes in the value of
feature
• One way to represent those states is by creating a multidimensional array such as
V[row, column, direction, tool]. Then we either query or compute a state.
• For example V[1, 2, north, torch] represents the state where the agent is at row 1,
column 2, looking north and holding a torch. The value inside this array cell tells
how valuable this state is.
• So once we have the set of states we can assign a value-state function for each
state.
• Needless to say that the amount of memory needed to accommodate the number of
state is huge and the amount to time needed to compute the value of each state is
also prohibitive.
Solutions
• It is always useful to keep in mind what we are trying to do, because with all the details we
might lose sight.
The idea is that we want to find the value of each state/action, in an environment, so that the
agent follows the optimum path that collects the maximum rewards.
• In order to address this shortcoming, we can adopt a new approach based on the features of
each state.
• The aim is to use these set of features to generalize the estimation of the value at states
that have similar features.
• We used the word estimation to indicate that this approach will never find the true value of a
state, but an approximation of it.
• Despite this seemingly inconvenient result, however this will achieve faster computation and
much more generalisations.
• The methods that compute these approximations are called Function Approximators.
TEMPORAL DIFFERENCE LEARNING (TD)

• One of the problems with the environment is that rewards usually are
not immediately observable. For example, in tic-tac-toe or others, we
only know the reward(s) on the final move (terminal state). All other
moves will have 0 immediate rewards.
• TD learning is an unsupervised technique to predict a variable's
expected value in a sequence of states. TD uses a mathematical trick
to replace complex reasoning about the future with a simple learning
procedure that can produce the same results.
• Instead of calculating the total future reward, TD tries to predict the
combination of immediate reward and its own reward prediction at the
next moment in time.
TEMPORAL DIFFERENCE LEARNING-TD

• Mathematically, the key concept of TD learning is the discounted return:

• Where the reward at time t is the combination of discounted rewards in the


future. It implies that future rewards are valued less. The TD Error is the difference
between the ultimate correct reward (V*_t) and our current prediction (V_t).
CONTENTS

• Plan of Attack
• Bellman ford equation
• Markov decision process
• Policy vs plan
• Living penalty
• Temporal difference
Q-LEARNING

• Q-learning is a popular model-free reinforcement learning algorithm based on


the Bellman equation.
• The main objective of Q-learning is to learn the policy which can
inform the agent that what actions should be taken for maximizing
the reward under what circumstances.
• It is an off-policy RL that attempts to find the best action to take at a current
state.
• The goal of the agent in Q-learning is to maximize the value of Q.
• The value of Q-learning can be derived from the Bellman equation.
BELLMAN-FORD EQUATION

• The Bellman equation was introduced by the Mathematician Richard


Ernest Bellman in the year 1953, and hence it is called as a
Bellman equation. It is associated with dynamic programming and used
to calculate the values of a decision problem at a certain point by
including the values of previous states.
• It is a way of calculating the value functions in dynamic programming
or environment that leads to modern reinforcement learning.
• The key-elements used in Bellman equations are:
• Action performed by the agent is referred to as "a"
• State occurred by performing the action is "s."
• The reward/feedback obtained for each good and bad action is "R."
• A discount factor is Gamma "γ."
• The Bellman equation can be written as:
• V(s) = max [R(s,a) + γV(s`)]
• Where,
• V(s)= value calculated at a particular point.
• R(s,a) = Reward at a particular state s by performing an action.
• γ = Discount factor
• V(s`) = The value at the previous state.
• In the above equation, we are taking the max of the complete values
because the agent tries to find the optimal solution always.
MARKOV DECISION PROCESS (MDP)

• Markov Decision Process or MDP, is used to formalize the reinforcement


learning problems. If the environment is completely observable, then its
dynamic can be modeled as a Markov Process. In MDP, the agent
constantly interacts with the environment and performs actions; at each
action, the environment responds and generates a new state.

• MDP uses Markov property, and to better understand the MDP, we need to learn about it.
• Markov Property:
• It says that "If the agent is present in the current state S1, performs an action a1
and move to the state s2, then the state transition from s1 to s2 only depends
on the current state and future action and states do not depend on past actions,
rewards, or states."
• Or, in other words, as per Markov Property, the current state transition does not depend on
any past action or state. Hence, MDP is an RL problem that satisfies the Markov property.
Such as in a Chess game, the players only focus on the current state and do not
need to remember past actions or states.
• MDP is used to describe the environment for the RL, and almost all the
RL problem can be formalized using MDP.
• MDP contains a tuple of four elements (S, A, Pa, Ra):

• A set of finite States S


• A set of finite Actions A
• Rewards received after transitioning from state S to state S', due to
action a.
• Probability Pa.
• Finite MDP:
• A finite MDP is when there are finite states, finite rewards, and finite
actions. In RL, we consider only the finite MDP.
• Markov Process:
• Markov Process is a memoryless process with a sequence of random
states S1, S2, ....., St that uses the Markov Property. Markov process is
also known as Markov chain, which is a tuple (S, P) on state S and
transition function P. These two components (S and P) can define the
dynamics of the system.
POLICY VS PLAN

• A policy defines the learning agent's way of behaving at a given time. Roughly speaking, a policy is a mapping from
perceived states of the environment to actions to be taken when in those states.
• In general (as the name suggests), planning consists in creating a "plan" which you will use to reach a "goal".
The goal depends on the context or problem. For example, in robotics, you can use a "planning algorithm"
(e.g. Dijkstra's algorithm) in order to find the path between two points on a map (given e.g. the map as a
graph).
• In RL, planning usually refers to the use of a model of the environment in order to find a policy that hopefully
will help the agent to behave optimally (that is, obtain the highest amount of return or "future cumulative
discounted reward").
• In RL, the problem (or environment) is usually represented as a Markov Decision Process (MDP). The "model"
of the environment (or MDP) refers to the transition probability distribution (and reward function) associated
with the MDP. If the transition model (and reward function) is known, you can use an algorithm that exploits it
to (directly or indirectly) find a policy. This is the usual meaning of planning in RL
• Planning is often performed "offline", that is, you "plan" before executing. While
you're executing the "plan", you often do not change it.
• However, often this is not desirable, given that you might need to change the
plan because the environment might also have changed.
• Furthermore, the authors also point out that planning algorithms often have a few
limitations: in the case of RL, a "model" of the environment is required to plan.
• In policy-based methods, instead of learning a value function that tells us what is the
expected sum of rewards given a state and an action, we learn directly the policy
function that maps state to action
LIVING PENALTY

• In Reinforcement Learning (RL), agents are trained on


a reward and punishment mechanism. The agent is rewarded for correct
moves and punished for the wrong ones. In doing so, the agent tries to
minimize wrong moves and maximize the right ones.
• When an agent takes an action in a state, it receives a reward. Here the term
“reward” is an abstract concept that describes feedback from the
environment. A reward can be positive or negative. When the reward is
positive, it is corresponding to our normal meaning of reward. When the
reward is negative, it is corresponding to what we usually call “punishment.”
• When the agent is awarded a reward for both positive and negative
actions, it forces the agent to do the actions correctly in-order to reach
the goal with minimal wastage of time in the environment. This
process is called as adding living penalty .This process helps to achieve
the higher efficiency
• Ex: In a maze of 4*4 environment, suppose the agent performs two
different actions: MoveUP and MoveDown, for each right action -0.01
reward and for wrong action -0.5 reward is given. Hence it forces the
agent to reach the goal as for positive actions also negative reward is
awarded.
TEMPORAL DIFFERENCE

• Temporal Difference Learning is an unsupervised learning technique that is very


commonly used in reinforcement learning for the purpose of predicting the total
reward expected over the future.
• They can, however, be used to predict other quantities as well. It is essentially a way to
learn how to predict a quantity that is dependent on the future values of a given
signal.
• Temporal difference learning is a method that is used to compute the long-term utility
of a pattern of behaviour from a series of intermediate rewards.
• Essentially, TD Learning focuses on predicting a variable's future value in a sequence of states. Temporal
difference learning was a major breakthrough in solving the problem of reward prediction.
• You could say that, it employs a mathematical trick that allows it to replace complicated reasoning with a
simple learning procedure that can be used to generate the very same results.

• The trick is that rather than attempting to calculate the total future reward, temporal difference learning just
attempts to predict the combination of immediate reward and its own reward prediction at the next moment
in time. Now when the next moment comes and brings fresh information with it, the new prediction is
compared with the expected prediction. If these two predictions are different from each other, the TD
algorithm will calculate how different the predictions are from each other and make use of this temporal
difference to adjust the old prediction toward the new prediction.
PARAMETERS USED IN TEMPORAL
DIFFERENCE
• Alpha (α): learning rate
It shows how much our estimates should be adjusted, based on the error. This rate varies
between 0 and 1.
• Gamma (γ): the discount rate
This indicates how much future rewards are valued. A larger discount rate signifies that
future rewards are valued to a greater extent. The discount rate also varies between 0 and 1.
• e: the ratio reflective of exploration vs. exploitation.
This involves exploring new options with probability e and staying at the current max with
probability 1-e. A larger e signifies that more exploration is carried out during training
Thank You

You might also like