0% found this document useful (0 votes)
62 views

Reinforcement Learning

Reinforcement learning is a type of machine learning where an agent learns how to achieve a goal by interacting with its environment. The agent performs actions and receives rewards or punishments, allowing it to gradually learn what actions yield the maximum reward. Key aspects of reinforcement learning include the agent, environment, actions, states, rewards, and policies to maximize long-term rewards. Methods like Q-learning use reinforcement learning to find optimal actions by learning action-value functions through trial-and-error interactions with dynamic and uncertain environments.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
62 views

Reinforcement Learning

Reinforcement learning is a type of machine learning where an agent learns how to achieve a goal by interacting with its environment. The agent performs actions and receives rewards or punishments, allowing it to gradually learn what actions yield the maximum reward. Key aspects of reinforcement learning include the agent, environment, actions, states, rewards, and policies to maximize long-term rewards. Methods like Q-learning use reinforcement learning to find optimal actions by learning action-value functions through trial-and-error interactions with dynamic and uncertain environments.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 32

Reinforcement Learning

By Shweta Saxena
Types of machine learning
Reinforcement Learning

Action
Environment
(State, Action, Reward)
Agent
(Computer
Program)
Reinforcement Learning
• Art of optimal decision making
• Reinforcement learning is a type of machine learning method where an
intelligent agent (computer program) interacts with the environment and
learns to act within that.
• In RL an agent learns by trial and error using feedback from its own actions
and experiences.
• How a Robotic dog learns the movement of his arms is an example of Reinforcement
learning.
• RL solves a specific type of problem where decision making is sequential and
the goal is long-term.
• Game-playing
• Rbotics, etc.
Reinforcement Learning
• The figure below illustrates the action-reward feedback loop of a
generic RL model.
Reinforcement Learning
Reinforcement Learning
• The above image shows the robot, diamond, and fire.
• The goal of the robot is to get the reward that is the diamond and
avoid the hurdles that are fired.
• The robot learns by trying all the possible paths and then choosing
the path which gives him the reward with the least hurdles.
• Each right step will give the robot a reward and each wrong step will
subtract the reward of the robot.
• The total reward will be calculated when it reaches the final reward
that is the diamond.
Main points in Reinforcement learning
1. Input: The input should be an initial state from which the model will
start
2. Output: There are many possible outputs as there are a variety of
solutions to a particular problem
3. Training:
• The training is based upon the input.
• The model will return a state and the user will decide to reward or punish the
model based on its output.
• The model keeps continues to learn.
• The best solution is decided based on the maximum reward.
Terms used in Reinforcement Learning
• Agent(): An entity that can perceive/explore the environment and act upon it.
• Environment(): A situation in which an agent is present or surrounded by. In RL, we assume
the stochastic environment, which means it is random in nature.
• Action(): Actions are the moves taken by an agent within the environment.
• State(): State is a situation returned by the environment after each action taken by the agent.
• Reward(): A feedback returned to the agent from the environment to evaluate the action of the
agent.
• Policy(): Policy is a strategy applied by the agent for the next action based on the current state.
• Value(): It is expected long-term retuned with the discount factor and opposite to the short-
term reward.
• Q-value(): It is mostly similar to the value, but it takes one additional parameter as a current
action (a).
Reinforcement learning and Supervised
learning
• Both supervised and reinforcement learning use mapping between
input and output.
• Supervised learning where the feedback provided to the agent is correct set
of actions for performing a task.
• Reinforcement learning uses rewards and punishments as signals for positive
and negative behavior
• Goal in unsupervised learning is to find similarities and differences
between data points.
• Reinforcement learning the goal is to find a suitable action model
that would maximize the total cumulative reward of the agent.
Difference between Reinforcement learning
and Supervised learning
Reinforcement learning Supervised learning

Reinforcement learning is all about making decisions In Supervised learning, the


sequentially. In simple words, we can say that the output decision is made on the initial
depends on the state of the current input and the next input input or the input given at the
depends on the output of the previous input start

In Reinforcement learning decision is dependent, So we give In supervised learning the


labels to sequences of dependent decisions decisions are independent of
each other so labels are given to
each decision.

Example: Chess game Example: Object recognition


Markov Decision Process
• Markov Decision Process or MDP, is used to formalize the
reinforcement learning problems.
• If the environment is completely observable, then its dynamic can be
modeled as a Markov Process.
• In MDP, the agent constantly interacts with the environment and
performs actions; at each action, the environment responds and
generates a new state.
Markov Decision Process
Markov Decision Process
• MDP contains a tuple of four elements (S, A, Pa, Ra):
• S = A set of finite States
• A = A set of finite Actions
• Ra =Rewards received after transitioning from state S to state S’,
due to action a.
• Pa = Probability.
• MDP uses Markov property.
Markov property
• "If the agent is present in the current state S1, performs an action a1
and move to the state s2, then the state transition from s1 to s2 only
depends on the current state and future action and states do not
depend on past actions, rewards, or states.“
• As per Markov Property, the current state transition does not depend
on any past action or state.
• Example: Chess game, the players only focus on the current state and do not
need to remember past actions or states.
Finite MDP
• A finite MDP is when there are finite states, finite rewards, and finite
actions.
• In RL, we consider only the finite MDP.
• Markov Process:
• It is a memoryless process with a sequence of random states S1, S2, ....., St
that uses the Markov Property.
• Markov process is also known as Markov chain, which is a tuple (S, P) on state
S and transition function P.
• These two components (S and P) can define the dynamics of the system.
Types of learning
Passive Reinforcement learning
• Agent’s policy is fixed.
• The goal of the agent is to evaluate how good an optimal policy is.
• The agent needs to learn the expected utility Uπ(s) for each state s
(state-action pairs).
• This can be done in three ways.
• Direct Utility Estimation
• Adaptive Dynamic Programming(ADP)
• Temporal Difference Learning (TD)
Active Reinforcement learning
• The goal of the agent is to evaluate how good an optimal policy is.
• Agent must also learn what to do.
• Types of Active reinforcement learning
• Adaptive Dynamic Programming(ADP) with exploration function
Q-learning
• Q-Learning is used when we have to find the optimal path.
• It finds the next best action, given a current state.
• It chooses this action at random and aims to maximize the reward.
Q-Learning
• Q stands for Quality, which means it specifies the quality of an action taken by
the agent.
• Q-learning is an Off policy RL algorithm
• The objective of the model is to find the best course of action given its current
state.
• To do this, it may come up with rules of its own or it may operate outside
the policy given to it to follow.
• This means that there is no actual need for a policy, hence we call it off-
policy.
• At each state S we choose an action “a” which maximizes the function Q (S, a).
• Function Q (S, a) means how good to take action "a" at a particular state "s.“
• Function Q (S, a) is also called Bellman function used to modify Q table.
Q table
• A Q-table or matrix is created while performing the Q-learning. The
table follows the state and action pair, i.e., [s, a], and initializes the
values to zero.
• After each action, the table is updated, and the q-values are stored
within the table.
• The RL agent uses this Q-table as a reference table to select the best
action based on the Q-values.
Working of Q learning
Bellman Equation
• It is used to determine the value of a particular state and deduce how
good it is to be in/take that state.
Discount Factor/ Rate
• It determines how much importance is to be given to the immediate
reward and future rewards.
• This basically helps us to avoid infinity as a reward in continuous
tasks.
• It has a value between 0 and 1.
• A value of 0 means that more importance is given to the immediate reward
and a value of 1 means that more importance is given to future rewards.
• In practice, a discount factor of 0 will never learn as it only considers
immediate reward.
• A discount factor of 1 will go on for future rewards which may lead to infinity.
• Therefore, the optimal value for the discount factor lies between 0.2 to 0.8.
Example
• An advertisement recommendation system.
• In a normal ad recommendation system, the ads you get are based on your
previous purchases or websites you may have visited.
• If you’ve bought a TV, you will get recommended TVs of different brands.

Figure: Ad Recommendation System


Example
• Using Q-learning, we can optimize the ad recommendation system to
recommend products that are frequently bought together.
• The reward will be if the user clicks on the suggested product.

Figure: Ad Recommendation System with Q-Learning


State Action Reward State action (SARSA)
• It is an on-policy temporal difference learning method.
• The on-policy control method selects the action for each state while
learning using a specific policy.
• The SARSA is named because it uses the quintuple Q(s, a, r, s', a').
Where,
s: original state
a: Original action
r: reward observed while following the states
s' and a': New state, action pair.
State Action Reward State action (SARSA)
• The goal of SARSA is to calculate the Q π (s, a) for the selected
current policy π and all pairs of (s-a).
• The main difference between Q-learning and SARSA algorithms is that
unlike Q-learning, the maximum reward for the next state is not
required for updating the Q-value in the table.
• In SARSA, new action and reward are selected using the same policy,
which has determined the original action.
Generalization in Reinforcement learning
• In Reinforcement learning, the generalization of the agents is
benchmarked on the environments they have been trained on.
• In a supervised learning setting, this would mean testing the model
using the training dataset.
Policy search
• It is a subfield in reinforcement learning which focuses on finding
good parameters for a given policy parametrization.
• It is well suited for robotics as it can cope with high-dimensional state
and action spaces, one of the main challenges in robot learning.
References
• https://www.geeksforgeeks.org/what-is-reinforcement-learning/
• https://www.kdnuggets.com/2018/06/explaining-reinforcement-learn
ing-active-passive.html
• https://www.javatpoint.com/reinforcement-learning#Markov
• https://www.freecodecamp.org/news/an-introduction-to-q-learning-r
einforcement-learning-14ac0b4493cc/
• https://www.simplilearn.com/tutorials/machine-learning-tutorial/wh
at-is-q-learning
• https://towardsdatascience.com/introduction-to-reinforcement-
learning-markov-decision-process-44c533ebf8da

You might also like