0% found this document useful (0 votes)
11 views7 pages

RL Ese

Uploaded by

Shrishti Bhasin
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views7 pages

RL Ese

Uploaded by

Shrishti Bhasin
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

REINFORCEMENT LEARNING

• Reinforcement Learning is a feedback-based Machine learning technique in which an agent learns to behave in an
environment by performing the actions and seeing the results of actions. For each good action, the agent gets positive
feedback, and for each bad action, the agent gets negative feedback or penalty.
• In Reinforcement Learning, the agent learns automatically using feedbacks without any labelled data, unlike
supervised learning.
• Since there is no labelled data, so the agent is bound to learn by its experience only.

ELEMENTS OF RL

• Policy: Policy defines the learning agent behavior for given time period. It is a mapping from perceived states of
the environment to actions to be taken when in those states.
• Reward function: Reward function is used to define a goal in a reinforcement learning problem.A reward function
is a function that provides a numerical score based on the state of the environment
• Value function: Value functions specify what is good in the long run. The value of a state is the total amount of
reward an agent can expect to accumulate over the future, starting from that state.
• Model: The last element of reinforcement learning is the model, which mimics the behaviour of the environment.
With the help of the model, one can make inferences about how the environment will behave. Such as, if a state and
an action are given, then a model can predict the next state and reward.

FRAMEWORK OF RL

• Agent: An entity that can perceive/explore the environment and act upon it.
• Environment: A situation in which an agent is present or surrounded by. In RL, we assume the stochastic
environment, which means it is random in nature.
• Action: Actions are the moves taken by an agent within the environment.
• State: State is a situation returned by the environment after each action taken by the agent.
• Reward: A feedback returned to the agent from the environment to evaluate the action of the agent.
Q-LEARNING
1. Q-learning is a machine learning approach that enables a model to iteratively learn and improve over time by
taking the correct action. Q-learning is a type of reinforcement learning.
2. Q-Learning is a Reinforcement learning policy that will find the next best action, given a current state. It chooses
this action at random and aims to maximize the reward.
3. The objective of the model is to find the best course of action given its current state. To do this, it may come up
with rules of its own or it may operate outside the policy given to it to follow. This means that there is no actual
need for a policy, hence we call it off-policy.
SUPERVISED VS RL

MARKOV MODEL
A Markov decision process (MDP) is a mathematical framework for modelling decision-making problems in which the
outcome of each decision depends on the current state of the world and the action taken. MDPs are used in reinforcement
learning (RL), a type of machine learning that allows software agents to learn how to behave in an environment by trial
and error.
An MDP mainly consists of:

• A set of possible world states S.


• A set of Models.
• A set of possible actions A.
• A real-valued reward function R(s,a).
• A policy the solution of Markov Decision Process.
The goal of the agent in an MDP is to find a policy, which is a mapping from states to actions, that maximizes the
expected sum of rewards over time.
Reinforcement learning algorithms use trial and error to learn the optimal policy for an MDP. They do this by interacting
with the environment and receiving rewards and penalties. Over time, the algorithm learns to associate certain actions
with certain states and rewards, and it can then choose the actions that are most likely to lead to a good outcome.
Dynamic Programming for MDP
1. Dynamic programming (DP) is a mathematical optimization method for solving sequential decision
problems. It is based on the principle of optimality, which states that an optimal policy has the property
that whatever the initial state and initial decision are, the remaining decisions must constitute an optimal
policy with regard to the state resulting from the first decision.
2. MDPs, or Markov decision processes, are a type of sequential decision problem where the agent's actions
affect the state of the environment and the rewards that it receives. DP can be used to solve MDPs by
recursively solving the problem for smaller and smaller subproblems.
3. The basic idea of DP for MDPs is to construct a value function for each state. The value function for a
state represents the expected reward that the agent can expect to receive if it starts in that state and follows
the optimal policy.
To construct the value function, DP uses the following recursive equation:

V(s) = max_a [R(s, a) + E[V(s')]


where:
• V(s) is the value function for state s
• a is an action
• R(s, a) is the immediate reward for taking action a in state s
• E[V(s')] is the expected value of the value function for the next state, s', which is given by the sum of
the transition probabilities from state s to state s' times the value function for state s'.
This equation states that the value of a state is equal to the maximum expected reward that can be obtained by
taking an action in that state and then following the optimal policy in the next state.
To solve the MDP using DP, we start by initializing the value function for each state to zero. Then, we
repeatedly iterate over the recursive equation above, updating the value function for each state until it
converges. Once the value function has converged, we can then determine the optimal policy for each state by
taking the action that maximizes the expected reward.
DP is a powerful tool for solving MDPs, but it can be computationally expensive for large MDPs. However,
there are a number of techniques that can be used to improve the efficiency of DP algorithms.

Bellman’s Principle of Optimality


1. It is the fundamental aspects of dynamic programming which states that the optimal solution to the
dynamic programming optimization problem can be found by combining the optimal solution to its sub
problem.
2. The principal is generally applicable for problem with finite or countable state space in order to minimize
the theoretical complexity.
3. It cannot be applied on classic model such as inventory management on dynamic pricing model that have
continuous state space and the challenge involve in dynamic programming with general state space.
4. The principle states that an optimal policy has the property that whatever the initial state and initial
decisions are the remaining decisions must constitute an optimal policy with regard to the state resulting
from the first decision.
5. Dynamic Programming method breaks down a multi-step decision problem into smaller (recursive) sub
problems using bellman’s principle of optimality.
6. State is independent of decisions taken at previous states. This allows us to separate initial decision from
the future decision and optimization future decision.
7. It is the aspects of dynamic programming which states that the optimal solution to a dynamic programming
optimization problem can be found by combining the optimal solution to its sub problems.
8. The principle is generally applicable for problem with finite or countable state space in order to minimize
the theoretical complexity.
9. It cannot be applied on classic model such as inventory management and dynamic pricing model that have
continuous state space and the challenge involve in dynamic programming with general state space.

Function Approximation
1. Function approximation in reinforcement learning is a technique that allows agents to learn in
environments with large state and action spaces by representing the value function or policy as a function.
This is in contrast to tabular reinforcement learning, where the value function or policy is represented
explicitly for each state or state-action pair.
2. Function approximation is important in reinforcement learning because it allows agents to learn in
environments with large state and action spaces. For example, if an agent is learning to play a game like
Atari Pong, there are billions of possible states that the agent could be in. It would be impractical to learn
a separate value or policy for each state. Instead, the agent can use function approximation to learn a
function that can estimate the value of any state.
Advantages of function approximation in reinforcement learning:
• Scalability: Function approximation allows agents to learn in environments with large state and action
spaces.
• Efficiency: Function approximation can improve the efficiency of reinforcement learning algorithms by
reducing the number of training steps required.
• Generalization: Function approximation allows agents to generalize to new states and actions that they
have not seen before.
Here are some specific examples of how function approximation can be used to improve reinforcement
learning performance:
• In Atari games, function approximation has been used to train agents to achieve superhuman
performance. For example, the Deep Q-Network (DQN) algorithm uses a neural network function
approximator to learn the Q-function, which maps state-action pairs to expected rewards.
• In robotics, function approximation has been used to train agents to perform complex tasks, such as
walking over uneven terrain and assembling products. For example, the Model Predictive Control (MPC)
algorithm uses a linear function approximator to learn the dynamics of the robot and to generate optimal
control trajectories.
• In finance, function approximation has been used to train agents to make investment decisions and
manage risk. For example, the Reinforcement Learning Portfolio Optimization (RLPO) algorithm uses a
neural network function approximator to learn the value of different investment portfolios.
Least Square Method
1. The Least Squares Method is used to derive a generalized linear equation between two variables, one of
which is independent and the other dependent on the former. The value of the independent variable is
represented as the x-coordinate and that of the dependent variable is represented as the y-coordinate in a
2D cartesian coordinate system. Initially, known values are marked on a plot.
2. The plot obtained at this point is called a scatter plot. Then, we try to represent all the marked points as a
straight line or a linear equation. The equation of such a line is obtained with the help of the least squares
method. This is done to get the value of the dependent variable for an independent variable for which the
value was initially unknown. This helps us to fill in the missing points in a data table or forecast the data.
Applications of Reinforcement Learning

1. Natural Language Processing (NLP):


• RL can be used to train agents to perform tasks such as text classification, machine
translation, and dialogue generation.
• In text classification, an agent can learn to classify text into different categories by
maximizing a reward signal based on how well the classification matches the desired
output.
• In machine translation, an agent can learn to translate text from one language to another
by maximizing a reward signal based on how well the translation matches the desired
output.
• In dialogue generation, an agent can learn to generate responses by maximizing a reward
signal based on how well the generated response matches the desired response.

2. Robotics:
• RL can be used to train robots to perform tasks such as grasping objects, walking, and
playing games.
• In grasping objects, an agent can learn to grasp objects by maximizing a reward signal
based on how well it grasps the object.
• In walking, an agent can learn to walk by maximizing a reward signal based on how far
it walks without falling.
• In playing games, an agent can learn to play games such as chess or go by maximizing
a reward signal based on how well it plays the game.

3. Recommendation Systems:
• RL can be used to train agents to make recommendations based on user feedback.
• In news recommendation systems, an agent can learn to recommend news articles by
maximizing a reward signal based on how well the recommended articles match the
user's interests.
• In E-commerce recommendation systems, an agent can learn to recommend products
by maximizing a reward signal based on how well the recommended products match
the user's preferences.
• In social media recommendation systems, an agent can learn to recommend posts or
accounts by maximizing a reward signal based on how well the recommended content
matches the user’s interests.
Value Iteration:

• Value iteration is a dynamic programming algorithm used to find the optimal value function of a Markov
decision process (MDP).
• The algorithm starts with an initial estimate of the value function and iteratively updates it until
convergence.
• At each iteration, the algorithm computes the maximum expected reward that can be obtained from each
state by considering all possible actions.
• The algorithm is guaranteed to converge to the optimal value function after a finite number of iterations.
• Value iteration is computationally efficient and can be used for large MDPs.

Policy Iteration:

• Policy iteration is a dynamic programming algorithm used to find the optimal policy of an MDP.
• The algorithm starts with an initial policy and iteratively improves it until convergence.
• At each iteration, the algorithm evaluates the current policy and computes a new policy that is guaranteed
to be better than or equal to the current policy.
• The algorithm is guaranteed to converge to the optimal policy after a finite number of iterations.
• Policy iteration can be computationally expensive, especially for large MDPs, but it often converges faster
than value iteration.

You might also like