RL Ese
RL Ese
• Reinforcement Learning is a feedback-based Machine learning technique in which an agent learns to behave in an
environment by performing the actions and seeing the results of actions. For each good action, the agent gets positive
feedback, and for each bad action, the agent gets negative feedback or penalty.
• In Reinforcement Learning, the agent learns automatically using feedbacks without any labelled data, unlike
supervised learning.
• Since there is no labelled data, so the agent is bound to learn by its experience only.
ELEMENTS OF RL
• Policy: Policy defines the learning agent behavior for given time period. It is a mapping from perceived states of
the environment to actions to be taken when in those states.
• Reward function: Reward function is used to define a goal in a reinforcement learning problem.A reward function
is a function that provides a numerical score based on the state of the environment
• Value function: Value functions specify what is good in the long run. The value of a state is the total amount of
reward an agent can expect to accumulate over the future, starting from that state.
• Model: The last element of reinforcement learning is the model, which mimics the behaviour of the environment.
With the help of the model, one can make inferences about how the environment will behave. Such as, if a state and
an action are given, then a model can predict the next state and reward.
FRAMEWORK OF RL
• Agent: An entity that can perceive/explore the environment and act upon it.
• Environment: A situation in which an agent is present or surrounded by. In RL, we assume the stochastic
environment, which means it is random in nature.
• Action: Actions are the moves taken by an agent within the environment.
• State: State is a situation returned by the environment after each action taken by the agent.
• Reward: A feedback returned to the agent from the environment to evaluate the action of the agent.
Q-LEARNING
1. Q-learning is a machine learning approach that enables a model to iteratively learn and improve over time by
taking the correct action. Q-learning is a type of reinforcement learning.
2. Q-Learning is a Reinforcement learning policy that will find the next best action, given a current state. It chooses
this action at random and aims to maximize the reward.
3. The objective of the model is to find the best course of action given its current state. To do this, it may come up
with rules of its own or it may operate outside the policy given to it to follow. This means that there is no actual
need for a policy, hence we call it off-policy.
SUPERVISED VS RL
MARKOV MODEL
A Markov decision process (MDP) is a mathematical framework for modelling decision-making problems in which the
outcome of each decision depends on the current state of the world and the action taken. MDPs are used in reinforcement
learning (RL), a type of machine learning that allows software agents to learn how to behave in an environment by trial
and error.
An MDP mainly consists of:
Function Approximation
1. Function approximation in reinforcement learning is a technique that allows agents to learn in
environments with large state and action spaces by representing the value function or policy as a function.
This is in contrast to tabular reinforcement learning, where the value function or policy is represented
explicitly for each state or state-action pair.
2. Function approximation is important in reinforcement learning because it allows agents to learn in
environments with large state and action spaces. For example, if an agent is learning to play a game like
Atari Pong, there are billions of possible states that the agent could be in. It would be impractical to learn
a separate value or policy for each state. Instead, the agent can use function approximation to learn a
function that can estimate the value of any state.
Advantages of function approximation in reinforcement learning:
• Scalability: Function approximation allows agents to learn in environments with large state and action
spaces.
• Efficiency: Function approximation can improve the efficiency of reinforcement learning algorithms by
reducing the number of training steps required.
• Generalization: Function approximation allows agents to generalize to new states and actions that they
have not seen before.
Here are some specific examples of how function approximation can be used to improve reinforcement
learning performance:
• In Atari games, function approximation has been used to train agents to achieve superhuman
performance. For example, the Deep Q-Network (DQN) algorithm uses a neural network function
approximator to learn the Q-function, which maps state-action pairs to expected rewards.
• In robotics, function approximation has been used to train agents to perform complex tasks, such as
walking over uneven terrain and assembling products. For example, the Model Predictive Control (MPC)
algorithm uses a linear function approximator to learn the dynamics of the robot and to generate optimal
control trajectories.
• In finance, function approximation has been used to train agents to make investment decisions and
manage risk. For example, the Reinforcement Learning Portfolio Optimization (RLPO) algorithm uses a
neural network function approximator to learn the value of different investment portfolios.
Least Square Method
1. The Least Squares Method is used to derive a generalized linear equation between two variables, one of
which is independent and the other dependent on the former. The value of the independent variable is
represented as the x-coordinate and that of the dependent variable is represented as the y-coordinate in a
2D cartesian coordinate system. Initially, known values are marked on a plot.
2. The plot obtained at this point is called a scatter plot. Then, we try to represent all the marked points as a
straight line or a linear equation. The equation of such a line is obtained with the help of the least squares
method. This is done to get the value of the dependent variable for an independent variable for which the
value was initially unknown. This helps us to fill in the missing points in a data table or forecast the data.
Applications of Reinforcement Learning
2. Robotics:
• RL can be used to train robots to perform tasks such as grasping objects, walking, and
playing games.
• In grasping objects, an agent can learn to grasp objects by maximizing a reward signal
based on how well it grasps the object.
• In walking, an agent can learn to walk by maximizing a reward signal based on how far
it walks without falling.
• In playing games, an agent can learn to play games such as chess or go by maximizing
a reward signal based on how well it plays the game.
3. Recommendation Systems:
• RL can be used to train agents to make recommendations based on user feedback.
• In news recommendation systems, an agent can learn to recommend news articles by
maximizing a reward signal based on how well the recommended articles match the
user's interests.
• In E-commerce recommendation systems, an agent can learn to recommend products
by maximizing a reward signal based on how well the recommended products match
the user's preferences.
• In social media recommendation systems, an agent can learn to recommend posts or
accounts by maximizing a reward signal based on how well the recommended content
matches the user’s interests.
Value Iteration:
• Value iteration is a dynamic programming algorithm used to find the optimal value function of a Markov
decision process (MDP).
• The algorithm starts with an initial estimate of the value function and iteratively updates it until
convergence.
• At each iteration, the algorithm computes the maximum expected reward that can be obtained from each
state by considering all possible actions.
• The algorithm is guaranteed to converge to the optimal value function after a finite number of iterations.
• Value iteration is computationally efficient and can be used for large MDPs.
Policy Iteration:
• Policy iteration is a dynamic programming algorithm used to find the optimal policy of an MDP.
• The algorithm starts with an initial policy and iteratively improves it until convergence.
• At each iteration, the algorithm evaluates the current policy and computes a new policy that is guaranteed
to be better than or equal to the current policy.
• The algorithm is guaranteed to converge to the optimal policy after a finite number of iterations.
• Policy iteration can be computationally expensive, especially for large MDPs, but it often converges faster
than value iteration.