RL Ese

Uploaded by

Shrishti Bhasin

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

11 views7 pages

RL Ese

Uploaded by

Shrishti Bhasin

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 7

REINFORCEMENT LEARNING

• Reinforcement Learning is a feedback-based Machine learning technique in which an agent learns to behave in an
environment by performing the actions and seeing the results of actions. For each good action, the agent gets positive
feedback, and for each bad action, the agent gets negative feedback or penalty.
• In Reinforcement Learning, the agent learns automatically using feedbacks without any labelled data, unlike
supervised learning.
• Since there is no labelled data, so the agent is bound to learn by its experience only.

ELEMENTS OF RL

• Policy: Policy defines the learning agent behavior for given time period. It is a mapping from perceived states of
the environment to actions to be taken when in those states.
• Reward function: Reward function is used to define a goal in a reinforcement learning problem.A reward function
is a function that provides a numerical score based on the state of the environment
• Value function: Value functions specify what is good in the long run. The value of a state is the total amount of
reward an agent can expect to accumulate over the future, starting from that state.
• Model: The last element of reinforcement learning is the model, which mimics the behaviour of the environment.
With the help of the model, one can make inferences about how the environment will behave. Such as, if a state and
an action are given, then a model can predict the next state and reward.

FRAMEWORK OF RL

• Agent: An entity that can perceive/explore the environment and act upon it.
• Environment: A situation in which an agent is present or surrounded by. In RL, we assume the stochastic
environment, which means it is random in nature.
• Action: Actions are the moves taken by an agent within the environment.
• State: State is a situation returned by the environment after each action taken by the agent.
• Reward: A feedback returned to the agent from the environment to evaluate the action of the agent.
Q-LEARNING
1. Q-learning is a machine learning approach that enables a model to iteratively learn and improve over time by
taking the correct action. Q-learning is a type of reinforcement learning.
2. Q-Learning is a Reinforcement learning policy that will find the next best action, given a current state. It chooses
this action at random and aims to maximize the reward.
3. The objective of the model is to find the best course of action given its current state. To do this, it may come up
with rules of its own or it may operate outside the policy given to it to follow. This means that there is no actual
need for a policy, hence we call it off-policy.
SUPERVISED VS RL

MARKOV MODEL
A Markov decision process (MDP) is a mathematical framework for modelling decision-making problems in which the
outcome of each decision depends on the current state of the world and the action taken. MDPs are used in reinforcement
learning (RL), a type of machine learning that allows software agents to learn how to behave in an environment by trial
and error.
An MDP mainly consists of:

• A set of possible world states S.

• A set of Models.
• A set of possible actions A.
• A real-valued reward function R(s,a).
• A policy the solution of Markov Decision Process.
The goal of the agent in an MDP is to find a policy, which is a mapping from states to actions, that maximizes the
expected sum of rewards over time.
Reinforcement learning algorithms use trial and error to learn the optimal policy for an MDP. They do this by interacting
with the environment and receiving rewards and penalties. Over time, the algorithm learns to associate certain actions
with certain states and rewards, and it can then choose the actions that are most likely to lead to a good outcome.
Dynamic Programming for MDP
1. Dynamic programming (DP) is a mathematical optimization method for solving sequential decision
problems. It is based on the principle of optimality, which states that an optimal policy has the property
that whatever the initial state and initial decision are, the remaining decisions must constitute an optimal
policy with regard to the state resulting from the first decision.
2. MDPs, or Markov decision processes, are a type of sequential decision problem where the agent's actions
affect the state of the environment and the rewards that it receives. DP can be used to solve MDPs by
recursively solving the problem for smaller and smaller subproblems.
3. The basic idea of DP for MDPs is to construct a value function for each state. The value function for a
state represents the expected reward that the agent can expect to receive if it starts in that state and follows
the optimal policy.
To construct the value function, DP uses the following recursive equation:

V(s) = max_a [R(s, a) + E[V(s')]

where:
• V(s) is the value function for state s
• a is an action
• R(s, a) is the immediate reward for taking action a in state s
• E[V(s')] is the expected value of the value function for the next state, s', which is given by the sum of
the transition probabilities from state s to state s' times the value function for state s'.
This equation states that the value of a state is equal to the maximum expected reward that can be obtained by
taking an action in that state and then following the optimal policy in the next state.
To solve the MDP using DP, we start by initializing the value function for each state to zero. Then, we
repeatedly iterate over the recursive equation above, updating the value function for each state until it
converges. Once the value function has converged, we can then determine the optimal policy for each state by
taking the action that maximizes the expected reward.
DP is a powerful tool for solving MDPs, but it can be computationally expensive for large MDPs. However,
there are a number of techniques that can be used to improve the efficiency of DP algorithms.

Bellman’s Principle of Optimality

1. It is the fundamental aspects of dynamic programming which states that the optimal solution to the
dynamic programming optimization problem can be found by combining the optimal solution to its sub
problem.
2. The principal is generally applicable for problem with finite or countable state space in order to minimize
the theoretical complexity.
3. It cannot be applied on classic model such as inventory management on dynamic pricing model that have
continuous state space and the challenge involve in dynamic programming with general state space.
4. The principle states that an optimal policy has the property that whatever the initial state and initial
decisions are the remaining decisions must constitute an optimal policy with regard to the state resulting
from the first decision.
5. Dynamic Programming method breaks down a multi-step decision problem into smaller (recursive) sub
problems using bellman’s principle of optimality.
6. State is independent of decisions taken at previous states. This allows us to separate initial decision from
the future decision and optimization future decision.
7. It is the aspects of dynamic programming which states that the optimal solution to a dynamic programming
optimization problem can be found by combining the optimal solution to its sub problems.
8. The principle is generally applicable for problem with finite or countable state space in order to minimize
the theoretical complexity.
9. It cannot be applied on classic model such as inventory management and dynamic pricing model that have
continuous state space and the challenge involve in dynamic programming with general state space.

Function Approximation
1. Function approximation in reinforcement learning is a technique that allows agents to learn in
environments with large state and action spaces by representing the value function or policy as a function.
This is in contrast to tabular reinforcement learning, where the value function or policy is represented
explicitly for each state or state-action pair.
2. Function approximation is important in reinforcement learning because it allows agents to learn in
environments with large state and action spaces. For example, if an agent is learning to play a game like
Atari Pong, there are billions of possible states that the agent could be in. It would be impractical to learn
a separate value or policy for each state. Instead, the agent can use function approximation to learn a
function that can estimate the value of any state.
Advantages of function approximation in reinforcement learning:
• Scalability: Function approximation allows agents to learn in environments with large state and action
spaces.
• Efficiency: Function approximation can improve the efficiency of reinforcement learning algorithms by
reducing the number of training steps required.
• Generalization: Function approximation allows agents to generalize to new states and actions that they
have not seen before.
Here are some specific examples of how function approximation can be used to improve reinforcement
learning performance:
• In Atari games, function approximation has been used to train agents to achieve superhuman
performance. For example, the Deep Q-Network (DQN) algorithm uses a neural network function
approximator to learn the Q-function, which maps state-action pairs to expected rewards.
• In robotics, function approximation has been used to train agents to perform complex tasks, such as
walking over uneven terrain and assembling products. For example, the Model Predictive Control (MPC)
algorithm uses a linear function approximator to learn the dynamics of the robot and to generate optimal
control trajectories.
• In finance, function approximation has been used to train agents to make investment decisions and
manage risk. For example, the Reinforcement Learning Portfolio Optimization (RLPO) algorithm uses a
neural network function approximator to learn the value of different investment portfolios.
Least Square Method
1. The Least Squares Method is used to derive a generalized linear equation between two variables, one of
which is independent and the other dependent on the former. The value of the independent variable is
represented as the x-coordinate and that of the dependent variable is represented as the y-coordinate in a
2D cartesian coordinate system. Initially, known values are marked on a plot.
2. The plot obtained at this point is called a scatter plot. Then, we try to represent all the marked points as a
straight line or a linear equation. The equation of such a line is obtained with the help of the least squares
method. This is done to get the value of the dependent variable for an independent variable for which the
value was initially unknown. This helps us to fill in the missing points in a data table or forecast the data.
Applications of Reinforcement Learning

1. Natural Language Processing (NLP):

• RL can be used to train agents to perform tasks such as text classification, machine
translation, and dialogue generation.
• In text classification, an agent can learn to classify text into different categories by
maximizing a reward signal based on how well the classification matches the desired
output.
• In machine translation, an agent can learn to translate text from one language to another
by maximizing a reward signal based on how well the translation matches the desired
output.
• In dialogue generation, an agent can learn to generate responses by maximizing a reward
signal based on how well the generated response matches the desired response.

2. Robotics:
• RL can be used to train robots to perform tasks such as grasping objects, walking, and
playing games.
• In grasping objects, an agent can learn to grasp objects by maximizing a reward signal
based on how well it grasps the object.
• In walking, an agent can learn to walk by maximizing a reward signal based on how far
it walks without falling.
• In playing games, an agent can learn to play games such as chess or go by maximizing
a reward signal based on how well it plays the game.

3. Recommendation Systems:
• RL can be used to train agents to make recommendations based on user feedback.
• In news recommendation systems, an agent can learn to recommend news articles by
maximizing a reward signal based on how well the recommended articles match the
user's interests.
• In E-commerce recommendation systems, an agent can learn to recommend products
by maximizing a reward signal based on how well the recommended products match
the user's preferences.
• In social media recommendation systems, an agent can learn to recommend posts or
accounts by maximizing a reward signal based on how well the recommended content
matches the user’s interests.
Value Iteration:

• Value iteration is a dynamic programming algorithm used to find the optimal value function of a Markov
decision process (MDP).
• The algorithm starts with an initial estimate of the value function and iteratively updates it until
convergence.
• At each iteration, the algorithm computes the maximum expected reward that can be obtained from each
state by considering all possible actions.
• The algorithm is guaranteed to converge to the optimal value function after a finite number of iterations.
• Value iteration is computationally efficient and can be used for large MDPs.

Policy Iteration:

• Policy iteration is a dynamic programming algorithm used to find the optimal policy of an MDP.
• The algorithm starts with an initial policy and iteratively improves it until convergence.
• At each iteration, the algorithm evaluates the current policy and computes a new policy that is guaranteed
to be better than or equal to the current policy.
• The algorithm is guaranteed to converge to the optimal policy after a finite number of iterations.
• Policy iteration can be computationally expensive, especially for large MDPs, but it often converges faster
than value iteration.

CMPE257 - W10C13 - Reinforcement Learning
No ratings yet
CMPE257 - W10C13 - Reinforcement Learning
161 pages
Adprl Chapter Icis
No ratings yet
Adprl Chapter Icis
43 pages
DLMAIRIL01 Q4-2024 Session4
No ratings yet
DLMAIRIL01 Q4-2024 Session4
80 pages
RL DQN PG
No ratings yet
RL DQN PG
65 pages
Unit 4
No ratings yet
Unit 4
49 pages
Lecture 1
No ratings yet
Lecture 1
26 pages
Reinforcement Learning2A
No ratings yet
Reinforcement Learning2A
88 pages
Markov Decicion
No ratings yet
Markov Decicion
40 pages
RL Unit-Ii
No ratings yet
RL Unit-Ii
14 pages
MDP Concepts
No ratings yet
MDP Concepts
23 pages
L12 Markov Decision Processes
No ratings yet
L12 Markov Decision Processes
64 pages
Reinforcement Learning: Karan Kathpalia
No ratings yet
Reinforcement Learning: Karan Kathpalia
80 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
31 pages
DSA5102 Lecture11
No ratings yet
DSA5102 Lecture11
44 pages
17 - Markov Decision Processes
No ratings yet
17 - Markov Decision Processes
59 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
43 pages
119686
No ratings yet
119686
24 pages
Unit-8 - Reinforcement Learning
No ratings yet
Unit-8 - Reinforcement Learning
52 pages
L12 Reinforcement Learning 2
No ratings yet
L12 Reinforcement Learning 2
26 pages
Unit-5 Ai
No ratings yet
Unit-5 Ai
19 pages
DLMAIRIL01 Q4-2024 Session2
No ratings yet
DLMAIRIL01 Q4-2024 Session2
68 pages
RL Ese Answers
No ratings yet
RL Ese Answers
16 pages
RL Unit - Ii
No ratings yet
RL Unit - Ii
20 pages
Markov Decision Process
No ratings yet
Markov Decision Process
15 pages
Unit-4 of Ai
No ratings yet
Unit-4 of Ai
9 pages
Discuss About Temporal Difference in Reinforcement Learning?
No ratings yet
Discuss About Temporal Difference in Reinforcement Learning?
9 pages
Experiment 4
No ratings yet
Experiment 4
7 pages
A Crash Course On Reinforcement Learning - Felix Wagner
No ratings yet
A Crash Course On Reinforcement Learning - Felix Wagner
84 pages
Lec17 ReinforcementLearning
No ratings yet
Lec17 ReinforcementLearning
58 pages
Markov Decision
No ratings yet
Markov Decision
4 pages
Reinforcement Learning: Amulya Viswambaran (202090007) Kehkashan Fatima (202090202) Sruthi Krishnan (202090333)
No ratings yet
Reinforcement Learning: Amulya Viswambaran (202090007) Kehkashan Fatima (202090202) Sruthi Krishnan (202090333)
40 pages
Advanced Microeconomics
100% (2)
Advanced Microeconomics
605 pages
Bouncing Ball Project Computer Graphics
67% (3)
Bouncing Ball Project Computer Graphics
10 pages
A Tutorial For Reinforcement Learning
No ratings yet
A Tutorial For Reinforcement Learning
14 pages
A Tutorial For Reinforcement Learning
No ratings yet
A Tutorial For Reinforcement Learning
17 pages
20ai903 - RL - Unit 2
No ratings yet
20ai903 - RL - Unit 2
27 pages
AS02
No ratings yet
AS02
16 pages
CSD311: Artificial Intelligence
No ratings yet
CSD311: Artificial Intelligence
11 pages
Deep RL - Content Beyond Syllabus
No ratings yet
Deep RL - Content Beyond Syllabus
16 pages
Reinforcement Learning Note
No ratings yet
Reinforcement Learning Note
16 pages
IntroductiontoRL BR
No ratings yet
IntroductiontoRL BR
22 pages
Lecture 9 Reiforcement Learning
No ratings yet
Lecture 9 Reiforcement Learning
29 pages
Reinforcement Learning: Nguyen Do Van, PHD
No ratings yet
Reinforcement Learning: Nguyen Do Van, PHD
40 pages
Unit Vi
No ratings yet
Unit Vi
17 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
28 pages
22 Reinforcement Learning
No ratings yet
22 Reinforcement Learning
18 pages
Add-On DRL CS06
No ratings yet
Add-On DRL CS06
23 pages
Types of Reinforcement Learning MDP
No ratings yet
Types of Reinforcement Learning MDP
3 pages
RL RS-Unit - 3
No ratings yet
RL RS-Unit - 3
6 pages
Unit 5 Reinforcement Learning Notes
No ratings yet
Unit 5 Reinforcement Learning Notes
20 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
9 pages
5.4-Reinforcement Learning-Part1-Introduction
No ratings yet
5.4-Reinforcement Learning-Part1-Introduction
15 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
6 pages
Unit 5 ML
No ratings yet
Unit 5 ML
15 pages
MLT Unit-5 Notes
No ratings yet
MLT Unit-5 Notes
17 pages
DL Unit 6 QP Solution
No ratings yet
DL Unit 6 QP Solution
15 pages
Reinforcement
No ratings yet
Reinforcement
9 pages
Reinforcement Learning in A Nutshell
No ratings yet
Reinforcement Learning in A Nutshell
12 pages
Stochastic Process - Markov Property - Markov Chain - Markov Decision Process - Reinforcement Learning - RL Techniques - Example Applications
No ratings yet
Stochastic Process - Markov Property - Markov Chain - Markov Decision Process - Reinforcement Learning - RL Techniques - Example Applications
39 pages
A Brief Introduction To Reinforcement Learning
No ratings yet
A Brief Introduction To Reinforcement Learning
4 pages
Crushing Principles of Mechanical Crushing
100% (1)
Crushing Principles of Mechanical Crushing
54 pages
TIC-TAC-TOE Game and Library Management System
No ratings yet
TIC-TAC-TOE Game and Library Management System
12 pages
J Heat Transfer 1975 Vol 97 N4
No ratings yet
J Heat Transfer 1975 Vol 97 N4
140 pages
An Assignment Problem Is A Particular Case of Transportation Problem
No ratings yet
An Assignment Problem Is A Particular Case of Transportation Problem
7 pages
Markov Decision Processes & Reinforcement Learning: Megan Smith Lehigh University, Fall 2006
No ratings yet
Markov Decision Processes & Reinforcement Learning: Megan Smith Lehigh University, Fall 2006
40 pages
02 Electronic Spreadsheet Advanced Revision Notes
No ratings yet
02 Electronic Spreadsheet Advanced Revision Notes
8 pages
Headcount Modelling en
100% (1)
Headcount Modelling en
20 pages
An CLP PDF
No ratings yet
An CLP PDF
545 pages
Lecture Notes v1.0 687 F22
No ratings yet
Lecture Notes v1.0 687 F22
115 pages
E BrIM Oct23
No ratings yet
E BrIM Oct23
50 pages
MLT Unit 1 & 2
No ratings yet
MLT Unit 1 & 2
119 pages
WaterGEMS V8i
No ratings yet
WaterGEMS V8i
4 pages
Modern Techniques For History Matching - Schulze-Riegert - Ghedan
No ratings yet
Modern Techniques For History Matching - Schulze-Riegert - Ghedan
50 pages
Optimization of Preventive Maintenance-A Review and Analysis
No ratings yet
Optimization of Preventive Maintenance-A Review and Analysis
5 pages
Tugas 1 - Operations Management Perspectives in The Air Transport
No ratings yet
Tugas 1 - Operations Management Perspectives in The Air Transport
14 pages
MLT Ese
No ratings yet
MLT Ese
21 pages
Daa Module 3
No ratings yet
Daa Module 3
38 pages
Tutorial 2 - March 7, 2012
No ratings yet
Tutorial 2 - March 7, 2012
2 pages
DAA Unit-4: P-Class
No ratings yet
DAA Unit-4: P-Class
8 pages
Deterministic Job-Shop Scheduling:past, Present and Future
No ratings yet
Deterministic Job-Shop Scheduling:past, Present and Future
2 pages
MLT Unit 2 Perceptron
No ratings yet
MLT Unit 2 Perceptron
34 pages
Locs HK Rs c4 PDF
No ratings yet
Locs HK Rs c4 PDF
49 pages
MLT Numericals
No ratings yet
MLT Numericals
4 pages
Articulo
No ratings yet
Articulo
11 pages
MLT Tensorflow Unit 3
No ratings yet
MLT Tensorflow Unit 3
20 pages
Linear Programming Formulations
No ratings yet
Linear Programming Formulations
61 pages
Greedy
No ratings yet
Greedy
22 pages
Simplexmethod-Lec.-2 (1) (2) (Original)
No ratings yet
Simplexmethod-Lec.-2 (1) (2) (Original)
21 pages
MLT Unit 3
No ratings yet
MLT Unit 3
11 pages
QC Sheet
No ratings yet
QC Sheet
8 pages
The Manifesto of The Super I For New Joinees
No ratings yet
The Manifesto of The Super I For New Joinees
11 pages
Amp
No ratings yet
Amp
3 pages
Lecture 8
No ratings yet
Lecture 8
45 pages
Data Interpretation (Foundation) - Practice-1
No ratings yet
Data Interpretation (Foundation) - Practice-1
13 pages
Everything You Need To Know About Linear Regression
No ratings yet
Everything You Need To Know About Linear Regression
23 pages
Aerodynamic and Aeroacoustic Shape Optimization
No ratings yet
Aerodynamic and Aeroacoustic Shape Optimization
12 pages
Methods For Aggregate Planning
No ratings yet
Methods For Aggregate Planning
8 pages
Maths Assignment
No ratings yet
Maths Assignment
4 pages
An Efficient Approach To Solve Sudoku Problem by Harmony Search Algorithm
No ratings yet
An Efficient Approach To Solve Sudoku Problem by Harmony Search Algorithm
12 pages
Study of Operations Research in Waste Management: Nancy Sharma, Nabhya Gupta, Muskaan Ahuja, Mrinesh Jain, Mridul Shah
No ratings yet
Study of Operations Research in Waste Management: Nancy Sharma, Nabhya Gupta, Muskaan Ahuja, Mrinesh Jain, Mridul Shah
6 pages
Ptich Kdoanh Lms
No ratings yet
Ptich Kdoanh Lms
2 pages
Chapter 4 - Solutions
No ratings yet
Chapter 4 - Solutions
2 pages
Markov Decision Process: Fundamentals and Applications
From Everand
Markov Decision Process: Fundamentals and Applications
Fouad Sabry
No ratings yet

RL Ese

Uploaded by

RL Ese

Uploaded by

REINFORCEMENT LEARNING

• A set of possible world states S.

V(s) = max_a [R(s, a) + E[V(s')]

Bellman’s Principle of Optimality

1. Natural Language Processing (NLP):

You might also like