RL Unit 3,4,5

UNIT-III
Overview of dynamic programming for MDP
The term dynamic programming (DP) refers to a collection of algorithms that

can be used to compute optimal policies given a perfect model of the
environment as a Markov decision process (MDP). Classical DP algorithms are
of limited utility in reinforcement learning both because of their assumption of
a perfect model and because of their great computational expense, but they
are still important theoretically. DP provides an essential foundation for the
understanding of the methods presented in the rest of this book. In fact, all of
these methods can be viewed as attempts to achieve much the same effect as
DP, only with less computation and without assuming a perfect model of the
environment.
we usually assume that the environment is a finite MDP. That is, we assume
that its state and action sets, and , for , are finite, and that its
dynamics are given by a set of transition
probabilities, , and expected immediate
rewards, , for all , ,
and ( is plus a terminal state if the problem is episodic). Although
DP ideas can be applied to problems with continuous state and action spaces,
exact solutions are possible only in special cases. A common way of obtaining
approximate solutions for tasks with continuous states and actions is to
quantize the state and action spaces and then apply finite-state DP methods.
The methods we explore in continuous problems and are a significant
extension of that approach.
The key idea of DP, and of reinforcement learning generally, is the use of value
functions to organize and structure the search for good policies. we show how
DP can be used to compute the value functions. we can easily obtain optimal
policies once we have found the optimal value functions, or , which satisfy
the Bellman optimality equations.
for all , , and . As we shall see, DP algorithms are obtained
by turning Bellman equations such as these into assignments, that is, into
update rules for improving approximations of the desired value functions.
Bellman optimality principle-
It is the fundamental aspects of dynamic programming which states that the

optimal solution to the dynamic programming optimization problem can be
found by combining the optimal solution to its sub problem
The principal is generally applicable for problem with finite or countable state
space in order to minimize the theoretical complexity.
It cannot be applied on classic model such as inventory management on

dynamic pricing model that have continuous state space and the challenge
involve in dynamic programming with general state space.
The principle states that an optimal policy has the property that whatever the
initial state and initial decisions are the remaining decisions must constitutes
an optimal policy with regard to the state resulting from the first decision.
Dynamic Programming method breaks down a multi-step decision problem

into smaller (recursive) sub problems using bellman’s principle of optimality.
State is independent of decisions taken at previous states. This allows us to
separate initial decision from the future decision and optimization future
decision.
𝑉𝜋(𝑆) = 𝑟(𝑠, 𝑎) + ¥ 𝑉𝜋(S ′ )
It is the aspects of dynamic programming which states that the optimal

solution to a dynamic programming optimization problem can be found by
combining the optimal solution to its sub problems.
The principle is generally applicable for problem with finite or countable state
space in order to minimize the theoretical complexity.
It cannot be applied on classic model such as inventory management and

dynamic pricing model that have continuous state space and the challenge
involve in dynamic programming with general state space.
Example-
Iterative policy evaluation-
Once a policy, , has been improved using to yield a better policy, , we can
then compute and improve it again to yield an even better . We can thus
obtain a sequence of monotonically improving policies and value functions:
where denotes a policy evaluation and denotes a policy improvement.

Each policy is guaranteed to be a strict improvement over the previous one
(unless it is already optimal). Because a finite MDP has only a finite number of
policies, this process must converge to an optimal policy and optimal value
function in a finite number of iterations.
This way of finding an optimal policy is called policy iteration. A complete

algorithm is given below Note that each policy evaluation, itself an iterative
computation, is started with the value function for the previous policy. This
typically results in a great increase in the speed of convergence of policy
evaluation. (Presumably because the value functions changes little from one
policy to the next).
EXAMPLE: -
POLICY ITERATION VALUE ITERATION
POLICY ITERATION VALUE ITERATION
1.It starts with a random policy. 1.It starts with a random value function
2.It is a complex algorithm 2. It is a simpler algorithm
3. It is a cheaper computing 3.It is a expensive computing
4. It requires few iteration to 4. It requires more iteration coverage

coverage
5.It is faster algoritham 5.It is a slower.

UNIT IV-
Function Approximation-
Least Square Method-
In statistics, when we have data in the form of data points that can be
represented on a Cartesian plane by taking one of the variables as the
independent variable represented as the x-coordinate(X) and the other one
as the dependent variable represented as the y-coordinate(Y)
The datapoints is plotted in a scattered plot.
This data might not be useful in making interpretations or predicting the

values of the dependent variable for the independent variable where it is
initially unknown. So, we try to get an equation of a line that fits best to the
given data points with the help of the Least Square Method.
The method uses averages of the data points and some formulae discussed as
follows to find the slope and intercept of the line of best fit. This line can be
then used to make further interpretations about the data and to predict the
unknown values.
The Least Squares Method is used to derive a generalized linear equation

between two variables, one of which is independent and the other
dependent on the former. The value of the independent variable is
represented as the x-coordinate and that of the dependent variable is
represented as the y-coordinate in a 2D cartesian coordinate system. Initially,
known values are marked on a plot. The plot obtained at this point is called a
scatter plot. Then, we try to represent all the marked points as a straight line
or a linear equation. The equation of such a line is obtained with the help of
the least squares method. This is done to get the value of the dependent
variable for an independent variable for which the value was initially
unknown. This helps us to fill in the missing points in a data table or forecast
the data.
Least Square Method Definition
“The least-squares method can be defined as a statistical method that is used

to find the equation of the line of best fit related to the given data. This
method is called so as it aims at reducing the sum of squares of deviations as
much as possible. The line obtained from such a method is called a regression
line.”
Formula of Least Square Method
The formula used in the least squares method and the steps used in deriving
the line of best fit from this method are discussed as follows:
 Step 1: Denote the independent variable values as x i and the dependent
ones as yi.
 Step 2: Calculate the average values of xi and yi as X and Y.
 Step 3: Presume the equation of the line of best fit as y = mx + c, where m
is the slope of the line and c represents the intercept of the line on the Y-
axis.
 Step 4: The slope m can be calculated from the following formula:
m = [Σ (X – xi)×(Y – yi)] / Σ(X – xi)2

 Step 5: The intercept c is calculated from the following formula:
c = Y – mX
Thus, we obtain the line of best fit as y = mx + c, where values of m and c can
be calculated from the formulae defined above.
Least Square Method Solved Examples-
Problem 1: Find the line of best fit for the following data points using the
least squares method: (x,y) = (1,3), (2,4), (4,8), (6,10), (8,15).
Solution:
Here, we have x as the independent variable and y as the dependent variable.

First, we calculate the means of x and y values denoted by X and Y
respectively.
X = (1+2+4+6+8)/5 = 4.2
Y = (3+4+8+10+15)/5 = 8
xi yi X – xi Y – yi (X-xi)*(Y-yi) (X – xi)2
3 5 16 10.24
1 3.2
2.2 4 8.8 4.84

2 4
0 0 0.04
4 8 0.2
-2 3.6 3.24
6 10 -1.8
-7 26.6 14.44
8 15 -3.8
Sum (Σ) 0 0 55 32.8
The slope of the line of best fit can be calculated from the formula as follows:
m = (Σ (X – xi)*(Y – yi)) /Σ(X – xi)2
m = 55/32.8 = 1.68 (rounded upto 2 decimal places)
Now, the intercept will be calculated from the formula as follows:
c = Y – mX
c = 8 – 1.68*4.2 = 0.94
Thus, the equation of the line of best fit becomes, y = 1.68x + 0.94.
Problem 2: Find the line of best fit for the following data of heights and
weights of students of a school using the least squares method:
 Height (in centimeters): [160, 162, 164, 166, 168]
 Weight (in kilograms): [52, 55, 57, 60, 61]

Solution:
Here, we denote Height as x (independent variable) and Weight as y
(dependent variable). Now, we calculate the means of x and y values denoted
by X and Y respectively.
X = (160 + 162 + 164 + 166 + 168 ) / 5 = 164
Y = (52 + 55 + 57 + 60 + 61) / 5 = 57
xi yi X – xi Y – yi (X-xi)*(Y-yi) (X – xi)2
52 5 20 16
160 4
2 2
162 55 4 4
0 0 0
164 57 0
-3 6 4
166 60 -2
-4 16 16
168 61 -4
Sum ( Σ ) 0 0 46 40
Now, the slope of the line of best fit can be calculated from the formula as
follows:
m = (Σ (X – xi)✕(Y – yi)) / Σ(X – xi)2
m = 46/40 = 1.15
Now, the intercept will be calculated from the formula as follows:
c = Y – mX
c = 164 – 1.15*57 = 98.45
Thus, the equation of the line of best fit becomes, y = 1.15x + 98.45.
Unit V
Case study
Prepare real time applications in Reinforcement learning
1. NLP
2. SYSTEM RECOMMENDATION etc

RL Unit 3,4,5

Uploaded by

Copyright:

Available Formats

RL Unit 3,4,5

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

RL Unit 3,4,5

Uploaded by

Copyright:

Available Formats

UNIT-III

Overview of dynamic programming for MDP

The term dynamic programming (DP) refers to a collection of algorithms that

Bellman optimality principle-

It is the fundamental aspects of dynamic programming which states that the

It cannot be applied on classic model such as inventory management on

Dynamic Programming method breaks down a multi-step decision problem

𝑉𝜋(𝑆) = 𝑟(𝑠, 𝑎) + ¥ 𝑉𝜋(S ′ )

It is the aspects of dynamic programming which states that the optimal

It cannot be applied on classic model such as inventory management and

Iterative policy evaluation-

where denotes a policy evaluation and denotes a policy improvement.

This way of finding an optimal policy is called policy iteration. A complete

POLICY ITERATION VALUE ITERATION

POLICY ITERATION VALUE ITERATION

2.It is a complex algorithm 2. It is a simpler algorithm

3. It is a cheaper computing 3.It is a expensive computing

4. It requires few iteration to 4. It requires more iteration coverage

5.It is faster algoritham 5.It is a slower.

Least Square Method-

This data might not be useful in making interpretations or predicting the

The Least Squares Method is used to derive a generalized linear equation

“The least-squares method can be defined as a statistical method that is used

Formula of Least Square Method

m = [Σ (X – xi)×(Y – yi)] / Σ(X – xi)2

Least Square Method Solved Examples-

Here, we have x as the independent variable and y as the dependent variable.

2.2 4 8.8 4.84

Sum (Σ) 0 0 55 32.8

You might also like