RL Unit 3,4,5
RL Unit 3,4,5
RL Unit 3,4,5
we usually assume that the environment is a finite MDP. That is, we assume
that its state and action sets, and , for , are finite, and that its
dynamics are given by a set of transition
probabilities, , and expected immediate
rewards, , for all , ,
and ( is plus a terminal state if the problem is episodic). Although
DP ideas can be applied to problems with continuous state and action spaces,
exact solutions are possible only in special cases. A common way of obtaining
approximate solutions for tasks with continuous states and actions is to
quantize the state and action spaces and then apply finite-state DP methods.
The methods we explore in continuous problems and are a significant
extension of that approach.
The key idea of DP, and of reinforcement learning generally, is the use of value
functions to organize and structure the search for good policies. we show how
DP can be used to compute the value functions. we can easily obtain optimal
policies once we have found the optimal value functions, or , which satisfy
the Bellman optimality equations.
for all , , and . As we shall see, DP algorithms are obtained
by turning Bellman equations such as these into assignments, that is, into
update rules for improving approximations of the desired value functions.
The principal is generally applicable for problem with finite or countable state
space in order to minimize the theoretical complexity.
The principle states that an optimal policy has the property that whatever the
initial state and initial decisions are the remaining decisions must constitutes
an optimal policy with regard to the state resulting from the first decision.
The principle is generally applicable for problem with finite or countable state
space in order to minimize the theoretical complexity.
Example-
Once a policy, , has been improved using to yield a better policy, , we can
then compute and improve it again to yield an even better . We can thus
obtain a sequence of monotonically improving policies and value functions:
1.It starts with a random policy. 1.It starts with a random value function
In statistics, when we have data in the form of data points that can be
represented on a Cartesian plane by taking one of the variables as the
independent variable represented as the x-coordinate(X) and the other one
as the dependent variable represented as the y-coordinate(Y)
The datapoints is plotted in a scattered plot.
The method uses averages of the data points and some formulae discussed as
follows to find the slope and intercept of the line of best fit. This line can be
then used to make further interpretations about the data and to predict the
unknown values.
The formula used in the least squares method and the steps used in deriving
the line of best fit from this method are discussed as follows:
Step 1: Denote the independent variable values as x i and the dependent
ones as yi.
Step 2: Calculate the average values of xi and yi as X and Y.
Step 3: Presume the equation of the line of best fit as y = mx + c, where m
is the slope of the line and c represents the intercept of the line on the Y-
axis.
Step 4: The slope m can be calculated from the following formula:
Problem 1: Find the line of best fit for the following data points using the
least squares method: (x,y) = (1,3), (2,4), (4,8), (6,10), (8,15).
Solution:
0 0 0.04
4 8 0.2
-2 3.6 3.24
6 10 -1.8
-7 26.6 14.44
8 15 -3.8
The slope of the line of best fit can be calculated from the formula as follows:
m = (Σ (X – xi)*(Y – yi)) /Σ(X – xi)2
m = 55/32.8 = 1.68 (rounded upto 2 decimal places)
Now, the intercept will be calculated from the formula as follows:
c = Y – mX
c = 8 – 1.68*4.2 = 0.94
Thus, the equation of the line of best fit becomes, y = 1.68x + 0.94.
Problem 2: Find the line of best fit for the following data of heights and
weights of students of a school using the least squares method:
Height (in centimeters): [160, 162, 164, 166, 168]
Weight (in kilograms): [52, 55, 57, 60, 61]
Solution:
Here, we denote Height as x (independent variable) and Weight as y
(dependent variable). Now, we calculate the means of x and y values denoted
by X and Y respectively.
X = (160 + 162 + 164 + 166 + 168 ) / 5 = 164
Y = (52 + 55 + 57 + 60 + 61) / 5 = 57
xi yi X – xi Y – yi (X-xi)*(Y-yi) (X – xi)2
52 5 20 16
160 4
2 2
162 55 4 4
0 0 0
164 57 0
-3 6 4
166 60 -2
-4 16 16
168 61 -4
Sum ( Σ ) 0 0 46 40
Now, the slope of the line of best fit can be calculated from the formula as
follows:
m = (Σ (X – xi)✕(Y – yi)) / Σ(X – xi)2
m = 46/40 = 1.15
Now, the intercept will be calculated from the formula as follows:
c = Y – mX
c = 164 – 1.15*57 = 98.45
Thus, the equation of the line of best fit becomes, y = 1.15x + 98.45.
Unit V
Case study
Prepare real time applications in Reinforcement learning
1. NLP
2. SYSTEM RECOMMENDATION etc