Module 5
Module 5
MODULE – 5
EVALUATING HYPOTHESIS, INSTANCE-BASED
LEARNING, REINFORCEMENT LEARNING
PART 1- EVALUATING HYPOTHESIS
MOTIVATION
Importance to evaluate the performance,
1. To understand whether to use the hypothesis.
• For instance, when learning from a limited-size database indicating the
effectiveness of different medical treatments, it is important to understand as
precisely as possible the accuracy of the learned hypotheses.
2. Evaluating hypotheses is an integral component of many learning methods.
• For example, in post-pruning decision trees to avoid overfitting, we must
evaluate resultant trees
Data is plentiful Accuracy is straightforward.
Difficulties arise given limited set of data. They are
1. Bias in the estimate.
The observed accuracy of the learned hypothesis over the training examples
is often a poor estimator of its accuracy over future examples.
It is a biased estimate of hypothesis accuracy over future examples.
To obtain an unbiased estimate of future accuracy, we typically test the
hypothesis on some set of test examples chosen independently of the training
examples and the hypothesis.
2. Variance in the estimate.
Even if the hypothesis accuracy is measured over an unbiased set of test
examples, independent of the training examples, the measured accuracy can
still vary from the true accuracy, depending on the makeup of the particular
set of test examples.
The smaller the set of test examples, the greater the expected variance
ESTIMATING HYPOTHESIS ACCURACY
Setting:
There is some space of possible instances X (e.g., the set of all people) over which various
target functions may be defined (e.g., people who plan to purchase new skis this year.
Definition: The sample error (denoted errors(h)) of hypothesis h with respect to target
function f and data sample S is
1
𝑒𝑟𝑟𝑜𝑟𝑠() ≡ ∑ 𝛿(𝑓(𝑥), (𝑥))
𝑛
𝑥∈𝑆
Where, n number of examples in S
𝛿(𝑓(𝑥), (𝑥)) is 1 if 𝑓(𝑥) ≠ (𝑥) otherwise 0.
True Error
The true error of a hypothesis is the probability that it will misclassify a single randomly
drawn instance from the distribution D.
Definition: The true error (denoted errorD(h) )of hypothesis h with respect to target
function f and distribution D ,is the probability that h will misclassify an instance drawn
at random according to D.
𝑒𝑟𝑟𝑜𝑟D() ≡ 𝑃𝑟𝑥∈D𝑓(𝑥) ≠ (𝑥)]
Where, 𝑃𝑟𝑥∈D denotes that the probability is taken over the instance distribution D.
Requirement
We need to Calculate errorD(h) but we have errors(h) in hand.
Question How good an estimate of errorD(h) is provided by errors(h)?
Confidence Intervals for Discrete-Valued Hypothesis
Answers the question How good an estimate of errorD(h) is provided by errors(h)?
Here, h discrete-valued function.
To estimate the true error for some discrete-valued hypothesis h, based on its observed
sample error over a sample S
Given,
Sample S n examples drawn independent of one another and of h, according to the
probability distribution D.
n≥30
h commits r errors over these n examples (i.e., errors(h)=r/n)
The statistical theory allows us to make the following assertions:
1. Given no other information, the most probable value of errorD(h) is errors(h).
2. With approximately 95% probability, the true error errorD(h) lies in the
interval
𝑒𝑟𝑟𝑜𝑟𝑠()(1−𝑒𝑟𝑟𝑜𝑟𝑠()) --- (i)
𝑒𝑟𝑟𝑜𝑟𝑆 () ± 1.96√
𝑛
0.30×(1−0.30)
0.30 ± 1.96√
40
0.30×0.70
0.30 ± 1.96√
40
0.21
0.30 ± 1.96√
40
0.30 ± 1.96√0.00525
0.30 ± 1.96 × 0.07
Confidence Interval 𝟎. 𝟑𝟎 ± 𝟎. 𝟏𝟒
The Eq. (i) given for 95% confidence interval can be generalized any desired
confidence level. Let ZN be used to calculate N% confidence intervals for errorD(h).
There Eq. (i) can be rewritten as
Therefore, for the above example, if 68% is the confidence level, then we get the
confidence interval as 𝟎0.30 ± 1.00 × 0.07
Eq. (5.1.) describes how to calculate the confidence intervals, or error bars, for
estimates of errorD(h) that are based on errors(h).
provides only an approximate confidence interval, though the approximation is quite
good when the sample contains at least 30 examples, and errors(h) is not too close to 0
or 1
Rule of thumb The above approximation works well when
𝑛. 𝑒𝑟𝑟𝑜𝑟𝑠()(1 − 𝑒𝑟𝑟𝑜𝑟𝑠()) ≥ 5
BASICS OF SAMPLING THEOREM
Summary
σx = √𝑛𝑝(1 − 𝑝)
For sufficiently large values of n the Binomial distribution is closely approximated
by a Normal distribution with the same mean and variance.
Note: Use the Normal approximation only when np(1-p) ≥5.
𝑅 ≡ ∑ 𝑌𝑖
𝑖=1
The probability that the random variable R will take on a specific value r (e.g.,
the probability of observing exactly r heads) is given by the Binomial distribution
(Figure 5.1.)
𝑛!
𝑃𝑟(𝑅 = 𝑟) = 𝑝𝑟(1 − 𝑝)𝑛−𝑟 --- (5.2)
𝑟!(𝑛−𝑟)!
𝜎𝑌 ≡ √𝑛𝑝(1 − 𝑝) ---(5.7)
Estimators, Bias, and Variance
Question What is the likely difference between errors(h) and the true error errorD(h)?
Rewrite errors(h) and errorD(h) using terms in Eq.(5.2) defining the Binomial
distribution:
𝑟
𝑒𝑟𝑟𝑟𝑜𝑟𝑠() =
𝑛
𝑒𝑟𝑟𝑟𝑜𝑟𝐷() = 𝑝
Where,
n number of instances in the sample S
r number of instances from S misclassified by h
p probability of misclassifying a single instance drawn from D.
Estimation Bias
errors(h) an estimator for true error errorD(h).
An estimator is any random variable used to estimate some parameter of the underlying
population from which the sample is drawn.
Estimation bias difference between the expected value of the estimator and the true
value of the parameter.
Definition: The estimation bias of an estimator Y for an arbitrary parameter p is
E[Y] = p
If the estimation bias is zero Y is an unbiased estimator for p.
Question Is errors(h) an unbiased estimator for errorD(h)?
Yes.
Binomial distribution the expected value of r is equal to np.
Given that n is a constant, that the expected value of r/n is p.
Another property of an estimator is its variance. Given a choice among alternative
unbiased estimators choose the one with least variance yields the smallest expected
squared error between the estimate and the true value of the parameter.
Example, for r = 12 and n = 40, an unbiased estimate for errorD(h) is given by
Confidence Interval
To describe the uncertainty associated with an estimate, give an interval within which
the true value is expected to fall, along with the probability with which it is expected to
fall into this interval Confidence Interval estimates.
Definition: An N% confidence interval for some parameter p is an interval that is
expected with probability N% to contain p.
Example, Given r = 12 and n = 40 approximately 95% probability that the interval 0.30
f0.14 contains the true error errorD(h).
Normal Distribution
It is a bell-shaped distribution fully specified by its mean μ and standard deviation σ
(FIgure 5.2).
Figure 5.3: A Normal distribution with mean 0, standard deviation 1: With 80%
confidence, the value of the random variable will lie in the two-sided interval [-
1.28,1.28]. Note Z.80 = 1.28. With 10% confidence it will lie to the right of this
interval, and with 10% confidence it will lie to the left.
Therefore,
If a random variable Y obeys a Normal distribution with mean μ and standard
deviation σ, then the measured random value y of Y will fall into the following
interval N% of the time
𝜇 ± 𝑍𝑁𝜎 --- (5.10)
Equivalently, the mean μ will fall into the following interval N% of the time
𝑦 ± 𝑍𝑁𝜎 --- (5.11)
Derivation of General Expression for N% confidence interval:
We know that,
- errors(h)follows a Binomial distribution with mean value errorD(h) and
standard deviation as in Eq. 5.9.
- For sufficiently large sample size n, the Binomial distribution is well
approximated by a Normal distribution.
- Eq. 5.11 is used to find the N% confidence interval for estimating the mean
value of a Normal distribution.
Substituting the mean and standard deviation of errors(h) into Eq. 5.11 we get Eq.
5.1. for N% confidence intervals for discrete-valued hypotheses.
Eq. 5.1 is two-sided bound it bounds the estimated quantity from above and from
below.
One-sided bound Example,
Question of interest What is the probability that errorD(h) is at most U?
This means it reflects bounding the maximum error of h and do not mind if the true error
is much smaller than estimated.
This represents one-sided error bound on errorD(h) with double the confidence than
the corresponding two-sided bound.
GENERAL APPROACH OF DERIVING CONFIDENCE
INTERVALS
The general process includes the following steps:
1. Identify the underlying population parameter p to be estimated, for example,
errorD(h).
2. Define the estimator Y. (E.g., errors(h)). It is desirable to choose a minimum-
variance, unbiased estimator.
3. Determine the probability distribution DY that governs the estimator Y, including
its mean and variance.
4. Determine the N% confidence interval by finding thresholds L and U such
that N% of the mass in the probability distribution DY falls between L and U.
5.4.1. Central Limit Theorem
Theorem 5.1: Consider a set of independent, identically distributed random variables
Y1…Yn governed by an arbitrary probability distribution with mean μ and finite variance
1
σ2. Define the sample mean, 𝑌̅ ≡ ∑𝑛 𝑌 .
𝑛 𝑛 𝑖=1 𝑖
̅
Then as 𝑛 → ∞, the distribution governing 𝑌𝑛 −𝜇 approaches a Normal distribution, with
𝜎
√𝑛
Next derive confidence intervals that characterize the likely error in employing 𝑑̂ to
estimate d. For a random variable 𝑑̂ obeying a Normal distribution with mean d and
variance σ2, the N% confidence interval estimate for d is 𝑑̂ ± 𝑍𝑁𝜎.
Using the approximate variance 𝜎𝑑2̂ , the approximate N% confidence interval estimate
for d is
Eq. 5.13 general two-sided confidence interval for estimating the difference
between errors of two hypotheses.
Acceptable to be used where h1 and h2 are tested on a single sample S (where S is
still independent of h1 and h2). Hence, 𝑑̂ can be redefined as
𝑑̂ ≡ 𝑒𝑟𝑟𝑜𝑟𝑆(1) − 𝑒𝑟𝑟𝑜𝑟𝑆(2)
5.5.1. Hypothesis Testing
What is the probability that errorD(h1 ) > errorD(h2)?
Let the sample errors for h1 and h2 are measured using two independent samples S1 and
S2 of size 100 and find that
𝑒𝑟𝑟𝑜𝑟𝑆1 (1) = 0.30 and
Eq. 5.14 expected value of the difference in errors between learning methods LA
and LB.
Consider a limited sample D0 for comparing algorithms.
Divide D0 into a training set S0 and a disjoint test set T0 .
o The training data can be used to train both LA and LB.
o The test set can be used to compare the accuracy of the two learned algorithms.
Here we measure the quantity,
𝑒𝑟𝑟𝑜𝑟𝑇0 (𝐿𝐴(𝑆0)) − 𝑒𝑟𝑟𝑜𝑟𝑇0 (𝐿𝐵(𝑆0)) ---(5.15)
Difference between Eq. 5.14 and Eq. 5.15
- 𝑒𝑟𝑟𝑜𝑟𝑇0 () is used to approximate 𝑒𝑟𝑟𝑜𝑟𝐷()
- Difference in errors is measured for one training set S0 rather than using the
entire sample S.
To improve the estimator in Eq. 5.15
- Repeatedly partition the data D0 into disjoint training and test sets and
- Take the mean of the test set errors for these different experiments
Result Procedure below
Procedure: to estimate the difference in error between LA and LB.
1. Partition the available data D0 into k disjoint subsets T1 ,T2…Tk of equal size,
where this size is at least 30.
2. for i from 1 to k, do
use Ti for the test set and the remaining data for training set Si
Si {D0 - Ti}
hA LA(Si)
hB LB(Si)
δi 𝑒𝑟𝑟𝑜𝑟𝑇𝑖(𝐴) − 𝑒𝑟𝑟𝑜𝑟𝑇𝑖(𝐵)
3. Return the value 𝛿̅, where
1𝑘
𝛿̅ ≡ ∑𝛿
𝑖
𝑘
𝑖=0
--- (T5.1)
- Train and test the learning algorithms k times, using each of the k subsets in turn
as the test set, and use all remaining data as the training set.
- After testing for all k independent test sets, return the mean difference 𝛿̅ that
represents an estimate of the difference between the two learning algorithms.
𝛿̅ can be taken as an estimate of the desired quantity from Eq. 5.14. Hence 𝛿̅ is an
estimatge of the quantity
[𝑒𝑟𝑟𝑜𝑟𝐷(𝐿𝐴(𝑆))−𝑒𝑟𝑟𝑜𝑟𝐷(𝐿𝐵(𝑆))]
𝐸𝑆⊂𝐷 ---(5.16)
0
𝑘−1
S a random sample of size |𝐷 | drawn uniformly from D0.
𝑘 0
The approximate N% confidence interval for estimating the quantity in Eq. 5.16 using
𝛿̅ is given by
𝛿 ̅ ± 𝑡𝑁,𝑘−1 𝑆𝛿̅ --- (5.17)
Where,
tN,k-1constant analogous to ZN
𝑆𝛿̅estimate of the Standard Deviation of the distribution governing 𝛿 ̅. It is
derfined as
1 2
𝑆 ≡ ∑𝑘 (𝛿 − 𝛿̅) ---(5.18)
̅
𝛿 √𝑘(𝑘−1) 𝑖=1 𝑖
Paired Tests
Tests where the hypotheses are evaluated over identical samples are called paired
tests. Paired tests typically produce tighter confidence intervals because any
differences in observed errors in a paired test are due to differences between the
hypotheses.
Note: when the hypotheses are tested on separate data samples, differences in the two
sample errors might be partially attributable to differences in the makeup of the two
samples.
Paired t Tests
To understand the justification for the confidence interval estimate given by Eq. 5.17,
consider,
- Given, the observed values of a set of independent, identically distributed random
variables Y1, Y2, ...,Yk.
- Mean μ of the probability distribution governing Yi.
- The estimator used is sample mean 𝑌̅
1 𝑘
𝑌̅≡ ∑𝑌
𝑖
𝑘
𝑖=1
The t test represented by Eq. 5.17 and Eq. 5.18 applies to the case in which Yi follow
a Normal Distribution. The t test applies to the situation in which task is to estimate
the sample mean of a collection of independent, identically and Normally distributed
random variables. Using Eq. 5..17 and Eq. 5.18 we get,
𝜇 = 𝑌̅± 𝑡𝑁,𝑘−1 𝑆𝑌̅
Where, 𝑆𝑌̅ is the estimated standard deviation of the sample mean
𝑘
𝑆𝑌̅ ≡ √ 1
( ∑(𝑌𝑖 − 𝑌̅)2
𝑘𝑘−1)
𝑖=1
region of instance space closest to that point (i.e., the instances for which the 1-NEAREST
NEIGHBOR algorithm will assign the classification belonging to that training example).
What is the nature of the hypothesis space H implicitly considered by the k-Nearest
Neighbor algorithm?
k-Nearest Neighbor algorithm never forms an explicit general hypothesis 𝑓̂ regarding
the target function f.
What does it do?
Just computes the classification of each new query instance as needed .
From Figure, k-Nearest Neighbor represents the decision surface as required by
classification
The decision surface is a combination of convex polyhedra surrounding each of the
training examples.
For every training example, the polyhedron indicates the set of query points whose
classification will be completely determined by that training example. Query points
outside the polyhedron are closer to some other training example. the Voronoi diagram
of the set of training examples.
The k-Nearest Neighbor algorithm is easily adapted to approximating continuous-
valued target function.
Update Algorithm to
• Calculate the mean value of the k nearest training examples rather than calculate
their most common value.
• replace the final line of the algorithm by
𝑘
𝑓̂(𝑥𝑞) ← ∑𝑖=1 𝑓(𝑥𝑖) ---(1)
𝑘
Completely eliminate the least relevant attributes from the instance space.
- Equivalent to setting some of the zi scaling factors to zero.
Moore and Lee (1994)
- Leave-one-out cross validation the set of m training instances is repeatedly divided
into a training set of size m-1 and test set of size 1, in all possible ways.
- Can be easily applied by knn.
Another issue: efficient memory indexing.
Reason: The algorithm delays all processing until a new query is received, significant
computation can be required to process each new query.
A Note on Terminology
Uses a terminology that has arisen from the field of statistical pattern recognition.
• Regression means approximating a real-valued target function.
• Residual is the error 𝑓̂(𝑥) − 𝑓(𝑥) in approximating the target function
• Kernel function is the function of distance that is used to determine the weight of
each training example.
o It is a function K such that wi = K(d(xi, xq))
LOCALLY WEIGHTED REGRESSION
Locally Weighted Regression (LWR) is the generalization of nearest-neighbor
approaches.
Nearest-neighbor approaches approximate the target function f(x) at the single query
point x = xq.
LWR constructs an explicit approximation to f over a local region surrounding xq.
uses nearby or distance-weighted training examples to form the local
approximation to f .
f can be approximated using a linear function or quadratic function or
multilayer NN or any form.
LWR is
• LOCAL because nearby or distance-weighted training examples are used to form
the local approximation to f
• WEIGHTED because the contribution of each training example is weighted by its
distance from the query point
• REGRESSION because this is the term used widely in the statistical learning
community for the problem of approximating real-valued functions.
And
∆𝑤𝑗 = ƞ ∑𝑥∈𝐷(𝑓(𝑥) − 𝑓̂(𝑥))𝑎𝑗(𝑥)) ---(6)
How shall we modify this procedure to derive a local approximation rather than a
global one?
Ans: Redefine the error criterion E to emphasize fitting the local training examples
There are 3 possible criteria:
Let E(xq) error is being defined as a function of the query point xq.
1. Minimize the squared error over just the k nearest neighbors:
1 ∑ (𝑓(𝑥) − 𝑓̂(𝑥))2
𝐸 (𝑥 ) ≡
1 𝑞 2
𝑥∈𝑘 𝑛𝑒𝑎𝑟𝑒𝑠𝑡 𝑛𝑏𝑟𝑠 𝑜𝑓 𝑥𝑞
2. Minimize the squared error over the entire set D of training examples, while
weighting the error of each training example by some decreasing function K of its
distance from xq
𝐸 (𝑥 ) ≡ 1 ( ) ̂( ) 2
2 𝑞 ∑ (𝑓 𝑥 − 𝑓 𝑥 ) 𝐾(𝑑(𝑥𝑞, 𝑥))
2
𝑥∈𝐷
3. Combine 1 and 2
𝐸 (𝑥 ) ≡ 1 ∑ ( ) ̂( ) 2
3 𝑞 2 (𝑓 𝑥 − 𝑓 𝑥 ) 𝐾(𝑑(𝑥𝑞, 𝑥))
𝑥∈𝑘 𝑛𝑒𝑎𝑟𝑒𝑠𝑡 𝑛𝑏𝑟𝑠 𝑜𝑓 𝑥𝑞
Criteria 3 is the best approach. If criteria 3 is used and gradient descent in Eq. (6) is re-
derived, we get the training rule as follows:
∆𝑤𝑗 = ƞ ∑𝑥∈𝑘 𝑛𝑒𝑎𝑟𝑒𝑠𝑡 𝑛𝑏𝑟𝑠 𝑜𝑓 𝑥𝑞 𝐾(𝑑(𝑥𝑞, 𝑥)) (𝑓(𝑥) − 𝑓̂(𝑥))𝑎𝑗(𝑥)) ---(7)
Remark son Locally Weighted Regression
In most cases, the target function is approximated by a constant, linear, or quadratic
function.
More complex functional forms are not often found because
(1) the cost of fitting more complex functions for each query instance is prohibitively
high, and
(2) these simple approximations model the target function quite well over a sufficiently
small sub-region of the instance space
RADIAL BASIS FUNCTION
Radial Basis Function (RBF) is closely related to distance-weighted regression and also
to artificial neural networks.
By Powell 1987; Broomhead and Lowe 1988; Moody and Darken 1989
The learned hypothesis is a function of the form
𝑓̂(𝑥) ≡ 𝑤0 + ∑𝑢=1
𝑘 𝑤𝑢𝐾𝑢(𝑑(𝑥𝑢, 𝑥)) ---(8)
𝐾𝑢(𝑑(𝑥𝑢, 𝑥)) Gaussian function centered at the point xu with some variance
𝜎𝜇2.
1
2 𝑑2(𝑥𝑢,𝑥)
𝐾𝑢 (𝑑(𝑥𝑢 , 𝑥)) = 𝑒2𝜎𝜇
Eq. 8 can represent 2-layer network where
• the first layer of units computes the values of the various 𝐾𝑢(𝑑(𝑥𝑢, 𝑥)).
• the second layer computes a linear combination of these first-layer unit values
Figure 5.4: A radial basis function network. Each hidden unit produces an activation
determined by a Gaussian function centered at some instance xu.
Therefore, its activation will be close to zero unless the input x is near xu. The
output unit produces a linear combination of the hidden unit activations. Although the
network shown here has just one output, multiple output units can also be included
RBF networks are trained in 2-stage process:
1. the number k of hidden units is determined and each hidden unit u is defined by
choosing the values of xu and 𝜎2 that define its kernel function 𝐾 (𝑑(𝑥 , 𝑥)).
𝜇 𝑢 𝑢
2. the weights wu are trained to maximize the fit of the network to the training data,
using the global error criterion given by Eq. 5
Alternative methods to choose an appropriate number of hidden units or, equivalently,
kernel functions
o Allocate a Gaussian kernel function for each training example ⟨𝒙𝒊, 𝒇(𝒙𝒊)⟩
centering this Gaussian at the point xi.
Each kernels may be assigned the same width σ2.
Result: RBF network learns a global approximation to the target function in which
each training example ⟨𝑥𝑖, 𝑓(𝑥𝑖)⟩ can influence the value of 𝑓̂ only in the
neighborhood of xi.
Advantage:
This Kernel function allows the RBF network to fit the training data exactly.
In other words,
For any set of m training examples the weights wo . ..wm for combining the
m Gaussian kernel functions can be set so that 𝑓̂(𝑥) = 𝑓(𝑥) for each training
example ⟨𝑥𝑖, 𝑓(𝑥𝑖)⟩.
o Choose a set of kernel functions that is smaller than the number of training
examples.
This is efficient if the number of training examples is large.
The set of kernel functions may be distributed with centers spaced uniformly
throughout the instance space X.
Or
Distribute the centers non-uniformly, especially if the instances themselves are
found to be distributed non-uniformly over X.
Or
Identify prototypical clusters of instances, then add a kernel function centered at each
cluster.
CASE-BASED REASONING
3 key properties of kNN and LWR:
1. They are lazy learning methods in that they defer the decision of how to generalize
beyond the training data until a new query instance is observed.
2. They classify new query instances by analyzing similar instances while ignoring
instances that are very different from the query.
3. They represent instances as real-valued points in an n-dimensional Euclidean space.
CASE-BASED REASONING
It is a learning paradigm based on the 1 and 2 but not 3.
Example,
CADET System-Sycara et al. 1992
o Uses CBR to assist in the conceptual design of simple mechanical devices
such as water faucets.
o It uses a library containing approximately 75 previous designs and design
fragments to suggest conceptual designs to meet the specifications of new
design problems.
o Each instance stored in memory is represented by describing both its structure
and its qualitative function.
o Example, Water Pipes
Therefore,
CADET
• searches its library for stored cases whose functional descriptions match the design
problem.
• If an exact match is found, indicating that some stored case implements
exactly the desired function, then this case can be returned as a suggested
solution to the design problem.
• If no exact match occurs, CADET may find cases that match various sub-
graphs of the desired functional specification
The system may elaborate the original function specification graph in order to
create functionally equivalent graphs that may match still more cases.
Example,
It uses a rewrite rule that allows it to rewrite the influence
𝐴→𝐵
As
𝐴→𝑥→𝐵
Correspondence between the problem setting of CADET and the general setting for
instance-based methods such as k-Nearest Neighbor.
CADET
Each stored training example describes a function graph along with the structure
that implements it.
New queries correspond to new function graphs.
Therefore, Consider
Task of the agent To learn a policy, 𝜋: 𝑆 → 𝐴, for selecting its next action at, based
on the current observed state st 𝜋(st)= at.
How shall we specify precisely which policy 𝜋 we would like the agent to learn?
Require the policy that produces the greatest possible cumulative reward for the
robot over time.
Therefore,
Define the cumulative value V𝜋(st) achieved by following an arbitrary policy 𝜋 , from
an arbitrary initial state st as:
𝑉𝜋(𝑠𝑡) ≡ 𝑟𝑡 + 𝛾𝑟𝑡+1 + 𝛾2𝑟𝑡+2+..
𝑖
𝑉𝜋(𝑠𝑡) ≡ ∑∞
𝑖=0 𝛾 𝑟𝑡+𝑖 ---(1)
Where, 0 ≤ 𝛾 ≤ 1
𝑉𝜋(𝑠𝑡) Discounted Cumulative Reward
Variants
𝑟 undiscounted sum of rewards over a finite
Finite Horizon Reward ∑𝑖=0 𝑡+𝑖
number h of steps.
1
Average Reward lim ∑ 𝑟 the average reward per time step over the
→∞ 𝑖=0 𝑡+𝑖
numerical evaluation function defined over states and actions, then implement the
optimal policy in terms of this evaluation function.
What evaluation function should the agent attempt to learn?
One obvious choice is V*. The agent should prefer state sl over state s2 whenever V*(sl)
> V*(s2), because the cumulative future reward will be greater from sl.
The optimal action in state s is the action a that maximizes the sum of the
immediate reward r(s, a) plus the value V* of the immediate successor state, discounted
by γ.
𝜋∗(𝑠) = 𝑎𝑟𝑔𝑚𝑎𝑥[𝑟(𝑠,
𝑎 𝑎) + 𝑉∗(𝛿(𝑠, 𝑎))] ---(3)
The Q Function
The value of Evaluation function Q(s, a) is the reward received immediately upon
executing action a from state s, plus the value (discounted by γ ) of following the
optimal policy thereafter
𝑄(𝑠, 𝑎) ≡ 𝑟(𝑠, 𝑎) + 𝛾𝑉∗(𝛿(𝑠, 𝑎))---(4)
Q(s,a) quantity that is maximized in Eq. 3 to choose the optimal action a in
state s
Re-write Eq. (3) in terms of Q(s,a) to get,
𝜋∗(𝑠) = 𝑎𝑟𝑔𝑚𝑎𝑥𝑄(𝑠,
𝑎 𝑎) ---(5)
That is, it needs to only consider each available action a in its current state s and
choose the action that maximizes Q(s, a).
Why is this rewrite important?
If the agent learns the Q function instead of the V* function, it will be able to select
optimal actions even when it has no knowledge of the functions r and δ .
Figure 5.6 shows Q values for every state and action in the simple grid world.
Q value for each state-action transition equals the r value for this transition plus
the V* value for the resulting state discounted by 𝛾.
Optimal policy shown in the figure corresponds to selecting actions with
maximal Q values
An Algorithm for Learning Q
Learning the Q function corresponds to learning the optimal policy.
The key problem is finding a reliable way to estimate training values for Q, given
only a sequence of immediate rewards r spread out over time. This can be
accomplished through iterative approximation
𝘍)
𝑉∗(𝑠) = 𝑚𝑎𝑥𝑎′
𝑄(𝑠,𝑎
o The agent moves one cell to the right in its grid world and receives an
immediate reward of zero for this transition.
o Apply the training rule of Equation
(𝑠,𝑎),𝑎𝘍)
𝑄(𝑠, 𝑎) = 𝑟(𝑠, 𝑎) + 𝛾𝑚𝑎𝑥𝑎′𝑄(𝛿
to refine its estimate Q for the state-action transition it just executed. According the
training rule, the new 𝑄̂ estimate for this transition is the sum of the received reward
(zero) and the highest 𝑄̂ value associated with the resulting state (100), discounted by
γ(.9).
Convergence
Will the Q Learning Algorithm converge toward a Q equal to the true Q function?
Yes, under certain conditions.
i. Assume the system is a deterministic MDP.
ii. Assume the immediate reward values are bounded; that is, there exists some
positive constant c such that for all states s and actions a, | r(s, a)| < c
iii. Assume the agent selects actions in such a fashion that it visits every possible
state-action pair infinitely often
Theorem 5.1: Convergence of Q learning for deterministic Markov decision
processes.
Consider a Q learning agent in a deterministic MDP with bounded
rewards(∀𝑠, 𝑎|𝑟(𝑠, 𝑎)| ≤ 𝑐. The Q learning agent uses the training rule of Eq. 7
initializes its table 𝑄̂(𝑠, 𝑎) to arbitrary finite values, and uses a discount factor 𝛾 such
that 0 ≤ 𝛾 < 1. Let 𝑄̂𝑛 (𝑠, 𝑎) denote the agent's hypothesis 𝑄̂(𝑠, 𝑎) following the nth
update. If each state-action pair is visited infinitely often, then 𝑄̂𝑛 (𝑠, 𝑎) converges to
Q(s,a ) as 𝑛 → ∞, for all s, a .
Prooof: Since each state-action transition occurs infinitely often, consider consecutive
intervals during which each state-action transition occurs at least once.
To prove that: The maximum error over all entries in the 𝑄̂table is reduced by at least
a factor of 𝛾 during each such interval. 𝑄̂𝑛 , is the agent's table of estimated 𝑄̂𝑛 values
after n updates.
̂ 𝘍 𝘍𝘍 𝘍 𝘍
𝑄𝑛(𝑠 ,𝑎 ) 𝑄(𝑠 ,𝑎 )
= 𝛾|𝑚𝑎𝑥 𝘍 − 𝑚𝑎𝑥 | ---(ii)
𝑎 𝑎𝘍
|𝑄̂ (𝑠𝘍 ,𝑎𝘍 )−𝑄(𝑠𝘍 ,𝑎𝘍 )|
≤ 𝛾𝑚𝑎𝑥𝑎𝑛𝘍 ---(iii)
|𝑄̂𝑛 (𝑠𝘍𝘍 ,𝑎𝘍 )−𝑄(𝑠𝘍𝘍 ,𝑎𝘍 )|
≤ 𝛾𝑚𝑎𝑥 ---(iv)
𝑠𝘍𝘍,𝑎 𝘍
Where,
P(ai|s) probability of selecting action ai, given that the agent is in state s
k>0 constant that determines how strongly the selection favors actions with high
𝑄̂values.