0% found this document useful (0 votes)

7 views40 pages

Module 5

This document discusses evaluating hypotheses in machine learning. It defines key terms like sample error, true error, and confidence intervals. Sample error is the error rate of a hypothesis over the available training data. True error is the error rate over all possible data based on the unknown probability distribution. To estimate true error based on sample error, confidence intervals can be calculated. With 95% confidence, the true error likely falls within an interval centered on the sample error. The size of the interval depends on sample size and how close sample error is to 0 or 1. Equations are provided to calculate confidence intervals for different confidence levels, like 95% or 68%. When the sample has at least 30 examples, these intervals

Uploaded by

divyathadillu12345

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

7 views40 pages

Module 5

Uploaded by

divyathadillu12345

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 40

ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING-18CS71

MODULE – 5
EVALUATING HYPOTHESIS, INSTANCE-BASED
LEARNING, REINFORCEMENT LEARNING
PART 1- EVALUATING HYPOTHESIS
MOTIVATION
Importance to evaluate the performance,
1. To understand whether to use the hypothesis.
• For instance, when learning from a limited-size database indicating the
effectiveness of different medical treatments, it is important to understand as
precisely as possible the accuracy of the learned hypotheses.
2. Evaluating hypotheses is an integral component of many learning methods.
• For example, in post-pruning decision trees to avoid overfitting, we must
evaluate resultant trees
 Data is plentiful  Accuracy is straightforward.
 Difficulties arise given limited set of data. They are
1. Bias in the estimate.
 The observed accuracy of the learned hypothesis over the training examples
is often a poor estimator of its accuracy over future examples.
 It is a biased estimate of hypothesis accuracy over future examples.
 To obtain an unbiased estimate of future accuracy, we typically test the
hypothesis on some set of test examples chosen independently of the training
examples and the hypothesis.
2. Variance in the estimate.
 Even if the hypothesis accuracy is measured over an unbiased set of test
examples, independent of the training examples, the measured accuracy can
still vary from the true accuracy, depending on the makeup of the particular
set of test examples.
 The smaller the set of test examples, the greater the expected variance
ESTIMATING HYPOTHESIS ACCURACY
Setting:
There is some space of possible instances X (e.g., the set of all people) over which various
target functions may be defined (e.g., people who plan to purchase new skis this year.

Prof.Dhanya K N ,Dept.of CS&E,MRIT 1

ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING-18CS71

X different instances may be encountered with different frequencies.

Assumption: there is some unknown probability distribution D that defines the probability
of encountering each instance in X. For example, D might assign a higher probability to
encountering 19-year-old people than 109-year-old people.
D only determines the probability that x will be encountered.
Learning Task to learn the target concept or target function f by considering a space H
of possible hypotheses.
Trainer  Provides Training examples of the target function f to the learner
 Draws instances independently according to the distribution D
 Forwards the instance x along with its correct target value f(x) to the learner
Illustration: Consider,
Target function to be learnt  people who plan to purchase new skis this year.
Given sample of training data collected by surveying people as they arrive at a ski resort.
X instance space of all people, who might be described by attributes such as their
age, occupation, how many times they skied last year, etc.
D  Distribution  for each person x the probability that x will be encountered as the
next person arriving at the ski resort.
Therefore,
Target function  f: X: {0,1}  for each person x the probability that x will be
encountered as the next person arriving at the ski resort.
Questions to be explored
1. Given a hypothesis h and a data sample containing n examples drawn at random
according to the distribution D, what is the best estimate of the accuracy of h over
future instances drawn from the same distribution?
2. What is the probable error in this accuracy estimate?
Sample Error and True Error
These are the two notations of accuracy or equivalently error.
Sample Error  error rate of the hypothesis over the sample of data that is available
True Error error rate of the hypothesis over the entire unknown distribution D of
examples.
Sample Error
The sample error of a hypothesis with respect to some sample S of instances drawn from
X is the fraction of S that it misclassifies:

Prof.Dhanya K N ,Dept.of CS&E,MRIT 2

ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING-18CS71

Definition: The sample error (denoted errors(h)) of hypothesis h with respect to target
function f and data sample S is
1
𝑒𝑟𝑟𝑜𝑟𝑠(𝑕) ≡ ∑ 𝛿(𝑓(𝑥), 𝑕(𝑥))
𝑛
𝑥∈𝑆
Where, n number of examples in S
𝛿(𝑓(𝑥), 𝑕(𝑥)) is 1 if 𝑓(𝑥) ≠ 𝑕(𝑥) otherwise 0.
True Error
The true error of a hypothesis is the probability that it will misclassify a single randomly
drawn instance from the distribution D.
Definition: The true error (denoted errorD(h) )of hypothesis h with respect to target
function f and distribution D ,is the probability that h will misclassify an instance drawn
at random according to D.
𝑒𝑟𝑟𝑜𝑟D(𝑕) ≡ 𝑃𝑟𝑥∈D𝑓(𝑥) ≠ 𝑕(𝑥)]
Where, 𝑃𝑟𝑥∈D denotes that the probability is taken over the instance distribution D.
Requirement
We need to Calculate errorD(h) but we have errors(h) in hand.
Question How good an estimate of errorD(h) is provided by errors(h)?
Confidence Intervals for Discrete-Valued Hypothesis
Answers the question How good an estimate of errorD(h) is provided by errors(h)?
Here, h discrete-valued function.
To estimate the true error for some discrete-valued hypothesis h, based on its observed
sample error over a sample S
Given,
 Sample S n examples drawn independent of one another and of h, according to the
probability distribution D.
 n≥30
 h  commits r errors over these n examples (i.e., errors(h)=r/n)
The statistical theory allows us to make the following assertions:
1. Given no other information, the most probable value of errorD(h) is errors(h).

2. With approximately 95% probability, the true error errorD(h) lies in the

interval
𝑒𝑟𝑟𝑜𝑟𝑠(𝑕)(1−𝑒𝑟𝑟𝑜𝑟𝑠(𝑕)) --- (i)
𝑒𝑟𝑟𝑜𝑟𝑆 (𝑕) ± 1.96√
𝑛

Prof.Dhanya K N ,Dept.of CS&E,MRIT 3

ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING-18CS71

Example Problem for Illustration

Given:
- S contains 40 examples  n=40
- h commits 12 errors  r =12
Therefore,
Sample Error errors(h)=12 / 40 = 0.30
Since no other information is given, use errors(h) to calculate errorD(h).
From experiments it is noticed that, on drawing independent samples drawing 40
examples every time resulted that approximately for 95% of these experiments, the
calculated interval would contain the true error.  95% confidence interval estimate
for errorD(h).
For current example, h commits r = 12 errors over a sample of n = 40 independently
drawn examples, the confidence interval is given by
𝑒𝑟𝑟𝑜𝑟𝑠(𝑕)(1−𝑒𝑟𝑟𝑜𝑟𝑠(𝑕))
Confidence Interval  𝑒𝑟𝑟𝑜𝑟 (𝑕) ± 1.96√
𝑆 𝑛

0.30×(1−0.30)
 0.30 ± 1.96√
40

0.30×0.70
 0.30 ± 1.96√
40

0.21
 0.30 ± 1.96√
40

 0.30 ± 1.96√0.00525
 0.30 ± 1.96 × 0.07
Confidence Interval  𝟎. 𝟑𝟎 ± 𝟎. 𝟏𝟒
The Eq. (i) given for 95% confidence interval can be generalized any desired
confidence level. Let ZN be used to calculate N% confidence intervals for errorD(h).
There Eq. (i) can be rewritten as

𝑒𝑟𝑟𝑜𝑟 (𝑕) ± 𝑍 √𝑒𝑟𝑟𝑜𝑟𝑠(𝑕)(1−𝑒𝑟𝑟𝑜𝑟𝑠(𝑕)) --- (5.1)

𝑆 𝑁 𝑛

Where, ZN  constant chosen depending on the desired confidence level (Table

5.1)
Table 5.1. Values of ZN for two-sided N% confidence intervals
Confidence Level N% : 50% 68% 80% 90% 95% 98% 99%

Prof.Dhanya K N ,Dept.of CS&E,MRIT 4

ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING-18CS71

Constant ZN : 0.67 1.00 1.28 1.64 1.96 2.33 2.58

Therefore, for the above example, if 68% is the confidence level, then we get the
confidence interval as 𝟎0.30 ± 1.00 × 0.07
Eq. (5.1.)  describes how to calculate the confidence intervals, or error bars, for
estimates of errorD(h) that are based on errors(h).
 provides only an approximate confidence interval, though the approximation is quite
good when the sample contains at least 30 examples, and errors(h) is not too close to 0
or 1
Rule of thumb  The above approximation works well when
𝑛. 𝑒𝑟𝑟𝑜𝑟𝑠(𝑕)(1 − 𝑒𝑟𝑟𝑜𝑟𝑠(𝑕)) ≥ 5
BASICS OF SAMPLING THEOREM
Summary

Error Estimation and Estimating Binomial Proportions

Precisely how does the deviation between sample error and true error depend on the size
of the data sample?  The property of interest is that h misclassifies the example.
To calculate sample error

Prof.Dhanya K N ,Dept.of CS&E,MRIT 5

ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING-18CS71

- First, collect a random sample S of n independently drawn instances from the

distribution D
- Then measure the sample error errors(h)
Each time drawing a different random sample Si of size n,  we observe different
values for the various 𝑒𝑟𝑟𝑜𝑟𝑠𝑖 (𝑕) depending on random differences in the makeup of the
various Si.
Therefore, errors i(h) the outcome of the ith such experiment, is a random variable.
Example,
To run k random experiments measuring the random variables, 𝑒𝑟𝑟𝑜𝑟𝑠1 (𝑕), 𝑒𝑟𝑟𝑜𝑟𝑠2 (𝑕)
,…, 𝑒𝑟𝑟𝑜𝑟𝑠𝑘(𝑕). As k grows, the histogram approach the form of distribution (Figure 5.1).

Figure 5.1. The Binomial Distribution

A Binomial distribution gives the probability of observing r heads in a sample of n
independent coin tosses, when the probability of heads on a single coin toss is p. It is
defined by the probability function.
𝑛!
𝑃(𝑟) = 𝑝𝑟(1 − 𝑝)𝑛−𝑟
𝑟! (𝑛 − 𝑟)!
If the random variable X follows a Binomial distribution, then:
- The probability Pr(X =r ) that X will take on the value r is given by P(r)
- The expected, or mean value of X, E[X], is
E[X] = np
- The variance of X , Var(X), is
Var(X) = np (1-p)
- The standard deviation of X, σx, is

σx = √𝑛𝑝(1 − 𝑝)
For sufficiently large values of n the Binomial distribution is closely approximated
by a Normal distribution with the same mean and variance.
Note: Use the Normal approximation only when np(1-p) ≥5.

Prof.Dhanya K N ,Dept.of CS&E,MRIT 6

ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING-18CS71

The Binomial Distribution Consider

the following example.Given: a
worn and bent coin
Task to estimate the probability that the coin will turn up heads when tossed p
Process Toss the coin n times and record the number of times r that it turns up heads.
Reasonable estimate of p is r/n.
When the experiment is rerun for new set of n coin tosses the value r may vary
somewhat from the value measured in the first experiment, yielding a somewhat
different estimate for p.
Binomial Distribution  for each possible value of r (i.e., from 0 to n), the
probability of observing exactly r heads given a sample of n independent tosses of a
coin whose true probability of heads is p.
Estimating p from a random sample of coin tosses is equivalent to estimating
errorD(h) from testing h on a random sample of instances.
- A single toss of the coin corresponds to drawing a single random instance from D and
determining whether it is misclassified by h.
- The probability p that a single random coin toss will turn up heads corresponds to the
probability that a single instance drawn at random will be misclassified  p
corresponds to errorD(h).
- r  number of heads observed over a sample of n coin tosses
 number of misclassifications observed over n randomly drawn instances
- r/n errors(h)
Here, the problem of estimating p for coins is identical to
the problem of estimating errorD(h) for hypotheses.
Binomial Distribution  general form of the probability distribution for the random
variable r, whether it represents the number of heads in n coin tosses or the number of
hypothesis errors in a sample of n examples.
The general setting to which the Binomial distribution applies is:
 There is a base, or underlying, experiment (e.g., toss of the coin) whose outcome
can be described by a random variable, say Y . The random variable Y can take
on two possible values (e.g., Y = 1 if heads, Y = 0 if tails)
 The probability that Y = 1 on any single trial of the underlying experiment is given
by some constant p, independent of the outcome of any other experiment. The

Prof.Dhanya K N ,Dept.of CS&E,MRIT 7

ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING-18CS71

probability that Y = 0 is therefore (1- p). Typically, p is not known in advance,

and the problem is to estimate it.
 A series of n independent trials of the underlying experiment is performed (e.g.,
n independent coin tosses), producing the sequence of independent, identically
distributed random variables Y1 , Y2,...,Yn. Let R denote the number of trials for
which Yi= 1 in this series of n experiments
𝑛

𝑅 ≡ ∑ 𝑌𝑖
𝑖=1

 The probability that the random variable R will take on a specific value r (e.g.,
the probability of observing exactly r heads) is given by the Binomial distribution
(Figure 5.1.)
𝑛!
𝑃𝑟(𝑅 = 𝑟) = 𝑝𝑟(1 − 𝑝)𝑛−𝑟 --- (5.2)
𝑟!(𝑛−𝑟)!

Mean and Variance

Mean  Expected value
The expected value is the average of the values taken on by repeatedly sampling the
random variable.
Definition: Consider a random variable Y that takes on the possible values y1, . . .yn. The
expected value of Y , E[Y],is
𝐸[𝑌] ≡ ∑𝑛𝑖=1 𝑦𝑖 Pr(𝑌 = 𝑦𝑖) --- (5.3)
Example,
Consider, Y = 1 with probability 0.7 and 2 with probability 0.3.
Then, Expected value is 1 ∙ 0.7 + 2 ∙ 0.3 = 1.3
Note: If random variable Y is governed by a Binomial distribution, then it can be shown
that E[Y] = np ---- (5.4)
Variance
It captures the "width or "spread" of the probability distribution  how far the random
variable is expected to vary from its mean value.
Definition: The variance of a random variable Y , Var[Y], is
𝑉𝑎𝑟[𝑌] ≡ 𝐸[(𝑌 − 𝐸[𝑦])2] ---(5.5)
Variance describes the expected squared error in using a single observation of Y to
estimate its mean E [ Y ]
Standard Deviation
The square root of the variance is called the standard deviation of Y , denoted by σY.

Prof.Dhanya K N ,Dept.of CS&E,MRIT 8

ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING-18CS71

Definition: The standard deviation of a random variable Y , σY, is

𝜎𝑌 ≡ √𝐸[(𝑌 − 𝐸[𝑦])2 ---(5.6)

For instance, if the random variable Y is governed by a Binomial distribution, then the
variance and standard deviation are given by
Var[Y]= np(1-p)

𝜎𝑌 ≡ √𝑛𝑝(1 − 𝑝) ---(5.7)
Estimators, Bias, and Variance
Question What is the likely difference between errors(h) and the true error errorD(h)?
Rewrite errors(h) and errorD(h) using terms in Eq.(5.2) defining the Binomial
distribution:
𝑟
𝑒𝑟𝑟𝑟𝑜𝑟𝑠(𝑕) =
𝑛
𝑒𝑟𝑟𝑟𝑜𝑟𝐷(𝑕) = 𝑝
Where,
n number of instances in the sample S
r  number of instances from S misclassified by h
p probability of misclassifying a single instance drawn from D.
Estimation Bias
errors(h)  an estimator for true error errorD(h).
An estimator is any random variable used to estimate some parameter of the underlying
population from which the sample is drawn.
Estimation bias difference between the expected value of the estimator and the true
value of the parameter.
Definition: The estimation bias of an estimator Y for an arbitrary parameter p is
E[Y] = p
If the estimation bias is zero  Y is an unbiased estimator for p.
Question  Is errors(h) an unbiased estimator for errorD(h)?
Yes.
 Binomial distribution the expected value of r is equal to np.
 Given that n is a constant, that the expected value of r/n is p.
Another property of an estimator is its variance. Given a choice among alternative
unbiased estimators choose the one with least variance  yields the smallest expected
squared error between the estimate and the true value of the parameter.
Example, for r = 12 and n = 40, an unbiased estimate for errorD(h) is given by

Prof.Dhanya K N ,Dept.of CS&E,MRIT 9

ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING-18CS71

errors(h) = r/n = 0.3.

The variance arises from r since n is a constant. Given r  Binomially distributed.
Therefore, Variance is given by Eq. (5.7)  𝒏𝒑(𝟏 − 𝒑)
Here, p is unknown and can be replaced by estimator r/n. Hence, we get,
Estimated variance= 40 ∙ 0.30(1 − 0.30) = 8.4 and
Standard Deviation = √8.4 ≈ 2.9
Therefore, standard deviation in errors(h)= r/n is approximately 2.9/40 = 0.07.
Hence, here, errors(h)=0.30 with standard deviation of approximately 0.07. Given r
errors in a sample of n independently drawn test examples, the standard deviation for
errors(h)is given by
𝜎𝑟 𝑝(1−𝑝) --- (5.8)
𝜎𝑒𝑟𝑟𝑜𝑟𝑠(𝑕) = =√
𝑛 𝑛

Eq. 5.8. can be approximated by substituting r/n = errors(h) for p.

𝜎𝑒𝑟𝑟𝑜𝑟𝑠(𝑕)(1−𝜎𝑒𝑟𝑟𝑜𝑟𝑠(𝑕))
𝜎𝑒𝑟𝑟𝑜𝑟𝑠(𝑕) ≈ √ 𝑛
--- (5.9)

Confidence Interval
To describe the uncertainty associated with an estimate, give an interval within which
the true value is expected to fall, along with the probability with which it is expected to
fall into this interval  Confidence Interval estimates.
Definition: An N% confidence interval for some parameter p is an interval that is
expected with probability N% to contain p.
Example, Given r = 12 and n = 40 approximately 95% probability that the interval 0.30
f0.14 contains the true error errorD(h).
Normal Distribution
It is a bell-shaped distribution fully specified by its mean μ and standard deviation σ
(FIgure 5.2).

Prof.Dhanya K N ,Dept.of CS&E,MRIT 10

ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING-18CS71

Figure 5.2: Normal Distribution

For large n, any Binomial distribution is very closely approximated by a Normal
distribution with the same mean and variance.
Definition: A Normal distribution (also called a Gaussian distribution) is a bell-shaped
distribution defined by the probability density function.
1 −1(𝑥−𝜇 )2
𝑝(𝑥) = 𝑒 2 𝜎
√2𝜋𝜎 2

A Normal distribution is fully determined by two parameters in the above formula: μ

and σ.
If the random variable X follows a normal distribution, then:
1. The probability that X will fall into the interval (a,6)is given by
𝑏
∫ 𝑝(𝑥)𝑑𝑥
𝑎

2. The expected or mean value of X, E[X], is

E[X] = μ
3. The variance of X, Var(X), is
Var (X) = σ2
4. The standard deviation of X, σx , is
σx = σ
Why to use Normal Distribution?
Most statistics references give tables specifying the size of the interval about the mean
that contains N% of the probability mass under the Normal distribution  needed to
calculate N% confidence interval.
From Table 5.1,
ZN  width of the smallest interval about the mean that includes N% of the total
probability mass under the bell-shaped Normal distribution.
 gives half the width of the interval (i.e., the distance from the mean in either
direction) measured in standard deviation (Figure 5.3  Z0.80 )

Prof.Dhanya K N ,Dept.of CS&E,MRIT 11

ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING-18CS71

Figure 5.3: A Normal distribution with mean 0, standard deviation 1: With 80%
confidence, the value of the random variable will lie in the two-sided interval [-
1.28,1.28]. Note Z.80 = 1.28. With 10% confidence it will lie to the right of this
interval, and with 10% confidence it will lie to the left.

Therefore,
If a random variable Y obeys a Normal distribution with mean μ and standard
deviation σ, then the measured random value y of Y will fall into the following
interval N% of the time
𝜇 ± 𝑍𝑁𝜎 --- (5.10)
Equivalently, the mean μ will fall into the following interval N% of the time
𝑦 ± 𝑍𝑁𝜎 --- (5.11)
Derivation of General Expression for N% confidence interval:
We know that,
- errors(h)follows a Binomial distribution with mean value errorD(h) and
standard deviation as in Eq. 5.9.
- For sufficiently large sample size n, the Binomial distribution is well
approximated by a Normal distribution.
- Eq. 5.11 is used to find the N% confidence interval for estimating the mean
value of a Normal distribution.
Substituting the mean and standard deviation of errors(h) into Eq. 5.11 we get Eq.
5.1. for N% confidence intervals for discrete-valued hypotheses.

𝑒𝑟𝑟𝑜𝑟 (𝑕) ± 𝑍 √𝑒𝑟𝑟𝑜𝑟𝑠(𝑕)(1−𝑒𝑟𝑟𝑜𝑟𝑠(𝑕)) --- (5.1)

𝑆 𝑁 𝑛

Two approximations used to derive Eq. 5.1 are

1. in estimating the standard deviation σ of errors(h)  approximated errord(h) by
errors(h). [Eq. 5.8 to Eq. 5.9]
2. the Binomial distribution has been approximated by the Normal distribution.
Note:
- The two approximations are very good as long as n≥30, or when np(1-p)≥5.
- For small n use Binomial Distribution.
Two-Sided and One-Sided Bounds

Prof.Dhanya K N ,Dept.of CS&E,MRIT 12

ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING-18CS71

Eq. 5.1 is two-sided bound it bounds the estimated quantity from above and from
below.
One-sided bound  Example,
Question of interest  What is the probability that errorD(h) is at most U?
This means it reflects bounding the maximum error of h and do not mind if the true error
is much smaller than estimated.

To find one-sided error bounds

Normal distribution is symmetric about its mean. Therefore, any two-sided confidence
interval based on a Normal distribution can be converted to a corresponding one-sided
interval with twice the confidence (Figure 5.3 (b)).

Figure 5.3: A Normal distribution with mean 0, standard deviation 1: With

90%confidence, it will lie in the one-sided interval [-∞, 1.28].
That is,
100(1- α)% confidence interval with lower bound L and upper bound U
100(1-α/2)% confidence interval with lower bound L and no upper bound, or
100(1-α/2)% confidence interval with lower bound U and no lower bound.
Where,
α probability that the correct value lies outside the stated interval, where
 α  probability that the value will fall into the unshaded region in Figure 5.2
and
 α/2  probability that it will fall into the unshaded region of Figure 5.3.
Illustration
Example, h commits r = 12 errors over a sample of n = 40 independently drawn
examples. We get two-sided 95% confidence interval of 0.30±0.14. Here,
100(1- α) = 95%  α = 0.05 and
100(1- α/2) = 97.5%  errorD(h) is at most 0.30+0.14 = 0.44 with no assertion
about the lower bound on errorD(h). [100(1-(0.05/2))=97.5].

Prof.Dhanya K N ,Dept.of CS&E,MRIT 13

ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING-18CS71

This represents one-sided error bound on errorD(h) with double the confidence than
the corresponding two-sided bound.
GENERAL APPROACH OF DERIVING CONFIDENCE
INTERVALS
The general process includes the following steps:
1. Identify the underlying population parameter p to be estimated, for example,
errorD(h).
2. Define the estimator Y. (E.g., errors(h)). It is desirable to choose a minimum-
variance, unbiased estimator.
3. Determine the probability distribution DY that governs the estimator Y, including
its mean and variance.
4. Determine the N% confidence interval by finding thresholds L and U such
that N% of the mass in the probability distribution DY falls between L and U.
5.4.1. Central Limit Theorem
Theorem 5.1: Consider a set of independent, identically distributed random variables
Y1…Yn governed by an arbitrary probability distribution with mean μ and finite variance
1
σ2. Define the sample mean, 𝑌̅ ≡ ∑𝑛 𝑌 .
𝑛 𝑛 𝑖=1 𝑖
̅
Then as 𝑛 → ∞, the distribution governing 𝑌𝑛 −𝜇 approaches a Normal distribution, with
𝜎
√𝑛

zero mean and standard deviation equal to 1.

Central Limit Theorem
 Describes how the mean and variance of 𝑌̅can be used to determine the mean
and variance of the individual Yi.
 Implies that whenever we define an estimator that is the mean of some sample
(e.g., errors(h) is the mean error), the distribution governing this estimator can
be approximated by a Normal distribution for sufficiently large n.
DIFFERENCE IN ERROR OF TWO HYPOTHESIS
Consider 2 hypothesis h1 and h2 for a discrete-valued target function where,
- h1 has been tested on a sample S1 containing n1 randomly drawn examples, and
- h2 has been tested on an independent sample S2 containing n2 randomly drawn from
the same distribution.
The difference d between the true errors of these two hypotheses is given by
𝑑 ≡ 𝑒𝑟𝑟𝑜𝑟𝐷(𝑕1) − 𝑒𝑟𝑟𝑜𝑟𝐷(𝑕2)

Prof.Dhanya K N ,Dept.of CS&E,MRIT 14

ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING-18CS71

Here, to derive a confidence interval estimate for d is to be calculated using an estimator.

That is a sample error can be an estimator and let this difference be represented as 𝑑̂.
𝑑̂ ≡ 𝑒𝑟𝑟𝑜𝑟𝑆1 (𝑕1) − 𝑒𝑟𝑟𝑜𝑟𝑆2 (𝑕2)
It can be shown that 𝑑̂ gives an unbiased estimate of d  E[𝑑̂]=d.
̂?
What is the probability distribution governing the random variable 𝒅
We know that,
- For large n1 and n2 ,both 𝑒𝑟𝑟𝑜𝑟𝑆1 (𝑕1) and 𝑒𝑟𝑟𝑜𝑟𝑆2 (𝑕2)follow the distributions
that are approximately Normal.
- The difference of two Normal distributions is also a Normal distribution, 𝑑̂
will also follow a distribution that is approximately Normal, with mean d.
It can be shown that the variance of this distribution is the sum of the variances of
𝑒𝑟𝑟𝑜𝑟𝑆1 (𝑕1) and 𝑒𝑟𝑟𝑜𝑟𝑆2 (𝑕2).
Use Eq. 5.9 to obtain the approximate variance of each of these distributions, to get
𝑒𝑟𝑟𝑜𝑟𝑆1(𝑕1)(1−𝑒𝑟𝑟𝑜𝑟𝑆1(𝑕1)) 𝑒𝑟𝑟𝑜𝑟𝑆2(𝑕2)(1−𝑒𝑟𝑟𝑜𝑟𝑆2(𝑕2)) --- (5.12)
𝜎2𝑑̂ ≈ +
𝑛1 𝑛2

Next derive confidence intervals that characterize the likely error in employing 𝑑̂ to
estimate d. For a random variable 𝑑̂ obeying a Normal distribution with mean d and
variance σ2, the N% confidence interval estimate for d is 𝑑̂ ± 𝑍𝑁𝜎.
Using the approximate variance 𝜎𝑑2̂ , the approximate N% confidence interval estimate
for d is

𝑑̂ ± 𝑍 √𝑒𝑟𝑟𝑜𝑟𝑆1(𝑕1)(1−𝑒𝑟𝑟𝑜𝑟𝑆1(𝑕1)) + 𝑒𝑟𝑟𝑜𝑟𝑆2(𝑕2)(1−𝑒𝑟𝑟𝑜𝑟𝑆2(𝑕2)) --- (5.13)

𝑁 𝑛1 𝑛2

Eq. 5.13  general two-sided confidence interval for estimating the difference
between errors of two hypotheses.
 Acceptable to be used where h1 and h2 are tested on a single sample S (where S is
still independent of h1 and h2). Hence, 𝑑̂ can be redefined as
𝑑̂ ≡ 𝑒𝑟𝑟𝑜𝑟𝑆(𝑕1) − 𝑒𝑟𝑟𝑜𝑟𝑆(𝑕2)
5.5.1. Hypothesis Testing
What is the probability that errorD(h1 ) > errorD(h2)?
Let the sample errors for h1 and h2 are measured using two independent samples S1 and
S2 of size 100 and find that
𝑒𝑟𝑟𝑜𝑟𝑆1 (𝑕1) = 0.30 and

Prof.Dhanya K N ,Dept.of CS&E,MRIT 15

ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING-18CS71

𝑒𝑟𝑟𝑜𝑟𝑆2 (𝑕2) = 0.20

And observed that the difference 𝑑̂ =0.10.
What is the probability that errorD(h1) > errorD(h2) given the observed difference in
̂ =0.10 in this case? Equivalently, what is the probability that d>0,
sample errors 𝒅
̂ =0.10.
given that 𝒅
̂ has not overestimated d by more
The probability Pr(d>0) is equal to the probability that 𝒅
̂ falls into the one-sided interval 𝒅
than 0.10 or this is the probability that 𝒅 ̂ < 𝒅 + 𝟎. 𝟏𝟎
̂ < 𝝁𝒅̂+ 𝟎. 𝟏𝟎 (since d is the distribution governing 𝑑̂)
or 𝒅
̂ < 𝝁𝒅̂+ 𝟎. 𝟏𝟎 in terms of number of Standard
Re-expressing the interval 𝒅
Deviations, it allows deviating from the Mean. From Eq. 5.12, it is observed that 𝜎𝒅̂≈
0.061. Hence on re-expressing the interval we get,
𝑑̂< 𝜇𝑑̂+ 1.64𝜎𝑑̂
̂ < 𝝁𝒅̂+ 𝟎. 𝟏𝟎 to get 1.64𝜎𝑑̂
[NOTE for reference: Use 𝜎𝒅̂≈ 0.061 in 𝒅
0.1
= 1.64 therefore 0.10 can be expressed as 1.64×0.061=1.64𝜎𝑑̂]
0.061

COMPARING LEARNING ALGORITHMS

Comparing the performance of two learning algorithms LA and LB rather than comparing
two specific hypothesis.
What is an appropriate test for comparing learning algorithms, and how can we
determine whether an observed difference between the algorithms is statistically
significant?
Task: Determine which of LA and LB is the better learning method on average for
learning some particular target function f.
“On average”  consider the relative performance of these two algorithms averaged
over all the training sets of size n that might be drawn from the underlying instance
distribution D. That is, estimate the expected value of the difference in errors
[𝑒𝑟𝑟𝑜𝑟𝐷(𝐿𝐴(𝑆))−𝑒𝑟𝑟𝑜𝑟𝐷(𝐿𝐵(𝑆))]
𝐸𝑆⊂𝐷 ---(5.14)
Here,
L(S)  hypothesis output by learning method L when given the sample S of training
data
𝑆 ⊂ 𝐷 the expected value is taken over samples S drawn according to the
underlying instance distribution D.

Prof.Dhanya K N ,Dept.of CS&E,MRIT 16

ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING-18CS71

Eq. 5.14  expected value of the difference in errors between learning methods LA
and LB.
Consider a limited sample D0 for comparing algorithms.
 Divide D0 into a training set S0 and a disjoint test set T0 .
o The training data can be used to train both LA and LB.
o The test set can be used to compare the accuracy of the two learned algorithms.
Here we measure the quantity,
𝑒𝑟𝑟𝑜𝑟𝑇0 (𝐿𝐴(𝑆0)) − 𝑒𝑟𝑟𝑜𝑟𝑇0 (𝐿𝐵(𝑆0)) ---(5.15)
Difference between Eq. 5.14 and Eq. 5.15
- 𝑒𝑟𝑟𝑜𝑟𝑇0 (𝑕) is used to approximate 𝑒𝑟𝑟𝑜𝑟𝐷(𝑕)
- Difference in errors is measured for one training set S0 rather than using the
entire sample S.
To improve the estimator in Eq. 5.15
- Repeatedly partition the data D0 into disjoint training and test sets and
- Take the mean of the test set errors for these different experiments
Result  Procedure below
Procedure: to estimate the difference in error between LA and LB.
1. Partition the available data D0 into k disjoint subsets T1 ,T2…Tk of equal size,
where this size is at least 30.
2. for i from 1 to k, do
use Ti for the test set and the remaining data for training set Si
 Si  {D0 - Ti}
 hA  LA(Si)
 hB  LB(Si)
 δi  𝑒𝑟𝑟𝑜𝑟𝑇𝑖(𝑕𝐴) − 𝑒𝑟𝑟𝑜𝑟𝑇𝑖(𝑕𝐵)
3. Return the value 𝛿̅, where
1𝑘
𝛿̅ ≡ ∑𝛿
𝑖
𝑘
𝑖=0

--- (T5.1)

Steps in Algorithm above

- Partition the data into k disjoint subsets of equal size, where this size is at least 30

Prof.Dhanya K N ,Dept.of CS&E,MRIT 17

ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING-18CS71

- Train and test the learning algorithms k times, using each of the k subsets in turn
as the test set, and use all remaining data as the training set.
- After testing for all k independent test sets, return the mean difference 𝛿̅ that
represents an estimate of the difference between the two learning algorithms.
𝛿̅ can be taken as an estimate of the desired quantity from Eq. 5.14. Hence 𝛿̅ is an
estimatge of the quantity
[𝑒𝑟𝑟𝑜𝑟𝐷(𝐿𝐴(𝑆))−𝑒𝑟𝑟𝑜𝑟𝐷(𝐿𝐵(𝑆))]
𝐸𝑆⊂𝐷 ---(5.16)
0
𝑘−1
S a random sample of size |𝐷 | drawn uniformly from D0.
𝑘 0

The approximate N% confidence interval for estimating the quantity in Eq. 5.16 using
𝛿̅ is given by
𝛿 ̅ ± 𝑡𝑁,𝑘−1 𝑆𝛿̅ --- (5.17)
Where,
tN,k-1constant analogous to ZN
𝑆𝛿̅estimate of the Standard Deviation of the distribution governing 𝛿 ̅. It is
derfined as
1 2
𝑆 ≡ ∑𝑘 (𝛿 − 𝛿̅) ---(5.18)
̅
𝛿 √𝑘(𝑘−1) 𝑖=1 𝑖

In tN,k-1, N  desired confidence level

k-1 number of degrees of freedom (also denoted by v)
 related to the number of independent random events that go
into producing the value for the random variable 𝛿̅
Table 5.2: Values of tN,v two-sided confidence intervals. As v∞, tN,v approaches ZN.
Confidence Interval

90% 95% 98% 99%

v=2 2.92 4.30 6.96 9.92

v=5 2.02 2.57 3.36 4.03

v = 10 1.81 2.23 2.76 3.27

v = 20 1.72 2.09 2.53 2.84

v = 30 1.70 2.04 2.46 2.75

Prof.Dhanya K N ,Dept.of CS&E,MRIT 18

ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING-18CS71

v = 120 1.66 1.96 2.36 2.62

v=∞ 1.64 1.98 2.33 2.58

Paired Tests
Tests where the hypotheses are evaluated over identical samples are called paired
tests. Paired tests typically produce tighter confidence intervals because any
differences in observed errors in a paired test are due to differences between the
hypotheses.
Note: when the hypotheses are tested on separate data samples, differences in the two
sample errors might be partially attributable to differences in the makeup of the two
samples.
Paired t Tests
To understand the justification for the confidence interval estimate given by Eq. 5.17,
consider,
- Given, the observed values of a set of independent, identically distributed random
variables Y1, Y2, ...,Yk.
- Mean μ of the probability distribution governing Yi.
- The estimator used is sample mean 𝑌̅
1 𝑘
𝑌̅≡ ∑𝑌
𝑖
𝑘
𝑖=1

The t test represented by Eq. 5.17 and Eq. 5.18 applies to the case in which Yi follow
a Normal Distribution. The t test applies to the situation in which task is to estimate
the sample mean of a collection of independent, identically and Normally distributed
random variables. Using Eq. 5..17 and Eq. 5.18 we get,
𝜇 = 𝑌̅± 𝑡𝑁,𝑘−1 𝑆𝑌̅
Where, 𝑆𝑌̅ is the estimated standard deviation of the sample mean

𝑘
𝑆𝑌̅ ≡ √ 1
( ∑(𝑌𝑖 − 𝑌̅)2
𝑘𝑘−1)
𝑖=1

The t distribution is a bell-shaped distribution similar to the Normal distribution, but

wider and shorter to reflect the greater variance introduced by using 𝑆𝑌̅ to
approximate the true standard deviation 𝜎𝑌̅.

Prof.Dhanya K N ,Dept.of CS&E,MRIT 19

ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING-18CS71

PART 2- INSTANCE-BASED LEARNING

INTRODUCTION
Examples  nearest neighbor and locally weighted regression
Learning 
Storing the presented training data
When a new query instance is encountered
Retrieve set of similar related instances from memory
Classify the new instances using the retrieved instances
Key Difference Instance-based approaches can construct a different approximation to
the target function for each distinct query instance that must be classified
Advantage  can use more complex, symbolic representations for instances case-
based reasoning
 case-based reasoning
 storing and reusing past experience at a help desk
 reasoning about legal cases by referring to previous cases
 solving complex scheduling problems by reusing relevant portions of
previously solved problems
Disadvantage
i. The cost of classifying new instances can be high.
Reason nearly all computation takes place at classification time rather than
when the training examples are first encountered
ii. They typically consider all attributes of the instances when attempting to retrieve
similar training examples from memory
k-NEAREST NEIGHBOR LEARNING
Assumes all instances correspond to points in the n-dimensional space ℜ𝑛
The nearest neighbors of an instance are defined in terms of the standard Euclidean
distance.
Consider,
Let instance x be described by the feature vector
⟨𝑎1(𝑥), 𝑎2(𝑥), … , 𝑎𝑛(𝑥)⟩
Where,
ar (x)  value of the rth attribute of instance x.
The distance between two instances xi and xj being d(xi , xj) is given by

Prof.Dhanya K N ,Dept.of CS&E,MRIT 20

ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING-18CS71

𝑑(𝑥𝑖, 𝑥𝑗) = √∑(𝑎𝑟(𝑥𝑖) − 𝑎𝑟(𝑥𝑗))2

𝑟=1

Target function may be either discrete-valued or real-valued.

Consider learning discrete-valued target functions of the form
𝒇: 𝕽𝒏 → 𝑽
Where,
V finite set {v1 ,…, vs}
The k-Nearest Neighbor Algorithm for approximating a discrete-valued target function
Training algorithm:
• For each training example ⟨𝑥, 𝑓(𝑥)⟩ add the example to the list training_examples
Classification Algorithm:
• Given a query instance xq to be classified
• Let xl ...xk denote the k instances from training_examples that are nearest to xq
• Return
𝒂𝒓𝒈𝒎𝒂𝒙 𝒌

𝒇̂(𝒙𝒒 ) ← ∑ 𝜹(𝒗, 𝒇(𝒙𝒊))

𝒗∈𝑽𝒊=𝟏

Where, 𝛿(𝑎, 𝑏) = 1 if a==b

𝛿(𝑎, 𝑏) = 0 otherwise

Operation of k-Nearest Neighbor

Figure 5.3: k-NEAREST NEIGHBOR.

A set of positive and negative training examples is shown on the left, along with a query
instance xq to be classified. The 1-NEAREST NEIGHBOR algorithm classifies xq
positive, whereas 5-NEAREST NEIGHBOR classifies it as negative. On the right is the
decision surface induced by the 1-NEAREST NEIGHBOR algorithm for a typical set of
training examples. The convex polygon surrounding each training example indicates the

Prof.Dhanya K N ,Dept.of CS&E,MRIT 21

ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING-18CS71

region of instance space closest to that point (i.e., the instances for which the 1-NEAREST
NEIGHBOR algorithm will assign the classification belonging to that training example).
What is the nature of the hypothesis space H implicitly considered by the k-Nearest
Neighbor algorithm?
k-Nearest Neighbor algorithm never forms an explicit general hypothesis 𝑓̂ regarding
the target function f.
What does it do?
Just computes the classification of each new query instance as needed .
From Figure, k-Nearest Neighbor represents the decision surface as required by
classification
The decision surface is a combination of convex polyhedra surrounding each of the
training examples.
For every training example, the polyhedron indicates the set of query points whose
classification will be completely determined by that training example. Query points
outside the polyhedron are closer to some other training example. the Voronoi diagram
of the set of training examples.
The k-Nearest Neighbor algorithm is easily adapted to approximating continuous-
valued target function.
Update Algorithm to
• Calculate the mean value of the k nearest training examples rather than calculate
their most common value.
• replace the final line of the algorithm by
𝑘
𝑓̂(𝑥𝑞) ← ∑𝑖=1 𝑓(𝑥𝑖) ---(1)
𝑘

Distance-Weighted Nearest Neighbor Algorithm

Refinement to k-Nearest Neighbor algorithm weight the contribution of each of the
k neighbors according to their distance to the query point xq giving greater weight to
closer neighbors.
Replace the final line in algorithm by
𝒇̂(𝒙 ) ← 𝒂𝒓𝒈𝒎𝒂𝒙∑𝒌 𝒘 𝜹(𝒗, 𝒇(𝒙 )) ---(2)
𝒒 𝒊=𝟏 𝒊 𝒊
𝒗∈𝑽
Where,
1
𝑤𝑖 ≡ ---(3)
𝑑(𝑥𝑞,𝑥𝑖)2

In case the distance is zero, d(xq,xi)2 becomes zero.

Prof.Dhanya K N ,Dept.of CS&E,MRIT 22

ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING-18CS71

Solution assign 𝒇̂(𝒙𝒒 ) to be f(xi)

It is possible to distance-weight the instances for real-valued target functions in
a similar fashion, replacing the final line of the algorithm in this case by
𝒇̂(𝒙 ) ← 𝒊=𝟏
∑𝒌 𝒘𝒊𝒇(𝒙𝒊) ---(4)
𝒒 ∑𝒌𝒊=𝟏 𝒘𝒊

Remarks on k-Nearest Neighbor Algorithm

The distance-weighted k-Nearest Neighbor algorithm is a highly effective inductive
inference method for many practical problems
- Robust to noisy training data
- Quite effective when it is provided a sufficiently large set of training data
What is the inductive bias of k-Nearest Neighbor?
An assumption that the classification of an instance xq will be most similar to the
classification of other instances that are nearby in Euclidean distance.
Practical issue!!
- The distance between instances is calculated based on all attributes of the instance.
- The decision is affected because of curse of dimensionality  The distance between
neighbors will be dominated by the large number of irrelevant attributes
Solution  weight each attribute differently when calculating the distance between two
instances.
 corresponds to stretching the axes in the Euclidean space, shortening the axes
that correspond to less relevant attributes, and lengthening the axes that correspond to
more relevant attributes use cross-validation approach.
Cross-validation approach
Consider to stretch (multiply) the jth axis by some factor zj, where the values zl . . .zn are
chosen to minimize the true classification error of the learning algorithm.
This true error can be estimated using cross validation.
Therefore,
Algorithm Step 1 select a random subset of the available data to use as
training examples
Algorithm Step 2 determine the values of zl . . .zn that lead to the minimum
error in classifying the remaining examples.
Repeat the above process multiple times the estimate for these
weighting factors can be made more accurate.
Alternative

Prof.Dhanya K N ,Dept.of CS&E,MRIT 23

ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING-18CS71

Completely eliminate the least relevant attributes from the instance space.
- Equivalent to setting some of the zi scaling factors to zero.
Moore and Lee (1994)
- Leave-one-out cross validation  the set of m training instances is repeatedly divided
into a training set of size m-1 and test set of size 1, in all possible ways.
- Can be easily applied by knn.
Another issue: efficient memory indexing.
Reason: The algorithm delays all processing until a new query is received, significant
computation can be required to process each new query.
A Note on Terminology
Uses a terminology that has arisen from the field of statistical pattern recognition.
• Regression means approximating a real-valued target function.
• Residual is the error 𝑓̂(𝑥) − 𝑓(𝑥) in approximating the target function
• Kernel function is the function of distance that is used to determine the weight of
each training example.
o It is a function K such that wi = K(d(xi, xq))
LOCALLY WEIGHTED REGRESSION
Locally Weighted Regression (LWR) is the generalization of nearest-neighbor
approaches.
Nearest-neighbor approaches approximate the target function f(x) at the single query
point x = xq.
LWR  constructs an explicit approximation to f over a local region surrounding xq.
 uses nearby or distance-weighted training examples to form the local
approximation to f .
f can be approximated using a linear function or quadratic function or
multilayer NN or any form.
LWR is
• LOCAL because nearby or distance-weighted training examples are used to form
the local approximation to f
• WEIGHTED because the contribution of each training example is weighted by its
distance from the query point
• REGRESSION because this is the term used widely in the statistical learning
community for the problem of approximating real-valued functions.

Prof.Dhanya K N ,Dept.of CS&E,MRIT 24

ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING-18CS71

The general approach in LWR!!

Given,
xq new query instance
Approach construct an approximation 𝑓̂ that fits the training examples in the
neighborhood surrounding xq.
 use the approximation to calculate 𝑓̂(𝑥𝑞)
Here,
𝑓̂(𝑥𝑞)  output as the estimated target value for the query instance.
𝑓̂ need to be retained as a different local approximation will be calculated for
each distinct query instance
Locally Weighted Linear Regression
Consider,
A case of locally weighted regression in which the target function f is approximated
near xq using a linear function of the form
𝑓̂(𝑥𝑞) = 𝑤0 + 𝑤1𝑎1(𝑥) + ⋯ + 𝑤𝑛𝑎𝑛(𝑥)
From Gradient Descent rule,
1
𝐸≡ ∑ (𝑓(𝑥) − 𝑓̂(𝑥))2 ---(5)
2 𝑥∈𝐷

And
∆𝑤𝑗 = ƞ ∑𝑥∈𝐷(𝑓(𝑥) − 𝑓̂(𝑥))𝑎𝑗(𝑥)) ---(6)
How shall we modify this procedure to derive a local approximation rather than a
global one?
Ans: Redefine the error criterion E to emphasize fitting the local training examples
There are 3 possible criteria:
Let E(xq)  error is being defined as a function of the query point xq.
1. Minimize the squared error over just the k nearest neighbors:
1 ∑ (𝑓(𝑥) − 𝑓̂(𝑥))2
𝐸 (𝑥 ) ≡
1 𝑞 2
𝑥∈𝑘 𝑛𝑒𝑎𝑟𝑒𝑠𝑡 𝑛𝑏𝑟𝑠 𝑜𝑓 𝑥𝑞

2. Minimize the squared error over the entire set D of training examples, while
weighting the error of each training example by some decreasing function K of its
distance from xq
𝐸 (𝑥 ) ≡ 1 ( ) ̂( ) 2
2 𝑞 ∑ (𝑓 𝑥 − 𝑓 𝑥 ) 𝐾(𝑑(𝑥𝑞, 𝑥))
2
𝑥∈𝐷

Prof.Dhanya K N ,Dept.of CS&E,MRIT 25

ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING-18CS71

3. Combine 1 and 2
𝐸 (𝑥 ) ≡ 1 ∑ ( ) ̂( ) 2
3 𝑞 2 (𝑓 𝑥 − 𝑓 𝑥 ) 𝐾(𝑑(𝑥𝑞, 𝑥))
𝑥∈𝑘 𝑛𝑒𝑎𝑟𝑒𝑠𝑡 𝑛𝑏𝑟𝑠 𝑜𝑓 𝑥𝑞

Criteria 3 is the best approach. If criteria 3 is used and gradient descent in Eq. (6) is re-
derived, we get the training rule as follows:
∆𝑤𝑗 = ƞ ∑𝑥∈𝑘 𝑛𝑒𝑎𝑟𝑒𝑠𝑡 𝑛𝑏𝑟𝑠 𝑜𝑓 𝑥𝑞 𝐾(𝑑(𝑥𝑞, 𝑥)) (𝑓(𝑥) − 𝑓̂(𝑥))𝑎𝑗(𝑥)) ---(7)
Remark son Locally Weighted Regression
In most cases, the target function is approximated by a constant, linear, or quadratic
function.
More complex functional forms are not often found because
(1) the cost of fitting more complex functions for each query instance is prohibitively
high, and
(2) these simple approximations model the target function quite well over a sufficiently
small sub-region of the instance space
RADIAL BASIS FUNCTION
Radial Basis Function (RBF) is closely related to distance-weighted regression and also
to artificial neural networks.
By Powell 1987; Broomhead and Lowe 1988; Moody and Darken 1989
The learned hypothesis is a function of the form
𝑓̂(𝑥) ≡ 𝑤0 + ∑𝑢=1
𝑘 𝑤𝑢𝐾𝑢(𝑑(𝑥𝑢, 𝑥)) ---(8)
𝐾𝑢(𝑑(𝑥𝑢, 𝑥)) Gaussian function centered at the point xu with some variance
𝜎𝜇2.
1
2 𝑑2(𝑥𝑢,𝑥)
𝐾𝑢 (𝑑(𝑥𝑢 , 𝑥)) = 𝑒2𝜎𝜇
Eq. 8 can represent 2-layer network where
• the first layer of units computes the values of the various 𝐾𝑢(𝑑(𝑥𝑢, 𝑥)).
• the second layer computes a linear combination of these first-layer unit values

Prof.Dhanya K N ,Dept.of CS&E,MRIT 26

ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING-18CS71

Figure 5.4: A radial basis function network. Each hidden unit produces an activation
determined by a Gaussian function centered at some instance xu.
Therefore, its activation will be close to zero unless the input x is near xu. The
output unit produces a linear combination of the hidden unit activations. Although the
network shown here has just one output, multiple output units can also be included
RBF networks are trained in 2-stage process:
1. the number k of hidden units is determined and each hidden unit u is defined by
choosing the values of xu and 𝜎2 that define its kernel function 𝐾 (𝑑(𝑥 , 𝑥)).
𝜇 𝑢 𝑢

2. the weights wu are trained to maximize the fit of the network to the training data,
using the global error criterion given by Eq. 5
Alternative methods to choose an appropriate number of hidden units or, equivalently,
kernel functions
o Allocate a Gaussian kernel function for each training example ⟨𝒙𝒊, 𝒇(𝒙𝒊)⟩
centering this Gaussian at the point xi.
Each kernels may be assigned the same width σ2.
Result: RBF network learns a global approximation to the target function in which
each training example ⟨𝑥𝑖, 𝑓(𝑥𝑖)⟩ can influence the value of 𝑓̂ only in the
neighborhood of xi.
Advantage:
This Kernel function allows the RBF network to fit the training data exactly.
In other words,
For any set of m training examples the weights wo . ..wm for combining the
m Gaussian kernel functions can be set so that 𝑓̂(𝑥) = 𝑓(𝑥) for each training
example ⟨𝑥𝑖, 𝑓(𝑥𝑖)⟩.

Prof.Dhanya K N ,Dept.of CS&E,MRIT 27

ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING-18CS71

o Choose a set of kernel functions that is smaller than the number of training
examples.
This is efficient if the number of training examples is large.
The set of kernel functions may be distributed with centers spaced uniformly
throughout the instance space X.
Or
Distribute the centers non-uniformly, especially if the instances themselves are
found to be distributed non-uniformly over X.
Or
Identify prototypical clusters of instances, then add a kernel function centered at each
cluster.

CASE-BASED REASONING
3 key properties of kNN and LWR:
1. They are lazy learning methods in that they defer the decision of how to generalize
beyond the training data until a new query instance is observed.
2. They classify new query instances by analyzing similar instances while ignoring
instances that are very different from the query.
3. They represent instances as real-valued points in an n-dimensional Euclidean space.
CASE-BASED REASONING
It is a learning paradigm based on the 1 and 2 but not 3.
Example,
CADET System-Sycara et al. 1992
o Uses CBR to assist in the conceptual design of simple mechanical devices
such as water faucets.
o It uses a library containing approximately 75 previous designs and design
fragments to suggest conceptual designs to meet the specifications of new
design problems.
o Each instance stored in memory is represented by describing both its structure
and its qualitative function.
o Example, Water Pipes

Prof.Dhanya K N ,Dept.of CS&E,MRIT 28

ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING-18CS71

Therefore,
CADET
• searches its library for stored cases whose functional descriptions match the design
problem.
• If an exact match is found, indicating that some stored case implements
exactly the desired function, then this case can be returned as a suggested
solution to the design problem.
• If no exact match occurs, CADET may find cases that match various sub-
graphs of the desired functional specification
The system may elaborate the original function specification graph in order to
create functionally equivalent graphs that may match still more cases.
Example,
It uses a rewrite rule that allows it to rewrite the influence
𝐴→𝐵
As
𝐴→𝑥→𝐵
Correspondence between the problem setting of CADET and the general setting for
instance-based methods such as k-Nearest Neighbor.
CADET
 Each stored training example describes a function graph along with the structure
that implements it.
 New queries correspond to new function graphs.
Therefore, Consider

Prof.Dhanya K N ,Dept.of CS&E,MRIT 29

ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING-18CS71

Xspace of all function graphs

f Target function Maps function graphs to the structures that implement them
⟨𝑥, 𝑓(𝑥)⟩ stored training example
 describes some function graph x and the structure f ( x ) that implements x.
Hence,
The system must learn from the training example cases to output the structure f
(xq) that successfully implements the input function graph query xq.
Properties of case-based reasoning systems that distinguish them from approaches such
as k-Nearest Neighbor
• Instances or cases may be represented by rich symbolic descriptions, such as the
function graphs used in CADET
• Multiple retrieved cases may be combined to form the solution to the new problem
 relies on knowledge-based reasoning rather than statistical methods
• There may be a tight coupling between case retrieval, knowledge-based reasoning, and
problem solving.

Prof.Dhanya K N ,Dept.of CS&E,MRIT 30

ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING-18CS71

PART 3- REINFORCEMENT LEARNING

INTRODUCTION
What is addressed?
o How an autonomous agent that senses and acts in its environment can learn to
choose optimal actions to achieve its goals?
o Each time the agent performs an action in its environment, a trainer may
provide a reward or penalty to indicate the desirability of the resulting state.
Task of the agent!!
o To learn from indirect, delayed reward, to choose sequences of actions that
produce the greatest cumulative reward.
Example:
Problem: Building a Learning Robot
The robot, or agent, has
• a set of sensors to observe the state of its environment, and
• a set of actions it can perform to alter this state.
Example: mobile robot
o Has sensors such as a camera and sonars, and actions such as “move forward”
and “turn”.
o Task to learn a control strategy, or policy, for choosing actions that achieve
its goals.
 Example robot has a goal of docking onto its battery charger
whenever its battery level is low.

Goal Learn to choose actions that maximize

𝑟0 + 𝛾𝑟1 + 𝛾2𝑟2 + ⋯
Where, 0 ≤ 𝛾 ≤ 1
Task Learn a control policy 𝜋: 𝑆 → 𝐴 that maximizes the expected sum of rewards

Prof.Dhanya K N ,Dept.of CS&E,MRIT 31

ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING-18CS71

Figure 5.5: An agent interacting with its environment.

The agent exists in an environment described by some set of possible states S.
It can perform any of a set of possible actions A. Each time it performs an action a, in
some state st the agent receives a real-valued reward rt, that indicates the immediate
value of this state-action transition. This produces a sequence of states si, actions ai, and
immediate rewards ri as shown in the figure. The agent's task is to learn a control policy,
𝜋:SA, that maximizes the expected sum of these rewards, with future rewards
discounted exponentially by their delay.
Example,
Manufacturing optimization problems in which a sequence of manufacturing actions
must be chosen
 Reward  value of the goods produced minus the costs involved
Therefore, what do we need?
Type of agent that must learn to choose actions that alter the state of its environment
and where a cumulative reward function is used to define the quality of any given action
sequence
The problem of learning a control policy to choose actions is similar in some
respects to the function approximation problems. Therefore, Target Function Control
Policy : 𝜋: 𝑆 → 𝐴
 outputs an appropriate action a from the set A, given the current state s from
the set S .
The reinforcement learning problem differs from other function approximation tasks
in several important respects
i. Delayed Reward
ii. Exploration
iii. Partially Observable States
iv. Life-Long Learning
i. Delayed Reward
 Task earn a target function 𝜋 that maps from the current state s to the
optimal action a = 𝜋(s).
 Approach not as <s, 𝜋(s)>
 trainer provides only a sequence of immediate reward values as the
agent executes its sequence of actions

Prof.Dhanya K N ,Dept.of CS&E,MRIT 32

ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING-18CS71

 Problem temporal credit assignment  determining which of the actions

in its sequence are to be credited with producing the eventual rewards.
ii. Exploration
 The agent influences the distribution of training examples by the action
sequence it chooses.
 Tradeoff choosing whether to favor
 exploration of unknown states and actions (to gather new information), or
 exploitation of states and actions that it has already learned will yield high
reward (to maximize its cumulative reward).
iii. Partially Observable States
 In many practical situations sensors provide only partial information.
 Example a robot with a forward-pointing camera cannot see the rear
environment.
 Solution agent should consider its previous observations together with
its current sensor data to choose action
actions to improve the observation of the environment
iv. Life-Long Learning
 Example,
o Robot learning  learn several related tasks within the same
environment, using the same sensors.
o Result possibility of using previously obtained experience or
knowledge to reduce sample complexity when learning new tasks.
LEARNING TASK
Markov Decision Process (MDP)
 The agent can perceive a set S of distinct states of its environment and has a set A
of actions that it can perform.
 At each discrete time step t , the agent senses the current state st, chooses a current
action at, and performs it.
 The environment responds by giving the agent a reward rt = r (st,at) and by
producing the succeeding state st+1 = δ(st,at).
 The functions δ(st,at) and r(st,at) depend only on the current state and action, and
not on earlier states or actions.

Prof.Dhanya K N ,Dept.of CS&E,MRIT 33

ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING-18CS71

Task of the agent  To learn a policy, 𝜋: 𝑆 → 𝐴, for selecting its next action at, based
on the current observed state st  𝜋(st)= at.
How shall we specify precisely which policy 𝜋 we would like the agent to learn?
Require the policy that produces the greatest possible cumulative reward for the
robot over time.
Therefore,
Define the cumulative value V𝜋(st) achieved by following an arbitrary policy 𝜋 , from
an arbitrary initial state st as:
𝑉𝜋(𝑠𝑡) ≡ 𝑟𝑡 + 𝛾𝑟𝑡+1 + 𝛾2𝑟𝑡+2+..
𝑖
𝑉𝜋(𝑠𝑡) ≡ ∑∞
𝑖=0 𝛾 𝑟𝑡+𝑖 ---(1)
Where, 0 ≤ 𝛾 ≤ 1
𝑉𝜋(𝑠𝑡) Discounted Cumulative Reward
Variants
 𝑕 𝑟  undiscounted sum of rewards over a finite
Finite Horizon Reward ∑𝑖=0 𝑡+𝑖

number h of steps.
1
 Average Reward  lim ∑𝑕 𝑟  the average reward per time step over the
𝑕→∞ 𝑕 𝑖=0 𝑡+𝑖

entire lifetime of the agent.

Learning Task learn a policy 𝜋 that maximizes 𝑉𝜋(𝑠)for all states s Optimal
Policy  𝜋*
𝜋 *= 𝑎𝑟𝑔𝑚𝑎𝑥𝑉𝜋(𝑠),
𝜋 (∀𝑠) --(2)
∗
Let, 𝑉𝜋 (𝑠) be V*(s)  maximum discounted cumulative reward that the agent can
obtain starting from state s.
Illustration
To illustrate these concepts, a simple grid-world environment is depicted in the topmost
diagram of Figure 5.6.

Prof.Dhanya K N ,Dept.of CS&E,MRIT 34

ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING-18CS71

Figure 5.6: A simple deterministic world to illustrate the basic concepts of Q-

learning.
 Each grid square represents a distinct state, each arrow a distinct action.
 An optimal policy, corresponding to actions with maximal Q values, is also
shown
o specifies exactly one action that the agent will select in any given
state.
o Directs the agent along the shortest path toward the state G
 The number associated with each arrow represents the immediate reward
r(s,a) the agent receives if it executes the corresponding state-action
transition.
 Here G Goal State agent can receive reward by entering this state.
 Also, G Absorbing State  the only action available to the agent once it
enters the state G is to remain in this state.
 Example, consider that, an immediate reward function, r(s,a) gives reward
100 for actions entering the goal state G, and zero otherwise. Values of V*(s)
and Q(s,a) follow from r(s,a), and the discount factor 𝛾 = 0.9.
 In bottom right grid, V*=100 because the optimal policy in this state selects
the "move up" action that receives immediate reward 100.
 V* in bottom center state is 90 since the optimal policy will move the agent
from this state to the right (generating an immediate reward of zero), then
upward (generating an immediate reward of 100).
Therefore, the discounted future reward from the bottom center state is
0 + 𝛾1100 + 𝛾20 + 𝛾30 + ⋯ = 90
Q LEARNING
How can an agent learn an optimal policy π * for an arbitrary environment?
The training information available to the learner is the sequence of immediate rewards
r(si,ai) for i = 0, 1,2, ........ Given this kind of training information it is easier to learn a

Prof.Dhanya K N ,Dept.of CS&E,MRIT 35

ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING-18CS71

numerical evaluation function defined over states and actions, then implement the
optimal policy in terms of this evaluation function.
What evaluation function should the agent attempt to learn?
One obvious choice is V*. The agent should prefer state sl over state s2 whenever V*(sl)
> V*(s2), because the cumulative future reward will be greater from sl.
The optimal action in state s is the action a that maximizes the sum of the
immediate reward r(s, a) plus the value V* of the immediate successor state, discounted
by γ.
𝜋∗(𝑠) = 𝑎𝑟𝑔𝑚𝑎𝑥[𝑟(𝑠,
𝑎 𝑎) + 𝑉∗(𝛿(𝑠, 𝑎))] ---(3)
The Q Function
The value of Evaluation function Q(s, a) is the reward received immediately upon
executing action a from state s, plus the value (discounted by γ ) of following the
optimal policy thereafter
𝑄(𝑠, 𝑎) ≡ 𝑟(𝑠, 𝑎) + 𝛾𝑉∗(𝛿(𝑠, 𝑎))---(4)
Q(s,a) quantity that is maximized in Eq. 3 to choose the optimal action a in
state s
Re-write Eq. (3) in terms of Q(s,a) to get,
𝜋∗(𝑠) = 𝑎𝑟𝑔𝑚𝑎𝑥𝑄(𝑠,
𝑎 𝑎) ---(5)
That is, it needs to only consider each available action a in its current state s and
choose the action that maximizes Q(s, a).
Why is this rewrite important?
If the agent learns the Q function instead of the V* function, it will be able to select
optimal actions even when it has no knowledge of the functions r and δ .
Figure 5.6 shows Q values for every state and action in the simple grid world.
 Q value for each state-action transition equals the r value for this transition plus
the V* value for the resulting state discounted by 𝛾.
 Optimal policy shown in the figure corresponds to selecting actions with
maximal Q values
An Algorithm for Learning Q
 Learning the Q function corresponds to learning the optimal policy.
 The key problem is finding a reliable way to estimate training values for Q, given
only a sequence of immediate rewards r spread out over time. This can be
accomplished through iterative approximation

Prof.Dhanya K N ,Dept.of CS&E,MRIT 36

ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING-18CS71

𝘍)
𝑉∗(𝑠) = 𝑚𝑎𝑥𝑎′
𝑄(𝑠,𝑎

Re-write the above equation to get

(𝑠,𝑎),𝑎𝘍)
𝑄(𝑠, 𝑎) = 𝑟(𝑠, 𝑎) + 𝛾𝑚𝑎𝑥𝑎′𝑄(𝛿 ---(6)
The agent repeatedly observes its current state s, chooses some action a, executes
this action, then observes the resulting reward r = r(s, a) and the new state s' =
δ(s,a). It then updates the table entry for 𝑄̂(𝑠, 𝑎) following each such transition,
according to the rule:
̂ (𝑠′,𝑎′)
𝑄̂(𝑠, 𝑎) = 𝑟 + 𝛾𝑚𝑎𝑥 𝑄𝑎′ ---(7)
Q Learning Algorithm

For each s,a initialize the table entry 𝑄̂(𝑠, 𝑎) to zero.

Observe the current state s
Do forever:
 Select an action a and execute it
 Receive immediate reward r
 Observe the new sate s’
 Update the table entry for 𝑄̂(𝑠, 𝑎) as follows:
̂ (𝑠′,𝑎′)
𝑄̂(𝑠, 𝑎) = 𝑟 + 𝛾𝑚𝑎𝑥 𝑄𝑎′
 𝑠 ← 𝑠′

o Q learning algorithm assuming deterministic rewards and actions. The discount

factor γ may be any constant such that 0 ≤ γ < 1
o 𝑄̂to refer to the learner's estimate, or hypothesis, of the actual Q function
An Illustrative Example
To illustrate the operation of the Q learning algorithm, consider a single action taken
^
by an agent, and the corresponding refinement to Q shown in below figure:

Prof.Dhanya K N ,Dept.of CS&E,MRIT 37

ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING-18CS71

o The agent moves one cell to the right in its grid world and receives an
immediate reward of zero for this transition.
o Apply the training rule of Equation
(𝑠,𝑎),𝑎𝘍)
𝑄(𝑠, 𝑎) = 𝑟(𝑠, 𝑎) + 𝛾𝑚𝑎𝑥𝑎′𝑄(𝛿
to refine its estimate Q for the state-action transition it just executed. According the
training rule, the new 𝑄̂ estimate for this transition is the sum of the received reward
(zero) and the highest 𝑄̂ value associated with the resulting state (100), discounted by
γ(.9).

Convergence
Will the Q Learning Algorithm converge toward a Q equal to the true Q function?
Yes, under certain conditions.
i. Assume the system is a deterministic MDP.
ii. Assume the immediate reward values are bounded; that is, there exists some
positive constant c such that for all states s and actions a, | r(s, a)| < c
iii. Assume the agent selects actions in such a fashion that it visits every possible
state-action pair infinitely often
Theorem 5.1: Convergence of Q learning for deterministic Markov decision
processes.
Consider a Q learning agent in a deterministic MDP with bounded
rewards(∀𝑠, 𝑎|𝑟(𝑠, 𝑎)| ≤ 𝑐. The Q learning agent uses the training rule of Eq. 7
initializes its table 𝑄̂(𝑠, 𝑎) to arbitrary finite values, and uses a discount factor 𝛾 such
that 0 ≤ 𝛾 < 1. Let 𝑄̂𝑛 (𝑠, 𝑎) denote the agent's hypothesis 𝑄̂(𝑠, 𝑎) following the nth
update. If each state-action pair is visited infinitely often, then 𝑄̂𝑛 (𝑠, 𝑎) converges to
Q(s,a ) as 𝑛 → ∞, for all s, a .
Prooof: Since each state-action transition occurs infinitely often, consider consecutive
intervals during which each state-action transition occurs at least once.
To prove that: The maximum error over all entries in the 𝑄̂table is reduced by at least
a factor of 𝛾 during each such interval. 𝑄̂𝑛 , is the agent's table of estimated 𝑄̂𝑛 values
after n updates.

Prof.Dhanya K N ,Dept.of CS&E,MRIT 38

ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING-18CS71

Let ∆𝑛 be the maximum error in 𝑄̂𝑛 ; That is,

|𝑄̂ (𝑠,𝑎)−𝑄(𝑠,𝑎 )|
∆𝑛 ≡ 𝑚𝑎𝑥𝑠,𝑎𝑛
Now use s' to denote δ(s,a). Hence for any table entry 𝑄̂𝑛 (𝑠, 𝑎) that is updated
on iteration n+1, the magnitude of the error in the revised estimate 𝑄̂𝑛+1 (𝑠, 𝑎) is
̂ (𝑠 𝘍 ,𝑎 𝘍𝘍
𝑄 𝘍 𝘍
𝑛 𝑄(𝑠 ,𝑎 )
|𝑄̂𝑛+1 (𝑠, 𝑎) − 𝑄(𝑠, 𝑎)| = | (𝑟 + 𝛾𝑚𝑎𝑥 𝘍
)) − (𝑟 + 𝛾𝑚𝑎𝑥 )|
𝑎 𝑎𝘍

̂ 𝘍 𝘍𝘍 𝘍 𝘍
𝑄𝑛(𝑠 ,𝑎 ) 𝑄(𝑠 ,𝑎 )
= 𝛾|𝑚𝑎𝑥 𝘍 − 𝑚𝑎𝑥 | ---(ii)
𝑎 𝑎𝘍
|𝑄̂ (𝑠𝘍 ,𝑎𝘍 )−𝑄(𝑠𝘍 ,𝑎𝘍 )|
≤ 𝛾𝑚𝑎𝑥𝑎𝑛𝘍 ---(iii)
|𝑄̂𝑛 (𝑠𝘍𝘍 ,𝑎𝘍 )−𝑄(𝑠𝘍𝘍 ,𝑎𝘍 )|
≤ 𝛾𝑚𝑎𝑥 ---(iv)
𝑠𝘍𝘍,𝑎 𝘍

|𝑄̂𝑛+1 (𝑠, 𝑎) − 𝑄(𝑠, 𝑎)| ≤ 𝛾∆𝑛

Expression (iii) follows from Expression (ii) because for any two functions f1
and f2 the following inequality holds
|𝑚𝑎𝑥𝑎𝑓1(𝑎) − 𝑚𝑎𝑥𝑓𝑎2(𝑎)| ≤ 𝑚𝑎𝑥|𝑓
𝑎
1(𝑎)−𝑓2(𝑎)|

s’’ is introduced in Expression (iv) over which the maximization is performed. It

allowed to obtain an expression that matches the definition of ∆𝑛.
The updated 𝑄̂𝑛+1 (𝑠, 𝑎) for any s, a is at most 𝛾 times the maximum error in the 𝑄̂𝑛
table, ∆𝑛 . The largest error in the initial table, ∆0 , is bounded because values of 𝑄̂0 (𝑠, 𝑎)
and Q(s,a ) are bounded for all s, a.
- After 1st interval during which each s,a is visited, the largest error in the table will
be at most 𝛾∆0.
- After k such intervals, the error will be at most𝛾𝑘∆0
- Since each state is visited infinitely often, the number of such intervals is infinite,
and ∆𝑛→ 0 as 𝑛 → ∞.
Hence proved.
Experimentation Strategies
The Q learning algorithm does not specify how actions are chosen by the agent
 One obvious strategy would be for the agent in state s to select the action a that
maximizes 𝑄̂(𝑠, 𝑎), thereby exploiting its current approximation 𝑄̂.
 The agent runs the risk that it will overcommit to actions that are found during early
training to have high 𝑄̂ values, while failing to explore other actions that have even
higher values.

Prof.Dhanya K N ,Dept.of CS&E,MRIT 39

ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING-18CS71

 For this reason, Q learning uses a probabilistic approach to selecting actions.

Actions with higher 𝑄̂ values are assigned higher probabilities, but every action is
assigned a nonzero probability.
 One way to assign such probabilities is
𝒌𝑄̂(𝑠,𝑎𝑖 )
𝑷(𝒂𝒊|𝒔) = 𝑄̂(𝑠,𝑎 )
∑𝒋 𝒌 𝑗

Where,
P(ai|s)  probability of selecting action ai, given that the agent is in state s
k>0  constant that determines how strongly the selection favors actions with high
𝑄̂values.

Prof.Dhanya K N ,Dept.of CS&E,MRIT 40

ML - Mod 2 - Part 2
No ratings yet
ML - Mod 2 - Part 2
60 pages
Methodo Supervised 2022
No ratings yet
Methodo Supervised 2022
48 pages
evaluation_hypothesis__new[1]
No ratings yet
evaluation_hypothesis__new[1]
55 pages
4 Estimation
No ratings yet
4 Estimation
33 pages
Bellana N 2009 - Shear Wave Velocity As Function of SPT Penetration Resistance and Vertical Effective Stress at California Bridge Sites - UCLA MSC Thesis
No ratings yet
Bellana N 2009 - Shear Wave Velocity As Function of SPT Penetration Resistance and Vertical Effective Stress at California Bridge Sites - UCLA MSC Thesis
79 pages
Full Download Visual Data Insights Using SAS ODS Graphics: A Guide to Communication-Effective Data Visualization 1st Edition Leroy Bessler PDF DOCX
100% (2)
Full Download Visual Data Insights Using SAS ODS Graphics: A Guide to Communication-Effective Data Visualization 1st Edition Leroy Bessler PDF DOCX
51 pages
AIML Mod-5
No ratings yet
AIML Mod-5
18 pages
AIML- Module 5- Updated
No ratings yet
AIML- Module 5- Updated
40 pages
CS7015 (Deep Learning) : Lecture 8
No ratings yet
CS7015 (Deep Learning) : Lecture 8
86 pages
Lecture 01
No ratings yet
Lecture 01
58 pages
Sharif Hossain - Econometric Analysis_ an Applied Approach to Business and Economics-Cambridge Scholars Publishing (2024)
No ratings yet
Sharif Hossain - Econometric Analysis_ an Applied Approach to Business and Economics-Cambridge Scholars Publishing (2024)
741 pages
shawe-taylor-slides Statiscal Learning Theory for Modern Machine Learning
No ratings yet
shawe-taylor-slides Statiscal Learning Theory for Modern Machine Learning
195 pages
ML05
No ratings yet
ML05
20 pages
Fikadu Final thessi-TD Inputs
No ratings yet
Fikadu Final thessi-TD Inputs
87 pages
chapter 1
No ratings yet
chapter 1
21 pages
Confidence Interval
No ratings yet
Confidence Interval
26 pages
Elementary Statistics: A Brief Version 8th Edition (eBook PDF) download
100% (1)
Elementary Statistics: A Brief Version 8th Edition (eBook PDF) download
50 pages
MCA Unit-2
No ratings yet
MCA Unit-2
49 pages
STAT - Day 2-2
No ratings yet
STAT - Day 2-2
108 pages
ML 05
No ratings yet
ML 05
20 pages
M. L. CH .10
No ratings yet
M. L. CH .10
17 pages
Unit-5
No ratings yet
Unit-5
9 pages
Interval Estimation
No ratings yet
Interval Estimation
20 pages
BUS51A_lecture12
No ratings yet
BUS51A_lecture12
47 pages
AI Mod5@AzDOCUMENTSin
No ratings yet
AI Mod5@AzDOCUMENTSin
40 pages
Week 8 v1.61 - Association and Causation
No ratings yet
Week 8 v1.61 - Association and Causation
57 pages
Learning Phrasal Verbs Through Conceptual Metaphors A Case of Japanese EFL Learners
No ratings yet
Learning Phrasal Verbs Through Conceptual Metaphors A Case of Japanese EFL Learners
25 pages
Car Sales Exponential Smoothing and Linear Regression
No ratings yet
Car Sales Exponential Smoothing and Linear Regression
9 pages
Weddings
No ratings yet
Weddings
11 pages
Analysis of Variance
No ratings yet
Analysis of Variance
3 pages
Unit 4 Part 2
No ratings yet
Unit 4 Part 2
24 pages
MBA 643 - Final Exam Notes
No ratings yet
MBA 643 - Final Exam Notes
11 pages
Error Propagation
No ratings yet
Error Propagation
22 pages
The Roleof Social Media Marketing
No ratings yet
The Roleof Social Media Marketing
13 pages
Bias-Variance
No ratings yet
Bias-Variance
8 pages
Evaluating Hypothesis: Bias in The Estimate. First, The Observed Accuracy of The Learned Hypothesis Over The Training
No ratings yet
Evaluating Hypothesis: Bias in The Estimate. First, The Observed Accuracy of The Learned Hypothesis Over The Training
17 pages
CS1 2021 Upgrade
No ratings yet
CS1 2021 Upgrade
48 pages
Influences of Environmental and Hedonic Motivations On PI Green Product
No ratings yet
Influences of Environmental and Hedonic Motivations On PI Green Product
11 pages
Week 3 Hypothesis Testing and Inference - 2024
No ratings yet
Week 3 Hypothesis Testing and Inference - 2024
51 pages
Stat Risk
No ratings yet
Stat Risk
6 pages
Productivity Improvement of A Special Purpose Machine
No ratings yet
Productivity Improvement of A Special Purpose Machine
14 pages
Bias Variance Tradeoff
No ratings yet
Bias Variance Tradeoff
71 pages
3 Expectation
No ratings yet
3 Expectation
70 pages
ML Unit 2
No ratings yet
ML Unit 2
8 pages
Intro To Essential Stats With Python
No ratings yet
Intro To Essential Stats With Python
51 pages
Scan 8 Jan 2020
No ratings yet
Scan 8 Jan 2020
1 page
Module3-BCS301
No ratings yet
Module3-BCS301
4 pages
Probability and Statistical Applications - Estimation Theory
No ratings yet
Probability and Statistical Applications - Estimation Theory
45 pages
BUAN6359 - Unit4 Part2 Handout
No ratings yet
BUAN6359 - Unit4 Part2 Handout
18 pages
04 - 08 - 2016 - Modeling A
No ratings yet
04 - 08 - 2016 - Modeling A
17 pages
Ch3 PDF
No ratings yet
Ch3 PDF
55 pages
CSCE 970 Lecture 6: System Evaluation and Combining Classifiers
No ratings yet
CSCE 970 Lecture 6: System Evaluation and Combining Classifiers
9 pages
Eva Uation Methods 273 A Spring 09
No ratings yet
Eva Uation Methods 273 A Spring 09
17 pages
Food Research International
No ratings yet
Food Research International
9 pages
Machine Learning Model Evaluation
No ratings yet
Machine Learning Model Evaluation
11 pages
Seminar Week 3 - In-Class - Fullpage
No ratings yet
Seminar Week 3 - In-Class - Fullpage
18 pages
Baes Theory
No ratings yet
Baes Theory
76 pages
Chapter 6
No ratings yet
Chapter 6
7 pages
Chapter 4stats and Prob
No ratings yet
Chapter 4stats and Prob
41 pages
Basic Concepts of Inference: Corresponds To Chapter 6 of Tamhane and Dunlop
No ratings yet
Basic Concepts of Inference: Corresponds To Chapter 6 of Tamhane and Dunlop
40 pages
Metering, Load Profiles and Settlement in Deregulated Markets
No ratings yet
Metering, Load Profiles and Settlement in Deregulated Markets
67 pages
Evaluating Hypotheses Problems Estimating Error: (H) Is Optimistically H Error H Error E Bias
No ratings yet
Evaluating Hypotheses Problems Estimating Error: (H) Is Optimistically H Error H Error E Bias
4 pages
Lec 6
No ratings yet
Lec 6
20 pages
Determining Amenity Values of Green Open Spaces in Shopping Malls: The Ayala Greenbelt Park Experience - Full Paper
No ratings yet
Determining Amenity Values of Green Open Spaces in Shopping Malls: The Ayala Greenbelt Park Experience - Full Paper
15 pages
Vision Dummy PDF
100% (1)
Vision Dummy PDF
51 pages
Nptel Week 5
No ratings yet
Nptel Week 5
4 pages
Topic 9 Sharpe's Portfolio Theory
No ratings yet
Topic 9 Sharpe's Portfolio Theory
21 pages
ML Lecture23
No ratings yet
ML Lecture23
57 pages
Basman Paper
No ratings yet
Basman Paper
11 pages
Rudyregularization PDF
No ratings yet
Rudyregularization PDF
56 pages
Inferential QB
No ratings yet
Inferential QB
22 pages
Lecture Transcript 4 (Estimation of Paramterers)
No ratings yet
Lecture Transcript 4 (Estimation of Paramterers)
12 pages
Time Series For Retail Store
No ratings yet
Time Series For Retail Store
10 pages
Interval Estimation
100% (1)
Interval Estimation
42 pages
Stanley - Yang - Thesis PDF
No ratings yet
Stanley - Yang - Thesis PDF
31 pages
Topic 3 Probalistic Models
No ratings yet
Topic 3 Probalistic Models
5 pages
Inferential Statistics: Estimation Hypothesis Testing
No ratings yet
Inferential Statistics: Estimation Hypothesis Testing
59 pages
Ps 1
No ratings yet
Ps 1
6 pages
The impact of different conversational generative AI chatbots on EFL learners: An analysis of willingness to communicate, foreign language speaking anxiety, and self-perceived communicative competence
No ratings yet
The impact of different conversational generative AI chatbots on EFL learners: An analysis of willingness to communicate, foreign language speaking anxiety, and self-perceived communicative competence
16 pages
What Is Statistic
No ratings yet
What Is Statistic
129 pages
Estimation
No ratings yet
Estimation
10 pages
Probability Theory For Machine Learning: Chris Cremer September 2015
No ratings yet
Probability Theory For Machine Learning: Chris Cremer September 2015
40 pages
CFA Level 1 Review - Quantitative Methods
50% (2)
CFA Level 1 Review - Quantitative Methods
10 pages
Statistical Estimation
No ratings yet
Statistical Estimation
31 pages
6 437-Pset1
No ratings yet
6 437-Pset1
8 pages
Estimation
No ratings yet
Estimation
41 pages
Assignment 3 PDF
100% (1)
Assignment 3 PDF
6 pages
Chapter 3 PDF
No ratings yet
Chapter 3 PDF
81 pages
International Journal of Mathematics and Statistics Invention (IJMSI)
No ratings yet
International Journal of Mathematics and Statistics Invention (IJMSI)
17 pages
A-level Maths Revision: Cheeky Revision Shortcuts
From Everand
A-level Maths Revision: Cheeky Revision Shortcuts
Scool Revision
3.5/5 (8)