AIML- Module 4- Updated
AIML- Module 4- Updated
MODULE – 4
BAYESIAN LEARNING
INTRODUCTION
Bayesian learning methods are relevant to our study of machine learning for two different
reasons.
First, Bayesian learning algorithms that calculate explicit probabilities for hypotheses,
such as the naive Bayes classifier, are among the most practical approaches to certain
types of learning problems.
The second reason that Bayesian methods are important to our study of machine
learning is that they provide a useful perspective for understanding many learning
algorithms that do not explicitly manipulate probabilities.
Each observed training example can incrementally decrease or increase the estimated
probability that a hypothesis is correct. This provides a more flexible approach to
learning than algorithms that completely eliminate a hypothesis if it is found to be
inconsistent with any single example.
Prior knowledge can be combined with observed data to determine the final probability
of a hypothesis. In Bayesian learning, prior knowledge is provided by asserting (1) a
prior probability for each candidate hypothesis, and (2) a probability distribution over
observed data for each possible hypothesis.
Bayesian methods can accommodate hypotheses that make probabilistic predictions
(e.g., hypotheses such as "this pneumonia patient has a 93% chance of complete
recovery").
New instances can be classified by combining the predictions of multiple hypotheses,
BAYESIAN THEOREM
In machine learning we are often interested in determining the best hypothesis from some
space H, given the observed training data D. Bayes theorem provides a way to calculate
the probability of a hypothesis based on its prior probability, the probabilities of observing
various data given the hypothesis, and the observed data itself.
P(h) initial probability that hypothesis h holds, before we have observed the training
data. P(h) is often called the prior-probability of h and may reflect any background
knowledge we have about the chance that h is a correct hypothesis.
“Bayes theorem provides a way to calculate the posterior probability P(h|D), from the
prior probability P(h), together with P(D) and P(D|h).”
𝑷(𝑫|𝒉)𝑷(𝒉)
Bayesian Theorem: 𝑷(𝒉|𝑫) = ---(1)
𝑷(𝑫)
Here, P(h|D) increases with P(h) and with P(D|h) according to Bayes theorem. It is also
reasonable to see that P(h|D) decreases as P(D) increases, because the more probable it
is that D will be observed independent of h, the less evidence D provides in support of h.
In many learning scenarios, the learner considers some set of candidate hypotheses H and
is interested in finding the most probable hypothesis h ∈ H given the observed data D (or
at least one of the maximally probable if there are several). Any such maximally probable
hypothesis is called a maximum a posteriori (MAP) hypothesis. We can determine the
MAP hypotheses by using Bayes theorem to calculate the posterior probability of each
candidate hypothesis. More precisely, we will say that hMAP is a MAP hypothesis
provided,
ℎ𝑀𝐴𝑃 ≡ 𝑎𝑟𝑔𝑚𝑎𝑥
ℎ∈𝐻𝑃(ℎ|𝐷)
𝑎𝑟𝑔𝑚𝑎𝑥𝑷(𝑫|𝒉)𝑷(𝒉)
≡
ℎ∈𝐻
𝑷(𝑫)
≡ 𝑎𝑟𝑔𝑚𝑎𝑥𝑷(𝑫|𝒉)𝑷(𝒉) ---(2)
ℎ∈𝐻
In some cases, we will assume that every hypothesis in H is equally probable a priori (
P(hi) = P(hj) for all hi and hj in H). In this case we can further above equation and need
only consider the term P(D|h) to find the most probable hypothesis. P(D|h) is often called
the likelihood of the data D given h, and any hypothesis that maximizes P(D|h) is called
a maximum likelihood (ML) hypothesis, hML.
𝒉𝑴𝑳 ≡ 𝒂𝒓𝒈𝒎𝒂𝒙𝑷(𝑫|𝒉)---(3)
𝒉∈𝑯
An Example
To illustrate Bayes rule, consider a medical diagnosis problem in which there are two
alternative hypotheses:
The available data is from a particular laboratory test with two possible outcomes:
⊕ (positive) and
⊖ (negative).
We have prior knowledge that over the entire population of people only .008 have this
disease.
The test returns a correct positive result in only 98% of the cases in which the disease is
actually present and a correct negative result in only 97% of the cases in which the disease
is not present.
Suppose we now observe a new patient for whom the lab test returns a positive result.
Should we diagnose the patient as having cancer or not?
Solution
ℎ𝑀𝐴𝑃 ≡ 𝑎𝑟𝑔𝑚𝑎𝑥𝑃(𝐷|ℎ)𝑃(ℎ)
ℎ∈𝐻
OUTPUT hMAP = ¬cancer because false positive has the highest values.
Note: The exact posterior probabilities can also be determined by normalizing the above
quantities so that they sum to 1.
𝑇𝑃 0.0078
P(cancer|+) = = = 0.21
𝑇𝑃+𝐹𝑃 0.0078+0.0298
0.0298
P(¬cancer|+) = 𝐹𝑃
= = 0.79
𝑇𝑃+𝐹𝑃 0.0078+0.0298
This step is warranted because Bayes theorem states that the posterior probabilities
are just the above quantities divided by the probability of the data, P(+). Although P(+)
was not provided directly as part of the problem statement, we can calculate it in this
fashion because we know that P(cancer|+) and P(¬cancer|+) must sum to 1.
Notice that while the posterior probability of cancer is significantly higher than its
prior probability, the most probable hypothesis is still that the patient does not have
cancer.
As this example illustrates, the result of Bayesian inference depends strongly on the
prior probabilities, which must be available in order to apply the method directly. Note
also that in this example the hypotheses are not completely accepted or rejected, but
rather become more or less probable as more data is observed.
Bayes theorem provides a principled way to calculate the posterior probability of each
hypothesis given the training data. It acts as a basis for a straightforward learning
algorithm that calculates the probability for each possible hypothesis, then outputs the
most probable
WHAT IS NOTICED?
Under certain conditions several algorithms output the same hypotheses as brute-force
Bayesian algorithm
𝑷(𝑫|𝒉)𝑷(𝒉)
𝑷(𝒉|𝑫) =
𝑷(𝑫)
To specify Learning problem Specify what values to be used for P(h) and
P(D|h)
3. We have no a priori reason to believe that any hypothesis is more probable than
any other.
• Given no prior knowledge that one hypothesis is more likely than another, it is
reasonable to assign the same prior probability to every hypothesis h in H .
• Since we assume the target concept is contained in H we should require that these
prior
probabilities sum to 1.
Therefore,
P(h) = 1 for all h in H
|𝐻|
• P(D|h) is the probability of observing the target values D = ⟨𝑑1 … 𝑑𝑚⟩ for fixed set of
instances ⟨𝑥1 … 𝑥𝑚⟩ given a world in which hypothesis h holds.
• Therefore,
Given these choices for P(h) and for P(D|h) we now have a fully-defined problem for
BRUTE-FORCE MAP LEARNING algorithm.
From Bayes Theorem
𝑷(𝑫|𝒉)𝑷(𝒉)
𝑷(𝒉|𝑫) =
𝑷(𝑫)
From (1)
Therefore,
From (1)
Therefore,
1
1.
P(h|D) = |𝐻|
𝑃(𝐷)
1
1.
|𝐻| 1 if h is consistent with D
= |𝑉𝑆𝐻,𝐷|
=
|𝑉𝑆𝐻,𝐷|
|𝐻|
Where VSH,D is the subset of hypotheses from H that are consistent with D .
We can derive P(D) from the theorem of total probabilityhypotheses are mutually
exclusive (∀𝑖 ≠ 𝑗)(𝑃(ℎ𝑖˄ℎ𝑗) = 0)
𝑃(𝐷) = ∑ 𝑃(𝐷|ℎ𝑖)𝑃(ℎ𝑖)
ℎ𝑖∈𝐻
1 1
= ∑ 1. + ∑ 0.
|𝐻| |𝐻|
ℎ𝑖∈𝑉𝑆𝐻,𝐷 ℎ𝑖∉𝑉𝑆𝐻,𝐷
1
= ∑ 1.
|𝐻|
ℎ𝑖∈𝑉𝑆𝐻,𝐷
|𝑉𝑆𝐻,𝐷|
𝑃(𝐷) =
|𝐻|
Therefore,
1
if h is consistent with D
P(h|D) = {|𝑉𝑆 𝐻,𝐷|
0 𝑂𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒
Every consistent hypothesis is, therefore, a MAP hypothesis.
Definition
Therefore,
For examples,
Consider the concept learning algorithm FIND- S. We know that, FIND-S searches the
hypothesis space H from specific to general hypotheses, outputting a maximally
specific consistent hypothesis it will output a MAP hypothesis under the probability
distributions P(h) and P(D|h) .
Are there other probability distributions for P(h) and P(D1h) under which FIND-
S outputs MAP hypotheses? Yes.
Find-S output MAP hypothesis relative to any prior probability distribution that
Consider Candidate Elimination algorithm with an assumption that the target concept c
is included in the hypothesis space H. Its output follows deductively from its inputs
plus this implicit inductive bias assumption.
Alternative,
Implicit assumptions
di = f (xi)+ei.
f(xi) noise-free value of the target function
ei random variable representing the noise
Assumption
• The values of the ei are drawn independently.
• They are distributed according to a Normal distribution with zero mean.
Task of the Learner
Output a maximum likelihood hypothesis, or, equivalently, a MAP hypothesis
assuming all hypotheses are equally probable a priori .
Example: Learning a linear function, though the analysis applies to learning arbitrary
real-valued functions.
Figure 6.2 illustrates the whole scenario. Here notice that the maximum likelihood
hypothesis is not necessarily identical to the correct hypothesis, f, because it is inferred
from only a limited sample of noisy training data.
1 −1 (𝑥−𝜇 )2
𝑝(𝑥) = 𝑒 2 𝜎
√2𝜋𝜎2
A Normal distribution is fully determined by two parameters in the above formula: μ and
σ. If the random variable X follows a normal distribution, then:
𝑏
The probability that X will fall into the interval (a, b) is given by ∫𝑎 𝑝(𝑥)𝑑𝑥
The expected, or mean value of X, E[X], is E[X] = μ
The variance of X, Var(X), is Var(X) = σ2
The standard deviation of X, σx, is σx = σ
The Central Limit Theorem states that the sum of a large number of independent,
identically distributed random variables follows a distribution that is approximately
Normal.
Prove: Maximum likelihood hypothesis hML minimizes the sum of the squared errors
between the observed training values di and the hypothesis predictions h(xi)
Proof: From equation (3) we have
Deriving the maximum likelihood hypothesis starting with our earlier definition of hML,
but using lower case p to refer to the probability density
hML = argmaxp(D|h)
h𝖾H
Assumptions
Fixed set of training instances ⟨x1 … xm⟩
D corresponding sequence of target values D=⟨d1 … dm⟩
di = f(xi) + ei
Training examples are mutually independent given h P(D|h) product of various
p(di|h)
𝑎𝑟𝑔𝑚𝑎𝑥 𝑚
ℎ𝑀𝐿 = 𝖦 𝑝(𝑑𝑖|ℎ)
ℎ𝗀𝐻 𝑖=1
𝑎𝑟𝑔𝑚𝑎𝑥 𝑚 1
1 − (𝑑𝑖−ℎ(𝑥𝑖))2
ℎ𝑀𝐿 = 𝖦 𝑒 2𝜎2
√2𝜋𝜎2
ℎ𝗀𝐻 𝑖=1
𝑎𝑟𝑔𝑚𝑎𝑥 𝑚
1 1
ℎ𝑀𝐿 = ∑ 𝑙𝑛 − (𝑑 − ℎ(𝑥𝑖))2
√2𝜋𝜎2 2𝜎2 𝑖
ℎ𝗀𝐻 𝑖=1
The first term in this expression is a constant independent of h, and can therefore be
discarded, yielding,
𝑎𝑟𝑔𝑚𝑎𝑥 𝑚
1
ℎ𝑀𝐿 = ∑− (𝑑𝑖 − ℎ(𝑥𝑖))2
2𝜎2
ℎ𝗀𝐻 𝑖=1
Maximizing this negative quantity is equivalent to minimizing the
corresponding positive quantity
argmin m
1
hML = ∑ (di − h(xi))2
2σ2
h𝖾Hi=1
𝑎𝑟𝑔𝑚𝑖𝑛 𝑚
ℎ𝑀𝐿 = ∑(𝑑𝑖 − ℎ(𝑥𝑖))2
ℎ𝗀𝐻 𝑖=1
Above equation shows that the maximum likelihood hypothesis hML is the one that
minimizes the sum of the squared errors between the observed training values d i and
the hypothesis predictions h(xi).
Limitations: The above analysis considers noise only in the target value of the training
example and does not consider noise in the attributes describing the instances
themselves.
f(x) Target Function = 1 if the applicant successfully repays their next loan
= 0 if not.
Here, f can be expected to be probabilistic
PROBABILISTIC f !!!
For example,
Among a collection of patients exhibiting the same set of observable symptoms,
we might find that 92% survive, and 8% do not. the output of the target function f(x)
is a probabilistic function of its input.
LEARNING PROBLEM!!!
Learn a neural network (or other real-valued function approximator) whose
output is the probability that f(x)=1
i.e., to learn the target function, f’ : X [0,1] such that f’(x) = P(f(x)=1).
From the examples,
f’(x) = 0.92
Probabilistic function f(x) will be equal to 1 in 92% of cases and equal to 0 in
the remaining 8%.
How can we learn f’ using, say, a neural network?
Solution
- first collect the observed frequencies of 1's and 0's for each possible value of x
- then train the neural network to output the target frequency for each x
Further to be Proven!!
It is possible to train a neural network directly from the observed training examples of
f, yet still derive a maximum likelihood hypothesis for f' .
What criterion should we optimize in order to find a maximum likelihood
hypothesis for f' in this setting?
To answer first obtain an expression for P(D|h).
Assumptions
- D training Data {⟨𝑥1, 𝑑1⟩ … ⟨𝑥𝑚, 𝑑𝑚⟩}
- di observed 0 or 1 for f(xi)
- Both xi and di are random variables
- Each training example is drawn independently
Therefore, we can write
What is the probability P(di|h,xi) of observing di = 1 for a single instance xi, given a
world in which hypothesis h holds?
We know that , h computes this probability
Therefore,
P(di = 1 |h, xi) = h(xi)
Hence,
ℎ(𝑥𝑖) if 𝑑𝑖 = 1
𝑃(𝑑𝑖|ℎ, 𝑥𝑖) = { 1 − ℎ(𝑥𝑖) 𝑖𝑓𝑑𝑖 = 0….(6.9)
In order to substitute for P(D|h) in (8), let us first "re-express it in a more mathematically
manipulable form, as
𝑃(𝑑𝑖|ℎ, 𝑥𝑖) = h(𝑥𝑖)𝑑𝑖(1 − ℎ(𝑥𝑖))1−𝑑𝑖…(6.10)
The expression on the right side of Equation (12) can be seen as a generalization of
the Binomial distribution. The expression in Equation (12) describes the probability that
flipping each of m distinct coins will produce the outcome (dl . . .dm), assuming that each
coin xi has probability h(xi) of producing a heads. Note the Binomial distribution is
similar, but makes the additional assumption that the coins have identical probabilities of
turning up heads (i.e., that h(xi) = h(xj), for every i, j). In both cases we assume the
outcomes of the coin flips are mutually independent-an assumption that fits our current
setting.
It is easy to work with log of the likelihood, we get
hML = 𝑎𝑟𝑔𝑚𝑎𝑥∑𝑚 𝑑 𝑙𝑛 ℎ(𝑥 ) + (1 − 𝑑 ) ln(1 − ℎ(𝑥 )) …(6.13)
ℎ𝗀𝐻 𝑖=1 𝑖 𝑖 𝑖 𝑖
Equation (6.13) describes the quantity that must be maximized in order to obtain the
maximum likelihood hypothesis in our current problem setting.
∑𝑚
𝑖=1 𝑑𝑖𝑙𝑛 ℎ(𝑥𝑖) Similar to Entropy expression and hence named as Cross Entropy.
𝜕ℎ(𝑥𝑖)
= 𝜎′(ℎ(𝑥 )𝑥 = ℎ(𝑥 )(1 − ℎ(𝑥 )𝑥 ---(2)
𝜕𝑤 𝑗𝑘 𝑖 𝑖𝑗𝑘 𝑖 𝑖 𝑖𝑗𝑘
Where,
𝑥𝑖𝑗𝑘 kth input to unit j for the ith training example
𝜎′(ℎ(𝑥𝑖)) Derivative of sigmoid squashing function
Substitute (2) in (1)
𝜕𝐺(ℎ|𝐷) 𝑚
=∑𝑖=1 𝑑𝑖−ℎ(𝑥𝑖) ℎ(𝑥 )(1 − ℎ(𝑥 ))𝑥
𝜕𝑤 𝑗𝑘 𝑖 𝑖𝑗𝑘
ℎ(𝑥𝑖)(1−ℎ(𝑥𝑖))
𝑖
𝑚
𝜕𝐺(ℎ|𝐷)
= ∑ 𝑑𝑖 − ℎ(𝑥𝑖)𝑥𝑖𝑗𝑘
𝜕𝑤𝑗𝑘
𝑖=1
Because we seek to maximize rather than minimize P(D(h), we perform gradient ascent
rather than gradient descent search.
On each iteration of the search the weight vector is adjusted in the direction of the
gradient, using the weight update rule
𝑤𝑗𝑘 = 𝑤𝑗𝑘 + ∆𝑤𝑗𝑘
Where,
𝑚
∆𝑤𝑗𝑘 = 𝜂 ∑ 𝑑𝑖 − ℎ(𝑥𝑖)𝑥𝑖𝑗𝑘
𝑖=1
CONCLUSION
The rule that minimizes sum of squared error seeks the maximum likelihood hypothesis
under the assumption that the training data can be modeled by normally distributed noise
added to the target function value.
The rule that minimizes cross entropy seeks the maximum likelihood hypothesis under
the assumption that the observed Boolean value is a probabilistic function of the input
instance.
ℎ∈𝐻𝑙𝑜𝑔2𝑃(𝐷|ℎ) + 𝑙𝑜𝑔2𝑃(ℎ)
hMAP = 𝑎𝑟𝑔𝑚𝑎𝑥
Or alternatively minimize the above quantity
CONCLUSION:
The Minimum Description Length (MDL) principle recommends choosing the
hypothesis that minimizes the sum of these two description lengths.
Consider that the codes C1 and C2 to represent the hypothesis and the data given
the hypothesis, we can state the MDL principle as
Minimum Description Length principle: Choose hMDL where
hMDL = 𝑎𝑟𝑔𝑚𝑖𝑛𝐿 𝐶 (ℎ)+𝐿 𝐶 (𝐷|ℎ)
ℎ 1 2
then the probability P(vj|D) that the correct classification for the new instance is vj
𝑃(𝑣𝑗|𝐷) = ∑ 𝑃(𝑣𝑗|ℎ𝑖)𝑃(ℎ𝑖|𝐷)
ℎ𝑖∈𝐻
The optimal classification of the new instance is the value v,, for which P(vj|D) is
maximum.
Bayes Optimal Classification
𝑎𝑟𝑔𝑚𝑎𝑥
∑ 𝑃(𝑣𝑗|ℎ𝑖)𝑃(ℎ𝑖|𝐷) − −(18)
𝑣𝑗∈𝑉ℎ𝑖∈𝐻
Example,
The set of possible classifications of the new instance is V={ ⊕ , ⊖}
𝑃(ℎ1|𝐷) = 0.4 , 𝑃(⊖ |ℎ𝑖) = 0, 𝑃(⊕ |ℎ1) = 1
𝑃(ℎ2|𝐷) = 0.3 , 𝑃(⊖ |ℎ2) = 1, 𝑃(⊕ |ℎ2) = 0
𝑃(ℎ3|𝐷) = 0.3 , 𝑃(⊖ |ℎ3) = 1, 𝑃(⊕ |ℎ3) = 0
Therefore,
And
𝑎𝑟𝑔𝑚𝑎𝑥
∑ 𝑃(𝑣𝑗|ℎ𝑖)𝑃(ℎ𝑖|𝐷) =⊖
𝑣𝑗∈{⊕,⊖}ℎ𝑖∈𝐻
Any system that classifies new instances according to Equation (6.18) is called a Bayes
optimal classifier, or Bayes optimal learner. Therefore, Bayes Optimal Classifier
maximizes the probability that the new instance is classified correctly, given the available
data, hypothesis space, and prior probabilities over the hypotheses.
GIBBS ALGORITHM
Bayes optimal classifier obtains the best performance that can be achieved from the
given training data.
Disadvantage!!
It can be costly to apply.
WHY??
• It computes the posterior probability for every hypothesis in H
• then combines the predictions of each hypothesis to classify each new instance.
ALTERNATIVE!!
GIBBS ALGORITHM
Definition:
Gibbs Algorithm
1. Choose a hypothesis h from H at random, according to the posterior probability
distribution over H.
2. Use h to predict the classification of the next instance x.
Haussler et al. 1994 it can be shown that under certain conditions the expected
misclassification error for the Gibbs algorithm is at most twice the expected error of the
Bayes optimal classifier
Implication for the concept learning problem
Consider,
• Uniform prior probability over H
• Target concepts are drawn with uniform distribution
Classification done with hypothesis drawn at random from VH,D
has expected error at most twice that of the Bayes optimal classifier.
NAÏVE BAYES CLASSIFIER
Highly practical learning method Naïve Bayes Classifier (NBC)
Where can NBC be applied??
Learning tasks where
• Each instance x is described by a conjunction of attribute values
• The target function f(x) can take on any value from some finite set V.
Given
A set of training examples of target function
A new instance described by the tuple of attribute values <a1, a2, …..,an>
Task!!
To predict the target value, or classification for the new instance
Bayesian approach!!!
Assign the most probable target value, VMAP, given the attribute values <a1,a2 . . .an>
that describe the instance.
𝑎𝑟𝑔𝑚𝑎𝑥
𝑣𝑀𝐴𝑃 = 𝑣𝑗∈𝑉𝑃(𝑣𝑗|𝑎1,𝑎 2, . . , 𝑎𝑛)
Rewrite the above expression using Bayes theorem
𝑎𝑟𝑔𝑚𝑎𝑥
𝑃(𝑎1,𝑎2, . . , 𝑎𝑛|𝑣𝑗)𝑃(𝑣𝑗)
𝑣𝑀𝐴𝑃 =
𝑣 ∈𝑉
𝑃(𝑎1,𝑎2, . . , 𝑎𝑛)
𝑗
𝑣𝑀𝐴𝑃 = 𝑎𝑟𝑔𝑚𝑎𝑥𝑃(𝑎
𝑣𝑗∈𝑉
1,𝑎2, . . , 𝑎𝑛|𝑣𝑗)𝑃(𝑣𝑗)--(6.19)
𝑣𝑁𝐵 = 𝑎𝑟𝑔𝑚𝑎𝑥𝑃(𝑣
𝑣𝑗∈𝑉 𝑗) 𝖦 𝑃(𝑎𝑖|𝑣𝑗) − −(6.20)
𝑖
An Illustrative Example
Apply the naive Bayes classifier to a concept learning problem Decision tree learning
Example Outlook Temperature Humidity Wind PlayTennis
D1 sunny hot high weak No
D2 sunny hot high strong No
D3 overcast hot high weak Yes
D4 rain mild high weak Yes
D5 rain cool normal weak Yes
D6 rain cool normal strong No
D7 overcast cool normal strong Yes
D8 sunny mild high weak No
D9 sunny cool normal weak Yes
D10 rain mild normal weak Yes
D11 sunny mild normal strong Yes
D12 overcast mild high strong Yes
D13 overcast hot normal weak Yes
D14 rain mild high strong No
Use naive Bayes classifier and the training data from the table to classify the
following novel instance
<Outlook =sunny, Temperature =cool, Humidity = high, Wind = strong>
Task!!
Predict the target value (yes or no) of the target concept PlayTennis for this new
instance.
We know that ,
𝑣𝑁𝐵 = 𝑎𝑟𝑔𝑚𝑎𝑥𝑃(𝑣
𝑣𝑗∈𝑉 𝑗) 𝖦 𝑃(𝑎𝑖|𝑣𝑗) − −(20)
𝑖
Here,
𝑎𝑟𝑔𝑚𝑎𝑥
𝑣 = 𝑃(𝑣 )𝑃(𝑂𝑢𝑡𝑙𝑜𝑜𝑘 = 𝑠𝑢𝑛𝑛𝑦|𝑣 )𝑃(𝑇𝑒𝑚𝑝𝑒𝑟𝑎𝑡𝑢𝑟𝑒 = 𝑐𝑜𝑜𝑙|𝑣 )𝑃(𝐻𝑢𝑚𝑖𝑑𝑖𝑡𝑦 = ℎ𝑖𝑔ℎ|𝑣 )𝑃(𝑊𝑖𝑛𝑑 = 𝑠𝑡𝑟𝑜𝑛𝑔|𝑣 )
𝑁𝐵 𝑣𝑗∈{𝒚𝒆𝒔,𝒏𝒐} 𝑗 𝑗 𝑗 𝑗 𝑗
--(6.21)
Requires 10 probabilities for estimating vNB.
The probabilities of the different target values can easily be estimated based
on their frequencies over the 14 training examples.
9
𝑃(𝑃𝑙𝑎𝑦𝑇𝑒𝑛𝑛𝑖𝑠 = 𝑦𝑒𝑠) = = 0.64
14
5
𝑃(𝑃𝑙𝑎𝑦𝑇𝑒𝑛𝑛𝑖𝑠 = 𝑛𝑜) = = 0.36
14
Now estimate the conditional probabilities
𝑃(𝑂𝑢𝑡𝑙𝑜𝑜𝑘 =sunny|PlayTennis=yes) = 2/9=0.22
𝑃(𝑂𝑢𝑡𝑙𝑜𝑜𝑘 =sunny|PlayTennis=no) = 3/5=0.6
𝑃(Temperature =cool|PlayTennis=yes) = 3/9=0.33
𝑃(Temperature =cool|PlayTennis=no)=1/5=0.2
𝑃(Humidity = high|PlayTennis=yes) = 3/9=0.33
𝑃(Humidity = high|PlayTennis=no)=4/5=0.8
𝑃(Wind = strong|PlayTennis=yes) = 3/9=0.33
𝑃(Wind = strong|PlayTennis=no)=3/5=0.6
Substituting in (6.21) we get,
P(yes)P(sunny|yes)P(cool|yes)P(high|yes)P(strong|yes)
=0.64×0.22×0.33×0.33 ×0.33=0.0051
P(no)P(sunny|no)P(cool|no)P(high|no)P(strong|no)
=0.36 ×0.6 ×0.2 ×0.8 ×0.6
=0.0207
where
n=5 (total number of training examples with target value = no)
nc = 3 number of examples in n samples where Wind=strong.
Problem !!
This provides a poor estimate if nc is too small.
Example,
Consider that P(Wind = strong|PlayTennis = no) = 0.8
Sample only 5 samples where PlayTennis = no.
Here, the most probable value for nc is 0.
Two difficulties
- nc/n produces a biased underestimate of the probability.
- When nc/n is zero, it will dominate the Bayes classifier if the future query contains
Wind = strong
WHY??
The quantity calculated in Equation (6.20)requires multiplying all the other
probability terms by this zero value.
How to avoid this difficulty??
Adopt a Bayesian approach to estimating the probability, using the m-
estimate:
m-estimate of Probability
𝒏𝒄 + 𝒎𝒑
𝒏+𝒎
Where,
Figure: A Bayesian belief network. The network on the left represents a set of conditional
independence assumptions. In particular, each node is asserted to be conditionally
independent of its non-descendants, given its immediate parents. Associated with each node is
a conditional probability table that specifies the conditional distribution for the variable given
its immediate parents in the graph. The conditional probability table for the Campjire node is
shown at the right, where Campjire is abbreviated to C, Storm abbreviated to S, and
BusTourGroup abbreviated to B
The joint probability for any desired assignment of values <y1, .. ,yn> to the
𝜕𝑙𝑛𝑃(𝐷|ℎ)
The gradient of P(D|h) is given by the derivatives for each wijk.
𝜕𝑤𝑖𝑗𝑘
---(4)
Example,
To calculate the derivative of In P(D|h) with respect to the upper rightmost entry in
the table calculate the quantity P(Campfire = True, Storm = False, BusTourGroup
= False|d) for each training example d in D.
Derivation of Eq. (4)
Let Ph(D) denote P(D|h)
To derive!
𝜕𝑙𝑛𝑃ℎ(𝐷)
𝜕𝑤𝑖𝑗𝑘
Proof:
𝜕𝑙𝑛𝑃ℎ(𝐷) 𝜕
= 𝑙𝑛 𝖦 𝑃ℎ(𝑑)
𝜕𝑤𝑖𝑗𝑘 𝜕𝑤𝑖𝑗𝑘
𝑑∈𝐷
𝜕𝑙𝑛𝑃ℎ(𝑑)
=∑
𝜕𝑤𝑖𝑗𝑘
𝑑∈𝐷
𝜕𝑙𝑛𝑓(𝑥) 1 𝜕𝑓(𝑥)
Since =
𝜕𝑥 𝑓(𝑥) 𝜕𝑥
1 𝜕𝑃ℎ(𝑑)
=∑
𝑃ℎ(𝑑) 𝜕𝑤𝑖𝑗𝑘
𝑑∈𝐷
Introduce the values of the variables Yi and Ui=Parents(Yi),by summing over their
possible values yij’ and uik’
𝜕𝑙𝑛𝑃ℎ(𝑑) 1 𝜕
𝜕𝑤 = ∑ 𝑃 (𝑑) 𝜕𝑤 ∑ 𝑃ℎ(𝑑|𝑦𝑖𝑗𝘍 , 𝑢𝑖𝑘𝘍 ) 𝑃ℎ(𝑦𝑖𝑗𝘍 , 𝑢𝑖𝑘𝘍 )
𝑖𝑗𝑘 ℎ 𝑖𝑗𝑘 𝘍
𝑑∈𝐷 𝑗 ,𝑘′
i’=i.
Therefore,
𝜕𝑙𝑛𝑃ℎ(𝑑) 1 𝜕
𝜕𝑤 = ∑ 𝑃 (𝑑) 𝜕𝑤 𝑃ℎ(𝑑|𝑦𝑖𝑗, 𝑢𝑖𝑘)𝑃ℎ(𝑦𝑖𝑗|𝑢𝑖𝑘)𝑃ℎ(𝑢𝑖𝑘)
𝑖𝑗𝑘 ℎ 𝑖𝑗𝑘
𝑑∈𝐷
𝜕𝑙𝑛𝑃ℎ(𝑑) 1 𝜕
𝜕𝑤 = ∑ 𝑃 (𝑑) 𝜕𝑤 𝑃ℎ(𝑑|𝑦𝑖𝑗, 𝑢𝑖𝑘)𝑤𝑖𝑗𝑘𝑃ℎ(𝑢𝑖𝑘)
𝑖𝑗𝑘 ℎ 𝑖𝑗𝑘
𝑑∈𝐷
𝜕𝑙𝑛𝑃ℎ(𝑑) 1
𝜕𝑤 =∑ 𝑃ℎ(𝑑|𝑦𝑖𝑗, 𝑢𝑖𝑘)𝑃ℎ(𝑢𝑖𝑘)
𝑖𝑗𝑘 𝑃 ℎ(𝑑)
𝑑∈𝐷
𝑃ℎ(𝑦𝑖𝑗, 𝑢𝑖𝑘|𝑑)𝑃ℎ(𝑢𝑖𝑘)
=∑
𝑃ℎ(𝑦𝑖𝑗, 𝑢𝑖𝑘)
𝑑∈𝐷
𝑃ℎ(𝑦𝑖𝑗, 𝑢𝑖𝑘|𝑑)
=∑
𝑃ℎ(𝑦𝑖𝑗|𝑢𝑖𝑘)
𝑑∈𝐷
𝜕𝑙𝑛𝑃ℎ(𝑑) 𝑃ℎ(𝑦𝑖𝑗, 𝑢𝑖𝑘|𝑑)
=∑
𝜕𝑤𝑖𝑗𝑘 𝑤𝑖𝑗𝑘
𝑑∈𝐷
--(5)
Requirement,
- as the weights wijk are updated they must remain valid probabilities in the
interval [0,1]
- sum ∑𝑗 𝑤𝑖𝑗𝑘remains 1 for all i, k.
To satisfy the requirements!!
Update weights in a two-step process.
i. Update each wijk by gradient ascent
𝑃ℎ(𝑦𝑖𝑗, 𝑢𝑖𝑘|𝑑)
𝑤𝑖𝑗𝑘 ← 𝑤𝑖𝑗𝑘 + 𝜂 ∑
𝑤𝑖𝑗𝑘
𝑑∈𝐷
ii. Renormalize the weights wijk to assure that the above constraints are satisfied.
Here,
k=2
instances points along x-axis
Each instance is generated using a two-step process.
i. One of the k Normal distributions is selected at random.
ii. A single random instance xi is generated according to this selected distribution
This process is repeated to generate a set of data points.
Consider a special case where, the selection of the single Normal distribution at
each step is based on choosing each with uniform probability
Learning task output a hypothesis h = (µ1, .. .µk) that describes the means of
each of the k distributions.
Goal a maximum likelihood hypothesis for these means; that is, a hypothesis
h that maximizes p(D|h)
It is easy to calculate the maximum likelihood hypothesis for the mean of a
single Normal distribution given the observed data instances x1,x2,.. .,xm drawn from
this single distribution.
We know that,
𝑎𝑟𝑔𝑚𝑖𝑛 𝑚 (𝑥 − 𝜇)2 ----(6)
𝜇𝑀𝐿 = ∑
𝜇 𝑖=1 𝑖
Where,
xi observed value of the ith instance
zi1 and zi2 indicated which of the two Normal distributions was used
to generate the value xi
zij = 1 if generated by jth Normal Distribution and 0 otherwise
Here,
xi Observed Variable
zi1 and zi2 Hidden Variables
If zi1 and zi2 were observed then Eq. (6) can be applied to solve for means µ1
and µ2 . Here we use EM Algorithm to find µ1 and µ2 .
EM Algorithm
Problem k-Means
Task Search for a maximum likelihood hypothesis by repeatedly re-estimating the
expected values of the hidden variables zij given its current hypothesis < µ1 …. µk >
Recalculate maximum likelihood hypothesis using these expected values
for the hidden variables.
Procedure:
First initialize the hypothesis to h = < µ1 , µ2 >
Iteratively re-estimate h by repeating the following two steps until the procedure
converges to a stationary value for h.
Step 1: Calculate the expected value E[zij] of each hidden variable zij assuming the
current hypothesis h = < µ1 , µ2 > holds.
Step 2: Calculate a new maximum likelihood hypothesis h’ = < µ1’ , µ2’ > assuming
the value taken on by each hidden variable zij is its expected value E[zij] calculated in
Step 1.
Replace h = < µ1 , µ2 > by h’ = < µ1’ , µ2’ > and iterate
Step 1 must calculate the expected value of zij.
E[zij] probability that instance xi was generated by the jth Normal distribution.
𝑝(𝑥 = 𝑥𝑖|𝜇 = 𝜇𝑗)
𝐸[𝑧𝑖𝑗] = 2
∑𝑛=1 𝑝(𝑥 = 𝑥𝑖|𝜇 = 𝜇𝑛)
1
− 2(𝑥𝑖−𝜇𝑗)2
𝑒 2𝜎
= 1 2
∑2𝑛=1 𝑒−2𝜎2 (𝑥𝑖−𝜇𝑛)
First step is implemented by substituting the current values < µ1 , µ2 > and the
observed xi into the above expression.
Second step use E[zij] calculated in Step 1 to derive a new maximum likelihood
hypothesis h’ = < µ1’ , µ2’ >.
It is
𝐦
𝛍 ← ∑𝐢=𝟏 𝐄[𝐳𝐢𝐣]𝐱𝐢
𝐣
∑𝐦
𝐢=𝟏 𝐄[𝐳𝐢𝐣]
---(8)
The above expression is similar to
m
1
μML = ∑ xi
m
i=1
---(7)
(7) used to estimate µ for a single Normal distribution.
(8) the weighted sample mean for µj, with each instance weighted by the
expectation E[zij] that it was generated by the jth Normal distribution .
Conclusion:
The current hypothesis is used to estimate the unobserved variables, and the expected
values of these variables are then used to calculate an improved hypothesis.
Further proof:
On each iteration through this loop, the EM algorithm increases the likelihood P(D|h)
unless it is at a local maximum. The algorithm thus converges to a local maximum
likelihood hypothesis for < µ1 , µ2 > .
General Statement of EM Algorithm
What have we learnt in the previous session!!
EM algorithm for the problem of estimating means of a mixture of Normal
distributions.
EM algorithm can be applied in many settings to estimate set of parameters, θ,
that include probability distribution given only the observed portion of the full
data produced by this distribution.
Example,
𝜃 = ⟨𝜇1, 𝜇2⟩
The full data were the triples
⟨𝑥𝑖, 𝑧𝑖1, 𝑧𝑖2⟩
√2𝜋𝜎2
Here, only one of zij can have the value 1 and all other must be 0.
Given this probability for a single instance 𝑝(𝑦𝑖|ℎ′),the logarithm of the probability
ln P(Y|h’) for all m instances in the data is
ln P(Y|h’) = 𝑙𝑛 ∏𝑖=1
𝑚 𝑝(𝑦 |ℎ′)
𝑖
𝑚
= ∑ 𝑙𝑛 𝑝(𝑦𝑖|ℎ′)
𝑖=1
𝑚
1 −
1 ∑𝑘 𝑧𝑖𝑗(𝑥 −𝜇𝘍)2
ln P(Y|h’) = ∑( 𝑙𝑛 𝑒 2𝜋𝜎2 𝑗=1 𝑖 𝑗
)
√2𝜋𝜎2
𝑖=1
Thus,
• The first (estimation) step of the EM algorithm defines the Q function based on
the estimated E[zij] terms.
• The second (maximization) step then finds the values 𝜇′1, … , 𝜇′𝑘 that maximize
this Q function.
In the current case
𝑎𝑟𝑔𝑚𝑎𝑥 𝑚
𝑎𝑟𝑔𝑚𝑎𝑥 ′ 1 −
1 ∑𝑘 𝐸[𝑧𝑖𝑗](𝑥 −𝜇𝘍)2
2 𝑗=1
ℎ′𝑄(ℎ |ℎ) = ∑ (𝑙𝑛 𝑒 2𝜋𝜎 𝑖 𝑗
)
√2𝜋𝜎2
ℎ′𝑖=1
𝑎𝑟𝑔𝑚𝑖𝑛 𝑚 𝑘 𝑘
---(10)
Therefore,
The maximum likelihood hypothesis here minimizes a weighted sum of squared
errors, where the contribution of each instance xi to the error that defines 𝜇𝑗′ is
weighted by E[zij] .
The quantity given by Equation (10) is minimized by setting each 𝜇𝑗′ to the
weighted sample mean
𝑚
𝜇 ← ∑𝑖=1 𝐸[𝑧𝑖𝑗]𝑥𝑖
𝑗
∑𝑚
𝑖=1 𝐸[𝑧𝑖𝑗]
---(11)
Eq. 10 & 11 Two steps in the k-means algorithm