0% found this document useful (0 votes)
6 views

AIML- Module 4- Updated

Uploaded by

Pradeep B M
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views

AIML- Module 4- Updated

Uploaded by

Pradeep B M
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 41

Maharaja Institute of Technology, Mysore Department of CSE

MODULE – 4

BAYESIAN LEARNING

INTRODUCTION

Bayesian reasoning provides a probabilistic approach to inference. It assumes that the


quantities of interest are governed by probability distributions and that optimal decisions can
be made by reasoning about these probabilities together with observed data. It is important to
machine learning because it provides a quantitative approach to weighing the evidence
supporting alternative hypotheses.

Bayesian learning methods are relevant to our study of machine learning for two different
reasons.

 First, Bayesian learning algorithms that calculate explicit probabilities for hypotheses,
such as the naive Bayes classifier, are among the most practical approaches to certain
types of learning problems.
 The second reason that Bayesian methods are important to our study of machine
learning is that they provide a useful perspective for understanding many learning
algorithms that do not explicitly manipulate probabilities.

Features of Bayesian learning methods include:

 Each observed training example can incrementally decrease or increase the estimated
probability that a hypothesis is correct. This provides a more flexible approach to
learning than algorithms that completely eliminate a hypothesis if it is found to be
inconsistent with any single example.
 Prior knowledge can be combined with observed data to determine the final probability
of a hypothesis. In Bayesian learning, prior knowledge is provided by asserting (1) a
prior probability for each candidate hypothesis, and (2) a probability distribution over
observed data for each possible hypothesis.
 Bayesian methods can accommodate hypotheses that make probabilistic predictions
(e.g., hypotheses such as "this pneumonia patient has a 93% chance of complete
recovery").
 New instances can be classified by combining the predictions of multiple hypotheses,

18CS71: Artificial Intelligence and Machine Learning 1


Maharaja Institute of Technology, Mysore Department of CSE

weighted by their probabilities.


 Even in cases where Bayesian methods prove computationally intractable, they can
provide a standard of optimal decision making against which other practical methods
can be measured.
Practical Difficulty

1. Bayesian methods typically require initial knowledge of many probabilities. When


these probabilities are not known in advance they are often estimated based on
background knowledge, previously available data, and assumptions about the form of
the underlying distributions.
2. The significant computational cost required to determine the Bayes optimal hypothesis
in the general case (linear in the number of candidate hypotheses). In certain
specialized situations, this computational cost can be significantly reduced.

BAYESIAN THEOREM

In machine learning we are often interested in determining the best hypothesis from some
space H, given the observed training data D. Bayes theorem provides a way to calculate
the probability of a hypothesis based on its prior probability, the probabilities of observing
various data given the hypothesis, and the observed data itself.

To define Bayes theorem precisely, let us first introduce a little notation.

 P(h)  initial probability that hypothesis h holds, before we have observed the training
data. P(h) is often called the prior-probability of h and may reflect any background
knowledge we have about the chance that h is a correct hypothesis.

 P(D)  prior probability that training data D will be observed

 P(D|h)  probability of observing data D given some world in which hypothesis h


holds.

In general, we write P(x|y) to denote the probability of x given y. In machine


learning problems we are interested in the probability P(h|D) that h holds given the
observed training data D. P(h|D) is called the posterior- probability of h, because it
reflects our confidence that h holds after we have seen the training data D. Notice the
posterior probability P(h|D) reflects the influence of the training data D, in contrast to
the prior probability P(h), which is independent of D.

18CS71: Artificial Intelligence and Machine Learning 2


Maharaja Institute of Technology, Mysore Department of CSE

“Bayes theorem provides a way to calculate the posterior probability P(h|D), from the
prior probability P(h), together with P(D) and P(D|h).”
𝑷(𝑫|𝒉)𝑷(𝒉)
Bayesian Theorem: 𝑷(𝒉|𝑫) = ---(1)
𝑷(𝑫)

Here, P(h|D) increases with P(h) and with P(D|h) according to Bayes theorem. It is also
reasonable to see that P(h|D) decreases as P(D) increases, because the more probable it
is that D will be observed independent of h, the less evidence D provides in support of h.

Maximum-a-Posteriori (MAP) Hypothesis

In many learning scenarios, the learner considers some set of candidate hypotheses H and
is interested in finding the most probable hypothesis h ∈ H given the observed data D (or
at least one of the maximally probable if there are several). Any such maximally probable
hypothesis is called a maximum a posteriori (MAP) hypothesis. We can determine the
MAP hypotheses by using Bayes theorem to calculate the posterior probability of each
candidate hypothesis. More precisely, we will say that hMAP is a MAP hypothesis
provided,

ℎ𝑀𝐴𝑃 ≡ 𝑎𝑟𝑔𝑚𝑎𝑥
ℎ∈𝐻𝑃(ℎ|𝐷)

𝑎𝑟𝑔𝑚𝑎𝑥𝑷(𝑫|𝒉)𝑷(𝒉)

ℎ∈𝐻
𝑷(𝑫)

≡ 𝑎𝑟𝑔𝑚𝑎𝑥𝑷(𝑫|𝒉)𝑷(𝒉) ---(2)
ℎ∈𝐻

We dropped the term P(D) because it is a constant independent of h.

Maximum Likelihood (ML) Hypothesis

In some cases, we will assume that every hypothesis in H is equally probable a priori (
P(hi) = P(hj) for all hi and hj in H). In this case we can further above equation and need
only consider the term P(D|h) to find the most probable hypothesis. P(D|h) is often called
the likelihood of the data D given h, and any hypothesis that maximizes P(D|h) is called
a maximum likelihood (ML) hypothesis, hML.

𝒉𝑴𝑳 ≡ 𝒂𝒓𝒈𝒎𝒂𝒙𝑷(𝑫|𝒉)---(3)
𝒉∈𝑯

In order to make clear the connection to machine learning problems, we introduced


Bayes theorem above by referring to the data D as training examples of some target
function and referring to H as the space of candidate target functions

18CS71: Artificial Intelligence and Machine Learning 3


Maharaja Institute of Technology, Mysore Department of CSE

Summary of Basic Probability Formulae

An Example

To illustrate Bayes rule, consider a medical diagnosis problem in which there are two
alternative hypotheses:

(1) that the patient has a particular form of cancer, and

(2) that the patient does not.

The available data is from a particular laboratory test with two possible outcomes:

⊕ (positive) and

⊖ (negative).

We have prior knowledge that over the entire population of people only .008 have this
disease.

Furthermore, the lab test is only an imperfect indicator of the disease.

The test returns a correct positive result in only 98% of the cases in which the disease is
actually present and a correct negative result in only 97% of the cases in which the disease
is not present.

In other cases, the test returns the opposite result.

Suppose we now observe a new patient for whom the lab test returns a positive result.
Should we diagnose the patient as having cancer or not?

18CS71: Artificial Intelligence and Machine Learning 4


Maharaja Institute of Technology, Mysore Department of CSE

NOTE: This figure is for the reference to obtain values.

Solution

The given situation can be summarized by the following probabilities:

P(cancer) = 0.008 P(¬cancer) = 0.992

P(+|cancer) = 0.98 P(-|cancer) = 0.02

P(-|¬cancer) = 0.97 P(+|¬cancer) = 0.03

The maximum a posteriori hypothesis can be found using Equation (2):

ℎ𝑀𝐴𝑃 ≡ 𝑎𝑟𝑔𝑚𝑎𝑥𝑃(𝐷|ℎ)𝑃(ℎ)
ℎ∈𝐻

D  +  true positive and false positive

hcancer= P(+|cancer)P(cancer) = 0.98 ×0.008 = 0.0078

hcancer= P(+|¬cancer)P(¬cancer) = 0.03 × 0.992 = 0.0298

OUTPUT  hMAP = ¬cancer because false positive has the highest values.

Note: The exact posterior probabilities can also be determined by normalizing the above
quantities so that they sum to 1.

𝑇𝑃 0.0078
P(cancer|+) = = = 0.21
𝑇𝑃+𝐹𝑃 0.0078+0.0298

0.0298
P(¬cancer|+) = 𝐹𝑃
= = 0.79
𝑇𝑃+𝐹𝑃 0.0078+0.0298

18CS71: Artificial Intelligence and Machine Learning 5


Maharaja Institute of Technology, Mysore Department of CSE

This step is warranted because Bayes theorem states that the posterior probabilities
are just the above quantities divided by the probability of the data, P(+). Although P(+)
was not provided directly as part of the problem statement, we can calculate it in this
fashion because we know that P(cancer|+) and P(¬cancer|+) must sum to 1.

Notice that while the posterior probability of cancer is significantly higher than its
prior probability, the most probable hypothesis is still that the patient does not have
cancer.

As this example illustrates, the result of Bayesian inference depends strongly on the
prior probabilities, which must be available in order to apply the method directly. Note
also that in this example the hypotheses are not completely accepted or rejected, but
rather become more or less probable as more data is observed.

BAYESIAN THEOREM AND CONCEPT LEARNING


What is the relationship between Bayes theorem and the problem of concept learning?

Bayes theorem provides a principled way to calculate the posterior probability of each
hypothesis given the training data. It acts as a basis for a straightforward learning
algorithm that calculates the probability for each possible hypothesis, then outputs the
most probable

WHAT IS UNDER DISCUSSION?

Consider a brute-force Bayesian concept learning algorithm, then compare it to concept


learning algorithms

WHAT IS NOTICED?

Under certain conditions several algorithms output the same hypotheses as brute-force
Bayesian algorithm

Brute-Force Bayes Concept Learning

Consider a finite hypothesis space H defined over the instance space X

TASK learn a target concept c : X  {0,1}

Given Sequence of training examples, ⟨⟨𝑥1, 𝑑1⟩ … ⟨𝑥𝑚, 𝑑𝑚⟩⟩

Where, xi –instance from X

di  target value of xi di = c(xi)

18CS71: Artificial Intelligence and Machine Learning 6


Maharaja Institute of Technology, Mysore Department of CSE

Assumption: Sequence of instances ⟨𝑥1 … 𝑥𝑚⟩ is held fixed.

 D can be written as a sequence of target values ⟨𝑑1 … 𝑑𝑚⟩

We can design a straightforward concept learning algorithm to output the maximum a


posteriori hypothesis, based on Bayes theorem, as follows:

BRUTE-FORCE MAP LEARNING algorithm

1. For each hypothesis h in H, calculate the posterior probability

𝑷(𝑫|𝒉)𝑷(𝒉)
𝑷(𝒉|𝑫) =
𝑷(𝑫)

2. Output the hypothesis hMAP with the highest posterior probability


𝑎𝑟𝑔𝑚𝑎𝑥
ℎ𝑀𝐴𝑃 ≡ ℎ∈𝐻𝑃(ℎ|𝐷)

The algorithm requires significant computation  Bayes Theorem is applied to each


hypothesis in H for P(h|D)

 To specify Learning problem Specify what values to be used for P(h) and
P(D|h)

 P(D) will be determined after P(h) and P(D|h)

P(h) and P(D|h) should be consistent with the following assumptions

1. The training data D is noise free (i.e., di =c(xi)).

2. The target concept c is contained in the hypothesis space H

3. We have no a priori reason to believe that any hypothesis is more probable than
any other.

Given the assumptions, what values should we specify for P(h)?

• Given no prior knowledge that one hypothesis is more likely than another, it is
reasonable to assign the same prior probability to every hypothesis h in H .

• Since we assume the target concept is contained in H we should require that these
prior
probabilities sum to 1.

Therefore,
P(h) = 1 for all h in H
|𝐻|

18CS71: Artificial Intelligence and Machine Learning 7


Maharaja Institute of Technology, Mysore Department of CSE

What choice shall we make for P(D|h)?

• P(D|h) is the probability of observing the target values D = ⟨𝑑1 … 𝑑𝑚⟩ for fixed set of
instances ⟨𝑥1 … 𝑥𝑚⟩ given a world in which hypothesis h holds.

• Assumption  noise-free training data

• The probability of observing classification di given h is just 1 if di = h(xi) and


0 if di ≠ h(xi)

• Therefore,

1 𝑖𝑓𝑑𝑖 = ℎ(𝑥𝑖)𝑓𝑜𝑟 𝑎𝑙𝑙 𝑑𝑖 𝑖𝑛 𝐷 --(1)


P(D|h) = {
0 𝑂𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒
That is, the probability of data D given hypothesis h is 1if D is consistent with h, and 0
otherwise.

Given these choices for P(h) and for P(D|h) we now have a fully-defined problem for
BRUTE-FORCE MAP LEARNING algorithm.
From Bayes Theorem
𝑷(𝑫|𝒉)𝑷(𝒉)
𝑷(𝒉|𝑫) =
𝑷(𝑫)

Case 1: h is inconsistent with the training data D

From (1)

P(D|h) is 0 when h is inconsistent with D.

Therefore,

P(h|D) = 0.𝑃(ℎ) = 0 if h is inconsistent with D


𝑃(𝐷)

That is, the posterior probability of a hypothesis inconsistent with D is zero.

Case 2: h is consistent with the training data D

From (1)

P(D|h) is 1 when h is consistent with D.

Therefore,
1
1.
P(h|D) = |𝐻|
𝑃(𝐷)

18CS71: Artificial Intelligence and Machine Learning 8


Maharaja Institute of Technology, Mysore Department of CSE

1
1.
|𝐻| 1 if h is consistent with D
= |𝑉𝑆𝐻,𝐷|
=
|𝑉𝑆𝐻,𝐷|
|𝐻|

Where VSH,D is the subset of hypotheses from H that are consistent with D .

How do we get P(D) =|𝑽𝑺𝑯,𝑫| ?


|𝑯|

• The sum over all hypotheses of P(h|D) must be one

• The number of hypotheses from H consistent with D is by definition |VSH,D |

We can derive P(D) from the theorem of total probabilityhypotheses are mutually
exclusive  (∀𝑖 ≠ 𝑗)(𝑃(ℎ𝑖˄ℎ𝑗) = 0)

𝑃(𝐷) = ∑ 𝑃(𝐷|ℎ𝑖)𝑃(ℎ𝑖)
ℎ𝑖∈𝐻

1 1
= ∑ 1. + ∑ 0.
|𝐻| |𝐻|
ℎ𝑖∈𝑉𝑆𝐻,𝐷 ℎ𝑖∉𝑉𝑆𝐻,𝐷

1
= ∑ 1.
|𝐻|
ℎ𝑖∈𝑉𝑆𝐻,𝐷

|𝑉𝑆𝐻,𝐷|
𝑃(𝐷) =
|𝐻|

Therefore,
1
if h is consistent with D
P(h|D) = {|𝑉𝑆 𝐻,𝐷|
0 𝑂𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒
Every consistent hypothesis is, therefore, a MAP hypothesis.

The evolution of probabilities associated with hypotheses is depicted


schematically in Figure given below. Initially (Figure 6.1a) all hypotheses have the
same probability. As training data accumulates (Figures 6.1b and 6.lc), the posterior
probability for inconsistent hypotheses becomes zero while the total probability
summing to one is shared equally among the remaining consistent hypotheses.

18CS71: Artificial Intelligence and Machine Learning 9


Maharaja Institute of Technology, Mysore Department of CSE

Figure 1: Evolution of posterior probabilities P(hlD) with increasing training


data. (a) Uniform priors assign equal probability to each hypothesis. As training data
increases first to Dl (b), then to Dl A D2 (c), the posterior probability of inconsistent
hypotheses becomes zero, while posterior probabilities increase for hypotheses
remaining in the version space

MAP Hypothesis and Consistent Learners

From the previous analysis

“every hypothesis consistent with D is a MAP hypothesis”

Definition

This implies a definition for consistent learners.

“A learning algorithm is a consistent learner provided it outputs a hypothesis that


commits zero errors over the training examples”

Therefore,

“Every consistent learner outputs a MAP hypothesis, if we assume a uniform prior


probability distribution over H (i.e., P(hi)=P(hj) for all i, j), and if we assume
deterministic, noise-free training data (i.e., P(D|h) =1 if D and h are consistent, and
0 otherwise).”

For examples,

Consider the concept learning algorithm FIND- S. We know that, FIND-S searches the
hypothesis space H from specific to general hypotheses, outputting a maximally
specific consistent hypothesis  it will output a MAP hypothesis under the probability
distributions P(h) and P(D|h) .

Are there other probability distributions for P(h) and P(D1h) under which FIND-
S outputs MAP hypotheses? Yes.

Find-S output  MAP hypothesis relative to any prior probability distribution that

18CS71: Artificial Intelligence and Machine Learning 10


Maharaja Institute of Technology, Mysore Department of CSE

favors more specific hypotheses.

Consider H  probability distribution P(h) over H that assigns P(h1) ≥ P(h2) if


h1 is more specific than h2 . It can be shown that FIND-S outputs a MAP hypothesis
assuming the prior distribution H and the same distribution P(D|h).

From concept learning,

An Inductive bias of a learning algorithm to be the set of assumptions B sufficient


to deductively justify the inductive inference performed by the learner.

Consider Candidate Elimination algorithm with an assumption that the target concept c
is included in the hypothesis space H. Its output follows deductively from its inputs
plus this implicit inductive bias assumption.

Alternative,

 Bayesian interpretation model inductive bias by an equivalent probabilistic


reasoning system based on Bayes theorem.

 Implicit assumptions

 Prior probabilities over H are given by the distribution P(h), and

 Strength of data in rejecting or accepting a hypothesis is given by P(D|h)

MAXIMUM LIKELIHOOD AND LEAST-SQUARED ERROR HYPOTHESES


Learning Problem: learning a continuous-valued target function
Bayesian Analysis
“Under certain assumptions any learning algorithm that minimizes the squared error
between the output hypothesis predictions and the training data will output a maximum
likelihood hypothesis.”
Problem Setting
Learner L considers an instance space X and a hypothesis space H consisting of some
class of real-valued functions defined over X
h:XR
L  should learn an unknown target function f :X  R drawn from H
Consider the following problem.
Given,
Set of m training examples where the target value of each example is corrupted by
random noise drawn according to a Normal probability distribution .  ⟨𝑥𝑖, 𝑑𝑖⟩ where

18CS71: Artificial Intelligence and Machine Learning 11


Maharaja Institute of Technology, Mysore Department of CSE

di = f (xi)+ei.
f(xi)  noise-free value of the target function
ei  random variable representing the noise
Assumption
• The values of the ei are drawn independently.
• They are distributed according to a Normal distribution with zero mean.
Task of the Learner
Output a maximum likelihood hypothesis, or, equivalently, a MAP hypothesis
assuming all hypotheses are equally probable a priori .
Example: Learning a linear function, though the analysis applies to learning arbitrary
real-valued functions.
Figure 6.2 illustrates the whole scenario. Here notice that the maximum likelihood
hypothesis is not necessarily identical to the correct hypothesis, f, because it is inferred
from only a limited sample of noisy training data.

Figure 2: Learning a real-valued function. The target function f corresponds to the


solid line. The training examples (xi,di) are assumed to have Normally distributed noise
ei with zero mean added to the true target value f (xi).The dashed line corresponds to
the linear function that minimizes the sum of squared errors. Therefore, it is the
maximum likelihood hypothesis hML , given these five training examples.
To show that,
Last-squared error hypothesis is, in fact, the maximum likelihood hypothesis
Probability densities:
• For probabilities over continuous variables such as e.
Requirement for our problem Total probability over all possible values of the random
variable must sum to one.

18CS71: Artificial Intelligence and Machine Learning 12


Maharaja Institute of Technology, Mysore Department of CSE

Why is Probability Density needed?


Reason For continuous variable it is difficult to achieve by assigning a finite probability
to each of the infinite set of possible values for the random variable.
What is done with Probability Densities?
Apply and expect the integral of this probability density over all possible values to be one.
Representation:
p Probability Density Function
P Finite Probability
Definition:
1
The probability density p(x0) is the limit as 𝜖 goes to zero, of 𝗀 times the probability that x
will take the value in the interv[x0, x0+ 𝜖] .
Probability density function:
1
𝑝(𝑥0) ≡ lim 𝑃(𝑥0 ≤ 𝑥 ≤ 𝑥0 + 𝜖)
𝗀→0 𝜖
Normal Distribution: Random noise variable e is generated by a Normal probability
distribution. A Normal distribution (also called a Gaussian distribution) is a smooth, bell-
shaped distribution that can be completely characterized by its mean μ and its standard
deviation σ. It can be defined by the probability density function.

1 −1 (𝑥−𝜇 )2
𝑝(𝑥) = 𝑒 2 𝜎
√2𝜋𝜎2
A Normal distribution is fully determined by two parameters in the above formula: μ and
σ. If the random variable X follows a normal distribution, then:
𝑏
 The probability that X will fall into the interval (a, b) is given by ∫𝑎 𝑝(𝑥)𝑑𝑥
 The expected, or mean value of X, E[X], is E[X] = μ
 The variance of X, Var(X), is Var(X) = σ2
 The standard deviation of X, σx, is σx = σ

18CS71: Artificial Intelligence and Machine Learning 13


Maharaja Institute of Technology, Mysore Department of CSE

The Central Limit Theorem states that the sum of a large number of independent,
identically distributed random variables follows a distribution that is approximately
Normal.
Prove: Maximum likelihood hypothesis hML minimizes the sum of the squared errors
between the observed training values di and the hypothesis predictions h(xi)
Proof: From equation (3) we have
Deriving the maximum likelihood hypothesis starting with our earlier definition of hML,
but using lower case p to refer to the probability density
hML = argmaxp(D|h)
h𝖾H

Assumptions
Fixed set of training instances ⟨x1 … xm⟩
D corresponding sequence of target values D=⟨d1 … dm⟩
di = f(xi) + ei
Training examples are mutually independent given h  P(D|h)  product of various
p(di|h)
𝑎𝑟𝑔𝑚𝑎𝑥 𝑚

ℎ𝑀𝐿 = 𝖦 𝑝(𝑑𝑖|ℎ)
ℎ𝗀𝐻 𝑖=1

ei obeys Normal distribution with zero mean and unknown variance σ2 .


di must also obey Normal distribution with variance σ2 centered around the true
target value f(xi) rather than zero.
Hence,
p(di|h) can be written as a Normal distribution with variance σ2 and mean μ = f (xi) .
Because we are writing the expression for the probability of di given that h is the
correct description of the target function f , we will also substitute p = f (xi)= h(xi),
yielding
𝑎𝑟𝑔𝑚𝑎𝑥 𝑚
1 𝑒−21𝜎2(𝑑𝑖−𝜇) 2
ℎ𝑀𝐿 = 𝖦
√2𝜋𝜎2
ℎ𝗀𝐻 𝑖=1

𝑎𝑟𝑔𝑚𝑎𝑥 𝑚 1
1 − (𝑑𝑖−ℎ(𝑥𝑖))2
ℎ𝑀𝐿 = 𝖦 𝑒 2𝜎2
√2𝜋𝜎2
ℎ𝗀𝐻 𝑖=1

Rather than maximizing the above complicated expression we shall choose to


maximize its (less complicated) logarithm

18CS71: Artificial Intelligence and Machine Learning 14


Maharaja Institute of Technology, Mysore Department of CSE

𝑎𝑟𝑔𝑚𝑎𝑥 𝑚
1 1
ℎ𝑀𝐿 = ∑ 𝑙𝑛 − (𝑑 − ℎ(𝑥𝑖))2
√2𝜋𝜎2 2𝜎2 𝑖
ℎ𝗀𝐻 𝑖=1

The first term in this expression is a constant independent of h, and can therefore be
discarded, yielding,
𝑎𝑟𝑔𝑚𝑎𝑥 𝑚
1
ℎ𝑀𝐿 = ∑− (𝑑𝑖 − ℎ(𝑥𝑖))2
2𝜎2
ℎ𝗀𝐻 𝑖=1
Maximizing this negative quantity is equivalent to minimizing the
corresponding positive quantity
argmin m
1
hML = ∑ (di − h(xi))2
2σ2
h𝖾Hi=1
𝑎𝑟𝑔𝑚𝑖𝑛 𝑚
ℎ𝑀𝐿 = ∑(𝑑𝑖 − ℎ(𝑥𝑖))2
ℎ𝗀𝐻 𝑖=1

Above equation shows that the maximum likelihood hypothesis hML is the one that
minimizes the sum of the squared errors between the observed training values d i and
the hypothesis predictions h(xi).
Limitations: The above analysis considers noise only in the target value of the training
example and does not consider noise in the attributes describing the instances
themselves.

MAXIMUM LIKELIHOOD HYPOTHESES FOR PREDICTING


PROBABILITIES
Proven:
Maximum likelihood hypothesis is the one that minimizes the sum of squared errors
over the training examples.
NOW!!
Derive an analogous criterion for a second setting that is common in neural network
learning learning to predict probabilities.
Setting!!
Learn a nondeterministic (probabilistic) function f : X -+ {0,1}, which has two discrete
output values
Example 1,
X medical patients in terms of their symptoms
f(x)  Target Function = 1  if patient survives
= 0  if not.
Example 2,
X loan applicants in terms of their past credit history

18CS71: Artificial Intelligence and Machine Learning 15


Maharaja Institute of Technology, Mysore Department of CSE

f(x)  Target Function = 1  if the applicant successfully repays their next loan

= 0  if not.
Here, f can be expected to be probabilistic
PROBABILISTIC f !!!
For example,
Among a collection of patients exhibiting the same set of observable symptoms,
we might find that 92% survive, and 8% do not.  the output of the target function f(x)
is a probabilistic function of its input.
LEARNING PROBLEM!!!
Learn a neural network (or other real-valued function approximator) whose
output is the probability that f(x)=1
i.e., to learn the target function, f’ : X  [0,1] such that f’(x) = P(f(x)=1).
From the examples,
f’(x) = 0.92
Probabilistic function f(x) will be equal to 1 in 92% of cases and equal to 0 in
the remaining 8%.
How can we learn f’ using, say, a neural network?
Solution
- first collect the observed frequencies of 1's and 0's for each possible value of x
- then train the neural network to output the target frequency for each x
Further to be Proven!!
It is possible to train a neural network directly from the observed training examples of
f, yet still derive a maximum likelihood hypothesis for f' .
What criterion should we optimize in order to find a maximum likelihood
hypothesis for f' in this setting?
To answer  first obtain an expression for P(D|h).
Assumptions
- D training Data  {⟨𝑥1, 𝑑1⟩ … ⟨𝑥𝑚, 𝑑𝑚⟩}
- di  observed 0 or 1 for f(xi)
- Both xi and di are random variables
- Each training example is drawn independently
Therefore, we can write

18CS71: Artificial Intelligence and Machine Learning 16


Maharaja Institute of Technology, Mysore Department of CSE

𝑃(𝐷|ℎ) = 𝖦 𝑃(𝑥𝑖, 𝑑𝑖|ℎ)


𝑖=1

- Probability of encountering any particular instance xi is independent of the


hypothesis h
o Example probability that our training set contains a particular patient xi is
independent of our hypothesis about survival rates
o survival di of the patient depends strongly on h
When x is independent of h we can rewrite the above expression as
𝑚 𝑃(𝑑 |ℎ, 𝑥 )𝑃(𝑥 ) …(6.8)
𝑃(𝐷|ℎ) = ∏𝑖=1 𝑖 𝑖 𝑖

What is the probability P(di|h,xi) of observing di = 1 for a single instance xi, given a
world in which hypothesis h holds?
We know that , h  computes this probability
Therefore,
P(di = 1 |h, xi) = h(xi)
Hence,
ℎ(𝑥𝑖) if 𝑑𝑖 = 1
𝑃(𝑑𝑖|ℎ, 𝑥𝑖) = { 1 − ℎ(𝑥𝑖) 𝑖𝑓𝑑𝑖 = 0….(6.9)

In order to substitute for P(D|h) in (8), let us first "re-express it in a more mathematically
manipulable form, as
𝑃(𝑑𝑖|ℎ, 𝑥𝑖) = h(𝑥𝑖)𝑑𝑖(1 − ℎ(𝑥𝑖))1−𝑑𝑖…(6.10)

18CS71: Artificial Intelligence and Machine Learning 17


Maharaja Institute of Technology, Mysore Department of CSE

The expression on the right side of Equation (12) can be seen as a generalization of
the Binomial distribution. The expression in Equation (12) describes the probability that
flipping each of m distinct coins will produce the outcome (dl . . .dm), assuming that each
coin xi has probability h(xi) of producing a heads. Note the Binomial distribution is
similar, but makes the additional assumption that the coins have identical probabilities of
turning up heads (i.e., that h(xi) = h(xj), for every i, j). In both cases we assume the
outcomes of the coin flips are mutually independent-an assumption that fits our current
setting.
It is easy to work with log of the likelihood, we get
hML = 𝑎𝑟𝑔𝑚𝑎𝑥∑𝑚 𝑑 𝑙𝑛 ℎ(𝑥 ) + (1 − 𝑑 ) ln(1 − ℎ(𝑥 )) …(6.13)
ℎ𝗀𝐻 𝑖=1 𝑖 𝑖 𝑖 𝑖

Equation (6.13) describes the quantity that must be maximized in order to obtain the
maximum likelihood hypothesis in our current problem setting.
∑𝑚
𝑖=1 𝑑𝑖𝑙𝑛 ℎ(𝑥𝑖)  Similar to Entropy expression and hence named as Cross Entropy.

Gradient Search to Maximize Likelihood in a Neural Net


Let 𝑑𝑖𝑙𝑛ℎ(𝑥𝑖) + (1 − 𝑑𝑖) ln(1 − ℎ(𝑥𝑖)) be represented by G(h,D)
Task derive a weight-training rule for neural network learning that seeks to maximize
G(h, D) using gradient ascent
From ANN gradient of G(h, D) is given by the vector of partial derivatives of G(h, D)
with respect to the various network weights that define the hypothesis h represented by
the learned network.
Therefore, he partial derivative of G(h, D) with respect to weight w jk from input k to
unit j is
𝑚
𝜕𝐺(ℎ|𝐷) 𝜕𝐷(ℎ|𝐷) 𝜕ℎ(𝑥𝑖)
=∑
𝜕𝑤𝑗𝑘 𝜕ℎ(𝑥𝑖) 𝜕𝑤𝑗𝑘
𝑖=1
𝑚
𝜕𝑑𝑖𝑙𝑛ℎ(𝑥𝑖) + (1 − 𝑑𝑖) ln(1 − ℎ(𝑥𝑖)) 𝜕ℎ(𝑥𝑖)
=∑
𝜕ℎ(𝑥𝑖) 𝜕𝑤𝑗𝑘
𝑖=1
𝑚
𝑑𝑖 − ℎ(𝑥𝑖) 𝜕ℎ(𝑥 )
= ∑ ℎ(𝑥 )(1 − ℎ(𝑥 )) 𝜕𝑤 𝑖
𝑖 𝑖 𝑗𝑘
𝑖=1
𝜕𝐺(ℎ|𝐷) 𝑑𝑖−ℎ(𝑥𝑖) 𝜕ℎ(𝑥𝑖)
= ∑𝑚
𝑖=1
---(1)
𝜕𝑤𝑗𝑘 ℎ(𝑥𝑖)(1−ℎ(𝑥𝑖)) 𝜕𝑤𝑗𝑘
Considering that neural network is constructed from a single layer of sigmoid units, we
get

18CS71: Artificial Intelligence and Machine Learning 18


Maharaja Institute of Technology, Mysore Department of CSE

𝜕ℎ(𝑥𝑖)
= 𝜎′(ℎ(𝑥 )𝑥 = ℎ(𝑥 )(1 − ℎ(𝑥 )𝑥 ---(2)
𝜕𝑤 𝑗𝑘 𝑖 𝑖𝑗𝑘 𝑖 𝑖 𝑖𝑗𝑘

Where,
𝑥𝑖𝑗𝑘 kth input to unit j for the ith training example
𝜎′(ℎ(𝑥𝑖)) Derivative of sigmoid squashing function
Substitute (2) in (1)
𝜕𝐺(ℎ|𝐷) 𝑚
=∑𝑖=1 𝑑𝑖−ℎ(𝑥𝑖) ℎ(𝑥 )(1 − ℎ(𝑥 ))𝑥
𝜕𝑤 𝑗𝑘 𝑖 𝑖𝑗𝑘
ℎ(𝑥𝑖)(1−ℎ(𝑥𝑖))
𝑖
𝑚
𝜕𝐺(ℎ|𝐷)
= ∑ 𝑑𝑖 − ℎ(𝑥𝑖)𝑥𝑖𝑗𝑘
𝜕𝑤𝑗𝑘
𝑖=1

Because we seek to maximize rather than minimize P(D(h), we perform gradient ascent
rather than gradient descent search.
On each iteration of the search the weight vector is adjusted in the direction of the
gradient, using the weight update rule
𝑤𝑗𝑘 = 𝑤𝑗𝑘 + ∆𝑤𝑗𝑘
Where,
𝑚

∆𝑤𝑗𝑘 = 𝜂 ∑ 𝑑𝑖 − ℎ(𝑥𝑖)𝑥𝑖𝑗𝑘
𝑖=1

Compare this weight-update rule to the weight-update rule used by the


BACKPROPAGATION algorithm to minimize the sum of squared errors between
predicted and observed network outputs.
Re-express it using the current notations:
𝑤𝑗𝑘𝑤𝑗𝑘 + ∆𝑤𝑗𝑘
Where,
𝑚

∆𝑤𝑗𝑘 = 𝜂 ∑ ℎ(𝑥𝑖)(1 − ℎ(𝑥𝑖))(𝑑𝑖 − ℎ(𝑥𝑖))𝑥𝑖𝑗𝑘


𝑖=1

CONCLUSION
The rule that minimizes sum of squared error seeks the maximum likelihood hypothesis
under the assumption that the training data can be modeled by normally distributed noise
added to the target function value.
The rule that minimizes cross entropy seeks the maximum likelihood hypothesis under
the assumption that the observed Boolean value is a probabilistic function of the input
instance.

18CS71: Artificial Intelligence and Machine Learning 19


Maharaja Institute of Technology, Mysore Department of CSE

MINIMUM DESCRIPTION LENGTH PRINCLIPLE


Occam's razor
Choose the shortest explanation for the observed data.
Bayesian perspective  Minimum Description Length (MDL) principle
Motivation interpreting the definition of hMAP in the light of basic concepts from
information theory.
Consider,
hMAP = 𝑎𝑟𝑔𝑚𝑎𝑥𝑃(𝐷|ℎ)𝑃(ℎ)
ℎ∈𝐻
hMAP can be equivalently expressed in terms of maximizing the log,

ℎ∈𝐻𝑙𝑜𝑔2𝑃(𝐷|ℎ) + 𝑙𝑜𝑔2𝑃(ℎ)
hMAP = 𝑎𝑟𝑔𝑚𝑎𝑥
Or alternatively minimize the above quantity

ℎ∈𝐻−𝑙𝑜𝑔2𝑃(𝐷|ℎ) − 𝑙𝑜𝑔 2𝑃(ℎ)


hMAP = 𝑎𝑟𝑔𝑚𝑖𝑛 (16)
 can be interpreted as a statement that short hypotheses are preferred, assuming a
particular representation scheme for encoding hypotheses and data.
Consider the problem of designing a code to transmit messages drawn at random,
where the probability of encountering message i is pi.
What is needed?
Most compact code  the code that minimizes the expected number of bits we must
transmit in order to encode a message drawn at random.
How?
Assign shorter codes to messages that are more probable.
Existing Proof!!
Shannon and Weaver (1949) Optimal code (i.e., the code that minimizes the expected
message length) assigns -log2 pi bitst to encode message i.
Let, number of bits required to encode message i using code C  description
length of message i with respect to C  Lc(i).
Consider Eq. 16.
Interpret the above expression with coding theory.
• -log2P(h)  description length of h under the optimal encoding for the
hypothesis space H.
This is the size of the description of hypothesis h using this optimal
representation.
𝐿𝐶𝐻(ℎ) = -log2P(h)

18CS71: Artificial Intelligence and Machine Learning 20


Maharaja Institute of Technology, Mysore Department of CSE

CH  Optimal code for hypothesis space H.


• -log2P(D|h)  description length of the training data D given hypothesis h, under
its optimal encoding.
𝐿𝐶𝐷|ℎ(𝐷|ℎ) = -log2P(D|h)
𝐶𝐷|ℎ optimal code for describing data D assuming that both the sender and receiver
know the hypothesis h
Rewrite (16)
hMAP = 𝑎𝑟𝑔𝑚𝑖𝑛ℎ𝐿𝐶 𝐻 (ℎ)+𝐿𝐶𝐷|ℎ (𝐷|ℎ)

CONCLUSION:
The Minimum Description Length (MDL) principle recommends choosing the
hypothesis that minimizes the sum of these two description lengths.
Consider that the codes C1 and C2 to represent the hypothesis and the data given
the hypothesis, we can state the MDL principle as
Minimum Description Length principle: Choose hMDL where
hMDL = 𝑎𝑟𝑔𝑚𝑖𝑛𝐿 𝐶 (ℎ)+𝐿 𝐶 (𝐷|ℎ)
ℎ 1 2

If we choose C1 to be the optimal encoding of hypotheses CH, and if we choose


C2 to be the optimal encoding CD|h, then hMDL= hMAP .
Example,
To apply the MDL principle to the problem of learning decision trees from some
training data.
What should we choose for the representations C1 and C2 of hypotheses and data?
C1  choose some obvious encoding of decision trees such that description
length grows with the number of nodes in the tree and with the number of edges
How shall we choose the encoding C2 of the data given a particular decision tree
hypothesis?
Consider a sequence of instances <x1 , … , xm> known to both transmitter and
receiver. Therefore, transmit only the classification <f(x1),…, f(xm)>
If the he training classifications <f(x1),…, f(xm)> are identical to the predictions
of the hypothesis, then there is no need to transmit any information about these
examples. The description length of the classifications given the hypothesis  0.
Consider, examples are misclassified by h
For each misclassification transmit a message that identifies which example is
misclassified along with its correct classification.

18CS71: Artificial Intelligence and Machine Learning 21


Maharaja Institute of Technology, Mysore Department of CSE

Misclassified classification log2 m


Correct classification  log2 k
The hypothesis hMDL under the encoding C1 and C2 is just the one that minimizes
the sum of these description lengths
Therefore MDL principle provides a way of trading off hypothesis complexity
for the number of errors committed by the hypothesis.
It might select a shorter hypothesis that makes a few errors over a longer
hypothesis that perfectly classifies the training data.  overfitting addressed.
Conclusion on MDL!!
Does this prove once and for all that short hypotheses are best?
NO
What is shown?
If a representation of hypotheses is chosen so that the size of hypothesis h is -
log2P(h), and if a representation for exceptions is chosen so that the encoding length of
D given h is equal to -log2 P(D|h), then the MDL principle produces MAP hypotheses.
There is no reason to believe that the MDL hypothesis relative to arbitrary
encodings C1 and C2 should be preferred.
BAYES OPTIMAL CLASSIFIER
What is the most probable hypothesis given the training data?
What is the most probable classification of the new instance given the training data?
It is possible to answer better than with MAP hypothesis.
How to develop more intuitions?
Consider H having h1, h2, and h3.
Let their posterior probabilities be 0.4, 0.3. and 0.3.
MAP hypothesis  h1.
Consider an instance x  classified +ve by h1
 classified –ve by h2, and h3
 That is, the probability that x is positive is 0.4 and probability that x is negative is
0.6.
 The most probable classification is different from MAP classification.
 The most probable classification of the new instance is obtained by combining the
predictions of all hypotheses, weighted by their posterior probabilities.
 If the possible classification of a new example can take on any value vj from set V,

18CS71: Artificial Intelligence and Machine Learning 22


Maharaja Institute of Technology, Mysore Department of CSE

then the probability P(vj|D) that the correct classification for the new instance is vj

𝑃(𝑣𝑗|𝐷) = ∑ 𝑃(𝑣𝑗|ℎ𝑖)𝑃(ℎ𝑖|𝐷)
ℎ𝑖∈𝐻

 The optimal classification of the new instance is the value v,, for which P(vj|D) is
maximum.
Bayes Optimal Classification
𝑎𝑟𝑔𝑚𝑎𝑥
∑ 𝑃(𝑣𝑗|ℎ𝑖)𝑃(ℎ𝑖|𝐷) − −(18)
𝑣𝑗∈𝑉ℎ𝑖∈𝐻

Example,
The set of possible classifications of the new instance is V={ ⊕ , ⊖}
𝑃(ℎ1|𝐷) = 0.4 , 𝑃(⊖ |ℎ𝑖) = 0, 𝑃(⊕ |ℎ1) = 1
𝑃(ℎ2|𝐷) = 0.3 , 𝑃(⊖ |ℎ2) = 1, 𝑃(⊕ |ℎ2) = 0
𝑃(ℎ3|𝐷) = 0.3 , 𝑃(⊖ |ℎ3) = 1, 𝑃(⊕ |ℎ3) = 0
Therefore,

∑ 𝑃(⊕ |ℎ𝑖)𝑃(ℎ𝑖|𝐷) = 0.4


ℎ𝑖∈𝐻

∑ 𝑃(⊖ |ℎ𝑖)𝑃(ℎ𝑖|𝐷) = 0.6


ℎ𝑖∈𝐻

And
𝑎𝑟𝑔𝑚𝑎𝑥
∑ 𝑃(𝑣𝑗|ℎ𝑖)𝑃(ℎ𝑖|𝐷) =⊖
𝑣𝑗∈{⊕,⊖}ℎ𝑖∈𝐻

Any system that classifies new instances according to Equation (6.18) is called a Bayes
optimal classifier, or Bayes optimal learner. Therefore, Bayes Optimal Classifier
maximizes the probability that the new instance is classified correctly, given the available
data, hypothesis space, and prior probabilities over the hypotheses.
GIBBS ALGORITHM
Bayes optimal classifier obtains the best performance that can be achieved from the
given training data.
Disadvantage!!
It can be costly to apply.
WHY??
• It computes the posterior probability for every hypothesis in H
• then combines the predictions of each hypothesis to classify each new instance.

18CS71: Artificial Intelligence and Machine Learning 23


Maharaja Institute of Technology, Mysore Department of CSE

ALTERNATIVE!!
GIBBS ALGORITHM
Definition:
Gibbs Algorithm
1. Choose a hypothesis h from H at random, according to the posterior probability
distribution over H.
2. Use h to predict the classification of the next instance x.
Haussler et al. 1994  it can be shown that under certain conditions the expected
misclassification error for the Gibbs algorithm is at most twice the expected error of the
Bayes optimal classifier
Implication for the concept learning problem
Consider,
• Uniform prior probability over H
• Target concepts are drawn with uniform distribution
Classification done with hypothesis drawn at random from VH,D
 has expected error at most twice that of the Bayes optimal classifier.
NAÏVE BAYES CLASSIFIER
Highly practical learning method Naïve Bayes Classifier (NBC)
Where can NBC be applied??
Learning tasks where
• Each instance x is described by a conjunction of attribute values
• The target function f(x) can take on any value from some finite set V.
Given
A set of training examples of target function
A new instance described by the tuple of attribute values <a1, a2, …..,an>
Task!!
To predict the target value, or classification for the new instance
Bayesian approach!!!
Assign the most probable target value, VMAP, given the attribute values <a1,a2 . . .an>
that describe the instance.
𝑎𝑟𝑔𝑚𝑎𝑥
𝑣𝑀𝐴𝑃 = 𝑣𝑗∈𝑉𝑃(𝑣𝑗|𝑎1,𝑎 2, . . , 𝑎𝑛)
Rewrite the above expression using Bayes theorem

18CS71: Artificial Intelligence and Machine Learning 24


Maharaja Institute of Technology, Mysore Department of CSE

𝑎𝑟𝑔𝑚𝑎𝑥
𝑃(𝑎1,𝑎2, . . , 𝑎𝑛|𝑣𝑗)𝑃(𝑣𝑗)
𝑣𝑀𝐴𝑃 =
𝑣 ∈𝑉
𝑃(𝑎1,𝑎2, . . , 𝑎𝑛)
𝑗

𝑣𝑀𝐴𝑃 = 𝑎𝑟𝑔𝑚𝑎𝑥𝑃(𝑎
𝑣𝑗∈𝑉
1,𝑎2, . . , 𝑎𝑛|𝑣𝑗)𝑃(𝑣𝑗)--(6.19)

Count the frequency of vj to get the value of P(vj)


How to estimate 𝑃(𝑎1,𝑎2, . . , 𝑎𝑛|𝑣𝑗)??
Takes a larger set of examples to follow the procedure of P(vj)
Problem The number of these terms is equal to the number of possible instances times
the number of possible target values.
Naïve Bayes Classifier simplifies the assumption that the attribute values are
conditionally independent given the target value.
Given,
The target value of the instance
Therefore,
The probability of observing the conjunction a1, a2.. .an, is the product of the
probabilities for the individual attributes: 𝑃(𝑎1,𝑎2, . . , 𝑎𝑛|𝑣𝑗) = ∏𝑖 𝑃(𝑎𝑖|𝑣𝑗).
Substitute in (6.19)

𝑣𝑁𝐵 = 𝑎𝑟𝑔𝑚𝑎𝑥𝑃(𝑣
𝑣𝑗∈𝑉 𝑗) 𝖦 𝑃(𝑎𝑖|𝑣𝑗) − −(6.20)
𝑖

An Illustrative Example
Apply the naive Bayes classifier to a concept learning problem Decision tree learning
Example Outlook Temperature Humidity Wind PlayTennis
D1 sunny hot high weak No
D2 sunny hot high strong No
D3 overcast hot high weak Yes
D4 rain mild high weak Yes
D5 rain cool normal weak Yes
D6 rain cool normal strong No
D7 overcast cool normal strong Yes
D8 sunny mild high weak No
D9 sunny cool normal weak Yes
D10 rain mild normal weak Yes
D11 sunny mild normal strong Yes
D12 overcast mild high strong Yes
D13 overcast hot normal weak Yes
D14 rain mild high strong No

18CS71: Artificial Intelligence and Machine Learning 25


Maharaja Institute of Technology, Mysore Department of CSE

Use naive Bayes classifier and the training data from the table to classify the
following novel instance
<Outlook =sunny, Temperature =cool, Humidity = high, Wind = strong>
Task!!
Predict the target value (yes or no) of the target concept PlayTennis for this new
instance.
We know that ,

𝑣𝑁𝐵 = 𝑎𝑟𝑔𝑚𝑎𝑥𝑃(𝑣
𝑣𝑗∈𝑉 𝑗) 𝖦 𝑃(𝑎𝑖|𝑣𝑗) − −(20)
𝑖

Here,
𝑎𝑟𝑔𝑚𝑎𝑥
𝑣 = 𝑃(𝑣 )𝑃(𝑂𝑢𝑡𝑙𝑜𝑜𝑘 = 𝑠𝑢𝑛𝑛𝑦|𝑣 )𝑃(𝑇𝑒𝑚𝑝𝑒𝑟𝑎𝑡𝑢𝑟𝑒 = 𝑐𝑜𝑜𝑙|𝑣 )𝑃(𝐻𝑢𝑚𝑖𝑑𝑖𝑡𝑦 = ℎ𝑖𝑔ℎ|𝑣 )𝑃(𝑊𝑖𝑛𝑑 = 𝑠𝑡𝑟𝑜𝑛𝑔|𝑣 )
𝑁𝐵 𝑣𝑗∈{𝒚𝒆𝒔,𝒏𝒐} 𝑗 𝑗 𝑗 𝑗 𝑗

--(6.21)
Requires 10 probabilities for estimating vNB.
The probabilities of the different target values can easily be estimated based
on their frequencies over the 14 training examples.
9
𝑃(𝑃𝑙𝑎𝑦𝑇𝑒𝑛𝑛𝑖𝑠 = 𝑦𝑒𝑠) = = 0.64
14
5
𝑃(𝑃𝑙𝑎𝑦𝑇𝑒𝑛𝑛𝑖𝑠 = 𝑛𝑜) = = 0.36
14
Now estimate the conditional probabilities
𝑃(𝑂𝑢𝑡𝑙𝑜𝑜𝑘 =sunny|PlayTennis=yes) = 2/9=0.22
𝑃(𝑂𝑢𝑡𝑙𝑜𝑜𝑘 =sunny|PlayTennis=no) = 3/5=0.6
𝑃(Temperature =cool|PlayTennis=yes) = 3/9=0.33
𝑃(Temperature =cool|PlayTennis=no)=1/5=0.2
𝑃(Humidity = high|PlayTennis=yes) = 3/9=0.33
𝑃(Humidity = high|PlayTennis=no)=4/5=0.8
𝑃(Wind = strong|PlayTennis=yes) = 3/9=0.33
𝑃(Wind = strong|PlayTennis=no)=3/5=0.6
Substituting in (6.21) we get,
P(yes)P(sunny|yes)P(cool|yes)P(high|yes)P(strong|yes)
=0.64×0.22×0.33×0.33 ×0.33=0.0051
P(no)P(sunny|no)P(cool|no)P(high|no)P(strong|no)
=0.36 ×0.6 ×0.2 ×0.8 ×0.6
=0.0207

18CS71: Artificial Intelligence and Machine Learning 26


Maharaja Institute of Technology, Mysore Department of CSE

NBC applies the target value as no.


By normalizing the above quantities to sum to one we can calculate the
conditional probability that the target value is no, given the observed attribute values.
0.0207
= 0.80
0.0051 + 0.0207
Estimating Probabilities
The probabilities were estimated by the fraction of times the event is observed to
occur over the total number of opportunities.
Example,
P(Wind = strong|Play Tennis = no) is calculated using 𝑛𝑐
𝑛

where
n=5 (total number of training examples with target value = no)
nc = 3 number of examples in n samples where Wind=strong.
Problem !!
This provides a poor estimate if nc is too small.
Example,
Consider that P(Wind = strong|PlayTennis = no) = 0.8
Sample only 5 samples where PlayTennis = no.
Here, the most probable value for nc is 0.
Two difficulties
- nc/n produces a biased underestimate of the probability.
- When nc/n is zero, it will dominate the Bayes classifier if the future query contains
Wind = strong
WHY??
The quantity calculated in Equation (6.20)requires multiplying all the other
probability terms by this zero value.
How to avoid this difficulty??
Adopt a Bayesian approach to estimating the probability, using the m-
estimate:

m-estimate of Probability
𝒏𝒄 + 𝒎𝒑
𝒏+𝒎
Where,

18CS71: Artificial Intelligence and Machine Learning 27


Maharaja Institute of Technology, Mysore Department of CSE

p prior estimate of the probability


m constant equivalent sample size determines how heavily to weight p
relative to the observed data
How to estimate p in the absence of other information?
Assume uniform priors.
NOTE: If m = 0 then m-estimate is equivalent to nc/n.
BAYESIAN BELIEF NETWORKS
Naïve Bayes Classifier  Uses the assumption that the values of the attributes a1 ...an,
are conditionally independent given the target value v
OUTPUT optimal Bayes classification
Bayesian Belief Networks  Describes the probability distribution governing a set of
variables by specifying a set of conditional independence assumptions along with a
set of conditional probabilities
Naïve Bayes Classifier  Assumes that all the variables are conditionally independent
given the value of the target variable
Bayesian Belief Networks  allow stating conditional independence assumptions that
apply to subsets of the variables
Bayesian belief network describes the probability distribution over a set of
variables.
Example,
Consider,
An arbitrary set of random variables Y1 …..Yn.
For each Yi  V(Yi)  possible value
Joint Space of the set of variables Y  cross product V(Y1)× V(Y2)×….. ×V(Yn)
Each item in the joint space  one of the possible assignments of values to the tuple
of variables < Y1 …..Yn >
The probability distribution over the joint space is called as Joint Probability
Distribution. It represents probability for each of the possible variable bindings for
the tuple < Y1 …..Yn >
Bayesian Belief Networks  describes JDP for a set of variables.
Conditional Independence
For 3 discrete-valued random variables X,Y, and Z,
“X is Conditionally Independent of Y given Z if the probability distribution

18CS71: Artificial Intelligence and Machine Learning 28


Maharaja Institute of Technology, Mysore Department of CSE

governing X is independent of the value of Y given a value for Z if


(∀𝑥𝑖, 𝑦𝑗, 𝑧𝑘)𝑃(𝑋 = 𝑥𝑖|𝑌 = 𝑦𝑗|𝑍 = 𝑧𝑘) = 𝑃(𝑋 = 𝑥𝑖|𝑍 = 𝑧𝑘) “ --- (1)
Here xi ∈V(X) , yi ∈V(Y), and zi ∈V(Z)
- Eq.(1) Can be abbreviated as P(X|Y,Z) = P(X|Z)
- Definition for set of variables
“Set of variables, X1.............. l is conditionally independent of the set of variables Y1
…. Ym given the set of variables Z1 …. Zn if
𝑃(X1 …. Xl|Y1 …. Ym|Z1 …. Zn) = 𝑃(X1 …. Xl|Z1 …. Zn)”
In NBC the instance attribute A1 is conditionally independent of instance
attribute A2 given the target value V. Therefore
𝑃(𝐴1, 𝐴2|𝑉) = 𝑃(𝐴1|𝐴2, 𝑉)𝑃(𝐴2|𝑉) ---(2)
= 𝑃(𝐴1|𝑉)𝑃(𝐴2|𝑉) ---(3)
In Eq. (3) if A1 is conditionally independent of A2 given V, then by our
definition of conditional independence then 𝑃(𝐴1|𝐴2, 𝑉) = 𝑃(𝐴1|𝑉)
Representation
A Bayesian belief network (Bayesian network) represents a joint probability
distribution for a set of variables.
Example,

Figure: A Bayesian belief network. The network on the left represents a set of conditional
independence assumptions. In particular, each node is asserted to be conditionally
independent of its non-descendants, given its immediate parents. Associated with each node is
a conditional probability table that specifies the conditional distribution for the variable given
its immediate parents in the graph. The conditional probability table for the Campjire node is
shown at the right, where Campjire is abbreviated to C, Storm abbreviated to S, and
BusTourGroup abbreviated to B
The joint probability for any desired assignment of values <y1, .. ,yn> to the

18CS71: Artificial Intelligence and Machine Learning 29


Maharaja Institute of Technology, Mysore Department of CSE

tuple of network variables <Y1.. .Yn> can be computed by the formula


𝑛

𝑃(𝑦1 … 𝑦𝑛) = 𝖦 𝑃(𝑦𝑖|𝑃𝑎𝑟𝑒𝑛𝑡𝑠(𝑌𝑖))


𝑖=1

From the table for Campfire an assertion can be expressed as


P(Campfire=True|Storm = True, BusTourGroup = True)=0.4
Full Joint Probability Distribution
The set of local conditional probability tables for all the variables, together
with the set of conditional independence assumptions described by the network.
Attractive feature  BBN allows a convenient way to represent causal knowledge
such as the fact that Lightning causes Thunder.
Inference
Task To infer the value of some target variable (e.g., ForestFire) given the observed
values of the other variables.
That is,
Probability distribution for the target variable, which specifies the probability
that it will take on each of its possible values given the observed values of the
other variables.
 Possible if values for all of the other variables in the network are known exactly.
Possible cases for inference
To infer the probability distribution for some variable (e.g., ForestFire) given
observed values for only a subset of the other variables (e.g., Thunder and
BusTourGroup may be the only observed values available).
SOLUTION
Bayesian Network
Learning Bayesian Belief Networks
Can we devise effective algorithms for learning Bayesian belief networks from
training data?
Ans: Possible. Several settings can be done.
Example,
i. 1st setting
Network Structure may be given or need to be inferred
ii. 2nd setting
Network variables might be directly observable in each training

18CS71: Artificial Intelligence and Machine Learning 30


Maharaja Institute of Technology, Mysore Department of CSE

example, or some might be unobservable.


Case 1: Network structure is given and all the variables are fully observable in
training examples
- Easy to learn conditional probability table
Case 2: Network structure is given and only some of the variable values are
observable in the training data
- Learning is difficult
- Similar to learning weights for hidden units in ANN where input and output
values are known.
- SOLUTION
Gradient Ascent Procedure (GAP) that learns the entries in the conditional
probability tables by Russel et. al. (1995)
GAP  searches through a space of hypotheses that corresponds to the set of
all possible entries for the conditional probability tables.
 Maximizes probability P(D|h) of the observed training data D given the
hypothesis h during gradient ascent.
Gradient Ascent Training of Bayesian Networks
By Russel et. al. (1995)
What is done?
Maximizes P(D|h) by following the gradient of ln P(D|h) with respect to the
parameters that define the conditional probability tables of the Bayesian network.
Consider,
wijk  single entry in one of the conditional probability tables
 conditional probability that the network variable Yi will take on the value yij
given that its immediate parents Ui take on the values given by uik.
Example,

Consider wijk is top right entry in the conditional probability table.


Yi is the variable Campfire,
Ui is the tuple of its parents (Storm,BusTourGroup),
yij = True, and uik = (False, False).

18CS71: Artificial Intelligence and Machine Learning 31


Maharaja Institute of Technology, Mysore Department of CSE

𝜕𝑙𝑛𝑃(𝐷|ℎ)
The gradient of P(D|h) is given by the derivatives for each wijk.
𝜕𝑤𝑖𝑗𝑘

Each can be calculated as,


𝜕𝑙𝑛𝑃(𝐷|ℎ) 𝑃(𝑌𝑖 = 𝑦𝑖𝑗, 𝑈𝑖 = 𝑢𝑖𝑘|𝑑)
=∑
𝜕𝑤𝑖𝑗𝑘 𝑤𝑖𝑗𝑘
𝑑∈𝐷

---(4)
Example,
To calculate the derivative of In P(D|h) with respect to the upper rightmost entry in
the table calculate the quantity P(Campfire = True, Storm = False, BusTourGroup
= False|d) for each training example d in D.
Derivation of Eq. (4)
Let Ph(D) denote P(D|h)
To derive!
𝜕𝑙𝑛𝑃ℎ(𝐷)
𝜕𝑤𝑖𝑗𝑘
Proof:
𝜕𝑙𝑛𝑃ℎ(𝐷) 𝜕
= 𝑙𝑛 𝖦 𝑃ℎ(𝑑)
𝜕𝑤𝑖𝑗𝑘 𝜕𝑤𝑖𝑗𝑘
𝑑∈𝐷

𝜕𝑙𝑛𝑃ℎ(𝑑)
=∑
𝜕𝑤𝑖𝑗𝑘
𝑑∈𝐷
𝜕𝑙𝑛𝑓(𝑥) 1 𝜕𝑓(𝑥)
Since =
𝜕𝑥 𝑓(𝑥) 𝜕𝑥

1 𝜕𝑃ℎ(𝑑)
=∑
𝑃ℎ(𝑑) 𝜕𝑤𝑖𝑗𝑘
𝑑∈𝐷

Introduce the values of the variables Yi and Ui=Parents(Yi),by summing over their
possible values yij’ and uik’
𝜕𝑙𝑛𝑃ℎ(𝑑) 1 𝜕
𝜕𝑤 = ∑ 𝑃 (𝑑) 𝜕𝑤 ∑ 𝑃ℎ(𝑑|𝑦𝑖𝑗𝘍 , 𝑢𝑖𝑘𝘍 ) 𝑃ℎ(𝑦𝑖𝑗𝘍 , 𝑢𝑖𝑘𝘍 )
𝑖𝑗𝑘 ℎ 𝑖𝑗𝑘 𝘍
𝑑∈𝐷 𝑗 ,𝑘′

From product rule of probability


𝜕𝑙𝑛𝑃ℎ(𝑑) 1 𝜕
𝜕𝑤 = ∑ 𝑃 (𝑑) 𝜕𝑤 ∑ 𝑃ℎ(𝑑|𝑦𝑖𝑗𝘍 , 𝑢𝑖𝑘𝘍 ) 𝑃ℎ(𝑦𝑖𝑗𝘍 |𝑢𝑖𝑘𝘍 )𝑃ℎ(𝑢𝑖𝑘𝘍 )
𝑖𝑗𝑘 ℎ 𝑖𝑗𝑘 𝘍 𝘍
𝑑∈𝐷 𝑗 ,𝑘

Consider the rightmost sum.


Given 𝑤𝑖𝑗𝑘 ≡ 𝑃ℎ(𝑦𝑖𝑗|𝑢𝑖𝑘)
𝜕
The only term in this sum for which is nonzero is the term for which j’=j and
𝜕𝑤𝑖𝑗𝑘

18CS71: Artificial Intelligence and Machine Learning 32


Maharaja Institute of Technology, Mysore Department of CSE

i’=i.
Therefore,
𝜕𝑙𝑛𝑃ℎ(𝑑) 1 𝜕
𝜕𝑤 = ∑ 𝑃 (𝑑) 𝜕𝑤 𝑃ℎ(𝑑|𝑦𝑖𝑗, 𝑢𝑖𝑘)𝑃ℎ(𝑦𝑖𝑗|𝑢𝑖𝑘)𝑃ℎ(𝑢𝑖𝑘)
𝑖𝑗𝑘 ℎ 𝑖𝑗𝑘
𝑑∈𝐷

𝜕𝑙𝑛𝑃ℎ(𝑑) 1 𝜕
𝜕𝑤 = ∑ 𝑃 (𝑑) 𝜕𝑤 𝑃ℎ(𝑑|𝑦𝑖𝑗, 𝑢𝑖𝑘)𝑤𝑖𝑗𝑘𝑃ℎ(𝑢𝑖𝑘)
𝑖𝑗𝑘 ℎ 𝑖𝑗𝑘
𝑑∈𝐷

𝜕𝑙𝑛𝑃ℎ(𝑑) 1
𝜕𝑤 =∑ 𝑃ℎ(𝑑|𝑦𝑖𝑗, 𝑢𝑖𝑘)𝑃ℎ(𝑢𝑖𝑘)
𝑖𝑗𝑘 𝑃 ℎ(𝑑)
𝑑∈𝐷

Applying Bayes theorem to rewrite𝑃ℎ(𝑑|𝑦𝑖𝑗, 𝑢𝑖𝑘), we have


𝜕𝑙𝑛𝑃ℎ(𝑑) = ∑ 1 𝑃ℎ(𝑦𝑖𝑗, 𝑢𝑖𝑘|𝑑)𝑃ℎ(𝑑)𝑃ℎ(𝑢𝑖𝑘)
𝜕𝑤 𝑃 (𝑑) 𝑃 (𝑦 , 𝑢 )
𝑖𝑗𝑘 ℎ ℎ 𝑖𝑗 𝑖𝑘
𝑑∈𝐷

𝑃ℎ(𝑦𝑖𝑗, 𝑢𝑖𝑘|𝑑)𝑃ℎ(𝑢𝑖𝑘)
=∑
𝑃ℎ(𝑦𝑖𝑗, 𝑢𝑖𝑘)
𝑑∈𝐷
𝑃ℎ(𝑦𝑖𝑗, 𝑢𝑖𝑘|𝑑)
=∑
𝑃ℎ(𝑦𝑖𝑗|𝑢𝑖𝑘)
𝑑∈𝐷
𝜕𝑙𝑛𝑃ℎ(𝑑) 𝑃ℎ(𝑦𝑖𝑗, 𝑢𝑖𝑘|𝑑)
=∑
𝜕𝑤𝑖𝑗𝑘 𝑤𝑖𝑗𝑘
𝑑∈𝐷

--(5)
Requirement,
- as the weights wijk are updated they must remain valid probabilities in the
interval [0,1]
- sum ∑𝑗 𝑤𝑖𝑗𝑘remains 1 for all i, k.
To satisfy the requirements!!
Update weights in a two-step process.
i. Update each wijk by gradient ascent
𝑃ℎ(𝑦𝑖𝑗, 𝑢𝑖𝑘|𝑑)
𝑤𝑖𝑗𝑘 ← 𝑤𝑖𝑗𝑘 + 𝜂 ∑
𝑤𝑖𝑗𝑘
𝑑∈𝐷

ii. Renormalize the weights wijk to assure that the above constraints are satisfied.

Learning the Structure of Bayesian Networks


Cooper and Herskovits (1992)  Bayesian scoring metric for choosing among
alternative networks.
 Also presented heuristic search algorithm called K2 for learning network

18CS71: Artificial Intelligence and Machine Learning 33


Maharaja Institute of Technology, Mysore Department of CSE

structure when the data is fully observable.


 K2
Performs a greedy search that trades off network complexity for accuracy over
the training data
Experiment  given a set of 3,000 training examples generated at random from
a manually constructed Bayesian network containing 37 nodes and 46 arcs.
 given an initial ordering over the 37 variables that was consistent with the
partial ordering of variable dependencies in the actual network.
OUTCOME  SUCCEEDED in reconstructing the correct Bayesian network
structure almost exactly, with the exception of one incorrectly deleted arc and
one incorrectly added arc
Spirtes et al. (1993) Constraint-based approaches to learning Bayesian network
structure
 Infer independence and dependence relationships from the data, and then
use these relationships to construct Bayesian networks
The EM Algorithm
Many approaches have been proposed to handle the problem of learning in the
presence of unobserved variables.
If some variable is sometimes observed and sometimes not, then the cases for which it
has been observed is used to learn predict its values when it is not.
EM Algorithm Dempster et al. 1977
 widely used approach to learning in the presence of unobserved variables.
 used even for variables whose value is never directly observed, but requires the
probability distribution.
 has been used to train BBN
 basis for many unsupervised clustering algorithms
 basis for Baum-Welch forward-backward algorithm for learning Partially
Observable Markov Models

Estimating the Means of k Gaussians


Example,
Given D data  set of instances generated by a probability distribution that is a
mixture of k distinct Normal distributions

18CS71: Artificial Intelligence and Machine Learning 34


Maharaja Institute of Technology, Mysore Department of CSE

Here,
k=2
instances points along x-axis
Each instance is generated using a two-step process.
i. One of the k Normal distributions is selected at random.
ii. A single random instance xi is generated according to this selected distribution
This process is repeated to generate a set of data points.
Consider a special case where, the selection of the single Normal distribution at
each step is based on choosing each with uniform probability
Learning task output a hypothesis h = (µ1, .. .µk) that describes the means of
each of the k distributions.
Goal  a maximum likelihood hypothesis for these means; that is, a hypothesis
h that maximizes p(D|h)
It is easy to calculate the maximum likelihood hypothesis for the mean of a
single Normal distribution given the observed data instances x1,x2,.. .,xm drawn from
this single distribution.
We know that,
𝑎𝑟𝑔𝑚𝑖𝑛 𝑚 (𝑥 − 𝜇)2 ----(6)
𝜇𝑀𝐿 = ∑
𝜇 𝑖=1 𝑖

Also, the sum of squared errors is minimized by the sample mean.


1
𝜇 = ∑𝑚 𝑥 ---(7)
𝑀𝐿 𝑚 𝑖=1 𝑖

Problem in hand mixture of k different Normal distributions


 cannot observe which instances were generated by which distribution
Let,
Full description of each instance be <xi, zi1, zi2>

18CS71: Artificial Intelligence and Machine Learning 35


Maharaja Institute of Technology, Mysore Department of CSE

Where,
xi observed value of the ith instance
zi1 and zi2  indicated which of the two Normal distributions was used
to generate the value xi
 zij = 1 if generated by jth Normal Distribution and 0 otherwise
Here,
xi  Observed Variable
zi1 and zi2  Hidden Variables
If zi1 and zi2 were observed then Eq. (6) can be applied to solve for means µ1
and µ2 . Here we use EM Algorithm to find µ1 and µ2 .
EM Algorithm
Problem  k-Means
Task  Search for a maximum likelihood hypothesis by repeatedly re-estimating the
expected values of the hidden variables zij given its current hypothesis < µ1 …. µk >
 Recalculate maximum likelihood hypothesis using these expected values
for the hidden variables.

Procedure:
First initialize the hypothesis to h = < µ1 , µ2 >
Iteratively re-estimate h by repeating the following two steps until the procedure
converges to a stationary value for h.
Step 1: Calculate the expected value E[zij] of each hidden variable zij assuming the
current hypothesis h = < µ1 , µ2 > holds.
Step 2: Calculate a new maximum likelihood hypothesis h’ = < µ1’ , µ2’ > assuming
the value taken on by each hidden variable zij is its expected value E[zij] calculated in
Step 1.
Replace h = < µ1 , µ2 > by h’ = < µ1’ , µ2’ > and iterate
Step 1 must calculate the expected value of zij.
E[zij]  probability that instance xi was generated by the jth Normal distribution.
𝑝(𝑥 = 𝑥𝑖|𝜇 = 𝜇𝑗)
𝐸[𝑧𝑖𝑗] = 2
∑𝑛=1 𝑝(𝑥 = 𝑥𝑖|𝜇 = 𝜇𝑛)
1
− 2(𝑥𝑖−𝜇𝑗)2
𝑒 2𝜎
= 1 2
∑2𝑛=1 𝑒−2𝜎2 (𝑥𝑖−𝜇𝑛)

18CS71: Artificial Intelligence and Machine Learning 36


Maharaja Institute of Technology, Mysore Department of CSE

First step is implemented by substituting the current values < µ1 , µ2 > and the
observed xi into the above expression.
Second step  use E[zij] calculated in Step 1 to derive a new maximum likelihood
hypothesis h’ = < µ1’ , µ2’ >.
It is
𝐦
𝛍 ← ∑𝐢=𝟏 𝐄[𝐳𝐢𝐣]𝐱𝐢
𝐣
∑𝐦
𝐢=𝟏 𝐄[𝐳𝐢𝐣]

---(8)
The above expression is similar to
m
1
μML = ∑ xi
m
i=1

---(7)
(7)  used to estimate µ for a single Normal distribution.
(8)  the weighted sample mean for µj, with each instance weighted by the
expectation E[zij] that it was generated by the jth Normal distribution .
Conclusion:
The current hypothesis is used to estimate the unobserved variables, and the expected
values of these variables are then used to calculate an improved hypothesis.
Further proof:
On each iteration through this loop, the EM algorithm increases the likelihood P(D|h)
unless it is at a local maximum. The algorithm thus converges to a local maximum
likelihood hypothesis for < µ1 , µ2 > .
General Statement of EM Algorithm
What have we learnt in the previous session!!
EM algorithm for the problem of estimating means of a mixture of Normal
distributions.
EM algorithm can be applied in many settings to estimate set of parameters, θ,
that include probability distribution given only the observed portion of the full
data produced by this distribution.
Example,
𝜃 = ⟨𝜇1, 𝜇2⟩
The full data were the triples
⟨𝑥𝑖, 𝑧𝑖1, 𝑧𝑖2⟩

18CS71: Artificial Intelligence and Machine Learning 37


Maharaja Institute of Technology, Mysore Department of CSE

Here , only xi was observed.


Let
X = {x1, ….., xm}  observed data in a set of m independently drawn
instances
Z = {z1, ….., zm}  unobserved data in these same instances
𝑌 = 𝑋 𝖴 𝑍  full data
Unobserved Z  random variable whose probability distribution depends on
the unknown parameters 𝜃 and on the observed data X.
Y  random variable because it is defined in terms of the random variable Z
What do we learn further??
General form of EM Algorithm
Notations:
h  current hypothesized values of the parameters 𝜃
h’ revised hypothesis that is estimated on each iteration of the EM algorithm
Learning task of EM algorithm!!
The EM algorithm searches for the MLH, h‘, by seeking the h' that maximizes
E[ln P(Y|h’)]
E[ln P(Y|h’)]  taken over the probability distribution governing Y
Y determined by the unknown parameters, 𝜃.
What does this expression signify??
P(Y|h’)  likelihood of the full data Y given hypothesis h’.
Maximizing the logarithm of this quantity In P(Y|h’) also maximizes P(Y|h’)
Introduce the expected value E[ln P(Y|h’)] because the full data Y is itself a
random variable
Given,
Full data Y is a combination of the observed data X and unobserved data Z .
What to be done??
Average over the possible values of the unobserved Z, weighting each according
to its probability.
Means??
Take the expected value E[ln P(Y |h')] over the probability distribution
governing the random variable Y .
The distribution governing Y is determined by the completely known values for

18CS71: Artificial Intelligence and Machine Learning 38


Maharaja Institute of Technology, Mysore Department of CSE

X, plus the distribution governing Z.


What is the probability distribution governing Y ?
Usually unknown  determined by the parameters θ
Therefore,
EM algorithm uses its current hypothesis h in place of the actual parameters θ
to estimate the distribution governing Y .
To define!!
A function Q(h’|h) that gives E[ln P(Y|h’)] as a function of h’ under the
assumption that θ=h and given the observed portion of X of the full data Y.
Q(h’|h) = E[ln P(Y|h’)|h,X]
In its general form, the EM algorithm repeats the following two steps until
convergence:
Step 1: Estimation (E) step: Calculate Q(h’|h) using the current hypothesis h and
the observed data X to estimate the probability distribution over Y.
Q(h’|h) ← E[ln P(Y|h’)|h,X]
Step 2: Maximization (M) step: Replace hypothesis h by the hypothesis h' that
maximizes this Q function.
ℎ← 𝑎𝑟𝑔𝑚𝑎𝑥𝑄(ℎ′|ℎ)
ℎ′

Derivation of k-Means Algorithm


k-means problem to estimate the parameters 𝜃 = ⟨𝜇1 … 𝜇𝑘⟩ that define the means
of k Normal distributions.
Given,
X = {⟨𝑥𝑖𝑗⟩}  observed data
Z = {⟨𝑧𝑖1, … , 𝑧𝑖𝑘⟩} indicates which of the k Normal distributions was used to
generate xij.
To apply EM algorithm  derive an expression for Q(h’|h)
Derive an expression for p(Y|h’)
The probability p(yi|h’) of a single instance yi = ⟨𝑥𝑖, 𝑧𝑖1, … , 𝑧𝑖𝑘⟩ of the full data can
be written as
′ ′
1 1 ∑𝑘 𝑧𝑖𝑗(𝑥 −𝜇𝘍)2

𝑝(𝑦𝑖|ℎ ) = 𝑝(𝑥𝑖, 𝑧𝑖1, … , 𝑧𝑖𝑘|ℎ ) = 𝑒 2𝜋𝜎 2 𝑗=1 𝑖 𝑗

√2𝜋𝜎2

18CS71: Artificial Intelligence and Machine Learning 39


Maharaja Institute of Technology, Mysore Department of CSE

Here, only one of zij can have the value 1 and all other must be 0.
Given this probability for a single instance 𝑝(𝑦𝑖|ℎ′),the logarithm of the probability
ln P(Y|h’) for all m instances in the data is
ln P(Y|h’) = 𝑙𝑛 ∏𝑖=1
𝑚 𝑝(𝑦 |ℎ′)
𝑖
𝑚

= ∑ 𝑙𝑛 𝑝(𝑦𝑖|ℎ′)
𝑖=1
𝑚
1 −
1 ∑𝑘 𝑧𝑖𝑗(𝑥 −𝜇𝘍)2
ln P(Y|h’) = ∑( 𝑙𝑛 𝑒 2𝜋𝜎2 𝑗=1 𝑖 𝑗
)
√2𝜋𝜎2
𝑖=1

That is, ln P(Y|h’) is a linear function of zij.


In general, for any function f (z) that is a linear function of z, the following equality
holds
E[f(z)] = f(E[z])
And also
𝑚
1 1 ∑𝑘 𝑧𝑖𝑗(𝑥 −𝜇𝘍)2
′ −
𝐸[ln 𝑃(𝑌|ℎ ] = 𝐸 [∑ (𝑙𝑛 𝑒 2𝜋𝜎2 𝑗=1 𝑖 𝑗
)]
√2𝜋𝜎2
𝑖=1
𝑚
1 −
1 ∑𝑘 𝐸[𝑧𝑖𝑗](𝑥 −𝜇𝘍)2
= ∑ (𝑙𝑛 𝑒 2𝜋𝜎 2 𝑗=1
𝑖 𝑗
)
√2𝜋𝜎2
𝑖=1

Therefore, the function Q(h’|h) for the k means problem is


𝑚
1 1 ∑𝑘 𝐸[𝑧𝑖𝑗](𝑥 −𝜇𝘍)2
′ −
𝑄(ℎ |ℎ) = ∑ (𝑙𝑛 𝑒 2𝜋𝜎2 𝑗=1 𝑖 𝑗
)
√2𝜋𝜎2
𝑖=1

Where, ℎ′ = ⟨𝜇′1, … , 𝜇′𝑘⟩


E [zij]  calculated based on current hypothesis h and observed data X.
From k-means Gaussians
1 2
𝑒 −2𝜎 2(𝑥𝑖 −𝜇 𝑗)
𝐸[𝑧𝑖𝑗 ] = 2 −
1
(𝑥𝑖 −𝜇𝑛) 2 ---(9)
∑𝑛=1 𝑒 2𝜎 2

Thus,
• The first (estimation) step of the EM algorithm defines the Q function based on
the estimated E[zij] terms.
• The second (maximization) step then finds the values 𝜇′1, … , 𝜇′𝑘 that maximize
this Q function.
In the current case

18CS71: Artificial Intelligence and Machine Learning 40


Maharaja Institute of Technology, Mysore Department of CSE

𝑎𝑟𝑔𝑚𝑎𝑥 𝑚
𝑎𝑟𝑔𝑚𝑎𝑥 ′ 1 −
1 ∑𝑘 𝐸[𝑧𝑖𝑗](𝑥 −𝜇𝘍)2
2 𝑗=1
ℎ′𝑄(ℎ |ℎ) = ∑ (𝑙𝑛 𝑒 2𝜋𝜎 𝑖 𝑗
)
√2𝜋𝜎2
ℎ′𝑖=1
𝑎𝑟𝑔𝑚𝑖𝑛 𝑚 𝑘 𝑘

= ∑ ∑ ∑ 𝐸[𝑧𝑖𝑗] (𝑥𝑖 − 𝜇𝑗′ )2


ℎ′𝑖=1 𝑗=1 𝑗=1

---(10)
Therefore,
The maximum likelihood hypothesis here minimizes a weighted sum of squared
errors, where the contribution of each instance xi to the error that defines 𝜇𝑗′ is
weighted by E[zij] .
The quantity given by Equation (10) is minimized by setting each 𝜇𝑗′ to the
weighted sample mean
𝑚
𝜇 ← ∑𝑖=1 𝐸[𝑧𝑖𝑗]𝑥𝑖
𝑗
∑𝑚
𝑖=1 𝐸[𝑧𝑖𝑗]

---(11)
Eq. 10 & 11  Two steps in the k-means algorithm

18CS71: Artificial Intelligence and Machine Learning 41

You might also like