We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
You are on page 1/ 7
Maximum Likelihood (ML) Hypothesis, hyyy,
If we assume that every hypothesis in H is equally probable
ie. P(h,) = P(h)) for all h, and h, in H
We can only consider P(D|h) to find the most probable hypothesis.
P(DJh) is often called the likelihood of the data D given h
Any hypothesis that maximizes P(D|h) is called a maximum likelihood (ML)
hypothesis, yyy.
hv = argmax P(D/h)
hellMaximum Likelihood and
Least-Squared Error Hypotheses
Many learning approaches such as neural network learning, linear regression, and
polynomial curve fitting try to learn a continuous-valued target function.
Under certain assumptions any learning algorithm that minimizes the squared
error between the output hypothesis predictions and the training data will output
a Maximum Likelihood Hypothesis.
— Any hypothesis that maximizes P(DJh) is called a maximum likelihood (ML) hypothesis, hy,
- hy, = argmax P(D|h)
fet
The significance of this result is that it provides a Bayesian justification (under certain
assumptions) for many neural network and other curve fitting methods that attempt to
minimize the sum of squared errors over the training data.Maximum Likelihood and
Least-Squared Error Hypotheses — Deriving hy,
In order to find the maximum likelihood hypothesis, we start with our earlier definition
but using lower case p to refer to the probability density function.
hu = argmax p(D[h)
heH
We assume a fixed set of training instances (x,. . . x,,) and therefore consider the data
D to be the corresponding sequence of target values D = (d,. . . d,,).
Here d, = f (x,) + e;. Assuming the training examples are mutually independent given
h, we can write p(Djh) as the product of the various p(dj{h)
m
nya, = argmax | | p(dith)
heH i=1Maximum Likelihood and
Least-Squared Error Hypotheses
‘The maximum likelihood hypothesis hyy, is the one that minimizes the sum of the
squared errors between observed training values d; and hypothesis predictions h(x,).
This holds under the assumption that the observed training values d; are generated by
adding random noise to the true target value, where this random noise is drawn
independently for each example from a Normal distribution with zero mean.
Similar derivations can be performed starting with other assumed noise distributions,
producing different results.
Why is it reasonable to choose the Normal distribution to characterize noise?
— One reason, is that it allows for a mathematically straightforward analysis.
— Asecond reason is that the smooth, bell-shaped distribution is a good approximation to
many types of noise in physical systems.
Minimizing the sum of squared errors is a common approach in many neural network,
curve fitting, and other approaches to approximating real-valued functions.Bayes Optimal Classifier
Normally we consider:
— What is the most probable hypothesis given the training data?
We can also consider:
— What is the most probable classification of the new instance given the training
data?
Consider a hypothesis space containing three hypotheses, hl, h2, and h3.
— Suppose that the posterior probabilities of these hypotheses given the training data are
4, 3, and .3 respectively.
— Thus, hl is the MAP hypothesis.
— Suppose a new instance x is encountered, which is classified positive by Al, but
negative by A2 and h3.
— Taking all hypotheses into account, the probability that x is positive is .4 (the
probability associated with 1), and the probability that it is negative is .6.
— The most probable classification (negative) in this case is different from the
classification generated by the MAP hypothesis.Bayes Optimal Classifier
The most probable classification of the new instance is obtained by combining the
predictions of all hypotheses, weighted by their posterior probabilities.
If the possible classification of the new example can take on any value v, from some
set V, then the probability P(v; | D) that the correct classification for the new instance
is vj:
j
VID) = Yaen P(y|hi) PChy|D)
Bayes optimal classification:
argmax Yp,cu P(¥j{hy) P(hy|D)
vyevBayes Optimal Classifier
+ Although the Bayes optimal classifier obtains the best performance that can be
achieved from the given training data, it can be quite costly to apply.
— The expense is due to the fact that it computes the posterior probability for every
hypothesis in H and then combines the predictions of each hypothesis to classify
each new instance.
+ An alternative, less optimal method is the Gibbs algorithm:
1. Choose a hypothesis # from H at random, according to the posterior probability
distribution over H.
2. Use h to predict the classification of the next instance x.