0% found this document useful (0 votes)
19 views

##7 Rev ML Module-2 Bayesian Learning

Uploaded by

monacmicsia
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
0% found this document useful (0 votes)
19 views

##7 Rev ML Module-2 Bayesian Learning

Uploaded by

monacmicsia
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
You are on page 1/ 7
Maximum Likelihood (ML) Hypothesis, hyyy, If we assume that every hypothesis in H is equally probable ie. P(h,) = P(h)) for all h, and h, in H We can only consider P(D|h) to find the most probable hypothesis. P(DJh) is often called the likelihood of the data D given h Any hypothesis that maximizes P(D|h) is called a maximum likelihood (ML) hypothesis, yyy. hv = argmax P(D/h) hell Maximum Likelihood and Least-Squared Error Hypotheses Many learning approaches such as neural network learning, linear regression, and polynomial curve fitting try to learn a continuous-valued target function. Under certain assumptions any learning algorithm that minimizes the squared error between the output hypothesis predictions and the training data will output a Maximum Likelihood Hypothesis. — Any hypothesis that maximizes P(DJh) is called a maximum likelihood (ML) hypothesis, hy, - hy, = argmax P(D|h) fet The significance of this result is that it provides a Bayesian justification (under certain assumptions) for many neural network and other curve fitting methods that attempt to minimize the sum of squared errors over the training data. Maximum Likelihood and Least-Squared Error Hypotheses — Deriving hy, In order to find the maximum likelihood hypothesis, we start with our earlier definition but using lower case p to refer to the probability density function. hu = argmax p(D[h) heH We assume a fixed set of training instances (x,. . . x,,) and therefore consider the data D to be the corresponding sequence of target values D = (d,. . . d,,). Here d, = f (x,) + e;. Assuming the training examples are mutually independent given h, we can write p(Djh) as the product of the various p(dj{h) m nya, = argmax | | p(dith) heH i=1 Maximum Likelihood and Least-Squared Error Hypotheses ‘The maximum likelihood hypothesis hyy, is the one that minimizes the sum of the squared errors between observed training values d; and hypothesis predictions h(x,). This holds under the assumption that the observed training values d; are generated by adding random noise to the true target value, where this random noise is drawn independently for each example from a Normal distribution with zero mean. Similar derivations can be performed starting with other assumed noise distributions, producing different results. Why is it reasonable to choose the Normal distribution to characterize noise? — One reason, is that it allows for a mathematically straightforward analysis. — Asecond reason is that the smooth, bell-shaped distribution is a good approximation to many types of noise in physical systems. Minimizing the sum of squared errors is a common approach in many neural network, curve fitting, and other approaches to approximating real-valued functions. Bayes Optimal Classifier Normally we consider: — What is the most probable hypothesis given the training data? We can also consider: — What is the most probable classification of the new instance given the training data? Consider a hypothesis space containing three hypotheses, hl, h2, and h3. — Suppose that the posterior probabilities of these hypotheses given the training data are 4, 3, and .3 respectively. — Thus, hl is the MAP hypothesis. — Suppose a new instance x is encountered, which is classified positive by Al, but negative by A2 and h3. — Taking all hypotheses into account, the probability that x is positive is .4 (the probability associated with 1), and the probability that it is negative is .6. — The most probable classification (negative) in this case is different from the classification generated by the MAP hypothesis. Bayes Optimal Classifier The most probable classification of the new instance is obtained by combining the predictions of all hypotheses, weighted by their posterior probabilities. If the possible classification of the new example can take on any value v, from some set V, then the probability P(v; | D) that the correct classification for the new instance is vj: j VID) = Yaen P(y|hi) PChy|D) Bayes optimal classification: argmax Yp,cu P(¥j{hy) P(hy|D) vyev Bayes Optimal Classifier + Although the Bayes optimal classifier obtains the best performance that can be achieved from the given training data, it can be quite costly to apply. — The expense is due to the fact that it computes the posterior probability for every hypothesis in H and then combines the predictions of each hypothesis to classify each new instance. + An alternative, less optimal method is the Gibbs algorithm: 1. Choose a hypothesis # from H at random, according to the posterior probability distribution over H. 2. Use h to predict the classification of the next instance x.

You might also like