0% found this document useful (0 votes)
32 views19 pages

Unit III

Uploaded by

dhavank4
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
32 views19 pages

Unit III

Uploaded by

dhavank4
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 19

UNIT III

Bayesian Learning

1. Significance role of Bayesian Learning in Machine Learning


 Bayesian learning algorithms calculate explicit probabilities for hypotheses
 Naive Bayes classifier is competitive with other learning algorithms in many cases
and in some cases outperforms other methods.
 Bayesian methods provide a useful perspective for understanding many learning
algorithms that do not explicitly manipulate probabilities.

2. Features of Bayesian learning methods


• Each observed training example can incrementally decrease or increase the estimated
probability that a hypothesis is correct.
• Prior knowledge can be combined with observed data to determine the final
probability of a hypothesis.
• New instances can be classified by combining the predictions of multiple
hypotheses, weighted by their probabilities.
• Even in case where Bayesian methods prove computationally intractable, they
can provide optimal decision making
Prior knowledge is provided by
(1) a prior probability for each candidate hypothesis
(2) a probability distribution over observed data for each possible hypothesis

3. Drawbacks of Bayesian learning


• Bayesian methods require initial knowledge of many probabilities.
• Probabilities are estimated based on background knowledge, previously available
data, and assumptions about the form of the underlying distributions.
• Significant computational cost required to determine the Bayes optimal hypothesis

4. Bayesian Theorem
• Let D be a data sample whose class label is unknown
• Let h be a hypothesis that X belongs to class C
• Determine P(h|D): probability of h given D
• P(h): prior probability of hypothesis h
• P(D): prior probability of training data
• P(D|h): probability of D given h
Given training data D, posteriori probability of a hypothesis h, P(h|D) follows the Bayes
theorem

P(h | D)  P(D | h)P(h)


• P(DH,) find the most probable hypothesis h εH for
Consider set of candidate hypotheses
data D. Any such maximally probable hypothesis is called a maximum a posteriori
(MAP) hypothesis.
• We can determine the MAP hypotheses by using Bayes theorem to calculate the
posterior probability of each candidate hypothesis.
• hMAP is a MAP hypothesis provided
• P(D/h) is called the likelihood of the data D given h, and any hypothesis that
maximizes P(D/h) is called a maximum likelihood (ML) hypothesis, hML.

5. BAYES THEOREM AND CONCEPT LEARNING


• What is the relationship between Bayes theorem and concept learning?
• Bayes theorem calculates the posterior probability of each hypothesis given the
training data.
• We can use it to calculate the probability for each possible hypothesis, then outputs
the most probable.

Brute-Force Bayes Concept Learning


• Learner considers some finite hypothesis space H defined over the instance space X,
in which the task is to learn some target concept c : X  {0,1}.
• Learner is given some sequence of training examples (<x1,d1>. . . <xm,dm>) where
xi is some instance from X and where di is the target value of xi (i.e., di = c(xi)).
• We can design concept learning algorithm to output the maximum a posteriori
hypothesis, based on Bayes theorem, as follows:

BRUTE-FORCE MAP LEARNING Algorithm


1. For each hypothesis h in H, calculate the posterior probability
2. Output the hypothesis hMAP with the highest posterior probability

Assumptions
1. The training data D is noise free (di = c(xi)).
2. The target concept c is contained in the hypothesis space H
3. We have no a priori reason to believe that any hypothesis is more probable than any
other.
• What values should we specify for P(h)?
• Given no prior knowledge that one hypothesis is more likely than another, assign the
same prior probability to every hypothesis h in H.
• Prior probabilities sum to 1. Choose

• Because of noise-free training data, the probability of observing classification di


given h is just 1 if di = h(xi) and 0 if di ≠ h(xi).

Bayesian theorem:

• Consider the case where h is inconsistent with the training data D. then

• The posterior probability of a hypothesis inconsistent with D is zero.


• When h is consistent with D, then P(D/h) is 1 when h is consistent with D:

• where VSH,D is the subset of hypotheses from H that are consistent with D
• Derive P(D) from the theorem of total probability and the fact that the hypotheses are
mutually exclusive
• where VSH,D is the number of hypotheses from H consistent with D.
• The above analysis implies that under our choice for P(h) and P(Dlh), every consistent
hypothesis has posterior probability (1 /I VSH,D), and every inconsistent hypothesis
has posterior probability 0. Every consistent hypothesis is, therefore, a MAP
hypothesis.

6. Bayes optimal classifier

• What is the most probable hypothesis given the training data


• Consider a hypothesis space containing three hypotheses, h1, h2, and h3.
Suppose that the posterior probabilities of these hypotheses given the training data are .4, .3,
and .3 respectively. h1 is the MAP hypothesis. Suppose a new instance x is encountered,
which is classified positive by h1, but negative by h2 and h3. Taking all hypotheses into
account, the probability that x is positive is .4 and the probability that it is negative is
therefore .6. The most probable classification (negative) in this case is different from the
classification generated by the MAP hypothesis. The most probable classification of the new
instance is obtained by combining the predictions of all hypotheses. If the possible
classification of the new example can take on any value vj from some set V, then the
probability P(vjlD) that the correct classification for the new instance is vj

The optimal classification of the new instance is the value vj, for which P (vj/D) is
maximum.
Bayes optimal classification
Any system that classifies new instances according to

is called Bayes optimal classifier or Bayes optimal learner. This method maximizes the
probability that the new instance is classified correctly, given the available data, hypothesis
space, and prior probabilities over the hypotheses.

Predictions made by Bayes optimal classifier can correspond to a hypothesis not contained in
H. Bayes optimal classifier effectively considers a hypothesis space H' different from the
space of hypotheses H to which Bayes theorem is being applied. H' effectively includes
hypotheses that perform comparisons between linear combinations of predictions from
multiple hypotheses in H.

7. Naives Bayes Classifier

The naive Bayes classifier applies to learning tasks where each instance x is described by a
conjunction of attribute values and where the target function f ( x ) can take on any value
from some finite set V. A set of training examples of the target function is provided, and a
new instance is presented, described by the tuple of attribute values (al, a2.. .a,). The learner
is asked to predict the target value, or classification, for this new instance.
The Bayesian approach to classifying the new instance is to assign the most probable target
value, VMAP given the attribute values (a1,a2 . . .an) that describe the instance.
We can use Bayes theorem to rewrite this expression as

Estimate each of the P(vj) by counting the frequency with which each target value vj occurs
in the training data. The naive Bayes classifier is based on the simplifying assumption that
the attribute values are conditionally independent given the target value. Approach used by
the naive Bayes classifier:

Naive Bayes learning method: Learning step in which the various P(vj) and P(ai /vj) terms
are estimated, based on their frequencies over the training data. The set of these estimates
corresponds to the learned hypothesis. This hypothesis is then used to classify each new
instance by applying the rule:

Whenever the naive Bayes assumption of conditional independence is satisfied, this naive
Bayes classification VNB is identical to the MAP classification.

Difference between the naive Bayes learning method and other learning methods is that there
is no explicit search through the space of possible hypotheses. Possible hypotheses is the
space of possible values that can be assigned to the various P(vj) and P(ailvj) terms).
Hypothesis is formed without searching, simply by counting the frequency of various data
combinations within the training examples.

ESTIMATING PROBABILITIES
Bayesian approach to estimating the probability, using the m-estimate defined as follows. m-
estimate of probability:

– n : number of training examples for which c=cj


– nc : number of examples for which c=cj and a=ai
– p : prior estimate of P(ai|cj)
– m : weight given to prior (number of “virtual” examples)

8. BAYESIAN BELIEF NETWORKS


Naive Bayes classifier assumes that the values of the attributes a1 . . .an, are conditionally
independent given the target value v. This assumption reduces the complexity of learning the
target function. A Bayesian belief network describes the probability distribution of a set of
variables by specifying a set of conditional independence assumptions along with a set of
conditional probabilities. Naive Bayes classifier assumes that all the variables are
conditionally independent given the value of the target variable, Bayesian belief networks
allow stating conditional independence assumptions that apply to subsets of the variables.
Joint probability distribution
• Bayesian belief network describes the probability distribution over a set of variables.
• Consider an arbitrary set of random variables Y1 . . . Yn, where each variable Yi can
take on the set of possible values V(Yi).
• We define the joint space of the set of variables Y to be the cross product V(Y1) x
V(Y2) x. . . V(Yn).
Each item in the joint space corresponds to one of the possible assignments of values to the
tuple of variables (Y1 . . . Yn).
The probability distribution over this joint space is called the joint probability distribution.
The joint probability distribution specifies the probability for each of the possible variable
bindings for the tuple (Y1 . . . Yn).
A Bayesian belief network describes the joint probability distribution for a set of variables.

9. Conditional Independence
Let X, Y, and Z be three discrete-valued random variables. We say that X is conditionally
independent of Y given Z if the probability distribution governing X is independent of the
value of Y given a value for Z; that is, if

• where xi ƐV(X), yj Ɛ V(Y), and zk Ɛ V(Z).


• P(X/Y, Z) = P(X/Z).
We say that the set of variables X1 . . . Xl is conditionally independent of the set of
variables Y1 . . . Ym given the set of variables Z1 . . . Zn, if P(X1 ... Xl/Y1 ... Ym, Z1
... Zn) = P(X1 ... Xl/Z1 ... Zn)
The naive Bayes classifier assumes that the instance attribute A1 is conditionally independent
of instance attribute A2 given the target value V. This allows the naive Bayes classifier to
calculate P(A1, A2/V) is:

If A1 is conditionally independent of A2 given V, then by our definition of conditional


independence P (A1,A2/V) = P(A1/V).

Representation
A Bayesian belief network represents the joint probability distribution for a set of
variables. A Bayesian network represents the joint probability distribution by specifying a set
of conditional independence assumptions (represented by a directed acyclic graph), together
with sets of local conditional probabilities. Each variable in the joint space is represented by a
node in the Bayesian network. For each variable two types of information are specified. First,
the network arcs represent the assertion that the variable is conditionally independent of its
non descendants in the network given its immediate predecessors in the network. We say X is
a descendant of Y if there is a directed path from Y to X. Second, a conditional
probability table is given for each variable, describing the probability distribution for that
variable given the values of its immediate predecessors. The joint probability for any desired
assignment of values (y1, . . . , yn) to the tuple of network variables (Y1 . . . Yn) can be
computed by the formula

where Parents(Yi) denotes the set of immediate predecessors of Yi in the network stored
in the conditional probability table associated with node Yi.
The set of local conditional probability tables for all the variables, together with the set of
conditional independence assumptions described by the network, describe the full joint
probability distribution for the network.

To illustrate, the Bayesian network in Figure 6.3 represents the joint probability distribution
over the boolean variables Storm, Lightning, Thunder, Forest Fire, Campfire, and
BusTourGroup. Consider the node Campfire. The network nodes and arcs represent the
assertion that Campfire is conditionally independent of its nondescendants Lightning and
Thunder, given its immediate parents Storm and BusTourGroup. This means that once we
know the value of the variables Storm and BusTourGroup, the variables Lightning and
Thunder provide no additional information about Campfire. The right side of the figure
shows the conditional probability table associated with the variable Campfire. The top left
entry in this table, for example, expresses the assertion that
P(Campfire = TruelStorm = True, BusTourGroup = True) = 0.4

Note this table provides only the conditional probabilities of Campfire given its parent
variables Storm and BusTourGroup. The set of local conditional probability tables for all the
variables, together with the set of conditional independence assumptions described by the
network, describe the full joint probability distribution for the network.
One attractive feature of Bayesian belief networks is that they allow a convenient way to
represent causal knowledge such as the fact that Lightning causes Thunder. In the
terminology of conditional independence, we express this by stating that Thunder is
conditionally independent of other variables in the network, given the value of Lightning.
Note this conditional independence assumption is implied by the arcs in the Bayesian
network of Figure 6.3.

Inference
We can use a Bayesian network to infer the value of some target variable (e.g., ForestFire)
given the observed values of the other variables. Probability distribution for the target
variable specifies the probability that it will take on each of its possible values given the
observed values of the other variables. This inference step can be straightforward if values for
all of the other variables in the network are known exactly. We may wish to infer the
probability distribution for some variable (e.g., ForestFire) given observed values for only
a subset of the other variables (e.g., Thunder and BusTourGroup may be the only
observed values available).

Bayesian network can be used to compute the probability distribution for any subset of
network variables given the values or distributions for any subset of the remaining variables.
Exact inference of probabilities for an arbitrary Bayesian network is known to be NP-hard.
Numerous methods have been proposed for probabilistic inference in Bayesian networks,
including exact inference methods and approximate inference methods that sacrifice precision
to gain efficiency.

10. Learning Bayesian Belief Networks


• Can we devise effective algorithms for learning Bayesian belief networks from
training data? Settings for this learning problem can be considered.
1. The network structure might be given in advance, or it might have to be inferred from the
training data.
2. All the network variables might be directly observable in each training example, or some
might be unobservable.
We simply estimate the conditional probability table entries. In the case where the network
structure is given but only some of the variable values are observable in the training data, the
learning problem is more difficult. This problem is somewhat analogous to learning the
weights for the hidden units in an artificial neural network, where the input and output node
values are given but the hidden unit values are left unspecified by the training examples.
Gradient ascent procedure searches through a space of hypotheses that corresponds to the set
of all possible entries for the conditional probability tables. The objective function that is
maximized during gradient ascent is the probability P(D/h) of the observed training data D
given the hypothesis h. This corresponds to searching for the maximum likelihood hypothesis
for the table entries.

11. Parametric Methods: Maximum Likelihood Estimation

Let us say we have an independent and identically distributed (iid) sample

We assume that xt are instances drawn from some known probability density family, p(x|θ),
defined up to parameters, θ:

xt ∼ p(x|θ)
We want to find θ that makes sampling xt from p(x|θ) as likely as possible. Because xt are
independent, the likelihood of parameter θ given sample X is the product of the likelihoods of
the individual points:

In maximum likelihood estimation, we are interested in finding θ that makes X the most
likely to be drawn. We thus search for θ that maximizes the likelihood, which we denote by
l(θ|X). We can maximize the log of the likelihood without changing the value where it takes
its maximum.

log(·) converts the product into a sum and leads to further computational simplification when
certain densities are assumed, for example, containing exponents. The log likelihood is
defined as

Let us now see some distributions that arise in the applications we are interested in. If we
have a two-class problem, the distribution we use is Bernoulli. When there are K > 2 classes,
its generalization is the multinomial. Gaussian (normal) density is the one most frequently
used for modeling class-conditional input densities with numeric input. For these three
distributions, we discuss the maximum likelihood estimators (MLE) of their parameters.

Bernoulli Density

In a Bernoulli distribution, there are two outcomes: An event occurs or it does not; for
example, an instance is a positive example of the class, or it is not. The event occurs and the
Bernoulli random variable X takes the value 1 with probability p, and the nonoccurrence of
the event has probability 1 − p and this is denoted by X taking the value 0. This is written as

The expected value and variance can be calculated as


The estimate for p is the ratio of the number of occurrences of the event to the number of
experiments. Remembering that if X is Bernoulli with p, E[X] = p, and, as expected, the
maximum likelihood estimator of the mean is the sample average.

12. k-Nearest Neighbor Estimator

The nearest neighbor class of estimators adapts the amount of smoothing to the local density
of data. The degree of smoothing is controlled by k, the number of neighbors taken into
account, which is much smaller than N, the sample size. Let us define a distance between a
and b, for example, |a − b|, and for each x, we define

d1(x) ≤ d2(x) ≤ · · · ≤ dN(x)

to be the distances arranged in ascending order, from x to the points in the sample: d1(x) is
the distance to the nearest sample, d2(x) is the distance to the next nearest, and so on. If xt are
the data points, then we define d1(x) = mint |x − xt |, and if i is the index of the closest
sample, namely, i = arg mint |x − xt |, then d2(x) = minj=i |x − xj |, and so forth.

The k-nearest neighbor (k-nn) density estimate is

This is like a naive estimator with h = 2dk(x), the difference being that instead of fixing h and
checking how many samples fall in the bin, we fix k, the number of observations to fall in the
bin, and compute the bin size. Where density is high, bins are small, and where density is
low, bins are larger (see figure 8.4). The k-nn estimator is not continuous; its derivative has a
discontinuity at all
where x(j) are the order statistics of the sample. The k-nn is not a probability density function
since it integrates to ∞, not 1.

k-nearest neighbor estimate for various k values.

To get a smoother estimate, we can use a kernel function whose effect decreases with
increasing distance

This is like a kernel estimator with adaptive smoothing parameter h =dk(x). K(·) is typically
taken to be the Gaussian kernel.

13. Support Vector Machine: Introduction

Kernel machines are maximum margin methods that allow the model to be written as a sum
of the influences of a subset of the training instances. These influences are given by
application-specific similarity kernels, and we discuss “kernelized” classification, regression,
ranking, outlier detection and dimensionality reduction, and how to choose and use kernels.

Support vectors: Support Vectors are those datapoints that the margin pushes up against
The distance from the hyperplane to the instances closest to it margin on either side is called
the margin, which we want to maximize for best generalization.

a
Maximum Margin
x f yest
f(x,w,b) = sign(w. x - b)
denotes +1
denotes -1
The maximum
margin linear
classifier is the linear
classifier with the,
Support Vectors are
those datapoints that
um, maximum
the margin pushes up margin.
against This is the simplest
kind of SVM (Called
an LSVM)

Linear SVM

The kernel function defines the space according to its notion of similarity and a kernel
function is good if we have better separation in its corresponding space

Classification Margin
wT x  b
 Distance from example xi to the separator is r  wi
 Examples closest to the hyperplane are support vectors.
 Margin ρ of the separator is the distance between support vectors.
ρ

Optimal Separating Hyperplane


Perceptron Revisited: Linear
Separators
 Binary classification can be viewed as the task of
separating classes in feature space:
wTx + b = 0
wTx +b>0
wTx + b < 0

f(x) = sign(wTx + b)

Now that we are using the hypothesis class of lines, the optimal separating hyperplane is the
one that maximizes the margin distance of xt to the discriminant is
We would like to maximize ρ but there are an infinite number of solutions that we can get by
scaling w and for a unique solution, we fix ρ||w|| = 1 and thus, to maximize the margin, we
minimize ||w||. The task can therefore be defined (see Cortes and Vapnik 1995; Vapnik 1995)
as to

In finding the optimal hyperplane, we can convert the optimization problem to a form whose
complexity depends on N, the number of training instances, and not on d. Another advantage
of this new formulation is that it will allow us to rewrite the basis functions in terms of kernel
Functions. To get the new formulation, we first write equation 13.3 as an unconstrained
problem using Lagrange multipliers αt :

This should be minimized with respect to w,w0 and maximized with respect to αt ≥ 0. The
saddle point gives the solution. This is a convex quadratic optimization problem because the
main term is convex and the linear constraints are also convex. Therefore, we can
equivalently solve the dual problem, making use of the Karush-Kuhn-Tucker conditions. The
dual is to maximize Lp with respect to αt , subject to the constraints that the gradient of Lp
with respect to w and w0 are 0 and also that αt ≥ 0:
This can be solved using quadratic optimization methods. The size of the dual depends on N,
sample size, and not on d, the input dimensionality. The upper bound for time complexity is
O(N3), and the upper bound for space complexity is O(N2).

Once we solve for αt , we see that though there are N of them, most vanish with αt = 0 and
only a small percentage have αt > 0. The set of xt whose αt > 0 are the support vectors, and as
we see in equation 13.5, w is written as the weighted sum of these training instances that are
selected as the support vectors. These are the xt that satisfy

and lie on the margin. We can use this fact to calculate w0 from any support vector as

The Nonseparable Case: Soft Margin Hyperplane

If the data is not linearly separable, the algorithm we discussed earlier will not work. In such
a case, if the two classes are not linearly separable such that there is no hyperplane to separate
them, we look for the one slack variables that incurs the least error. We define slack
variables, ξt ≥ 0, which store the deviation from the margin. There are two types of deviation:
An instance may lie on the wrong side of the hyperplane and be misclassified. Or, it may be
on the right side but may lie in the margin, namely, not sufficiently away from the
hyperplane. Relaxing equation 13.1, we require

Adding the constraints, the Lagrangian of equation 13.4 then becomes


The nonseparable instances that we store as support vectors are the instances that we would
have trouble correctly classifying if they were not in the training set; they would either be
misclassified or classified correctly but not with enough confidence. We can say that the
number of support vectors is an upper-bound estimate for the expected number of errors. The
expected test error rate is

where EN[·] denotes expectation over training sets of size N. The nice implication of this is
that it shows that the error rate depends on the number of support vectors and not on the input
dimensionality.

Hinge loss: Equation 13.9 implies that we define error if the instance is on the wrong side or
if the margin is less than 1. This is called the hinge loss. If yt = wT xt + w0 is the output and r
t is the desired output, hinge loss is defined as
In figure 13.3, we compare hinge loss with 0/1 loss, squared error, and cross-entropy. We see
that unlike 0/1 loss, hinge loss also penalizes instances in the margin even though they may
be on the correct side, and the loss increases linearly as the instance moves away on the
wrong side. This is different from the squared loss that therefore is not as robust as the hinge
loss. We see that cross-entropy minimized in logistic discrimination (section 10.7) or by the
linear perceptron (section 11.3) is a good continuous approximation to the hinge loss.

C of equation 13.10 is the regularization parameter fine-tuned using cross-validation. It


defines the trade-off between margin maximization and error minimization: If it is too large,
we have a high penalty for nonseparable points, and we may store many support vectors and
overfit.

If it is too small, we may find too simple solutions that underfit. Typically, one chooses from
[10−6, 10−5, . . . , 10+5, 10+6] in the log scale by looking at the accuracy on a validation set.

Defining Kernels

It is also possible to define application-specific kernels. Kernels are generally considered to


be measures of similarity in the sense that K(x,y) takes a larger value as x and y are more
“similar,” from the point of view of the application. This implies that any prior knowledge we
have regarding the application can be provided to the learner through appropriately defined
kernels—“kernel engineering”—

There are string kernels, tree kernels, graph kernels, and so on depending on how we
represent the data and how we measure similarity in that representation.
For example, given two documents, the number of words appearing in both may be a kernel.
Let us say D1 and D2 are two documents and one possible representation is called bag of
words where we predefine M words relevant for the application, and we define φ(D1) as the
Mdimensional binary vector whose dimension i is 1 if word i appears in D1 and is 0
otherwise. Then, φ(D1)Tφ(D2) counts the number of shared words. Here, we see that if we
directly define and implement K(D1,D2) as the number of shared words, we do not need to
preselect M words and can use just any word in the vocabulary (of course, after discarding
uninformative words like “of,” “and,” etc.) and we would not need to generate the bag-of-
words representation explicitly and it would be as if we allowed M to be as large as we want.

Sometimes—for example, in bioinformatics applications—we can calculate a similarity score


between two objects, which may not necessarily be positive semidefinite. Given two strings
(of genes), a kernel measures the edit distance, namely, how many operations (insertions,
deletions, substitutions) it takes to convert one string into another; this is also called
alignment. In such a case, a trick is to define a set of M templates and represent an object as
the M-dimensional vector of scores to all the templates.

That is, if mi, i = 1, . . . , M are the templates and s(xt ,mi) is the score between xt and mi ,
then we define

Sometimes, we have a binary score function; for example, two proteins may interact or not,
and we want to be able to generalize from this to scores for two arbitrary instances. In such a
case, a trick is to define a graph where the nodes are the instances and two nodes are linked if
they interact, that is, if the binary score returns 1. Then we say that two nodes that are not
immediately linked are “similar” if the path between them is short or if they are connected by
many paths. This converts pairwise local interactions to a global similarity measure, rather
like defining a geodesic distance used in Isomap, and it is called the diffusion kernel.

If p(x) is a probability density, then

is a valid kernel. This is used when p(x) is a generative model for x measuring how likely it is
that we see x. For example, if x is a sequence, p(x) can be a hidden Markov model. With this
kernel, K(xt , x) will take a high value if both xt and x are likely to have been generated by
the same model. It is also possible to parametrize the generative model as p(x|θ) and learn θ
from data; this is called the Fisher kernel

You might also like