Bayesian Learning
Bayesian Learning
P ( D | h) P ( h)
max
hH P( D)
max P ( D | h) P (h)
hH
An Example
Does patient have cancer or not?
A patient takes a lab test and the result comes back positive. The test
returns a correct positive result in only 98% of the cases in which the
disease is actually present, and a correct negative result in only 97% of the
cases in which the disease is not present. Furthermore, .008 of the entire
population have this cancer.
P (cancer ) .008, P ( cancer ) .992
P ( | cancer ) .98, P ( | cancer ) .02
P ( | cancer ) .03, P ( | cancer ) .97
P ( | cancer ) P (cancer )
P (cancer | )
P ( )
P ( | cancer ) P ( cancer )
P ( cancer | )
P ( )
MAP Learner
Comments:
Computational intensive
Providing a standard for judging the performance of
learning algorithms
Choosing P(h) and P(D|h) reflects our prior
knowledge about the learning task
MAP Learner
• Assumptions:
– The training data D is noise free (i.e., di = c(xi))
– The target concept c is contained in the hypothesis space H .
– We have no a priori reason to believe that any hypothesis is
more probable than any other.
MAP Learner
• Case 1:If h is not consistent with D then
• Derivation of hML
MAXIMUM LIKELIHOOD AND LEAST-
SQUARED ERROR HYPOTHESES
• First term is independent oh h so remove.
• Limitations
– The above analysis considers noise only in the
target value of the training example and does
not consider noise in the attributes describing
the instances themselves.
MAXIMUM LIKELIHOOD HYPOTHESES FOR
PREDICTING PROBABILITIES
• Learn a nondeterministic (probabilistic) function f :
X ->{0,1}, which has two discrete output values.
• Ex:medical patient symptoms.
• We might wish to learn a neural network (or other real-valued
function approximator) whose output is the probability that f (x) =
1.
• Now, the target function f' : X -? [O, 1] such that f'(x) = P (f (x) = 1).
• What criterion should we optimize in order to find a maximum
likelihood hypothesis for f' in this setting ?
MAXIMUM LIKELIHOOD HYPOTHESES FOR
PREDICTING PROBABILITIES
• The training data D is of the form D = {(xl,dl). . . (x,
dm)},where di is the observed 0 or 1 value for f (xi).
• P(D/h) can be written as below
Example:
P(h1| D) =.4, P(-|h1)=0, P(+|h1)=1 P( | h ) P(h | D) .4
hiH
i i
P ( a1 , a2 ....an | v j ) P (v j )
max
vjV P ( a1 , a2 ....an )
max P ( a1 , a2 ....an | v j ) P (v j )
vjV
Outlook P N Humidity P N
sunny 2/9 3/5 high 3/9 4/5
overcast 4/9 0 normal 6/9 1/5
rain 3/9 2/5
Tempreature Windy
hot 2/9 2/5 true 3/9 3/5
mild 4/9 2/5 false 6/9 2/5
cool 3/9 1/5
Bayesian classification
• The classification problem may be formalized
using a-posteriori probabilities:
• P(C|X) = prob. that the sample tuple
X=<x1,…,xk> is of class C.
Classify_New_Instance (x)
n
P( x1 ,....xn | M ) P ( xi | Pai , M )
i 1
Pai parent ( xi )
Bayesian Belief Networks
• Inference
– Exact inference of probabilities in general for an
arbitrary Bayesian network is known to be NP-hard .
– Numerous methods have been proposed for
probabilistic inference in Bayesian networks .
– Monte Carlo methods provide approximate solutions
by randomly sampling the distributions of the
unobserved variables
Inference in Bayesian Networks
Age Income
How likely are elderly rich
people to buy Sun?
House Living
Owner Location
Newspaper
Preference
Voting
EU
Pattern
Inference in Bayesian Networks
P( paper = DM | Age>60,
Living Income > 60k, v = labour)
House
Owner Location
Newspaper
Preference
Voting
EU
Pattern
Bayesian Belief Networks
• Learning Bayesian Belief Networks
– Settings
• The network structure might be given in advance, or it might have to be
inferred from the training data.
• All the network variables might be directly observable in each training
example, or some might be unobservable.
– If the network structure is given in advance and the variables are fully
observable in the training examples, learning the conditional probability tables
is straightforward.
Bayesian Belief Networks
• The network structure is given but only some of the variable
values are observable in the training data, the learning
problem is more difficult.
• Russell et al. (1995) propose a gradient ascent procedure
that learns the entries in the conditional probability tables.
• Searches through a space of hypotheses that corresponds to
the set of all possible entries for the conditional probability
tables.
Bayesian Belief Networks
• Gradient Ascent Training of Bayesian Networks
– It maximizes P(D|h) by following the gradient of In P(D|h) with
respect to the parameters that define the conditional probability
tables of the Bayesian network.
•
EM algorithm
• A widely used approach to learning in the presence of
unobserved variables.
• Consider a problem in which the data D is a set of instances
generated by a probability distribution that is a mixture of k
distinct Normal distributions.
• Each of the k Normal distributions has the same variance
σ2,and where σ2is known.
• The learning task is to output a hypothesis h = (μ1 .. .μk)
that describes the means of each of the k distributions.
EM algorithm
• If k=2
– The EM algorithm first initializes the hypothesis to h =
(μ1,μ2),where μ1 and μ2 are
arbitrary initial values.
– It then iteratively re-estimates h by repeating the following two
steps until the procedure converges to a stationary value for h.
• Step 1: Calculate the expected value E[zij]of each hidden variable zij,
assuming the current hypothesis h = (μ1,μ2) holds.
• Step 2: Calculate a new maximum likelihood hypothesis h' = (μ1’,μ2’),
assuming the value taken on by each hidden variable zij is its expected
value E[zij ]
• The E[zij ] is calculated as below.
•
• The second (maximization) step then finds
the values μ1’ ...... μk’ that maximize this Q
function.