0% found this document useful (0 votes)

59 views91 pages

Bayesian

Find-S outputs the most specific hypothesis consistent with the training data D (the hypothesis with maximum version space). This hypothesis would have P(D|h)=1, since it is perfectly consistent with D. Assuming equal priors P(h) for all hypotheses, Find-S outputs the MAP hypothesis, since it maximizes P(D|h). So in this special case where we can assume equal priors, Find-S learning produces the MAP hypothesis.

Uploaded by

Mishaal Hajiani

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

59 views91 pages

Bayesian

Uploaded by

Mishaal Hajiani

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

You are on page 1/ 91

Machine Learning

CS 165B
Spring 2012

1
Course outline
• Introduction (Ch. 1)
• Concept learning (Ch. 2)
• Decision trees (Ch. 3)
• Ensemble learning
• Neural Networks (Ch. 4)
• Linear classifiers
• Support Vector Machines
• Bayesian Learning (Ch. 6)
• Instance-based Learning (Ch. 8)
• Clustering
• Genetic Algorithms (Ch. 9)
2
• Computational learning theory (Ch. 7)
Three approaches to classification

• Use Discriminant Functions directly without probabilities:

– Convert the input vector into one or more real values so
that a simple operation (like threshholding) can be applied
to get the class.
• Infer conditional class probabilities:
– Compute the conditional probability of each class:
p (class  Ck | x)
 Then make a decision that minimizes some loss function
 Discriminative Models.
• Compare the probability of the input under separate, class-
specific, Generative Models.
– E.g. fit a multivariate Gaussian to the input vectors of each
class and see which Gaussian fits the test data vector best.
3
Bayesian Learning
• Provides practical learning algorithms
– Assigns probabilities to hypotheses
 Typically learns most probable hypothesis
– Combine prior knowledge (prior probabilities)
– Competitive with ANNs/DTs
– Several classes of models, including: Bayesian vs.
 Naïve Bayes learning Frequentist debate
 Bayesian belief network learning
• Provides foundations for machine learning
– Evaluating/interpreting other learning algorithms
 E.g., Find-S, Candidate Elimination, ANNs, …
 Shows they output most probable hypotheses
– Guiding the design of new algorithms

4
Basic formulas for probabilities
• Product rule : probability PAB of a conjunction of two
events A and B :
PABPA|BPBPB|APA

• Sum rule: probability PAB of a disjunction of two

events A and B:
PABPAPBPAB

• Total probability : if events A, …, An are mutually

exclusive with i  n PAi, then
n
P ( B )   P( B | Ai ) P ( Ai )
5
i 1
Probability distributions
• Bernoulli Distribution: Random Variable X takes values {0, 1}, s.t
P(X=1) = p = 1 – P(X=0)

• Binomial Distribution: Random Variable X takes values {1, 2,…, n}, representing
the number of successes (X=1) in n Bernoulli trials.
k
P(X=k) = f(n, p, k) = Cn pk (1-p)n-k

• Categorical Distribution: Random Variable X takes on values in {1,2,…k} s.t

P(X=i) = pi and  1 k pi = 1

• Multinomial Distribution: is to Categorical what Binomial is to Bernoulli

• Let the random variables Xi (i=1, 2,…, k) indicate the number of times
outcome i was observed over the n trials.
• The vector X = (X1, ..., Xk) follows a multinomial distribution with
parameters n and p, where p = (p1, ..., pk) and 1k pi = 1
6
f(x1,x2,…xk,n,p) = P(X1=x1,…Xk=xk) =
Basics of Bayesian Learning

• P(h) - the prior probability of a hypothesis h

Reflects background knowledge; before data is observed. If no
information - uniform distribution.

• P(D) - The probability that this sample of the Data is observed.

(No knowledge of the hypothesis)

• P(D|h): The probability of observing the sample D, given

hypothesis h

• P(h|D): The posterior probability of h. The probability of h

given that D has been observed.

7
Bayes Theorem
P ( D | h) P ( h)
P(h | D) 
P( D)
• Phprior probability of hypothesis h
• PD prior probability of training data D
• Ph|D (posterior) probability of h given D
• PD|h probability of D given h /*likelihood*/
• Note proof of theorem:
from definition of conditional probabilities
e.g., Ph, DPh|DPD

8
Choosing Hypotheses
P ( D | h) P ( h )
P(h | D) 
P( D)
• The goal of Bayesian Learning: the most probable hypothesis given the
training data
Maximum a Posteriori hypothesis hMAP

hMAP  arg max P(h | D)

hH

P ( D | h) P ( h)
 arg max
hH P( D)
 arg max P( D | h) P (h)
hH

• If P(hi)P(hj), Maximum
hMLLikelihood (ML
 arg max P ( D )| hhypothesis:
)
hH
9
Maximum Likelihood Estimate
• Assume that you toss a (p,1-p) coin m times and get k Heads, m-
k Tails. What is p? The model we assumed
is binomial. You could
assume a different
model!
• If p is the probability of Heads, the probability of the data
observed is:
P(D|p) = pk (1-p)m-k
• The log Likelihood:
L(p) = log P(D|p) = k log(p) + (m-k)log(1-p)

• To maximize, set the derivative w.r.t. p equal to 0:

dL(p)/dp = k/p – (m-k)/(1-p)

• Solving this for p, gives: p=k/m

10
Example: Does patient have cancer or not?
• A patient takes a lab test and the result comes back
positive. The test returns a correct positive result in only
 of the cases in which the disease is actually present,
and a correct negative result in only  of the cases in
which the disease is not present. Furthermore,  of the
entire population have this cancer
PcancerPcancer
P|cancerP|cancer
P|cancerP|cancer
P (  | cancer) P (cancer) .98  .008 .0078
P (cancer |  )   
P( ) P() P ()

P ( | cancer ) P (cancer ) .03  .992 .0298

P (cancer |  )   
P () P () P()
11
Brute Force MAP Hypothesis Learner

1. For each hypothesis h in H, calculate the posterior probability

P ( D | h) P ( h)
P(h | D) 
P( D)
2. Output the hypothesis hMAP with the highest posterior probability

hMAP  arg max P(h | D)

hH

• May require significant computation (large |H|)

• Need to specify P(h), P(D|h) for all h

12
Coin toss example
h MAP  argmaxhH P(h | D)  argmaxhH P(D | h)P(h)/P(D)
• A given coin is either fair or has a 60% bias in favor of Head.
• Decide what is the bias of the coin [This is a learning problem!]

• Two hypotheses: h1: P(H)=0.5; h2: P(H)=0.6

– Prior: P(h): P(h1)=0.75 P(h2 )=0.25
– Now we need Data. 1st Experiment: coin toss is H.
– P(D|h): P(D|h1)=0.5 ; P(D|h2) =0.6

– P(D): P(D)=P(D|h1)P(h1) + P(D|h2)P(h2 )

= 0.5  0.75 + 0.6  0.25 = 0.525

– P(h|D):
P(h1|D) = P(D|h1)P(h1)/P(D) = 0.50.75/0.525 = 0.714
P(h2|D) = P(D|h2)P(h2)/P(D) = 0.60.25/0.525 = 0.286

13
Coin toss example
h MAP  argmaxhH P(h | D)  argmaxhH P(D | h)P(h)/P(D)

• After 1st coin toss is H we still think that the coin is more likely
to be fair
• If we were to use Maximum Likelihood approach (i.e., assume
equal priors) we would think otherwise. The data supports the
biased coin better.

• Try: 100 coin tosses; 70 heads.

14
Coin toss example
h MAP  argmaxhH P(h | D)  argmaxhH P(D | h)P(h)/P(D)
• Case of 100 coin tosses; 70 heads.

P(D)  P(D | h1 )P(h1 )  P(D | h 2 )P(h2 )

 0.5 100  0.75  0.6700.4 30  0.25
 7.9  10 -31  0.75  3.4  10 -28  0.25
P(D | h1 )P(h 1 ) P(D | h2 )P(h 2 )
P(h 1 | D)    P(h 2 | D)
P(D) P(D)
 0.0057 0.9943


15
Example: Relation to Concept Learning

• Consider the concept learning task

– instance space X, hypothesis space H, training examples D
– consider the Find-S learning algorithm (outputs most specific hypothesis from
the version space VSH,D)
• What would Bayes rule produce as the MAP hypothesis?
• Does Find-S output a MAP hypothesis?
P ( D | h) P ( h )
P(h | D) 
P( D)

16
Relation to Concept Learning

• Assume: given set of instances x,…, xm

Dcx,…, cxm is the set of classifications
1
• For all h in H, P(h)  (uniform distribution)
H

1 if h is consistent with D
• Choose P ( D | h)  
0 otherwise
1 VS H , D
• Compute P( D)   P ( D | h) P ( h)  
hH hVS H , D H

H

• Now  1
 if h is consistent with D
P(h | D )   VS H , D
 0 otherwise


• Every hypothesis consistent with D is a MAP hypothesis

17
Evolution of Posterior Probabilities

P(h) P(h|D1) P(h|D1D2)

hypotheses hypotheses hypotheses

Characterization of concept learning:

use of prior instead of bias

18
(Bayesian) Learning a real-valued
function
• Continuous-valued target function
– Goal: learn h: X → R
– Bayesian justification for minimizing SSE
• Assume
– Target function h(x) is corrupted by noise
 probability density functions model noise

 Normal iid errors (N(mean, sd))

– Observe di=h(xi) + ei, i=1,n

– All hypotheses equally likely (a priori)
• Linear h
– A linear combination of basis functions

19
hML = Minimizing Squared Error
hML =arg max P(D | h)
hÎ H
m
=arg max Õ P(di | h)
hÎ H i=1
m 1 2
1
=arg max Õ
- d - h( xi ))
2( i
e 2s
2
hÎ H i=1 2ps
m 1 2

=arg max Õ e
- d - h( xi ))
2( i
2s
hÎ H i=1
m
1
=arg max å -
2
( di - h(xi ) )
hÎ H i=1 2s 2
m
=arg max å - ( di - h(xi ))
2

hÎ H i=1
m
=arg min å ( di - h(xi ))
2

hÎ H i=1
20
Learning to Predict Probabilities
• Consider predicting survival probability from patient data
– Training examples xi, di where di is either  or 
– Want to learn a probabilistic function (like a coin) that for a
given input outputs 0/1 with certain probabilities.
– Could train a NN/SVM to learn ratios.
• Approach: train neural network to output a probability given xi
Training examples for f
• Modified target function f’xPf x Learn f’ using ML
• Max likelihood hypothesis: hence need to find
P(D | h)
= i=1..m P (xi , di | h) (independence of each example)
i=1..m P (di | h , xi) P(xi | h) (conditional probabilities)
= i=1..m P (di | h , xi) P(xi) (independence of h and xi)
21
Maximum Likelihood Hypothesis
 h( xi ) if d i  1 h would output h(xi) for
P(di | h, xi )   input xi. Prob that di is
1  h( xi ) if d i  0 1 = h(xi), and prob that
di is 0 = 1-h(xi).
P (d i | h, xi )  h( xi ) d i (1  h( xi ))1 d i
m
P ( D h)   h( xi ) d i (1  h( xi ))1 d i P( xi )
i 1
m
hML  arg max  h( xi ) d i (1  h( xi ))1 d i P( xi )
hH i 1

m
hML  arg max   d i ln h( xi )  (1  d i )(1  h( xi )) 
hH i 1

Cross entropy error 22

Weight update rule for ANN sigmoid unit

m
• Go up gradient of likelihood function G(h,D) =   d ln h( x )  (1  d ) ln(1  h( x ))
i 1
i i i i

m
G (h, D) 
w j
  w
i 1 j
 d i ln h( xi )  (1  d i ) ln(1  h( xi )) 
m
 

 d i ln h( xi )  (1  d i ) ln(1  h( xi )) . h( xi ) . neti
i 1
h( xi ) neti w j
m
d i  h ( xi )
  h( x )(1  h( x ))h( x )(1  h( x )) x
i 1 i i
i i ji

m
  (d
i 1
i  h( xi ))x ji

• Weight update rule:

w j  w j  w j Same as minimizing sum of
m squared error for linear ANN units
w j    (d
i 1
i  h( xi )) x ji

23
Information theoretic view of hMAP
hMAP  arg max P( D | h) P (h)
hH

 arg max log 2 P ( D | h)  log 2 P(h)

hH

 arg min  log 2 P ( D | h)  log 2 P (h)

hH

• Information theory: the optimal (shortest expected coding length) code

assigns logp bits to an event with probability p
– Shorter codes for more probable messages
– Interpret logP(h) as the length of h under optimal code for the
hypothesis space
 Optimal description length of h given its probability
– InterpretlogP(D | h) as length of D given h under optimal code
 Assume both receiver/sender know h
– cost of encoding hypothesis + cost of encoding data given the hypothesis
24
Minimum Description Length Principle
• Occam’s razor: prefer the shortest hypothesis
– Now have Bayesian interpretation
• Let LC1(h), LC2(D | h) be optimal length descriptions of h and D|h in some
encoding scheme C
– Intepretation: MAP hypothesis is one that minimizes
LC1(h) +LC2(D | h)
– MDL: prefer the hypothesis h that minimizes
hMDL  argmin LC1(h) + LC2(D | h)
hH

• Example of decision trees

– LC1(h) as related to depth of tree
– LC2(D | h) as related to number of correct classifications for D
 Assume sender/receiver know sequence of x’s and knows h’s
 Receiver can compute if correct classification of each x from the h
 Hence only need to transmit misclassifications for receiver to know all
– prefer the hypothesis that minimizes
length(h) + length(misclassifications)
Can use for pruning trees
25
Bayes Optimal Classifier
• Bayes optimal classification:

arg max  P (v | h) P (h | D)
vV hH

• Example: H = {h1, h2, h3}

– P(h1 | D) = .4 P | h1) = 0 P | h1) = 1
– P(h2 | D) = .3 P | h2) = 1 P | h2) = 0
– P(h3 | D) = .3 P | h3) = 1 P | h3) = 0
 P( | h) P(h | D)  0.4
hH

 P( | h) P(h | D)  0.6

hH

arg max  P(v | h) P(h | D)  

vV hH
26
Simplest approximation: Gibbs Classifier
• Bayes optimal classifier
– Maximizes prob that new example will be classified correctly,
given, D, H, prior p’s
– provides best result, but can be expensive if too many hypotheses
• Gibbs algorithm:
1. Randomly choose a hypothesis h, according to Ph|D
2. Use h to classify new instance
• Surprising fact: Assume target concepts are drawn at random from
H according to priors on H. Then:
E[errorGibbs]  2E[errorBayesOptimal]
• Suppose uniform prior distribution over H, then
– Pick any hypothesis from VS, with uniform probability
– Its expected error no worse than twice Bayes optimal

27
Simpler classification:Naïve Bayes
• Along with decision trees, neural networks, nearest
neighbor, one of the most practical learning methods
• When to use
– Moderate or large training set available
– Attributes that describe instances are conditionally independent
given classification
• Successful applications:
– Diagnosis
– Classifying text documents

28
Naïve Bayes Classifier
• Assume target function f : X → V
each instance x described by attributes a1, …, an
– In simplest case, V has two values (0,1)
• Most probable value of f (x) is:
vMAP  arg max P(v | a1  a2    an )
vV

P (a1  a2    an | v) P(v)
 arg max
vV P(a1  a2    an )
 arg max P (a1  a2    an | v) P(v)
vV
• Naïve Bayes assumption:

P (a1  a2    an | v)   P (ai | v)
• Naïve Bayes classifier: i

vNB  arg max P(v) P(ai | v) 29

vV i
Day Outlook Temp Humidity Wind Tennis?

Example
D1 Sunny Hot High Weak No
D2 Sunny Hot High Strong No
D3 Overcast Hot High Weak Yes
• Consider PlayTennis again
D4 Rain Mild High Weak Yes
• P (yes) = 9/14, D5 Rain Cool Normal Weak Yes
P (no) = 5/14 D6 Rain Cool Normal Strong No
D7 Overcast Cool Normal Strong Yes
• P(Sunny|yes)  2/9 Sunny Mild High Weak No
D8
• P(Sunny|no)  3/5 D9 Sunny Cool Normal Weak Yes
D10 Rain Mild Normal Weak Yes
D11 Sunny Mild Normal Strong Yes
• Classify: D12 Overcast Mild High Strong Yes
(sun, cool, high, strong) D13 Overcast Hot Normal Weak Yes
D14 Rain Mild High Strong No
vNB  arg max P (v) P (ai | v)  n
vV i

is often violated
• but it works surprisingly well anyway
• Don’t need estimated posteriors Pˆ (v | a1    an )
to be correct; need only that

arg max Pˆ (v) Pˆ (ai | v)  arg max P (v) P(a1    an | v)

vV i vV

31
Estimating Probabilities
• If none of the training instances with target value v
have attribute value ai ?
Pˆ (ai | v)  0  Pˆ (v) Pˆ (ai | v)  0
i

• Typical solution: Bayesian estimate for Pˆ (ai | v)

ˆ nc  mp
P (ai | v) 
nm
– n: number of training examples with result v
– nc: number of examples with result v and ai
ˆ (a | v)
P
– p: prior estimate for i
 Uniform priors (e.g., uniform over attribute values)
– m: weight given to prior (equivalent sample size)

32
Classify Text
• Why?
– Learn which news articles are of interest
– Learn to classify web pages by topic
– Junk mail filtering
• Naïve Bayes is among the most effective algorithms
• What attributes shall we use to represent text documents?

33
Learning to Classify Text

• Target concept Interesting? : Document →

• Represent each document by vector of words
– one attribute per word position in document
• Learning: Use training examples to estimate
– P () and P ()
– P (doc|+) and P (doc|)
• Naïve Bayes conditional independence assumption
length ( doc )
P(doc | v)   i 1
P(ai  wk | v)

– P(ai  wk|v): probability of i th word being wk, given v

34
Position Independence Assumption
length ( doc )
P(doc | v)   i 1
P(ai  wk | v)

• P (ai wk|v) is hard to compute(#w=50K,#v=2,L=111)

35
LEARN_Naïve_Bayes_Text (Examples, V)
• collect all words and other tokens that occur in Examples
– Vocabulary ← all distinct words and other tokens in Examples
• calculate probability terms P (v) and P (wk | v)
For each target value v in V do
– docsv ← subset of Examples for which the target value is v
– P(v) ← |docsv| / |Examples|
– Textv ← a single document created by concatenating all
members of docsv
– n ← total number of words in Textv (duplicates counted)
– for each word wk in Vocabulary
 nk ← number of times word wk occurs in Textv
 P (wk|v) ← nk n |Vocabulary|

36
CLASSIFY_Naïve_Bayes_Text (Doc)
• positions ← all word positions in Doc that contain tokens found in
Vocabulary
• Return

vNB  arg max P(v)

vV

i positions
P (ai | v)

37
Example: 20 Newsgroups
• Given 1000 training documents from each group
• Learn to classify new documents to a newsgroup
– comp.graphics, comp.os.ms-windows.misc,
comp.sys.ibm.pc.hardware, comp.sys.mac.hardware,
comp.windows.x
– misc.forsale, rec.autos, rec.motorcycles, rec.sport.baseball,
rec.sport.hockey
– alt.atheism, talk.religion.misc, talk.politics.mideast,
talk.politics.misc, talk.politics.guns
– soc.religion.christian, sci.space sci.crypt, sci.electronics,
sci.med
• Naive Bayes: 89% classification accuracy
38
Conditional Independence
• X is conditionally independent of Y given Z if the probability
distribution governing X is independent of the value of Y given
the value of Z
xiyjzk PXxi|Y  yj Z  zk  PX xi|Z  zk
 [or PX|Y,Z  PX|Z
• Example: P(Thunder|Rain, Lightning) = P(Thunder|
Lightning)
• Can generalize to X1…Xn, Y1…Ym, Z1…Zk
• Extreme case:
– Naive Bayes assumes full conditional independence:
PX1…Xn|Z  PX1…Xn|XnZPXn|Z
PX1…Xn|ZPXn|Z…
 i PXi|Z
39
• Symmetry of conditional independence
– Assume X is conditionally independent of Z given Y
 P(X|Y,Z) = P(X|Y)

– Now,
 P(Z|X,Y) = P(X|Y,Z) P(Z|Y) / P(X|Y)

– Therefore,
 P(Z|X,Y) = P(Z|Y)

– Or, Z is conditionally independent of X given Y

40
Bayesian Belief Networks
• Problems with above methods:
– Bayes Optimal Classifier expensive computationally
– Naive Bayes assumption of conditional independence too
restrictive
• For tractability/reliability, need other assumptions
– Model of world intermediate between
 Full conditional probabilities
 Full conditional independence
• Bayesian Belief networks describe conditional independence among
subsets of variables
– Assume only proper subsets are conditionally independent
– Combines prior knowledge about dependencies among variables
with observed training data
41
Bayesian Belief Networks (a.k.a.
Bayesian Networks)
a.k.a. Probabilistic networks, Belief nets, Bayes nets, etc.
• Belief network
– A data structure (depicted as a graph) that represents the
dependence among variables and allows us to concisely specify
the joint probability distribution
• A belief network is a directed acyclic graph where:
– The nodes represent the set of random variables (one node per
random variable)
– Arcs between nodes represent influence, or dependence
 A link from node X to node Y means that X “directly
influences” Y
– Each node has a conditional probability table (CPT) that defines
P(node | parents)

Judea Pearl, Turing Award winner 2012 42

Bayesian Belief Network
• Network represents conditional independence assertions:
– Each node conditionally independent of its non-descendants (what
is descendent?), given its immediate predecessors (represented by
arcs)

Storm BusTourGroup Campfire

SB SB SB SB
C 0.4 0.1 0.8 0.2
Lightning Campfire
C 0.6 0.9 0.2 0.8

Thunder ForestFire
43
Example
• Random variables X and Y X P(X)
– X: It is raining
– Y: The grass is wet
• X affects Y
Or, Y is a symptom of X Y P(Y|X)
• Draw two nodes and link them
• Define the CPT for each node
− P(X) and P(Y | X)
• Typical use: we observe Y and we want to query P(X | Y)
− Y is an evidence variable
− X is a query variable
44
Try it…
• What is P(X | Y)?
– Given that we know the CPTs of each
node in the graph
X P(X)

P (Y | X ) P ( X )
P( X | Y ) 
P (Y )
Y P(Y|X)
P (Y | X ) P ( X )

 P( X , Y )
X

P (Y | X ) P( X ) Example

 P(Y | X ) P( X )
X 45
Belief nets represent joint probability
• The joint probability function can be calculated directly
from the network
– It is the product of the CPTs of all the nodes
– P(var1, …, varN) = Πi P(vari|Parents(vari))

X P(X) P(X) X Y P(Y)

Y P(Y|X) Z P(Z|X,Y)

P(X,Y) = P(X) P(Y|X) P(X,Y,Z) = P(X) P(Y) P(Z|X,Y)

• Derivation 46
• General case
Example
I’m at work and my neighbor John calls to say my home
alarm is ringing, but my neighbor Mary doesn’t call. The
alarm is sometimes triggered by minor earthquakes. Was
there a burglar at my house?
• Random (boolean) variables:
– JohnCalls, MaryCalls, Earthquake, Burglar, Alarm
• The belief net shows the influence links
• This defines the joint probability
– P(JohnCalls, MaryCalls, Earthquake, Burglar, Alarm)
• What do we want to know? P(B | J, M)

Why not P(B | J, A, M) ?

47
Example

Links and CPTs? 48

Example

Joint probability? P(J, M, A, B, E)? 49

Calculate P(J, M, A, B, E)

P(J, M, A, B, E) = P(B) P(E) P(A|B,E) P(J|A) P(M|A)

= 0.001 * 0.998 * 0.94 * 0.9 * 0.3
= 0.0002532924

How about P(B | J, M) ?

Remember, this means P(B=true | J=true, M=false)

50
Calculate P(B | J, M)

P ( B , J , M )
P ( B | J , M ) 
P ( J , M ) By marginalization:

  P ( J , M , A , B , E )
i j
i j


   P ( J , M , A , B , E )
i j k
i j k

  P ( B ) P ( E ) P ( A | B , E ) P ( J | A ) P ( M | A )
i j
j i j i i


   P ( B ) P ( E ) P ( A | B , E ) P ( J | A ) P ( M | A )
i j k
j k i j k i i

51
Example

• Conditional independence is seen here

– P(JohnCalls | MaryCalls, Alarm, Earthquake, Burglary) =
P(JohnCalls | Alarm)
– So JohnCalls is independent of MaryCalls, Earthquake, and
Burglary, given Alarm
• Does this mean that an earthquake or a burglary do not
influence whether or not John calls?
– No, but the influence is already accounted for in the Alarm
variable
– JohnCalls is conditionally independent of Earthquake, but not
absolutely independent of it

52
Course outline

53
Class feedback

• Difficult concepts
o PCA
o Fischer’s Linear Discriminant
o Backpropagation
o Logistic regression
o SVM?
o Bayesian learning?

54
Class feedback

• Pace
• Slightly fast
• Slow down on difficult parts
• Difficulty of homework
• Slightly hard
• Difficulty of project
• Need more structure
• Other feedback
• More depth?

55
Naive Bayes model
• A common situation is when a single cause directly
influences several variables, which are all conditionally
independent, given the cause.

P(C, e1, e2, e3) = P(C) P(e1 | C) P(e2 | C) P(e3 | C)

Rain C
In general,

P(C , e1 ,..., en )  P(C ) P(ei | C )

e1 e2 e3
Wet grass People with Car
umbrellas accidents
56
Naive Bayes model
• Typical query for naive Bayes:
– Given some evidence, what’s the probability of the cause?
– P(C | e1) = ?
– P(C | e1, e3) = ?

P(e1 | C ) P(C )
P(C | e1 ) 
Rain C P (e1 )

P (e1 | C ) P (C )

 P(e1 | C ) P(C )
C
e1 e2 e3
Wet grass People with Car
umbrellas accidents
57
Drawing belief nets
• What would a belief net look like if all the variables were
fully dependent?

X1 X2 X3 X4 X5

P(X1,X2,X3,X4,X5) = P(X1)P(X2|X1)P(X3|X1,X2)P(X4|X1,X2,X3)P(X5|X1,X2,X3,X4)

• But this isn’t the only way to draw the belief net when all
the variables are fully dependent

58
Fully connected belief net
• In fact, there are N! ways of connecting up a fully-
connected belief net
– That is, there are N! ways of ordering the nodes
A way to represent joint probability
For N=2 Does not really capture causality!

X1 X2 X1 X2 P(X1,X2) = ?

For N=5

X1 X2 X3 X4 X5 P(X1,X2,X3,X4,X5) = ?

and 119 others… 59

Drawing belief nets (cont.)

Fully-connected net displays the joint distribution

P(X1, X2, X3, X4, X5) = P(X1) P(X2|X1) P(X3|X1,X2) P(X4|X1,X2,X3) P(X5|X1, X2, X3, X4)

X1 X2 X3 X4 X5

But what if there are conditionally independent variables?

P(X1, X2, X3, X4, X5) = P(X1) P(X2|X1) P(X3|X1,X2) P(X4|X2,X3) P(X5|X3, X4)

X1 X2 X3 X4 X5

60
Drawing belief nets (cont.)
What if the variables are all independent?
P(X1, X2, X3, X4, X5) = P(X1) P(X2) P(X3) P(X4) P(X5)

X1 X2 X3 X4 X5

What if the links are drawn like this:

X1 X2 X3 X4 X5

Not allowed – not a DAG

61
Drawing belief nets (cont.)

What if the links are drawn like this:

X1 X2 X3 X4 X5

P(X1, X2, X3, X4, X5) = P(X1) P(X2 | X3) P(X3 | X1) P(X4 | X2) P(X5 | X4)

It can be redrawn like this:

X1 X3 X2 X4 X5

All arrows going left-to-right

62
Belief nets
• General assumptions
– A DAG is a reasonable representation of the influences among the
variables
 Leaves of the DAG have no direct influence on other variables

– Conditional independences cause the graph to be much less than

fully connected (the system is sparse)

63
What are belief nets used for?
• Given the structure, we can now pose queries:
– Typically: P(Cause | Symptoms)
– P(X1 | X4, X5)
– P(Earthquake | JohnCalls)
– P(Burglary | JohnCalls, MaryCalls)

Query variable Evidence variables

64
Rained X P(X)
Raining

X P(X)

Wet grass Y P(Y|X)

Wet grass

Y P(Y|X)

Worm
ASK P(X|Y) sighting Z P(Z|Y)

ASK P(X|Z)
65
How to construct a belief net
• Choose the random variables that describe the domain
– These will be the nodes of the graph

• Choose a left-to-right ordering of the variables that indicates

a general order of influence
– “Root causes” to the left, symptoms to the right

X1 X2 X3 X4 X5

Causes Symptoms

66
How to construct a belief net (cont.)
• Draw arcs from left to right to indicate “direct influence” among variables
– May have to reorder some nodes

X1 X2 X3 X4 X5

• Define the conditional probability table (CPT) for each node

– P(node | parents)
P(X1) P(X4 | X2,X3)
P(X2) P(X5 | X4)
P(X3 | X1,X2)
67
Example: Flu and measles

P(Flu) Flu Measles P(Measles)

Fever Spots P(Spots | Measles)

P(Fever | Flu, Measles)

To create the belief net:

• Choose variables (evidence and query)
• Choose an ordering and create links (direct influences)
• Fill in probabilities (CPTs)
68
Example: Flu and measles

P(F) F M P(M)

P(V | F, M) V S P(S | M)

CPTs: P(F) = 0.01 P(S| M) = [0, 0.9]

P(M) = 0.001 P(V| F, M) = [0.01, 0.8, 0.9, 1.0]

Compute P(F | V) and P(F | V, S).

Are they equivalent?
How about P(V | M) and P(V | M, S)? 69
Independence
• Variables X and Y are independent if and only if
– P(X, Y) = P(X) P(Y)
– P(X | Y) = P(X)
– P(Y | X) = P(Y)
• We can determine independence of variables in a belief net
directly from the graph
– Variables X and Y are independent if they share no common
ancestry
 I.e., the set of { X, parents of X, grandparents of X, … } has a
null intersection with the set of {Y, parents of Y, grandparents
of Y, … }
X, Y dependent

X Y 70
Conditional Independence
• X and Y are (conditionally) independent given E iff
– P(X | Y, E) = P(X | E) Independence is the same as
– P(Y | X, E) = P(Y | E) conditional independence given
empty E

• {X1,…,Xn} and {Y1,…,Ym} are conditionally independent

given {E1,…,Ek} iff
– P(X1,…,Xn | Y1, …, Ym, E1, …,Ek) = P(X1,…,Xn | E1, …,Ek)
– P(Y1, …, Ym | X1,…,Xn, E1, …,Ek) = P(Y1, …, Ym | E1, …,Ek)

• We can determine conditional independence of variables

(and sets of variables) in a belief net directly from the
graph
71
Conditional independence and d-separation
• Two sets of nodes, X and Y, are conditionally independent given
evidence nodes, E, if every undirected path from a node in X to a node
in Y is blocked by E. Also called d-separation.
• A path is blocked given E if there is a node Z on the path for which
one of the following holds:

Cases 1 and 2:
variable Z is in E

Case 3:
variable Z or
its descendants
is not in E
72
Path Blockage

Blocked Unblocked
Active
Three cases:
– Common cause
E E

– X Y X Y

73
Path Blockage

Blocked Unblocked
Active
Three cases:
– Common cause X X

E E
– Intermediate cause
Y Y
–

74
Path Blockage

Blocked Unblocked
Active
Three cases:
– Common cause X Y

X Y A
– Intermediate cause
A C
– Common Effect X Y
C
A

C
75
Examples

R G W P(W | R, G) = P(W | G)
Rain Wet Worms
Grass

T F C P(T | C, F) = P(T | F)
Tired Flu Cough

W M I P(W | I, M)  P(W | M)
Work Money Inherit P(W | I) = P(W)

76
Examples

X Z Y X ind. of Y? X ind. of Y given Z?

Yes Yes

X Z Y X ind. of Y? X ind. of Y given Z?

No Yes

X Z Y X ind. of Y? X ind. of Y given Z?

No Yes

X Z Y X ind. of Y? X ind. of Y given Z?

Yes No

X Y X ind. of Y? X ind. of Y given Z?

No No
Z
77
Examples (cont.)

Z X – wet grass
Y – rainbow P(X, Y)  P(X) P(Y)
Z – rain P(X | Y, Z) = P(X | Z)
X Y

X Y
X – rain
P(X, Y) = P(X) P(Y)
Y – sprinkler
Z P(X | Y, Z)  P(X | Z)
Z – wet grass
P(X | Y, W)  P(X |
W – worms
W)
W
78
Examples
Are X and Y independent?
Are X and Y conditionally independent given Z?

X Y X Y

Z W Z W

X – rain X – rain
Y – sprinkler Y – sprinkler
Z – rainbow Z – rainbow
W – wet grass W – wet grass

P(X,Y) = P(X) P(Y) Yes P(X,Y) = P(X) P(Y) No

P(X | Y, Z) = P(X | Z) Yes P(X | Y, Z) = P(X | Z) No
79
Conditional Independence
• What are the conditional independences here?
Radio and Ignition, given Battery?
Yes
Radio and Starts, given Ignition?
Yes
Gas and Radio, given Battery?
Yes
Gas and Radio, given Starts?
No
Gas and Battery, given Moves?
No

80
Conditional Independence

• What are the conditional independences here?

A B C D E

A and E, given null?

Yes
A and E, given D?
No
A and E, given C,D?
Yes

81
Theorems
• A node is conditionally independent of its non-descendants
given its parents.

• A node is conditionally independent of all other nodes

given its Markov blanket (its parents, its descendants, and
other parents of its children).

82
Why does conditional independence matter?
• Helps the developer (or the user) verify the graph structure
– Are these things really independent?
– Do I need more/fewer arcs?

• Gives hints about computational efficiencies

• Shows that you understand BNs…

• Try this applet:

http://www.phil.cmu.edu/~wimberly/dsep/dSep.html

83
Case Study
• Pathfinder system. (Heckerman 1991, Probabilistic
Similarity Networks, MIT Press, Cambridge MA).
• Diagnostic system for lymph-node diseases.
– 60 diseases and 100 symptoms and test-results.
– 14,000 probabilities
– Expert consulted to make net.
– 8 hours to determine variables.
– 35 hours for net topology.
– 40 hours for probability table values.
• Apparently, the experts found it quite easy to invent the
links and probabilities.
• Pathfinder is now outperforming world experts.
84
Inference in Bayesian Networks
• How can one infer (probabilities of) values of one/more network
variables, given observed values of others?
– Bayes net contains all information needed for this
– Easy if only one variable with unknown value
– In general case, problem is NP hard
 Need to compute sums of probs over unknown values
• In practice, can succeed in many cases
– Exact inference methods work well for some network
structures (polytrees)
– Variable elimination methods reduce the amount of repeated
computation
– Monte Carlo methods “simulate” the network randomly to
calculate approximate solutions
85
Learning Bayesian Networks
• Object of current research
• Several variants of this learning task
– Network structure might be known or unknown
 Structure incorporates prior beliefs

– Training examples might provide values of all network variables,

or just some
• If structure known and can observe all variables
– Then it’s easy as training a Naïve Bayes classifier
– Compute relative frequencies from observations

86
Learning Bayes Nets
• Suppose structure known, variables partially observable
– e.g., observe ForestFire, Storm, BusTourGroup, Thunder, but not
Lightning, Campfire...
• Analogous to learning weights for hidden units of ANN
– Assume know input/output node values
– Do not know values of hidden units
• In fact, can learn network conditional probability tables
using gradient ascent
– Search through hypothesis space corresponding to set of all
possible entries for conditional probability tables
– Maximize P(D|h) (ML hypoth for table entries)
– Converge to network h that (locally) maximizes PD|h

87
Gradient for Bayes Net
• Let wijk denote one entry in the conditional probability table for variable
Yi in the network
wijk  PYi  yij|Parents(Yi)  the list uik of parents values)
– e.g., if Yi  Campfire, then uik could be
Storm  T, BusTourGroup  F
• Perform gradient ascent repeatedly by:
– update wijk using training data D
 Using gradient ascent up lnP(D|h) in w-space using wijk update rule with small
step
 Need to calculate sum over training examples of
P(Yi=yij, Ui=uik|d)/wijk
– Calculate these from network
– If unobservable for a given d, use inference
– Renormalize wijk by summing to 1 and normalizing to between [0,1]

1 
=
¶
ln Õ P(d | h)   P(d | h)  w P(d | yij , ui , k , h) wijk P (u i ,k | h)
¶wijk dÎ D d D ijk

1
=å
¶ln P(d | h)   P (d | h)  P( d | y
d D
ij , u i , k , h) P (u i , k | h)

• Perform gradient ascent by repeatedly

1. update all wijk using training data D
P ( yij , uik | d )
wijk  wijk   
d D
wijk
2. then, renormalize the wijk to enssure
j wijk  
 wijk 

90
Course outline

CSM User Manual 2020
No ratings yet
CSM User Manual 2020
183 pages
Bcs602 ML Mod-4 Notes @vtunetwork
No ratings yet
Bcs602 ML Mod-4 Notes @vtunetwork
31 pages
ML - Unit-3 Chapter - 6 (Bayes Theorem) - Notes
No ratings yet
ML - Unit-3 Chapter - 6 (Bayes Theorem) - Notes
123 pages
Slide07 Bayes
No ratings yet
Slide07 Bayes
51 pages
Module 5
No ratings yet
Module 5
30 pages
Machine Learning Notes
No ratings yet
Machine Learning Notes
180 pages
Naive Bayes
No ratings yet
Naive Bayes
60 pages
Bayesian Learning
No ratings yet
Bayesian Learning
81 pages
ML Unit-Iii
No ratings yet
ML Unit-Iii
178 pages
ML - Unit4pdf
No ratings yet
ML - Unit4pdf
65 pages
L4 Naive Bayes
No ratings yet
L4 Naive Bayes
31 pages
Bayesian Learning
No ratings yet
Bayesian Learning
22 pages
Unit 4
No ratings yet
Unit 4
24 pages
Bayesian Learning
No ratings yet
Bayesian Learning
44 pages
Unit2 - 5 Part 1
No ratings yet
Unit2 - 5 Part 1
14 pages
UNIT 4 - Bayesian Learning
No ratings yet
UNIT 4 - Bayesian Learning
54 pages
@vtudeveloper - in ML Mod 4
No ratings yet
@vtudeveloper - in ML Mod 4
11 pages
Unit 2 Bayesian Learning
No ratings yet
Unit 2 Bayesian Learning
50 pages
Module - 4 QB Solved-1
No ratings yet
Module - 4 QB Solved-1
31 pages
AMC Matrix Solution 001
70% (10)
AMC Matrix Solution 001
3 pages
AIML - Module 4 - Updated
No ratings yet
AIML - Module 4 - Updated
41 pages
ML Unit-4
No ratings yet
ML Unit-4
24 pages
ML - Unit-3 Chapter - 6 (Bayes Theorem) - Notes
No ratings yet
ML - Unit-3 Chapter - 6 (Bayes Theorem) - Notes
31 pages
AI Algorithms: Foundations, Applications, and Advancements
From Everand
AI Algorithms: Foundations, Applications, and Advancements
Anand Vemula
No ratings yet
Unit III
No ratings yet
Unit III
19 pages
7Cs of Communication With Examples New
0% (1)
7Cs of Communication With Examples New
57 pages
Module - 5 - Notes BAYESIAN Learning Notes
No ratings yet
Module - 5 - Notes BAYESIAN Learning Notes
24 pages
ML Unit 4-1-24
No ratings yet
ML Unit 4-1-24
24 pages
Abstract of Bids As Read and As Calculated
100% (1)
Abstract of Bids As Read and As Calculated
135 pages
2bayesian Learning
No ratings yet
2bayesian Learning
22 pages
Module 5
No ratings yet
Module 5
24 pages
Bayesian Decision Theory and Learning: Jayanta Mukhopadhyay Dept. of Computer Science and Engg
No ratings yet
Bayesian Decision Theory and Learning: Jayanta Mukhopadhyay Dept. of Computer Science and Engg
56 pages
Slide 1
No ratings yet
Slide 1
37 pages
ML Unit 3 Part 1
No ratings yet
ML Unit 3 Part 1
36 pages
List of Job Consultancy With Address in Hyderbad
67% (6)
List of Job Consultancy With Address in Hyderbad
20 pages
ML Unit 3 Part 1
No ratings yet
ML Unit 3 Part 1
36 pages
Wa0002.
No ratings yet
Wa0002.
24 pages
ML Unit III
No ratings yet
ML Unit III
40 pages
ML Unit 3
No ratings yet
ML Unit 3
14 pages
6.1 Bayesian Learning
No ratings yet
6.1 Bayesian Learning
33 pages
ML Unit 3 Bayesian - Learning (Textbook)
No ratings yet
ML Unit 3 Bayesian - Learning (Textbook)
25 pages
1965 STHS Yearbook
No ratings yet
1965 STHS Yearbook
122 pages
Bayes Algorithm
No ratings yet
Bayes Algorithm
26 pages
Unit 4
No ratings yet
Unit 4
18 pages
ML - Unit 1 - Part Ii
No ratings yet
ML - Unit 1 - Part Ii
18 pages
(Ebooks PDF) Download Triple Focus A New Approach To Education The Full Chapters
100% (3)
(Ebooks PDF) Download Triple Focus A New Approach To Education The Full Chapters
21 pages
Bayesian Learning
No ratings yet
Bayesian Learning
49 pages
Features of Bayesian Learning Methods
No ratings yet
Features of Bayesian Learning Methods
39 pages
15CS73 Module 4
No ratings yet
15CS73 Module 4
60 pages
18CS71 Module 4
No ratings yet
18CS71 Module 4
30 pages
3.1 New
No ratings yet
3.1 New
12 pages
Chapter 6 Word - Table and Mail Merge
No ratings yet
Chapter 6 Word - Table and Mail Merge
29 pages
ML UNIT-5 Notes PDF
No ratings yet
ML UNIT-5 Notes PDF
41 pages
Bayesian Learning: Artificial Intelligence and Machine Learning 18CS71
No ratings yet
Bayesian Learning: Artificial Intelligence and Machine Learning 18CS71
24 pages
Module 4 - Bayesian Learning
No ratings yet
Module 4 - Bayesian Learning
36 pages
Lecture 9: Bayesian Learning: Cognitive Systems II - Machine Learning SS 2005
No ratings yet
Lecture 9: Bayesian Learning: Cognitive Systems II - Machine Learning SS 2005
39 pages
2018 Power Substation Guidebook
No ratings yet
2018 Power Substation Guidebook
22 pages
Module - 4 Bayeian Learning
No ratings yet
Module - 4 Bayeian Learning
44 pages
E-Note 14654 Content Document 20231228101425AM
No ratings yet
E-Note 14654 Content Document 20231228101425AM
10 pages
Bayesian Learning Video Tutorial
No ratings yet
Bayesian Learning Video Tutorial
25 pages
Bayesian Learning Unit 3 PDF
No ratings yet
Bayesian Learning Unit 3 PDF
18 pages
Machine Learning: Lecture 6: Bayesian Learning (Based On Chapter 6 of Mitchell T.., Machine Learning, 1997)
No ratings yet
Machine Learning: Lecture 6: Bayesian Learning (Based On Chapter 6 of Mitchell T.., Machine Learning, 1997)
15 pages
Data Sheet enUS 1890876683
No ratings yet
Data Sheet enUS 1890876683
47 pages
Visit:: Join Telegram To Get Instant Updates: Contact: MAIL: Instagram: Instagram: Whatsapp Share
No ratings yet
Visit:: Join Telegram To Get Instant Updates: Contact: MAIL: Instagram: Instagram: Whatsapp Share
25 pages
OFAD 40023 Internet and Web Design COMMON
No ratings yet
OFAD 40023 Internet and Web Design COMMON
86 pages
SL09. Bayesian Learning
No ratings yet
SL09. Bayesian Learning
4 pages
Bayesian Learning: Salma Itagi, Svit
No ratings yet
Bayesian Learning: Salma Itagi, Svit
14 pages
Lec04 BayesianLearning
No ratings yet
Lec04 BayesianLearning
39 pages
Module 2 Notes
No ratings yet
Module 2 Notes
24 pages
01 - Disaster - (2) - JupyterLab
No ratings yet
01 - Disaster - (2) - JupyterLab
16 pages
AG5000 Quick Start
No ratings yet
AG5000 Quick Start
7 pages
Characteristics of New Media
No ratings yet
Characteristics of New Media
1 page
Form Supplier Registration Form - GDP
No ratings yet
Form Supplier Registration Form - GDP
6 pages
Ece-A Touch Sensor
No ratings yet
Ece-A Touch Sensor
5 pages
1973 Eldorado
No ratings yet
1973 Eldorado
70 pages
Practical: 3: 2ceit503 Computer Networks
No ratings yet
Practical: 3: 2ceit503 Computer Networks
8 pages
Mystic Media House Profile
No ratings yet
Mystic Media House Profile
16 pages
Az1084s PDF
No ratings yet
Az1084s PDF
17 pages
PUCIT Entry Test Mcqs
100% (3)
PUCIT Entry Test Mcqs
4 pages
Learn Statistics Fast: A Simplified Detailed Version for Students
From Everand
Learn Statistics Fast: A Simplified Detailed Version for Students
Hesbon R.M
No ratings yet
06 - Completing Business Messages
No ratings yet
06 - Completing Business Messages
27 pages
Bayesian Networks
No ratings yet
Bayesian Networks
24 pages
Instaliranje Total War
No ratings yet
Instaliranje Total War
2 pages
De La Salle University - Dasmariñas: 1 Semester / Midterm Period / S.Y. 2020-2021
No ratings yet
De La Salle University - Dasmariñas: 1 Semester / Midterm Period / S.Y. 2020-2021
2 pages
Mod Menu Crash 2024 02 27-12 00 09
No ratings yet
Mod Menu Crash 2024 02 27-12 00 09
3 pages
NF Assighment4
No ratings yet
NF Assighment4
5 pages
MP22 8259
No ratings yet
MP22 8259
8 pages
R E C E S S L Unch Break: K. J. Somaiya College of Arts, Commerce & Science, Kopargaon
No ratings yet
R E C E S S L Unch Break: K. J. Somaiya College of Arts, Commerce & Science, Kopargaon
7 pages
Arjun Jaggi: Mapple July 2012 - Jan 2013
No ratings yet
Arjun Jaggi: Mapple July 2012 - Jan 2013
3 pages
CycloTouch R-Series Brochure V03
No ratings yet
CycloTouch R-Series Brochure V03
2 pages
Quy Dinh Ve Style (Layer, Font, Dim) PDF
No ratings yet
Quy Dinh Ve Style (Layer, Font, Dim) PDF
2 pages
FAQ Wifiunifi 24022020
No ratings yet
FAQ Wifiunifi 24022020
3 pages
Classification Tree Exercise
No ratings yet
Classification Tree Exercise
1 page
A-level Maths Revision: Cheeky Revision Shortcuts
From Everand
A-level Maths Revision: Cheeky Revision Shortcuts
Scool Revision
3.5/5 (8)