0% found this document useful (0 votes)
29 views81 pages

AI Learning

The document discusses topics related to machine learning, including inductive learning methods like decision tree learning using information gain, and statistical learning topics such as parameter estimation, naive Bayes classification, and learning Bayesian networks. It provides examples and explanations of key concepts in machine learning like constructing hypotheses from examples and measuring learning performance on test data.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
29 views81 pages

AI Learning

The document discusses topics related to machine learning, including inductive learning methods like decision tree learning using information gain, and statistical learning topics such as parameter estimation, naive Bayes classification, and learning Bayesian networks. It provides examples and explanations of key concepts in machine learning like constructing hypotheses from examples and measuring learning performance on test data.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 81

Learning

LESSON 13-14
Reading
Chapter 18
Chapter 20

3/21/2016 503043 - LEARNING 2


Outline
1. Inductive Learning
• Learning agents
• Inductive learning
• Decision tree learning

2. Statistical Learning
◦ Parameter Estimation:
◦ Maximum Likelihood (ML); Maximum A Posteriori (MAP); Bayesian; Continuous case
◦ Learning Parameters for a Bayesian Network
◦ Naive Bayes
◦ Maximum Likelihood estimates; Priors
◦ Learning Structure of Bayesian Networks

3/21/2016 503043 - LEARNING 3


Learning
Learning is essential for unknown environments,
◦ i.e., when designer lacks omniscience

Learning is useful as a system construction method,


◦ i.e., expose the agent to reality rather than trying to write
it down

Learning modifies the agent's decision mechanisms


to improve performance

3/21/2016 503043 - LEARNING 4


Learning agents

3/21/2016 503043 - LEARNING 5


Learning element
Design of a learning element is affected by
◦ Which components of the performance element are to be
learned
◦ What feedback is available to learn these components
◦ What representation is used for the components

Type of feedback:
◦ Supervised learning: correct answers for each example
◦ Unsupervised learning: correct answers not given
◦ Reinforcement learning: occasional rewards

3/21/2016 503043 - LEARNING 6


Inductive learning
Simplest form: learn a function from examples

f is the target function

An example is a pair (x, f(x))

Problem: find a hypothesis h


such that h ≈ f
given a training set of examples

(This is a highly simplified model of real learning:


◦ Ignores prior knowledge
◦ Assumes examples are given)

3/21/2016 503043 - LEARNING 7


Inductive learning method
Construct/adjust h to agree with f on training set
(h is consistent if it agrees with f on all examples)

E.g., curve fitting:

3/21/2016 503043 - LEARNING 8


Inductive learning method
Construct/adjust h to agree with f on training set
(h is consistent if it agrees with f on all examples)

E.g., curve fitting:

3/21/2016 503043 - LEARNING 9


Inductive learning method
Construct/adjust h to agree with f on training set
(h is consistent if it agrees with f on all examples)

E.g., curve fitting:

3/21/2016 503043 - LEARNING 10


Inductive learning method
Construct/adjust h to agree with f on training set
(h is consistent if it agrees with f on all examples)

E.g., curve fitting:

3/21/2016 503043 - LEARNING 11


Inductive learning method
Construct/adjust h to agree with f on training set
(h is consistent if it agrees with f on all examples)

E.g., curve fitting:

3/21/2016 503043 - LEARNING 12


Inductive learning method
Construct/adjust h to agree with f on training set
(h is consistent if it agrees with f on all examples)

E.g., curve fitting:

Ockham’s razor: prefer the simplest hypothesis consistent with data

3/21/2016 503043 - LEARNING 13


Learning decision trees
Problem: decide whether to wait for a table at a restaurant,
based on the following attributes:
1. Alternate: is there an alternative restaurant nearby?
2. Bar: is there a comfortable bar area to wait in?
3. Fri/Sat: is today Friday or Saturday?
4. Hungry: are we hungry?
5. Patrons: number of people in the restaurant (None, Some, Full)
6. Price: price range ($, $$, $$$)
7. Raining: is it raining outside?
8. Reservation: have we made a reservation?
9. Type: kind of restaurant (French, Italian, Thai, Burger)
10. WaitEstimate: estimated waiting time (0-10, 10-30, 30-60, >60)

3/21/2016 503043 - LEARNING 14


Attribute-based
representations
Examples described by attribute values (Boolean, discrete, continuous)
E.g., situations where I will/won't wait for a table:

Classification of examples is positive (T) or negative (F)

3/21/2016 503043 - LEARNING 15


Decision trees
One possible representation for hypotheses E.g., here is the “true” tree
for deciding whether to wait:

3/21/2016 503043 - LEARNING 16


Expressiveness
Decision trees can express any function of the input attributes.
E.g., for Boolean functions, truth table row → path to leaf:

Trivially, there is a consistent decision tree for any training set with one path to leaf for
each example (unless f nondeterministic in x) but it probably won't generalize to new
examples

Prefer to find more compact decision trees


3/21/2016 503043 - LEARNING 17
Hypothesis spaces
How many distinct decision trees with n Boolean attributes?
= number of Boolean functions
n
= number of distinct truth tables with 2n rows = 22

E.g., with 6 Boolean attributes, there are 18,446,744,073,709,551,616


trees

3/21/2016 503043 - LEARNING 18


Hypothesis spaces
How many distinct decision trees with n Boolean attributes?
= number of Boolean functions
n
= number of distinct truth tables with 2n rows = 22

E.g., with 6 Boolean attributes, there are 18,446,744,073,709,551,616 trees

How many purely conjunctive hypotheses (e.g., Hungry  Rain)?


Each attribute can be in (positive), in (negative), or out
 3n distinct conjunctive hypotheses
More expressive hypothesis space
◦ increases chance that target function can be expressed
◦ increases number of hypotheses consistent with training set
 may get worse predictions

3/21/2016 503043 - LEARNING 19


Decision tree learning
Aim: find a small tree consistent with the training examples
Idea: (recursively) choose "most significant" attribute as root of
(sub)tree

3/21/2016 503043 - LEARNING 20


Choosing an attribute
Idea: a good attribute splits the examples into subsets that
are (ideally) "all positive" or "all negative"

Patrons? is a better choice

3/21/2016 503043 - LEARNING 21


Using information theory
To implement Choose-Attribute in the DTL
algorithm
Information Content (Entropy):
I(P(v1), … , P(vn)) = Σi=1 -P(vi) log2 P(vi)
For a training set containing p positive examples
and n negative examples:
p n p p n n
I( , ) log 2  log 2
pn pn pn pn pn pn

3/21/2016 503043 - LEARNING 22


Information gain
A chosen attribute A divides the training set E into subsets E1, … ,
Ev according to their values for A, where A has v distinct values.
v
p i  ni pi ni
remainder( A)   I( , )
i 1 p  n pi  ni pi  ni
Information Gain (IG) or reduction in entropy from the attribute
test:

p n
IG( A)  I ( , )  remainder( A)
pn pn
Choose the attribute with the largest IG

3/21/2016 503043 - LEARNING 23


Information gain
For the training set, p = n = 6, I(6/12, 6/12) = 1 bit

Consider the attributes Patrons and Type (and others too):


2 4 6 2 4
IG( Patrons)  1  [ I (0,1)  I (1,0)  I ( , )]  .0541 bits
12 12 12 6 6
2 1 1 2 1 1 4 2 2 4 2 2
IG(Type)  1  [ I ( , )  I ( , )  I ( , )  I ( , )]  0 bits
12 2 2 12 2 2 12 4 4 12 4 4

Patrons has the highest IG of all attributes and so is chosen by the DTL
algorithm as the root

3/21/2016 503043 - LEARNING 24


Example contd.
Decision tree learned from the 12 examples:

Substantially simpler than “true” tree---a more complex hypothesis isn’t justified by
small amount of data

3/21/2016 503043 - LEARNING 25


Performance measurement
How do we know that h ≈ f ?
1. Use theorems of computational/statistical learning theory
2. Try h on a new test set of examples
(use same distribution over example space as training set)

Learning curve = % correct on test set as a function of training set size

3/21/2016 503043 - LEARNING 26


Summary 1
Learning needed for unknown environments, lazy
designers
Learning agent = performance element + learning
element
For supervised learning, the aim is to find a simple
hypothesis approximately consistent with training
examples
Decision tree learning using information gain
Learning performance = prediction accuracy measured
on test set

3/21/2016 503043 - LEARNING 27


Statistical Learning
Parameter Estimation:
◦ Maximum Likelihood (ML)
◦ Maximum A Posteriori (MAP)
◦ Bayesian
◦ Continuous case

Learning Parameters for a Bayesian Network


Naive Bayes
◦ Maximum Likelihood estimates
◦ Priors

Learning Structure of Bayesian Networks

503043 - Learning
3/21/2016 28
Coin Flip
C1 C2 C3

P(H|C1) = 0.1 P(H|C2) = 0.5 P(H|C3) = 0.9

Which coin will I use?


P(C1) = 1/3 P(C2) = 1/3 P(C3) = 1/3

Prior: Probability of a hypothesis


before we make any observations
3/21/2016 503043 - LEARNING 29
Coin Flip
C1 C2 C3

P(H|C1) = 0.1 P(H|C2) = 0.5 P(H|C3) = 0.9

Which coin will I use?


P(C1) = 1/3 P(C2) = 1/3 P(C3) = 1/3
Uniform Prior: All hypothesis are equally likely
before we make any observations
3/21/2016 503043 - LEARNING 30
Experiment 1: Heads
Which coin did I use?
P(C1|H) = ? P(C2|H) = ? P(C3|H) = ?

C1 C2 C3

P(H|C1)=0.1 P(H|C2) = 0.5 P(H|C3) = 0.9

P(C1)=1/3 P(C2) = 1/3 P(C3) = 1/3


3/21/2016 503043 - LEARNING 31
Experiment 1: Heads
Which coin did I use?
P(C1|H) = 0.066 P(C2|H) = 0.333 P(C3|H) = 0.6

Posterior: Probability of a hypothesis given data


C1 C2 C3

P(H|C1) = 0.1 P(H|C2) = 0.5 P(H|C3) = 0.9


P(C1) = 1/3 P(C2) = 1/3 P(C3) = 1/3
3/21/2016 503043 - LEARNING 32
Terminology
Prior:
◦ Probability of a hypothesis before we see any data
Uniform Prior:
◦ A prior that makes all hypothesis equaly likely
Posterior:
◦ Probability of a hypothesis after we saw some data
Likelihood:
◦ Probability of data given hypothesis

3/21/2016 503043 - LEARNING 33


Experiment 2: Tails
Which coin did I use?
P(C1|HT) = ? P(C2|HT) = ? P(C3|HT) = ?

C1 C2 C3

P(H|C1) = 0.1 P(H|C2) = 0.5 P(H|C3) = 0.9


P(C1) = 1/3 P(C2) = 1/3 P(C3) = 1/3
3/21/2016 503043 - LEARNING 34
Experiment 2: Tails
Which coin did I use?
P(C1|HT) = 0.21 P(C2|HT) = 0.58 P(C3|HT) = 0.21

C1 C2 C3

P(H|C1) = 0.1 P(H|C2) = 0.5 P(H|C3) = 0.9


P(C1) = 1/3 P(C2) = 1/3 P(C3) = 1/3
3/21/2016 503043 - LEARNING 35
Experiment 2: Tails
Which coin did I use?
P(C1|HT) = 0.21P(C2|HT) = 0.58 P(C3|HT) = 0.21

C1 C2 C3

P(H|C1) = 0.1 P(H|C2) = 0.5 P(H|C3) = 0.9


P(C1) = 1/3 P(C2) = 1/3 P(C3) = 1/3

3/21/2016 503043 - LEARNING 36


Your Estimate?
What is the probability of heads after two experiments?

Most likely coin: Best estimate for P(H)


C2 P(H|C2) = 0.5

C1 C2 C3

P(H|C1) = 0.1 P(H|C2) = 0.5 P(H|C3) = 0.9


P(C1) = 1/3 P(C2) = 1/3 P(C3) = 1/3
3/21/2016 503043 - LEARNING 37
Your Estimate?
Maximum Likelihood Estimate: The best hypothesis
that fits observed data assuming uniform prior

Most likely coin: Best estimate for P(H)


C2 P(H|C2) = 0.5

C2

P(H|C2) = 0.5
P(C2) = 1/3
3/21/2016 503043 - LEARNING 38
Using Prior Knowledge
Should we always use a Uniform Prior ?
Background knowledge:
◦ Heads => we have take-home midterm
◦ Dan doesn’t like take-homes…
◦ => Dan is more likely to use a coin biased in his favor

C1 C2 C3

P(H|C1) = 0.1 P(H|C2) = 0.5 P(H|C3) = 0.9


3/21/2016 503043 - LEARNING 39
Using Prior Knowledge
We can encode it in the prior:

P(C1) = 0.05 P(C2) = 0.25 P(C3) = 0.70


C1 C2 C3

P(H|C1) = 0.1 P(H|C2) = 0.5 P(H|C3) = 0.9


3/21/2016 503043 - LEARNING 40
Experiment 1: Heads
Which coin did I use?
P(C1|H) = ? P(C2|H) = ? P(C3|H) = ?

C1 C2 C3

P(H|C1) = 0.1 P(H|C2) = 0.5 P(H|C3) = 0.9


P(C1) = 0.05 P(C2) = 0.25 P(C3) = 0.70

3/21/2016 503043 - LEARNING 41


Experiment 1: Heads
Which coin did I use?
P(C1|H) = 0.006 P(C2|H) = 0.165 P(C3|H) = 0.829
Compare with ML posterior after Exp 1:
P(C1|H) = 0.066 P(C2|H) = 0.333 P(C3|H) = 0.600
C1 C2 C3

P(H|C1) = 0.1 P(H|C2) = 0.5 P(H|C3) = 0.9


P(C1) = 0.05 P(C2) = 0.25 P(C3) = 0.70
3/21/2016 503043 - LEARNING 42
Experiment 2: Tails
Which coin did I use?

P(C1|HT) = ? P(C2|HT) = ? P(C3|HT) = ?

C1 C2 C3

P(H|C1) = 0.1 P(H|C2) = 0.5 P(H|C3) = 0.9


P(C1) = 0.05 P(C2) = 0.25 P(C3) = 0.70
3/21/2016 503043 - LEARNING 43
Experiment 2: Tails
Which coin did I use?
P(C1|HT) = 0.035P(C2|HT) = 0.481P(C3|HT) = 0.485

C1 C2 C3

P(H|C1) = 0.1 P(H|C2) = 0.5 P(H|C3) = 0.9


P(C1) = 0.05 P(C2) = 0.25 P(C3) = 0.70
3/21/2016 503043 - LEARNING 44
Experiment 2: Tails
Which coin did I use?
P(C1|HT) = 0.035 P(C2|HT)=0.481 P(C3|HT) = 0.485

C1 C2 C3

P(H|C1) = 0.1 P(H|C2) = 0.5 P(H|C3) = 0.9


P(C1) = 0.05 P(C2) = 0.25 P(C3) = 0.70
3/21/2016 503043 - LEARNING 45
Your Estimate?
What is the probability of heads after two experiments?

Most likely coin: Best estimate for P(H)


C3 P(H|C3) = 0.9

C1 C2 C3

P(H|C1) = 0.1 P(H|C2) = 0.5 P(H|C3) = 0.9


P(C1) = 0.05 P(C2) = 0.25 P(C3) = 0.70
3/21/2016 503043 - LEARNING 46
Your Estimate?
Maximum A Posteriori (MAP) Estimate:
The best hypothesis that fits observed data
assuming a non-uniform prior

Most likely coin:


C3
C3

Best estimate for P(H) P(H|C3) = 0.9


P(C3) = 0.70
P(H|C3) = 0.9

3/21/2016 503043 - LEARNING 47


Did We Do The Right Thing?
P(C1|HT)=0.035 P(C2|HT)=0.481 P(C3|HT)=0.485

C1 C2 C3

P(H|C1) = 0.1 P(H|C2) = 0.5 P(H|C3) = 0.9

3/21/2016 503043 - LEARNING 48


Did We Do The Right Thing?
P(C1|HT) =0.035 P(C2|HT)=0.481 P(C3|HT)=0.485
C2 and C3 are almost
equally likely

C1 C2 C3

P(H|C1) = 0.1 P(H|C2) = 0.5 P(H|C3) = 0.9

3/21/2016 503043 - LEARNING 49


A Better Estimate
Recall: = 0.680

P(C1|HT)=0.035 P(C2|HT)=0.481 P(C3|HT)=0.485

C1 C2 C3
P(H|C1) = 0.1 P(H|C2) = 0.5 P(H|C3) = 0.9

3/21/2016 503043 - LEARNING 50


Bayesian Estimate
Bayesian Estimate: Minimizes prediction error,
given data and (generally) assuming a non-uniform prior

= 0.680

P(C1|HT)=0.035 P(C2|HT)=0.481 P(C3|HT)=0.485

C1 C2 C3
P(H|C1) = 0.1 P(H|C2) = 0.5 P(H|C3) = 0.9
3/21/2016 503043 - LEARNING 51
Comparison
After more experiments: HTH8

ML (Maximum Likelihood):
P(H) = 0.5
after 10 experiments: P(H) = 0.9

MAP (Maximum A Posteriori):


P(H) = 0.9
after 10 experiments: P(H) = 0.9

Bayesian:
P(H) = 0.68
after 10 experiments: P(H) = 0.9

3/21/2016 503043 - LEARNING 52


Comparison
ML (Maximum Likelihood):
Easy to compute
MAP (Maximum A Posteriori):
Still easy to compute
Incorporates prior knowledge
Bayesian:
Minimizes error => great when data is scarce
Potentially much harder to compute

3/21/2016 503043 - LEARNING 53


Summary For Now

Prior Hypothesis

Maximum Likelihood Uniform The most likely


Estimate
Maximum A Any The most likely
Posteriori Estimate
Weighted
Bayesian Estimate Any
combination

3/21/2016 503043 - LEARNING 54


Continuous Case
In the previous example,
◦we chose from a discrete set of three coins

In general,
◦we have to pick from a continuous distribution
◦of biased coins

3/21/2016 503043 - LEARNING 55


Continuous Case

3/21/2016 503043 - LEARNING 56


3

Continuous Case
2

0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

3/21/2016 503043 - LEARNING 57


Continuous Case
Prior Exp 1: Heads Exp 2: Tails
2 3 2

uniform
1 1

0 0
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

3 3 3

with background
2
knowledge 2 2

1
1 1

0
0 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

3/21/2016 503043 - LEARNING 58


Continuous Case
Posterior after 2 experiments:
2

w/ uniform prior
ML Estimate 1

MAP Estimate
Bayesian Estimate 0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

2
with background
knowledge
1

0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

3/21/2016 503043 - LEARNING 59


After 10 Experiments...
Posterior:
5

w/ uniform prior
ML Estimate
2

MAP Estimate 0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Bayesian Estimate -1

with background
3

knowledge
2

0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

-1

3/21/2016 503043 - LEARNING 60


After 100 Experiments...

3/21/2016 503043 - LEARNING 61


Topics
Parameter Estimation:
◦ Maximum Likelihood (ML)
◦ Maximum A Posteriori (MAP)
◦ Bayesian
◦ Continuous case

Learning Parameters for a Bayesian Network


Naive Bayes
◦ Maximum Likelihood estimates
◦ Priors

Learning Structure of Bayesian Networks

3/21/2016 503043 - LEARNING 62


Review: Conditional
Probability
P(A | B) is the probability of A given B
Assumes that B is the only info known.
Defined by: P( A  B)
P( A | B) 
P( B)

A
A B
True

B
63

3/21/2016 503043 - LEARNING


Conditional Independence
A&B not independent, since P(A|B) < P(A)

A AB

B
True

3/21/2016 503043 - LEARNING 64


Conditional Independence
But: A&B are made independent by C

A AB AC P(A|C) =


P(A|B,C)
B
True

BC
3/21/2016 503043 - LEARNING 65
P( E | H ) P( H )
Bayes Rule P( H | E ) 
P( E )

Simple proof from def of conditional probability:

P( H  E )
P( H | E )  (Def. cond. prob.)
P( E )
P( H  E )
P( E | H )  (Def. cond. prob.)
P( H )
P ( H  E )  P ( E | H ) P ( H ) (Mult by P(H) in line 1)

P( E | H ) P( H )
QED: P( H | E )  (Substitute #3 in #2)
P( E )

3/21/2016 503043 - LEARNING 66


An Example Bayes Net
Pr(B=t) Pr(B=f)
Earthquake Burglary 0.05 0.95

Pr(A|E,B)
e,b 0.9 (0.1)
e,b 0.2 (0.8)
Radio Alarm
e,b 0.85 (0.15)
e,b 0.01 (0.99)

Nbr1Calls Nbr2Calls

503043 - Learning
3/21/2016 67
Given Parents, X is Independent of
Non-Descendants

3/21/2016 503043 - LEARNING 68


Given Markov Blanket, X is Independent of All
Other Nodes

MB(X) = Par(X)  Childs(X)  Par(Childs(X))

3/21/2016 503043 - LEARNING 69


Parameter Estimation and Bayesian
Networks

E B R A J M
T F T T F T
F F F F F T
F T F T T T
F F F T T T
F T F F F F
We have: ...
- Bayes Net structure and observations
- We need: Bayes Net parameters
3/21/2016 503043 - LEARNING 70
Parameter Estimation and Bayesian
Networks
E B R A J M
T F T T F T
F F F F F T
F T F T T T
F F F T T T
F T F F F F
...

Prior
25

20
20

18

16
Now compute
P(B) = ? + data = either MAP or
14
15
12

10

10
8

4
5

Bayesian estimate
0
0 0.2 0.4 0.6 0.8 1
0
0 0.2 0.4 0.6 0.8 1 -2

-5

3/21/2016 503043 - LEARNING 71


Parameter Estimation and Bayesian
Networks
E B R A J M
T F T T F T
F F F F F T
F T F T T T
F F F T T T
F T F F F F
...
P(A|E,B) = ?
P(A|E,¬B) = ?
P(A|¬E,B) = ?
P(A|¬E,¬B) = ?
3/21/2016 503043 - LEARNING 72
Parameter Estimation and Bayesian
Networks
E B R A J M
T F T T F T
F F F F F T
F T F T T T
F F F T T T
F T F F F F
...
P(A|E,B) = ? Prior 2

P(A|E,¬B) = ? 2

P(A|¬E,B) = ? 1
+ data= 1

P(A|¬E,¬B) = ? 0
0 0.2 0.4 0.6 0.8 1
0
0 0.2 0.4 0.6 0.8 1

3/21/2016 503043 - LEARNING 73


Recap
Given a BN structure (with discrete or continuous variables), we can learn the
parameters of the conditional prop tables.

Earthqk Burgl

Spam

Alarm
Nigeria Sex Nude

N1 N2
503043 - Learning
3/21/2016 74
What if we don’t know structure?
Learning The Structure
of Bayesian Networks
Search thru the space…
◦ of possible network structures!
◦ (for now, assume we observe all variables)
For each structure, learn parameters
Pick the one that fits observed data best
◦ Caveat – won’t we end up fully connected????

When scoring, add a penalty


 model complexity
3/21/2016 503043 - LEARNING 76
Learning The Structure
of Bayesian Networks
Search thru the space
For each structure, learn parameters
Pick the one that fits observed data best

Problem?
Exponential number of networks!
And we need to learn parameters for each!
Exhaustive search out of the question!
So what now?

3/21/2016 503043 - LEARNING 77


Learning The Structure
of Bayesian Networks

Local search!
◦ Start with some network structure
◦ Try to make a change
◦ (add or delete or reverse edge)
◦ See if the new network is any better

◦What should be the initial state?

3/21/2016 503043 - LEARNING 78


Initial Network Structure?
Uniform prior over random networks?

Network which reflects expert knowledge?

3/21/2016 503043 - LEARNING 79


Learning BN Structure

503043 - Learning
3/21/2016 80
The Big Picture
We described how to do MAP (and ML) learning of a Bayes net
(including structure)

How would Bayesian learning (of BNs) differ?

Find all possible networks


Calculate their posteriors
When doing inference, return weighed
combination of predictions from all
networks!
3/21/2016 503043 - LEARNING 81

You might also like