0% found this document useful (0 votes)
288 views103 pages

Lecture 5 Bayesian Classification 3

Artificial Intelligence is a course of computer science and engineering department in all countries.

Uploaded by

musa
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
288 views103 pages

Lecture 5 Bayesian Classification 3

Artificial Intelligence is a course of computer science and engineering department in all countries.

Uploaded by

musa
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 103

Bayesian Theorem

Abu Saleh Musa Miah


Assist. Professor, Dept. of CSE, BAUST, Bangladesh
email: [email protected], tel: +8801734264899
web: www.baust.edu.bd/cse
1
Probability
Probability is the measure of the likelihood that an event will occur.

Probability is quantified as a number between 0 and 1 (where 0


indicates impossibility and 1 indicates certainty).

2
Example
 A simple example is the toss of a fair (unbiased) coin.
 Since the two outcomes are equally probable,
 the probability of "heads" equals the probability of "tails",
 so the probability is 1/2 (or 50%) chance of either "heads" or
"tails".

3
Probability Theory
Try to write rules for dental diagnosis using first-order logic, so that
we can see how the logical approach breaks down
Consider the following rule:

The problem is that this rule is wrong. Not all patients with
toothaches have cavities; some of them have gum disease, an abscess,
or one of several other problems:

4
Unfortunately, in order to make the rule true, we have to add an
almost unlimited list of possible causes. We could try turning the
rule into a causal rule

 But this rule is not right either; not all cavities cause pain The only
way to fix the rule is to make it logically exhaustive:
 to augment the left-hand side with all the qualifications required for
a cavity to cause a toothache.
 Even then, for the purposes of diagnosis, one must also take into
account the possibility that the patient might have a toothache and a
cavity that are unconnected

5
Trying to use first-order logic to cope with a domain like medical
diagnosis thus fails for three main reasons:

 Laziness: It is too much work to list the complete set of


antecedents or consequents needed to ensure an exceptionless rule
and too hard to use such rules.
 Theoretical ignorance: Medical science has no complete theory
for the domain.
 Practical ignorance: Even if we know all the rules, we might be
uncertain about a particular patient because not all the necessary
tests have been or can be run.

 The connection between toothaches and cavities is just not a logical


consequence in either direction.
 This is typical of the medical domain, as well as most other
judgmental domains:
 law, business, design, automobile repair, gardening, dating, and so
6
 Our main tool for dealing with degrees of belief will be probability
theory, which assigns to each sentence a numerical degree of belief
between 0 and 1.
 Probability provides a way of summarizing the uncertainty that
comes from our laziness and ignorance.

 We might not know for sure what afflicts a particular patient, but
we believe that there is, say, an 80% chance-that is, a probability of
0.8-that the patient has a cavity if he or she has a toothache.
 The 80% summarizes those cases in which all the factors needed for
a cavity to cause a toothache are present and other cases in which
the patient has both toothache and cavity but the two are
unconnected. The missing 20% summarizes all the other possible
causes of toothache that we are too lazy or ignorant to confirm or
deny.

7
Probability Theory: Variables and Events
• In probability theory, the set of all possible worlds is called the sample space.
• For example, if we are about to roll two (distinguishable) dice, there are 36
possible worlds to consider: (1,1), (1,2), . . ., (6,6).
• A random variable can be an observation, outcome or event the value of which is
uncertain.
• Total and Die1 are random variables, random variable has a domain—the set of
possible values it can take on.
• The domain of Total for two dice is the set {2, . . . , 12} and the domain of Die1 is
{1, . . . , 6}.
• e.g a coin. Let’s use Throw as the random variable denoting the outcome
• The set of possible outcomes for a random variable is called its domain.
• The domain of Throw is {head, tail}
• A Boolean random variable has two outcomes.
• University {true, false}
• Cavity has the domain {true, false}
• Toothache has the domain {true, false}
• Weather has the domain {sunny, rainy, cloudy, snow} domain of Age to be
Probability Theory: Variables and Events
Each random variable has a domain of values that it can take on.

For example, the domain of Cavity might be (true, false ).

Random Variable can


 Discrete or Continuous
 Discrete------ Dice
 Continuous ---------- Tomorrow’s Temperature.

For example, the domain of Weather might be (sunny, rainy,


cloudy, snow).
Prior/Independent probability
The unconditional or prior probability associated with a proposition
a is the degree of belief accorded to it in the absence of any other
information. it is written as p(a)
For example,
Probability=Number of events occurred in Favor of expected outcome
/
Number of total events occurred expected outcome

 If we play dice it can be 1 2 3 4 5 6,


 The probability of 6 is 1/6
 Probability of odd or even number 3/6

 if I am going to the dentist for a regular checkup, the probability P(cavity)=0.2


might be of interest;
10
Prior probability
Probability=Number of events occurred in Favor of expected outcome
/
Number of total events occurred expected outcome
For example, you would expect for a fair dice that the event that you
threw five would have a frequency about one-sixth.

11
Independenct variable
The two run variables are considered independent if their joint
probability, that is, a probability of X and Y, equals to the product of
their marginals.
So it will be a probability of X times a probability of Y.

X and Y are independent if:


P(X,Y)=P(X)P(Y)
(joint)= (marginal)

12
Independence
Let's see an example.

 Imagine have a deck of 52 cards and take, randomly, 2 cards from it.
 the first random variable would be the picture that is drawn on
 the first card
 and second would be the picture that is drawn on the second card.
Those kind of variables are dependent since it is impossible to
take one card two times.
Another example is throwing two coins independently.

probability that the first coin will land heads up and the second would
land tails up equals to the product of the two probabilities.

And so these random variables are independent. 13


Conditional Probability
Probability of X given that Y happened:

It is the probability of X given Y equals to the joint probability P


of X and Y over the marginal probability P of Y.
Conditional Probability
• A conditional probability expresses the likelihood that one event a will occur if b
occurs. We denote this as follows
P ( a | b)
• This is read as "the probability of a, given that all we know is b." For example,

• e.g.
P(Toothache  true )  0.2

P(Toothache  true | Cavity  true )  0.6


• So conditional probabilities reflect the fact that some events make other events
more (or less) likely

• If one event doesn’t affect the likelihood of another event they are said to be
independent and therefore
P ( a | b)  P ( a )
• E.g. if you roll a 6 on a die, it doesn’t make it more or less likely that you will
roll a 6 on the next throw. The rolls are independent.
Conditional Probability
Let's consider an example.  
 Imagine you are a student and you want to pass some course.
 It has two exams in it, a midterm and the final.
 The probability that the student will pass a midterm is 0.4 and
 the probability that the student will pass a midterm and the final 0.25.
 If you want to find the probability that you will pass the final, given that you already passed
the midterm, you can apply the formula from the previous slide. And this will give you a
value around 60%.

 P(M)=0.4
 P(M&F)=0.25
 P(F/M)==0.625

 We need two tricks


 chain rule and sum rules
16
Joint Probability
 P( Weather, Cavity) denotes all combinations of the values of Weather and Cavity
Where Weather=sunny, rainy, cloudy, snow
Cavity=yes, no

 That is 4 x 2 table of probabilities and called the joint probability dis-tribution of


Weather and Cavity.

 P(sunny, Cavity) would be a two-element vector giving the probabilities of a sunny day
with a cavity and a sunny day with no cavity

 P(Weather , Cavity) = P(Weather | Cavity) P(Cavity) instead of as these 4×2=8 equations


(using abbreviations W and C):

17
Joint Probability
 P( Weather, Cavity) denotes all combinations of the values of Weather and Cavity

 P(W , C) = P(W | C)P(C) instead of as these 4×2=8 equations (using abbreviations


Weather and Cavity):

 Written as P(sunny, cavity) or P(sunny ∧ cavity).


 will sometimes use P notation to derive results about individual P values, and when we
say “P(sunny)=0.6” it is really an abbreviation for “P(sunny) is the one-element vector
0.6, which means that P(sunny)=0.6.”

18
Conditional Probability
 A joint probability distribution that covers this complete set is
called the full joint probability distribution.
 We borrow this part directly from the semantics of propositional
logic, as follows. A possible world is defined to be an
assignment of values to all of the random variables under
consideration.
 For example, Cavity, Toothache, and Weather, then the full joint
distribution is given by P(Cavity , Toothache, Weather).
 This joint distribution can be represented as a 2 x 2 x 4 table with
16 entries
 Probability distributions for continuous variables are called
probability density functions.

19
Combining Probabilities

 For example, the sentence P(al b) = 0.8 cannot be interpreted to mean "whenever b holds, conclude that P(a) is 0.8."
first, P(a) always denotes the prior probability of a, not the posterior probability given some evidence;
 second, the statement P(al b) = 0.8 is immediately relevant just when b is the only available evidence.
 When additional information c is available, the degree of belief in a is P(al b A c), which might have little relation
to P(a1b).
 For example, c might tell us directly whether a is true or false.
 If we examine a patient who complains of toothache, and discover a cavity, then we have additional evidence cavity,
and
 we conclude (trivially) that P(cavity / toothache A cavity) = 1.0.
 So
Chain rule

 We can derive it from the definition of the conditional probability.


 That is, the joint probability of X and Y equals to the product of X given Y and the
probability of Y.
 By induction, we can prove the same formula for three variables.
 It will be the probability of X, Y, and Z equals to probability of X given Y and
 Z, the probability of Y given Z, and finally probability of Z. And in a similar way, we can
obtain the formula for the arbitrary number of points. So this would be the probability of
the current point, given all its previous points.
21
Sum rule
If events Y1, …., Yn are mutually
exclusive with X

 P()
P()+ P()+….+ P()

if you want to find out the marginal distribution p(X), and


you know only the joint probability that p(X,Y),
you can integrate out the random variable Y, as it is given on the formula.
The Axioms of Probability
Using the axioms of probability
We can derive a variety of useful facts from the basic ,axioms. For
example, the familiar rule for negation follows by substituting l a for
b in axiom 3, giving us:

 Let the discrete variable D have the domain (dl, . . . , d,).


 Then it is easy to show (Exercise 13.2) that

 The probability of a proposition is equal to the sum of the probabilities of the


atomic events in which it holds
Inference by Enumeration
Inference by Enumeration
Normalization

Normalization will turn out to be a useful shortcut in many probability calculations


Independence
BAYES RULE
Conditional probability can be written
  in two forms because of the
commutativity of conjunction:

Equating the two right-hand sides and dividing by P(a), we get


=

B=parameter
A=Observation/data

This equation is known as Bayes' rule (also Bayes' law or Bayes'


Theorem.

29
BAYES RULE
The Bayes Theorem was developed and named for Thomas
Bayes(1702-1761)
 Show the Relation between one conditional probability and its
inverse.
 Provide a mathematical rule for revising an estimate or forecast in
light of experience and observation.
In the 18th Century , Thomas Bayes,
 Ponder this question:
“Does God really exist?”
•Being interested in the mathematics, he attempt to develop a formula
to arrive at the probability that God does exist based on
the evidence that was available to him on earth.
Later, Laplace refined Bayes’ work and gave it the name
“Bayes’ Theorem”.
30
Applying: Bayesian Rule
1. Classification:

2.Regularization:
This formula can also lead to regularization.
We can treat the prior on theta as a regularizer.
Imagine if you want to estimate the probability that your coin will land
heads up.
You already know that most of the coins land heads up with probability 0.5 and
so you can use the following prior, because we'll say that most of the coins are fair.
However, if you know that for your experiment the probability
of heads can either be fair, that is 0.5 or bias towards heads,
that is greater than 0.5, you could use for example the following
31
Applying: BAYES RULE
3. Online Learning

4.Prediction:
 Bayes’ rule is useful in practice because there are many cases where we do have good probability
estimates for these three numbers and need to compute the fourth.

The conditional probability P(effect | cause) quantifies the relationship in the causal direction, whereas
P(cause | effect) describes the diagnostic direction. In a task such as medical diagnosis,
we often have conditional probabilities on causal relationships (that is, the doctor knows P(symptoms |
disease)) and want to derive a diagnosis, P(disease | symptoms).

32
Example-1 BAYES RULE
A doctor knows that the disease meningitis causes the patient to have
a stiff neck, say, 50% of the time. The doctor also knows some
unconditional facts: the prior probability that a patient has meningitis is
50,000, and the prior probability that any patient has a stiff neck is 20.
Letting s be the proposition that the patient has a stiff neck and m be
the proposition that the patient has meningitis, we have

That is, we expect only 1 in 5000 patients with a stiff neck to have
meningitis

33
Bayes’ rule can capture causal models
• Suppose a doctor knows that a meningitis causes a stiff neck in 50% of cases
P ( s | m)  0.5
• She also knows that the probability in the general population of someone having
a stiff neck at any time is 1/20
P ( s )  0.05

• She also has to know the incidence of meningitis in the population (1/50,000)
P(m)  0.00002
• Using Bayes’ rule she can calculate the probability the patient has meningitis:

P( s | m) P(m) 0.5  0.00002


P(m | s)    0.0002  1 / 5000
P( s) 0.05
P(effect | cause) P(cause )
P (cause | effect ) 
P(effect )
Example-2 of Bayes Rule
Marie is getting married tomorrow, at an outdoor ceremony in the desert. In recent
years, it has rained only 5 days each year. Unfortunately, the weatherman has
predicted rain for tomorrow. When it actually rains, the weatherman correctly
forecasts rain 90% of the time. When it doesn't rain, he incorrectly forecasts rain 10%
of the time. What is the probability that it will rain on the day of Marie's wedding?

Solution…
The sample space is defined by two mutually-exclusive events - it rains
or it does not rain. Additionally, a third event occurs when the
weatherman predicts rain. Notation for these events appears below.

 Event A1. It rains on Marie's wedding.


 Event A2. It does not rain on Marie's wedding.
 Event B. The weatherman predicts rain

35
In terms of probabilities, we know the following:
 P( A1 ) = 5/365 =0.0136985 [It rains 5 days out of the year.]
 P( A2 ) = 360/365 = 0.9863014 [It does not rain 360 days out
of the year.]
 P( B | A1 ) = 0.9 [When it rains, the weatherman predicts rain
90% of the time.]
 P( B | A2 ) = 0.1 [When it does not rain, the weatherman
predicts rain 10% of the time.]

We want to know P( A1 | B ), the probability it will rain on the day


of Marie's wedding, given a forecast for rain by the weatherman.
The answer can be determined from Bayes' theorem, as shown
below.

36
Using Bayes' rule: Combining evidence
What happens when we have two or more pieces of evidence? For example, what can a
dentist conclude if her nasty steel probe catches in the aching tooth of a patient? If we
know the full joint distribution

That might be feasible for just two evidence variables, but again it will not scale up. If there are
n possible evidence variables (X rays, diet, oral hygiene, etc.), then there are 2n possible
combinations of observed values for which we would need to know conditional probabilities.

This equation expresses the conditional independence of toothache and catch given Cavity.
We can plug it into Equation (13.12) to obtain the probability of a cavity

37
Using Bayes' rule: Combining evidence
 The dentistry example illustrates a commonly occurring pattern in which a single cause
directly influences a number of effects, all of which are conditionally independent, given
the cause. The full joint distribution can be written as

Such a probability distribution is called a naive Bayes model-"naive" because it is often


used (as a simplifying assumption) in cases where the "effect" variables are not conditionally
independent given the cause variable. (The naive Bayes model is sometimes called a
Bayesian classifier, a somewhat careless usage that has prompted true Bayesians to call it
the idiot Bayes model.)

38
What is a Bayesian Network ?
A graphical model that efficiently encodes the joint probability
distribution for a large set of variables
We remarked on the importance of independence and conditional
independence relationships in simplifying probabilistic
representations of the world.
This chapter introduces a systematic way to represent
such relationships explicitly in the form of Bayesian networks

39
Bayesian network Or Belief Networks
A Bayesian network is a directed  graph in which each node is an
notated with quantitative probability information. The full
specification is as follows:

1. A set of random variables makes up the nodes of the network.


Variables may be discrete or continuous.
2. A set of directed links or arrows connects pairs of nodes. If
there is an arrow from node X to node Y, X is said to be a parent of Y.
3. Each node X, has a conditional probability distribution that
quantifies the effect of the parents on the node.
4. The graph has no directed cycles (and hence is a directed, acyclic
graph, or DAG).
Probabilistic model from BN
Nodes: Random variables
Edges: Direct impact
BN Excercise
 consisting of the variables Toothache, Cavzty, Catch, and Weather
 We argued that Weather is independent of the other variables;
 we argued that Toothache and Catch are conditionally independent, given Cavity.

Figure 14.1 A simple Bayesian network in which Weather is


independent of the other three variables and Toothache and Catch are
conditionally independent, given Cavity
 Formally, the conditional independence of Toothache and Catch given Cavity is indicated by
the absence of a link between Toothache and Catch.
 Intuitively, the network represents the fact that Cavity is a direct cause of Toothache and
Catch, whereas no direct causal relationship exists between Toothache and Catch. 42
BN Excercise

1) Consider the following Bayesian network, where F = having the flu


and C = coughing:

a) Write down the joint probability table specified by the Bayesian


network.
Answer:

43
b) Determine the probabilities for the following Bayesian network

so that it specifies the same joint probabilities as the given one.

Answer:

44
C) Which Bayesian network would you have specified using the
rules learned in class?
Answer:
The first one. It is good practice to add nodes that correspond to
causes before nodes that correspond to their effects.

d) Are C and F independent in the given Bayesian network?

Answer:
No, since (for example) P(F) = 0.1 but P(F / C) 0.23
e) Are C and F independent in the Bayesian network from Question b?
Answer:
No, for the same reason.

45
2) To safeguard your house, you recently installed two deferent
alarm systems by two deferent reputable manufacturers that use
completely deferent sensors for their alarm systems.
a) Which one of the two Bayesian networks given below makes
independence assumptions that are not true? Explain all of your
reasoning. Alarm1 means that the first alarm system rings,
Alarm2 means that the second alarm system rings, and Burglary
means that a burglary is in progress.

Answer: The second one falsely assumes that Alarm1 and Alarm2 are
independent if the value of Burglary is unknown. However, if the
alarms are working as intended, it should be more likely that Alarm1
rings if Alarm2 rings (that is, they should not be independent). 46
2) To safeguard your house, you recently installed two deferent
alarm systems by two deferent reputable manufacturers that use
completely deferent sensors for their alarm systems.

b) Consider the first Bayesian network. How many probabilities


need to be specified for its conditional probability tables? How
many probabilities would need to be given if the same joint
probability distribution were specified in a joint probability table?

47
 Answer:

We need to specify 5 probabilities,


namely P(Burglary),
P,
P,
Pand
P.

A joint probability table would need probabilities.

48
c) Consider the second Bayesian network. Assume that:

49
Answer:

with

50
4) Consider the following Bayesian network. A, B, C, and D
are Boolean random variables. If we know that A is true, what
is the probability of D being true?

51
5) For the following Bayesian network

We know that X and Z are not guaranteed to be independent if


the value of Y is unknown.
This means that, depending on the probabilities, X and Z can
be independent or dependent if the value of Y is unknown.

Construct probabilities where X and Z are independent if the


value of Y is unknown, and show that they are indeed
independent. 52
53
Bayesian network earthquake example
 Imagine that you buy an alarm to a house to prevent thief
from going into it. Either the thief goes into a house,
 and the alarm will go off and they will get, for example,
SMS notification.
 However, the alarm may give a false alarm in case of an
earthquake.
 Also, if there is a strong earthquake, the radio will report
about it and so you get another source of them notification.
 Here's a graphical model for it. What is the general
probability of the four and the variables, thief, alarm,
earthquake, and the radio is given by the following formula.

P(t,a,e,r)=P(t)P(e)P(a/t,e)P(r/e)
What is the probability that there is a thief in our house?
Let's define it us to 10 to the power of minus three, that is, one in a
thousand houses has been robbed.
What is the probability of an earthquake?
Well, Iet's say it is 10 to the power of mines two. The earthquakes
happened about once in 100 days. Now, we've defined the probability
of alarm, given the thief as an earthquake so those will be four
numbers. If there is a thief in our house, the alarm will go for sure.
BN probabilistic distribution
Priors

P(T=1)

P(E=1)

P(R/E)

P(t,a,e,r)=P(t)P(e)P(a/t,e)P(r/e) E=0

P(A=1/T,E) E= E=1 E
0
T=0 0 1/10  If there is no earthquake, the radio reports it has
nothing to tell us about, and so it will not report
T=1 1 1 about it.
 If there is no thief and there is no earthquake, then the
alarm has no reason to send us signals.  However, if there is an earthquake, the radio
 However, if there is no thief, but there is an earthquake, will reports that were permuted one half that is,
 the alarm will notify us for abuse at one time. it does not report about some small earthquakes
BN probabilistic distribution
P(A=1/T,E) E=0 E=1 Priors

T=0 0 1/10 P(T=1)


T=1 1 1 P(E=1)
P(E=1)

T=1
  equal T
T=0 equal P(R/E)
E=0
 For a work, and you got a notification from an alarm
  system. E

 You want to estimate the probability that there is a


thief in our house, given that there is an alarm, given that
we've gotten notification from an alarm.
 This would be, the probability of a thief during the alarm.
 =50%
 =P(A/T,E)P(T)P(E)
 =P(A/T)P(A/E)P(T)P(E)

 =1%
  =0
  =0
Bayesian network earthquake example
Example:
I´m at work, neighbor John calls to say my alarm is ringing, but
neighbor Mary doesn´t call. Something it‘s set off by minor
earthquakes. Is there a burglary?

Variables: 5 Variables such Burglary, Earthquake, Alarm, JohnCalls,


and MarCalls. Network topology reflects ‘causal knowledge’
Conditional Probability Table(CPT) Example

In the CPTs, the letters B, E, A, J, and M stand for Burglary,


Earthquake, Alarm, JohnCalls, and MarCalls respectively.
Joint distribution

  Joint Distribution for a network


A
with n boolean nodes has
Rows for the combinations of
parent values.
Total 32 rows… Ok, 31.

59
Compactness

  A CPT for Boolean with k boolean parents has for the
combinations of parent value.

 Each row requires one number of p for and is just 1-p

 If each variable has no more than k parents , the


complete networks requirees O(n.) numbers.
 Ie.. Grows linearly with n vs O() for the full joint
distribution.

 For the burglary net, 1+1+4+2+2=10 numbers vs

 Suppose we have n=30 nodes, each with five parents


k=5, then the bayesian network requires 960 numbers,
but the full joint distribution requires over a billion.

60
Compactness

where parents(Xi) denotes the specific values of the variables in Parents(Xi).


Thus, each entry in the joint distribution is represented by the product of the
appropriate elements of the conditional probability tables (CPTs) in the Bayesian
network. The CPTs therefore provide a decomposed representation of the joint
distribution.

This identity is called the chain rule.

61
• Calculate
  the probability that the alarm has sounded, but neither a
burglary nor an earthquake has occurred, and both john and mary
call
• =?

• Q2; P(J)=?
• =P(J/A)*P(A)+P(J/A)P(A)

P(A)=P(A/B,E)*P(B^E)+ P(A/B,E)*P(B^E)+ P(A/B,E)*P(B^E)+


P(A/B,E)*P(B^E)

P(A)=P(A/B,E)*P(B^E)+ P(A/B,E)*P(B^E)+ P(A/B,E)*P(B^E)+


P(A/B,E)*P(B^E)
Compactness example
we will get a compact Bayesian network only if we choose
the node ordering well. What happens if we happen to
choose the wrong order?
Consider the burglary example again. Suppose we decide to add the nodes in the order
MaryCalls, JohnCalls, Alarm, Burglary, Earthquake.

63
Compactness example

Suppose we choose the ordering M,J,A,B,E.


 Adding MaryCalls: No parents.

 Adding JohnCalls: If Mary calls, that probably means the


alarm has gone off, which of course would make it more
likely that John calls. Therefore, JohnCalls needs MaryCalls
as a parent.

 Adding Alarm: Clearly, if both call, it is more likely that the


alarm has gone off than if just one or neither calls, so we
need both MaryCalls and JohnCalls as parents.

64
Compactness example

Suppose we choose the ordering M,J,A,B,E.


 Adding Burglary: If we know the alarm state, then the call from
John or Mary might give us information about our phone ringing
or Mary’s music, but not about burglary: P(B | A, J ,M) = P(B | A)
 Hence we need just alarm as parent
.

 Adding Earthquake: If the alarm is on, it is more likely that there


has been an earthquake. But if we know that there has been a
burglary, then that explains the alarm, and the probability of an
earthquake would be only slightly above normal. Hence, we need
both Alarm and Burglary as parents.

65
Compactness example
 The resulting network has two more links than the original network requires three
more probabilities to be specified. the probability of Earthquake, given Burglary
and Alarm.
 Deciding conditional independence is hard in no causal directions.
 Network is less compact: 1+2+4+2+4=13 numbers needed

66
Compactness example

67
Naïve Bayes Classifier
We have a class, C that directly impacts
the values of the features, that is for different
classes. The distribution of the features may be
different. And the joint distribution can be written
using the following formula.

It is the probability of the class times the product over


all features, the probability of the current feature
given the class.

we have a lot of equal sub-graphs. A bit more convenient way to write down this graph is called a
plate notation. It is written as follows.
How Naive Bayes algorithm works?

Step 1: Convert the data set into a frequency table


Step 2: Create Likelihood table by finding the probabilities like Overcast probability = 0.29
and probability of playing is 0.64.
Step 3: Now, use Naive Bayesian equation to calculate the posterior probability for each
class. The class with the highest posterior probability is the outcome of prediction.

69
How Naive Bayes algorithm works?

Problem: Players will play if weather is sunny. Is this statement is correct?

We can solve it using above discussed method of posterior probability.

P(Yes | Sunny) = P( Sunny | Yes) * P(Yes) / P (Sunny)

Here we have P (Sunny |Yes) = 3/9 = 0.33, P(Sunny) = 5/14 = 0.36, P( Yes)= 9/14 = 0.64

Now, P (Yes | Sunny) = 0.33 * 0.64 / 0.36 = 0.60, which has higher probability.

Naive Bayes uses a similar method to predict the probability of different class based on
various attributes. This algorithm is mostly used in text classification and with problems
having multiple classes.

Problem: Players will play if weather is rainy Is this statement is correct?

 
70
Predict(Cloudy,Warm,Outdoor)=?
P(sunny)=.40
P(cloudy)= .60
71
72
Predict(Cloudy,Warm,Outdoor)=No

73
Why Bayesian Networks?

Bayesian Probability represents the degree of belief in that event


while Classical Probability (or frequents approach) deals with true or
physical probability of an event

•Bayesian Network

•Handling of Incomplete or missing Data Sets


•Over-fitting of data can is avoidable when using Bayesian networks
and Bayesian statistical methods.

74
What are the Pros and Cons of Baysian theorem

Pros:
 It is easy and fast to predict class of test data set. It also perform well in multi class
prediction
 When assumption of independence holds, a Naive Bayes classifier performs better
compare to other models like logistic regression and you need less training data.
 It perform well in case of categorical input variables compared to numerical variable(s).
For numerical variable, normal distribution is assumed (bell curve, which is a strong
assumption).

Cons:
 If categorical variable has a category (in test data set), which was not observed in
training data set, then model will assign a 0 (zero) probability and will be unable to
make a prediction. This is often known as “Zero Frequency”. To solve this, we can use
the smoothing technique. One of the simplest smoothing techniques is called Laplace
estimation.
 On the other side naive Bayes is also known as a bad estimator, so the probability
outputs from predict_proba are not to be taken too seriously.
 Another limitation of Naive Bayes is the assumption of independent predictors. In real
life, it is almost impossible that we get a set of predictors which are completely
independent.
75
Limitations of Bayesian Networks

Typically require initial knowledge of many probabilities…quality and


extent of prior knowledge play an important role

 Significant computational cost(NP hard task)


 Unanticipated probability of an event is not taken care of.

76
Decision Network
 Uncertainty

 Many environment have multiple possible outcomes. Good, bad,

 Utilities are combined with the outcome probabilities for actions


to give expected utility for each action.

 Decision making under uncertainty: what action to take the


state of the world unknown?

 Bayesian answer: Find the utility of each possible outcome and


take the actions that maximized expected utility.
Decision Network
 Expected Utility:

Maximum Expected Utility:

 Generally:

)
 MEU(e,)

Value of Perfect Information

79
 Expected Utility:
Prior to execution of A, the agent assigns probability P(Result(A)/ Do(A),E) to each outcome, where E
summarizes the agents available about the world, and Do(A) is proposition that action A is executed in the
current state. Then we can cancalcualte the expected utility of the action given the evidence, EU(A/F)
using followng formula:
EU(A|E)=
Maximum Expected Utility: a rational agent should choose an action
that maximizes the agent’s EU
 Simple decisions are one-shot decisions.

Value of Perfect Information

80
VPI(A)?
A independent of C
P(C/A)=P(C)
VPI(A)=0

81
Decision networks

Decision networks combine Bayesian networks with


decision theory.
Decision networks combine Bayesian networks with
additional node types for actions and utilities. We, will
use airport siting as an example.
Extend Bayes nets to handle actions and utilities
a.k.a. influence diagrams
Make use of Bayes net inference
Useful application: Value of Information
Decision Networks
 A decision network  is a graphical
representation of a finite sequential decision
problem.

 A decision network is an extension of the 


Bayes' Net representation that allows us to
calculated expected utilities for the actions
that we take.

83
Decision network representation
• Chance nodes: random variables, as in Bayes nets

• Decision nodes: actions that decision maker can take

• Utility/value nodes: the utility of the outcome state.


Representing a decision problem with
a decision network
Decision network represents information about the agent's current
state, its possible actions, the state that will result from the
agent's action, and the utility of that state.

Chance nodes (ovals) represent random variables, just as they do in


Bayes nets.
The agent could be uncertain about the construction cost, the level
of air traffic and the potential for litigation, and the Deaths, Noise,
and total Cost variables, each of which also depends on the site
chosen.
Decision nodes (rectangles) represent points where the decision-
maker has a choice of actions. In this case, the Airport site action
can take on a different value for each site under consideration. The
choice influences the cost, safety, and noise that will result.
85
Utility nodes (diamonds) represent the agent's utility function .
The utility node has as parents all variables describing the outcome
that directly affect utility.
Associated with the utility node is a description of the agent's utility
as a function of the parent attributes.
The description could be just a tabulation of the function, or it
might be a parameterized additive or multi linear function.

86
Decision network example

 Chance nodes(BNs)
 Actions(rectangles cannot have parents)
 Utility node(diamond, depends on action and chance node

87
Decision Networks
Action-utility tables:
 Notice that because the noise, death, and cost chance nodes refer to
future states, they can never have their values se as evidence
variables.

Figure 16.5 A simple decision


network for the airport-siting
problem

88
Action-utility tables:
 Simplified version omits these nodes. Omission of an explicit
description of the outcome of the siting decision means that it is less
flexible with respect to changes in circumstances.

Figure 16.6 A simplified representation of the airport-siting problem.


Chance nodes corresponding to outcome states have been factored out. 89
Action-utility tables:
 Rather than representing a utility function on states, the table
associated with the utility node represents the expected utility
associated with each action..

Figure 16.6 A simplified representation of the airport-siting problem.


Chance nodes corresponding to outcome states have been factored out. 90
Action-utility tables:
 In the original DN, a change in aircraft noise levels can be reflected by a change in
the conditional probability table associated with the noise node, whereas a change
in the weight accorded to noise pollution in the utility function can be reflected by
a change in the utility table. In the action utility diagram, on the other hand, all
such changes have to be reflected by a change in the utility table.
 Action utility formalism is a compiled version of the original
formulation.

Figure 16.6 A simplified representation of the airport-siting problem.


Chance nodes corresponding to outcome states have been factored out. 91
Evaluating decision networks
To determine rational decisions the network has to be evaluated and
utilities computed
I. Set the evidence variables for the current state.
2. For each possible value of the decision node;
(a) Set the decision node to that value
b) Calculate the posterior probabilities for the parent nodes of the
utility node, using a standard probabilistic inference algorithm.
(c) Calculate the resulting utility for the action.
3. Return the action with the highest utility.

92
Evaluating decision networks
Set the evidence variables for the current state.
For each possible value of the decision node (assume just
one):
 Set the decision node to that value.
 Calculate the posterior probabilities for the parent
nodes of the utility node, using BN inference.
 Calculate the resulting utility for the action.
Return the action with the highest utility.
Example 9.11: agent should take an
umbrella when it goes out. The agent's
utility depends on the weather and
whether it takes an umbrella. However, it
does not get to observe the weather. It only
gets to observe the forecast. The forecast
probabilistically depends on the weather.
94
Conider a simple decision network for a decision of whether the
agent should take an umbrella when it goes out. The agents utility
depends on the weather and whether it takes an umbrella.

Domain for each random variable; the domain for each decision
variable;
Random variable Weather has domain {sunny, rain}
Decision variable Umbrella has domain {take, leave}
95
Expected Utilities: Optimal decision=leave
 EU(take)=70*.3+20*.7=35
 EU(Leave)=0*.3+100*.7=70

 𝑀𝐸𝑈 ( ∅ ) =𝑚𝑎𝑥𝐸𝑈 ( 𝑎 )=7096


The value of Information
 In the preceding analysis, we have assumed that all relevant information, or at least all available
information, is provided to the agent before makes its decision. In practice, this is hardly ever the case

 One of the most important parts of decision making is knowing


what question to ask.
 To conduct expensive and critical tests or not depends on tow
factors
 Whether the different possible outcomes would make a
significant difference to the optimal course of action
 The likelihood of the various outcomes.
 Information value theory enables an agent to choose what
information to acquire.

97
The value of Information

98
The value of Information

99
The value of Information

P(w/f=bad)=p(f/w)*f(w)/p(f)
p(f)=p(f/w=rain)*p(w=rain)+
p(f/w=sun)*p(w=sun)

100
The value of Information

Optimal Decision=take MEU(F=bad)=maxEU(a/bad)=53


101
The value of Information

Forecase Distribution

Value of Information=(.59*95+.41*53)-70
VI=7.8

102
Summery

103

You might also like