Lecture 5 Bayesian Classification 3
Lecture 5 Bayesian Classification 3
2
Example
A simple example is the toss of a fair (unbiased) coin.
Since the two outcomes are equally probable,
the probability of "heads" equals the probability of "tails",
so the probability is 1/2 (or 50%) chance of either "heads" or
"tails".
3
Probability Theory
Try to write rules for dental diagnosis using first-order logic, so that
we can see how the logical approach breaks down
Consider the following rule:
The problem is that this rule is wrong. Not all patients with
toothaches have cavities; some of them have gum disease, an abscess,
or one of several other problems:
4
Unfortunately, in order to make the rule true, we have to add an
almost unlimited list of possible causes. We could try turning the
rule into a causal rule
But this rule is not right either; not all cavities cause pain The only
way to fix the rule is to make it logically exhaustive:
to augment the left-hand side with all the qualifications required for
a cavity to cause a toothache.
Even then, for the purposes of diagnosis, one must also take into
account the possibility that the patient might have a toothache and a
cavity that are unconnected
5
Trying to use first-order logic to cope with a domain like medical
diagnosis thus fails for three main reasons:
We might not know for sure what afflicts a particular patient, but
we believe that there is, say, an 80% chance-that is, a probability of
0.8-that the patient has a cavity if he or she has a toothache.
The 80% summarizes those cases in which all the factors needed for
a cavity to cause a toothache are present and other cases in which
the patient has both toothache and cavity but the two are
unconnected. The missing 20% summarizes all the other possible
causes of toothache that we are too lazy or ignorant to confirm or
deny.
7
Probability Theory: Variables and Events
• In probability theory, the set of all possible worlds is called the sample space.
• For example, if we are about to roll two (distinguishable) dice, there are 36
possible worlds to consider: (1,1), (1,2), . . ., (6,6).
• A random variable can be an observation, outcome or event the value of which is
uncertain.
• Total and Die1 are random variables, random variable has a domain—the set of
possible values it can take on.
• The domain of Total for two dice is the set {2, . . . , 12} and the domain of Die1 is
{1, . . . , 6}.
• e.g a coin. Let’s use Throw as the random variable denoting the outcome
• The set of possible outcomes for a random variable is called its domain.
• The domain of Throw is {head, tail}
• A Boolean random variable has two outcomes.
• University {true, false}
• Cavity has the domain {true, false}
• Toothache has the domain {true, false}
• Weather has the domain {sunny, rainy, cloudy, snow} domain of Age to be
Probability Theory: Variables and Events
Each random variable has a domain of values that it can take on.
11
Independenct variable
The two run variables are considered independent if their joint
probability, that is, a probability of X and Y, equals to the product of
their marginals.
So it will be a probability of X times a probability of Y.
12
Independence
Let's see an example.
Imagine have a deck of 52 cards and take, randomly, 2 cards from it.
the first random variable would be the picture that is drawn on
the first card
and second would be the picture that is drawn on the second card.
Those kind of variables are dependent since it is impossible to
take one card two times.
Another example is throwing two coins independently.
probability that the first coin will land heads up and the second would
land tails up equals to the product of the two probabilities.
• e.g.
P(Toothache true ) 0.2
• If one event doesn’t affect the likelihood of another event they are said to be
independent and therefore
P ( a | b) P ( a )
• E.g. if you roll a 6 on a die, it doesn’t make it more or less likely that you will
roll a 6 on the next throw. The rolls are independent.
Conditional Probability
Let's consider an example.
Imagine you are a student and you want to pass some course.
It has two exams in it, a midterm and the final.
The probability that the student will pass a midterm is 0.4 and
the probability that the student will pass a midterm and the final 0.25.
If you want to find the probability that you will pass the final, given that you already passed
the midterm, you can apply the formula from the previous slide. And this will give you a
value around 60%.
P(M)=0.4
P(M&F)=0.25
P(F/M)==0.625
P(sunny, Cavity) would be a two-element vector giving the probabilities of a sunny day
with a cavity and a sunny day with no cavity
17
Joint Probability
P( Weather, Cavity) denotes all combinations of the values of Weather and Cavity
18
Conditional Probability
A joint probability distribution that covers this complete set is
called the full joint probability distribution.
We borrow this part directly from the semantics of propositional
logic, as follows. A possible world is defined to be an
assignment of values to all of the random variables under
consideration.
For example, Cavity, Toothache, and Weather, then the full joint
distribution is given by P(Cavity , Toothache, Weather).
This joint distribution can be represented as a 2 x 2 x 4 table with
16 entries
Probability distributions for continuous variables are called
probability density functions.
19
Combining Probabilities
For example, the sentence P(al b) = 0.8 cannot be interpreted to mean "whenever b holds, conclude that P(a) is 0.8."
first, P(a) always denotes the prior probability of a, not the posterior probability given some evidence;
second, the statement P(al b) = 0.8 is immediately relevant just when b is the only available evidence.
When additional information c is available, the degree of belief in a is P(al b A c), which might have little relation
to P(a1b).
For example, c might tell us directly whether a is true or false.
If we examine a patient who complains of toothache, and discover a cavity, then we have additional evidence cavity,
and
we conclude (trivially) that P(cavity / toothache A cavity) = 1.0.
So
Chain rule
P()
P()+ P()+….+ P()
B=parameter
A=Observation/data
29
BAYES RULE
The Bayes Theorem was developed and named for Thomas
Bayes(1702-1761)
Show the Relation between one conditional probability and its
inverse.
Provide a mathematical rule for revising an estimate or forecast in
light of experience and observation.
In the 18th Century , Thomas Bayes,
Ponder this question:
“Does God really exist?”
•Being interested in the mathematics, he attempt to develop a formula
to arrive at the probability that God does exist based on
the evidence that was available to him on earth.
Later, Laplace refined Bayes’ work and gave it the name
“Bayes’ Theorem”.
30
Applying: Bayesian Rule
1. Classification:
2.Regularization:
This formula can also lead to regularization.
We can treat the prior on theta as a regularizer.
Imagine if you want to estimate the probability that your coin will land
heads up.
You already know that most of the coins land heads up with probability 0.5 and
so you can use the following prior, because we'll say that most of the coins are fair.
However, if you know that for your experiment the probability
of heads can either be fair, that is 0.5 or bias towards heads,
that is greater than 0.5, you could use for example the following
31
Applying: BAYES RULE
3. Online Learning
4.Prediction:
Bayes’ rule is useful in practice because there are many cases where we do have good probability
estimates for these three numbers and need to compute the fourth.
The conditional probability P(effect | cause) quantifies the relationship in the causal direction, whereas
P(cause | effect) describes the diagnostic direction. In a task such as medical diagnosis,
we often have conditional probabilities on causal relationships (that is, the doctor knows P(symptoms |
disease)) and want to derive a diagnosis, P(disease | symptoms).
32
Example-1 BAYES RULE
A doctor knows that the disease meningitis causes the patient to have
a stiff neck, say, 50% of the time. The doctor also knows some
unconditional facts: the prior probability that a patient has meningitis is
50,000, and the prior probability that any patient has a stiff neck is 20.
Letting s be the proposition that the patient has a stiff neck and m be
the proposition that the patient has meningitis, we have
That is, we expect only 1 in 5000 patients with a stiff neck to have
meningitis
33
Bayes’ rule can capture causal models
• Suppose a doctor knows that a meningitis causes a stiff neck in 50% of cases
P ( s | m) 0.5
• She also knows that the probability in the general population of someone having
a stiff neck at any time is 1/20
P ( s ) 0.05
• She also has to know the incidence of meningitis in the population (1/50,000)
P(m) 0.00002
• Using Bayes’ rule she can calculate the probability the patient has meningitis:
Solution…
The sample space is defined by two mutually-exclusive events - it rains
or it does not rain. Additionally, a third event occurs when the
weatherman predicts rain. Notation for these events appears below.
35
In terms of probabilities, we know the following:
P( A1 ) = 5/365 =0.0136985 [It rains 5 days out of the year.]
P( A2 ) = 360/365 = 0.9863014 [It does not rain 360 days out
of the year.]
P( B | A1 ) = 0.9 [When it rains, the weatherman predicts rain
90% of the time.]
P( B | A2 ) = 0.1 [When it does not rain, the weatherman
predicts rain 10% of the time.]
36
Using Bayes' rule: Combining evidence
What happens when we have two or more pieces of evidence? For example, what can a
dentist conclude if her nasty steel probe catches in the aching tooth of a patient? If we
know the full joint distribution
That might be feasible for just two evidence variables, but again it will not scale up. If there are
n possible evidence variables (X rays, diet, oral hygiene, etc.), then there are 2n possible
combinations of observed values for which we would need to know conditional probabilities.
This equation expresses the conditional independence of toothache and catch given Cavity.
We can plug it into Equation (13.12) to obtain the probability of a cavity
37
Using Bayes' rule: Combining evidence
The dentistry example illustrates a commonly occurring pattern in which a single cause
directly influences a number of effects, all of which are conditionally independent, given
the cause. The full joint distribution can be written as
38
What is a Bayesian Network ?
A graphical model that efficiently encodes the joint probability
distribution for a large set of variables
We remarked on the importance of independence and conditional
independence relationships in simplifying probabilistic
representations of the world.
This chapter introduces a systematic way to represent
such relationships explicitly in the form of Bayesian networks
39
Bayesian network Or Belief Networks
A Bayesian network is a directed graph in which each node is an
notated with quantitative probability information. The full
specification is as follows:
43
b) Determine the probabilities for the following Bayesian network
Answer:
44
C) Which Bayesian network would you have specified using the
rules learned in class?
Answer:
The first one. It is good practice to add nodes that correspond to
causes before nodes that correspond to their effects.
Answer:
No, since (for example) P(F) = 0.1 but P(F / C) 0.23
e) Are C and F independent in the Bayesian network from Question b?
Answer:
No, for the same reason.
45
2) To safeguard your house, you recently installed two deferent
alarm systems by two deferent reputable manufacturers that use
completely deferent sensors for their alarm systems.
a) Which one of the two Bayesian networks given below makes
independence assumptions that are not true? Explain all of your
reasoning. Alarm1 means that the first alarm system rings,
Alarm2 means that the second alarm system rings, and Burglary
means that a burglary is in progress.
Answer: The second one falsely assumes that Alarm1 and Alarm2 are
independent if the value of Burglary is unknown. However, if the
alarms are working as intended, it should be more likely that Alarm1
rings if Alarm2 rings (that is, they should not be independent). 46
2) To safeguard your house, you recently installed two deferent
alarm systems by two deferent reputable manufacturers that use
completely deferent sensors for their alarm systems.
47
Answer:
48
c) Consider the second Bayesian network. Assume that:
49
Answer:
with
50
4) Consider the following Bayesian network. A, B, C, and D
are Boolean random variables. If we know that A is true, what
is the probability of D being true?
51
5) For the following Bayesian network
P(t,a,e,r)=P(t)P(e)P(a/t,e)P(r/e)
What is the probability that there is a thief in our house?
Let's define it us to 10 to the power of minus three, that is, one in a
thousand houses has been robbed.
What is the probability of an earthquake?
Well, Iet's say it is 10 to the power of mines two. The earthquakes
happened about once in 100 days. Now, we've defined the probability
of alarm, given the thief as an earthquake so those will be four
numbers. If there is a thief in our house, the alarm will go for sure.
BN probabilistic distribution
Priors
P(T=1)
P(E=1)
P(R/E)
P(t,a,e,r)=P(t)P(e)P(a/t,e)P(r/e) E=0
P(A=1/T,E) E= E=1 E
0
T=0 0 1/10 If there is no earthquake, the radio reports it has
nothing to tell us about, and so it will not report
T=1 1 1 about it.
If there is no thief and there is no earthquake, then the
alarm has no reason to send us signals. However, if there is an earthquake, the radio
However, if there is no thief, but there is an earthquake, will reports that were permuted one half that is,
the alarm will notify us for abuse at one time. it does not report about some small earthquakes
BN probabilistic distribution
P(A=1/T,E) E=0 E=1 Priors
T=1
equal T
T=0 equal P(R/E)
E=0
For a work, and you got a notification from an alarm
system. E
=1%
=0
=0
Bayesian network earthquake example
Example:
I´m at work, neighbor John calls to say my alarm is ringing, but
neighbor Mary doesn´t call. Something it‘s set off by minor
earthquakes. Is there a burglary?
59
Compactness
A CPT for Boolean with k boolean parents has for the
combinations of parent value.
60
Compactness
61
• Calculate
the probability that the alarm has sounded, but neither a
burglary nor an earthquake has occurred, and both john and mary
call
• =?
• Q2; P(J)=?
• =P(J/A)*P(A)+P(J/A)P(A)
63
Compactness example
64
Compactness example
65
Compactness example
The resulting network has two more links than the original network requires three
more probabilities to be specified. the probability of Earthquake, given Burglary
and Alarm.
Deciding conditional independence is hard in no causal directions.
Network is less compact: 1+2+4+2+4=13 numbers needed
66
Compactness example
67
Naïve Bayes Classifier
We have a class, C that directly impacts
the values of the features, that is for different
classes. The distribution of the features may be
different. And the joint distribution can be written
using the following formula.
we have a lot of equal sub-graphs. A bit more convenient way to write down this graph is called a
plate notation. It is written as follows.
How Naive Bayes algorithm works?
69
How Naive Bayes algorithm works?
Here we have P (Sunny |Yes) = 3/9 = 0.33, P(Sunny) = 5/14 = 0.36, P( Yes)= 9/14 = 0.64
Now, P (Yes | Sunny) = 0.33 * 0.64 / 0.36 = 0.60, which has higher probability.
Naive Bayes uses a similar method to predict the probability of different class based on
various attributes. This algorithm is mostly used in text classification and with problems
having multiple classes.
70
Predict(Cloudy,Warm,Outdoor)=?
P(sunny)=.40
P(cloudy)= .60
71
72
Predict(Cloudy,Warm,Outdoor)=No
73
Why Bayesian Networks?
•Bayesian Network
74
What are the Pros and Cons of Baysian theorem
Pros:
It is easy and fast to predict class of test data set. It also perform well in multi class
prediction
When assumption of independence holds, a Naive Bayes classifier performs better
compare to other models like logistic regression and you need less training data.
It perform well in case of categorical input variables compared to numerical variable(s).
For numerical variable, normal distribution is assumed (bell curve, which is a strong
assumption).
Cons:
If categorical variable has a category (in test data set), which was not observed in
training data set, then model will assign a 0 (zero) probability and will be unable to
make a prediction. This is often known as “Zero Frequency”. To solve this, we can use
the smoothing technique. One of the simplest smoothing techniques is called Laplace
estimation.
On the other side naive Bayes is also known as a bad estimator, so the probability
outputs from predict_proba are not to be taken too seriously.
Another limitation of Naive Bayes is the assumption of independent predictors. In real
life, it is almost impossible that we get a set of predictors which are completely
independent.
75
Limitations of Bayesian Networks
76
Decision Network
Uncertainty
Generally:
)
MEU(e,)
79
Expected Utility:
Prior to execution of A, the agent assigns probability P(Result(A)/ Do(A),E) to each outcome, where E
summarizes the agents available about the world, and Do(A) is proposition that action A is executed in the
current state. Then we can cancalcualte the expected utility of the action given the evidence, EU(A/F)
using followng formula:
EU(A|E)=
Maximum Expected Utility: a rational agent should choose an action
that maximizes the agent’s EU
Simple decisions are one-shot decisions.
80
VPI(A)?
A independent of C
P(C/A)=P(C)
VPI(A)=0
81
Decision networks
83
Decision network representation
• Chance nodes: random variables, as in Bayes nets
86
Decision network example
Chance nodes(BNs)
Actions(rectangles cannot have parents)
Utility node(diamond, depends on action and chance node
87
Decision Networks
Action-utility tables:
Notice that because the noise, death, and cost chance nodes refer to
future states, they can never have their values se as evidence
variables.
88
Action-utility tables:
Simplified version omits these nodes. Omission of an explicit
description of the outcome of the siting decision means that it is less
flexible with respect to changes in circumstances.
92
Evaluating decision networks
Set the evidence variables for the current state.
For each possible value of the decision node (assume just
one):
Set the decision node to that value.
Calculate the posterior probabilities for the parent
nodes of the utility node, using BN inference.
Calculate the resulting utility for the action.
Return the action with the highest utility.
Example 9.11: agent should take an
umbrella when it goes out. The agent's
utility depends on the weather and
whether it takes an umbrella. However, it
does not get to observe the weather. It only
gets to observe the forecast. The forecast
probabilistically depends on the weather.
94
Conider a simple decision network for a decision of whether the
agent should take an umbrella when it goes out. The agents utility
depends on the weather and whether it takes an umbrella.
Domain for each random variable; the domain for each decision
variable;
Random variable Weather has domain {sunny, rain}
Decision variable Umbrella has domain {take, leave}
95
Expected Utilities: Optimal decision=leave
EU(take)=70*.3+20*.7=35
EU(Leave)=0*.3+100*.7=70
97
The value of Information
98
The value of Information
99
The value of Information
P(w/f=bad)=p(f/w)*f(w)/p(f)
p(f)=p(f/w=rain)*p(w=rain)+
p(f/w=sun)*p(w=sun)
100
The value of Information
Forecase Distribution
Value of Information=(.59*95+.41*53)-70
VI=7.8
102
Summery
103