Bayesian Learning
Bayesian Learning
Probabilistic Models
• A probabilistic model is a joint distribution
over a set of random variables
Distribution over T,W
• Probabilistic models:
• (Random) variables with domains T W P
• Assignments are called outcomes hot sun 0.4
• Joint distributions: say whether
assignments (outcomes) are likely hot rain 0.1
• Normalized: sum to 1.0 cold sun 0.2
• Ideally: only certain variables directly
interact cold rain 0.3
• We will not be discussing Independent
events, and Mutually Exclusive events
Main Types of Probability (discussed here)
• Joint Probability
• The probability of two (or more) events is called the joint probability. The joint
probability of two or more random variables is referred to as the joint probability
distribution.
• P(A and B) = P(A given B) * P(B)
• Marginal Probability
• The probability of one event in the presence of all (or a subset of) outcomes of the
other random variable is called the marginal probability or the marginal distribution.
• P(X=A) = sum P(X=A, Y=yi) for all y
• Conditional Probability
• The probability of one event given the occurrence of another event is called
the conditional probability.
• P(A given B)= P(A|B) = P(A and B)/P(B)
Events
• An event is a set E of outcomes
T P
hot 0.5
T W P
cold 0.5
hot sun 0.4
hot rain 0.1
cold sun 0.2 W P
cold rain 0.3 sun 0.6
rain 0.4
Conditional Probabilities
• A simple relation between joint and conditional P(a,b)
probabilities
• In fact, this is taken as the definition of a conditional
probability
P(a) P(b)
T W P
hot sun 0.4
hot rain 0.1
cold sun 0.2
cold rain 0.3
The Product Rule
• Example:
D W P D W P
wet sun 0.1 wet sun 0.08
R P
dry sun 0.9 dry sun 0.72
sun 0.8
wet rain 0.7 wet rain 0.14
rain 0.2
dry rain 0.3 dry rain 0.06
Joint probability Distribution
• A joint probability distribution can be used to represent the probabilities of
combined statements, such as A ∧ B.
• Analyzing logical statements does not function in situations that are lacking certainty.
• If we are unsure whether A is true, then we cannot make use of this expression.
• In many real-world situations, it is very useful to be able to talk about things that lack certainty.
• For example, what will the weather be like tomorrow?
Probabilistic Reasoning
• We might formulate a very simple hypothesis based on general
observation, such as “it is sunny only 10% of the time, and rainy 70%
of the time.”
• P(S) = 0.1
• P(R) = 0.7
13
Bayes Classifier
• Givem feature points: we want to compute class probabilities using Baye’s
Rule:
P ( x | C ) P (C )
P (C | x ) =
P ( x)
• More Formally
𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶 𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿 × 𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃
𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝 =
𝐸𝐸𝐸𝐸𝐸𝐸𝐸𝐸𝐸𝐸𝐸𝐸𝐸𝐸𝐸𝐸
• 𝑃𝑃 𝐶𝐶 : prior probability.
• 𝑃𝑃 𝑥𝑥 : probability of 𝑥𝑥.
• 𝑃𝑃 𝑥𝑥 𝐶𝐶 : conditional probability of x given C : likelihood
• 𝑃𝑃 𝐶𝐶|𝑥𝑥 : conditional probability of C given 𝑥𝑥 : posterior probability
Weather Play
Example Sunny No
Overcast Yes
Rainy Yes
• Problem:
Sunny Yes
• Player will play or not if the
Sunny Yes
weather is sunny?
Overcast Yes
Rainy No
Rainy No
Sunny Yes
Rainy Yes
Sunny No
Overcast Yes
Overcast Yes
Rainy No
Example
• We can solve it using above discussed method of posterior probability.
𝑷𝑷 𝑭𝑭|𝑪𝑪 × 𝑷𝑷(𝑪𝑪)
𝑃𝑃 𝐶𝐶 𝐹𝐹 =
𝑷𝑷 𝑭𝑭
18
𝑷𝑷 𝑭𝑭|𝑪𝑪 × 𝑷𝑷(𝑪𝑪)
𝑃𝑃 𝐶𝐶 𝐹𝐹 =
𝑷𝑷 𝑭𝑭
20
Multiple Input Attributes
So far we have only considered Bayes Classification when we have one attribute (i.e.
“name”). But we may have many features.
How do we use all the features?
21
Multiple Input Attributes
22
Naïve Bayes Rule
• Recall
𝑃𝑃 𝐴𝐴 𝐵𝐵 ∗ 𝑃𝑃 𝐵𝐵 = 𝑃𝑃 𝐴𝐴, 𝐵𝐵 = 𝑃𝑃 𝐵𝐵, 𝐴𝐴
• Bayes Rule
𝑃𝑃 𝐹𝐹 𝐶𝐶 ∗ 𝑃𝑃 𝐶𝐶
𝑃𝑃 𝐶𝐶 𝐹𝐹 =
𝑃𝑃 𝐹𝐹
• General Bayes Rule:
𝑃𝑃 𝐹𝐹1 , … , 𝐹𝐹𝑛𝑛 𝐶𝐶 ∗ 𝑃𝑃 𝐶𝐶
𝑃𝑃 𝐶𝐶 𝐹𝐹1 , … , 𝐹𝐹𝑛𝑛 =
𝑃𝑃 𝐹𝐹1 , … , 𝐹𝐹𝑛𝑛
• 𝑃𝑃 𝐶𝐶, 𝐹𝐹1 , … , 𝐹𝐹𝑛𝑛 = 𝑃𝑃 𝐶𝐶 ∗ 𝑃𝑃 𝐹𝐹1 𝐶𝐶 ∗ 𝑃𝑃 𝐹𝐹2 𝐶𝐶, 𝐹𝐹1 ∗ 𝑃𝑃 𝐹𝐹3 , … , 𝐹𝐹𝑛𝑛 𝐶𝐶, 𝐹𝐹1 , 𝐹𝐹2
25
Naïve Bayes Rule
• To simplify the task, Naïve Bayesian classifiers assume attributes have independent
distributions, and thereby estimate
𝑷𝑷 𝑭𝑭𝟏𝟏 , … , 𝑭𝑭𝒏𝒏 |𝑪𝑪 × 𝑷𝑷(𝑪𝑪)
𝑃𝑃 𝐶𝐶 𝐹𝐹1 , … , 𝐹𝐹𝑛𝑛 =
𝑷𝑷 𝑭𝑭𝟏𝟏 , … , 𝑭𝑭𝒏𝒏
𝑛𝑛
• Key idea: compute a probability for each class based on the probability distribution in the training
data.
• First take into account the probability of each attribute. Treat all attributes equally important, i.e.,
multiply the probabilities.
• Now take into account the overall probability of a given class. Multiply it with the probabilities of
the attributes.
• Now choose the class so that it maximizes this probability. This means that the new instance will
be classified as YES or NO.
arg max { P=
(C ) P (Outlook sunny
= | C ) P(Temp cool |=
C ) P( Humidity high
= | C ) P(Wind strong | C )}
C∈[ yes , no ]
Outlook Temperature
Yes No P(Yes) P(No) Yes No P(Yes) P(No)
Sunny 2 3 2/9 3/5 Hot 2 2 2/9 2/5
Overcast 4 0 4/9 0/5 Mild 4 2 4/9 2/5
Rainy 3 2 3/9 2/5 Cool 3 1 3/9 1/5
Total 9 5 100% 100% Total 9 5 100% 100%
Humidity Wind
Yes No P(Yes) P(No) Yes No P(Yes) P(No)
High 3 4 3/9 4/5 False 6 2 6/9 2/5
Normal 6 1 6/9 1/5 True 3 3 3/9 3/5
Total 9 5 100% 100% Total 9 5 100% 100%
P [ X | C ] P [C ] , or =
P [ X | Play Yes
= ] P [ Play Yes ] X = {Sunny, Cool , High, Strong}
2 3 3 3 9
P [ X | Play = Yes ] P [ Play = Yes ] = ∗ ∗ ∗ ∗ = 0.0053
9 9 9 9 14
3 1 4 3 5
P [ X | Play = No ] P [ Play = No ] = ∗ ∗ ∗ ∗ = 0.0206
5 5 5 5 14
answer : PlayTennis ( X ) = no
Supervised Classifeir: Using Bayes
2 3 3 3 9
P [ X | Play = Yes ] P [ Play = Yes ] = ∗ ∗ ∗ ∗ = 0.0053
9 9 9 9 14
3 1 4 3 5
P [ X | Play = No ] P [ Play = No ] = ∗ ∗ ∗ ∗ = 0.0206
5 5 5 5 14
P ( X ) P=
= ( Outlook Sunny ) * P (Temperature
= Cool ) * P (=
Humidity High )=
* P (Wind Strong )
5 4 7 6
P ( X ) = * * *
14 14 14 14 0.0053
P ( X ) = 0.02186
( Play Yes
P= = |X) = 0.2424
0.02186
0.0206
( Play No
P= = |X) = 0.9421
0.02186
Question
• For the given dataset, apply naïve Color Type Origin Stolen
• Given that the lady has said that the taxi was white, what is the
likelihood that she is right?
Example: Witness Reliability
• P(Y) = 0.9 (the probability of any particular taxi being yellow)
• P(W) = 0.1 (the probability of any particular taxi being white)
• Let us denote by
• P(CW) the probability that the culprit was driving a white taxi
• P(CY) the probability that it was a yellow car
• P(WW) to denote the probability that the witness says she saw a white car
• P(WY) to denote that she says she saw a yellow car
• Now, if the witness really saw a yellow car, she would say that it was yellow 75% of the
time, and if she says she saw a white car, she would say it was white 75% of the time, So
• P(WW| CW) =0.75
• P(WY | CY) = 0.75
Example: Witness Reliability
• we can apply Bayes’ theorem to find the probability, given that she is saying that the car was
white, that she is correct:
• We now need to calculate P(WW)—the prior probability that the lady would say she saw a
white car.
• We now need to calculate P(WW)—the prior probability that the lady would say she saw a white car.
• In other words, if the lady says that the car was white, the probability
that it was in fact white is only 0.25