Foundations of Machine Learning: Part A: Probability Basics
Foundations of Machine Learning: Part A: Probability Basics
Module 4:
Part A: Probability Basics
Sudeshna Sarkar
IIT Kharagpur
• Probability is the study of randomness and
uncertainty.
• A random experiment is a process whose
outcome is uncertain.
Examples:
– Tossing a coin once or several times
– Tossing a die
– Tossing a coin until one gets Heads
– ...
Events and Sample Spaces
Sample Space
The sample space is the set of all possible outcomes.
Event
An event is any
Simple Events collection of one or
The individual outcomes more simple events
are called simple events. 3
Sample Space
• Sample space Ω : the set of all the possible
outcomes of the experiment
– If the experiment is a roll of a six-sided die, then the
natural sample space is {1, 2, 3, 4, 5, 6}
– Suppose the experiment consists of tossing a coin
three times.
Ω = {(ℎℎℎ, ℎℎ𝑡, ℎ𝑡ℎ, ℎ𝑡𝑡, 𝑡ℎℎ, 𝑡ℎ𝑡, 𝑡𝑡ℎ, 𝑡𝑡𝑡}
– the experiment is the number of customers that
arrive at a service desk during a fixed time period, the
sample space should be the set of nonnegative
integers: Ω = 𝑍 + = 0, 1, 2, 3, …
Events
• Events are subsets of the sample space
o A= {the outcome that the die is even} ={2,4,6}
o B = {exactly two tosses come out tails}=(htt, tht, tth}
o C = {at least two heads} = {hhh, hht, hth, thh}
Probability
• A Probability is a number assigned to each
event in the sample space.
• Axioms of Probability:
– For any event A, 0 P(A) 1.
– P() =1 and 𝑃 𝜙 = 0
– If A1, A2, … An is a partition of A, then
P(A) = P(A1) + P(A2) + ...+ P(An)
Properties of Probability
• For any event A, P(Ac) = 1 - P(A).
• If A B, then P(A) P(B).
• For any two events A and B,
P(A B) = P(A) + P(B) - P(A B).
For three events, A, B, and C,
P(ABC) =
P(A) + P(B) + P(C)
- P(AB) - P(AC) - P(BC)
+ P(AB C)
Intuitive Development (agrees with axioms)
• Intuitively, the probability of an event a could
be defined as:
8
Random Variable
• A random variable is a function defined on the
sample space Ω
– maps the outcome of a random event into real
scalar values
X(w)
w
Discrete Random Variables
• Random variables (RVs) which may take on only a
countable number of distinct values
– e.g., the sum of the value of two dies
P X xi P X x Y y
j i j
P X x Y y P Y y
i j j
j
Marginalization
P X xi P X x Y y
j i j
P X x Y y P Y y
i j j
j
P X x Y y
P X x Y y
P Y y
P Y y j X xi P X xi
P X xi Y y j
P Y y
k j
X xk P X xk
Independent RVs
P X x Y y P X x P Y y X x P Y y
P X x Y y Z z P X x Z z P Y y Z z
More on Conditional Independence
P X x Y y Z z P X x Z z P Y y Z z
P X x Y y, Z z P X x Z z
P Y y X x, Z z P Y y Z z
Continuous Random Variables
• What if X is continuous?
• Probability density function (pdf) instead of
probability mass function (pmf)
• A pdf is any function 𝑓(𝑥) that describes the
probability density in terms of the input
variable x.
PDF
• Properties of pdf
–
f x 0, x
–
–
f x 1
f x 1 ???
• Actual probability can be obtained by taking
the integral of pdf
– E.g. the probability of X being between 0 and 1 is
1
P 0 X 1
0
f x dx
Cumulative Distribution Function
• FX v P X v
• Discrete RVs
– FX v v P X vi
i
• Continuous vRVs
– FX v
f x dx
d
– FX x f x
dx
Common Distributions
• Normal 𝑋~𝑁(𝜇, 𝜎 2 )
1 2
x
– f x exp , x
2 2 2
– E.g. the height of the entire population
0.4
0.35
0.3
0.25
f(x)
0.2
0.15
0.1
0.05
0
-5 -4 -3 -2 -1 0 1 2 3 4 5
x
Multivariate Normal
• Generalization to higher dimensions of the
one-dimensional normal
Covariance Matrix
• 1
f X x1 , , xd
2 d 2
12
1
exp x x
T 1
2
Mean
Mean and Variance
• Mean (Expectation): E X
– Discrete RVs: E X vi P X vi
v i
– Continuous RVs: E X
xf x dx
• Variance: V X E X
2
V X vi P X vi
2
– Discrete RVs:
vi
V X x f x dx
2
– Continuous RVs:
Mean Estimation from Samples
• Given a set of N samples from a distribution,
we can estimate the mean of the distribution
by:
28
Variance Estimation from Samples
• Given a set of N samples from a distribution,
we can estimate the variance of the
distribution by:
29
Thank You
Foundations of Machine Learning
Module 4:
Part B: Bayesian Learning
Sudeshna Sarkar
IIT Kharagpur
Probability for Learning
• Probability for classification and modeling
concepts.
• Bayesian probability
– Notion of probability interpreted as partial belief
• Bayesian Estimation
– Calculate the validity of a proposition
• Based on prior estimate of its probability
• and New relevant evidence
Bayes Theorem
• Goal: To determine the most probable hypothesis,
given the data D plus any initial knowledge about the
prior probabilities of the various hypotheses in H.
Bayes Theorem
P( D | h) P(h)
Bayes Rule: P ( h | D)
P ( D)
• P(h) = prior probability of hypothesis h
• P(D) = prior probability of training data D
• P(h|D) = probability of h given D (posterior density )
• P(D|h) = probability of D given h (likelihood of D given h)
An Example
Does patient have cancer or not?
A patient takes a lab test and the result comes back positive. The test
returns a correct positive result in only 98% of the cases in which the
disease is actually present, and a correct negative result in only 97% of
the cases in which the disease is not present. Furthermore, .008 of the
entire population have this cancer.
P ( D | h) P ( h)
arg max
hH P( D)
arg max P( D | h) P(h)
hH
Maximum Likelihood (ML) Hypothesis
hMAP arg max P(h | D)
hH
P ( D | h) P ( h)
arg max
hH P( D)
arg max P( D | h) P(h)
hH
Comments:
Computational intensive
Providing a standard for judging the performance of
learning algorithms
Choosing P(h) and P(D|h) reflects our prior
knowledge about the learning task
Maximum likelihood and least-squared error
• Learn a Real-Valued Function:
• Consider any real-valued target function f.
• Training examples (xi,di) are assumed to have Normally
distributed noise ei with zero mean and variance σ2, added
to the true target value f(xi),
di satisfies 𝑁(𝑓 𝑥𝑖 , 𝜎 2 )
Assume that ei is drawn independently for each xi .
Compute ML Hypo
m
1 1 d i h( xi ) 2
arg max ln( 2 ) (
2
)
hH i 1 2 2
m
arg min (d i h( xi )) 2
hH i 1
10
Bayes Optimal Classifier
Question: Given new instance x, what is its most probable classification?
• ℎ𝑀𝐴𝑃 (𝑥) is not the most probable classification!
Example: Let P(h1|D) = .4, P(h2|D) = .3, P(h3 |D) =.3
Given new data x, we have h1(x)=+, h2(x) = -, h3(x) = -
What is the most probable classification of x ?
Bayes optimal classification:
arg max
v j V
P(v
hi H
j | hi ) P(hi | D)
where V is the set of all the values a classification can take and vj is one
possible such classification.
Example:
P(h1| D) =.4, P(-|h1)=0, P(+|h1)=1 P( | h ) P(h | D) .4
hiH
i i
12
Gibbs Algorithm
• Bayes optimal classifier is quite computationally
expensive, if H contains a large number of
hypotheses.
• An alternative, less optimal classifier Gibbs algorithm,
defined as follows:
1. Choose a hypothesis randomly according to
P(h|D), where D is the posterior probability
distribution over H.
2. Use it to classify new instance
13
Error for Gibbs Algorithm
• Surprising fact: Assume the expected value is
taken over target concepts drawn at random,
according to the prior probability distribution
assumed by the learner, then (Haussler et al.
1994)
Module 4:
Part C: Naïve Bayes
Sudeshna Sarkar
IIT Kharagpur
Bayes Theorem
P( D | h) P(h)
P ( h | D)
P ( D)
Naïve Bayes
• Bayes classification
P(Y | X) P( X | Y ) P(Y ) P( X 1 , , X n | Y ) P(Y )
Difficulty: learning the joint probability P(X1 , , Xn |C)
• Naïve Bayes classification
Assume all input features are conditionally independent!
P( X 1 , X 2 , , X n | Y ) P( X 1 | X 2 , , X n , Y ) P( X 2 , , X n | Y )
P( X 1 | Y ) P( X 2 , , X n | Y )
P( X 1 | Y ) P( X 2 | Y ) P( X n | Y )
Naïve Bayes
Bayes rule:
• Classify (Xnew)
7
Example
Learning Phase
Outlook Play=Yes Play=No Temperature Play=Yes Play=No
Sunny 2/9 3/5 Hot 2/9 2/5
Overcast 4/9 0/5 Mild 4/9 2/5
Rain 3/9 2/5 Cool 3/9 1/5
8
Example
Test Phase
– Given a new instance, predict its label
x’=(Outlook=Sunny, Temperature=Cool, Humidity=High, Wind=Strong)
– Look up tables achieved in the learning phrase
P(Outlook=Sunny|Play=No) = 3/5
P(Outlook=Sunny|Play=Yes) = 2/9
P(Temperature=Cool|Play==No) = 1/5
P(Temperature=Cool|Play=Yes) = 3/9
P(Huminity=High|Play=No) = 4/5
P(Huminity=High|Play=Yes) = 3/9
P(Wind=Strong|Play=No) = 3/5
P(Wind=Strong|Play=Yes) = 3/9
P(Play=No) = 5/14
P(Play=Yes) = 9/14
– Decision making with the MAP rule
P(Yes|x’) ≈ [P(Sunny|Yes)P(Cool|Yes)P(High|Yes)P(Strong|Yes)]P(Play=Yes) = 0.0053
P(No|x’) ≈ [P(Sunny|No) P(Cool|No)P(High|No)P(Strong|No)]P(Play=No) = 0.0206
9
Estimating Parameters: Y, Xi discrete-valued
MAP estimates:
Only difference:
“imaginary” examples
Naïve Bayes: Assumptions of Conditional
Independence
Often the Xi are not really conditionally independent
• Classify (Xnew)
Estimating Parameters: Y discrete, Xi continuous
ˆ 1 ( x 21 . 64 ) 2
1 ( x 21 . 64 ) 2
P( x | Yes ) exp exp
2.35 2 2 2.35 2.35 2
2
11.09
ˆ 1 ( x 23 .88 ) 2
1 ( x 23 .88 ) 2
P( x | No)
exp
exp
7.09 2 2 7.09 7.09 2
2
50.25
15
The independence hypothesis…
• makes computation possible
• yields optimal classifiers when satisfied
• Rarely satisfied in practice, as attributes (variables)
are often correlated.
• To overcome this limitation:
– Bayesian networks combine Bayesian reasoning with
causal relationships between attributes
Thank You
Foundations of Machine Learning
Module 4:
Part D: Bayesian Networks
Sudeshna Sarkar
IIT Kharagpur
Why Bayes Network
• Bayes optimal classifier is too costly to apply
• Naïve Bayes makes overly restrictive
assumptions.
– But all variables are rarely completely independent.
• Bayes network represents conditional
independence relations among the features.
• Representation of causal relations makes the
representation and inference efficient.
Late Rainy
Accident
wakeup day
Traffic Meeting
Jam postponed
Late for
Work
Late for
meeting
Bayesian Network
• A graphical model that efficiently encodes the joint
probability distribution for a large set of variables
• A Bayesian Network for a set of variables (nodes)
X = { X1,…….Xn}
• Arcs represent probabilistic dependence among
variables
• Lack of an arc denotes a conditional independence
• The network structure S is a directed acyclic graph
• A set P of local probability distributions at each node
(Conditional Probability Table)
Representation in Bayesian Belief
Networks
Late
Accide Rainy
wake Conditional probability table
nt day
up
associated with each node
specifies the conditional
Traffic Meeting distribution for the
Jam postponed variable given its immediate
Late parents in
for the graph
Work
Late for
meeting
symptom
symptom
Bayesian Networks
• Structure of the graph Conditional independence relations
In general,
p(X1, X2,....XN) = p(Xi | parents(Xi ) )
A B C Markov dependence:
p(A,B,C) = p(C|B) p(B|A)p(A)
Naïve Bayes Model
Y1 Y2 Y3 Yn
C
Hidden Markov Model (HMM)
Y1 Y3 Yn
Observed
Y2
----------------------------------------------------
S1 S2 S3 Sn Hidden
Assumptions:
1. hidden state sequence is Markov
2. observation Yt is conditionally independent of all other
variables given St