0% found this document useful (0 votes)
23 views75 pages

Foundations of Machine Learning: Part A: Probability Basics

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
23 views75 pages

Foundations of Machine Learning: Part A: Probability Basics

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 75

Foundations of Machine Learning

Module 4:
Part A: Probability Basics

Sudeshna Sarkar
IIT Kharagpur
• Probability is the study of randomness and
uncertainty.
• A random experiment is a process whose
outcome is uncertain.
Examples:
– Tossing a coin once or several times
– Tossing a die
– Tossing a coin until one gets Heads
– ...
Events and Sample Spaces
Sample Space
The sample space is the set of all possible outcomes.

Event
An event is any
Simple Events collection of one or
The individual outcomes more simple events
are called simple events. 3
Sample Space
• Sample space Ω : the set of all the possible
outcomes of the experiment
– If the experiment is a roll of a six-sided die, then the
natural sample space is {1, 2, 3, 4, 5, 6}
– Suppose the experiment consists of tossing a coin
three times.
Ω = {(ℎℎℎ, ℎℎ𝑡, ℎ𝑡ℎ, ℎ𝑡𝑡, 𝑡ℎℎ, 𝑡ℎ𝑡, 𝑡𝑡ℎ, 𝑡𝑡𝑡}
– the experiment is the number of customers that
arrive at a service desk during a fixed time period, the
sample space should be the set of nonnegative
integers: Ω = 𝑍 + = 0, 1, 2, 3, …
Events
• Events are subsets of the sample space
o A= {the outcome that the die is even} ={2,4,6}
o B = {exactly two tosses come out tails}=(htt, tht, tth}
o C = {at least two heads} = {hhh, hht, hth, thh}
Probability
• A Probability is a number assigned to each
event in the sample space.
• Axioms of Probability:
– For any event A, 0  P(A)  1.
– P() =1 and 𝑃 𝜙 = 0
– If A1, A2, … An is a partition of A, then
P(A) = P(A1) + P(A2) + ...+ P(An)
Properties of Probability
• For any event A, P(Ac) = 1 - P(A).
• If A  B, then P(A)  P(B).
• For any two events A and B,
P(A  B) = P(A) + P(B) - P(A  B).
For three events, A, B, and C,
P(ABC) =
P(A) + P(B) + P(C)
- P(AB) - P(AC) - P(BC)
+ P(AB C)
Intuitive Development (agrees with axioms)
• Intuitively, the probability of an event a could
be defined as:

Where N(a) is the number that event a happens in n trials

8
Random Variable
• A random variable is a function defined on the
sample space Ω
– maps the outcome of a random event into real
scalar values


X(w)
w
Discrete Random Variables
• Random variables (RVs) which may take on only a
countable number of distinct values
– e.g., the sum of the value of two dies

• X is a RV with arity k if it can take on exactly one


value out of k values,
– e.g., the possible values that X can take on are
2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12
Probability of Discrete RV
• Probability mass function (pmf): P  X  xi 
• Simple facts about pmf
–  i P  X  xi   1
– P  X  xi  X  x j   0 if i  j
– P  X  xi  X  x j   P  X  xi   P  X  x j  if i  j
– P  X  x1  X  x2   X  xk   1
Common Distributions
• Uniform 𝑋~ 𝑈[1, ⋯ , 𝑁]
– X takes values 1, 2, …, N
– PX  i  1 N
– E.g. picking balls of different colors from a box
• Binomial 𝑋~𝐵𝑖𝑛(𝑛, 𝑝)
– X takes values 0, 1, …, n
n i n i
– P  X  i     p 1  p 
i
– E.g. coin flips
Joint Distribution
• Given two discrete RVs X and Y, their joint
distribution is the distribution of X and Y together
– e.g.
you and your friend each toss a coin 10 times
P(You get 5 heads AND you friend get 7 heads)
•  x y
P X  x  Y  y  1

  P  You get i heads AND your friend get j heads   1


50 100
i 0 j 0
Conditional Probability
 
• P X  x Y  y is the probability of 𝑋 = 𝑥, given
the occurrence of 𝑌 = 𝑦
– E.g. you get 0 heads, given that your friend gets 3
heads
P X  x  Y  y
• P X  x Y  y 
P Y  y
Law of Total Probability
• Given two discrete RVs X and Y, which take values in
x1, , xm  and  y1, , yn  , We have

P  X  xi   P X  x  Y  y 
j i j

  P  X  x Y  y P  Y  y 
i j j
j
Marginalization

Marginal Probability Joint Probability

P  X  xi    P X  x  Y  y 
j i j

  P  X  x Y  y P  Y  y 
i j j
j

Conditional Probability Marginal Probability


Bayes Rule
• X and Y are discrete RVs…

P X  x  Y  y
P X  x Y  y 
P Y  y

 
P Y  y j X  xi P  X  xi 
 
P X  xi Y  y j 
 P Y  y
k j 
X  xk P  X  xk 
Independent RVs

• X and Y are independent means that 𝑋 = 𝑥


does not affect the probability of 𝑌 = 𝑦

• Definition: X and Y are independent iff


– P(XY) = P(X)P(Y)
– P X  x  Y  y   P X  x P Y  y 
More on Independence

P X  x  Y  y   P X  x P Y  y 

P X  x Y  y  P X  x P Y  y X  x  P Y  y 

• E.g. no matter how many heads you get, your


friend will not be affected, and vice versa
Conditionally Independent RVs
• Intuition: X and Y are conditionally
independent given Z means that once Z is
known, the value of X does not add any
additional information about Y
• Definition: X and Y are conditionally
independent given Z iff

P X  x  Y  y Z  z  P X  x Z  z  P Y  y Z  z 
More on Conditional Independence

P X  x  Y  y Z  z  P X  x Z  z  P Y  y Z  z 

P  X  x Y  y, Z  z   P  X  x Z  z 

P  Y  y X  x, Z  z   P  Y  y Z  z 
Continuous Random Variables
• What if X is continuous?
• Probability density function (pdf) instead of
probability mass function (pmf)
• A pdf is any function 𝑓(𝑥) that describes the
probability density in terms of the input
variable x.
PDF
• Properties of pdf

f  x   0, x
– 

– 
f  x  1
f  x   1 ???
• Actual probability can be obtained by taking
the integral of pdf
– E.g. the probability of X being between 0 and 1 is
1
P  0  X  1  
0
f  x dx
Cumulative Distribution Function
• FX  v   P  X  v 
• Discrete RVs
– FX  v    v P  X  vi 
i

• Continuous vRVs
– FX  v   

f  x  dx
d
– FX  x   f  x 
dx
Common Distributions

• Normal 𝑋~𝑁(𝜇, 𝜎 2 )
1  2
 x   
– f  x  exp  , x 
2  2 2

– E.g. the height of the entire population
0.4

0.35

0.3

0.25
f(x)

0.2

0.15

0.1

0.05

0
-5 -4 -3 -2 -1 0 1 2 3 4 5
x
Multivariate Normal
• Generalization to higher dimensions of the
one-dimensional normal
Covariance Matrix

• 1
f X  x1 , , xd  
 2  d 2

12

 1 
 exp   x      x    
T 1

 2 
Mean
Mean and Variance
• Mean (Expectation):   E  X 
– Discrete RVs: E  X    vi P  X  vi 
v i

– Continuous RVs: E X  
xf  x  dx

• Variance: V  X   E  X   
2

V  X   vi    P  X  vi 
2
– Discrete RVs:
vi

V X    x    f  x dx
2
– Continuous RVs:

Mean Estimation from Samples
• Given a set of N samples from a distribution,
we can estimate the mean of the distribution
by:

28
Variance Estimation from Samples
• Given a set of N samples from a distribution,
we can estimate the variance of the
distribution by:

29
Thank You
Foundations of Machine Learning

Module 4:
Part B: Bayesian Learning

Sudeshna Sarkar
IIT Kharagpur
Probability for Learning
• Probability for classification and modeling
concepts.
• Bayesian probability
– Notion of probability interpreted as partial belief
• Bayesian Estimation
– Calculate the validity of a proposition
• Based on prior estimate of its probability
• and New relevant evidence
Bayes Theorem
• Goal: To determine the most probable hypothesis,
given the data D plus any initial knowledge about the
prior probabilities of the various hypotheses in H.
Bayes Theorem

P( D | h) P(h)
Bayes Rule: P ( h | D) 
P ( D)
• P(h) = prior probability of hypothesis h
• P(D) = prior probability of training data D
• P(h|D) = probability of h given D (posterior density )
• P(D|h) = probability of D given h (likelihood of D given h)
An Example
Does patient have cancer or not?
A patient takes a lab test and the result comes back positive. The test
returns a correct positive result in only 98% of the cases in which the
disease is actually present, and a correct negative result in only 97% of
the cases in which the disease is not present. Furthermore, .008 of the
entire population have this cancer.

P (cancer )  .008, P(cancer)  .992


P ( | cancer )  .98, P(  | cancer )  .02
P ( | cancer)  .03, P(  | cancer )  .97
P ( | cancer) P(cancer)
P (cancer |  ) 
P()
P ( | cancer) P (cancer )
P (cancer |  ) 
P()
Maximum A Posteriori (MAP) Hypothesis
P( D | h) P(h)
P ( h | D) 
P ( D)
The Goal of Bayesian Learning: the most probable hypothesis
given the training data (Maximum A Posteriori hypothesis)

hMAP  arg max P(h | D)


hH

P ( D | h) P ( h)
 arg max
hH P( D)
 arg max P( D | h) P(h)
hH
Maximum Likelihood (ML) Hypothesis
hMAP  arg max P(h | D)
hH

P ( D | h) P ( h)
 arg max
hH P( D)
 arg max P( D | h) P(h)
hH

• If every hypothesis in H is equally probable a priori,


we only need to consider the likelihood of the data D
given h, P(D|h). Then, hMAP becomes the Maximum
Likelihood,
hML= argmax hH P(D|h)
MAP Learner
For each hypothesis h in H, calculate the posterior probability
P( D | h) P(h)
P ( h | D) 
P ( D)
Output the hypothesis hMAP with the highest posterior probability
hMAP  max P(h | D)
hH

Comments:
Computational intensive
Providing a standard for judging the performance of
learning algorithms
Choosing P(h) and P(D|h) reflects our prior
knowledge about the learning task
Maximum likelihood and least-squared error
• Learn a Real-Valued Function:
• Consider any real-valued target function f.
• Training examples (xi,di) are assumed to have Normally
distributed noise ei with zero mean and variance σ2, added
to the true target value f(xi),
di satisfies 𝑁(𝑓 𝑥𝑖 , 𝜎 2 )
Assume that ei is drawn independently for each xi .
Compute ML Hypo

hML  arg max p( D | h)


hH
m 1 d  h ( xi ) 2
1  ( i
 arg max 
)
2 
e
hH 2
i 1
2

m
1 1 d i  h( xi ) 2
 arg max   ln( 2 )  (
2
)
hH i 1 2 2 
m
 arg min  (d i  h( xi )) 2
hH i 1

10
Bayes Optimal Classifier
Question: Given new instance x, what is its most probable classification?
• ℎ𝑀𝐴𝑃 (𝑥) is not the most probable classification!
Example: Let P(h1|D) = .4, P(h2|D) = .3, P(h3 |D) =.3
Given new data x, we have h1(x)=+, h2(x) = -, h3(x) = -
What is the most probable classification of x ?
Bayes optimal classification:
arg max
v j V
 P(v
hi H
j | hi ) P(hi | D)
where V is the set of all the values a classification can take and vj is one
possible such classification.
Example:
P(h1| D) =.4, P(-|h1)=0, P(+|h1)=1  P( | h ) P(h | D)  .4
hiH
i i

P(h2|D) =.3, P(-|h2)=1, P(+|h2)=0


P(h3|D)=.3, P(-|h3)=1, P(+|h3)=0  P( | h ) P(h | D)  .6
hiH
i i
Why “Optimal”?
• Optimal in the sense that no other classifier
using the same H and prior knowledge can
outperform it on average

12
Gibbs Algorithm
• Bayes optimal classifier is quite computationally
expensive, if H contains a large number of
hypotheses.
• An alternative, less optimal classifier Gibbs algorithm,
defined as follows:
1. Choose a hypothesis randomly according to
P(h|D), where D is the posterior probability
distribution over H.
2. Use it to classify new instance

13
Error for Gibbs Algorithm
• Surprising fact: Assume the expected value is
taken over target concepts drawn at random,
according to the prior probability distribution
assumed by the learner, then (Haussler et al.
1994)

E f [errorX , f GibbsClassifier ]  2 E f [errorX , f BayesOPtimal ],


where f denotes a target function, X denotes the instance space.
Thank You
Foundations of Machine Learning

Module 4:
Part C: Naïve Bayes

Sudeshna Sarkar
IIT Kharagpur
Bayes Theorem
P( D | h) P(h)
P ( h | D) 
P ( D)
Naïve Bayes
• Bayes classification
P(Y | X)  P( X | Y ) P(Y )  P( X 1 ,  , X n | Y ) P(Y )
Difficulty: learning the joint probability P(X1 ,  , Xn |C)
• Naïve Bayes classification
Assume all input features are conditionally independent!
P( X 1 , X 2 ,  , X n | Y )  P( X 1 | X 2 ,  , X n , Y ) P( X 2 ,  , X n | Y )
 P( X 1 | Y ) P( X 2 ,  , X n | Y )
 P( X 1 | Y ) P( X 2 | Y )    P( X n | Y )
Naïve Bayes
Bayes rule:

Assuming conditional independence among Xi’s:

So, classification rule for Xnew = < X1, …, Xn > is:


Naïve Bayes Algorithm – discrete Xi

• Train Naïve Bayes (examples)


for each* value yk
estimate
for each* value xij of each attribute Xi
estimate

• Classify (Xnew)

* probabilities must sum to 1, so need estimate only n-1 parameters...


Estimating Parameters: Y, Xi discrete-valued

Maximum likelihood estimates (MLE’s):

Number of items in set D for


which Y=yk
Example
• Example: Play Tennis

7
Example
Learning Phase
Outlook Play=Yes Play=No Temperature Play=Yes Play=No
Sunny 2/9 3/5 Hot 2/9 2/5
Overcast 4/9 0/5 Mild 4/9 2/5
Rain 3/9 2/5 Cool 3/9 1/5

Humidity Play=Yes Play=No Wind Play=Yes Play=No


High 3/9 4/5 Strong 3/9 3/5
Normal 6/9 1/5 Weak 6/9 2/5

P(Play=Yes) = 9/14 P(Play=No) = 5/14

8
Example
Test Phase
– Given a new instance, predict its label
x’=(Outlook=Sunny, Temperature=Cool, Humidity=High, Wind=Strong)
– Look up tables achieved in the learning phrase
P(Outlook=Sunny|Play=No) = 3/5
P(Outlook=Sunny|Play=Yes) = 2/9
P(Temperature=Cool|Play==No) = 1/5
P(Temperature=Cool|Play=Yes) = 3/9
P(Huminity=High|Play=No) = 4/5
P(Huminity=High|Play=Yes) = 3/9
P(Wind=Strong|Play=No) = 3/5
P(Wind=Strong|Play=Yes) = 3/9
P(Play=No) = 5/14
P(Play=Yes) = 9/14
– Decision making with the MAP rule
P(Yes|x’) ≈ [P(Sunny|Yes)P(Cool|Yes)P(High|Yes)P(Strong|Yes)]P(Play=Yes) = 0.0053
P(No|x’) ≈ [P(Sunny|No) P(Cool|No)P(High|No)P(Strong|No)]P(Play=No) = 0.0206

Given the fact P(Yes|x’) < P(No|x’), we label x’ to be “No”.

9
Estimating Parameters: Y, Xi discrete-valued

If unlucky, our MLE estimate for P(Xi | Y) may be zero.

MAP estimates:
Only difference:
“imaginary” examples
Naïve Bayes: Assumptions of Conditional
Independence
Often the Xi are not really conditionally independent

• We can use Naïve Bayes in many cases anyway


– often the right classification, even when not the right
probability
Gaussian Naïve Bayes (continuous X)
• Algorithm: Continuous-valued Features
– Conditional probability often modeled with the normal
distribution

Sometimes assume variance


– is independent of Y (i.e., i),
– or independent of Xi (i.e., k)
– or both (i.e., )
12
Gaussian Naïve Bayes Algorithm – continuous Xi
(but still discrete Y)
• Train Naïve Bayes (examples)
for each value yk
estimate*
for each attribute Xi estimate
class conditional mean , variance

• Classify (Xnew)
Estimating Parameters: Y discrete, Xi continuous

Maximum likelihood estimates: jth training


example

ith feature kth class


(z)=1 if z true,
else 0
Naïve Bayes
• Example: Continuous-valued Features
– Temperature is naturally of continuous value.
Yes: 25.2, 19.3, 18.5, 21.7, 20.1, 24.3, 22.8, 23.1, 19.8
No: 27.3, 30.1, 17.4, 29.5, 15.1
– Estimate mean and variance for each class
1 N 1 N Yes  21.64, Yes  2.35
   xn ,    ( xn  )2
2
N n1 N n1  No  23.88, No  7.09

– Learning Phase: output two Gaussian models for P(temp|C)

ˆ 1  ( x  21 . 64 ) 2
 1  ( x  21 . 64 ) 2

P( x | Yes )  exp    exp  
2.35 2  2  2.35  2.35 2
2
 11.09 

ˆ 1  ( x  23 .88 ) 2
 1  ( x  23 .88 ) 2

P( x | No)  
exp  
  
exp  
7.09 2  2  7.09  7.09 2
2
 50.25 
15
The independence hypothesis…
• makes computation possible
• yields optimal classifiers when satisfied
• Rarely satisfied in practice, as attributes (variables)
are often correlated.
• To overcome this limitation:
– Bayesian networks combine Bayesian reasoning with
causal relationships between attributes
Thank You
Foundations of Machine Learning

Module 4:
Part D: Bayesian Networks

Sudeshna Sarkar
IIT Kharagpur
Why Bayes Network
• Bayes optimal classifier is too costly to apply
• Naïve Bayes makes overly restrictive
assumptions.
– But all variables are rarely completely independent.
• Bayes network represents conditional
independence relations among the features.
• Representation of causal relations makes the
representation and inference efficient.
Late Rainy
Accident
wakeup day

Traffic Meeting
Jam postponed

Late for
Work

Late for
meeting
Bayesian Network
• A graphical model that efficiently encodes the joint
probability distribution for a large set of variables
• A Bayesian Network for a set of variables (nodes)
X = { X1,…….Xn}
• Arcs represent probabilistic dependence among
variables
• Lack of an arc denotes a conditional independence
• The network structure S is a directed acyclic graph
• A set P of local probability distributions at each node
(Conditional Probability Table)
Representation in Bayesian Belief
Networks
Late
Accide Rainy
wake Conditional probability table
nt day
up
associated with each node
specifies the conditional
Traffic Meeting distribution for the
Jam postponed variable given its immediate
Late parents in
for the graph
Work

Late for
meeting

Each node is asserted to be conditionally independent of


its non-descendants, given its immediate parents
5
Inference in Bayesian Networks
• Computes posterior probabilities given evidence about
some nodes
• Exploits probabilistic independence for efficient
computation.
• Unfortunately, exact inference of probabilities in
general for an arbitrary Bayesian Network is known to
be NP-hard.
• In theory, approximate techniques (such as Monte
Carlo Methods) can also be NP-hard, though in
practice, many such methods were shown to be useful.
• Efficient algorithms leverage the structure of the graph
6
Applications of Bayesian Networks
• Diagnosis: P(cause|symptom)=?
• Prediction: P(symptom|cause)=?
cause
• Classification: P(class|data) cause
• Decision-making
(given a cost function) C C2
1

symptom
symptom
Bayesian Networks
• Structure of the graph  Conditional independence relations

In general,
p(X1, X2,....XN) =  p(Xi | parents(Xi ) )

The full joint distribution


The graph-structured approximation
• Requires that graph is acyclic (no directed cycles)

• 2 components to a Bayesian network


– The graph structure (conditional independence
assumptions)
– The numerical probabilities (for each variable given its
parents)
Examples
A B C Marginal Independence:
p(A,B,C) = p(A) p(B) p(C)

A: D Conditionally independent effects:


p(A,B,C) = p(B|A)p(C|A)p(A)
B and C are conditionally independent
B: S1 C: S2 Given A

A: Traffic B: Late wakeup Independent Causes:


p(A,B,C) = p(C|A,B)p(A)p(B)
“Explaining away”
C: late

A B C Markov dependence:
p(A,B,C) = p(C|B) p(B|A)p(A)
Naïve Bayes Model
Y1 Y2 Y3 Yn

C
Hidden Markov Model (HMM)
Y1 Y3 Yn
Observed
Y2

----------------------------------------------------

S1 S2 S3 Sn Hidden

Assumptions:
1. hidden state sequence is Markov
2. observation Yt is conditionally independent of all other
variables given St

Widely used in sequence learning eg, speech recognition, POS


tagging
Inference is linear in n
Learning Bayesian Belief Networks
1. The network structure is given in advance and all the
variables are fully observable in the training examples.
– estimate the conditional probabilities.
2. The network structure is given in advance but only
some of the variables are observable in the training
data.
– Similar to learning the weights for the hidden units of
a Neural Net: Gradient Ascent Procedure
3. The network structure is not known in advance.
– Use a heuristic search or constraint-based technique to
search through potential structures.
14
Thank You

You might also like