0% found this document useful (0 votes)
87 views25 pages

Statistical Learning

1. The document discusses statistical learning and key concepts like probability distributions, joint distributions, conditional probability, and Bayes' rule. 2. Bayesian learning is introduced as a way to compute the posterior probability of a hypothesis given evidence using Bayes' theorem. 3. An example of Bayesian prediction and learning using a candy problem is provided to illustrate computing posterior probabilities and predictions.

Uploaded by

Alex Stihi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
87 views25 pages

Statistical Learning

1. The document discusses statistical learning and key concepts like probability distributions, joint distributions, conditional probability, and Bayes' rule. 2. Bayesian learning is introduced as a way to compute the posterior probability of a hypothesis given evidence using Bayes' theorem. 3. An example of Bayesian prediction and learning using a candy problem is provided to illustrate computing posterior probabilities and predictions.

Uploaded by

Alex Stihi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 25

CS480/680

Lecture 4: May 15, 2019


Statistical Learning
[RN]: Sec 20.1, 20.2, [M]: Sec. 2.2, 3.2

University of Waterloo CS480/680 Spring 2019 Pascal Poupart 1


Statistical Learning

• View: we have uncertain knowledge of the world

• Idea: learning simply reduces this uncertainty

University of Waterloo CS480/680 Spring 2019 Pascal Poupart 2


Terminology
• Probability distribution:
– A specification of a probability for each event in
our sample space
– Probabilities must sum to 1
• Assume the world is described by two (or
more) random variables
– Joint probability distribution
• Specification of probabilities for all combinations of
events

University of Waterloo CS480/680 Spring 2019 Pascal Poupart 3


Joint distribution

• Given two random variables ! and ":


• Joint distribution:
Pr(! = ' Λ " = )) for all ', )

• Marginalisation (sumout rule):


Pr(! = ') = Σ) Pr(! = ' Λ " = ))
Pr(" = )) = Σ' Pr(! = ' Λ " = ))

University of Waterloo CS480/680 Spring 2019 Pascal Poupart 4


Example: Joint Distribution
sunny ~sunny
cold ~cold cold ~cold

headache 0.108 0.012 headache 0.072 0.008

~headache 0.016 0.064 ~headache 0.144 0.576

P(headacheΛsunnyΛcold) = P(~headacheΛsunnyΛ~cold) =

P(headacheVsunny) =

P(headache) =

marginalization
University of Waterloo CS480/680 Spring 2019 Pascal Poupart 5
Conditional Probability
• Pr($|&): fraction of worlds in which & is true
that also have $ true
H=“Have headache”
F=“Have Flu”
F
Pr(() = 1/10
Pr(-) = 1/40
Pr((|-) = 1/2
H
Headaches are rare and flu is
rarer, but if you have the flu,
then there is a 50-50 chance
you will have a headache
University of Waterloo CS480/680 Spring 2019 Pascal Poupart 6
Conditional Probability
F Pr($|*) = Fraction of flu inflicted
worlds in which you have a
headache
H
=(# worlds with flu and headache)/
(# worlds with flu)

= (Area of “H and F” region)/


H=“Have headache” (Area of “F” region)
F=“Have Flu”
= Pr($ Λ *)/ Pr(*)
Pr($) = 1/10
Pr(*) = 1/40
Pr($|*) = 1/2

University of Waterloo CS480/680 Spring 2019 Pascal Poupart 7


Conditional Probability
• Definition:
Pr($|&) = Pr($ Λ &) / Pr(&)

• Chain rule:
Pr($ Λ &) = Pr($|&) Pr(&)

Memorize these!

University of Waterloo CS480/680 Spring 2019 Pascal Poupart 8


Inference
F One day you wake up with a
headache. You think “Drat! 50%
of flues are associated with
H
headaches so I must have a 50-
50 chance of coming down with
the flu”

H=“Have headache”
F=“Have Flu” Is your reasoning correct?

Pr($) = 1/10 Pr(*Λ$) =


Pr(*) = 1/40
Pr($|*) = 1/2 Pr * $ =

University of Waterloo CS480/680 Spring 2019 Pascal Poupart 9


Example: Joint Distribution
sunny ~sunny
cold ~cold cold ~cold

headache 0.108 0.012 headache 0.072 0.008

~headache 0.016 0.064 ~headache 0.144 0.576

Pr(ℎ%&'&(ℎ% Λ (*+' | -.//0) =

Pr(ℎ%&'&(ℎ% Λ (*+' | ~-.//0) =

University of Waterloo CS480/680 Spring 2019 Pascal Poupart 10


Bayes Rule
• Note
Pr($|&)Pr(&) = Pr($Λ&) = Pr(&Λ$) = Pr(&|$)*+($)

• Bayes Rule
Pr(&|$) = [(Pr($|&)Pr(&)]/Pr($)

Memorize this!

University of Waterloo CS480/680 Spring 2019 Pascal Poupart 11


Using Bayes Rule for inference
• Often we want to form a hypothesis about the world
based on what we have observed
• Bayes rule is vitally important when viewed in terms
of stating the belief given to hypothesis H, given
evidence e
Prior probability
Likelihood

Posterior probability
Normalizing constant
University of Waterloo CS480/680 Spring 2019 Pascal Poupart 12
Bayesian Learning
• Prior: Pr($)
• Likelihood: Pr(&|$)
• Evidence: ( = < &1, &2, … , &/ >

• Bayesian Learning amounts to computing the


posterior using Bayes’ Theorem:
Pr($|() = 1 Pr((|$)Pr($)

University of Waterloo CS480/680 Spring 2019 Pascal Poupart 13


Bayesian Prediction
• Suppose we want to make a prediction about an
unknown quantity X

• Pr($|&) = Σ* Pr($|&, ℎ- ).(ℎ* |&)


= Σ* Pr($|ℎ- ).(ℎ* |&)

• Predictions are weighted averages of the predictions


of the individual hypotheses
• Hypotheses serve as “intermediaries” between raw
data and prediction

University of Waterloo CS480/680 Spring 2019 Pascal Poupart 14


Candy Example
• Favorite candy sold in two flavors:
– Lime (hugh)
– Cherry (yum)
• Same wrapper for both flavors
• Sold in bags with different ratios:
– 100% cherry
– 75% cherry + 25% lime
– 50% cherry + 50% lime
– 25% cherry + 75% lime
– 100% lime

University of Waterloo CS480/680 Spring 2019 Pascal Poupart 15


Candy Example
• You bought a bag of candy but don’t know its flavor
ratio

• After eating ! candies:


– What’s the flavor ratio of the bag?
– What will be the flavor of the next candy?

University of Waterloo CS480/680 Spring 2019 Pascal Poupart 16


Statistical Learning
• Hypothesis H: probabilistic theory of the world
– ℎ1: 100% cherry
– ℎ2: 75% cherry + 25% lime
– ℎ3: 50% cherry + 50% lime
– ℎ4: 25% cherry + 75% lime
– ℎ5: 100% lime
• Examples E: evidence about the world
– '1: 1st candy is cherry
– '2: 2nd candy is lime
– '3: 3rd candy is lime
– …

University of Waterloo CS480/680 Spring 2019 Pascal Poupart 17


Candy Example
• Assume prior Pr($) = < 0.1, 0.2, 0.4, 0.2, 0.1 >
• Assume candies are i.i.d. (identically and
independently distributed)
Pr(/|ℎ) = P2 3(42|ℎ)
• Suppose first 10 candies all taste lime:
Pr(/|ℎ5) =
Pr(/|ℎ3) =
Pr(/|ℎ1) =

University of Waterloo CS480/680 Spring 2019 Pascal Poupart 18


Posterior

University of Waterloo CS480/680 Spring 2019 Pascal Poupart 19


Prediction
Probability that next candy is lime

University of Waterloo CS480/680 Spring 2019 Pascal Poupart 20


Bayesian Learning
• Bayesian learning properties:
– Optimal (i.e. given prior, no other prediction is correct
more often than the Bayesian one)
– No overfitting (all hypotheses considered and weighted)

• There is a price to pay:


– When hypothesis space is large, Bayesian learning may be
intractable
– i.e. sum (or integral) over hypothesis often intractable
• Solution: approximate Bayesian learning

University of Waterloo CS480/680 Spring 2019 Pascal Poupart 21


Maximum a posteriori (MAP)
• Idea: make prediction based on most probable
hypothesis ℎ"#$
ℎ"#$ = &'()&*ℎ+ Pr(ℎ+ |0)
Pr(2|0) » Pr(2|ℎ345 )

• In contrast, Bayesian learning makes prediction


based on all hypotheses weighted by their
probability

University of Waterloo CS480/680 Spring 2019 Pascal Poupart 22


MAP properties
• MAP prediction less accurate than Bayesian
prediction since it relies only on one hypothesis ℎ"#$
• But MAP and Bayesian predictions converge as data
increases
• Controlled overfitting (prior can be used to penalize
complex hypotheses)

• Finding ℎ"#$ may be intractable:


– ℎ"#$ = &'()&*+ Pr(ℎ|0)
– Optimization may be difficult

University of Waterloo CS480/680 Spring 2019 Pascal Poupart 23


Maximum Likelihood (ML)
• Idea: simplify MAP by assuming uniform prior
(i.e., Pr(ℎ% ) = Pr(ℎ() "), ()
ℎ+,- = ./01.2ℎ Pr(ℎ) Pr(3|ℎ)
ℎ+5 = ./01.2ℎ Pr(3|ℎ)

• Make prediction based on ℎ+5 only:


Pr(6|3) » Pr(6|ℎ78 )

University of Waterloo CS480/680 Spring 2019 Pascal Poupart 24


ML properties
• ML prediction less accurate than Bayesian and MAP
predictions since it ignores prior info and relies only
on one hypothesis ℎ"#
• But ML, MAP and Bayesian predictions converge as
data increases
• Subject to overfitting (no prior to penalize complex
hypothesis that could exploit statistically insignificant
data patterns)

• Finding ℎ"# is often easier than ℎ"$%


ℎ"# = '()*'+ℎ Σ- log Pr(4-|ℎ)
University of Waterloo CS480/680 Spring 2019 Pascal Poupart 25

You might also like