0% found this document useful (0 votes)
18 views52 pages

6 Probabilities

This document discusses machine learning concepts including generative models, inference from probabilities, and probabilistic reasoning. Generative models estimate probabilities to model the distributions of classes rather than just classify points. This allows using probabilities and statistical inference to reason about unknown variables given observed evidence.

Uploaded by

damasodra33
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views52 pages

6 Probabilities

This document discusses machine learning concepts including generative models, inference from probabilities, and probabilistic reasoning. Generative models estimate probabilities to model the distributions of classes rather than just classify points. This allows using probabilities and statistical inference to reason about unknown variables given observed evidence.

Uploaded by

damasodra33
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 52

CSCE-421 Machine Learning

4. Inference from Probabilities

Instructor: Guni Sharon, classes: TR 3:55-5:10, HRBB 124


1
Based on slides by: Pieter Abbeel, Dan Klein
Announcements
• Announcements
• Lecture recordings “may be synchronous”
• The student must be admitted by faculty into the zoom (instead of free joining) so that
it only takes care of the student in quarantine situation
• The student must agree to allow his name to be seen by students in class because this
means that he/she gives up his privacy. The student is disclosing private medical
information by joining the zoom session
• No class or office hours on Tuesday, Sep-28
• Overdue:
• Written assignment 1: K Nearest Neighbors + Linear algebra (due Wednesday, Sep-15)
• Due:
• Programing assignment 1: Perceptron (due Wednesday, Sep-22)

2
So far
• Binary classification,
• Linear classifier,
• E.g., infected with COVID19 (y/n) based on symptoms
• Train using the perceptron algorithm
• Guaranteed to converge in the separable case
• Multiclass classification,
• Linear classifier,
• E.g., most likely disease based on symptoms
• Train using the multiclass perceptron algorithm
• Guaranteed to converge in the separable case
(http://proceedings.mlr.press/v97/beygelzimer19a/beygelzimer19a-supp.pdf)

3
Generative vs Discriminative Models
• So far: Discriminative Classifiers
• Assume some functional form for
• Estimate parameters of directly from training data
• Finds a decision boundary between the classes
• E.g., K-NN, Linear classifier (perceptron)
• Limited by model assumptions (curse of dimensionality, linear separability)
• Next: Generative classifiers
• Assume some functional form for ,
• Estimate parameters of directly from training data
• Use Bayes rule to calculate
• Finds the actual distribution of each class
4
Estimating probability
• Empirical estimation of probability
• 10 coin flips

• Maximum Likelihood Estimation


• Assume some underlying parametrized distribution that comes from

• Find parameters, , that maximize the probability of sampling

5
Probability of observing samples of out

Estimating probability of sample in total where in a single


observation is

• Assume some underlying parametrized ( )


𝑃 ( 𝑥 )= 𝑛 𝑝 ( 1− 𝑝 )
𝑥
𝑥 𝑛− 𝑥

distribution that comes from

• Find parameters, , that maximize the


probability of sampling

• For the coin case, we have a Binomial


distribution

6
Sidestep: PDF
• Probability density function (PDF)
• The probability of the random variable
falling within a particular range of values
• Given by the integral of this variable's PDF
over that range
• Taking on a specific value has a prob=0
• The PDF is nonnegative everywhere, and
its integral (CDF) over the entire space is
equal to 1

7
Log likelihood (important, pay
attention):
Estimating probability • MLE is invariant under this
transformation
• Common trick for breaking up a
factored objective function
• • But log is not defined over negative
values
• Probability can’t be negative
• Find max with respect to
Log is monotonically
• increasing
𝑓 (𝑥)

log ( 𝑓 ( 𝑥 ) )

𝑓 (𝑥)
8
Unseen Events

MLE for P(outcome=5) ?


Laplace Smoothing
Estimate the probability of drawing
• Laplace’s estimate: ‘red’ given 3 observations
• Pretend you saw every outcome
once more than you actually did
r r b
Laplace Smoothing
• Laplace’s estimate (extended):
• Pretend you saw every outcome k extra times r r b

• What’s Laplace with k = 0?


• k is the strength of the uniformity prior
• Laplace for conditionals:
• Smooth each condition independently:
Maximum Likelihood?
• Relative frequencies are the maximum likelihood estimates for a given data set

• Another option is to consider the maximum a posteriori probability given the


data

We need to estimate a distribution


over which is now treated as a
random var instead of a parameter
Written assignment 2: generative models
• Assume a data set with occurrences of “Head” event and
occurrences of “Tail” event
• Define as the MLE of with Laplace smoothing with some parameter
• Define as the MAP of with with a Beta distribution () prior

• Prove that, in this case,


• Due Friday, Sep-24 Normalizing constant

13
Maximum a posteriori probability

• Model as a random variable, drawn from a distribution


• Note that is not a random variable associated with an event in a
sample space
• is the prior distribution over the parameter(s) , before we see any
data
• is the likelihood of the data given the parameter(s)
• is the posterior distribution over the parameter(s) after we have
observed the data
14
Maximum a posteriori probability
𝑃 (𝜃∨𝐷)

15
Empirical probability estimation summary
• MLE: . is the set of model parameters
• MAP: . is set of random variables.
• MAP only adds the term
• Independent of the data and penalizes if the parameters, deviate too much
from our prior believe
• We will later revisit this as a form of regularization, where will be interpreted
as a measure of classifier complexity

16
Generative model
• Can be trained with either MLE of MAP approaches
• Provides a distribution over labels
• Why is this powerful?
• Allows the use of statistical inference (probabilistic reasoning)!

17
Notation clarification Quiz 1 now available.
Complete by Tuesday, Sep-28

• the probability of observing given event


• E.g.,
• the probability of sampling given a distribution that is defined by
• E.g.,
• Multivariate normal distribution
• the probability of sampling the entire training set given a distribution
that is defined by

• the probability of sampling the entire training set given a sampled


distribution (sampled from a distribution of distributions)
18
So far
• Discriminative model
• Maps features to labels
• Generative model
• Maps features to a distribution over labels (probability per class)
• Do we really need generative models?
• Seems like an overkill
• Allows the use of statistical inference (probabilistic reasoning)!

19
Inference from probabilities
• A ghost is in the grid
somewhere
• Sensor readings tell how close a
square is to the ghost
• On the ghost: red
• 1 or 2 away: orange
• 3 or 4 away: yellow
• 5+ away: green

 Sensors are noisy, but we know P(Color | Distance)


P(red | 3) P(orange | 3) P(yellow | 3) P(green | 3)
0.05 0.15 0.5 0.3
[Demo: Ghostbuster – no probability (L12D1) ]
Video of Demo Ghostbuster – No probability
Uncertainty
• General situation:
• Observed variables (evidence, X): Agent knows
certain things about the state of the world (e.g.,
sensor readings or symptoms)
• Unobserved variables (Y): Agent needs to
reason about other aspects (e.g. where an object
is or what disease is present)
• Generative model (): Agent knows something
about how the known variables relate to the
unknown variables
• Probabilistic reasoning gives us a framework
for managing our beliefs and knowledge
Random Variables
• A random variable is some aspect of the world about
which we (may) have uncertainty
• R = Is it raining?
• T = Is it hot or cold?
• D = How long will it take to drive to work?
• L = Where is the ghost?
• We denote random variables with capital letters
• Random variables have domains
• R in {true, false} (often write as {+r, -r})
• T in {hot, cold}
• D in [0, )
• L in possible locations, maybe {(0,0), (0,1), …}
Probability Distributions
• Associate a probability with each value

• Weather:
• Temperature:

W P
T P
sun 0.6
hot 0.5
rain 0.1
cold 0.5
fog 0.3
meteor 0.0
Probability Distributions
• Random variables are affiliated with distributions

Shorthand notation:
T P W P
hot 0.5 sun 0.6
cold 0.5 rain 0.1
fog 0.3
meteor 0.0

OK if all domain entries are unique


• A distribution is a TABLE of probability per value

• An outcome probability is a single number

• Must have: and


Joint Distributions
• A joint distribution over a set of random variables:
specifies a real number for each assignment (or outcome):

T W P
hot sun 0.4
• Must obey:
hot rain 0.1
cold sun 0.2
cold rain 0.3

• Size of distribution if n variables with domain sizes d?


• For all but the smallest distributions, impractical to write out!
Events
• An event is a set E of outcomes

• From a joint distribution, we can calculate the probability T W P


of any event hot sun 0.4
• Probability that it’s hot AND sunny?
hot rain 0.1
cold sun 0.2
• Probability that it’s hot?
cold rain 0.3
• Probability that it’s hot OR sunny?

• Typically, the events we care about are partial


assignments, like P(T=hot)
Quiz: Events

• P(+x, +y) ?
X Y P
+x +y 0.2
• P(+x) ? +x -y 0.3
-x +y 0.4
-x -y 0.1

• P(-y OR +x) ?
Marginal Distributions
• Marginal distributions are sub-tables which eliminate variables
• Marginalization (summing out): Combine collapsed rows by adding

T P
T W P hot 0.5
hot sun 0.4 cold 0.5
hot rain 0.1
cold sun 0.2
cold rain 0.3 W P
sun 0.6
rain 0.4
Quiz: Marginal Distributions

X P
+x
X Y P
-x
+x +y 0.2
+x -y 0.3
-x +y 0.4
Y P
-x -y 0.1
+y
-y
Conditional Probabilities
• Derived from the joint probability P(a,b)
• The definition of a conditional probability

P(a) P(b)

T W P
hot sun 0.4
hot rain 0.1
cold sun 0.2
cold rain 0.3
Quiz: Conditional Probabilities

• P(+x | +y) ?
X Y P
+x +y 0.2
+x -y 0.3
-x +y 0.4 • P(-x | +y) ?
-x -y 0.1

• P(-y | +x) ?
Conditional Distributions
• Conditional distributions are probability distributions over
some variables given fixed values of others
Conditional Distributions

Joint Distribution
W P
sun 0.8
rain 0.2 T W P
hot sun 0.4
hot rain 0.1
W P cold sun 0.2
sun 0.4 cold rain 0.3
rain 0.6
Conditional Distributions

T W P
hot sun 0.4
W P
hot rain 0.1
sun 0.4
cold sun 0.2
rain 0.6
cold rain 0.3
Normalization Trick

SELECT the joint NORMALIZE the


probabilities selection
T W P (make it sum to one)
matching the
hot sun 0.4 evidence T W P W P
hot rain 0.1 cold sun 0.2 sun 0.4
cold sun 0.2 cold rain 0.3 rain 0.6
cold rain 0.3
To Normalize
• (Dictionary) To bring or restore to a normal condition
All entries sum to ONE
• Procedure:
• Step 1: Compute Z = sum over all entries
• Step 2: Divide every entry by Z

• Example 2
• Example 1
W P W P T W P T W P
Normalize
sun 0.2 hot sun 20 Normalize hot sun 0.4
sun 0.4
rain 0.3 hot rain 5 hot rain 0.1
Z = 0.5 rain 0.6
cold sun 10 Z = 50 cold sun 0.2
cold rain 15 cold rain 0.3
Quiz: Normalization Trick
• P(X | Y=-y) ?
SELECT the joint NORMALIZE the
probabilities selection
X Y P (make it sum to one)
matching the
+x +y 0.2 X P X P
evidence
+x -y 0.3 +x +x
-x +y 0.4 -x -x
-x -y 0.1
Probabilistic Inference
• Probabilistic inference: compute a desired
probability from other known probabilities (e.g.
conditional from joint)
• We generally compute conditional probabilities
• P(on time | no reported accidents) = 0.90
• These represent the agent’s beliefs given the
evidence
• Probabilities change with new evidence:
• P(on time | no accidents, 5 a.m.) = 0.95
• P(on time | no accidents, 5 a.m., raining) = 0.80
• Observing new evidence causes beliefs to be updated
Inference by Enumeration
* Works fine with
• General case:  We want: multiple query
• Evidence variables: variables, too

• Query* variable:
All variables
• Hidden variables:

 Step 1: Select the  Step 2: Sum out H to get joint  Step 3: Normalize
entries consistent of Query and evidence
with the evidence
Inference by Enumeration
• P(W)? W P
sun
S T W P
rain
summer hot sun 0.30
summer hot rain 0.05
• P(W | winter)?
W P summer cold sun 0.10

sun summer cold rain 0.05

rain winter hot sun 0.10


winter hot rain 0.05
winter cold sun 0.15
• P(W | winter, hot)?
W P winter cold rain 0.20
sun
rain
Inference by Enumeration
 Obvious problems:
 Worst-case time complexity O(dn)
 Space complexity O(dn) to store the joint distribution
The Product Rule
• Sometimes have conditional distributions but want the joint
The Product Rule

• Example:

D W P D W P
wet sun 0.1 wet sun 0.08
R P
dry sun 0.9 dry sun 0.72
sun 0.8
wet rain 0.7 wet rain 0.14
rain 0.2
dry rain 0.3 dry rain 0.06
The Chain Rule
• More generally, can always write any joint distribution as an incremental product of
conditional distributions

• Proof:
Bayes’ Rule
Bayes’ Rule
• Two ways to factor a joint distribution over two variables:

• Dividing, we get:

• Why is this at all helpful?


• Lets us build one conditional from its reverse
• Often one conditional is tricky but the other one is simple
• Foundation of many systems we’ll see later
• In the running for most important AI equation!
Inference with Bayes’ Rule
• Example: Diagnostic probability from causal probability:

• Example:
• M: meningitis, S: stiff neck
Example
givens

=0.0079
• Note: posterior probability of meningitis still very small
• Note: you should still get stiff necks checked out!
Quiz: Bayes’ Rule
• Given: D W P
W P wet sun 0.1
sun 0.8 dry sun 0.9
rain 0.2 wet rain 0.7
dry rain 0.3

• What is P(W | dry) ?


D W P
𝑃 ( 𝑑𝑟𝑦|𝑊 ) 𝑃 ( 𝐷 ,𝑊 )= 𝑃 ( 𝐷|𝑊 ) 𝑃 (𝑊 )
𝑃 ( 𝑊 |𝑑𝑟𝑦 )= 𝑃 (𝑊 ) wet sun 0.08
𝑃 ( 𝑑𝑟𝑦 )
dry sun 0.72
0.78
𝑃 ( 𝑑𝑟𝑦 ) =? wet rain 0.14
dry rain 0.06
Ghostbusters, Revisited
• Let’s say we have two distributions:
• Prior distribution over ghost location: P(G)
• Let’s say this is uniform
• Sensor reading model: P(R | G)
• Given: we know what our sensors do
• R = reading color measured at (1,1)
• E.g. P(R = yellow | G=(1,1)) = 0.1

• We can calculate the posterior distribution P(G|r)


over ghost locations given a reading using Bayes’
rule:

[Demo: Ghostbuster – with probability (L12D2) ]


Video of Demo Ghostbusters with Probability
What did we learn?
• Generative vs Discriminative Models
• MLE and MAP
• Random variables and events
• Maximum Likelihood and Maximum a posteriori estimates
• Probability distribution
• Joint distribution and marginal distribution
• Conditional probabilities
• Probabilistic inference
• Bayes’ rule
• Bayesian inference
What next?
• Class:
• Bayesian networks
• Assignments:
• Programing assignment 1: Perceptron (due Wednesday, Sep-22)
• Written assignment 2: generative models (due Friday, Sep-24)
• Quizzes:
• Quiz 1 (due Tuesday, Sep-28)

52

You might also like