0% found this document useful (0 votes)
25 views56 pages

07 Probability Review

Uploaded by

SAMEER REDDY
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
25 views56 pages

07 Probability Review

Uploaded by

SAMEER REDDY
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 56

Probability Overview

• Events
– discrete random variables, continuous random variables,
compound events
• Axioms of probability
– What defines a reasonable theory of uncertainty
• Independent events
• Conditional probabilities
• Bayes rule and beliefs
• Joint probability distribution
• Expectations
• Independence, Conditional independence
Random Variables
• Informally, A is a random variable if
– A denotes something about which we are uncertain
– perhaps the outcome of a randomized experiment

• Examples
A = True if a randomly drawn person from our class is female
A = The hometown of a randomly drawn person from our class
A = True if two randomly drawn persons from our class have same birthday

• Define P(A) as “the fraction of possible worlds in which A is true” or


“the fraction of times A holds, in repeated runs of the random experiment”
– the set of possible worlds is called the sample space, S
– A random variable A is a function defined over S
A: S → {0,1}
A little formalism
More formally, we have
• a sample space S (e.g., set of students in our class)
– aka the set of possible worlds

• a random variable is a function defined over the sample


space
– Gender: S → { m, f }
– Height: S → Reals
• an event is a subset of S
– e.g., the subset of S for which Gender=f
– e.g., the subset of S for which (Gender=m) AND (eyeColor=blue)
• we’re often interested in probabilities of specific events
• and of specific events conditioned on other specific events
Visualizing A

Sample space
of all possible Worlds in which P(A) = Area of
worlds A is true
reddish oval

Its area is 1 Worlds in which A is False


The Axioms of Probability
• 0 <= P(A) <= 1
• P(True) = 1
• P(False) = 0
• P(A or B) = P(A) + P(B) - P(A and B)

[di Finetti 1931]:

when gambling based on “uncertainty formalism A” you can


be exploited by an opponent

iff

your uncertainty formalism A violates these axioms


Elementary Probability in
Pictures
• P(~A) + P(A) = 1

A ~A
A useful theorem
• 0 <= P(A) <= 1, P(True) = 1, P(False) = 0,
P(A or B) = P(A) + P(B) - P(A and B)

➔ P(A) = P(A ^ B) + P(A ^ ~B)

A = [A and (B or ~B)] = [(A and B) or (A and ~B)]


P(A) = P(A and B) + P(A and ~B) – P((A and B) and (A and ~B))
P(A) = P(A and B) + P(A and ~B) – P(A and B and A and ~B)
Elementary Probability in Pictures
• P(A) = P(A ^ B) + P(A ^ ~B)

A^ B

A ^ ~B
B
Definition of Conditional Probability

P(A ^ B)
P(A|B) = -----------
P(B)

A
B
Definition of Conditional
Probability
P(A ^ B)
P(A|B) = -----------
P(B)

Corollary: The Chain Rule


P(A ^ B) = P(A|B) P(B)
P(A ^ B ^ C) = P(A|BC) P(BC)

= P(A|BC) P(B|C) P(C)


Bayes Rule
• let’s write 2 expressions for P(A ^ B)
A^ B

A
B

P(A ^ B ) = P(A|B) P(B)

= P(B|A) P(A) [Since, P(A ^ B ) = P(B ^ A )]

Therefore, P(B|A) P(A)


P(A|B) = -----------------
P(B)
P(B|A) * P(A)
P(A|B) = Bayes’ rule
P(B)

we call P(A) the “prior”


Bayes, Thomas (1763) An essay
towards solving a problem in the doctrine
and P(A|B) the “posterior” of chances. Philosophical Transactions of
the Royal Society of London, 53:370-418

…by no means merely a curious speculation in the doctrine of chances,


but necessary to be solved in order to a sure foundation for all our
reasonings concerning past facts, and what is likely to be hereafter….
necessary to be considered by any that would give a clear account of the
strength of analogical or inductive reasoning…
P(B|A) * P(A)
Other Forms of Bayes Rule P(A|B) =
P(B)

P(B| A)P(A)
P(A|B) =
P(B| A)P(A)+P(B|~ A)P(~ A)

P(B| A X)P(A X)
P(A|B X) =
P(B X)
Applying Bayes Rule
P(B| A)P(A)
P(A|B) =
P(B| A)P(A)+ P(B|~ A)P(~ A)
A = you have covid, B = you just coughed

Assume:
P(A) = 0.05
P(B|A) = 0.80
P(B| ~A) = 0.20

what is P(covid | cough) = P(A|B)?


Solution:
Applying Bayes Rule
P(B| A)P(A) P(B|A) * P(A)
P(A|B) = P(A|B) =
P(B)
P(B| A)P(A)+ P(B|~ A)P(~ A)
A = you have covid, B = you just coughed

Assume:
P(A) = 0.05
P(B|A) = 0.80
P(B| ~A) = 0.20

what is P(covid | cough) = P(A|B)?


Solution:
(0.8 x 0.05)/ (0.8 x 0.05 + 0.2 x 0.95
Therefore, P(A|B) = 0.17
Applying Bayes Rule
P(B| A)P(A) P(B|A) * P(A)
P(A|B) = P(A|B) =
P(B| A)P(A)+ P(B|~ A)P(~ A) P(B)

A = You Fail B = Did not listen to classes

Assume:
P(A) = 0.1
P(B|A) = 0.60

what is P(Fail | Did not listen to classes) = P(A|B)?


Solution:
what does all this have to do with
function approximation?

instead of F: X →Y,
learn P(Y | X)
The Joint Distribution Example: Boolean
variables A, B, C
A B C Prob
Recipe for making a joint 0 0 0 0.30
distribution of M variables: 0 0 1 0.05
0 1 0 0.10
0 1 1 0.05
1 0 0 0.05
1 0 1 0.10
1 1 0 0.25
1 1 1 0.10

0.05 0.10 0.05


A
0.25
0.10 C
0.05
B 0.10
0.30
The Joint Distribution Example: Boolean
variables A, B, C
A B C Prob
Recipe for making a joint 0 0 0 0.30
distribution of M variables: 0 0 1 0.05
0 1 0 0.10
0 1 1 0.05
1. Make a truth table listing all
1 0 0 0.05
combinations of values (M 1 0 1 0.10
Boolean variables → 2M rows). 1 1 0 0.25
1 1 1 0.10

0.05 0.10 0.05


A
0.25
0.10 C
0.05
B 0.10
0.30
The Joint Distribution Example: Boolean
variables A, B, C
A B C Prob
Recipe for making a joint 0 0 0 0.30
distribution of M variables: 0 0 1 0.05
0 1 0 0.10
0 1 1 0.05
1. Make a truth table listing all
1 0 0 0.05
combinations of values (M 1 0 1 0.10
Boolean variables → 2M rows). 1 1 0 0.25
1 1 1 0.10

2. For each combination of


values, say how probable it is. 0.05 0.10 0.05
A
0.25
0.10 C
0.05
B 0.10
0.30
The Joint Distribution Example: Boolean
variables A, B, C
A B C Prob
Recipe for making a joint 0 0 0 0.30
distribution of M variables: 0 0 1 0.05
0 1 0 0.10
0 1 1 0.05
1. Make a truth table listing all
1 0 0 0.05
combinations of values (M 1 0 1 0.10
Boolean variables → 2M rows). 1 1 0 0.25
1 1 1 0.10

2. For each combination of


values, say how probable it is. 0.05 0.10 0.05
A
0.25
0.10 C
3. If you subscribe to the axioms 0.05
of probability, those B 0.10
probabilities must sum to 1. 0.30
Using the
Joint
Distribution

One you have the JD P(E) =  P(row)


you can ask for the rows matching E
probability of any logical
expression involving
these variables
Using the
Joint

P(Poor Male) = 0.4654 P(E) =  P(row)


rows matching E
Using the
Joint

P(Poor) = 0.7604 P(E) =  P(row)


rows matching E
Inference
with the
Joint

P(E1  E2 )
 P(row)
rows matching E and E
P(E1 | E2 ) = = 1 2

P(E2 )  P(row)
rows matching E2

P(Male | Poor) = 0.4654 / 0.7604 = 0.612


Learning and the
Joint
Distribution

Suppose we want to learn the function f: <G, H> → W

Equivalently, P(W | G, H)

Solution: learn joint distribution from data, calculate P(W | G, H)

e.g., P(W=rich | G = female, H = 40.5- ) =


Learning and the
Joint
Distribution

Suppose we want to learn the function f: <G, H> → W

Equivalently, P(W | G, H)

Solution: learn joint distribution from data, calculate P(W | G, H)

e.g., P(W=rich | G = female, H = 40.5- ) =

P(W=rich ^ G = female ^ H = 40.5- ) 0.024


------------------------------------------------- = -------- = 0.088
P(G = female ^ H = 40.5- ) 0.277
sounds like the solution to
learning F: X →Y,
or P(Y | X).

Are we done?
sounds like the solution to
learning F: X →Y,
or P(Y | X).

Main problem: learning P(Y|X)


can require more data than we have

consider learning Joint Dist. with 100 attributes


# of rows in this table? 2^100 ~= 10^30
# of people on earth? 7 x 10^9
fraction of rows with 0 training examples?
What to do?
1. Be smart about how we estimate
probabilities from sparse data
– maximum likelihood estimates
– maximum a posteriori estimates

2. Be smart about how to represent joint


distributions
– Bayes networks, graphical models
1. Be smart about how we
estimate probabilities
Estimating Probability of Heads
X=1 X=0
Estimating Probability of Heads
X=1 X=0

α1
-----------
α1 + α0
Estimating θ = P(X=1)
X=1 X=0
Case A:
100 flips: 51 Heads (X=1), 49 Tails (X=0)

Case B:
3 flips: 2 Heads (X=1), 1 Tails (X=0)
Estimating θ = P(X=1)
X=1 X=0
Case C: (online learning)
• keep flipping, want single learning algorithm
that gives reasonable estimate after each flip
Principles for Estimating Probabilities

Principle 1 (maximum likelihood):


• choose parameters θ that maximize P(data | θ)
• e.g.,

Principle 2 (maximum a posteriori prob.):


• choose parameters θ that maximize P(θ | data)
• e.g.
Maximum Likelihood Estimation
P(X=1) = θ P(X=0) = (1-θ) X=1 X=0

Data D:

Flips produce data D with heads, tails


• flips are independent, identically distributed 1’s and 0’s
(Bernoulli)
• and are counts that sum these outcomes (Binomial)
Maximum Likelihood Estimation
P(X=1) = θ P(X=0) = (1-θ) X=1 X=0

Data D: {1 1 0 0 1}

P(D|θ) : θ θ 1-θ 1-θ θ

Flips produce data D with heads, tails


• flips are independent, identically distributed 1’s and 0’s
(Bernoulli)
• and are counts that sum these outcomes (Binomial)
Maximum Likelihood Estimate for Θ

log likelihood is so common that it has its own


notation
hint:
hint:
Summary:
Maximum Likelihood Estimate
X=1 X=0
P(X=1) = θ
P(X=0) = 1-θ
(Bernoulli)
Principles for Estimating Probabilities

Principle 1 (maximum likelihood):


• choose parameters θ that maximize
P(data | θ )

Principle 2 (maximum a posteriori prob.):


• choose parameters θ that maximize
P(θ | data) = P(data | θ )P(θ )
P(data)
Beta prior distribution – P(θ )
Beta prior distribution – P(θ )

[C. Guestrin]
and MAP estimate is therefore
and MAP estimate is therefore
Some terminology
• Likelihood function: P(data | θ )
• Prior: P(θ)
• Posterior: P(θ | data)

• Conjugate prior: P(θ)is the conjugate


prior for likelihood function P(data | θ )if
the forms of P(θ )and P(θ | data) are the
same.
You should know
• Probability basics
– random variables, conditional probs, …
– Bayes rule
– Joint probability distributions
– calculating probabilities from the joint distribution
• Estimating parameters from data
– maximum likelihood estimates
– maximum a posteriori estimates
– distributions – binomial, Beta, Dirichlet, …
– conjugate priors
Extra slides
Independent Events
• Definition: two events A and B are
independent if P(A ^ B)=P(A)*P(B)
• Intuition: knowing A tells us nothing
about the value of B (and vice versa)
Picture “A independent of B”
Expected values
Given a discrete random variable X, the expected value
of X, written E[X] is

Example: X P(X)
0 0.3
1 0.2
2 0.5
Expected values
Given discrete random variable X, the expected value of
X, written E[X] is

We also can talk about the expected value of functions


of X
Covariance
Given two discrete r.v.’s X and Y, we define the
covariance of X and Y as

e.g., X=gender, Y=playsFootball


or X=gender, Y=leftHanded

Remember:

You might also like