07 Probability Review

Download as pdf or txt
Download as pdf or txt
You are on page 1of 56

Probability Overview

• Events
– discrete random variables, continuous random variables,
compound events
• Axioms of probability
– What defines a reasonable theory of uncertainty
• Independent events
• Conditional probabilities
• Bayes rule and beliefs
• Joint probability distribution
• Expectations
• Independence, Conditional independence
Random Variables
• Informally, A is a random variable if
– A denotes something about which we are uncertain
– perhaps the outcome of a randomized experiment

• Examples
A = True if a randomly drawn person from our class is female
A = The hometown of a randomly drawn person from our class
A = True if two randomly drawn persons from our class have same birthday

• Define P(A) as “the fraction of possible worlds in which A is true” or


“the fraction of times A holds, in repeated runs of the random experiment”
– the set of possible worlds is called the sample space, S
– A random variable A is a function defined over S
A: S → {0,1}
A little formalism
More formally, we have
• a sample space S (e.g., set of students in our class)
– aka the set of possible worlds

• a random variable is a function defined over the sample


space
– Gender: S → { m, f }
– Height: S → Reals
• an event is a subset of S
– e.g., the subset of S for which Gender=f
– e.g., the subset of S for which (Gender=m) AND (eyeColor=blue)
• we’re often interested in probabilities of specific events
• and of specific events conditioned on other specific events
Visualizing A

Sample space
of all possible Worlds in which P(A) = Area of
worlds A is true
reddish oval

Its area is 1 Worlds in which A is False


The Axioms of Probability
• 0 <= P(A) <= 1
• P(True) = 1
• P(False) = 0
• P(A or B) = P(A) + P(B) - P(A and B)

[di Finetti 1931]:

when gambling based on “uncertainty formalism A” you can


be exploited by an opponent

iff

your uncertainty formalism A violates these axioms


Elementary Probability in
Pictures
• P(~A) + P(A) = 1

A ~A
A useful theorem
• 0 <= P(A) <= 1, P(True) = 1, P(False) = 0,
P(A or B) = P(A) + P(B) - P(A and B)

➔ P(A) = P(A ^ B) + P(A ^ ~B)

A = [A and (B or ~B)] = [(A and B) or (A and ~B)]


P(A) = P(A and B) + P(A and ~B) – P((A and B) and (A and ~B))
P(A) = P(A and B) + P(A and ~B) – P(A and B and A and ~B)
Elementary Probability in Pictures
• P(A) = P(A ^ B) + P(A ^ ~B)

A^ B

A ^ ~B
B
Definition of Conditional Probability

P(A ^ B)
P(A|B) = -----------
P(B)

A
B
Definition of Conditional
Probability
P(A ^ B)
P(A|B) = -----------
P(B)

Corollary: The Chain Rule


P(A ^ B) = P(A|B) P(B)
P(A ^ B ^ C) = P(A|BC) P(BC)

= P(A|BC) P(B|C) P(C)


Bayes Rule
• let’s write 2 expressions for P(A ^ B)
A^ B

A
B

P(A ^ B ) = P(A|B) P(B)

= P(B|A) P(A) [Since, P(A ^ B ) = P(B ^ A )]

Therefore, P(B|A) P(A)


P(A|B) = -----------------
P(B)
P(B|A) * P(A)
P(A|B) = Bayes’ rule
P(B)

we call P(A) the “prior”


Bayes, Thomas (1763) An essay
towards solving a problem in the doctrine
and P(A|B) the “posterior” of chances. Philosophical Transactions of
the Royal Society of London, 53:370-418

…by no means merely a curious speculation in the doctrine of chances,


but necessary to be solved in order to a sure foundation for all our
reasonings concerning past facts, and what is likely to be hereafter….
necessary to be considered by any that would give a clear account of the
strength of analogical or inductive reasoning…
P(B|A) * P(A)
Other Forms of Bayes Rule P(A|B) =
P(B)

P(B| A)P(A)
P(A|B) =
P(B| A)P(A)+P(B|~ A)P(~ A)

P(B| A X)P(A X)
P(A|B X) =
P(B X)
Applying Bayes Rule
P(B| A)P(A)
P(A|B) =
P(B| A)P(A)+ P(B|~ A)P(~ A)
A = you have covid, B = you just coughed

Assume:
P(A) = 0.05
P(B|A) = 0.80
P(B| ~A) = 0.20

what is P(covid | cough) = P(A|B)?


Solution:
Applying Bayes Rule
P(B| A)P(A) P(B|A) * P(A)
P(A|B) = P(A|B) =
P(B)
P(B| A)P(A)+ P(B|~ A)P(~ A)
A = you have covid, B = you just coughed

Assume:
P(A) = 0.05
P(B|A) = 0.80
P(B| ~A) = 0.20

what is P(covid | cough) = P(A|B)?


Solution:
(0.8 x 0.05)/ (0.8 x 0.05 + 0.2 x 0.95
Therefore, P(A|B) = 0.17
Applying Bayes Rule
P(B| A)P(A) P(B|A) * P(A)
P(A|B) = P(A|B) =
P(B| A)P(A)+ P(B|~ A)P(~ A) P(B)

A = You Fail B = Did not listen to classes

Assume:
P(A) = 0.1
P(B|A) = 0.60

what is P(Fail | Did not listen to classes) = P(A|B)?


Solution:
what does all this have to do with
function approximation?

instead of F: X →Y,
learn P(Y | X)
The Joint Distribution Example: Boolean
variables A, B, C
A B C Prob
Recipe for making a joint 0 0 0 0.30
distribution of M variables: 0 0 1 0.05
0 1 0 0.10
0 1 1 0.05
1 0 0 0.05
1 0 1 0.10
1 1 0 0.25
1 1 1 0.10

0.05 0.10 0.05


A
0.25
0.10 C
0.05
B 0.10
0.30
The Joint Distribution Example: Boolean
variables A, B, C
A B C Prob
Recipe for making a joint 0 0 0 0.30
distribution of M variables: 0 0 1 0.05
0 1 0 0.10
0 1 1 0.05
1. Make a truth table listing all
1 0 0 0.05
combinations of values (M 1 0 1 0.10
Boolean variables → 2M rows). 1 1 0 0.25
1 1 1 0.10

0.05 0.10 0.05


A
0.25
0.10 C
0.05
B 0.10
0.30
The Joint Distribution Example: Boolean
variables A, B, C
A B C Prob
Recipe for making a joint 0 0 0 0.30
distribution of M variables: 0 0 1 0.05
0 1 0 0.10
0 1 1 0.05
1. Make a truth table listing all
1 0 0 0.05
combinations of values (M 1 0 1 0.10
Boolean variables → 2M rows). 1 1 0 0.25
1 1 1 0.10

2. For each combination of


values, say how probable it is. 0.05 0.10 0.05
A
0.25
0.10 C
0.05
B 0.10
0.30
The Joint Distribution Example: Boolean
variables A, B, C
A B C Prob
Recipe for making a joint 0 0 0 0.30
distribution of M variables: 0 0 1 0.05
0 1 0 0.10
0 1 1 0.05
1. Make a truth table listing all
1 0 0 0.05
combinations of values (M 1 0 1 0.10
Boolean variables → 2M rows). 1 1 0 0.25
1 1 1 0.10

2. For each combination of


values, say how probable it is. 0.05 0.10 0.05
A
0.25
0.10 C
3. If you subscribe to the axioms 0.05
of probability, those B 0.10
probabilities must sum to 1. 0.30
Using the
Joint
Distribution

One you have the JD P(E) =  P(row)


you can ask for the rows matching E
probability of any logical
expression involving
these variables
Using the
Joint

P(Poor Male) = 0.4654 P(E) =  P(row)


rows matching E
Using the
Joint

P(Poor) = 0.7604 P(E) =  P(row)


rows matching E
Inference
with the
Joint

P(E1  E2 )
 P(row)
rows matching E and E
P(E1 | E2 ) = = 1 2

P(E2 )  P(row)
rows matching E2

P(Male | Poor) = 0.4654 / 0.7604 = 0.612


Learning and the
Joint
Distribution

Suppose we want to learn the function f: <G, H> → W

Equivalently, P(W | G, H)

Solution: learn joint distribution from data, calculate P(W | G, H)

e.g., P(W=rich | G = female, H = 40.5- ) =


Learning and the
Joint
Distribution

Suppose we want to learn the function f: <G, H> → W

Equivalently, P(W | G, H)

Solution: learn joint distribution from data, calculate P(W | G, H)

e.g., P(W=rich | G = female, H = 40.5- ) =

P(W=rich ^ G = female ^ H = 40.5- ) 0.024


------------------------------------------------- = -------- = 0.088
P(G = female ^ H = 40.5- ) 0.277
sounds like the solution to
learning F: X →Y,
or P(Y | X).

Are we done?
sounds like the solution to
learning F: X →Y,
or P(Y | X).

Main problem: learning P(Y|X)


can require more data than we have

consider learning Joint Dist. with 100 attributes


# of rows in this table? 2^100 ~= 10^30
# of people on earth? 7 x 10^9
fraction of rows with 0 training examples?
What to do?
1. Be smart about how we estimate
probabilities from sparse data
– maximum likelihood estimates
– maximum a posteriori estimates

2. Be smart about how to represent joint


distributions
– Bayes networks, graphical models
1. Be smart about how we
estimate probabilities
Estimating Probability of Heads
X=1 X=0
Estimating Probability of Heads
X=1 X=0

α1
-----------
α1 + α0
Estimating θ = P(X=1)
X=1 X=0
Case A:
100 flips: 51 Heads (X=1), 49 Tails (X=0)

Case B:
3 flips: 2 Heads (X=1), 1 Tails (X=0)
Estimating θ = P(X=1)
X=1 X=0
Case C: (online learning)
• keep flipping, want single learning algorithm
that gives reasonable estimate after each flip
Principles for Estimating Probabilities

Principle 1 (maximum likelihood):


• choose parameters θ that maximize P(data | θ)
• e.g.,

Principle 2 (maximum a posteriori prob.):


• choose parameters θ that maximize P(θ | data)
• e.g.
Maximum Likelihood Estimation
P(X=1) = θ P(X=0) = (1-θ) X=1 X=0

Data D:

Flips produce data D with heads, tails


• flips are independent, identically distributed 1’s and 0’s
(Bernoulli)
• and are counts that sum these outcomes (Binomial)
Maximum Likelihood Estimation
P(X=1) = θ P(X=0) = (1-θ) X=1 X=0

Data D: {1 1 0 0 1}

P(D|θ) : θ θ 1-θ 1-θ θ

Flips produce data D with heads, tails


• flips are independent, identically distributed 1’s and 0’s
(Bernoulli)
• and are counts that sum these outcomes (Binomial)
Maximum Likelihood Estimate for Θ

log likelihood is so common that it has its own


notation
hint:
hint:
Summary:
Maximum Likelihood Estimate
X=1 X=0
P(X=1) = θ
P(X=0) = 1-θ
(Bernoulli)
Principles for Estimating Probabilities

Principle 1 (maximum likelihood):


• choose parameters θ that maximize
P(data | θ )

Principle 2 (maximum a posteriori prob.):


• choose parameters θ that maximize
P(θ | data) = P(data | θ )P(θ )
P(data)
Beta prior distribution – P(θ )
Beta prior distribution – P(θ )

[C. Guestrin]
and MAP estimate is therefore
and MAP estimate is therefore
Some terminology
• Likelihood function: P(data | θ )
• Prior: P(θ)
• Posterior: P(θ | data)

• Conjugate prior: P(θ)is the conjugate


prior for likelihood function P(data | θ )if
the forms of P(θ )and P(θ | data) are the
same.
You should know
• Probability basics
– random variables, conditional probs, …
– Bayes rule
– Joint probability distributions
– calculating probabilities from the joint distribution
• Estimating parameters from data
– maximum likelihood estimates
– maximum a posteriori estimates
– distributions – binomial, Beta, Dirichlet, …
– conjugate priors
Extra slides
Independent Events
• Definition: two events A and B are
independent if P(A ^ B)=P(A)*P(B)
• Intuition: knowing A tells us nothing
about the value of B (and vice versa)
Picture “A independent of B”
Expected values
Given a discrete random variable X, the expected value
of X, written E[X] is

Example: X P(X)
0 0.3
1 0.2
2 0.5
Expected values
Given discrete random variable X, the expected value of
X, written E[X] is

We also can talk about the expected value of functions


of X
Covariance
Given two discrete r.v.’s X and Y, we define the
covariance of X and Y as

e.g., X=gender, Y=playsFootball


or X=gender, Y=leftHanded

Remember:

You might also like