3 Prob-Review
3 Prob-Review
Readings:
Mitchell Ch. 1, 2, 6.1 – 6.3
Murphy Ch. 2
Bishop Ch. 1 - 2
Slides are adopted from Matt Gormley, Rob Hall, Zahra Koochak and Jeremy Irvin 1
Outline
• Motivation
• Probability Theory
– Sample space, Outcomes, Events
– Kolmogorov’s Axioms of Probability
• Random Variables
– Random variables, Probability mass function (pmf), Probability density function (pdf),
Cumulative distribution function (cdf)
– Examples
– Notation
– Expectation and Variance
– Joint, conditional, marginal probabilities
– Independence
– Bayes’ Rule
• Common Probability Distributions
– Beta, Dirichlet, etc.
• Recap of Decision Trees
– Entropy
– Information Gain
• Probability in ML
2
Why Probability?
3
Computer Why Probability? Domain of
Science Interest
Machine Learning
Optimization Statistics
Probability
Calculus Measure
Theory Linear Algebra 4
PROBABILITY THEORY
5
Probability Theory: Definitions
Example 1: Flipping a coin
Sample Space {Heads, Tails}
Outcome Example: Heads
Event Example: {Heads}
Probability P({Heads}) = 0.5
P({Tails}) = 0.5
6
Probability Theory: Definitions
Probability provides a science for inference
about interesting events
Sample Space The set of all possible outcomes
Outcome Possible result of an experiment
Event Any subset of the sample space
Probability The non-negative number assigned
to each event in the sample space
• Each outcome is unique
• Only one outcome can occur per experiment
• An outcome can be in multiple events
• An elementary event consists of exactly one outcome
7
Probability Theory: Definitions
Example 2: Rolling a 6-sided die
Sample Space {1,2,3,4,5,6}
Outcome Example: 3
Event Example: {3}
(the event “the die came up 3”)
Probability P({3}) = 1/6
P({4}) = 1/6
8
Probability Theory: Definitions
Example 2: Rolling a 6-sided die
Sample Space {1,2,3,4,5,6}
Outcome Example: 3
Event Example: {2,4,6}
(the event “the roll was even”)
Probability P({2,4,6}) = 0.5
P({1,3,5}) = 0.5
9
Probability Theory: Definitions
Example 3: Timing how long it takes a monkey to
reproduce Shakespeare
Sample Space [0, +∞)
Outcome Example: 1,433,600 hours
Event Example: [1, 6] hours
Probability P([1,6]) = 0.000000000001
P([1,433,600, +∞)) = 0.99
10
Kolmogorov’s Axioms
11
Kolmogorov’s Axioms
All of
probability can
be derived
from just
these!
In words:
1. Each event has non-negative probability.
2. The probability that some event will occur is one.
3. The probability of the union of many disjoint sets is
the sum of their probabilities
12
Axioms Deriving Probability Theorems
Proof:
• A subset of B ➔ B = A + C for C=B-A
• A and C are disjoint ➔ P(B) = P(A or C)=P(A) + P(C)
• P(C) >= 0
13
Slide adapted from William Cohen (10-601B, Spring 2016)
Probability Theory: Definitions
• The complement of an event E, denoted ~E,
is the event that E does not occur.
~E
14
Axioms Deriving Probability Theorems
Proof:
• P(A or ~A) = P(Ω) = 1
• A and ~A are disjoint ➔ P(A) + P(~A )=P(A or ~A)
➔ P(A) + P(~A) = 1
15
Slide adapted from William Cohen (10-601B, Spring 2016)
Axioms Deriving Probability Theorems
16
Slide adapted from William Cohen (10-601B, Spring 2016)
These Axioms are Not to be Trifled With
- Andrew Moore
• There have been many many other approaches to
understanding “uncertainty”:
• Fuzzy Logic, three-valued logic, Dempster-Shafer, non-monotonic
reasoning, …
• 40 years ago people in AI argued about these; now they
mostly don’t
– Any scheme for combining uncertain information, uncertain “beliefs”,
etc,… really should obey these axioms to be internally consistent (from
Jayne, 1958; Cox 1930’s)
– If you gamble based on “uncertain beliefs”, then [you can be exploited
by an opponent] [your uncertainty formalism violates the axioms] -
di Finetti 1931 (the “Dutch book argument”)
RANDOM VARIABLES
18
Random Variables: Definitions
Random Def 1: Variable whose possible values
Variable are the outcomes of a random
(capital experiment
letters)
19
Random Variables: Definitions
Random Def 1: Variable whose possible values
Variable are the outcomes of a random
experiment
22
Random Variables: Definitions
Example 2: Rolling a 6-sided die
Sample Space {1,2,3,4,5,6}
Outcome Example: 3
Event Example: {3}
(the event “the die came up 3”)
Probability P({3}) = 1/6
P({4}) = 1/6
23
Random Variables: Definitions
Example 2: Rolling a 6-sided die
Sample Space {1,2,3,4,5,6}
Outcome Example: 3
Event Example: {3}
(the event “the die came up 3”)
Probability P({3}) = 1/6
P({4}) = 1/6
Discrete Ran- Example: The value on the top face
dom Variable of the die.
Prob. Mass p(3) = 1/6
Function p(4) = 1/6
(pmf)
24
Random Variables: Definitions
Example 2: Rolling a 6-sided die
Sample Space {1,2,3,4,5,6}
Outcome Example: 3
Event Example: {2,4,6}
(the event “the roll was even”)
Probability P({2,4,6}) = 0.5
P({1,3,5}) = 0.5
Discrete Ran- Example: 1 if the die landed on an
dom Variable even number and 0 otherwise
Prob. Mass p(1) = 0.5
Function p(0) = 0.5
(pmf)
25
Random Variables: Definitions
Discrete Random variable whose values come
Random from a countable set (e.g. the natural
Variable numbers or {True, False})
Probability Function giving the probability that
mass discrete r.v. X takes value x.
function
(pmf)
26
Random Variables: Definitions
Continuous Random variable whose values come
Random from an interval or collection of
Variable intervals (e.g. the real numbers or the
range (3, 5))
Probability Function returns a nonnegative real
density indicating the relative likelihood that a
function continuous r.v. X takes value x
(pdf)
• For any continuous random variable: P(X = x) = 0
• Non-zero probabilities are only available to intervals:
27
Random Variables: Definitions
Example 3: Timing how long it takes a monkey to
reproduce Shakespeare
Sample Space [0, +∞)
Outcome Example: 1,433,600 hours
Event Example: [1, 6] hours
Probability P([1,6]) = 0.000000000001
P([1,433,600, +∞)) = 0.99
Continuous Example: Represents time to
Random Var. reproduce (not an interval!)
Prob. Density Example: Gamma distribution
Function
28
Random Variables: Definitions
“Region”-valued Random Variables
Sample Space Ω {1,2,3,4,5}
Events x The sub-regions 1, 2, 3, 4, or 5
X=1 X=4
X=3
X=5
X=2
29
Random Variables: Definitions
“Region”-valued Random Variables
Sample Space Ω All points in the region:
Events x The sub-regions 1, 2, 3, 4, or 5
RecallRan-
Discrete that an event
X Represents a random selection of a
domisVariable
any subset of the sub-region
sample space.
Prob.
SoMass
bothFn. P(X=x) Proportional to size of sub-region
definitions
of the sample space
here are valid.
X=1 X=4
X=3
X=5
X=2
30
Random Variables: Definitions
Cumulative Function that returns the probability
distribution that a random variable X is less than or
function equal to x:
32
Cumulative Distribution Function
33
34
Random Variables and Events
Question: Something seems wrong… Random Def 2: A measureable
• We defined P(E) (the capital ‘P’) as Variable function from the
a function mapping events to sample space to the
probabilities
real numbers:
• So why do we write P(X=x)?
• A good guess: X=x is an event…
Example 2:
35
Notational Shortcuts
A convenient shorthand:
36
Expectation and Variance
The expected value of X is E[X]. Also called the mean.
38
Expectation and Variance
The variance of X is Var(X).
39
Variance
40
Variance
41
Multiple Random Variables
• Joint probability
• Marginal probability
• Conditional probability
42
Joint Probability
43
Slide from Sam Roweis (MLSS, 2005)
Marginal Probabilities
44
Slide from Sam Roweis (MLSS, 2005)
45
Conditional Probability
46
Slide from Sam Roweis (MLSS, 2005)
Independence and
Conditional Independence
47
Slide from Sam Roweis (MLSS, 2005)
48
49
Definition of Conditional
Probability
P(A ^ B)
P(A|B) = -----------
P(B)
56
prior
posterior
P(B|A) * P(A)
P(A|B) = Bayes’ rule
P(B)
P(A|B) * P(B)
P(B|A) =
P(A) Bayes, Thomas (1763) An essay towards
solving a problem in the doctrine of
chances. Philosophical Transactions of the
Royal Society of London, 53:370-418
62
Common Probability Distributions
• For Discrete Random Variables:
– Bernoulli
– Binomial
– Multinomial
– Categorical
– Poisson
• For Continuous Random Variables:
– Exponential
– Gamma
– Beta
– Dirichlet
– Laplace
– Gaussian (1D)
– Multivariate Gaussian
63
Bernoulli Distribution
64
Binomial Distribution
66
Continuous Distributions
67
68
69
70
71
72
73
74
Variance
75
Law of Large Numbers
76
Law of Large Numbers (LLN)
77
Central Limit Theorem (CLT)
78
79
80
84
Oh, the Places You’ll Use Probability!
Supervised Classification
• Naïve Bayes
• Logistic regression
85
Oh, the Places You’ll Use Probability!
ML Theory
(Example: Sample Complexity)
86
Oh, the Places You’ll Use Probability!
Deep Learning
(Example: Deep Bi-directional RNN)
y1 y2 y3 y4
h1 h2 h3 h4
h1 h2 h3 h4
x1 x2 x3 x4
87
Oh, the Places You’ll Use Probability!
Graphical Models
• Hidden Markov Model (HMM)
<START> n v p d n
ψ1 ψ3 ψ5 ψ7 ψ9
88
Summary
1. Probability theory is rooted in (simple)
axioms
2. Random variables provide an important
tool for modeling the world
3. Our favorite probability distributions are
just functions! (usually with interesting
properties)
4. Probability and Statistics are essential to
Machine Learning
89