Introduction to Machine Learning
Probability Theory for ML
Readings:
Mitchell Ch. 1, 2, 6.1 – 6.3
Murphy Ch. 2
Bishop Ch. 1 - 2
Slides are adopted from Matt Gormley, Rob Hall, Zahra Koochak and Jeremy Irvin 1
Outline
• Motivation
• Probability Theory
– Sample space, Outcomes, Events
– Kolmogorov’s Axioms of Probability
• Random Variables
– Random variables, Probability mass function (pmf), Probability density function (pdf),
Cumulative distribution function (cdf)
– Examples
– Notation
– Expectation and Variance
– Joint, conditional, marginal probabilities
– Independence
– Bayes’ Rule
• Common Probability Distributions
– Beta, Dirichlet, etc.
• Recap of Decision Trees
– Entropy
– Information Gain
• Probability in ML
2
Why Probability?
3
Computer Why Probability? Domain of
Science Interest
Machine Learning
Optimization Statistics
Probability
Calculus Measure
Theory Linear Algebra 4
PROBABILITY THEORY
5
Probability Theory: Definitions
Example 1: Flipping a coin
Sample Space {Heads, Tails}
Outcome Example: Heads
Event Example: {Heads}
Probability P({Heads}) = 0.5
P({Tails}) = 0.5
6
Probability Theory: Definitions
Probability provides a science for inference
about interesting events
Sample Space The set of all possible outcomes
Outcome Possible result of an experiment
Event Any subset of the sample space
Probability The non-negative number assigned
to each event in the sample space
• Each outcome is unique
• Only one outcome can occur per experiment
• An outcome can be in multiple events
• An elementary event consists of exactly one outcome
7
Probability Theory: Definitions
Example 2: Rolling a 6-sided die
Sample Space {1,2,3,4,5,6}
Outcome Example: 3
Event Example: {3}
(the event “the die came up 3”)
Probability P({3}) = 1/6
P({4}) = 1/6
8
Probability Theory: Definitions
Example 2: Rolling a 6-sided die
Sample Space {1,2,3,4,5,6}
Outcome Example: 3
Event Example: {2,4,6}
(the event “the roll was even”)
Probability P({2,4,6}) = 0.5
P({1,3,5}) = 0.5
9
Probability Theory: Definitions
Example 3: Timing how long it takes a monkey to
reproduce Shakespeare
Sample Space [0, +∞)
Outcome Example: 1,433,600 hours
Event Example: [1, 6] hours
Probability P([1,6]) = 0.000000000001
P([1,433,600, +∞)) = 0.99
10
Kolmogorov’s Axioms
11
Kolmogorov’s Axioms
All of
probability can
be derived
from just
these!
In words:
1. Each event has non-negative probability.
2. The probability that some event will occur is one.
3. The probability of the union of many disjoint sets is
the sum of their probabilities
12
Axioms Deriving Probability Theorems
Monotonicity: if A is a subset of B, then P(A) <= P(B)
Proof:
• A subset of B ➔ B = A + C for C=B-A
• A and C are disjoint ➔ P(B) = P(A or C)=P(A) + P(C)
• P(C) >= 0
• So P(B) >= P(A)
13
Slide adapted from William Cohen (10-601B, Spring 2016)
Probability Theory: Definitions
• The complement of an event E, denoted ~E,
is the event that E does not occur.
~E
14
Axioms Deriving Probability Theorems
Theorem: P(~A) = 1 - P(A)
Proof:
• P(A or ~A) = P(Ω) = 1
• A and ~A are disjoint ➔ P(A) + P(~A )=P(A or ~A)
➔ P(A) + P(~A) = 1
….then solve for P(~A)
15
Slide adapted from William Cohen (10-601B, Spring 2016)
Axioms Deriving Probability Theorems
Theorem: P(A or B) = P(A) + P(B) - P(A and B)
- P(E2)
Proof: P(E1)+P(E2)
P(E3)+P(E2)
• E1 = A and ~(A and B)
• E2 = (A and B)
• E3 = B and ~(A and B)
• E1 or E2 or E3 = A or B and E1, E2, E3 disjoint ➔
P(A or B) = P(E1) + P(E2) + P(E3)
• further P(A) = P(E1) + P(E2) and P(B) = P(E3) + P(E2)
• ...
16
Slide adapted from William Cohen (10-601B, Spring 2016)
These Axioms are Not to be Trifled With
- Andrew Moore
• There have been many many other approaches to
understanding “uncertainty”:
• Fuzzy Logic, three-valued logic, Dempster-Shafer, non-monotonic
reasoning, …
• 40 years ago people in AI argued about these; now they
mostly don’t
– Any scheme for combining uncertain information, uncertain “beliefs”,
etc,… really should obey these axioms to be internally consistent (from
Jayne, 1958; Cox 1930’s)
– If you gamble based on “uncertain beliefs”, then [you can be exploited
by an opponent] [your uncertainty formalism violates the axioms] -
di Finetti 1931 (the “Dutch book argument”)
RANDOM VARIABLES
18
Random Variables: Definitions
Random Def 1: Variable whose possible values
Variable are the outcomes of a random
(capital experiment
letters)
Value of a The value taken by a random variable
Random (lowercase
letters)
Variable
19
Random Variables: Definitions
Random Def 1: Variable whose possible values
Variable are the outcomes of a random
experiment
Discrete Random variable whose values come
Random from a countable set (e.g. the natural
Variable numbers or {True, False})
Continuous Random variable whose values come
Random from an interval or collection of
Variable intervals (e.g. the real numbers or the
range (3, 5)) 20
Random Variables: Definitions
Random Def 1: Variable whose possible values
Variable are the outcomes of a random
experiment
Def 2: A measureable function from
the sample space to the real numbers:
Discrete Random variable whose values come
Random from a countable set (e.g. the natural
Variable numbers or {True, False})
Continuous Random variable whose values come
Random from an interval or collection of
Variable intervals (e.g. the real numbers or the
range (3, 5)) 21
Random Variables: Definitions
Discrete Random variable whose values come
Random from a countable set (e.g. the natural
Variable numbers or {True, False})
Probability Function giving the probability that
mass discrete r.v. X takes value x.
function
(pmf)
22
Random Variables: Definitions
Example 2: Rolling a 6-sided die
Sample Space {1,2,3,4,5,6}
Outcome Example: 3
Event Example: {3}
(the event “the die came up 3”)
Probability P({3}) = 1/6
P({4}) = 1/6
23
Random Variables: Definitions
Example 2: Rolling a 6-sided die
Sample Space {1,2,3,4,5,6}
Outcome Example: 3
Event Example: {3}
(the event “the die came up 3”)
Probability P({3}) = 1/6
P({4}) = 1/6
Discrete Ran- Example: The value on the top face
dom Variable of the die.
Prob. Mass p(3) = 1/6
Function p(4) = 1/6
(pmf)
24
Random Variables: Definitions
Example 2: Rolling a 6-sided die
Sample Space {1,2,3,4,5,6}
Outcome Example: 3
Event Example: {2,4,6}
(the event “the roll was even”)
Probability P({2,4,6}) = 0.5
P({1,3,5}) = 0.5
Discrete Ran- Example: 1 if the die landed on an
dom Variable even number and 0 otherwise
Prob. Mass p(1) = 0.5
Function p(0) = 0.5
(pmf)
25
Random Variables: Definitions
Discrete Random variable whose values come
Random from a countable set (e.g. the natural
Variable numbers or {True, False})
Probability Function giving the probability that
mass discrete r.v. X takes value x.
function
(pmf)
26
Random Variables: Definitions
Continuous Random variable whose values come
Random from an interval or collection of
Variable intervals (e.g. the real numbers or the
range (3, 5))
Probability Function returns a nonnegative real
density indicating the relative likelihood that a
function continuous r.v. X takes value x
(pdf)
• For any continuous random variable: P(X = x) = 0
• Non-zero probabilities are only available to intervals:
27
Random Variables: Definitions
Example 3: Timing how long it takes a monkey to
reproduce Shakespeare
Sample Space [0, +∞)
Outcome Example: 1,433,600 hours
Event Example: [1, 6] hours
Probability P([1,6]) = 0.000000000001
P([1,433,600, +∞)) = 0.99
Continuous Example: Represents time to
Random Var. reproduce (not an interval!)
Prob. Density Example: Gamma distribution
Function
28
Random Variables: Definitions
“Region”-valued Random Variables
Sample Space Ω {1,2,3,4,5}
Events x The sub-regions 1, 2, 3, 4, or 5
Discrete Ran- X Represents a random selection of a
dom Variable sub-region
Prob. Mass Fn. P(X=x) Proportional to size of sub-region
X=1 X=4
X=3
X=5
X=2
29
Random Variables: Definitions
“Region”-valued Random Variables
Sample Space Ω All points in the region:
Events x The sub-regions 1, 2, 3, 4, or 5
RecallRan-
Discrete that an event
X Represents a random selection of a
domisVariable
any subset of the sub-region
sample space.
Prob.
SoMass
bothFn. P(X=x) Proportional to size of sub-region
definitions
of the sample space
here are valid.
X=1 X=4
X=3
X=5
X=2
30
Random Variables: Definitions
Cumulative Function that returns the probability
distribution that a random variable X is less than or
function equal to x:
• For discrete random variables:
• For continuous random variables:
32
Cumulative Distribution Function
33
34
Random Variables and Events
Question: Something seems wrong… Random Def 2: A measureable
• We defined P(E) (the capital ‘P’) as Variable function from the
a function mapping events to sample space to the
probabilities
real numbers:
• So why do we write P(X=x)?
• A good guess: X=x is an event…
Answer: P(X=x) is just shorthand! These sets are events!
Example 1:
Example 2:
35
Notational Shortcuts
A convenient shorthand:
36
Expectation and Variance
The expected value of X is E[X]. Also called the mean.
• Discrete random variables:
• Continuous random variables:
38
Expectation and Variance
The variance of X is Var(X).
• Discrete random variables:
• Continuous random variables:
39
Variance
40
Variance
41
Multiple Random Variables
• Joint probability
• Marginal probability
• Conditional probability
42
Joint Probability
43
Slide from Sam Roweis (MLSS, 2005)
Marginal Probabilities
44
Slide from Sam Roweis (MLSS, 2005)
45
Conditional Probability
46
Slide from Sam Roweis (MLSS, 2005)
Independence and
Conditional Independence
47
Slide from Sam Roweis (MLSS, 2005)
48
49
Definition of Conditional
Probability
P(A ^ B)
P(A|B) = -----------
P(B)
Corollary: The Chain Rule
P(A ^ B) = P(A|B) P(B)
Slide from William Cohen (10-601B, Spring 2016)
BAYES’ RULE
56
prior
posterior
P(B|A) * P(A)
P(A|B) = Bayes’ rule
P(B)
P(A|B) * P(B)
P(B|A) =
P(A) Bayes, Thomas (1763) An essay towards
solving a problem in the doctrine of
chances. Philosophical Transactions of the
Royal Society of London, 53:370-418
…by no means merely a curious speculation in the doctrine of chances, but
necessary to be solved in order to a sure foundation for all our reasonings
concerning past facts, and what is likely to be hereafter…. necessary to be
considered by any that would give a clear account of the strength of
analogical or inductive reasoning…
Slide from William Cohen (10-601B, Spring 2016)
60
61
COMMON PROBABILITY
DISTRIBUTIONS
62
Common Probability Distributions
• For Discrete Random Variables:
– Bernoulli
– Binomial
– Multinomial
– Categorical
– Poisson
• For Continuous Random Variables:
– Exponential
– Gamma
– Beta
– Dirichlet
– Laplace
– Gaussian (1D)
– Multivariate Gaussian
63
Bernoulli Distribution
64
Binomial Distribution
Slide from http://mathworld.wolfram.com/BinomialDistribution.html 65
Multinomial Distribution
66
Continuous Distributions
67
68
69
70
71
72
73
74
Variance
75
Law of Large Numbers
76
Law of Large Numbers (LLN)
77
Central Limit Theorem (CLT)
78
79
80
84
Oh, the Places You’ll Use Probability!
Supervised Classification
• Naïve Bayes
• Logistic regression
85
Oh, the Places You’ll Use Probability!
ML Theory
(Example: Sample Complexity)
86
Oh, the Places You’ll Use Probability!
Deep Learning
(Example: Deep Bi-directional RNN)
y1 y2 y3 y4
h1 h2 h3 h4
h1 h2 h3 h4
x1 x2 x3 x4
87
Oh, the Places You’ll Use Probability!
Graphical Models
• Hidden Markov Model (HMM)
<START> n v p d n
time flies like an arrow
• Conditional Random Field (CRF)
<START> ψ0 n ψ2 v ψ4 p ψ6 d ψ8 n
ψ1 ψ3 ψ5 ψ7 ψ9
88
Summary
1. Probability theory is rooted in (simple)
axioms
2. Random variables provide an important
tool for modeling the world
3. Our favorite probability distributions are
just functions! (usually with interesting
properties)
4. Probability and Statistics are essential to
Machine Learning
89