0% found this document useful (0 votes)
15 views77 pages

3 Prob-Review

The document provides an introduction to machine learning with a focus on probability theory, covering essential concepts such as sample space, random variables, and probability distributions. Key topics include Kolmogorov’s axioms, expectation, variance, and Bayes' rule, along with examples and definitions related to discrete and continuous random variables. The document also outlines common probability distributions and their applications in machine learning.

Uploaded by

alsanahesapmesap
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views77 pages

3 Prob-Review

The document provides an introduction to machine learning with a focus on probability theory, covering essential concepts such as sample space, random variables, and probability distributions. Key topics include Kolmogorov’s axioms, expectation, variance, and Bayes' rule, along with examples and definitions related to discrete and continuous random variables. The document also outlines common probability distributions and their applications in machine learning.

Uploaded by

alsanahesapmesap
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 77

Introduction to Machine Learning

Probability Theory for ML

Readings:
Mitchell Ch. 1, 2, 6.1 – 6.3
Murphy Ch. 2
Bishop Ch. 1 - 2

Slides are adopted from Matt Gormley, Rob Hall, Zahra Koochak and Jeremy Irvin 1
Outline
• Motivation
• Probability Theory
– Sample space, Outcomes, Events
– Kolmogorov’s Axioms of Probability
• Random Variables
– Random variables, Probability mass function (pmf), Probability density function (pdf),
Cumulative distribution function (cdf)
– Examples
– Notation
– Expectation and Variance
– Joint, conditional, marginal probabilities
– Independence
– Bayes’ Rule
• Common Probability Distributions
– Beta, Dirichlet, etc.
• Recap of Decision Trees
– Entropy
– Information Gain
• Probability in ML

2
Why Probability?

3
Computer Why Probability? Domain of
Science Interest

Machine Learning

Optimization Statistics

Probability

Calculus Measure
Theory Linear Algebra 4
PROBABILITY THEORY

5
Probability Theory: Definitions
Example 1: Flipping a coin
Sample Space {Heads, Tails}
Outcome Example: Heads
Event Example: {Heads}
Probability P({Heads}) = 0.5
P({Tails}) = 0.5

6
Probability Theory: Definitions
Probability provides a science for inference
about interesting events
Sample Space The set of all possible outcomes
Outcome Possible result of an experiment
Event Any subset of the sample space
Probability The non-negative number assigned
to each event in the sample space
• Each outcome is unique
• Only one outcome can occur per experiment
• An outcome can be in multiple events
• An elementary event consists of exactly one outcome
7
Probability Theory: Definitions
Example 2: Rolling a 6-sided die
Sample Space {1,2,3,4,5,6}
Outcome Example: 3
Event Example: {3}
(the event “the die came up 3”)
Probability P({3}) = 1/6
P({4}) = 1/6

8
Probability Theory: Definitions
Example 2: Rolling a 6-sided die
Sample Space {1,2,3,4,5,6}
Outcome Example: 3
Event Example: {2,4,6}
(the event “the roll was even”)
Probability P({2,4,6}) = 0.5
P({1,3,5}) = 0.5

9
Probability Theory: Definitions
Example 3: Timing how long it takes a monkey to
reproduce Shakespeare
Sample Space [0, +∞)
Outcome Example: 1,433,600 hours
Event Example: [1, 6] hours
Probability P([1,6]) = 0.000000000001
P([1,433,600, +∞)) = 0.99

10
Kolmogorov’s Axioms

11
Kolmogorov’s Axioms
All of
probability can
be derived
from just
these!

In words:
1. Each event has non-negative probability.
2. The probability that some event will occur is one.
3. The probability of the union of many disjoint sets is
the sum of their probabilities
12
Axioms Deriving Probability Theorems

Monotonicity: if A is a subset of B, then P(A) <= P(B)

Proof:
• A subset of B ➔ B = A + C for C=B-A
• A and C are disjoint ➔ P(B) = P(A or C)=P(A) + P(C)
• P(C) >= 0

• So P(B) >= P(A)

13
Slide adapted from William Cohen (10-601B, Spring 2016)
Probability Theory: Definitions
• The complement of an event E, denoted ~E,
is the event that E does not occur.

~E

14
Axioms Deriving Probability Theorems

Theorem: P(~A) = 1 - P(A)

Proof:
• P(A or ~A) = P(Ω) = 1
• A and ~A are disjoint ➔ P(A) + P(~A )=P(A or ~A)
➔ P(A) + P(~A) = 1

….then solve for P(~A)

15
Slide adapted from William Cohen (10-601B, Spring 2016)
Axioms Deriving Probability Theorems

Theorem: P(A or B) = P(A) + P(B) - P(A and B)


- P(E2)
Proof: P(E1)+P(E2)
P(E3)+P(E2)
• E1 = A and ~(A and B)
• E2 = (A and B)
• E3 = B and ~(A and B)
• E1 or E2 or E3 = A or B and E1, E2, E3 disjoint ➔
P(A or B) = P(E1) + P(E2) + P(E3)
• further P(A) = P(E1) + P(E2) and P(B) = P(E3) + P(E2)
• ...

16
Slide adapted from William Cohen (10-601B, Spring 2016)
These Axioms are Not to be Trifled With
- Andrew Moore
• There have been many many other approaches to
understanding “uncertainty”:
• Fuzzy Logic, three-valued logic, Dempster-Shafer, non-monotonic
reasoning, …
• 40 years ago people in AI argued about these; now they
mostly don’t
– Any scheme for combining uncertain information, uncertain “beliefs”,
etc,… really should obey these axioms to be internally consistent (from
Jayne, 1958; Cox 1930’s)
– If you gamble based on “uncertain beliefs”, then [you can be exploited
by an opponent]  [your uncertainty formalism violates the axioms] -
di Finetti 1931 (the “Dutch book argument”)
RANDOM VARIABLES

18
Random Variables: Definitions
Random Def 1: Variable whose possible values
Variable are the outcomes of a random
(capital experiment
letters)

Value of a The value taken by a random variable


Random (lowercase
letters)
Variable

19
Random Variables: Definitions
Random Def 1: Variable whose possible values
Variable are the outcomes of a random
experiment

Discrete Random variable whose values come


Random from a countable set (e.g. the natural
Variable numbers or {True, False})
Continuous Random variable whose values come
Random from an interval or collection of
Variable intervals (e.g. the real numbers or the
range (3, 5)) 20
Random Variables: Definitions
Random Def 1: Variable whose possible values
Variable are the outcomes of a random
experiment

Def 2: A measureable function from


the sample space to the real numbers:

Discrete Random variable whose values come


Random from a countable set (e.g. the natural
Variable numbers or {True, False})
Continuous Random variable whose values come
Random from an interval or collection of
Variable intervals (e.g. the real numbers or the
range (3, 5)) 21
Random Variables: Definitions
Discrete Random variable whose values come
Random from a countable set (e.g. the natural
Variable numbers or {True, False})
Probability Function giving the probability that
mass discrete r.v. X takes value x.
function
(pmf)

22
Random Variables: Definitions
Example 2: Rolling a 6-sided die
Sample Space {1,2,3,4,5,6}
Outcome Example: 3
Event Example: {3}
(the event “the die came up 3”)
Probability P({3}) = 1/6
P({4}) = 1/6

23
Random Variables: Definitions
Example 2: Rolling a 6-sided die
Sample Space {1,2,3,4,5,6}
Outcome Example: 3
Event Example: {3}
(the event “the die came up 3”)
Probability P({3}) = 1/6
P({4}) = 1/6
Discrete Ran- Example: The value on the top face
dom Variable of the die.
Prob. Mass p(3) = 1/6
Function p(4) = 1/6
(pmf)
24
Random Variables: Definitions
Example 2: Rolling a 6-sided die
Sample Space {1,2,3,4,5,6}
Outcome Example: 3
Event Example: {2,4,6}
(the event “the roll was even”)
Probability P({2,4,6}) = 0.5
P({1,3,5}) = 0.5
Discrete Ran- Example: 1 if the die landed on an
dom Variable even number and 0 otherwise
Prob. Mass p(1) = 0.5
Function p(0) = 0.5
(pmf)
25
Random Variables: Definitions
Discrete Random variable whose values come
Random from a countable set (e.g. the natural
Variable numbers or {True, False})
Probability Function giving the probability that
mass discrete r.v. X takes value x.
function
(pmf)

26
Random Variables: Definitions
Continuous Random variable whose values come
Random from an interval or collection of
Variable intervals (e.g. the real numbers or the
range (3, 5))
Probability Function returns a nonnegative real
density indicating the relative likelihood that a
function continuous r.v. X takes value x
(pdf)
• For any continuous random variable: P(X = x) = 0
• Non-zero probabilities are only available to intervals:

27
Random Variables: Definitions
Example 3: Timing how long it takes a monkey to
reproduce Shakespeare
Sample Space [0, +∞)
Outcome Example: 1,433,600 hours
Event Example: [1, 6] hours
Probability P([1,6]) = 0.000000000001
P([1,433,600, +∞)) = 0.99
Continuous Example: Represents time to
Random Var. reproduce (not an interval!)
Prob. Density Example: Gamma distribution
Function
28
Random Variables: Definitions
“Region”-valued Random Variables
Sample Space Ω {1,2,3,4,5}
Events x The sub-regions 1, 2, 3, 4, or 5

Discrete Ran- X Represents a random selection of a


dom Variable sub-region
Prob. Mass Fn. P(X=x) Proportional to size of sub-region

X=1 X=4

X=3

X=5
X=2

29
Random Variables: Definitions
“Region”-valued Random Variables
Sample Space Ω All points in the region:
Events x The sub-regions 1, 2, 3, 4, or 5
RecallRan-
Discrete that an event
X Represents a random selection of a
domisVariable
any subset of the sub-region
sample space.
Prob.
SoMass
bothFn. P(X=x) Proportional to size of sub-region
definitions
of the sample space
here are valid.
X=1 X=4

X=3

X=5
X=2

30
Random Variables: Definitions
Cumulative Function that returns the probability
distribution that a random variable X is less than or
function equal to x:

• For discrete random variables:

• For continuous random variables:

32
Cumulative Distribution Function

33
34
Random Variables and Events
Question: Something seems wrong… Random Def 2: A measureable
• We defined P(E) (the capital ‘P’) as Variable function from the
a function mapping events to sample space to the
probabilities
real numbers:
• So why do we write P(X=x)?
• A good guess: X=x is an event…

Answer: P(X=x) is just shorthand! These sets are events!


Example 1:

Example 2:

35
Notational Shortcuts
A convenient shorthand:

36
Expectation and Variance
The expected value of X is E[X]. Also called the mean.

• Discrete random variables:

• Continuous random variables:

38
Expectation and Variance
The variance of X is Var(X).

• Discrete random variables:

• Continuous random variables:

39
Variance

40
Variance

41
Multiple Random Variables
• Joint probability
• Marginal probability
• Conditional probability

42
Joint Probability

43
Slide from Sam Roweis (MLSS, 2005)
Marginal Probabilities

44
Slide from Sam Roweis (MLSS, 2005)
45
Conditional Probability

46
Slide from Sam Roweis (MLSS, 2005)
Independence and
Conditional Independence

47
Slide from Sam Roweis (MLSS, 2005)
48
49
Definition of Conditional
Probability
P(A ^ B)
P(A|B) = -----------
P(B)

Corollary: The Chain Rule


P(A ^ B) = P(A|B) P(B)

Slide from William Cohen (10-601B, Spring 2016)


BAYES’ RULE

56
prior
posterior
P(B|A) * P(A)
P(A|B) = Bayes’ rule
P(B)

P(A|B) * P(B)
P(B|A) =
P(A) Bayes, Thomas (1763) An essay towards
solving a problem in the doctrine of
chances. Philosophical Transactions of the
Royal Society of London, 53:370-418

…by no means merely a curious speculation in the doctrine of chances, but


necessary to be solved in order to a sure foundation for all our reasonings
concerning past facts, and what is likely to be hereafter…. necessary to be
considered by any that would give a clear account of the strength of
analogical or inductive reasoning…

Slide from William Cohen (10-601B, Spring 2016)


60
61
COMMON PROBABILITY
DISTRIBUTIONS

62
Common Probability Distributions
• For Discrete Random Variables:
– Bernoulli
– Binomial
– Multinomial
– Categorical
– Poisson
• For Continuous Random Variables:
– Exponential
– Gamma
– Beta
– Dirichlet
– Laplace
– Gaussian (1D)
– Multivariate Gaussian

63
Bernoulli Distribution

64
Binomial Distribution

Slide from http://mathworld.wolfram.com/BinomialDistribution.html 65


Multinomial Distribution

66
Continuous Distributions

67
68
69
70
71
72
73
74
Variance

75
Law of Large Numbers

76
Law of Large Numbers (LLN)

77
Central Limit Theorem (CLT)

78
79
80
84
Oh, the Places You’ll Use Probability!
Supervised Classification
• Naïve Bayes

• Logistic regression

85
Oh, the Places You’ll Use Probability!
ML Theory
(Example: Sample Complexity)

86
Oh, the Places You’ll Use Probability!
Deep Learning
(Example: Deep Bi-directional RNN)

y1 y2 y3 y4

h1 h2 h3 h4

h1 h2 h3 h4

x1 x2 x3 x4

87
Oh, the Places You’ll Use Probability!
Graphical Models
• Hidden Markov Model (HMM)
<START> n v p d n

time flies like an arrow

• Conditional Random Field (CRF)


<START> ψ0 n ψ2 v ψ4 p ψ6 d ψ8 n

ψ1 ψ3 ψ5 ψ7 ψ9

88
Summary
1. Probability theory is rooted in (simple)
axioms
2. Random variables provide an important
tool for modeling the world
3. Our favorite probability distributions are
just functions! (usually with interesting
properties)
4. Probability and Statistics are essential to
Machine Learning
89

You might also like