0% found this document useful (0 votes)

71 views

Probability Theory For Machine Learning: Chris Cremer September 2015

This document provides an outline and overview of probability theory concepts relevant to machine learning. It discusses sources of uncertainty and how probability theory provides a framework to quantify uncertainty. Key concepts covered include sample spaces, probability definitions and rules, probability distributions like binomial and Gaussian, maximum likelihood estimation, and the relationships between maximum likelihood estimation and least squares regression. The goal is to motivate these probability concepts and provide a high-level introduction.

Uploaded by

jeffconnors

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

71 views

Probability Theory For Machine Learning: Chris Cremer September 2015

Uploaded by

jeffconnors

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 40

Probability Theory for

Machine Learning
Chris Cremer
September 2015
Outline
• Motivation
• Probability Definitions and Rules
• Probability Distributions
• MLE for Gaussian Parameter Estimation
• MLE and Least Squares
Material
• Pattern Recognition and Machine Learning - Christopher M. Bishop
• All of Statistics – Larry Wasserman
• Wolfram MathWorld
• Wikipedia
Motivation
• Uncertainty arises through:
• Noisy measurements
• Finite size of data sets
• Ambiguity: The word bank can mean (1) a financial institution, (2) the side of a river,
or (3) tilting an airplane. Which meaning was intended, based on the words that
appear nearby?
• Limited Model Complexity
• Probability theory provides a consistent framework for the quantification
and manipulation of uncertainty
• Allows us to make optimal predictions given all the information available to
us, even though that information may be incomplete or ambiguous
Sample Space
• The sample space Ω is the set of possible outcomes of an experiment.
Points ω in Ω are called sample outcomes, realizations, or elements.
Subsets of Ω are called Events.

• Example. If we toss a coin twice then Ω = {HH,HT, TH, TT}. The event
that the first toss is heads is A = {HH,HT}

• We say that events A1 and A2 are disjoint (mutually exclusive) if Ai ∩

Aj = {}
• Example: first flip being heads and first flip being tails
Probability
• We will assign a real number P(A) to every event A, called the
probability of A.
• To qualify as a probability, P must satisfy three axioms:
• Axiom 1: P(A) ≥ 0 for every A
• Axiom 2: P(Ω) = 1
• Axiom 3: If A1,A2, . . . are disjoint then
Joint and Conditional Probabilities
• Joint Probability
• P(X,Y)
• Probability of X and Y

• Conditional Probability
• P(X|Y)
• Probability of X given Y
Independent and Conditional Probabilities
• Assuming that P(B) > 0, the conditional probability of A given B:
• P(A|B)=P(AB)/P(B)
• P(AB) = P(A|B)P(B) = P(B|A)P(A)
• Product Rule

• Two events A and B are independent if

• P(AB) = P(A)P(B)
• Joint = Product of Marginals

• Two events A and B are conditionally independent given C if they are

independent after conditioning on C
• P(AB|C) = P(B|AC)P(A|C) = P(B|C)P(A|C)
Example
• 60% of ML students pass the final and 45% of ML students pass both the
final and the midterm *
• What percent of students who passed the final also passed the
midterm?

* These are made up values.

Example
• 60% of ML students pass the final and 45% of ML students pass both the
final and the midterm *
• What percent of students who passed the final also passed the
midterm?

• Reworded: What percent of students passed the midterm given they

passed the final?
• P(M|F) = P(M,F) / P(F)
• = .45 / .60
• = .75
* These are made up values.
Marginalization and Law of Total Probability
• Marginalization (Sum Rule)

• Law of Total Probability

Bayes’ Rule
P(A|B) = P(AB) /P(B) (Conditional Probability)
P(A|B) = P(B|A)P(A) /P(B) (Product Rule)
P(A|B) = P(B|A)P(A) / Σ P(B|A)P(A) (Law of Total Probability)
Bayes’ Rule
Example
• Suppose you have tested positive for a disease; what is the
probability that you actually have the disease?
• It depends on the accuracy and sensitivity of the test, and on the
background (prior) probability of the disease.
• P(T=1|D=1) = .95 (true positive)
• P(T=1|D=0) = .10 (false positive)
• P(D=1) = .01 (prior)

• P(D=1|T=1) = ?
Example
• P(T=1|D=1) = .95 (true positive)
• P(T=1|D=0) = .10 (false positive)
• P(D=1) = .01 (prior)

Bayes’ Rule Law of Total Probability

• P(D|T) = P(T|D)P(D) / P(T) • P(T) = Σ P(T|D)P(D)
= .95 * .01 / .1085 = P(T|D=1)P(D=1) + P(T|D=0)P(D=0)
= .087 = .95*.01 + .1*.99
= .1085
The probability that you have the disease given you tested positive is 8.7%
Random Variable
• How do we link sample spaces and events to data?
• A random variable is a mapping that assigns a real number X(ω) to
each outcome ω

• Example: Flip a coin ten times. Let X(ω) be the number of heads in the
sequence ω. If ω = HHTHHTHHTT, then X(ω) = 6.
Discrete vs Continuous Random Variables
• Discrete: can only take a countable number of values
• Example: number of heads
• Distribution defined by probability mass function (pmf)
• Marginalization:

• Continuous: can take infinitely many values (real numbers)

• Example: time taken to accomplish task
• Distribution defined by probability density function (pdf)
• Marginalization:
Probability Distribution Statistics
• Mean: E[x] = μ = first moment = Univariate continuous random variable

= Univariate discrete random variable

• Variance: Var(X) =

• Nth moment =
Discrete Distribution

Bernoulli Distribution
• Input: x ∈ {0, 1}
• Parameter: μ

• Example: Probability of flipping heads (x=1)

• Mean = E[x] = μ
• Variance = μ(1 − μ)
Discrete Distribution

Binomial Distribution
• Input: m = number of successes
• Parameters: N = number of trials
μ = probability of success
• Example: Probability of flipping heads m times out of N independent
flips with success probability μ

• Mean = E[x] = Nμ
• Variance = Nμ(1 − μ)
Discrete Distribution

Multinomial Distribution
• The multinomial distribution is a generalization of the binomial
distribution to k categories instead of just binary (success/fail)
• For n independent trials each of which leads to a success for exactly
one of k categories, the multinomial distribution gives the probability
of any particular combination of numbers of successes for the various
categories
• Example: Rolling a die N times
Discrete Distribution

Multinomial Distribution
• Input: m1 … mK (counts)
• Parameters: N = number of trials
μ = μ1 … μK probability of success for each category, Σμ=1

• Mean of mk: Nµk

• Variance of mk: Nµk(1-µk)
Continuous Distribution

Gaussian Distribution
• Aka the normal distribution
• Widely used model for the distribution of continuous variables
• In the case of a single variable x, the Gaussian distribution can be
written in the form

• where μ is the mean and σ2 is the variance

Gaussian Distribution
• Gaussians with different means and variances
Multivariate Gaussian Distribution
• For a D-dimensional vector x, the multivariate Gaussian distribution
takes the form

• where μ is a D-dimensional mean vector

• Σ is a D × D covariance matrix
• |Σ| denotes the determinant of Σ
Inferring Parameters
• We have data X and we assume it comes from some distribution
• How do we figure out the parameters that ‘best’ fit that distribution?
• Maximum Likelihood Estimation (MLE)

• Maximum a Posteriori (MAP)

See ‘Gibbs Sampling for the Uninitiated’ for a straightforward introduction to parameter
estimation: http://www.umiacs.umd.edu/~resnik/pubs/LAMP-TR-153.pdf
I.I.D.
• Random variables are independent and identically distributed (i.i.d.) if
they have the same probability distribution as the others and are all
mutually independent.

• Example: Coin flips are assumed to be IID

MLE for parameter estimation
• The parameters of a Gaussian distribution are the mean (µ) and
variance (σ2)

• We’ll estimate the parameters using MLE

• Given observations x1, . . . , xN , the likelihood of those observations
for a certain µ and σ2 (assuming IID) is

Likelihood =
MLE for parameter estimation

Likelihood =

What’s the distribution’s mean

and variance?
MLE for Gaussian Parameters

Likelihood =

• Now we want to maximize this function wrt µ

• Instead of maximizing the product, we take the log of the likelihood so
the product becomes a sum

Log Likelihood = log Log

• We can do this because log is monotonically increasing

• Meaning
MLE for Gaussian Parameters
• Log Likelihood simplifies to:

• Now we want to maximize this function wrt μ

• How?

To see proofs for these derivations: http://www.statlect.com/normal_distribution_maximum_likelihood.htm

MLE for Gaussian Parameters
• Log Likelihood simplifies to:

• Now we want to maximize this function wrt μ

• Take the derivative, set to 0, solve for μ

To see proofs for these derivations: http://www.statlect.com/normal_distribution_maximum_likelihood.htm

Maximum Likelihood and Least Squares
• Suppose that you are presented with a
sequence of data points (X1, T1), ..., (Xn, Tn),
and you are asked to find the “best fit” line
passing through those points.
• In order to answer this you need to know
precisely how to tell whether one line is
“fitter” than another
• A common measure of fitness is the squared-
error

For a good discussion of Maximum likelihood estimators and least squares see
http://people.math.gatech.edu/~ecroot/3225/maximum_likelihood.pdf
Maximum Likelihood and Least Squares
y(x,w) is estimating the target t

Red line

• Error/Loss/Cost/Objective function measures the squared error

Green lines

• Least Square Regression

• Minimize L(w) wrt w
Maximum Likelihood and Least Squares
• Now we approach curve fitting from a probabilistic perspective
• We can express our uncertainty over the value of the target variable
using a probability distribution
• We assume, given the value of x, the corresponding value of t has a
Gaussian distribution with a mean equal to the value y(x,w)

β is the precision parameter (inverse variance)

Maximum Likelihood and Least Squares
Maximum Likelihood and Least Squares
• We now use the training data {x, t} to
determine the values of the unknown
parameters w and β by maximum likelihood

• Log Likelihood
Maximum Likelihood and Least Squares
• Log Likelihood

• Maximize Log Likelihood wrt to w

• Since last two terms, don’t depend on w,
they can be omitted.
• Also, scaling the log likelihood by a positive
constant β/2 does not alter the location of
the maximum with respect to w, so it can be
ignored
• Result: Maximize
Maximum Likelihood and Least Squares
• MLE
• Maximize

• Least Squares
• Minimize

• Therefore, maximizing likelihood is equivalent, so far as determining w is

concerned, to minimizing the sum-of-squares error function
• Significance: sum-of-squares error function arises as a consequence of
maximizing likelihood under the assumption of a Gaussian noise
distribution
Questions?

A Probability and Statistics Cheatsheet
No ratings yet
A Probability and Statistics Cheatsheet
28 pages
cs109 Final Cheat 3 PDF
No ratings yet
cs109 Final Cheat 3 PDF
13 pages
MTH2222 Mathematics of Uncertainty
No ratings yet
MTH2222 Mathematics of Uncertainty
96 pages
Unit 2 (2) - 1
No ratings yet
Unit 2 (2) - 1
37 pages
2 Mle
No ratings yet
2 Mle
28 pages
Scribe: Naive Bayes Classifier
No ratings yet
Scribe: Naive Bayes Classifier
16 pages
3logistic Regression
No ratings yet
3logistic Regression
61 pages
Lecture 10
No ratings yet
Lecture 10
59 pages
Dealing With Uncertainty P (X - E) : Probability Theory The Foundation of Statistics
No ratings yet
Dealing With Uncertainty P (X - E) : Probability Theory The Foundation of Statistics
34 pages
Applied Maths
No ratings yet
Applied Maths
34 pages
lec2 (1)
No ratings yet
lec2 (1)
46 pages
Foundations of Machine Learning: Part A: Probability Basics
No ratings yet
Foundations of Machine Learning: Part A: Probability Basics
75 pages
2223hk1 Slide01 ML2022-2
No ratings yet
2223hk1 Slide01 ML2022-2
23 pages
probs_stats
No ratings yet
probs_stats
26 pages
ML_Lec 2- Review of probability and statistics
No ratings yet
ML_Lec 2- Review of probability and statistics
30 pages
Mathematics in Machine Learning
No ratings yet
Mathematics in Machine Learning
83 pages
01 Lectureslides ProbTheory
No ratings yet
01 Lectureslides ProbTheory
42 pages
ECE523 Engineering Applications of Machine Learning and Data Analytics - Bayes and Risk - 1
No ratings yet
ECE523 Engineering Applications of Machine Learning and Data Analytics - Bayes and Risk - 1
7 pages
Cs Ai Lecture Notes 02
No ratings yet
Cs Ai Lecture Notes 02
103 pages
MAS 102_Topic 1
No ratings yet
MAS 102_Topic 1
13 pages
Lecture2 Math ML Review
No ratings yet
Lecture2 Math ML Review
87 pages
PBM Notes
No ratings yet
PBM Notes
130 pages
ECE 368 Course Review: Probabilistic Reasoning 2023
No ratings yet
ECE 368 Course Review: Probabilistic Reasoning 2023
138 pages
Revision - Elements or Probability: Notation For Events
No ratings yet
Revision - Elements or Probability: Notation For Events
20 pages
Cheat Sheet
No ratings yet
Cheat Sheet
5 pages
Bayesian and MLE
No ratings yet
Bayesian and MLE
30 pages
stochbasics_handout
No ratings yet
stochbasics_handout
36 pages
Statistics
No ratings yet
Statistics
60 pages
Probability and Statistics Cheat Sheet
100% (2)
Probability and Statistics Cheat Sheet
28 pages
03 MLE MAP NBayes-1-21-2015
No ratings yet
03 MLE MAP NBayes-1-21-2015
40 pages
Learning Models From Data: 1 Parametric Estimation
No ratings yet
Learning Models From Data: 1 Parametric Estimation
14 pages
BCS-DS-602: Machine Learning: Dr. Sarika Chaudhary Associate Professor Fet-Cse
No ratings yet
BCS-DS-602: Machine Learning: Dr. Sarika Chaudhary Associate Professor Fet-Cse
18 pages
Sam Roweis Probx
No ratings yet
Sam Roweis Probx
12 pages
PRML Slides 2
No ratings yet
PRML Slides 2
86 pages
CHP 5
No ratings yet
CHP 5
63 pages
ML Cheat Sheet
50% (2)
ML Cheat Sheet
74 pages
Bayes ML Tutorial
No ratings yet
Bayes ML Tutorial
69 pages
Log-Linear Models and Conditional Random Fieldsels
No ratings yet
Log-Linear Models and Conditional Random Fieldsels
27 pages
Lecture1 Intro ML
No ratings yet
Lecture1 Intro ML
60 pages
Lecture 03 Maximum Likelihood Estimation
No ratings yet
Lecture 03 Maximum Likelihood Estimation
22 pages
Intro To Data Science Lecture 2
No ratings yet
Intro To Data Science Lecture 2
12 pages
Artificial Intelligence and Machine Learning
No ratings yet
Artificial Intelligence and Machine Learning
55 pages
Probability and Statistics - Cookbook
No ratings yet
Probability and Statistics - Cookbook
28 pages
Probability and Statistics Cookbook
No ratings yet
Probability and Statistics Cookbook
28 pages
3maximum-likelyhood
No ratings yet
3maximum-likelyhood
15 pages
PRML RefSheet
No ratings yet
PRML RefSheet
6 pages
15.097: Probabilistic Modeling and Bayesian Analysis
No ratings yet
15.097: Probabilistic Modeling and Bayesian Analysis
42 pages
9 Mle
No ratings yet
9 Mle
39 pages
Maximum Likelihood Estimation by K.Kashin
No ratings yet
Maximum Likelihood Estimation by K.Kashin
34 pages
Probability and Statistics
No ratings yet
Probability and Statistics
28 pages
Lec2 IntroToProbabilityAndStatistics
No ratings yet
Lec2 IntroToProbabilityAndStatistics
37 pages
S1B 16 All Lectures
No ratings yet
S1B 16 All Lectures
221 pages
Probability Theory - Towards Data Science
No ratings yet
Probability Theory - Towards Data Science
19 pages
All in One CheatSheet PDF
No ratings yet
All in One CheatSheet PDF
52 pages
1.probability Random Variables and Stochastic Processes Athanasios Papoulis S. Unnikrishna Pillai 1 300 1 30
No ratings yet
1.probability Random Variables and Stochastic Processes Athanasios Papoulis S. Unnikrishna Pillai 1 300 1 30
30 pages
ML Unit 2
No ratings yet
ML Unit 2
8 pages
This is The Statistics Handbook your Professor Doesn't Want you to See. So Easy, it's Practically Cheating...
From Everand
This is The Statistics Handbook your Professor Doesn't Want you to See. So Easy, it's Practically Cheating...
S. Deviant
4.5/5 (6)
Statistics II Essentials
From Everand
Statistics II Essentials
Emil Milewski
2.5/5 (1)
Foundations of Elementary Analysis
From Everand
Foundations of Elementary Analysis
Roshan Trivedi
No ratings yet
Learn Statistics Fast: A Simplified Detailed Version for Students
From Everand
Learn Statistics Fast: A Simplified Detailed Version for Students
Hesbon R.M
No ratings yet
Classical Decomposition - Forecasting - Principles and Practice
No ratings yet
Classical Decomposition - Forecasting - Principles and Practice
4 pages
FDMEE Vs Cloud Data Management
No ratings yet
FDMEE Vs Cloud Data Management
20 pages
Jump Discontinuities
No ratings yet
Jump Discontinuities
6 pages
Understanding Cult Ual Context
No ratings yet
Understanding Cult Ual Context
1 page
Observer and The Airplane
No ratings yet
Observer and The Airplane
12 pages
Ijfs 06 00030 PDF
No ratings yet
Ijfs 06 00030 PDF
26 pages
Calculus 3.formulas
No ratings yet
Calculus 3.formulas
49 pages
Tangent Lines
No ratings yet
Tangent Lines
7 pages
Critical Points
No ratings yet
Critical Points
7 pages
Logarithmic Differentiation
No ratings yet
Logarithmic Differentiation
6 pages
Area Under or Enclosed by The Curve
No ratings yet
Area Under or Enclosed by The Curve
11 pages
Ladder Sliding Down The Wall
No ratings yet
Ladder Sliding Down The Wall
11 pages
Riemann Sums Left Endpoints
No ratings yet
Riemann Sums Left Endpoints
10 pages
Basic Linux
No ratings yet
Basic Linux
65 pages
Applied Optimization
No ratings yet
Applied Optimization
12 pages
Matrix Computations (3rd Ed.,1996) - 2
No ratings yet
Matrix Computations (3rd Ed.,1996) - 2
723 pages
General Method of Moments
No ratings yet
General Method of Moments
14 pages
Ch6 Problem Set
No ratings yet
Ch6 Problem Set
2 pages
hkdse m1 notes
No ratings yet
hkdse m1 notes
3 pages
Notes On Forecasting With Moving Averages - Robert Nau
No ratings yet
Notes On Forecasting With Moving Averages - Robert Nau
28 pages
Measurement System Analyses Gauge Repeatability and Reproducibility Methods PDF
No ratings yet
Measurement System Analyses Gauge Repeatability and Reproducibility Methods PDF
8 pages
Final Exam Study Guide For EIN6935allC13 - Session 7
No ratings yet
Final Exam Study Guide For EIN6935allC13 - Session 7
10 pages
IB Biology Notes - 1 Working With Data
No ratings yet
IB Biology Notes - 1 Working With Data
1 page
Unilag - Post Utme Questions
No ratings yet
Unilag - Post Utme Questions
51 pages
Spot Speed Studies: 9.2.1 Speed Definitions of Interest
No ratings yet
Spot Speed Studies: 9.2.1 Speed Definitions of Interest
32 pages
T-Test: T-TEST PAIRS Pengetahuan2 WITH Umur (PAIRED) /CRITERIA CI (.9500) /missing Analysis
No ratings yet
T-Test: T-TEST PAIRS Pengetahuan2 WITH Umur (PAIRED) /CRITERIA CI (.9500) /missing Analysis
3 pages
LessonPlan G7 Mean Median and Mode of Ungrouped Data
No ratings yet
LessonPlan G7 Mean Median and Mode of Ungrouped Data
6 pages
Unit 5
No ratings yet
Unit 5
17 pages
Measures of Central Tendency and Dispersion
No ratings yet
Measures of Central Tendency and Dispersion
64 pages
Braglia Et Al 2019
No ratings yet
Braglia Et Al 2019
28 pages
SM-78Ages and Averages
No ratings yet
SM-78Ages and Averages
14 pages
Educ8 Assessment Test
No ratings yet
Educ8 Assessment Test
13 pages
Joanne Karla Jimenez - ENSC 234-LABORATORY NO.4
No ratings yet
Joanne Karla Jimenez - ENSC 234-LABORATORY NO.4
2 pages
Iso 2854 1976
100% (2)
Iso 2854 1976
52 pages
Maths Formulas: 1. Averages
No ratings yet
Maths Formulas: 1. Averages
7 pages
Eva As Statistical Models
No ratings yet
Eva As Statistical Models
24 pages
Integrity Sensitivity Pursuit of Excellence Pride in One's Heritage T1/NA/TSRS m/XI-MATH/2017-18
0% (1)
Integrity Sensitivity Pursuit of Excellence Pride in One's Heritage T1/NA/TSRS m/XI-MATH/2017-18
5 pages
Nda Maths Mock Test 02
No ratings yet
Nda Maths Mock Test 02
7 pages
Arpit.221248_CA-1
No ratings yet
Arpit.221248_CA-1
65 pages
Hypatia Combined Contest
No ratings yet
Hypatia Combined Contest
66 pages
Biostatistics
No ratings yet
Biostatistics
53 pages
Linear Models For Portfolio Optimization
No ratings yet
Linear Models For Portfolio Optimization
28 pages
Nptel: Course On
No ratings yet
Nptel: Course On
11 pages
Automobile Analysis: Cleary University December 23, 2007 Version 3
No ratings yet
Automobile Analysis: Cleary University December 23, 2007 Version 3
12 pages
Uniform-Hazard Response Spectra-An Alternative Approach
No ratings yet
Uniform-Hazard Response Spectra-An Alternative Approach
13 pages
06 Week-3, Domain-1 Advanced Science and Mathamatics
No ratings yet
06 Week-3, Domain-1 Advanced Science and Mathamatics
25 pages

Probability Theory For Machine Learning: Chris Cremer September 2015

Uploaded by

Probability Theory For Machine Learning: Chris Cremer September 2015

Uploaded by

Probability Theory for

• We say that events A1 and A2 are disjoint (mutually exclusive) if Ai ∩

• Two events A and B are independent if

• Two events A and B are conditionally independent given C if they are

* These are made up values.

• Reworded: What percent of students passed the midterm given they

• Law of Total Probability

Bayes’ Rule Law of Total Probability

• Continuous: can take infinitely many values (real numbers)

= Univariate discrete random variable

• Example: Probability of flipping heads (x=1)

• Mean of mk: Nµk

• where μ is the mean and σ2 is the variance

• where μ is a D-dimensional mean vector

• Maximum a Posteriori (MAP)

• Example: Coin flips are assumed to be IID

• We’ll estimate the parameters using MLE

What’s the distribution’s mean

• Now we want to maximize this function wrt µ

Log Likelihood = log Log

• We can do this because log is monotonically increasing

• Now we want to maximize this function wrt μ

To see proofs for these derivations: http://www.statlect.com/normal_distribution_maximum_likelihood.htm

• Now we want to maximize this function wrt μ

To see proofs for these derivations: http://www.statlect.com/normal_distribution_maximum_likelihood.htm

• Error/Loss/Cost/Objective function measures the squared error

• Least Square Regression

β is the precision parameter (inverse variance)

• Maximize Log Likelihood wrt to w

• Therefore, maximizing likelihood is equivalent, so far as determining w is

You might also like