0% found this document useful (0 votes)
24 views46 pages

Lec 2

Lecture 2 of MSE 546 covers essential mathematical concepts in probability, statistics, and information theory relevant to advanced machine learning. Key topics include definitions of probability, random variables, distribution functions, and Bayes' rule, as well as point estimation and model fitting in statistics. The lecture emphasizes the importance of unbiased estimators and their variance in statistical analysis.

Uploaded by

Khoi Le
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
24 views46 pages

Lec 2

Lecture 2 of MSE 546 covers essential mathematical concepts in probability, statistics, and information theory relevant to advanced machine learning. Key topics include definitions of probability, random variables, distribution functions, and Bayes' rule, as well as point estimation and model fitting in statistics. The lecture emphasizes the importance of unbiased estimators and their variance in statistical analysis.

Uploaded by

Khoi Le
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 46

MSE 546: Advanced Machine Learning

Sirisha Rambhatla

University of Waterloo

Lecture 2
Math Background: Probability, Statistics and Information Theory

1 / 46
Outline

1 Math Background: Probability

2 Math Background: Statistics

3 Math Background: Information Theory

4 Reading

2 / 46
Math Background: Probability

Outline

1 Math Background: Probability


Motivation
Definitions
Random Variables
Distribution Function
Multivariate Distributions
Bayes Rule
Independence of Random Variables

2 Math Background: Statistics

3 Math Background: Information Theory

4 Reading
3 / 46
Math Background: Probability Motivation

Conditional Generation of Data


Prompt: “Alice in wonderland. Down the rabbit hole. Alice was getting
so very bored with her sister on the riverbank.”

https://dalledemo.com/ [1, 2]
https://stablediffusion.fr/
demo GitHub [3]
4 / 46
Math Background: Probability Definitions

Probability: Terms and Notation

Sample Space: a set of all possible outcomes or realizations of some


random trial.
Example: : Toss a coin twice; the sample space is
Ω = {HH, HT, T H, T T }.
Event: A subset of sample space
Example: : the event that at least one toss is a head is
A = {HH, HT, T H}.
Probability: We assign a real number P (A) to each event A, called the
probability of A.
Example: In the coin tossing example
3
P (A) =
4

5 / 46
Math Background: Probability Definitions

Probability Axioms

The probability P must satisfy three axioms:

1 Non-negativity: P (A) ≥ 0 for every A;


Probabilities cannot be negative!

2 Unit Measure:P (Ω) = 1;


Probability of entire samples space is 1
i.e. “something happens” and that there are no events outside of the
sample space

3 Mutually exclusive
P∞ events: If A1 , A2 , . . . are disjoint, then
P (∪∞ A
i=1 i ) = i=1 P (Ai )
disjoint: A1 ∧ A2 = ∅, ∧ is logical AND.
∪ is “union” which can be represented by ∨ logical OR
Helps us to analyze events of interest and convert logical operations
into arithmetic

6 / 46
Math Background: Probability Definitions

Probability: Terms and Notation

Probability of a union of two events: The probability of event A or B


happening is given by:

P (A ∨ B) = P (A) + P (B) − P (A ∧ B)

7 / 46
Math Background: Probability Random Variables

Random Variables

Definition: A random variable is a function that maps from a random


event to a real number, i.e. X : Ω → R, that assigns a real number X(ω)
to each outcome ω.

Example: In coin tossing, we let H → 1 and let T → 0. The event “at


least one toss is a head” then can be written as X1 + X2 > 0, where X1
and X2 are the random variables (ie, 1, or 0 corresponding to the first toss
and the second toss respectively).

Two Types: Discrete (e.g. Bernoulli in Coin toss) and Continuous (e.g.
Gaussian)

8 / 46
Math Background: Probability Random Variables

From Random Variables to Data

Data The data are specific realizations of random variables

(X1 = 1, X2 = 0), (X1 = 1, X2 = 1), (X1 = 0, X2 = 0)

are 3 observations from the coin toss experiments (note that each
experiment involves tossing twice).

Statistic: A statistic is any function of the data or random variables, e.g.


mean, variance etc.

9 / 46
Math Background: Probability Distribution Function

Distribution Function: Discrete Random Variable

Definition: Suppose X is a random variable and x is a specific value that


it can take, then
For discrete r.v. X, the probability mass function is defined as

fX (x) = P (X = x) : probability of X takes the value of x

Example: For a fair coin P (X = 1) = 1/2, where X is either 0 (’T’) or 1


(’H’).

10 / 46
Math Background: Probability Distribution Function

Distribution Function: Continuous Random Variable

Definition: Suppose X is a random variable and x is a specific value that


it can take, then

For a continuous r.v. X, fX (x) ≥ 0 is the probability density function,


for every a ≤ b: Z b
P (a ≤ X ≤ b) = f (x)dx
a
R∞
where −∞ f (x)dx = 1. Note: for continuous distributions P (X = x) = 0!

Cumulative distribution function (CDF) of X : FX (x) = P (X ≤ x). If


F (x) is differentiable everywhere, f (x) = F ′ (x).

11 / 46
Math Background: Probability Distribution Function

Expectation
Expected Values
Of a function g(·) of a discrete r.v. X,
X
E[g(X)] = g(x)f (x);
x∈X

Example: E[g(X)] for tossing a fair coin is

µ = 1 × P (X = 1) + 0 × P (X = 0) = 1/2

Of a function g(·) of a continuous r.v. X,


Z ∞
E[g(X)] = g(x)f (x)dx.
−∞

12 / 46
Math Background: Probability Distribution Function

Expectation

Mean and Variance µ = E[X] is the mean; var[X] = E[(X − µ)2 ] is the
variance. Hence, we have var[X] = E[X 2 ] − µ2 .

Example: The variance in our coin tossing example is


1
var[X] = E[(X − µ)2 ] = (1 − 1/2)2 P (X = 1) + (0 − 1/2)2 P (X = 0) =
4

Linearity of Expectation: For r.v X1 , . . . , Xn :

Xn n
X
E[ Xi ] = E[Xi ].
i=1 i=1

13 / 46
Math Background: Probability Distribution Function

Common Distributions

Discrete variable Probability function Mean


N +1
Uniform X ∼ U [1, . . . , N ] 1/N 2
n x
Binomial X ∼ Bin(n, p) x
p (1 − p)(n−x) np
Geometric X ∼ Geom(p) (1 − p)x−1 p 1/p
e−λ λx
Poisson X ∼ P oisson(λ) x!
λ
Continuous variable Probability density function Mean
Uniform X ∼ U (a, b) 1/ (b-a) (a + b)/2
Gaussian X ∼ N (µ, σ 2 ) √1 exp(− 2σ1 2 (x − µ)2 ) µ
2πσ
1
Gamma X ∼ Γ(α, β) (x ≥ 0) Γ(α)β a
xa−1 e−x/β αβ
x
1 −β
Exponential X ∼ exponen(β) β
e β

14 / 46
Math Background: Probability Multivariate Distributions

Multivariate Distributions

Dealing with two random variables

fX,Y (x, y) = P (X = x, Y = y) : probability of X taking x and Y taking y

Example: Let X represent ’Vertical Jump Height’ and Y represents

players (either ’SN’ or ’LJ’).

P (X = 40 inches, Y = ‘LJ ′ )

probability that the vertical jump height is 40 inches AND the player is LJ

15 / 46
Math Background: Probability Multivariate Distributions

Multivariate Distributions: Marginal Distribution

Marginal distribution

X X
P (X = x) = P (X = x, Y = y), P (Y = y) = P (X = x, Y = y)
y x

Example: Represents the probability of vertical jump being x (irrespective


of who the player is.)
Discrete Case:
X X
fX (x) = P (X = x) = P (X = x, Y = y) = fX,Y (x, y)
y y
Continuous Case Z
fX (x) = fX,Y (x, y)dy
y

16 / 46
Math Background: Probability Multivariate Distributions

Multivariate Distributions: Conditional Distribution

Conditional distribution

P (X = x|Y = y)

Represents the probability that vertical jump height is x GIVEN that the
player is y.
P (Y = y|X = x)
Represents the probability that the player is y GIVEN that the vertical
jump height is x.

Computed as

P (X = x, Y = y) fX,Y (x, y)
fX|Y (x|y) = P (X = x|Y = y) = =
P (Y = y) fY (y)

17 / 46
Math Background: Probability Multivariate Distributions

Multivariate Distributions: An Example


Vertical Jump Height (in) Player # of times
35 LJ 20
35 SN 10
20 LJ 5
20 SN 10
Jointly
20 20 4
P (X = 35, Y = LJ) = = =
20 + 10 + 5 + 10 45 9
Marginally
30 2 20 4
P (X = 35) = = , P (Y = SN ) = =
45 3 45 9
Conditionally
20 2 4 2
P (Y = LJ|X = 35) = = which is also 9 ÷ 3
10 + 20 3
18 / 46
Math Background: Probability Bayes Rule

Bayes Rule

Consider the Conditional:

P (X = x, Y = y)
P (X = x|Y = y) =
P (Y = y)
This relationship can be used to derive the relationship between the two
conditionals P (X = x|Y = y) and P (Y = y|X = x). This is Bayes Rule,
i.e.
P (X = x, Y = y) P (X = x|Y = y)P (Y = y)
P (Y = y|X = x) = =
P (X = x) P (X = x)

Here we used the fact that


P (X = x, Y = y) = P (X = x|Y = y)P (Y = y)

19 / 46
Math Background: Probability Bayes Rule

Bayes Rule

Law of total Probability: Relates conditional to marginal

For the discrete case, say X takes values as x1 , x2 , . . . the marginal can
be written as
X X
P (Y = y) = P (X = x, Y = y) = P (Y = y|X = x)P (X = x)
x x

Therefore, we have
X
fY (y) = fY |X (y|xj )fX (xj )
j

20 / 46
Math Background: Probability Bayes Rule

Bayes Rule

(Simple Form)
P (Y |X)P (X)
P (X|Y ) =
P (Y )
(Discrete Random Variables)

fY |X (y|xi )fX (xi )


fX|Y (xi |y) = P
j fY |X (y|xj )fX (xj )

(Continuous Random Variables)

fY |X (y|x)fX (x)
fX|Y (x|y) = R
x fY |X (y|x)fX (x)dx

21 / 46
Math Background: Probability Independence of Random Variables

Independence

Independent Variables X and Y are independent if and only if:

P (X = x, Y = y) = P (X = x)P (Y = y)

or fX,Y (x, y) = fX (x)fY (y) for all values x and y.

IID variables: Independent and identically distributed (IID) random


variables are drawn from the same distribution and are all mutually
independent.

22 / 46
Math Background: Probability Independence of Random Variables

Correlation

Covariance

cov(X, Y ) = E[(X − µx )(Y − µy )],

Correlation coefficients

corr(X, Y ) = Cov(X, Y )/σx σy

Independence ⇒ Uncorrelated (corr(X, Y ) = 0).

However, the reverse is generally not true.

23 / 46
Math Background: Statistics

Outline

1 Math Background: Probability

2 Math Background: Statistics


Motivation
Point Estimates
Bayesian Statistics

3 Math Background: Information Theory

4 Reading

24 / 46
Math Background: Statistics Motivation

Model fitting from Data


The process of estimating parameters θ from D is called model fitting, or
training, and is at the heart of machine learning.

Fitting a line (linear model) to Data. Fitting a polynomial (degree d > 1)

BUT
How do we know which model to fit?
How do we learn the parameters of these models?
25 / 46
Math Background: Statistics Motivation

Statistics: Preliminaries

Suppose X1 , . . . , Xn are random variables, you may have seen empirical


estimates of mean and variance as follows:

Sample Mean:
N
1 X
X̄ = Xi
N
i=1

Sample Variance:
N
2 1 X
SN −1 = (Xi − X̄)2 .
N −1
i=1

26 / 46
Math Background: Statistics Point Estimates

Point Estimation

Definition The point estimator θ̂N is a function of samples X1 , . . . , XN


that approximates a parameter θ∗ of the distribution of Xi .
Sample Bias: The bias of an estimator is

bias(θ̂N ) = Eθ [θ̂N ] − θ∗

An estimator is unbiased estimator if Eθ [θ̂N ] = θ


Why care about unbiasedness? If an estimator is unbiased then it gives
you the exact parameter of interest (in expectation)!

27 / 46
Math Background: Statistics Point Estimates

Unbiased Estimators

Is unbiasedness enough?

Example As estimator that looks as only one datapoint is also unbiased.

θ̂(D) = x1

What is the problem with it? It will not generalize to new data!

28 / 46
Math Background: Statistics Point Estimates

Variance is also important

So the variance of an estimator is also important.

 2
V ar[θ̂] := E[θ̂2 ] − E[θ̂]

How low can this go? This is answered by the celebrated Cramer-Rao
Lower Bound. (Out of the scope of this course)

29 / 46
Math Background: Statistics Point Estimates

Bias-Variance Trade-off
A fundamental trade-off that needs to be made when picking a method for
parameter estimation, assuming that the goal is to minimize the mean
squared error (MSE) of a estimate.

Let
θ∗ be the true parameter
θ̂ = θ̂(D) denote the estimator
θ̄ = E[θ̂] be its expected value
mean squared error (MSE) := E[(θ̂ − θ∗ )2 ]

E[(θ̂ − θ∗ )2 ] = V ar[θ̂] + Bias2 (θ̂)


Show?
Take-away: Assuming our goal is to minimize squared error, it might be
wise to use a biased estimator, so long as it reduces our variance by more
than the square of the bias.
30 / 46
Math Background: Statistics Point Estimates

Maximum Likelihood Estimation

Motivation: Most machine learning training boils down to an


optimization problem of the form

θ̂ = argmin L(θ)
θ

Maximum Likelihood Estimation picks the parameters θ that assign the


highest probability to the training data.

θ̂mle := argmax p(D|θ)


θ

This is a point estimate since it is a single estimator.

31 / 46
Math Background: Statistics Point Estimates

Maximum Likelihood Estimation

We usually assume the training examples D = {xn , yn } for


n ∈ {1, . . . , N } are independently sampled from the same distribution (iid
assumption), so the (conditional) likelihood p(D|θ) becomes

N
Y
p(D|θ) := p(yn |xn , θ)
n=1

It is easier to work with log likelihood

N
X
θ̂mle := argmax log p(yn |xn , θ)
θ n=1

32 / 46
Math Background: Statistics Point Estimates

Maximum Likelihood Estimation

We prefer positing this as a minimization of the negative log likelihood


(N LL(θ)):

θ̂mle = argmin N LL(θ) (1)


θ
N
X
= argmin − log p(yn |xn , θ) (2)
θ n=1

Aside: It can be shown that the MLE achieves the Cramer Rao lower
bound, and hence has the smallest asymptotic variance of any unbiased
estimator (asymptotically optimal).

33 / 46
Math Background: Statistics Point Estimates

Empirical Risk Minimization

We can generalize MLE by replacing the (conditional) log loss term in


N LL(θ) (1), with any loss function ℓ(yn , θ; xn ) to get
N
1 X
L(θ) = ℓ(yn , θ; xn ) (3)
N
n=1

This is known as empirical risk minimization or ERM, since it is the


expected loss where the expectation is taken wrt the empirical distribution.

34 / 46
Math Background: Statistics Point Estimates

Maximum A Posteriori (MAP) Estimation


Problem with MLE and ERM is Overfitting:Try to pick parameters
that minimize loss on the training set, but this may not result in a model
that has low loss on future data.

Main solution: Add a penalty term to N LL(θ) (MLE) or ERM objective.

Maximum A Posteriori (MAP) Estimation: This penalty takes the


form of assuming the prior p(θ) for the parameters θ, gives

L(θ) = − ( log p(D|θ) + log p(θ)) (4)

Minimizing this objective is equivalent to maximizing the log posterior.

θ̂ = argmax log p(θ|D) = argmax [ log p(D|θ) + log p(θ) − const] (5)
θ θ

35 / 46
Math Background: Statistics Bayesian Statistics

Relationship to Bayesian Inference


The point estimates way of thinking does not consider the uncertainty in
the parameters itself.
Inference: Modeling uncertainty about parameters using a probability
distribution (as opposed to just computing a point estimate)
The Bayesian way of doing this is by using the posterior distribution
p(θ|D), and for this we:
start with the prior distribution p(θ)
define a likelihood function p(D|θ)
use Bayes Rule to compute the posterior
p(θ)p(D|θ)
p(θ|D) = (6)
p(D)
where the marginal likelihood p(D) is computed by marginalizing the
unknown θ.
The methods such as variational inference all originate from this.
36 / 46
Math Background: Information Theory

Outline

1 Math Background: Probability

2 Math Background: Statistics

3 Math Background: Information Theory


Motivation

4 Reading

37 / 46
Math Background: Information Theory Motivation

Learning to Generate Data

The comparing distributions when learning generative modeling!

Learning in Generative Adversarial Networks

38 / 46
Math Background: Information Theory Motivation

The Discrete Case: Shannon Entropy

Suppose X is a random variable which can have one of the m values:


x1 , . . . , xm ., with probability P (X = xi ) = pi for i ∈ m.
Entropy (Shannon Entropy) is the average amount of surprise in a
random variable’s outcome.
m
X
H(X) = − pi logb pi
i=1

“High entropy” means X is from a uniform (boring) distribution;


“Low entropy” means X is from varied (peaks and valleys)
distribution.

39 / 46
Math Background: Information Theory Motivation

Shannon Entropy: An Example


Entropy of a fair coin flip
For a fair coin pi = 0.5

H(X) = − m
P
i=1 pi logb pi
2
X
=− 0.5 log2 0.5
i=1
=1

Entropy of a biased coin flip


Say p1 = 0.8

H(X) = − m
P
i=1 pi logb pi
Entropy of coin flips
= −0.8 log2 0.8 − 0.2 log2 0.2
= 0.7219
40 / 46
Math Background: Information Theory Motivation

Conditional Entropy and Mutual Information

Conditional Entropy is the remaining entropy of a random variable Y given that


the value of another random variable X is known.
m
X m X
X n
H(Y |X) = p(X = xi )H(p(Y |X = xi )) = − p(xi , yj ) log p(yj |xi )
i=1 i=i j=1

Mutual Information: if Y must be transmitted: what is amount of information


about Y that is to be transmitted if both sides knew X?
how many bits on average would be saved if both ends of the line knew X?

I(Y ; X) = H(Y ) − H(Y |X).


Notice that I(Y ; X) = I(X; Y ) = H(X) + H(Y ) − H(X, Y )
These concepts are directly used in learning Decision Trees.

41 / 46
Math Background: Information Theory Motivation

Kullback-Leibler (KL) Divergence

Kullback-Leibler divergence is a measure of distance between two distributions:


a “true” distribution p(X), and an arbitrary distribution q(X).
X p(x)
KL(p||q) = p(x) log
x
q(x)

Cross Entropy the expected number of bits we need to represent a dataset


coming from distribution p if we using distribution q.
m
X
Hce (p, q) = − pi logb qi
i=1

42 / 46
Math Background: Information Theory Motivation

Relationship between KL Divergence and Cross-Entropy

Kullback-Leibler divergence can be written as


X p(x) p(x) is the “true”
KL(p||q) = p(x) log
q(x) distribution
x
q(x) is an arbitrary
X X
= p(x) log p(x) − p(x) log q(x)
x x distribution
= −H(p) + Hce (p, q)

Cross Entropy the expected number of bits we need to represent a dataset


coming from distribution p if we using distribution q.
X
Hce (p, q) = − p(x) logb q(x)
x

Therefore, KL Divergence is the extra bits needed to use distribution q to


represent p. It is essentially a way to compare distributions!

43 / 46
Reading

Outline

1 Math Background: Probability

2 Math Background: Statistics

3 Math Background: Information Theory

4 Reading

44 / 46
Reading

Reading

PiML1
Chapter 2 – 2.1, 2.2 (2.2.1 - 2.2.4, 2.2.5.1 - 2.2.5.3), and 2.3.
Chapter 3 – Eq (3.1), eq.(3.7), sections 3.1.3, and 3.1.4
Chapter 4 – Section 4.1, 4.2.1, 4.2.2, 4.3 (intro), 4.5 (intro about
MAP), 4.6 (intro), 4.7.6 (intro), 4.7.6.1, 4.7.6.2,
Chapter 5 – Section 5.1.6.1
Chapter 6 – 6.1 (6.1.1-6.1.4, 6.1.6. (intro)), 6.2 (6.2.1-6.2.2), 6.3
(6.3.1 - 6.3.2)

45 / 46
Reading

References I

[1] Alec Radford et al. “Learning Transferable Visual Models From


Natural Language Supervision”. In: Proceedings of the 38th
International Conference on Machine Learning. Ed. by Marina Meila
and Tong Zhang. Vol. 139. Proceedings of Machine Learning
Research. PMLR, 18–24 Jul 2021, pp. 8748–8763. url:
https://proceedings.mlr.press/v139/radford21a.html.
[2] Aditya Ramesh et al. “Hierarchical text-conditional image generation
with clip latents”. In: arXiv preprint arXiv:2204.06125 (2022).
[3] Robin Rombach et al. High-Resolution Image Synthesis with Latent
Diffusion Models. 2021. arXiv: 2112.10752 [cs.CV].

46 / 46

You might also like