Lec 2
Lec 2
Sirisha Rambhatla
University of Waterloo
Lecture 2
Math Background: Probability, Statistics and Information Theory
1 / 46
Outline
4 Reading
2 / 46
Math Background: Probability
Outline
4 Reading
3 / 46
Math Background: Probability Motivation
https://dalledemo.com/ [1, 2]
https://stablediffusion.fr/
demo GitHub [3]
4 / 46
Math Background: Probability Definitions
5 / 46
Math Background: Probability Definitions
Probability Axioms
3 Mutually exclusive
P∞ events: If A1 , A2 , . . . are disjoint, then
P (∪∞ A
i=1 i ) = i=1 P (Ai )
disjoint: A1 ∧ A2 = ∅, ∧ is logical AND.
∪ is “union” which can be represented by ∨ logical OR
Helps us to analyze events of interest and convert logical operations
into arithmetic
6 / 46
Math Background: Probability Definitions
P (A ∨ B) = P (A) + P (B) − P (A ∧ B)
7 / 46
Math Background: Probability Random Variables
Random Variables
Two Types: Discrete (e.g. Bernoulli in Coin toss) and Continuous (e.g.
Gaussian)
8 / 46
Math Background: Probability Random Variables
are 3 observations from the coin toss experiments (note that each
experiment involves tossing twice).
9 / 46
Math Background: Probability Distribution Function
10 / 46
Math Background: Probability Distribution Function
11 / 46
Math Background: Probability Distribution Function
Expectation
Expected Values
Of a function g(·) of a discrete r.v. X,
X
E[g(X)] = g(x)f (x);
x∈X
µ = 1 × P (X = 1) + 0 × P (X = 0) = 1/2
12 / 46
Math Background: Probability Distribution Function
Expectation
Mean and Variance µ = E[X] is the mean; var[X] = E[(X − µ)2 ] is the
variance. Hence, we have var[X] = E[X 2 ] − µ2 .
Xn n
X
E[ Xi ] = E[Xi ].
i=1 i=1
13 / 46
Math Background: Probability Distribution Function
Common Distributions
14 / 46
Math Background: Probability Multivariate Distributions
Multivariate Distributions
P (X = 40 inches, Y = ‘LJ ′ )
probability that the vertical jump height is 40 inches AND the player is LJ
15 / 46
Math Background: Probability Multivariate Distributions
Marginal distribution
X X
P (X = x) = P (X = x, Y = y), P (Y = y) = P (X = x, Y = y)
y x
16 / 46
Math Background: Probability Multivariate Distributions
Conditional distribution
P (X = x|Y = y)
Represents the probability that vertical jump height is x GIVEN that the
player is y.
P (Y = y|X = x)
Represents the probability that the player is y GIVEN that the vertical
jump height is x.
Computed as
P (X = x, Y = y) fX,Y (x, y)
fX|Y (x|y) = P (X = x|Y = y) = =
P (Y = y) fY (y)
17 / 46
Math Background: Probability Multivariate Distributions
Bayes Rule
P (X = x, Y = y)
P (X = x|Y = y) =
P (Y = y)
This relationship can be used to derive the relationship between the two
conditionals P (X = x|Y = y) and P (Y = y|X = x). This is Bayes Rule,
i.e.
P (X = x, Y = y) P (X = x|Y = y)P (Y = y)
P (Y = y|X = x) = =
P (X = x) P (X = x)
19 / 46
Math Background: Probability Bayes Rule
Bayes Rule
For the discrete case, say X takes values as x1 , x2 , . . . the marginal can
be written as
X X
P (Y = y) = P (X = x, Y = y) = P (Y = y|X = x)P (X = x)
x x
Therefore, we have
X
fY (y) = fY |X (y|xj )fX (xj )
j
20 / 46
Math Background: Probability Bayes Rule
Bayes Rule
(Simple Form)
P (Y |X)P (X)
P (X|Y ) =
P (Y )
(Discrete Random Variables)
fY |X (y|x)fX (x)
fX|Y (x|y) = R
x fY |X (y|x)fX (x)dx
21 / 46
Math Background: Probability Independence of Random Variables
Independence
P (X = x, Y = y) = P (X = x)P (Y = y)
22 / 46
Math Background: Probability Independence of Random Variables
Correlation
Covariance
Correlation coefficients
23 / 46
Math Background: Statistics
Outline
4 Reading
24 / 46
Math Background: Statistics Motivation
BUT
How do we know which model to fit?
How do we learn the parameters of these models?
25 / 46
Math Background: Statistics Motivation
Statistics: Preliminaries
Sample Mean:
N
1 X
X̄ = Xi
N
i=1
Sample Variance:
N
2 1 X
SN −1 = (Xi − X̄)2 .
N −1
i=1
26 / 46
Math Background: Statistics Point Estimates
Point Estimation
bias(θ̂N ) = Eθ [θ̂N ] − θ∗
27 / 46
Math Background: Statistics Point Estimates
Unbiased Estimators
Is unbiasedness enough?
θ̂(D) = x1
What is the problem with it? It will not generalize to new data!
28 / 46
Math Background: Statistics Point Estimates
2
V ar[θ̂] := E[θ̂2 ] − E[θ̂]
How low can this go? This is answered by the celebrated Cramer-Rao
Lower Bound. (Out of the scope of this course)
29 / 46
Math Background: Statistics Point Estimates
Bias-Variance Trade-off
A fundamental trade-off that needs to be made when picking a method for
parameter estimation, assuming that the goal is to minimize the mean
squared error (MSE) of a estimate.
Let
θ∗ be the true parameter
θ̂ = θ̂(D) denote the estimator
θ̄ = E[θ̂] be its expected value
mean squared error (MSE) := E[(θ̂ − θ∗ )2 ]
θ̂ = argmin L(θ)
θ
31 / 46
Math Background: Statistics Point Estimates
N
Y
p(D|θ) := p(yn |xn , θ)
n=1
N
X
θ̂mle := argmax log p(yn |xn , θ)
θ n=1
32 / 46
Math Background: Statistics Point Estimates
Aside: It can be shown that the MLE achieves the Cramer Rao lower
bound, and hence has the smallest asymptotic variance of any unbiased
estimator (asymptotically optimal).
33 / 46
Math Background: Statistics Point Estimates
34 / 46
Math Background: Statistics Point Estimates
θ̂ = argmax log p(θ|D) = argmax [ log p(D|θ) + log p(θ) − const] (5)
θ θ
35 / 46
Math Background: Statistics Bayesian Statistics
Outline
4 Reading
37 / 46
Math Background: Information Theory Motivation
38 / 46
Math Background: Information Theory Motivation
39 / 46
Math Background: Information Theory Motivation
H(X) = − m
P
i=1 pi logb pi
2
X
=− 0.5 log2 0.5
i=1
=1
H(X) = − m
P
i=1 pi logb pi
Entropy of coin flips
= −0.8 log2 0.8 − 0.2 log2 0.2
= 0.7219
40 / 46
Math Background: Information Theory Motivation
41 / 46
Math Background: Information Theory Motivation
42 / 46
Math Background: Information Theory Motivation
43 / 46
Reading
Outline
4 Reading
44 / 46
Reading
Reading
PiML1
Chapter 2 – 2.1, 2.2 (2.2.1 - 2.2.4, 2.2.5.1 - 2.2.5.3), and 2.3.
Chapter 3 – Eq (3.1), eq.(3.7), sections 3.1.3, and 3.1.4
Chapter 4 – Section 4.1, 4.2.1, 4.2.2, 4.3 (intro), 4.5 (intro about
MAP), 4.6 (intro), 4.7.6 (intro), 4.7.6.1, 4.7.6.2,
Chapter 5 – Section 5.1.6.1
Chapter 6 – 6.1 (6.1.1-6.1.4, 6.1.6. (intro)), 6.2 (6.2.1-6.2.2), 6.3
(6.3.1 - 6.3.2)
45 / 46
Reading
References I
46 / 46