Lec2 IntroToProbabilityAndStatistics
Lec2 IntroToProbabilityAndStatistics
Email: [email protected]
URL: https://www.zabaras.com/
Empirical distribution
We call 𝑥 = 𝑋 𝜔 , 𝜔 ∈ ℝ, a realization of 𝑋.
Probability density
PX ( B) p X ( x)dx
B
We often write:
p X ( x) p( x)
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 4
Cummulative Density Function
z
p x 0 F z p( x)dx
p( x)dx 1
(cumulative density function)
b
P x (a, b) p ( x)dx F (b) F (a )
a
The CDF for a random variable 𝑋 is the function 𝐹(𝑥) that returns the
probability that 𝑋 is less than 𝑥.
X xpX ( x)dx
std X var[ X ]
T ( X ) T ( x) p X ( x)dx
f p( x) f ( x)
x
Conditional expectation
x f | y p( x | y ) f ( x)
x
x f | y p( x | y) f ( x)dx
X 2 ( ) 1 .
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 9
Uniform Random Variable
Consider the Uniform random variable 𝒰 𝑥 𝑎, 𝑏 :
1
U ( x | a, b) ( a x b)
ba
x for 0 x 1,
What is the CDF of a uniform random variable 𝒰 𝑥 0,1 ?
PU ( x) 1, for x 1
0, otherwise
You can show that the mean, 2nd moment and variance of 𝒰 𝑥 𝑎, 𝑏 are:
ab a 2 ab b 2 (b a ) 2
x | a, b , x | a, b
2
, var x | a, b
2 3 12
Note that it is possible for 𝑝(𝑥) > 1 but the density still needs to integrate to 1.
For example, note that
1
U ( x | 0,1/ 2) 2 1 x 0,
2
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 10
The Gaussian Distribution
A random variable 𝑋 ∈ ℝ is Gaussian or normally distributed 𝑋~𝒩 𝜇, 𝜎 2 if:
1 2
t
1
P X t
2
exp ( x ) dx
2 2
2
To show that this density is normalized note the following trick:
1 1 1
Let I exp 2 ( x ) 2 dx, then : I 2 exp 2 ( y ) 2 exp 2 ( x ) 2 dxdy
2 2 2
2
1 2 1 2
Set r ( x ) ( y ) , then : I 0 2 2
0 2 2 r rdr
2 2 2 2
exp r rdrd 2 exp
0
1
Thus : I 2 exp 2 u du 2 2 e 1 2 2
0 2
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 11
The Gaussian Distribution
A random variable 𝑋 ∈ ℝ is Gaussian or normally distributed 𝑋~𝒩 𝜇, 𝜎 2 if:
1
N x, , 2
1
exp 2 ( x ) 2
2 2 2
2
We often work with the precision of a Gaussian 𝜆 = 1/𝜎2. The higher the 𝜆 the
narrower the distribution is.
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 12
CDF of a Gaussian
Plot of the Standard Normal 𝒩(0,1) and CDF.
PDF
0.4
0.25
0.2
0.15
CDF 0.1
N x;0,1
100
90 0.05
80 0
-3 -2 -1 0 1 2 3
70
x
F x; , N ( z | ,
60
2 2
50
)dz
40
30
20
F x; , 2
1
2
1 erf z / 2 , z ( x ) /
x;0,1
x
2
erf x dt
10
t 2
e
0
-3 -2 -1 0 1 2 3 0
x
( x) p( z )dz
z
1
x
z2
2
z
exp dz
2
𝑥−𝜇
Assume 𝑋 ~ 𝒩(𝜇, 𝜎2). Then F x = 𝑃(𝑋 < 𝑥|𝜇, 𝜎2) = Φ .
𝜎
[X ] μ
var[ X ] 2
1 ( x )2
p( x) exp
2 2 2
is one of the most studied and one of the most used distributions
0.35
0.25
Estimation of 𝜇, 𝜎
0.2
0.15
0.05
3.5
Matlab implementation 2
1.5
0.5
0
-0.5 -0.4 -0.3 -0.2 -0.1 0 0.1 0.2 0.3 0.4 0.5
From Bayesian Core, J.M. Marin and C.P. Roberts, Chapter 2 (available online)
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 18
Datasets: CMBData
CMBdata: Spectral representation of the cosmological microwave background
(CMB), i.e. electromagnetic radiation from photons back to 300,000 years
after the Big Bang, expressed as difference in apparent temperature from the
mean temperature.1 6
1
MLE based
0.9
5 estimates of
0.8 𝜇 and 2 used in
0.7 4
constructing the
0.6
Gaussian shown
0.5 3
with the solid line
0.4
Maximum
0.3
2
0.2
0.1
Likelihood
1
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Estimators (MLE)
will be discussed
CMBdata 0
-0.2 -0.1 0 0.1 0.2 0.3 0.4 0.5 0.6 in a follow up
From Bayesian Core, J.M. Marin and C.P. Roberts, Chapter 2 (available online)
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 19
Quantiles
Recall that the probability density function 𝑝𝑋(·) of a random variable 𝑋 is
defined as the derivative of the cumulative density function, so that
x0
F ( x0 ) p
X ( x)dx
The value 𝑦(𝛼) such that 𝐹(𝑦(𝛼) ) = 𝛼 is called the a-quantile of the
distribution with CDF 𝐹. The median is of course 𝐹(𝑦 0.5 ) = 0.5
One can define tail area probabilities. The shaded
Note that PX () p
X ( x)dx 1 regions each contain 𝛼/2 of the probability mass.
For 𝒩(0, 1), the leftmost cutoff point is Φ−1 (𝛼/2),
where Φ is the cdf of 𝒩(0, 1).
Bern ( x | ) x (1 )1 x
Using the indicator function, we can also write this as:
Bern ( x | ) ( x 1)
(1 ) ( x 0)
var[ f ] [ f ( x) 2 ] [ f ( x)]2
For the Bernoulli distribution Bern ( x | ) x (1 )1 x, we can easily show
from the definitions:
[ x]
var[ x] (1 )
[ x]
x0,1
p ( x | ) ln p( x | ) ln (1 ) ln(1 )
N N
p (D | ) p ( xn | ) xn (1 )1 xn m (1 ) N m
n 1 n 1
N m
Bin ( X m | N , ) (1 ) N m 0.3
0.15
It can be shown (see S. Ross, Introduction to Probability Models) that the limit
of the binomial distribution as 𝑁 → ∞, 𝑁𝜇 → 𝜆, is the Poisson(l) distribution.
=0.250
0.25 =0.900
0.9
0.35 0.4
0.3 0.35
0.25 0.3
0.25
0.2
0.2
0.15
0.15
0.1
0.1
0.05
0.05
0
0 1 2 3 4 5 6 7 8 9 10 0
0 1 2 3 4 5 6 7 8 9 10
Bin ( N , )
Because 𝑚 = 𝑥1 + . . . + 𝑥𝑁, and for each observation the mean and variance
are known from the Bernoulli distribution:
N
[m] mBin (m | N , ) [ x1 ... xN ] N
m0
N
var[m] m [m] Bin (m | N , ) var[ x1 ... xN ] N (1 )
2
m0
m0 m
where 𝝁 = 𝜇1 , … , 𝜇𝐾 𝑇 .
p( x | )
x x k 1
xk
k k 1
k 1
is the # of observations of 𝑥𝑘 = 1.
K m k K k
mk
mk
k 1
k 1 k 1
l
1 l mk
k 1 m
K
k
N
k 1
As expected, this is
the fraction in the 𝑁
observations of 𝑥𝑘 = 1
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 31
Multinomial Distribution
We can also consider the joint distribution of 𝑚1, … , 𝑚𝐾 in 𝑁 observations
conditioned on the parameters 𝝁 = (𝜇1, … , 𝜇𝐾).
m1 !m2 !...mK !
To visuallize the data (sequence logo), we plot the letters 𝐴, 𝐶, 𝐺 and 𝑇 with a
font size proportional to their empirical probability, and with the most probable
letter on the top.
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 33
Example: Biosequence Analysis
The empirical probability distribution at location 𝑡, is obtained by normalizing
the vector of counts (see MLE estimate)
𝑁 𝑁 𝑁 𝑁
1
𝑡 =
𝜽 𝕀(𝑋𝑖𝑡 = 1) , 𝕀(𝑋𝑖𝑡 = 2) , 𝕀(𝑋𝑖𝑡 = 3) , 𝕀(𝑋𝑖𝑡 = 4)
𝑁
𝑖=1 𝑖=1 𝑖=1 𝑖=1
2
Bits
1
0.35 0.12
0.3
0.1
0.25
0.08
0.2
0.06
0.15
0.04
0.1
0.02
0.05
0
0 0 5 10 15 20 25 30
0 5 10 15 20 25 30
This corresponds to a histogram with spikes at each sample point with height
equal to the corresponding weight. This distribution assigns zero weight to
any point not in the dataset.
Note that the “sample mean of 𝑓(𝑥)” is the expectation of 𝑓(𝑥) under the
empirical distribution:
N N
1 1
[ f ( x)] pemp ( x ) [ f ( x)] f ( x) xi ( x)dx f (x )
i
i 1 N N i 1
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 37