Lec4 IntroToProbabilityAndStatistics
Lec4 IntroToProbabilityAndStatistics
Email: [email protected]
URL: https://www.zabaras.com/
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras)
Contents
Covariance, Uncorrelated Random Variables, Multivariate Random
Variables, Independence Vs Uncorrelated Random Variables
Dirichlet distribution
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras) 2
References
• Following closely Chris Bishops’ PRML book, Chapter 2
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras) 3
Covariance
Consider two random variables X , Y : .
P X A, Y B P X 1 ( A) Y 1 ( B ) p( x, y)dxdy
A B
p ( x, y ) p ( x ) p ( y )
cov( X , Y ) X X Y Y
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras) 4
Correlation, Center Normalized Random Variables
Consider two random variables X , Y : .
~ ~ ~ ~
𝔼 𝑋 =𝔼 𝑌 =0 var 𝑋 = var 𝑌 = 1
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras) 5
Variance and Covariance
Variance
var f f ( X ) f ( X )
2
f ( X ) 2 f ( X )
2
Covariance
cov X , Y X X Y Y XY X Y
X ,Y X ,Y
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras) 6
Uncorrelated Random Variables
Consider two random variables X , Y : .
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras) 7
Multivariate Random Variables
Consider X1
X
X 2: n
..
Xn
X xp( x)dx n
, or X i xi p( x)dx xi p xi dxi , i 1, 2,.., n
n n
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras) 9
Covariance Matrix
Consider X1
X
X 2: n
..
Xn
cov X x X x X
T n n
The covariance matrix is: p ( x)dx
n
Note that the diagonal of the covariance matrix gives the variances of
the individual components:
cov X ii ( x X )
i i
2
p ( x)dx ( xi X i ) 2 p ( xi , xi' )dxi' dxi ( xi X i ) 2 p( xi )dxi var X i
n n n1 n
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras) 10
Covariance Matrix
The covariance matrix of a vector 𝑿 can be written explicitly:
var[ X 1 ] cov[ X 1 , X 2 ] ... cov[ X 1 , X d ]
var[ X 2 ] ...
cov X
cov[ X d , X 1 ] cov[ X d , X 2 ] ... var[ X d ]
X Y X Y
2
X Y
2
0 var
X Y X Y X Y
var X var Y cov X , Y
2 1 1 2Corr X , Y
2
X 2
Y XY
Corr X , Y 1
Similarly starting with
X Y
0 var Corr X , Y 1
X Y
cov( X , Y )
corr X , Y
var( X ) var(Y )
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras) 12
Variance and Covariance
The covariance of the vector random variables is:
cov X , Y X X Y Y
T
XY T X Y T
X ,Y
X ,Y
cov X , X X ,X
XX T X X T
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras) 13
Correlation as a Degree of Linearity
It can be shown that
If Y aX b, a 0, then : corr X , Y 1
If Y aX b, a 0, then : corr X , Y 1
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras) 14
Independent vs Uncorrelated
Note that:
var[ X Y ] X Y 2 X Y
2
X X Y Y 2 XY X Y
2 2
2 2
1 (1) 1
2
1 1
X 0, var
X
2 12 3
X 2 var X X
1 1
Y
2
02
3 3
1
1
XY x3 p( x)dx x3 dx 0
1
2
cov X , Y XY [ X ] [Y ] 0 0 1/ 3
Corr X , Y 0
XY XY XY
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras) 16
Mutual Information
The Figure given next shows several data sets where
there is clear dependence between 𝑋 and 𝑌, and yet the
correlation coefficient is 0.
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras) 17
Uncorrelated Random Variables
Several sets of (𝑥, 𝑦) points, with the correlation coefficient of 𝑥 and 𝑦
for each set.
The correlation reflects the noisiness and direction of a linear
relationship (top row), but not the slope of that relationship (middle),
nor nonlinear relationships (bottom).
The figure in the center has a slope of 0 but the correlation coefficient
is undefined because 𝑣𝑎𝑟[𝑌] = 0.
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras) 18
Marginal Density
Consider two random variables X , Y : with joint probability density
p ( x, y )
The probability density of 𝑋 when 𝑌 can take any value is defined as:
p ( x) p ( x, y )dy
Similarly:
p ( y ) p ( x, y )dx.
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras) 19
Conditional Probability Density
Consider two random variables X , Y : with joint density p( x, y )
p( x, y)dxdy 2 p( x, y)dx b
p ( x, y )
P a X b | y Y y
y a
y
a
dx
2 p ( y ) p( y)
a
p ( y )dy
p ( x|Y y )
y
X X | y p( y )dy
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras) 21
Linear Transformations
Suppose y f x Ax b. You can show that:
y A x b
cov y A cov x AT
y a T
x b
var y a T cov x a
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras) 22
Multivariate Gaussian
is Gaussian or normally distributed X ~ N ( x0 , ) if:
2
A random variable X
1 2
t
1
P X t
2
exp ( x x ) dx
2 2 2 0
A multivariate X
D
is Gaussian if its probability density is
1/2
1 1 T 1
p( x ) exp ( x x ) ( x x )
2 D det 2
0 0
where x0 D , DD is symmetric positive definite (covariance matrix).
The symmetry property of the covariance matrix does not affect the value of
( x x0 )T 1 ( x x0 ). However, for symmetric covariance matrices we only
need to describe 𝐷(𝐷 + 1)/2 elements rather than 𝐷2.
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras) 23
Conditional and Marginal Probability Densities
conditional bivariate normal pdf conditional bivariate normal pdf
5
4.5 Conditional
4 1
3.5
Ellipsoids : 0.8
Probability Density
3
2.5
equiprobability 0.6
0.4
2 curves of 𝑝(𝑥, 𝑦)
0.2
1.5
. 0
1 5
0.5 p(x|y=2) 4
3 4
5
2 3
0 2
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 1 1
y 0 0
Marginal bivariate normal pdf Marginal bivariate normal pdfx
5
4.5 Marginal
4 1
3.5 0.8
Probability Density
3 0.6
2.5
y
0.4
2
0.2
1.5
0
5
1
0.5
𝑝(𝑥) 4
3 4
5
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras) 24
Transformation of Probability Density
A probability density transforms differently from functions.
Let 𝑥 = 𝑔(𝑦).
dx
p y ( y ) px ( g ( y )) px ( g ( y )) g '( y ) px ( g ( y )) sg '( y ), s {1,1}
dy
This is easily derived by taking observations in the
interval (𝑥, 𝑥 + 𝑑𝑥) to be transformed to observations in
(𝑦, 𝑦 + 𝑑𝑦), i.e.
p y ( y )dy px ( x)dx
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras) 25
Transformation of Probability Density
For example consider the Gamma distribution
b a a 1 xb
Gamma ( x | a, b) x e
(a )
Let us compute the density of 𝑌 = 1/𝑋.
b b
dx b a ( a 1) y 1 b a ( a 1) y
p y ( y ) px ( g ( y )) y e 2 y e ,
dy (a ) y ( a )
dx 1
where 2
dy y
(a )
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras) 26
Multivariate Change of Variables
If 𝑓 is an invertible mapping, we can define the pdf of the
transformed variables using the Jacobian of the inverse
mapping 𝒚 → 𝒙: y1 y1
x ... x
x y 1 n
p y ( y ) px ( x ) det , ...
y x
yn yn
x ... x
1 n
As an example, it is trivial to show that transforming a
density from Cartesian coordinates 𝒙 = (𝑥, 𝑦) to polar
coordinates 𝒚 = (𝑟, 𝜃), where 𝑥 = 𝑟 cos 𝜃 and 𝑦 =
𝜕(𝑥,𝑦)
𝑟 sin 𝜃, | | = 𝑟, gives:
𝜕(𝑟,𝜃)
pr , (r , ) px , y (r cos , r sin ) r
pr , (r , )drd px , y (r cos , r sin )rdrd
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras) 27
Transformation of Probability Density
dx
Recall p y ( y ) px ( g ( y )) px ( g ( y )) sg '( y ), s {1,1}
dy
Using this Eq., note that modes of densities depend on
the choice of variables (see 2nd term on the rhs below):
p 'y ( y ) spx' ( x) g '( y ) spx ( g ( y )) g ''( y )
2
T ( x | , , ) N x | ,
0
1
Gamma | / 2, / 2 d
This form is useful in providing generalization to a multivatiate Student’s
T
T ( x | , , ) N x | ,
0
1
Gamma | / 2, / 2 d
*Use change of variables for distributions, also d d , and notice that the extra terms that appear cancel out.
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras) 29
Multivariate Student’s T Distribution
T ( x | , , ) N x | , Gamma | / 2, / 2 d
1
0
This integral can be computed analytically as:
D
( ) 2 /2 D /2
||
1/2
T ( x | , , ) 2 2 D /2
1
( )
2
2 x x (Mahalanobis Distance )
T
One can derive the above form of the distribution by substitution in the
Eq. on the top.
/ 2
/2
1/2
T ( x | , , ) D/2
D /2 / 21e /2 e / 2 d Use / 2 2 / 2
2
/ 2 2 0
/ 2
/2
1/2 1/2
/ 2 d / 2 D /2 /2
0
D /2 /2 D /2 / 2 1
/ 2 2
/ 2 e d 1 2
/
/ 2 2 D /2 / 2
D/2
2
/2 D /2
D 2 2 1 2 2 2
1 exp ln 1 exp exp O 1
2 2 2 2 2
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras) 32
Dirichlet Distribution
We introduce the Dirichlet distribution as a family of
“conjugate priors” (to be formally introduced in a follow up
lecture) for the parameters 𝜇𝑘 in the multinomial
distribution.
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras) 33
Dirichlet Distribution
Its probability density function returns the belief that the
probabilities of 𝐾 rival events are 𝜇𝑘 given that each event
has been observed 𝛼𝑘 − 1 times:
K
p( | ) k k 1 ,
k 1
0 k 1,
K
k 1
k 1
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras) 34
Dirichlet Distribution
The Dirichlet distribution of order 𝐾 ≥ 2 with parameters
𝛼1, … , 𝛼𝐾 > 0 has a PDF with respect to Lebesgue
measure on K 1 given by
K
1
p( | )
Beta ( ) k 1
k k 1
for all 𝜇1, … , 𝜇𝐾−1 > 0 satisfying 𝜇1+. . +𝜇𝐾−1 < 1, where 𝜇𝐾
is an abbreviation for 1 – 𝜇1 − ⋯ − 𝜇𝐾−1 . The normalizing
constant is the multinomial Beta function:
K
( k )
Beta( ) , 1 , 2 ,..., K
k 1 T
K
k
k 1
The Dirichlet distribution over (𝜇1, 𝜇2, 𝜇3)
is confined on a plane as shown.
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras) 35
Dirichlet Distribution
We write the Dirichlet distribution as:
K
(a0 )
p ( | ) K ( ) k k 1 , K ( ) , a0 a1 ... aK
k 1 (a1 )...(aK )
Note the following useful relation:
K
K K
j
k 1
k 1
k
j
e
k 1
k 1 ln k
ln j k k 1
k 1
ln K ( ) ln ( j ) ln ( 0 )
1 1 K
1
..
k 1
K ( ) d 1....d K K ( )
j j K ( ) j j 0
k
0 0 k 1
0
j
where d ln ( ) d is the digamma function.
ln ( j ) ln ( 0 )
ln j j 0 , j , 0
j 0
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras) 36
Dirichlet Distribution: Normalization
To show the normalization, we use induction. The case for
𝑀 = 2 was shown earlier for the Beta distribution.
Assume that the Dirichlet normalization formula is valid for
𝑀 − 1 terms. We will show the formula for 𝑀 terms:
M 1
k 1
M 1 M 1
pM ( 1 ,..., M 1 ) CM k 1 j
k 1 j 1
M
pM 1 ( 1 ,..., M 2 ) CM k M 1 1 j M 1 d M 1
0 k 1 j 1
M 1 1 M 11
k 1 M 1 1
M 2 1 M 2
CM k t 1 j 1 t
M 1
dt
M 2 k 1 0 j 1
M 1 t 1
j
j 1
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras) 37
Dirichlet Distribution: Normalization
M 1 M 1 1
M 2
M 2
pM 1 ( 1 ,..., M 2 ) CM k k 1 1 j 1 t
M 1
M 1 1
t dt
k 1 j 1 0
M 1 M 1
M 2 k 1 M 2 M 1 M
CM k 1 j
k 1 j 1 M 1 M
Dirichlet ( M 1)
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras) 40
Dirichlet Distribution
The Dirichlet distribution over (𝜇1, 𝜇2, 𝜇3) where the horizontal
axes are 𝜇1 and 𝜇2 and the vertical axis is the density.
{ K } 0.1 { K } 10
25
15
20
f(x1,x2,1-x1-x2)
f(x1,x2,1-x1-x2)
10 15
0
10 0
5
5
0 0.5
0 0.5
1
0.8
0.6 1
0.4 0.8
0.2 1 x2 0.6
0
0.4
x1 0.2 1 x2
0
{ K } 1 x1
3.5
3
f(x1,x2,1-x1-x2)
2.5
2 0
1.5
1 0.5
1
0.8
0.6
0.4
x2
MatLab Code
0.2 1
0
x1
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras) 41
Dirichlet Distribution
The Dirichlet distribution over (𝜇1, 𝜇2, 𝜇3) where the horizontal
axes are 𝜇1 and 𝜇2 and the vertical axis is the density. =0.10
15
10
{ K } 0.1, 0.1, 0.1 If ak<1/3 for all k,
we obtain spikes
p
5 at the corners
0
1
1
0.5
=10.00 0.5
{ K } 2, 2, 2 0 0
25
20
15
{ K } 10,10,10
p
10
0
1
Run visDirichletGui & dirichlet3dPlot 1
from PMTK 0.5
0.5
0 0
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras) 42
Dirichlet Distribution
fromfrom5-dimensional
Samples Samples Dir (alpha=5)
symmetric Dirichlet distribution.
1
0.5
Samples from Dir (alpha=0.1)
0 1
1 2 3 4 5
1 0.5
0.5 0
1 2 3 4 5
0 1
1 2 3 4 5
1 0.5
0.5 0
1 2 3 4 5
0 1
1 2 3 4 5
1 0.5
0.5 0
1 2 3 4 5
0 1
1 2 3 4 5
1 0.5
0.5 0
1 2 3 4 5
0 1
1 2 3 4 5
0.5
{ K } 5,5,...,5 0
1 2 3 4 5
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras) 43
Dirichlet Distribution
In closing, we have the following properties (you only need
the normalization d (a)...(a)(a ) , a a ... a of the Dirichlet
K
k 1 1 K
k 0 1 K
k 1
k k 1 k 0 k j l
[ k ] , mode[ k ] , var[ k ] 2 , cov[ j l ] 2 j l
0 0 1 0 ( 0 1) 0 ( 0 1)
K
where : 0 k
k 1
Often we use:
k K
In this case:
1 K 1
[ k ] , var[ k ] 2
K K ( 1)