0% found this document useful (0 votes)
9 views44 pages

Lec4 IntroToProbabilityAndStatistics

This document provides an introduction to probability and statistics concepts including covariance, uncorrelated random variables, multivariate random variables, independence, marginal and conditional densities, transformations of random variables, and Dirichlet distribution. It defines key terms such as covariance, correlation, variance, and covariance matrix for multivariate random variables. References for further reading on probability and statistics are also provided.

Uploaded by

hu jack
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views44 pages

Lec4 IntroToProbabilityAndStatistics

This document provides an introduction to probability and statistics concepts including covariance, uncorrelated random variables, multivariate random variables, independence, marginal and conditional densities, transformations of random variables, and Dirichlet distribution. It defines key terms such as covariance, correlation, variance, and covariance matrix for multivariate random variables. References for further reading on probability and statistics are also provided.

Uploaded by

hu jack
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 44

Introduction to

Probability and Statistics


(Continued)
Prof. Nicholas Zabaras
Center for Informatics and Computational Science
https://cics.nd.edu/
University of Notre Dame
Notre Dame, Indiana, USA

Email: [email protected]
URL: https://www.zabaras.com/

August 28, 2018

Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras)
Contents
 Covariance, Uncorrelated Random Variables, Multivariate Random
Variables, Independence Vs Uncorrelated Random Variables

 Marginal and Conditional Densities, Conditional Expectaction

 The multivariate Gaussian, Multivariate Student t distribution

 Transformations of random variables

 Dirichlet distribution

Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras) 2
References
• Following closely Chris Bishops’ PRML book, Chapter 2

• Kevin Murphy’s, Machine Learning: A probablistic perspective, Chapter 2

• Jaynes, E. T. (2003). Probability Theory: The Logic of Science. Cambridge


University Press.

• Bertsekas, D. and J. Tsitsiklis (2008). Introduction to Probability. Athena


Scientific. 2nd Edition

• Wasserman, L. (2004). All of statistics. A Concise Course in Statistical


Inference. Springer.

Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras) 3
Covariance
 Consider two random variables X , Y :   .

 The joint probability distribution is defined as:

P  X  A, Y  B  P  X 1 ( A) Y 1 ( B )   p( x, y)dxdy
A B

 Two random variables are independent if

p ( x, y )  p ( x ) p ( y )

 The covariance of X and Y is defined as:

cov( X , Y )   X   X  Y  Y 

 It is straight forward to verify that: cov( X , Y )   XY    X  Y 

Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras) 4
Correlation, Center Normalized Random Variables
 Consider two random variables X , Y :   .

 The correlation coefficient of X and Y is defined as:


cov( X , Y )
corrc( X , Y ) 
 XY

where the standard deviations of X and Y are


 X  cov( X ),  Y  cov(Y )
~ 𝑋−𝔼 𝑋
𝑋=
𝜎𝑋
 The center normalized random variables are defined as:
~ 𝑌−𝔼 𝑌
𝑌=
𝜎𝑌
 It is straight forward to verify that:

~ ~ ~ ~
𝔼 𝑋 =𝔼 𝑌 =0 var 𝑋 = var 𝑌 = 1

Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras) 5
Variance and Covariance
 Variance

var  f    f ( X )   f ( X )  
2
 f ( X ) 2    f ( X )
2


 Covariance

cov  X , Y    X   X  Y  Y    XY    X  Y 
X ,Y  X ,Y

It expresses the extent to which 𝑋 and 𝑌 vary (linearly)


together.

 If 𝑋 and 𝑌 are independent, 𝑝(𝑋, 𝑌) = 𝑝(𝑋)𝑝(𝑌), their


covariance vanishes.

Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras) 6
Uncorrelated Random Variables
 Consider two random variables X , Y :   .

 We say that 𝑋 and 𝑌 are uncorrelated when: cov( X , Y )  0  [ XY ]  [ X ] [Y ]

 If 𝑋 and 𝑌 are independent, then they are uncorrelated:

cov( X , Y )   X   X  Y  Y    X   X  Y  Y   0


  
The opposite is not true: Uncorrelated random variables are not
independent. Independency affects the whole density, not just the
expectation.

 𝑋 and 𝑌 are orthogonal if


 XY   0
In the last case, the following holds:
( X  Y ) 2    X 2   Y 2 

Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras) 7
Multivariate Random Variables
 Consider  X1 
X 
X   2:  n
 .. 
 
Xn 

where each component X i is an - valued variable.

 𝑋 is defined by the joint probability density of its components



pX : n

 Define the cumulative distribution function is defined as:


F ( x 1 , x2 ,..., xn )  Pr  X 1  x1 , X 2  x2 ,..., X n  xn   [0,1]

 Then the probability density function of 𝑋 is defined as


n F ( x)
p ( x1 , x2 ,..., xn ) 
x1x2 ...xn
and  p( x)dx  1
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras) 8
Multivariate Random Variables
 Consider  X1 
X 
X   2:  n
 .. 
 
Xn 

where each component X i is an -valued variable.

 The expectation is defined as

 X    xp( x)dx  n
, or  X i    xi p( x)dx   xi p  xi dxi  , i  1, 2,.., n
n n

Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras) 9
Covariance Matrix
 Consider  X1 
X 
X   2:  n
 .. 
 
Xn 
cov  X    x   X   x   X 
T n n
 The covariance matrix is: p ( x)dx 
n

or equivalently: cov  X ij    x   X   x


n
i i j  
 X j  p ( x)dx  , 1  i, j  n.

 The covariance matrix is symmetric and positive semi-definite, i.e.

∀𝜐 ∈ ℝ𝑛 , 𝜐 ≠ 0, 𝜐 𝑇 cov 𝑋 𝜐 = න 𝜐 𝑇 (𝑥 − 𝑥ҧ ) 𝑥 − 𝑥ҧ 𝑇 𝜐 𝑝(𝑥)𝑑𝑥 = න 𝜐 𝑇 (𝑥 − 𝑥ҧ ) 2 𝑝(𝑥)𝑑𝑥 ≥ 0.


ℝ𝑛 ℝ𝑛

 Note that the diagonal of the covariance matrix gives the variances of
the individual components:
cov  X ii   ( x   X )
i i
2
p ( x)dx   ( xi   X i ) 2  p ( xi , xi' )dxi' dxi   ( xi   X i ) 2 p( xi )dxi  var  X i 
n n n1 n

Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras) 10
Covariance Matrix
 The covariance matrix of a vector 𝑿 can be written explicitly:
 var[ X 1 ] cov[ X 1 , X 2 ] ... cov[ X 1 , X d ] 
 var[ X 2 ] ... 
cov  X    
 
 
 cov[ X d , X 1 ] cov[ X d , X 2 ] ... var[ X d ] 

 A normalized version of this is the correlation matrix (all


elements between [−1,1] (diagonal elements = 1)
 corr[ X 1 , X 1 ] corr[ X 1 , X 2 ] ... corr[ X 1 , X d ] 
 corr[ X 2 , X 2 ] ... 
R 
 
 
 corr [ X d , X 1 ] corr [ X d , X 2 ] ... corr [ X d , X d 
]
cov( X , Y )
corr  X , Y  
var( X ) var(Y )
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras) 11
Correlation Coefficient Between -1 and 1
 Consider two scalar random variables 𝑋 and 𝑌. We can write
the following:

X Y    X Y 
2
  X Y 
2

0  var            
 X  Y    X  Y      X  Y  
var  X  var Y  cov  X , Y 
  2  1  1  2Corr  X , Y 
2
X 2
Y  XY
 Corr  X , Y   1
 Similarly starting with
X Y 
0  var     Corr  X , Y   1
 X  Y 
cov( X , Y )
corr  X , Y  
var( X ) var(Y )
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras) 12
Variance and Covariance
 The covariance of the vector random variables is:

cov  X , Y    X   X  Y  Y 
T
  XY T   X  Y T 
X ,Y
  X ,Y

 The covariance between the components of a vector:

cov  X , X   X ,X
 XX T   X   X T 

Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras) 13
Correlation as a Degree of Linearity
 It can be shown that
If Y  aX  b, a  0, then : corr  X , Y   1
If Y  aX  b, a  0, then : corr  X , Y   1

The regression coefficient is 𝑎 = 𝑐𝑜𝑣[𝑋, 𝑌 ] / 𝑣𝑎𝑟 [𝑋].


 Think of the correlation coefficient as a degree of linearity.
 If 𝑋 and 𝑌 are independent, 𝑝(𝑋, 𝑌) = 𝑝(𝑋)𝑝(𝑌), then
𝑐𝑜𝑣 [𝑋, 𝑌] = 0, and hence 𝑐𝑜𝑟𝑟 [𝑋, 𝑌] = 0 so they are
uncorrelated.
 The converse is not true: uncorrelated does not imply
independence.

Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras) 14
Independent vs Uncorrelated
 Note that:
var[ X  Y ]   X  Y  2    X  Y  
2

 
 X     X   Y    Y   2  XY    X  Y  
2 2
 2 2

 var  X   var Y   2 cov  X , Y 

 From the above equation, we note that if 𝑋, 𝑌 are


independent then:

var  X  Y   var  X   var Y 

but note that the linearity of expectation is valid even when


the variables are not independent:
 X  Y    X   Y 
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras) 15
Correlation and Dependence
 Uncorrelated does not imply independent.

 For example, let 𝑋 ∼ 𝒰(−1, 1) and 𝑌 = 𝑋2. Clearly 𝑌 is


dependent on 𝑋, yet one can show that 𝑐𝑜𝑟𝑟 [𝑋, 𝑌 ] = 0.

1  (1)   1
2
1  1
 
X   0, var  
X 
2 12 3

 X 2   var  X     X 
1 1
Y  
2
  02 
3 3
1
1
 XY    x3 p( x)dx   x3 dx 0
1
2

cov  X , Y   XY   [ X ] [Y ] 0  0 1/ 3
Corr  X , Y     0
 XY  XY  XY

Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras) 16
Mutual Information
 The Figure given next shows several data sets where
there is clear dependence between 𝑋 and 𝑌, and yet the
correlation coefficient is 0.

 A more general measure of dependence between random


variables is mutual information.

The mutual information is zero if and only if the variables


are truly independent.

Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras) 17
Uncorrelated Random Variables
 Several sets of (𝑥, 𝑦) points, with the correlation coefficient of 𝑥 and 𝑦
for each set.
 The correlation reflects the noisiness and direction of a linear
relationship (top row), but not the slope of that relationship (middle),
nor nonlinear relationships (bottom).
 The figure in the center has a slope of 0 but the correlation coefficient
is undefined because 𝑣𝑎𝑟[𝑌] = 0.

Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras) 18
Marginal Density
 Consider two random variables X , Y :   with joint probability density
p ( x, y )

 The probability density of 𝑋 when 𝑌 can take any value is defined as:

p ( x)   p ( x, y )dy

 Similarly:
p ( y )   p ( x, y )dx.

Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras) 19
Conditional Probability Density
 Consider two random variables X , Y :   with joint density p( x, y )

 The probability density of 𝑋 assuming that 𝑌 = 𝑦 is defined as


p ( x, y )
p( x | y)  , p( y )  0
p( y)

 One can show this by noting the following:


y  b b

  p( x, y)dxdy  2 p( x, y)dx b
p ( x, y )
P a  X  b | y    Y  y     
y a
y 
 a
dx
2 p ( y ) p( y)

a
p ( y )dy
p ( x|Y  y )
y 

 From this we derive the following important identity:


p ( x, y )  p ( x | y ) p ( y )  p ( y | x) p ( x).

 Bayes’ rule in terms of densities now follows as:


p( y | x) p( x)
p( x | y) 
p( y)
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras) 20
Conditional Expectations
 Consider two random variables X , Y :   .

 We define the conditional expectation as:  X | y    xp( x | y)dx

 The expectation of 𝑋 via conditional expectation can be computed as:


 
 X    xp( x)dx   x   p( x, y ) dy  dx 
 p ( x| y ) p ( y ) 

 X      xp( x | y )dx  p( y )dy    X | y p( y )dy 

 X     X | y p( y )dy
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras) 21
Linear Transformations
 Suppose y  f  x   Ax  b. You can show that:
 y  A  x  b
cov  y   A cov  x  AT

 For a scalar-valued function y  f  x   aT x b :

 
y  a T
 x b
var  y   a T cov  x  a

Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras) 22
Multivariate Gaussian
is Gaussian or normally distributed X ~ N ( x0 ,  ) if:
2
 A random variable X 
 1 2
t
1
P  X  t 
2 
exp   ( x  x ) dx
2   2 2 0

 A multivariate X 
D
is Gaussian if its probability density is
1/2
 1   1 T 1 
p( x )    exp   ( x  x )  ( x  x ) 
  2  D det    2
0 0

 
where x0  D ,   DD is symmetric positive definite (covariance matrix).

 The symmetry property of the covariance matrix does not affect the value of
( x  x0 )T  1 ( x  x0 ). However, for symmetric covariance matrices we only
need to describe 𝐷(𝐷 + 1)/2 elements rather than 𝐷2.

 It is invariant under linear transformations, i.e. for A, B  M M  D , c  M

X1 ~ N ( 1 , 1 ), X 2 ~ N ( 2 ,  2 ) AX1  BX 2  c ~ N ( A1  B2  c, A1 AT  B 2 BT )


and X1 , X 2 independent 

Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras) 23
Conditional and Marginal Probability Densities
conditional bivariate normal pdf conditional bivariate normal pdf
5

4.5 Conditional
4 1

3.5
Ellipsoids : 0.8

Probability Density
3

2.5
equiprobability 0.6

0.4
2 curves of 𝑝(𝑥, 𝑦)
0.2
1.5
. 0
1 5

0.5 p(x|y=2) 4
3 4
5

2 3
0 2
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 1 1
y 0 0
Marginal bivariate normal pdf Marginal bivariate normal pdfx
5

4.5 Marginal
4 1

3.5 0.8

Probability Density
3 0.6

2.5
y

0.4

2
0.2

1.5
0
5
1

0.5
𝑝(𝑥) 4
3 4
5

2 Link here for 3a MatLab program


2
0
1 to 1generate these figures
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 y 0 0
x
x

Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras) 24
Transformation of Probability Density
 A probability density transforms differently from functions.

 Let 𝑥 = 𝑔(𝑦).
dx
p y ( y )  px ( g ( y ))  px ( g ( y )) g '( y )  px ( g ( y )) sg '( y ), s {1,1}
dy
 This is easily derived by taking observations in the
interval (𝑥, 𝑥 + 𝑑𝑥) to be transformed to observations in
(𝑦, 𝑦 + 𝑑𝑦), i.e.
p y ( y )dy  px ( x)dx

Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras) 25
Transformation of Probability Density
For example consider the Gamma distribution
b a a 1  xb
Gamma ( x | a, b)  x e
(a )
 Let us compute the density of 𝑌 = 1/𝑋.
b b
dx b a  ( a 1)  y 1 b a  ( a 1)  y
p y ( y )  px ( g ( y ))  y e  2  y e ,
dy (a ) y ( a )
dx 1
where  2
dy y

 This is the Inverse Gamma distribution


a b
b 
InvGamma ( y | a, b)  y  ( a 1) e y

(a )
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras) 26
Multivariate Change of Variables
 If 𝑓 is an invertible mapping, we can define the pdf of the
transformed variables using the Jacobian of the inverse
mapping 𝒚 → 𝒙:  y1 y1 
 x ... x 
x y  1 n

p y ( y )  px ( x ) det ,  ... 
y x 
yn yn 
 
 x ... x 
 1 n 
 As an example, it is trivial to show that transforming a
density from Cartesian coordinates 𝒙 = (𝑥, 𝑦) to polar
coordinates 𝒚 = (𝑟, 𝜃), where 𝑥 = 𝑟 cos 𝜃 and 𝑦 =
𝜕(𝑥,𝑦)
𝑟 sin 𝜃, | | = 𝑟, gives:
𝜕(𝑟,𝜃)
pr , (r , )  px , y (r cos  , r sin  ) r
pr , (r , )drd  px , y (r cos  , r sin  )rdrd
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras) 27
Transformation of Probability Density
dx
 Recall p y ( y )  px ( g ( y ))  px ( g ( y )) sg '( y ), s {1,1}
dy
 Using this Eq., note that modes of densities depend on
the choice of variables (see 2nd term on the rhs below):
p 'y ( y )  spx' ( x)  g '( y )  spx ( g ( y )) g ''( y )
2

 Consider 𝑋 ~ 𝒩(6,1) and the following


y 1
x  g ( y )  ln  5, y  g 1 ( x) 
1 y 1  e  x 5
 Transforming 𝑝𝑥(𝑥) as a
function gives the same mode
for 𝑝𝑥 (𝑔(𝑦)). The actual mode
of 𝑝𝑦(𝑦) is shifted.
 The histogram of 𝑝𝑦(𝑦) is
obtained as:
y ( s )  g 1 ( x ( s ) ), where : x ( s ) ~ px ( x)
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras) 28
Multivariate Student’s T Distribution

p( x |  , a, b)   N  x |  , 1  Gamma  | a, b  d
0

 If we return to the derivation of the univariate Student’s T distribution


and substitute   2a,   a ,    b / a , and use
b
ba a 1 b
Gamma  | a, b    e
( a )
we can write the Student’s T distribution as:*


T ( x |  ,  , )   N x |  ,  
0
1
 Gamma  |  / 2, / 2  d
 This form is useful in providing generalization to a multivatiate Student’s
T 


T ( x |  ,  , )   N x |  ,  
0
1
 Gamma  |  / 2, / 2  d
*Use change of variables for distributions, also d   d , and notice that the extra terms that appear cancel out.
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras) 29
Multivariate Student’s T Distribution

T ( x |  ,  , )   N  x |  ,    Gamma  |  / 2, / 2  d
1

0
 This integral can be computed analytically as:
D 
(  ) 2  /2  D /2
||   
1/2
T ( x |  ,  , )  2 2 D /2 
1 

( )     
2
 2   x      x    (Mahalanobis Distance )
T

 One can derive the above form of the distribution by substitution in the
Eq. on the top.
 / 2
 /2 

1/2

T ( x |  ,  , )  D/2 
 D /2  / 21e  /2 e  / 2 d Use     / 2   2 / 2 
2

  / 2   2  0

 / 2  
 /2
 
   
1/2 1/2

   
  / 2 d / 2  D /2  /2
0
D /2 /2 D /2  / 2 1 
  / 2   2
/ 2  e d  1   2
/ 
  / 2   2  D /2   / 2   
D/2

 Normalization proof is immediate from the normalization of the normal &


Gamma distributions.
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras) 30
Multivariate Student’s T Distribution
D 
(  )  2  /2  D /2

|  |1/2

T ( x |  ,  , )  2 2
D /2 
1  
      
( )
2
 Some useful results of the multivatiate Student’s T are given below:

 x   if   1, cov  x   1 if   2, mode  x   
 2
 One can show easily the expression for the mean by using 𝒙 = 𝒛 + 𝝁:
D 
 (  )  2  /2  D /2

|  |1/2

 x   2 2
 D /2 
1    z    dz
 ( )     
2
 The 1st term drops out since T ( z |  ,  , ) is even. The 2nd term gives 𝝁
from the normalization of the distribution.
 The covariance is computed as:
 / 2  
 /2

 
 0  x

cov  x      N x |  ,  
1
  x    x   T
dx Gamma  |  / 2, / 2  d 


  / 2   0
    /21e /2 d
1

 / 2    / 2  1  / 2    / 2  1 1   / 2 1   1


 /2
1
 
  / 2   / 2  / 2  / 2  /21   / 2   / 2 1  2
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras) 31
Multivariate Student’s T Distribution
D 
(  )  2  /2  D /2

|  |1/2

T ( x |  ,  , )  2 2
D /2 
1  
      
( )
2
 Differentiation with respect to 𝒙 also shows the mode being 𝝁:

 x    if   1, cov  x   1 if   2, mode  x   
 2
 The Student’s T has fatter tails than a Gaussian. The smaller 𝜐 is the
fatter the tails.

 For 𝜐 ∞, the distribution approaches a Gaussian. Indeed note that:

 2 
 /2  D /2
   D    2         2 1   2 2    2 
1     exp      ln 1     exp          exp    O  1  
    2 2      2   2      2 
  

 The distribution can also be written in terms of 𝚺 = 𝜦−1 (scale matrix –


not the covariance) or 𝑽 = 𝜐𝚺.

Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras) 32
Dirichlet Distribution
 We introduce the Dirichlet distribution as a family of
“conjugate priors” (to be formally introduced in a follow up
lecture) for the parameters 𝜇𝑘 in the multinomial
distribution.

The Dirichlet distribution 𝒟𝒾𝓇(𝛼), is a family of continuous


multivariate probability distributions parametrized by the
vector 𝜶 of positive reals.

 It is the multivariate generalization of the 𝓑𝓮𝓽𝓪


distribution.

Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras) 33
Dirichlet Distribution
 Its probability density function returns the belief that the
probabilities of 𝐾 rival events are 𝜇𝑘 given that each event
has been observed 𝛼𝑘 − 1 times:
K
p(  |  )   k k 1 ,
k 1

0  k  1,
K


k 1
k 1

The distribution over the space of 𝜇𝑘 is 𝐾 − 1 dimensional


due to the last constraint above.

Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras) 34
Dirichlet Distribution
The Dirichlet distribution of order 𝐾 ≥ 2 with parameters
𝛼1, … , 𝛼𝐾 > 0 has a PDF with respect to Lebesgue
measure on K 1 given by
K
1
p( |  )  
Beta ( ) k 1
k k 1

for all 𝜇1, … , 𝜇𝐾−1 > 0 satisfying 𝜇1+. . +𝜇𝐾−1 < 1, where 𝜇𝐾
is an abbreviation for 1 – 𝜇1 − ⋯ − 𝜇𝐾−1 . The normalizing
constant is the multinomial Beta function:
K

 ( k )
Beta( )  ,   1 ,  2 ,...,  K 
k 1 T

 K 
  k 
 k 1 
The Dirichlet distribution over (𝜇1, 𝜇2, 𝜇3)
is confined on a plane as shown.
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras) 35
Dirichlet Distribution
We write the Dirichlet distribution as:
K
(a0 )
p (  |  )  K ( ) k k 1 , K ( )  , a0  a1  ...  aK
k 1 (a1 )...(aK )
 Note the following useful relation:
 K
 K K

 j

k 1
 k 1
k 
 j
e
k 1
 k 1 ln k
 ln  j  k k 1
k 1

 From this we can derive an interesting expression for ln  j 



1 1 K 1 1 K
ln  j   K ( )  .. ln  j   k  k 1
d 1....d  K  K ( )  ..   k 1
d 1....d  K 
 j
k
0 0 k 1 0 0 k 1

  ln K ( )  ln ( j )  ln ( 0 )
1 1 K
1
 ..  
 k 1
K ( ) d 1....d  K K ( )   
 j  j K ( )  j  j  0
k
0 0 k 1
  0 
 
 j
where     d ln ( ) d is the digamma function.
 ln ( j )  ln ( 0 )
ln  j     j     0  ,   j   ,   0  
 j  0
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras) 36
Dirichlet Distribution: Normalization
 To show the normalization, we use induction. The case for
𝑀 = 2 was shown earlier for the Beta distribution.
 Assume that the Dirichlet normalization formula is valid for
𝑀 − 1 terms. We will show the formula for 𝑀 terms:
 M 1
 k 1  
M 1 M 1
pM ( 1 ,...,  M 1 )  CM  k 1    j 
k 1  j 1 
M

 Let us integrate out M 1 :


M 2
1  j  M 1
 M  2  k 1   M 1 1  M  2 
j 1

pM 1 ( 1 ,...,  M  2 )  CM     k   M 1 1    j   M 1  d  M 1 
0  k 1   j 1 
 M 1 1 M 11
  k 1   M 1 1  
M 2 1 M 2
CM    k   t 1    j  1  t 
 M 1
 dt
 M 2   k 1 0  j 1 
 M 1 t 1
  j  
 j 1 

Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras) 37
Dirichlet Distribution: Normalization
 M 1  M 1 1
 M 2
 
M 2
pM 1 ( 1 ,...,  M  2 )  CM   k k 1  1    j  1  t 
 M 1

 M 1 1
t dt 
 k 1  j 1  0
 M 1  M 1
 M  2  k 1   M  2    M 1    M 
 CM    k   1    j 
 k 1  j 1    M 1   M 
Dirichlet ( M 1)

 The last step above comes from the normalization of Beta.


 What we have above is an (𝑀 − 1) term Dirichlet
distribution with coefficients 1 ,....,  M 2 ,  M 1   M . Since we
assumed that the normalization formula is valid for (𝑀 − 1)
terms, we must have:
 1  ...  M  2    M 1   M    M 1    M 
1  CM 
 1  ...   M    M 1   M 
 1  ...   M 
CM 
 1  ...  M  2    M 1    M 
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras) 38
Dirichlet Distribution
Using the multinomial as a “likelihood” and the Dirichlet as
“the conjugate prior”, we arrive at the following “posterior”
K
p (  | D )  p(D |  ) p(  )  p(  | D )   k k  mk 1
k 1
Multinomial Dirichlet

which is a Dirichlet distribution Dir (  | 1  m1 ,...,  K  mK ).


The normalization factor is computed easily from the
normalization factor of the Dirichlet as:
 K 
  k  N  K
p(  | D )  K k 1 
 k k  mk 1
 ( k  mk ) k 1
k 1

 𝑎𝑘 can be interpreted as “the effective number of prior


observations of 𝑥𝑘 = 1”.
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras) 39
Dirichlet Distribution
Examples of Dirichlet distribution over (𝜇1, 𝜇2, 𝜇3) which can
be plotted in 2𝐷 since 𝜇3 = 1 − 𝜇1 − 𝜇2.
Uniform Broad centered Narrow centered
at (1/3,1/3,1/3) at (1/3,1/3,1/3)
a0  a1  ...  aK
controls how
peaked the
distribution is

The aK control the


location of the peak

Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras) 40
Dirichlet Distribution
The Dirichlet distribution over (𝜇1, 𝜇2, 𝜇3) where the horizontal
axes are 𝜇1 and 𝜇2 and the vertical axis is the density.
{ K }  0.1 { K }  10
25
15
20

f(x1,x2,1-x1-x2)
f(x1,x2,1-x1-x2)

10 15

0
10 0
5

5
0 0.5
0 0.5
1
0.8
0.6 1
0.4 0.8
0.2 1 x2 0.6
0
0.4
x1 0.2 1 x2
0

{ K }  1 x1

3.5

3
f(x1,x2,1-x1-x2)

2.5

2 0

1.5

1 0.5
1
0.8
0.6
0.4
x2
MatLab Code
0.2 1
0
x1
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras) 41
Dirichlet Distribution
The Dirichlet distribution over (𝜇1, 𝜇2, 𝜇3) where the horizontal
axes are 𝜇1 and 𝜇2 and the vertical axis is the density.  =0.10

15

10
{ K }  0.1, 0.1, 0.1 If ak<1/3 for all k,
we obtain spikes

p
5 at the corners

0
1
1
0.5
 =10.00 0.5

{ K }  2, 2, 2 0 0

25

20

15
{ K }  10,10,10
p

10

0
1
Run visDirichletGui & dirichlet3dPlot 1
from PMTK 0.5
0.5

0 0

Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras) 42
Dirichlet Distribution
fromfrom5-dimensional
Samples Samples Dir (alpha=5)
symmetric Dirichlet distribution.
1
0.5
Samples from Dir (alpha=0.1)
0 1
1 2 3 4 5
1 0.5
0.5 0
1 2 3 4 5
0 1
1 2 3 4 5
1 0.5
0.5 0
1 2 3 4 5
0 1
1 2 3 4 5
1 0.5
0.5 0
1 2 3 4 5
0 1
1 2 3 4 5
1 0.5
0.5 0
1 2 3 4 5
0 1
1 2 3 4 5
0.5
{ K }  5,5,...,5 0
1 2 3 4 5

{ K }  0.1, 0.1,..., 0.1


Run dirichletHistogramDemo
from PMTK

Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras) 43
Dirichlet Distribution
 In closing, we have the following properties (you only need
the normalization    d   (a)...(a)(a ) , a  a  ...  a of the Dirichlet
K
 k 1 1 K
k 0 1 K
k 1

distribution and the property ( x  1)  x( x) to prove them):


0

k k 1  k  0   k   j l
[ k ]  , mode[ k ]  , var[ k ]  2 , cov[  j l ]   2  j  l
0 0 1  0 ( 0  1)  0 ( 0  1)
K
where :  0    k
k 1

 Often we use:
k   K

In this case:
1 K 1
[ k ]  , var[ k ]  2
K K (  1)

Increasing  increases the precision of the distribution.


Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras) 44

You might also like