0% found this document useful (0 votes)
10 views3 pages

STAT3006: Tutorial 2

The document discusses generating random multivariate normal distributions and mixture distributions for simulation purposes. It provides methods for randomly generating valid correlation matrices and using them to simulate observations from a mixture of normal distributions with random parameters. Steps are outlined for analyzing the simulated mixture data, including clustering methods like K-means and mixture modeling to determine the optimal number of components.

Uploaded by

DomMMK
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views3 pages

STAT3006: Tutorial 2

The document discusses generating random multivariate normal distributions and mixture distributions for simulation purposes. It provides methods for randomly generating valid correlation matrices and using them to simulate observations from a mixture of normal distributions with random parameters. Steps are outlined for analyzing the simulated mixture data, including clustering methods like K-means and mixture modeling to determine the optimal number of components.

Uploaded by

DomMMK
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 3

STAT3006

Tutorial 2
1. In introductory statistics courses, you would have come across rules to explain
what happens to the mean and variance under linear combinations of univariate
random variables.
You have likely seen the following.
Given random variables X1 and X2 and constants a and b, consider the linear
combination,
Y = aX1 + bX2 .

The mean of the linear combination is the linear combination of the means,

E(Y ) = aE(X1 ) + bE(X2 )

The variance of the linear combination is as follows:

Var(Y ) = a2 Var(X1 ) + b2 Var(X2 ) + 2 a b Cov(X1 , X2 ).

Derive corresponding expressions for the mean and variance of linear combina-
tions of arbitrary p-dimensional random variables (assume dependent): A X + B Y .
Hints: let Z = (X T Y T )T , consider a suitable matrix C and the properties of
C Z. Assume E(X) = µX , E(Y ) = µY , Cov(X, X) = ΣXX , Cov(Y, Y ) = ΣY Y ,
Cov(X, Y ) = ΣXY .

2. Random mixture distributions


It is sometimes useful to simulate data from a multivariate normal distribution,
e.g. as part of simulating from a mixture distribution. However, if one wants to
simulate data from a randomly chosen but valid multivariate normal distribu-
tion, this is more challenging.
Bayesian statistics provides one possible answer with a variety of prior distribu-
tions available for e.g. proportions, means and covariance matrix. Particular sets
of parameters can be sampled from these distributions, which are sufficient to
describe a randomly-generated multivariate normal distribution.
Here the usual choices of prior distribution would be a Dirichlet distribution
for proportions, a normal distribution with high variances for the means and
a Wishart or inverse-Wishart distribution for the covariance matrices. Choosing
hyper-parameters for the last of these can be difficult.
One alternative is to produce a random covariance matrix by filling it with ran-
dom values and then correcting it so that it forms a valid covariance matrix. We
know that it has to be symmetric. That’s easily achieved by only filling in the
diagonal and upper triangle of the matrix and then copying the upper triangle
to the lower triangle. The other requirement is that the matrix be positive semi-
definite, but we will assume that it needs to be positive definite to be much use.

1
So that we don’t have to think about scaling of one variable relative to another,
it is often desirable to make the covariance matrix a correlation matrix. That is,
have all the variance terms equal to 1, or have all the diagonal entries equal to
1. Then we could fill in all the upper triangular terms with, for example, values
drawn from a standard normal distribution N(0, 1). We could reject any values
with magnitude above 1 since these cannot be part of a correlation matrix. We
could also draw from the continuous uniform distribution U[−1, 1]. Having then
copied these values to the lower triangle, we would have a symmetric matrix, but
one with a low chance of being a valid correlation or covariance matrix.
However, this can be corrected, albeit at the expense of not knowing the distri-
bution of the resulting matrices or their elements. Assume that the symmetric
matrix produced so far is B. This matrix will have eigenvalues and eigenvectors.
λ is an eigenvalue of B and v is an eigenvector of B if

Bv = λv

A statement if thus type is true for every eigenvalue.


Now consider the eigenvalues of B + aI, a > 0.

(B + aI)v = Bv + av = λv + av = (λ + a)v

So (λ + a) is an eigenvalue of B + aI, with v being the corresponding eigenvector.


For B to be positive definite, all of its eigenvalues must be positive. If that’s al-
ready true for a randomly generated symmetric matrix, then we can just use that
as a covariance matrix. If not, we look at the smallest eigenvalue, λ1 , and replace
B with B + (−λ1 + ϵ)I, where ϵ > 0 is an arbitrary small value. The resulting
symmetric matrix will have all positive eigenvalues and hence be positive defi-
nite and suitable as a covariance matrix.
Note that the set of eigenvalues of B + (−λ1 + ϵ)I are those of B, but with each
increased by (−λ1 + ϵ). The eigenvectors are common to B.
The particular attraction of this method is that any zeroes which one desires to
put in the covariance matrix (or inverse covariance matrix, if generating that in-
stead), remain after the eigenvalue ”correction”.
Note that it is easy to convert this into a correlation matrix if desired using the
conversion equation (1.17) in the lecture notes, implemented in R via cov2cor().
In the following, you will use this method as part of simulating from a mixture
distribution.

(a) Randomly generate three valid 3×3 correlation matrices. These will be used
as the covariance matrices of three components in a mixture.
(b) Draw three random three-dimensional vectors where each element is drawn
from a standard normal distribution. These will be the means of the three
components.

2
(c) Draw one observation from a Dirichlet distribution to use as the mixture
proportions. The Dirichlet distribution is the multivariate analogue of the
beta distribution - it produces a set of values which sum to 1. The following
code should work.

install.packages("MCMCpack")
library(MCMCpack)
props = rdirichlet(1, c(1,1,1) )

(d) Draw 100 observations from this mixture distribution.


(e) Plot two contours of the marginal distributions of each mixture component
in each possible pair of dimensions. Make sure to use the proportions in
doing this, and use the same weighted density contour levels for each com-
ponent.
(f) Choose the two clusters which seem to be closest together. For these, con-
duct a Hotelling’s T 2 test (assuming common covariance matrices) to com-
pare their means. Use R to determine the sample means and common sam-
ple covariance matrix and calculate the F statistic. Show all working.
(g) Try K-means clustering of the data with K varying from 1 to 5.
(h) Use the Gap statistic method to try to choose the optimal number of clusters.
(i) Try mixture model clustering of the data with 1 to 5 clusters.
(j) Use BIC to try to choose the optimal number of clusters for the mixture.

You might also like