0% found this document useful (0 votes)
59 views83 pages

Modulesdocumentfile - phpsTAT301slidesem and Mixture Models PDF

Uploaded by

Yulya Sha
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
59 views83 pages

Modulesdocumentfile - phpsTAT301slidesem and Mixture Models PDF

Uploaded by

Yulya Sha
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 83

MSc in Statistics

Statistical Genetics – Bioinformatics


Unit 3: EM algorithm and Finite Mixture Models

Panagiotis Papastamoulis
Assistant Professor, Department of Statistics, AUEB

[email protected]

1/2/2021

P. Papastamoulis Statistical Genetics – Bioinformatics 1/2/2021 1 / 61


Overview

1 The Expectation-Maximization (EM) algorithm


A motivating genetics example
Why the EM algorithm works?
Remarks

2 Mixtures of Distributions

3 EM algorithm for finite mixtures

4 Examples
Poisson Mixture
ChIP-Seq data: zero inflated + overdispersed data
Remarks

P. Papastamoulis Statistical Genetics – Bioinformatics 1/2/2021 2 / 61


Introduction
All relevant statistical information about the unknown parameters
is contained in the observed data likelihood

L(✓; x)
or the observed data posterior distribution

⇡(✓�x)
In real-world problems, these tend to be complicated functions of ✓
Special computational tools are required in order to extract
meaningful summaries, such as parameter estimates and
standard errors
In this unit, we will augment the observed data by taking into
account unobserved (or missing) data
The key ideas behind the EM algorithm and data augmentation
are the same:
� to solve a difficult incomplete-data problem by repeatedly solving
tractable complete-data problems.
P. Papastamoulis Statistical Genetics – Bioinformatics 1/2/2021 3 / 61
Some examples

Multivariate datasets where subjects did not respond to some


variables (e.g. non-reply in surveys)
Recording errors
Censored data: e.g. observing lifetimes of units until failure. The
observation period cannot observe a given threshold
Discretization of continuous data
Important variables not measured (e.g. too expensive)

P. Papastamoulis Statistical Genetics – Bioinformatics 1/2/2021 4 / 61


Expectation–Maximization (EM) Algorithm
Iterative method to find a mode of the likelihood function L(✓; x)
Applicable in cases where the observed data x can be augmented
by some missing (that is, unobserved) data z
The basic idea:
� consider the complete data (x, z)
� If we observed this, we would be maximizing the complete
likelihood:
Lc (✓; x, z)
� where would be typically simpler than L(✓; x)
� But we only observe x, not (x, z)
Expectation (E) step: calculate the expected value of
log Lc (✓; x, Z) given a current set of parameter values
Maximization (M) step: maximize the expected log-likelihood with
respect to the parameters
Iterate E and M steps until convergence

P. Papastamoulis Statistical Genetics – Bioinformatics 1/2/2021 5 / 61


EM algorithm
1 Give some starting values ✓ (0)
2 Iterate for t = 1, 2, . . . until a fixed point is reached:
1 Expectation step: compute

Q(✓, ✓ (t−1) ) ∶= EZ�x;✓(t−1) log{Lc (✓; x, Z)}


2 Maximization step: set

✓ (t) = argmax✓ Q(✓, ✓ (t−1) )

Notes:
The expectation at Step 2.1 refers to the expectation of a function
of Z with respect to the conditional distribution Z�X = x,
assuming that ✓ = ✓ (t−1) : fZ�X (z�x; ✓)
E.g. for discrete Z

EZ�x;✓(t−1) log{Lc (✓; x, Z) = � log{Lc (✓; x, z)}P(Z = z�x; ✓ (t−1) )


z∈Z

It depends on ✓ (t−1) (hence the subscript in the expectation)


P. Papastamoulis Statistical Genetics – Bioinformatics 1/2/2021 6 / 61
Classical genetics problem: blood group
Let a random sample of size n
Observe phenotype
x = {xA , xB , xO , xAB }
n = xA + xB + xO + xAB
Genotype is not observed
Under Hardy-Weinberg equilibrium
Genotype Phenotype P(Genotype)
AA A 2
✓A
AO A 2✓A ✓O
BB B 2
✓B
BO B 2✓B ✓O
image from:
OO O 2
✓O
https://en.wikipedia.org/wiki/
AB AB 2✓A ✓B
ABO_blood_group_system
Problem: Estimate the underlying
allele proportions ✓ = (✓A , ✓B , ✓O )�
P. Papastamoulis Statistical Genetics – Bioinformatics 1/2/2021 7 / 61
Allele proportions
✓A = P(parent → A), ✓B = P(parent → B), ✓O = P(parent → O)
For genotype AA
� P(father → A) = ✓A
� P(mother → A) = ✓
A
� Under independence: P(AA) = ✓A
2

For genotype AO
One scenario is: Another scenario is
� P(father → A) = ✓A � P(father → O) = ✓O
� P(mother → O) = ✓O � P(mother → A) = ✓A
� Under independence + taking into account both scenarios:
P(AO) = 2✓A ✓O
The overall probability of observing phenotype A is
P(phenotype A) = P(genotype AA) + P(genotype AO)
= ✓A
2
+ 2✓A ✓O
The probability of observing remaining phenotypes (B, O and AB)
is derived in a similar manner
P. Papastamoulis Statistical Genetics – Bioinformatics 1/2/2021 8 / 61
Dataset (synthetic)

Random sample of size 1000


xA xB xO xAB
frequency 228 161 579 32

Task: estimate underlying allele proportions given the observed


phenotype frequencies.

P. Papastamoulis Statistical Genetics – Bioinformatics 1/2/2021 9 / 61


Classical genetics problem: blood group
Multinomial distribution:

X = (XA , XB , XO , XAB )� ∼ Mult(n; ⇡)

where ⇡ = ⇡(✓) = (✓A


2
+ 2✓A ✓O , ✓B
2
+ 2✓B ✓O , ✓O
2
, 2✓A ✓B )�
Unknown parameters: ✓ = (✓A , ✓B , ✓O )� , where
� 0 � ✓ � 1
A
� 0 � ✓
B �1
� 0 � ✓O � 1
� ✓A + ✓B + ✓O = 1

Likelihood of the observed data:

L(✓; x) = (✓A + 2✓A ✓O )nA


n! 2
xA !xB !xO !xAB !
× (✓B
2
+ 2✓B ✓O )nB ✓O2nO
(2✓A ✓B )nAB

Maximization of log L is not analytically tractable

P. Papastamoulis Statistical Genetics – Bioinformatics 1/2/2021 10 / 61


Complete data formulation

Let z = (zAA , zAO , zBB , zBO )� be the unobserved frequencies of


genotypes AA, AO, BB and BO, respectively
It should hold that

zAA + zAO = xA
zBB + zBO = xB

(and of course they should be integers)


Complete data vector:

(x, z) = (xA , xB , xO , xAB , zAA , zAO , zBB , zBO )�

If we had observed (x, z) the problem would be easy!

P. Papastamoulis Statistical Genetics – Bioinformatics 1/2/2021 11 / 61


Joint distribution of complete data

What is the joint distribution of (Z, X)?

(Z, X) =(ZAA , ZAO , ZBB , ZBO , XO , XAB , XA , XB )�


∼Mult �n; (✓A
2 2
, 2✓A ✓O , ✓B 2
, 2✓B ✓O , ✓O , 2✓A ✓B )�
× I{xA =zAA +zBB } I{xB =zBB +zBO }

The indicator functions denote degenerate distributions at a single


point
This is due to the previous constraints:
� XA = ZAA + ZAO
� XB = ZBB + ZBO

P. Papastamoulis Statistical Genetics – Bioinformatics 1/2/2021 12 / 61


Complete likelihood

Assume for instance that we have observed (x, z)


The complete likelihood is written as

Lc (✓; x, z) ∶=
n!
zAA !zAO !zBB !zBO !xO !xAB !
✓A (2✓A ✓O )zAO ✓B
2zAA 2zBB
(2✓B ✓O )zBO ✓O
2xO
(2✓A ✓B )xAB
2zAA +zAO +xAB 2zBB +zBO +xAB zAO +zBO +2xO
∝ ✓A ✓B ✓O

Notice that this is easy to maximize with respect to ✓ :

log Lc (✓; x, z) = c + (2zAA + zAO + xAB ) log ✓A


+ (2zBB + zBO + xAB ) log ✓B
+ (zAO + zBO + 2xO ) log ✓O

P. Papastamoulis Statistical Genetics – Bioinformatics 1/2/2021 13 / 61


Exercise

Use the method of Lagrange multipliers to show that log Lc (✓; x, z) is


maximized at
2zAA + zAO + xAB zAA + xA + xAB
✓ˆA = =
2n 2n
+ + BB xB + xAB
+
✓ˆB = =
2z BB z BO x AB z
2n 2n
+ + − + xB − zBB + 2xO
✓ˆO = =
z AO z BO 2x O x A z AA
.
2n 2n

Notice however that these expressions are not computable


We have not observed zAA , zBB .

P. Papastamoulis Statistical Genetics – Bioinformatics 1/2/2021 14 / 61


Idea: take the expectation of the complete
log-likelihood

Since log Lc (✓; x, z) depends on the unobserved data (z ) what


happens if we compute the expectation

E log Lc (✓; x, Z) = E{c + (2ZAA + ZAO + xAB ) log ✓A


+ (2ZBB + ZBO + xAB ) log ✓B
+ (ZAO + ZBO + 2xO ) log ✓O �x, ✓}
= c + (2E{ZAA �x, ✓} + E{ZAO �x, ✓} + xAB ) log ✓A
+ (2E{ZBB �x, ✓} + E{ZBO �x, ✓} + xAB ) log ✓B
+ (E{ZAO �x, ✓} + E{ZBO �x, ✓} + 2xO ) log ✓O

with respect to Z = (ZAA , ZAO = xA − ZAA , ZBB , ZBO = xB − ZBB )


given x and ✓ ?

P. Papastamoulis Statistical Genetics – Bioinformatics 1/2/2021 15 / 61


Conditional distribution
Given x and ✓ the conditional distribution of Z is
2
✓2
(ZAA , ZBB )�(x, ✓) ∼ B �xA , � B �xB , 2 B �
✓A
2
✓A + 2✓A ✓O ✓B + 2✓B ✓O
Conditional pmf
zAA xA −zAA
✓2 2
fZ�X (z�x; ✓) = � �� 2 A � �1 − �
xA ✓A
zAA ✓A + 2✓A ✓O 2 + 2✓ ✓
✓A A O
zBB xB −zBB
✓2 ✓2
×� �� 2 B � �1 − 2 B �
xB
zBB ✓B + 2✓B ✓O ✓B + 2✓B ✓O
× IZ (zAA , zAO , zBB , zBO )
where Z ∶= {0, . . . , xA } × {xA − zAA } × {0, . . . , xB } × {xB − zBB }
Conditional expectations: 2
E(ZAA �x, ✓) = xA =∶ ẑAA
✓A
2
✓A + 2✓A ✓O
2
E(ZBB �x, ✓) = xB =∶ ẑBB
✓B
2 + 2✓ ✓
✓B B O
P. Papastamoulis Statistical Genetics – Bioinformatics 1/2/2021 16 / 61
Expected complete log-likelihood

EZ�x;✓ log Lc (✓; x, Z) = c + (2E{ZAA �x, ✓} + E{ZAO �x, ✓} + xAB ) log ✓A


+ (2E{ZBB �x, ✓} + E{ZBO �x, ✓} + xAB ) log ✓B
+ (E{ZAO �x, ✓} + E{ZBO �x, ✓} + 2xO ) log ✓O
= c + (ẑAA + xA + xAB ) log ✓A
+ (ẑBB + xB + xAB ) log ✓B
+ (xA − ẑAA + xB − ẑBB + 2xO ) log ✓O

P. Papastamoulis Statistical Genetics – Bioinformatics 1/2/2021 17 / 61


Maximize the expected complete log-likelihood

Clearly, the maximum of E log Lc (✓; x, Z) is obtained at:

ẑAA + xA + xAB
✓ˆA =
2n
+ B + xAB
✓ˆB =
ẑ BB x
2n
✓ˆO = 1 − ✓ˆA − ✓ˆB .

Notice that these depend on ✓ (through ẑ s)

P. Papastamoulis Statistical Genetics – Bioinformatics 1/2/2021 18 / 61


EM algorithm for blood group

(0) (0) (0) (0) (0)


1 Give some initial values ✓A , ✓B , ✓AB = 1 − ✓B − ✓AB
2 Iterate for t = 1, 2, . . .
1 Expectation step: set

(t−1) 2 (t−1) 2
�✓A � �✓B �
ẑAA = xA , ẑBB = xB
(t−1) 2 (t−1) (t−1) (t−1) 2 (t−1) (t−1)
�✓A � + 2✓A ✓O �✓B � + 2✓B ✓O

2 Maximization step: set

(t)ẑAA + xA + xAB
✓A =
2n
(t) ẑBB + xB + xAB
✓B =
2n
(t) (t) (t)
✓O = 1 − ✓A − ✓B .

P. Papastamoulis Statistical Genetics – Bioinformatics 1/2/2021 19 / 61


Results

(t) (t) (t)


t ✓A ✓B ✓O log L(✓ (t) ; x)
1 0.3319727 0.4283183 0.2397090 -1097.2232789
2 0.1766420 0.1344842 0.6888738 -32.0243787
3 0.1429550 0.1036589 0.7533860 -9.9195482
4 0.1398785 0.1016816 0.7584399 -9.7705883
5 0.1396249 0.1015572 0.7588179 -9.7697265
6 0.1396045 0.1015490 0.7588464 -9.7697214
7 0.1396029 0.1015485 0.7588486 -9.7697214
Convergence criterion:

log L(✓ (t) ; x) − log L(✓ (t−1) ; x) < 10−6

The EM algorithm converges in 7 iterations

P. Papastamoulis Statistical Genetics – Bioinformatics 1/2/2021 20 / 61


Results
0

O O O O O

0.7
O
−200

0.6
allele proportion
log−likelihood

0.5
−600

0.4
A

0.3
O

0.2
−1000

A
B A A A A A

0.1
B B B B B

1 2 3 4 5 6 7 1 2 3 4 5 6 7

iteration iteration

P. Papastamoulis Statistical Genetics – Bioinformatics 1/2/2021 21 / 61


log−likelihood surface

1.0
−56
50
−5
50 35
1 0
−4 −3
50 −52
−2 0 00
80

−3450
0 −5
−3
−2400 10
40
−2
0.8 65 0 0 −
−215 0 −3 5
0 30 050
−2 0
−2 550 −3 −50
−1850 10
0 25 00
−1 −2 0
75 −2 −
0 05 45 −32 495
−155 0 0 0 00 0
−1
−1
65 −2 −3 −49
40 0 35 10 00
0 0
0 −
−3 48
−1 −2 05 50
0.6

45 30 0
−1050 0 −1 0 −4
80 −2 80
−1 0 95 0
000 −1 0
−900 −4
−1 9 75
−85
0 −9
50 60 50 −2
75 0
0 −4
−750 −2 0
−8 70
θB

00 00 0
−650 −1 0 − −
22
−60 −7 30 50 −29 465
0 00 0 00 0
−5 50 −4
−50 −2 60
0.4

0 70 0
−450 0 −4
−40 55
0 0
−35 0 −4
−30 50
0 0
−4
−200 85
0
−2
−150 60 −3
0 65
0 −4
0.2

9 00
−2
−50 85 −4
0 95
−3 0
15
0 −5
05
0

−3
0

35
−100 −25

−5
0

25
0.0

−1200 −2500 −3000

0
−1150 −1100 −1250 −1350 −1500 −1700 −1900 −2200

0.0 0.2 0.4 0.6 0.8 1.0

θA

P. Papastamoulis Statistical Genetics – Bioinformatics 1/2/2021 22 / 61


Implementation in R
x <- c(228, 161, 579, 32)
names(x) <- c(’A’, ’B’, ’O’, ’AB’)
loglike <- function(theta){
p_theta <- c(theta[’A’]ˆ2 + 2*theta[’A’] * theta[’O’],
theta[’B’]ˆ2 + 2*theta[’B’] * theta[’O’],
(theta[’O’])ˆ2, 2*theta[’A’] * theta[’B’])
return(dmultinom(x, prob = p_theta, log=TRUE))
}

maxIter <- 100


theta <- matrix(NA, maxIter, 3)
colnames(theta) <- c(’A’, ’B’, ’O’)
loglikelihood <- numeric(maxIter)

iter <- 1
theta[iter,] <- runif(3)
theta[iter,] <- theta[iter,]/sum(theta[iter,])
loglikelihood[iter] <- loglike(theta[iter,])
diff_logL <- 99999

while((iter < maxIter) & (diff_logL > 1e-6)){


iter <- iter + 1
z_AA <- x[’A’] * theta[iter-1,’A’]ˆ2/(theta[iter-1,’A’]ˆ2 +
2*theta[iter-1,’A’]*theta[iter-1,’O’])
z_BB <- x[’B’] * theta[iter-1,’B’]ˆ2/(theta[iter-1,’B’]ˆ2 +
2*theta[iter-1,’B’]*theta[iter-1,’O’])

theta[iter,’A’] <- (z_AA + x[’A’] + x[’AB’])/(2*n)


theta[iter,’B’] <- (z_BB + x[’B’] + x[’AB’])/(2*n)
theta[iter,’O’] <- 1 - theta[iter,’A’] - theta[iter,’B’]
loglikelihood[iter] <- loglike(theta[iter,])
diff_logL <- loglikelihood[iter] - loglikelihood[iter-1]
}

P. Papastamoulis Statistical Genetics – Bioinformatics 1/2/2021 23 / 61


Why the EM algorithm works?

Each step of the EM algorithm increases the observed


log-likelihood
log L(✓ (t+1) ; x) � log L(✓ (t) ; x)
Why?

P. Papastamoulis Statistical Genetics – Bioinformatics 1/2/2021 24 / 61


Connection between observed and complete likelihood
Conditional distribution of missing data Z

fX,Z (x, z; ✓)
fZ�X (z�x; ✓) =
fX (x; ✓)
Thus

log L(✓; x) = log Lc (✓; x, z) − log fZ�X (z�x; ✓)

Take expectation with respect to Z ∼ fZ�X (z�x; ✓ (t) ) assuming


that ✓ = ✓ (t)

log L(✓; x) = EZ�x;✓(t) {log Lc (✓; x, Z)} − EZ�x;✓(t) log fZ�X (z�x; ✓)
=∶ Q(✓, ✓ (t) ) − H(✓, ✓ (t) )

Recall that at the M-step we maximize Q(✓, ✓ (t) ) with respect to ✓

P. Papastamoulis Statistical Genetics – Bioinformatics 1/2/2021 25 / 61


Loglikelihood difference between successive iterations

Thus

log L(✓ (t+1) ; x) − log L(✓ (t) ; x) = Q(✓ (t+1) , ✓ (t) ) − H(✓ (t+1) , ✓ (t) )
− �Q(✓ (t) , ✓ (t) ) − H(✓ (t) , ✓ (t) )�

= Q(✓ (t+1) , ✓ (t) ) − Q(✓ (t) , ✓ (t) ) − �H(✓ (t+1) , ✓ (t) ) − H(✓ (t) , ✓ (t) )�

Clearly: Q(✓ (t+1) , ✓ (t) ) − Q(✓ (t) , ✓ (t) ) � 0


It suffices to show that

H(✓ (t+1) , ✓ (t) ) − H(✓ (t) , ✓ (t) ) � 0.

where
H(✓, ✓ (t) ) ∶= EZ�x;✓(t) log fZ�X (z�x; ✓), ✓∈⇥

P. Papastamoulis Statistical Genetics – Bioinformatics 1/2/2021 26 / 61


H(✓ (t+1) , ✓ (t) ) − H(✓ (t) , ✓ (t) ) � 0

H(✓, ✓ (t) ) − H(✓ (t) , ✓ (t) ) = EZ�x;✓(t) log fZ�X (z�x; ✓)


− EZ�x;✓(t) log fZ�X (z�x; ✓ (t) )

� fZ�X (z�x; ✓) � �
� �
= EZ�x;✓(t) �log �

� (t) �
fZ�X (z�x; ✓ ) �
� �

� fZ�X (z�x; ✓) � �
� �
(by Jensen’s inequality) � log EZ�x;✓(t) � �
� fZ�X (z�x; ✓ ) �
� (t) �
� �
fZ�X (z�x; ✓)
= log � f
(t) Z�X
(z�x; ✓ (t) )dz
Z
f Z�X (z�x; ✓ )

= log � fZ�X (z�x; ✓)dz


Z
= 0, ∀✓ ∈ ⇥
The proof is completed.
P. Papastamoulis Statistical Genetics – Bioinformatics 1/2/2021 27 / 61
Remarks

The EM algorithm is not just another numerical optimization


method. It gives deeper statistical insights for the problem at
hand, by exploiting the missing data structure
It can also be used for
� penalized likelihood problems
� estimating the mode of a posterior distribution

P. Papastamoulis Statistical Genetics – Bioinformatics 1/2/2021 28 / 61


Remarks

he EM algorithm is similar to the Gibbs sampler (MCMC algorithm).


Both techniques use data augmentation:
The EM algorithm computes the conditional expectation of the
missing data (deterministic step)
The Gibbs sampler simulates the missing data from the full
conditional posterior distribution (stochastic step)

P. Papastamoulis Statistical Genetics – Bioinformatics 1/2/2021 29 / 61


Remarks

The M-step can be performed using numerical techniques


(e.g. Newton-Raphson)
There are many variants of the EM algorithm, e.g.:
� Expectation - Conditional Maximization (ECM)
� Stochastic EM (SEM)
� Monte Carlo EM (MCEM)
� Generalized EM (GEM)

P. Papastamoulis Statistical Genetics – Bioinformatics 1/2/2021 30 / 61


Comments
The EM algorithm exhibits monotonic convergence (each step
increases the observed log-likelihood), but
� Depending on the starting values, it may converge to a local mode
� Multiple runs from different starting values may improve the result
� Clever initialization schemes are suggested
The convergence is slow compared to other algorithms
(e.g. Newton-Raphson), but
� This means that more iterations are demanded for the EM
algorithm to converge
� This does not means that the algorithm is slower than
Newton-Raphson: each iteration may be cheaper compared to the
computational cost of Newton-Raphson
� Remember that: in Newton-Raphson we must find the inverse of
Hessian matrix etc..
The standard errors of the estimates are not directly available
from the EM, but they can be estimated using
� Monte Carlo estimates of the missing information principle
� Bootstrap
P. Papastamoulis Statistical Genetics – Bioinformatics 1/2/2021 31 / 61
Introduction to Mixture Models

Finite Mixture Models


Infinite mixture models
EM algorithm/MCMC
Applications
� Zero inflated Chip-Seq datasets
� Heterogeneous RNA-Seq datasets
P. Papastamoulis Statistical Genetics – Bioinformatics 1/2/2021 32 / 61
Some toy examples of normal mixtures

0.5N (−2, 1) + 0.5N (2, 1)

P. Papastamoulis Statistical Genetics – Bioinformatics 1/2/2021 33 / 61


Some toy examples of normal mixtures

0.5N (0, 1) + 0.5N (0, 32 )

P. Papastamoulis Statistical Genetics – Bioinformatics 1/2/2021 33 / 61


Some toy examples of normal mixtures

100 N (−1, ( 3 ) ) + 100 N (1, ( 3 ) ) + ∑j=0 350 N (j − 3, ( 100 ) )


49 2 2 49 2 2 6 1 1 2

P. Papastamoulis Statistical Genetics – Bioinformatics 1/2/2021 33 / 61


Some toy examples of normal mixtures

−2 −.5
0.45N2 �� � , � �� + 0.45N2 �� � , � �� + 0.1N2 �� � , � ��
2 2 2 .5 0 .5 0
−2 −.5 1 2 .5 1 0 0 .5

P. Papastamoulis Statistical Genetics – Bioinformatics 1/2/2021 33 / 61


Semi-parametric way of estimation

Titterington et al (1985):
provided the number of component densities is not bounded
above, certain forms of mixture can be used to provide arbitrarily
close approximation to a given probability distribution

- - - Laplace(0,1) - - - Logistic(0,1) - - - t4
—– .5N (0, 0.29) + .5N (0, 3.8) —– .47N (0, 1.45) + .53N (0, 5.3) —– .8N (0, 0.8) + .2N (0, 7)

P. Papastamoulis Statistical Genetics – Bioinformatics 1/2/2021 34 / 61


Mixture of Normal Distributions
Data:
K
Xi ∼ � pk N (µk , k ),
2
k=1
i = 1, . . . , n independently

P. Papastamoulis Statistical Genetics – Bioinformatics 1/2/2021 35 / 61


Mixture of Normal Distributions
Data:
K
Xi ∼ � pk N (µk , k ),
2
k=1
i = 1, . . . , n independently
Number of components K � 1 fixed integer

P. Papastamoulis Statistical Genetics – Bioinformatics 1/2/2021 35 / 61


Mixture of Normal Distributions
Data:
K
Xi ∼ � pk N (µk , k ),
2
k=1
i = 1, . . . , n independently
Number of components K � 1 fixed integer
pk > 0: weight of component k , with
K−1 K−1
p ∈ PK ∶= �pk > 0; k = 1, . . . , K − 1 ∶ � pk < 1, pK ∶= 1 − � pk �
k=1 k=1

P. Papastamoulis Statistical Genetics – Bioinformatics 1/2/2021 35 / 61


Mixture of Normal Distributions
Data:
K
Xi ∼ � pk N (µk , k ),
2
k=1
i = 1, . . . , n independently
Number of components K � 1 fixed integer
pk > 0: weight of component k , with
K−1 K−1
p ∈ PK ∶= �pk > 0; k = 1, . . . , K − 1 ∶ � pk < 1, pK ∶= 1 − � pk �
k=1 k=1

Unknown parameters: (µ, 2 , p) ∈ RK × (0, +∞)K × PK

P. Papastamoulis Statistical Genetics – Bioinformatics 1/2/2021 35 / 61


Mixture of Normal Distributions
Data:
K
Xi ∼ � pk N (µk , k ),
2
k=1
i = 1, . . . , n independently
Number of components K � 1 fixed integer
pk > 0: weight of component k , with
K−1 K−1
p ∈ PK ∶= �pk > 0; k = 1, . . . , K − 1 ∶ � pk < 1, pK ∶= 1 − � pk �
k=1 k=1

Unknown parameters: (µ, 2 , p) ∈ RK × (0, +∞)K × PK


Likelihood (highly intractable):
n K
L(µ, 2
, p; x) = � � pk f (xi ; µk , k)
2
i=1 k=1

where f (⋅; µk , k2 ): pdf of N (µk , k2 ).


P. Papastamoulis Statistical Genetics – Bioinformatics 1/2/2021 35 / 61
Mixture of multivariate Normals

For multivariate x = (x1 , . . . , xp ) assume a mixture of K


multivariate normal distributions

Example:
K
x ∼ � pk Np (µk , ⌃k )
k=1

� µ k ∈ Rp
� ⌃k positive semi-definite p × p
matrix

P. Papastamoulis Statistical Genetics – Bioinformatics 1/2/2021 36 / 61


Mixture of multivariate Normals

For multivariate x = (x1 , . . . , xp ) assume a mixture of K


multivariate normal distributions

Example:

K
x ∼ � pk Np (µk , ⌃k )
k=1

� µ k ∈ Rp

⌃k positive semi-definite p × p

Old Faithful Geyser Data
matrix

P. Papastamoulis Statistical Genetics – Bioinformatics 1/2/2021 36 / 61


Mixture of multivariate Normals

For multivariate x = (x1 , . . . , xp ) assume a mixture of K


multivariate normal distributions

Example:

K
x ∼ � pk Np (µk , ⌃k )
k=1

� µ k ∈ Rp

⌃k positive semi-definite p × p

Old Faithful Geyser Data
� Mixture of K = 2 bivariate
matrix
normal distributions

P. Papastamoulis Statistical Genetics – Bioinformatics 1/2/2021 36 / 61


Finite mixture of K distributions
Data:
K
Xi ∼ � pk fk (x; ✓k ),
k=1
i = 1, . . . , K independently
Number of components K > 1 fixed integer
fk : k -th component of the mixture
pk > 0: weight of component k , with
K−1 K−1
p ∈ PK ∶= �pk > 0; k = 1, . . . , K − 1 ∶ � pk < 1, pK ∶= 1 − � pk �
k=1 k=1

Unknown parameters: (✓, p) ∈ ⇥ × (0, +∞) K


× PK
Likelihood (highly intractable):
n K
L(✓, p; x) = � � pk fk (xi ; ✓k )
i=1 k=1

where fk (⋅; ✓k ): pdf of the k -th mixture component.


P. Papastamoulis Statistical Genetics – Bioinformatics 1/2/2021 37 / 61
Mixtures as missing data models

P. Papastamoulis Statistical Genetics – Bioinformatics 1/2/2021 38 / 61


Mixtures as missing data models

P. Papastamoulis Statistical Genetics – Bioinformatics 1/2/2021 38 / 61


Mixtures as missing data models

P. Papastamoulis Statistical Genetics – Bioinformatics 1/2/2021 38 / 61


Mixtures as missing data models
Let Z i = (Zi1 , . . . , ZiK )� ∼ Mult(1, p) with

K
P(Z i = z i ) = � pj ij , independent for i = 1, . . . , n
z

j=1

where z i = (zi1 , . . . , ziK )� with zij ∈ {0, 1} and zi1 + . . . + ziK = 1.


Example with K = 3:
� P{Z = (0, 0, 1)} = p
i 3
� prior probability of assigning observation i to the 3rd mixture

component
Conditional distribution of Xi given Z i

K
Xi �(Z i = z i ) ∼ � fj (⋅; ✓j )zij , independent for i = 1, . . . , n.
j=1

E.g. K = 3: Xi �{Z i = (0, 0, 1)} ∼ f3 (⋅; ✓3 )

P. Papastamoulis Statistical Genetics – Bioinformatics 1/2/2021 39 / 61


Mixtures as missing data models

Z = (Z 1 , . . . , Z n ) are unobserved (latent allocation variables)


If they were observed, the model would be ‘‘de-mixed’’
It is straightforward to show that the marginal distribution of Xi is
the mixture density

K
fXi (x) = � pj fj (x; ✓j )
j=1

P. Papastamoulis Statistical Genetics – Bioinformatics 1/2/2021 40 / 61


Posterior (membership) probabilities

From Bayes theorem we easily obtain that

pj fj (xi ; ✓j )
P(Zij = 1�xi , p, ✓) =
∑k=1 pk fk (xi ; ✓k )
K

=∶ wij

independent for i = 1, . . . , n. This means that the conditional (or


posterior) distribution

Z i ∶= (Zi1 , . . . , ZiK )� �(xi , p, ✓) ∼ Mult(1, wi ), independent for i = 1, . . . ,


(1)
where w i ∶= (wi1 , . . . , wiK )
However we do not know p and ✓

P. Papastamoulis Statistical Genetics – Bioinformatics 1/2/2021 41 / 61


Complete Likelihood

Assuming that we have observed the pairs

(x1 , z 1 ), . . . , (xn , z n )

the complete (unobserved) likelihood is defined as

n K
Lc (p, ✓; x, z) = � � {pj fj (xi ; ✓ j )}zij
i=1 j=1
n K
log Lc (p, ✓; x, z) = � � zij {log pj + log fj (xi ; ✓ j )}
i=1 j=1

Although Z s are unobserved, we can treat them as missing data


and compute their expectation, given the (unknown) parameter
values

P. Papastamoulis Statistical Genetics – Bioinformatics 1/2/2021 42 / 61


EM algorithm (for finite mixtures)
1 Give some starting values p(0) , ✓ (0)
2 Iterate for t = 1, 2, . . . until a fixed point is reached:
1 Expectation step: compute

Q({p, ✓}, {p(t−1) , ✓ (t−1) }) ∶= EZ�x,p(t−1) ,✓(t−1) log{Lc (p, ✓; x, Z)}


K K
= � w⋅j log pj + � w⋅j log fj (xi �✓ j )
j=1 j=1

where
n
w⋅j ∶= � wij , j = 1, . . . , K
i=1

and Z = (Z 1 , . . . , Z n )� is distributed according to (1), given


p(t−1) , ✓ (t−1) (and x).
2 Maximization step: set

�p(t) , ✓ (t) � = argmaxp,✓ Q({p, ✓}, {p(t−1) , ✓ (t−1) })

P. Papastamoulis Statistical Genetics – Bioinformatics 1/2/2021 43 / 61


Maximization step
Update of mixing proportions

(t) w⋅j ∑ni=1 wij


pj = = , j = 1, . . . , K
n n
The update for the component-specific parameters

(✓ 1 , . . . , ✓ K )
depends on the parametric form of fj (⋅)
E.g. when fj (x; ✓ j ) = pdf of N (µj , j2 ), then for j = 1, . . . , K

(t) ∑ni=1 wij xi


µj =
∑ni=1 wij
(t) 2
2 (t)
∑ni=1 wij �xi − µj �
( j) =
∑ni=1 wij
Question: The case K = 1 collapses to something familiar?
(Other examples follow)
P. Papastamoulis Statistical Genetics – Bioinformatics 1/2/2021 44 / 61
Example: Poisson mixture

(simulate) n = 80 observations from

0.2P(1) + 0.8P(5)

Observed data
14
8 10
frequency

6
4
2
0

0 1 2 3 4 5 6 7 8 9

count

P. Papastamoulis Statistical Genetics – Bioinformatics 1/2/2021 45 / 61


Example: Poisson mixture

EM algorithm for K = 2 poisson components


(0) (0) (0) (0) (0)
1 Give some starting values p1 , p2 = 1 − p1 , 1 , 2
2 Iterate for t = 1, 2, . . . until a fixed point is reached:
1 For i = 1, . . . , n and j = 1, 2 set
(t−1)
− (t−1) xi
(t−1) e
j � j �
pj
wij =
xi !
(t−1) (t−1)
− (t−1) xi − (t−1) xi
(t−1) e � 1 � (t−1) e � 2 �
+ p2
1 2
p1 xi ! xi !

2 For j = 1, 2 set

∑i=1 wij ∑i=1 wij xi


n n
(t) (t)
pj = =
∑i=1 wij
, j n
n

P. Papastamoulis Statistical Genetics – Bioinformatics 1/2/2021 46 / 61


Example: Poisson mixture
−180

0.8

7
0.7
−185

6
0.6
mixing proportion
−190

5
log−likelihood

lambda
0.5

4
−195

0.4

3
−200

0.3

2
−205

0.2

1
0 10 20 30 40 0 10 20 30 40 0 10 20 30 40

iteration iteration iteration

(t) (t) (t) (t)


t p1 p2 1 2 log-likelihood
1 0.40 0.60 4.35 7.13 -206.36
48 0.21 0.79 1.08 4.90 -180.00

convergence criterion: log L(p(t) , ✓ (t) ) − log L(p(t−1) , ✓ (t−1) ) < 10−6

P. Papastamoulis Statistical Genetics – Bioinformatics 1/2/2021 47 / 61


Example: Poisson mixture
−180

8
0.8
−182

mixing proportion
log−likelihood

6
0.6
−184

lambda
0.4

4
−186

0.2
−188

2
0 20 40 60 80 0 20 40 60 80 0 20 40 60 80

iteration iteration iteration

(t) (t) (t) (t)


t p1 p2 1 2 log-likelihood
1 0.06 0.94 8.81 4.13 -188.91
87 0.79 0.21 4.90 1.08 -180.00

convergence criterion: log L(p(t) , ✓ (t) ) − log L(p(t−1) , ✓ (t−1) ) < 10−6

P. Papastamoulis Statistical Genetics – Bioinformatics 1/2/2021 48 / 61


Example: Poisson mixture
−180

8
0.8
−182

mixing proportion
log−likelihood

6
0.6
−184

lambda
0.4

4
−186

0.2
−188

2
0 20 40 60 80 0 20 40 60 80 0 20 40 60 80

iteration iteration iteration

(t) (t) (t) (t)


t p1 p2 1 2 log-likelihood
1 0.06 0.94 8.81 4.13 -188.91
87 0.79 0.21 4.90 1.08 -180.00

convergence criterion: log L(p(t) , ✓ (t) ) − log L(p(t−1) , ✓ (t−1) ) < 10−6
Did you notice anything weird?

P. Papastamoulis Statistical Genetics – Bioinformatics 1/2/2021 48 / 61


Label switching
p̂1 p̂2 ˆ1 ˆ2 log-likelihood
run 1 0.21 0.79 1.08 4.90 -180.00
run 2 0.79 0.21 4.90 1.08 -180.00

Typically, the parameters of mixture models are not identifiable


The likelihood is invariant to permutations of the parameters
This means that the likelihood has (a multiple of) K! symmetric
modes (where K is the number of components)
This behaviour is not of great importance in the EM algorithm
� The EM algorithm will converge to one among the symmetric modes
of the likelihood (not necessarily the main modes though...)
� depending on the starting values
But it causes many problems in other settings, such as:
� standard results from asymptotical theory do not appply
� burdens the Bayesian estimation of mixture models (many
solutions exist though)
P. Papastamoulis Statistical Genetics – Bioinformatics 1/2/2021 49 / 61
Example: Poisson mixture

x f (x) fˆ(x)
0 0.079 0.077
1 0.101 0.106
2 0.104 0.112
3 0.125 0.130
4 0.143 0.145
5 0.141 0.139
6 0.117 0.113
7 0.084 0.079
8 0.052 0.049
9 0.029 0.026
10 0.015 0.013
11 0.007 0.006
12 0.003 0.002

P. Papastamoulis Statistical Genetics – Bioinformatics 1/2/2021 50 / 61


Example: Poisson mixture

true pmf
estimated pmf
0.15
0.10
0.05
0.00

0 1 2 3 4 5 6 7 8 9

P. Papastamoulis Statistical Genetics – Bioinformatics 1/2/2021 51 / 61


Example: Poisson mixture

0.79 P(4.90) + 0.21 P(1.08)


0.79 P(4.90)
0.15

0.21 P(1.08)
0.10
0.05
0.00

0 1 2 3 4 5 6 7 8 9

P. Papastamoulis Statistical Genetics – Bioinformatics 1/2/2021 52 / 61


Example: ChIP-Seq dataset
Data from Kuan et al (2011): A Statistical Framework for the
Analysis of ChIP-Seq Data. JASA, 106 (495):891-903
The way to obtain the data is available from the
mosaicsExample package in Bioconductor
Sequences of pieces of DNA that are obtained from chromatin
immunoprecipitation (ChIP)
This technology enables the mapping of the locations along
genomic DNA of transcription factors, nucleosomes, histone
modifications, chromatin remodeling enzymes, chaperones,
polymerases and other protein.
See the source code file for chapter 4 of Modern Statistics for
Modern Biology book to obtain the data
> describe(bincts$tagCount)
n mean sd median min max range skew kurtosis
X1 256523 1.8 2.48 1 0 66 66 3.23 25.02

P. Papastamoulis Statistical Genetics – Bioinformatics 1/2/2021 53 / 61


Example: ChIP-Seq dataset
8e+04
Frequency

4e+04
0e+00

0 2 4 6 8 10 13 16 19 22 25 28 31 34 37 40 43 49 66

count

P. Papastamoulis Statistical Genetics – Bioinformatics 1/2/2021 54 / 61


Example: ChIP-Seq dataset
10000
Frequency (log−scale)

100
1

0 2 4 6 8 10 13 16 19 22 25 28 31 34 37 40 43 49 66

count

P. Papastamoulis Statistical Genetics – Bioinformatics 1/2/2021 55 / 61


Example: ChIP-Seq dataset (various models)
Poisson
X ∼ P( )
Negative Binomial
X ∼ N B(µ, r)
Zero Inflated Poisson models (k components)

k
X ∼ p1 I(x = 0) + � pj f (x; j)
j=2

� So it is a mixture of k distributions where


� the first component is a degenerate distribution at zero
� the j -th component is the P( j ) distribution
� pj � 0 with ∑kj=1 pj = 1
� ∶= ( 2, . . . , k) ∈ (0, ∞)k−1
� This model will be estimated by the EM algorithm, for k = 2, . . . , 6.

P. Papastamoulis Statistical Genetics – Bioinformatics 1/2/2021 56 / 61


Example: ChIP-Seq dataset
1e−05 1e−04 1e−03 1e−02 1e−01
Probability (log−scale)

0 2 4 6 8 10 13 16 19 22 25 28 31 34 37 40 43 49 66

count

P. Papastamoulis Statistical Genetics – Bioinformatics 1/2/2021 57 / 61


Example: ChIP-Seq dataset

poisson
1e−05 1e−04 1e−03 1e−02 1e−01

ZIP(2)
ZIP(3)
ZIP(4)
Probability (log−scale)

ZIP(5)
ZIP(6)
negative binomial

0 2 4 6 8 10 13 16 19 22 25 28 31 34 37 40 43 49 66

count

P. Papastamoulis Statistical Genetics – Bioinformatics 1/2/2021 57 / 61


Example: ChIP-Seq dataset

poisson
1e−05 1e−04 1e−03 1e−02 1e−01

ZIP(2)
ZIP(3)
ZIP(4)
Probability (log−scale)

ZIP(5)
ZIP(6)
negative binomial

0 2 4 6 8 10 13 16 19 22 25 28 31 34 37 40 43 49 66

count

P. Papastamoulis Statistical Genetics – Bioinformatics 1/2/2021 57 / 61


Example: ChIP-Seq dataset

poisson
1e−05 1e−04 1e−03 1e−02 1e−01

ZIP(2)
ZIP(3)
ZIP(4)
Probability (log−scale)

ZIP(5)
ZIP(6)
negative binomial

0 2 4 6 8 10 13 16 19 22 25 28 31 34 37 40 43 49 66

count

P. Papastamoulis Statistical Genetics – Bioinformatics 1/2/2021 57 / 61


Example: ChIP-Seq dataset

poisson
1e−05 1e−04 1e−03 1e−02 1e−01

ZIP(2)
ZIP(3)
ZIP(4)
Probability (log−scale)

ZIP(5)
ZIP(6)
negative binomial

0 2 4 6 8 10 13 16 19 22 25 28 31 34 37 40 43 49 66

count

P. Papastamoulis Statistical Genetics – Bioinformatics 1/2/2021 57 / 61


Example: ChIP-Seq dataset

poisson
1e−05 1e−04 1e−03 1e−02 1e−01

ZIP(2)
ZIP(3)
ZIP(4)
Probability (log−scale)

ZIP(5)
ZIP(6)
negative binomial

0 2 4 6 8 10 13 16 19 22 25 28 31 34 37 40 43 49 66

count

P. Papastamoulis Statistical Genetics – Bioinformatics 1/2/2021 57 / 61


Example: ChIP-Seq dataset

poisson
1e−05 1e−04 1e−03 1e−02 1e−01

ZIP(2)
ZIP(3)
ZIP(4)
Probability (log−scale)

ZIP(5)
ZIP(6)
negative binomial

0 2 4 6 8 10 13 16 19 22 25 28 31 34 37 40 43 49 66

count

P. Papastamoulis Statistical Genetics – Bioinformatics 1/2/2021 57 / 61


Example: ChIP-Seq dataset

poisson
1e−05 1e−04 1e−03 1e−02 1e−01

ZIP(2)
ZIP(3)
ZIP(4)
Probability (log−scale)

ZIP(5)
ZIP(6)
negative binomial

0 2 4 6 8 10 13 16 19 22 25 28 31 34 37 40 43 49 66

count

P. Papastamoulis Statistical Genetics – Bioinformatics 1/2/2021 57 / 61


Initialization is crucial for the EM algorithm

Log-likelihood values obtained at the last iteration of the EM


algorithm (Poisson GLM mixture model)
initialized by the two different initialization schemes, for a number
of mixture components varying from 1 to 30.
The right-hand panel is a zoomed version of the left-hand panel.
x-axis: number of mixture components (K )
P. Papastamoulis Statistical Genetics – Bioinformatics 1/2/2021 58 / 61
Estimating the number of mixture components
Model-selection problem (not easy)

P. Papastamoulis Statistical Genetics – Bioinformatics 1/2/2021 59 / 61


Estimating the number of mixture components
Model-selection problem (not easy)
� The number of things we don’t know is something that we don’t
know...

P. Papastamoulis Statistical Genetics – Bioinformatics 1/2/2021 59 / 61


Estimating the number of mixture components
Model-selection problem (not easy)
� The number of things we don’t know is something that we don’t
know...
Common approach: use penalized likelihood criteria
� Information Criteria in Model Selection (e.g. AIC, BIC) are often
used to assess the order of a mixture model, but:
� mixtures of distributions are really bizarre statistical models so the
standard asymptotic theory does not apply
� usually they tend to overfit models
� Integrated Completed Likelihood (ICL)

P. Papastamoulis Statistical Genetics – Bioinformatics 1/2/2021 59 / 61


Estimating the number of mixture components
Model-selection problem (not easy)
� The number of things we don’t know is something that we don’t
know...
Common approach: use penalized likelihood criteria
� Information Criteria in Model Selection (e.g. AIC, BIC) are often
used to assess the order of a mixture model, but:
� mixtures of distributions are really bizarre statistical models so the
standard asymptotic theory does not apply
� usually they tend to overfit models
� Integrated Completed Likelihood (ICL)
Other approaches: Bayesian estimation
� MCMC sampling on the space of mixture models with different
number of components
� Reversible Jump MCMC
� Birth-Death MCMC
� Allocation Sampler
� Overfitting Mixture Models

P. Papastamoulis Statistical Genetics – Bioinformatics 1/2/2021 59 / 61


Closing remarks (Mixture models)
Mixture distributions comprise a finite or infinite number of
components, possibly of different distributional types, that can
describe different features of data
They facilitate much more careful description of complex systems,
as evidenced by the enthusiasm with which they have been
adopted in such diverse areas as astronomy, ecology,
bioinformatics, computer science, economics, engineering,
robotics and biostatistics
Historically, mixture models have provided an endless benchmark
for assessing various statistical techniques, from the EM
algorithm to reversible jump methodology.
Mixture models are the basis for formulating more general latent
variable models, such as
� Hidden Markov Models
� Markov Random Fields
They provide the means for model-based clustering of complex
datasets (next unit).
P. Papastamoulis Statistical Genetics – Bioinformatics 1/2/2021 60 / 61
Recommended Bibliography

Dempster, A.P, Laird, N.M, and Rubin, E.


Maximum Likelihood from Incomplete Data via the EM algorithm.
JRSS B, 1977
McLachlan, Geoffrey and David Peel.
Finite Mixture Models.
Wiley Series in Probability and Statistics, 2000
Marin, J.M, Mengersen, K., and Robert, C.P.
Bayesian modelling and inference on Mixtures of distributions.
Handbook of Statistics, 2005
Holmes, Susan and Wolfgang Huber.
Chapter 4 in Modern Statistics for Modern Biology.
Cambridge University Press, 2019

P. Papastamoulis Statistical Genetics – Bioinformatics 1/2/2021 61 / 61

You might also like