Modulesdocumentfile - phpsTAT301slidesem and Mixture Models PDF
Modulesdocumentfile - phpsTAT301slidesem and Mixture Models PDF
Panagiotis Papastamoulis
Assistant Professor, Department of Statistics, AUEB
1/2/2021
2 Mixtures of Distributions
4 Examples
Poisson Mixture
ChIP-Seq data: zero inflated + overdispersed data
Remarks
L(✓; x)
or the observed data posterior distribution
⇡(✓�x)
In real-world problems, these tend to be complicated functions of ✓
Special computational tools are required in order to extract
meaningful summaries, such as parameter estimates and
standard errors
In this unit, we will augment the observed data by taking into
account unobserved (or missing) data
The key ideas behind the EM algorithm and data augmentation
are the same:
� to solve a difficult incomplete-data problem by repeatedly solving
tractable complete-data problems.
P. Papastamoulis Statistical Genetics – Bioinformatics 1/2/2021 3 / 61
Some examples
Notes:
The expectation at Step 2.1 refers to the expectation of a function
of Z with respect to the conditional distribution Z�X = x,
assuming that ✓ = ✓ (t−1) : fZ�X (z�x; ✓)
E.g. for discrete Z
For genotype AO
One scenario is: Another scenario is
� P(father → A) = ✓A � P(father → O) = ✓O
� P(mother → O) = ✓O � P(mother → A) = ✓A
� Under independence + taking into account both scenarios:
P(AO) = 2✓A ✓O
The overall probability of observing phenotype A is
P(phenotype A) = P(genotype AA) + P(genotype AO)
= ✓A
2
+ 2✓A ✓O
The probability of observing remaining phenotypes (B, O and AB)
is derived in a similar manner
P. Papastamoulis Statistical Genetics – Bioinformatics 1/2/2021 8 / 61
Dataset (synthetic)
zAA + zAO = xA
zBB + zBO = xB
Lc (✓; x, z) ∶=
n!
zAA !zAO !zBB !zBO !xO !xAB !
✓A (2✓A ✓O )zAO ✓B
2zAA 2zBB
(2✓B ✓O )zBO ✓O
2xO
(2✓A ✓B )xAB
2zAA +zAO +xAB 2zBB +zBO +xAB zAO +zBO +2xO
∝ ✓A ✓B ✓O
ẑAA + xA + xAB
✓ˆA =
2n
+ B + xAB
✓ˆB =
ẑ BB x
2n
✓ˆO = 1 − ✓ˆA − ✓ˆB .
(t−1) 2 (t−1) 2
�✓A � �✓B �
ẑAA = xA , ẑBB = xB
(t−1) 2 (t−1) (t−1) (t−1) 2 (t−1) (t−1)
�✓A � + 2✓A ✓O �✓B � + 2✓B ✓O
(t)ẑAA + xA + xAB
✓A =
2n
(t) ẑBB + xB + xAB
✓B =
2n
(t) (t) (t)
✓O = 1 − ✓A − ✓B .
O O O O O
0.7
O
−200
0.6
allele proportion
log−likelihood
0.5
−600
0.4
A
0.3
O
0.2
−1000
A
B A A A A A
0.1
B B B B B
1 2 3 4 5 6 7 1 2 3 4 5 6 7
iteration iteration
1.0
−56
50
−5
50 35
1 0
−4 −3
50 −52
−2 0 00
80
−3450
0 −5
−3
−2400 10
40
−2
0.8 65 0 0 −
−215 0 −3 5
0 30 050
−2 0
−2 550 −3 −50
−1850 10
0 25 00
−1 −2 0
75 −2 −
0 05 45 −32 495
−155 0 0 0 00 0
−1
−1
65 −2 −3 −49
40 0 35 10 00
0 0
0 −
−3 48
−1 −2 05 50
0.6
45 30 0
−1050 0 −1 0 −4
80 −2 80
−1 0 95 0
000 −1 0
−900 −4
−1 9 75
−85
0 −9
50 60 50 −2
75 0
0 −4
−750 −2 0
−8 70
θB
00 00 0
−650 −1 0 − −
22
−60 −7 30 50 −29 465
0 00 0 00 0
−5 50 −4
−50 −2 60
0.4
0 70 0
−450 0 −4
−40 55
0 0
−35 0 −4
−30 50
0 0
−4
−200 85
0
−2
−150 60 −3
0 65
0 −4
0.2
9 00
−2
−50 85 −4
0 95
−3 0
15
0 −5
05
0
−3
0
35
−100 −25
−5
0
25
0.0
0
−1150 −1100 −1250 −1350 −1500 −1700 −1900 −2200
θA
iter <- 1
theta[iter,] <- runif(3)
theta[iter,] <- theta[iter,]/sum(theta[iter,])
loglikelihood[iter] <- loglike(theta[iter,])
diff_logL <- 99999
fX,Z (x, z; ✓)
fZ�X (z�x; ✓) =
fX (x; ✓)
Thus
log L(✓; x) = EZ�x;✓(t) {log Lc (✓; x, Z)} − EZ�x;✓(t) log fZ�X (z�x; ✓)
=∶ Q(✓, ✓ (t) ) − H(✓, ✓ (t) )
Thus
log L(✓ (t+1) ; x) − log L(✓ (t) ; x) = Q(✓ (t+1) , ✓ (t) ) − H(✓ (t+1) , ✓ (t) )
− �Q(✓ (t) , ✓ (t) ) − H(✓ (t) , ✓ (t) )�
= Q(✓ (t+1) , ✓ (t) ) − Q(✓ (t) , ✓ (t) ) − �H(✓ (t+1) , ✓ (t) ) − H(✓ (t) , ✓ (t) )�
where
H(✓, ✓ (t) ) ∶= EZ�x;✓(t) log fZ�X (z�x; ✓), ✓∈⇥
−2 −.5
0.45N2 �� � , � �� + 0.45N2 �� � , � �� + 0.1N2 �� � , � ��
2 2 2 .5 0 .5 0
−2 −.5 1 2 .5 1 0 0 .5
Titterington et al (1985):
provided the number of component densities is not bounded
above, certain forms of mixture can be used to provide arbitrarily
close approximation to a given probability distribution
- - - Laplace(0,1) - - - Logistic(0,1) - - - t4
—– .5N (0, 0.29) + .5N (0, 3.8) —– .47N (0, 1.45) + .53N (0, 5.3) —– .8N (0, 0.8) + .2N (0, 7)
Example:
K
x ∼ � pk Np (µk , ⌃k )
k=1
� µ k ∈ Rp
� ⌃k positive semi-definite p × p
matrix
Example:
K
x ∼ � pk Np (µk , ⌃k )
k=1
� µ k ∈ Rp
�
⌃k positive semi-definite p × p
�
Old Faithful Geyser Data
matrix
Example:
K
x ∼ � pk Np (µk , ⌃k )
k=1
� µ k ∈ Rp
�
⌃k positive semi-definite p × p
�
Old Faithful Geyser Data
� Mixture of K = 2 bivariate
matrix
normal distributions
K
P(Z i = z i ) = � pj ij , independent for i = 1, . . . , n
z
j=1
component
Conditional distribution of Xi given Z i
K
Xi �(Z i = z i ) ∼ � fj (⋅; ✓j )zij , independent for i = 1, . . . , n.
j=1
K
fXi (x) = � pj fj (x; ✓j )
j=1
pj fj (xi ; ✓j )
P(Zij = 1�xi , p, ✓) =
∑k=1 pk fk (xi ; ✓k )
K
=∶ wij
(x1 , z 1 ), . . . , (xn , z n )
n K
Lc (p, ✓; x, z) = � � {pj fj (xi ; ✓ j )}zij
i=1 j=1
n K
log Lc (p, ✓; x, z) = � � zij {log pj + log fj (xi ; ✓ j )}
i=1 j=1
where
n
w⋅j ∶= � wij , j = 1, . . . , K
i=1
(✓ 1 , . . . , ✓ K )
depends on the parametric form of fj (⋅)
E.g. when fj (x; ✓ j ) = pdf of N (µj , j2 ), then for j = 1, . . . , K
0.2P(1) + 0.8P(5)
Observed data
14
8 10
frequency
6
4
2
0
0 1 2 3 4 5 6 7 8 9
count
2 For j = 1, 2 set
0.8
7
0.7
−185
6
0.6
mixing proportion
−190
5
log−likelihood
lambda
0.5
4
−195
0.4
3
−200
0.3
2
−205
0.2
1
0 10 20 30 40 0 10 20 30 40 0 10 20 30 40
convergence criterion: log L(p(t) , ✓ (t) ) − log L(p(t−1) , ✓ (t−1) ) < 10−6
8
0.8
−182
mixing proportion
log−likelihood
6
0.6
−184
lambda
0.4
4
−186
0.2
−188
2
0 20 40 60 80 0 20 40 60 80 0 20 40 60 80
convergence criterion: log L(p(t) , ✓ (t) ) − log L(p(t−1) , ✓ (t−1) ) < 10−6
8
0.8
−182
mixing proportion
log−likelihood
6
0.6
−184
lambda
0.4
4
−186
0.2
−188
2
0 20 40 60 80 0 20 40 60 80 0 20 40 60 80
convergence criterion: log L(p(t) , ✓ (t) ) − log L(p(t−1) , ✓ (t−1) ) < 10−6
Did you notice anything weird?
x f (x) fˆ(x)
0 0.079 0.077
1 0.101 0.106
2 0.104 0.112
3 0.125 0.130
4 0.143 0.145
5 0.141 0.139
6 0.117 0.113
7 0.084 0.079
8 0.052 0.049
9 0.029 0.026
10 0.015 0.013
11 0.007 0.006
12 0.003 0.002
true pmf
estimated pmf
0.15
0.10
0.05
0.00
0 1 2 3 4 5 6 7 8 9
0.21 P(1.08)
0.10
0.05
0.00
0 1 2 3 4 5 6 7 8 9
4e+04
0e+00
0 2 4 6 8 10 13 16 19 22 25 28 31 34 37 40 43 49 66
count
100
1
0 2 4 6 8 10 13 16 19 22 25 28 31 34 37 40 43 49 66
count
k
X ∼ p1 I(x = 0) + � pj f (x; j)
j=2
0 2 4 6 8 10 13 16 19 22 25 28 31 34 37 40 43 49 66
count
poisson
1e−05 1e−04 1e−03 1e−02 1e−01
ZIP(2)
ZIP(3)
ZIP(4)
Probability (log−scale)
ZIP(5)
ZIP(6)
negative binomial
0 2 4 6 8 10 13 16 19 22 25 28 31 34 37 40 43 49 66
count
poisson
1e−05 1e−04 1e−03 1e−02 1e−01
ZIP(2)
ZIP(3)
ZIP(4)
Probability (log−scale)
ZIP(5)
ZIP(6)
negative binomial
0 2 4 6 8 10 13 16 19 22 25 28 31 34 37 40 43 49 66
count
poisson
1e−05 1e−04 1e−03 1e−02 1e−01
ZIP(2)
ZIP(3)
ZIP(4)
Probability (log−scale)
ZIP(5)
ZIP(6)
negative binomial
0 2 4 6 8 10 13 16 19 22 25 28 31 34 37 40 43 49 66
count
poisson
1e−05 1e−04 1e−03 1e−02 1e−01
ZIP(2)
ZIP(3)
ZIP(4)
Probability (log−scale)
ZIP(5)
ZIP(6)
negative binomial
0 2 4 6 8 10 13 16 19 22 25 28 31 34 37 40 43 49 66
count
poisson
1e−05 1e−04 1e−03 1e−02 1e−01
ZIP(2)
ZIP(3)
ZIP(4)
Probability (log−scale)
ZIP(5)
ZIP(6)
negative binomial
0 2 4 6 8 10 13 16 19 22 25 28 31 34 37 40 43 49 66
count
poisson
1e−05 1e−04 1e−03 1e−02 1e−01
ZIP(2)
ZIP(3)
ZIP(4)
Probability (log−scale)
ZIP(5)
ZIP(6)
negative binomial
0 2 4 6 8 10 13 16 19 22 25 28 31 34 37 40 43 49 66
count
poisson
1e−05 1e−04 1e−03 1e−02 1e−01
ZIP(2)
ZIP(3)
ZIP(4)
Probability (log−scale)
ZIP(5)
ZIP(6)
negative binomial
0 2 4 6 8 10 13 16 19 22 25 28 31 34 37 40 43 49 66
count