EM Missing
EM Missing
net/publication/355446236
CITATIONS READS
3 561
1 author:
Loc Nguyen
Loc Nguyen's Academic Network
268 PUBLICATIONS 407 CITATIONS
SEE PROFILE
All content following this page was uploaded by Loc Nguyen on 21 October 2021.
Abstract
Expectation maximization (EM) algorithm is a powerful mathematical tool for estimating parameter of statistical models in case
of incomplete data or hidden data. EM assumes that there is a relationship between hidden data and observed data, which can be a
joint distribution or a mapping function. Therefore, this implies another implicit relationship between parameter estimation and
data imputation. If missing data which contains missing values is considered as hidden data, it is very natural to handle missing
data by EM algorithm. Handling missing data is not a new research but this report focuses on the theoretical base with detailed
mathematical proofs for fulfilling missing values with EM. Besides, multinormal distribution and multinomial distribution are the
two sample statistical models which are concerned to hold missing values.
Keywords- Expectation Maximization (EM), Missing Data, Multinormal Distribution, Multinomial Distribution
I. INTRODUCTION
Literature of expectation maximization (EM) algorithm in this report is mainly extracted from the preeminent article “Maximum
Likelihood from Incomplete Data via the EM Algorithm” by Arthur P. Dempster, Nan M. Laird, and Donald B. Rubin (Dempster,
Laird, & Rubin, 1977). For convenience, let DLR be reference to such three authors. The preprint “Tutorial on EM algorithm”
(Nguyen, 2020) by Loc Nguyen is also referred in this report.
Now we skim through an introduction of EM algorithm. Suppose there are two spaces X and Y, in which X is hidden space
whereas Y is observed space. We do not know X but there is a mapping from X to Y so that we can survey X by observing Y. The
mapping is many-one function φ: X → Y and we denote φ–1(Y) = {𝑋 ∈ 𝑿: φ(X) = Y} as all 𝑋 ∈ 𝑿 such that φ(X) = Y. We also
denote X(Y) = φ–1(Y). Let f(X | Θ) be the probability density function (PDF) of random variable 𝑋 ∈ 𝑿 and let g(Y | Θ) be the PDF
of random variable 𝑌 ∈ 𝒀. Note, Y is also called observation. Equation 1.1 specifies g(Y | Θ) as integral of f(X | Θ) over φ–1(Y).
𝑔(𝑌|Θ) = ∫ 𝑓(𝑋|Θ)d𝑋 (1.1)
𝜑−1 (𝑌)
Where Θ is probabilistic parameter represented as a column vector, Θ = (θ1, θ2,…, θr)T in which each θi is a particular parameter.
If X and Y are discrete, equation 1.1 is re-written as follows:
𝑔(𝑌|Θ) = ∑ 𝑓(𝑋|Θ)
𝑋∈𝜑−1 (𝑌)
According to viewpoint of Bayesian statistics, Θ is also random variable. As a convention, let Ω be the domain of Θ such that Θ ∈
Ω and the dimension of Ω is r. For example, normal distribution has two particular parameters such as mean μ and variance σ2 and
so we have Θ = (μ, σ2)T. Note that, Θ can degrades into a scalar as Θ = θ. The conditional PDF of X given Y, denoted k(X | Y, Θ),
is specified by equation 1.2.
𝑓(𝑋|Θ)
𝑘(𝑋|𝑌, Θ) = (1.2)
𝑔(𝑌|Θ)
According to DLR (Dempster, Laird, & Rubin, 1977, p. 1), X is called complete data and the term “incomplete data” implies
existence of X and Y where X is not observed directly and X is only known by the many-one mapping φ: X → Y. In general, we
only know Y, f(X | Θ), and k(X | Y, Θ) and so our purpose is to estimate Θ based on such Y, f(X | Θ), and k(X | Y, Θ). Like MLE
approach, EM algorithm also maximizes the likelihood function to estimate Θ but the likelihood function in EM concerns Y and
there are also some different aspects in EM which will be described later. Pioneers in EM algorithm firstly assumed that f(X | Θ)
belongs to exponential family with note that many popular distributions such as normal, multinomial, and Poisson belong to
exponential family. Although DLR (Dempster, Laird, & Rubin, 1977) proposed a generality of EM algorithm in which f(X | Θ)
distributes arbitrarily, we should concern exponential family a little bit. Exponential family (Wikipedia, Exponential family, 2016)
refers to a set of probabilistic distributions whose PDF (s) have the same exponential form according to equation 1.3 (Dempster,
Laird, & Rubin, 1977, p. 3):
𝑓(𝑋|Θ) = 𝑏(𝑋) exp(Θ𝑇 𝜏(𝑋))⁄𝑎(Θ) (1.3)
Where b(X) is a function of X, which is called base measure and τ(X) is a vector function of X, which is sufficient statistic. For
example, the sufficient statistic of normal distribution is τ(X) = (X, XXT)T. Equation 1.3 expresses the canonical form of exponential
family. Recall that Ω is the domain of Θ such that Θ ∈ Ω. Suppose that Ω is a convex set. If Θ is restricted only to Ω then, f(X | Θ)
specifies a regular exponential family. If Θ lies in a curved sub-manifold Ω0 of Ω then, f(X | Θ) specifies a curved exponential
family. The a(Θ) is partition function for variable X, which is used for normalization.
𝑎(Θ) = ∫ 𝑏(𝑋)exp(Θ𝑇 𝜏(𝑋))d𝑋
𝑋
As usual, a PDF is known as a popular form but its exponential family form (canonical form of exponential family) specified by
equation 1.3 looks unlike popular form although they are the same. Therefore, parameter in popular form is different from
parameter in exponential family form.
For example, multinormal distribution with theoretical mean μ and covariance matrix Σ of random variable X = (x1, x2,…, xn)T
has PDF in popular form is:
𝑛 1 1
𝑓(𝑋|𝜇, Σ) = (2𝜋)−2 |Σ|−2 ∗ exp (− (𝑋 − 𝜇)𝑇 Σ −1 (𝑋 − 𝜇))
2
Hence, parameter in popular form is Θ = (μ, Σ)T. Exponential family form of such PDF is:
𝑛
𝑋 1 1
𝑓(𝑋|𝜃1 , 𝜃2 ) = (2𝜋)− 2 ∗ exp ((𝜃1 , 𝜃2 ) ( 𝑇 ))⁄exp (− 𝜃1𝑇 𝜃2−1 𝜃1 − log|−2𝜃2 |)
𝑋𝑋 4 2
Where,
𝜃
Θ = ( 1)
𝜃2
𝜃1 = Σ −1 𝜇
1
𝜃2 = − Σ −1
2 𝑛
𝑏(𝑋) = (2𝜋)−2
𝑋
𝜏(𝑋) = ( 𝑇 )
𝑋𝑋
1 1
𝑎(Θ) = exp (− 𝜃1𝑇 𝜃2−1 𝜃1 − log|−2𝜃2 |)
4 2
The exponential family form is used to represents all distributions belonging to exponential family as canonical form. Parameter
in exponential family form is called exponential family parameter. As a convention, parameter Θ mentioned in EM algorithm is
often exponential family parameter if PDF belongs to exponential family and there is no additional information.
Expectation maximization (EM) algorithm has many iterations and each iteration has two steps in which expectation step (E-
step) calculates sufficient statistic of hidden data based on observed data and current parameter whereas maximization step (M-
step) re-estimates parameter. When DLR proposed EM algorithm (Dempster, Laird, & Rubin, 1977), they firstly concerned that
the PDF f(X | Θ) of hidden space belongs to exponential family. E-step and M-step at the tth iteration are described in table 1.1
(Dempster, Laird, & Rubin, 1977, p. 4), in which the current estimate is Θ(t), with note that f(X | Θ) belongs to regular exponential
family.
E-step:
We calculate current value τ(t) of the sufficient statistic τ(X) from observed Y and current parameter Θ(t) according to
following equation:
𝜏 (𝑡) = 𝐸(𝜏(𝑋)|𝑌, Θ(𝑡) ) = ∫ 𝑘(𝑋|𝑌, Θ(𝑡) )𝜏(𝑋)d𝑋
𝜑−1 (𝑌)
M-step:
Basing on τ(t), we determine the next parameter Θ(t+1) as solution of following equation:
𝐸(𝜏(𝑋)|Θ) = ∫ 𝑓(𝑋|Θ)𝜏(𝑋)d𝑋 = 𝜏 (𝑡)
𝑋
Note, Θ(t+1) will become current parameter at the next iteration ((t+1)th iteration).
Table 1.1. E-step and M-step of EM algorithm given regular exponential PDF f(X|Θ)
EM algorithm stops if two successive estimates are equal, Θ * = Θ(t) = Θ(t+1), at some tth iteration. At that time we conclude that Θ*
is the optimal estimate of EM process. As a convention, the estimate of parameter Θ resulted from EM process is denoted Θ *
̂ in order to emphasize that Θ* is solution of optimization problem.
instead of Θ
For further research, DLR gave a preeminent generality of EM algorithm (Dempster, Laird, & Rubin, 1977, pp. 6-11) in which
f(X | Θ) specifies arbitrary distribution. In other words, there is no requirement of exponential family. They define the conditional
expectation Q(Θ’ | Θ) according to equation 1.4 (Dempster, Laird, & Rubin, 1977, p. 6).
𝑄(Θ′ |Θ) = 𝐸(log(𝑓(𝑋|Θ′ ))|𝑌, Θ) = ∫ 𝑘(𝑋|𝑌, Θ)log(𝑓(𝑋|Θ′ ))d𝑋 (1.4)
𝜑 −1 (𝑌)
𝑇
𝑋𝑜𝑏𝑠 = (𝑥𝑗 : 𝑗 ∈ 𝑀̅ )𝑇 = (𝑥𝑚̅ , 𝑥𝑚̅ , … , 𝑥𝑚̅ ̅̅̅ ) (2.7)
1 2 |𝑀|
Obviously, dimension of Xmis is |M| and dimension of Xobs is |𝑀 ̅ | = n–|M|. Note, when composing X from Xobs and Xmis as X = {Xobs,
Xmis}, it is required a right re-arrangement of elements in both Xobs and Xmis.
Let Z = (z1, z2,…, zn)T be n-dimension random variable whose each element zj is binary random variable indicating if xj is
missing. Random variable Z is also called missingness variable.
1 if 𝑥𝑗 missing
𝑧𝑗 = { (2.8)
0 if 𝑥𝑗 existent
For example, given X = (x1, x2, x3, x4)T, when X is observed as X = (x1=1, x2=?, x3=4, x4=?, x5=9)T, we have Xobs = (x1=1, x3=4,
x5=9)T, Xmis = (x2=?, x4=?)T, and Z = (z1=0, z2=1, z3=0, z4=1, z5=0)T.
Generally, when X is replaced by a sample 𝒳 = {X1, X2,…, XN} whose Xi (s) are iid, let 𝒵 = {Z1, Z2,…, ZN} be a set of
missingness variables associated with 𝒳. All Zi (s) are iid too. 𝒳 and 𝒵 can be represented as matrices. Given Xi, its associative
quantities are Zi, Mi, and 𝑀 ̅𝑖 . Let X = {Xobs, Xmis} be random variable representing every Xi. Let Z be random variable representing
every Zi. As a convention, Xobs(i) and Xmis(i) refer to Xobs part and Xmis part of Xi. We have:
𝑋𝑖 = {𝑋𝑜𝑏𝑠 (𝑖), 𝑋𝑚𝑖𝑠 (𝑖)} = (𝑥𝑖1 , 𝑥𝑖2 , … , 𝑥𝑖𝑛 )𝑇
𝑇
𝑋𝑚𝑖𝑠 (𝑖) = (𝑥𝑖𝑚1 , 𝑥𝑖𝑚2 , … , 𝑥𝑖𝑚|𝑀 | )
𝑖
𝑇
𝑋𝑜𝑏𝑠 (𝑖) = (𝑥𝑖𝑚̅𝑖1 , 𝑥𝑖𝑚̅𝑖2 , … , 𝑥𝑖𝑚̅𝑖|𝑀
̅̅̅ |
) (2.9)
𝑖
z13=0, z14=1)T, Z2 = (z21=1, z22=0, z23=1, z24=0)T, Z3 = (z31=0, z32=0, z33=1, z34=1)T, and Z4 = (z41=1, z42=1, z43=0, z44=0)T. All Zi (s)
are iid too.
x1 x2 x3 x4 z1 z2 z3 z4
X1 1 ? 3 ? Z1 0 1 0 1
X2 ? 2 ? 4 Z2 1 0 1 0
X3 1 2 ? ? Z3 0 0 1 1
X4 ? ? 3 4 Z4 1 1 0 0
Of course, we have Xobs(1) = (x11=1, x13=3)T, Xmis(1) = (x12=?, x14=?)T, Xobs(2) = (x22=2, x24=4)T, Xmis(2) = (x21=?, x23=?)T, Xobs(3) =
(x31=1, x32=2)T, Xmis(3) = (x33=?, x34=?)T, Xobs(4) = (x43=3, x44=4)T, and Xmis(4) = (x41=?, x42=?)T. We also have M1 = {m11=2, m12=4},
𝑀̅1 = {𝑚̅ 11 =1, 𝑚
̅ 12=3}, M2 = {m21=1, m22=3}, 𝑀 ̅2 = {𝑚̅ 21 =2, 𝑚
̅ 22=4}, M3 = {m31=3, m32=4}, 𝑀 ̅3 = {𝑚̅ 31 =1, 𝑚
̅ 32 =2}, M4 = {m41=1,
̅
m42=2}, and 𝑀4 = {𝑚 ̅ 41 =3, 𝑚
̅ 42 =4}.
Both X and Z are associated with their own PDFs, as follows:
𝑓(𝑋|Θ) = 𝑓(𝑋𝑜𝑏𝑠 , 𝑋𝑚𝑖𝑠 |Θ)
(2.10)
𝑓(𝑍|𝑋𝑜𝑏𝑠 , 𝑋𝑚𝑖𝑠 , Φ)
Where Θ and Φ are parameters of PDFs of X = {Xobs, Xmis} and Z, respectively. The goal of handling missing data is to estimate Θ
and Φ given X. Sufficient statistic of X = {Xobs, Xmis} is composed of sufficient statistic of Xobs and sufficient statistic of Xmis.
𝜏(𝑋) = 𝜏(𝑋𝑜𝑏𝑠 , 𝑋𝑚𝑖𝑠 ) = {𝜏(𝑋𝑜𝑏𝑠 ), 𝜏(𝑋𝑚𝑖𝑠 )} (2.11)
How to compose τ(X) from τ(Xobs) and τ(Xmis) is dependent on distribution type of the PDF f(X|Θ).
The joint PDF of X and Z is main object of handling missing data, which is defined as follows:
𝑓(𝑋, 𝑍|Θ, Φ) = 𝑓(𝑋𝑜𝑏𝑠 , 𝑋𝑚𝑖𝑠 , 𝑍|Θ, Φ) = 𝑓(𝑍|𝑋𝑜𝑏𝑠 , 𝑋𝑚𝑖𝑠 , Φ)𝑓(𝑋𝑜𝑏𝑠 , 𝑋𝑚𝑖𝑠 |Θ) (2.12)
The PDF of Xobs is defined as integral of f(X|Θ) over Xmis:
𝑓(𝑋𝑜𝑏𝑠 |Θ) = ∫ 𝑓(𝑋𝑜𝑏𝑠 , 𝑋𝑚𝑖𝑠 |Θ)d𝑋𝑚𝑖𝑠 (2.13)
𝑋𝑚𝑖𝑠
The PDF of Xmis is conditional PDF of Xmis given Xobs is:
𝑓(𝑋|Θ) 𝑓(𝑋𝑜𝑏𝑠 , 𝑋𝑚𝑖𝑠 |Θ)
𝑓(𝑋𝑚𝑖𝑠 |𝑋𝑜𝑏𝑠 , Θ𝑀 ) = 𝑓(𝑋𝑚𝑖𝑠 |𝑋𝑜𝑏𝑠 , Θ) = = (2.14)
𝑓(𝑋𝑜𝑏𝑠 |Θ) 𝑓(𝑋𝑜𝑏𝑠 |Θ)
The notation ΘM implies that the parameter ΘM of the PDF f(Xmis | Xobs, ΘM) is derived from the parameter Θ of the PDF f(X|Θ),
which is function of Θ and Xobs as ΘM = u(Θ, Xobs). Thus, ΘM is not a new parameter and it is dependent on distribution type.
Θ𝑀 = 𝑢(Θ, 𝑋𝑜𝑏𝑠 ) (2.15)
How to determine u(Θ, Xobs) is dependent on distribution type of the PDF f(X|Θ).
There are three types of missing data, which depends on relationship between Xobs, Xmis, and Z (Josse, Jiang, Sportisse, &
Robin, 2018):
- Missing data (X or 𝒳) is Missing Completely At Random (MCAR) if the probability of Z is independent from both Xobs
and Xmis such that f(Z | Xobs, Xmis, Φ) = f(Z | Φ).
- Missing data (X or 𝒳) is Missing At Random (MAR) if the probability of Z depends on only Xobs such that f(Z | Xobs, Xmis,
Φ) = f(Z | Xobs, Φ).
- Missing data (X or 𝒳) is Missing Not At Random (MNAR) in all other cases, f(Z | Xobs, Xmis, Φ) = f(Z | Xobs, Xmis, Φ).
There are two main approaches for handling missing data (Josse, Jiang, Sportisse, & Robin, 2018):
- Using some statistical models such as EM to estimate parameter with missing data.
- Inputting plausible values for missing values to obtain some complete samples (copies) from the missing data. Later on,
every complete sample is used to produce an estimate of parameter by some estimation methods, for example, MLE and
MAP. Finally, all estimates are synthesized to produce the best estimate.
Here we focus on the first approach with EM to estimate parameter with missing data. Without loss of generality, given sample 𝒳
= {X1, X2,…, XN} in which all Xi (s) are iid, by applying equation 1.8 for GEM with the joint PDF f(Xobs, Xmis, Z | Θ, Φ), we consider
{Xobs, Z} as observed part and Xmis as hidden part. Let X = {Xobs, Xmis} be random variable representing all Xi (s). Let Xobs(i) denote
observed part Xobs of Xi and let Zi be missingness variable corresponding to Xi, by following equation 1.8, the expectation Q(Θ’,
Φ’ | Θ, Φ) becomes:
𝑁
𝑄(Θ′ , Φ′ |Θ, Φ) = ∑ ∫ 𝑓(𝑋𝑚𝑖𝑠 |𝑋𝑜𝑏𝑠 (𝑖), 𝑍𝑖 , Θ, Φ) ∗ log(𝑓(𝑋𝑜𝑏𝑠 (𝑖), 𝑋𝑚𝑖𝑠 , 𝑍𝑖 |Θ′ , Φ′ ))d𝑋𝑚𝑖𝑠
𝑖=1 𝑋𝑚𝑖𝑠
𝑁
= ∑ ∫ 𝑓(𝑋𝑚𝑖𝑠 |𝑋𝑜𝑏𝑠 (𝑖), Θ𝑀𝑖 ) ∗ log(𝑓(𝑋𝑜𝑏𝑠 (𝑖), 𝑋𝑚𝑖𝑠 |Θ′ , Φ′ ) ∗ 𝑓(𝑍𝑖 |𝑋𝑜𝑏𝑠 (𝑖), 𝑋𝑚𝑖𝑠 , Θ′ , Φ′ ))d𝑋𝑚𝑖𝑠
𝑖=1 𝑋𝑚𝑖𝑠
𝑁
= ∑ ∫ 𝑓(𝑋𝑚𝑖𝑠 |𝑋𝑜𝑏𝑠 (𝑖), Θ𝑀𝑖 ) ∗ log(𝑓(𝑋𝑜𝑏𝑠 (𝑖), 𝑋𝑚𝑖𝑠 |Θ′ ) ∗ 𝑓(𝑍𝑖 |𝑋𝑜𝑏𝑠 (𝑖), 𝑋𝑚𝑖𝑠 , Φ′ ))d𝑋𝑚𝑖𝑠
𝑖=1 𝑋𝑚𝑖𝑠
𝑁
= ∑ ∫ 𝑓(𝑋𝑚𝑖𝑠 |𝑋𝑜𝑏𝑠 (𝑖), Θ𝑀𝑖 ) ∗ (log(𝑓(𝑋𝑜𝑏𝑠 (𝑖), 𝑋𝑚𝑖𝑠 |Θ′ )) + log(𝑓(𝑍𝑖 |𝑋𝑜𝑏𝑠 (𝑖), 𝑋𝑚𝑖𝑠 , Φ′ ))) d𝑋𝑚𝑖𝑠
𝑖=1 𝑋𝑚𝑖𝑠
𝑁
𝑄1 (Θ′ |Θ) = ∑ ∫ 𝑓(𝑋𝑚𝑖𝑠 |𝑋𝑜𝑏𝑠 (𝑖), Θ𝑀𝑖 )log(𝑓(𝑋𝑜𝑏𝑠 (𝑖), 𝑋𝑚𝑖𝑠 |Θ′ ))d𝑋𝑚𝑖𝑠
𝑖=1 𝑋𝑚𝑖𝑠
𝑁
𝑄2 (Φ′ |Θ) = ∑ ∫ 𝑓(𝑋𝑚𝑖𝑠 |𝑋𝑜𝑏𝑠 (𝑖), Θ𝑀𝑖 )log(𝑓(𝑍𝑖 |𝑋𝑜𝑏𝑠 (𝑖), 𝑋𝑚𝑖𝑠 , Φ′ ))d𝑋𝑚𝑖𝑠
𝑖=1 𝑋𝑚𝑖𝑠
Note, unknowns of Q(Θ’, Φ’ | Θ, Φ) are Θ’ and Φ’. Because it is not easy to maximize Q(Θ’, Φ’ | Θ, Φ) with regard to Θ’ and Φ’,
we assume that the PDF f(X|Θ) belongs to exponential family.
𝑓(𝑋|Θ) = 𝑓(𝑋𝑜𝑏𝑠 , 𝑋𝑚𝑖𝑠 |Θ) = 𝑏(𝑋𝑜𝑏𝑠 , 𝑋𝑚𝑖𝑠 ) ∗ exp((Θ)𝑇 𝜏(𝑋𝑜𝑏𝑠 , 𝑋𝑚𝑖𝑠 ))⁄𝑎(Θ) (2.17)
Note,
𝑏(𝑋) = 𝑏(𝑋𝑜𝑏𝑠 , 𝑋𝑚𝑖𝑠 )
𝜏(𝑋) = 𝜏(𝑋𝑜𝑏𝑠 , 𝑋𝑚𝑖𝑠 ) = {𝜏(𝑋𝑜𝑏𝑠 ), 𝜏(𝑋𝑚𝑖𝑠 )}
It is easy to deduce that
𝑓(𝑋𝑚𝑖𝑠 |𝑋𝑜𝑏𝑠 , Θ𝑀 ) = 𝑏(𝑋𝑚𝑖𝑠 ) exp((Θ𝑀 )𝑇 𝜏(𝑋𝑚𝑖𝑠 ))⁄𝑎(Θ𝑀 ) (2.18)
Therefore,
𝑇
𝑓(𝑋𝑚𝑖𝑠 |𝑋𝑜𝑏𝑠 (𝑖), Θ𝑀𝑖 ) = 𝑏(𝑋𝑚𝑖𝑠 ) exp ((Θ𝑀𝑖 ) 𝜏(𝑋𝑚𝑖𝑠 ))⁄𝑎(Θ𝑀𝑖 )
We have:
𝑁
𝑄1 (Θ′ |Θ) = ∑ ∫ 𝑓(𝑋𝑚𝑖𝑠 |𝑋𝑜𝑏𝑠 (𝑖), Θ𝑀𝑖 )log(𝑓(𝑋𝑜𝑏𝑠 (𝑖), 𝑋𝑚𝑖𝑠 |Θ′ ))d𝑋𝑚𝑖𝑠
𝑖=1 𝑋𝑚𝑖𝑠
𝑁
= ∑ ∫ 𝑓(𝑋𝑚𝑖𝑠 |𝑋𝑜𝑏𝑠 (𝑖), Θ𝑀𝑖 ) ∗ log(𝑏(𝑋𝑜𝑏𝑠 (𝑖), 𝑋𝑚𝑖𝑠 ) exp((Θ′ )𝑇 𝜏(𝑋𝑜𝑏𝑠 (𝑖), 𝑋𝑚𝑖𝑠 ))⁄𝑎(Θ′ ))d𝑋𝑚𝑖𝑠
𝑖=1 𝑋𝑚𝑖𝑠
𝑁
= ∑ ∫ 𝑓(𝑋𝑚𝑖𝑠 |𝑋𝑜𝑏𝑠 (𝑖), Θ𝑀𝑖 ) ∗ (log(𝑏(𝑋𝑜𝑏𝑠 (𝑖), 𝑋𝑚𝑖𝑠 )) + (Θ′ )𝑇 𝜏(𝑋𝑜𝑏𝑠 (𝑖), 𝑋𝑚𝑖𝑠 ) − log(𝑎(Θ′ ))) d𝑋𝑚𝑖𝑠
𝑖=1 𝑋𝑚𝑖𝑠
𝑁 𝑁
= ∑ ∫ 𝑓(𝑋𝑚𝑖𝑠 |𝑋𝑜𝑏𝑠 (𝑖), Θ𝑀𝑖 )log(𝑏(𝑋𝑜𝑏𝑠 (𝑖), 𝑋𝑚𝑖𝑠 ))d𝑋𝑚𝑖𝑠 + ∑ ∫ 𝑓(𝑋𝑚𝑖𝑠 |𝑋𝑜𝑏𝑠 (𝑖), Θ𝑀𝑖 )(Θ′ )𝑇 𝜏(𝑋𝑜𝑏𝑠 (𝑖), 𝑋𝑚𝑖𝑠 )d𝑋𝑚𝑖𝑠
𝑖=1 𝑋𝑚𝑖𝑠 𝑖=1 𝑋𝑚𝑖𝑠
𝑁
= ∑ ∫ 𝑓(𝑋𝑚𝑖𝑠 |𝑋𝑜𝑏𝑠 (𝑖), Θ𝑀𝑖 )log(𝑏(𝑋𝑜𝑏𝑠 (𝑖), 𝑋𝑚𝑖𝑠 ))d𝑋𝑚𝑖𝑠 + (Θ′ )𝑇 ∑ ∫ 𝑓(𝑋𝑚𝑖𝑠 |𝑋𝑜𝑏𝑠 (𝑖), Θ𝑀𝑖 )𝜏(𝑋𝑜𝑏𝑠 (𝑖), 𝑋𝑚𝑖𝑠 )d𝑋𝑚𝑖𝑠
𝑖=1 𝑋𝑚𝑖𝑠 𝑖=1 𝑋𝑚𝑖𝑠
𝑁
𝑁 𝑁
= ∑ ∫ 𝑓(𝑋𝑚𝑖𝑠 |𝑋𝑜𝑏𝑠 (𝑖), Θ𝑀𝑖 )log(𝑏(𝑋𝑜𝑏𝑠 (𝑖), 𝑋𝑚𝑖𝑠 ))d𝑋𝑚𝑖𝑠 + (Θ′ )𝑇 ∑ ∫ 𝑓(𝑋𝑚𝑖𝑠 |𝑋𝑜𝑏𝑠 (𝑖), Θ𝑀𝑖 )𝜏(𝑋𝑜𝑏𝑠 (𝑖), 𝑋𝑚𝑖𝑠 )d𝑋𝑚𝑖𝑠
𝑖=1 𝑋𝑚𝑖𝑠 𝑖=1 𝑋𝑚𝑖𝑠
− 𝑁log(𝑎(Θ′ ))
∫ 𝑓(𝑋𝑚𝑖𝑠 |𝑋𝑜𝑏𝑠 (𝑖), Θ𝑀𝑖 )𝜏(𝑋𝑜𝑏𝑠 (𝑖))d𝑋𝑚𝑖𝑠 ,
𝑁 𝑁
𝑋𝑚𝑖𝑠
= ∑ ∫ 𝑓(𝑋𝑚𝑖𝑠 |𝑋𝑜𝑏𝑠 (𝑖), Θ𝑀𝑖 )log(𝑏(𝑋𝑜𝑏𝑠 (𝑖), 𝑋𝑚𝑖𝑠 ))d𝑋𝑚𝑖𝑠 + (Θ′ )𝑇 ∑
𝑖=1 𝑋𝑚𝑖𝑠 𝑖=1 ∫ 𝑓(𝑋𝑚𝑖𝑠 |𝑋𝑜𝑏𝑠 (𝑖), Θ𝑀𝑖 )𝜏(𝑋𝑚𝑖𝑠 )d𝑋𝑚𝑖𝑠
{𝑋𝑚𝑖𝑠 }
− 𝑁log(𝑎(Θ′ ))
𝜏(𝑋𝑜𝑏𝑠 (𝑖)) ∫ 𝑓(𝑋𝑚𝑖𝑠 |𝑋𝑜𝑏𝑠 (𝑖), Θ𝑀𝑖 )d𝑋𝑚𝑖𝑠 ,
𝑁 𝑁
𝑋𝑚𝑖𝑠
= ∑ ∫ 𝑓(𝑋𝑚𝑖𝑠 |𝑋𝑜𝑏𝑠 (𝑖), Θ𝑀𝑖 )log(𝑏(𝑋𝑜𝑏𝑠 (𝑖), 𝑋𝑚𝑖𝑠 ))d𝑋𝑚𝑖𝑠 + (Θ′ )𝑇 ∑
𝑖=1 𝑋𝑚𝑖𝑠 𝑖=1 ∫ 𝑓(𝑋𝑚𝑖𝑠 |𝑋𝑜𝑏𝑠 (𝑖), Θ𝑀𝑖 )𝜏(𝑋𝑚𝑖𝑠 )d𝑋𝑚𝑖𝑠
{𝑋𝑚𝑖𝑠 }
− 𝑁log(𝑎(Θ′ ))
𝑁 𝑁 𝜏(𝑋𝑜𝑏𝑠 (𝑖)),
= ∑ ∫ 𝑓(𝑋𝑚𝑖𝑠 |𝑋𝑜𝑏𝑠 (𝑖), Θ𝑀𝑖 )log(𝑏(𝑋𝑜𝑏𝑠 (𝑖), 𝑋𝑚𝑖𝑠 ))d𝑋𝑚𝑖𝑠 + (Θ′ )𝑇 ∑ { ∫ 𝑓(𝑋 |𝑋 (𝑖), Θ )𝜏(𝑋 )d𝑋 }
𝑚𝑖𝑠 𝑜𝑏𝑠 𝑀𝑖 𝑚𝑖𝑠 𝑚𝑖𝑠
𝑖=1 𝑋𝑚𝑖𝑠 𝑖=1
𝑋𝑚𝑖𝑠
′ ))
− 𝑁log(𝑎(Θ
Therefore, equation 2.19 specifies Q1(Θ’|Θ) given f(X|Θ) belongs to exponential family.
𝑁 𝑁
𝑄1 (Θ′ |Θ) = ∑ 𝐸(log(𝑏(𝑋𝑜𝑏𝑠 (𝑖), 𝑋𝑚𝑖𝑠 ))|Θ𝑀𝑖 ) + (Θ′ )𝑇 ∑{𝜏(𝑋𝑜𝑏𝑠 (𝑖)), 𝐸(𝜏(𝑋𝑚𝑖𝑠 )|Θ𝑀𝑖 )} − 𝑁log(𝑎(Θ′ )) (2.19)
𝑖=1 𝑖=1
Where,
𝐸(log(𝑏(𝑋𝑜𝑏𝑠 (𝑖), 𝑋𝑚𝑖𝑠 ))|Θ𝑀𝑖 ) = ∫ 𝑓(𝑋𝑚𝑖𝑠 |𝑋𝑜𝑏𝑠 (𝑖), Θ𝑀𝑖 )log(𝑏(𝑋𝑜𝑏𝑠 (𝑖), 𝑋𝑚𝑖𝑠 ))d𝑋𝑚𝑖𝑠 (2.20)
𝑋𝑚𝑖𝑠
(𝑡)
Θ𝑀𝑖 = 𝑢(Θ(𝑡) , 𝑀𝑖 )
(𝑡) (𝑡)
𝐸(𝜏(𝑋𝑚𝑖𝑠 )|Θ𝑀𝑖 ) = ∫ 𝑓(𝑋𝑚𝑖𝑠 |𝑋𝑜𝑏𝑠 (𝑖), Θ𝑀𝑖 )𝜏(𝑋𝑚𝑖𝑠 )d𝑋𝑚𝑖𝑠
𝑋𝑚𝑖𝑠
Equation 2.22 is variant of equation 2.11 when f(X|Θ) belongs to exponential family but how to compose τ(X) from τ(Xobs) and
τ(Xmis) is not determined exactly yet.
As a result, at M-step of some tth iteration, given τ(t) and Θ(t), the next parameter Θ(t+1) is a solution of the following equation:
𝐸(𝜏(𝑋)|Θ) = 𝜏 (𝑡) (2.23)
Moreover, at M-step of some t iteration, the next parameter Φ(t+1) is a maximizer of Q2(Φ | Θ(t)) given Θ(t) as follows:
th
̅|
|𝑀 1 1
𝑓(𝑋𝑜𝑏𝑠 |Θ𝑜𝑏𝑠 ) = (2𝜋)− −
2 |Σ𝑜𝑏𝑠 | 2 exp (− (𝑋𝑜𝑏𝑠 − 𝜇𝑜𝑏𝑠 )𝑇 (Σ𝑜𝑏𝑠 )−1 (𝑋𝑜𝑏𝑠 − 𝜇𝑜𝑏𝑠 )) (2.33)
2
Therefore,
̅ 𝑖|
|𝑀 1 1 𝑇 −1
𝑓(𝑋𝑜𝑏𝑠 (𝑖)|Θ𝑜𝑏𝑠 (𝑖)) = (2𝜋)− −
2 |Σ𝑜𝑏𝑠 (𝑖)| 2 exp (− (𝑋𝑜𝑏𝑠 (𝑖) − 𝜇𝑜𝑏𝑠 (𝑖)) (Σ𝑜𝑏𝑠 (𝑖)) (𝑋𝑜𝑏𝑠 (𝑖) − 𝜇𝑜𝑏𝑠 (𝑖)))
2
Where,
𝑇
𝜇𝑜𝑏𝑠 (𝑖) = (𝜇𝑚̅𝑖1 , 𝜇𝑚̅𝑖2 , … , 𝜇𝑚̅𝑖|𝑀
̅̅̅ |
)
𝑖
𝜎𝑚̅𝑖1𝑚̅𝑖1 𝜎𝑚̅𝑖1𝑚̅𝑖2 ⋯ 𝜎𝑚̅𝑖1𝑚̅𝑖|𝑀
̅̅̅ |
𝑖
𝜎𝑚̅𝑖2𝑚̅𝑖1 𝜎𝑚̅𝑖2𝑚̅𝑖2 ⋯ 𝜎𝑚̅𝑖2𝑚̅𝑖|𝑀
̅̅̅ |
(2.34)
𝑖
Σ𝑜𝑏𝑠 (𝑖) =
⋮ ⋮ ⋱ ⋮
𝜎𝑚̅𝑖|𝑀 ̅ 𝑖1
̅̅̅ | 𝑚
𝜎𝑚̅𝑖|𝑀 ̅ 𝑖2
̅̅̅ | 𝑚
⋯ 𝜎𝑚̅𝑖|𝑀 ̅ 𝑖|𝑀
̅̅̅ | 𝑚 ̅̅̅ |
( 𝑖 𝑖 𝑖 𝑖 )
̅𝑖 or Mi. Note, 𝜎𝑚̅ 𝑚̅ is covariance of 𝑥𝑚̅ and 𝑥𝑚̅ .
Obviously, Θobs(i) is extracted from Θ given indicator 𝑀 𝑖𝑗 𝑖𝑘 𝑖𝑗 𝑖𝑘
We have:
𝑓(𝑋𝑚𝑖𝑠 (𝑖)|Θ𝑚𝑖𝑠 (𝑖)) = ∫ 𝑓(𝑋𝑜𝑏𝑠 (𝑖), 𝑋𝑚𝑖𝑠 (𝑖)|Θ)d𝑋𝑜𝑏𝑠 (𝑖)
𝑋𝑜𝑏𝑠 (𝑖)
It is necessary to calculate the sufficient with normal PDF f(Xi|Θ), which means that we need to define what τ1(t) and τ2(t) are. The
sufficient statistic of Xobs(i) is:
𝑇 𝑇
𝜏(𝑋𝑜𝑏𝑠 (𝑖)) = (𝑋𝑜𝑏𝑠 (𝑖), 𝑋𝑜𝑏𝑠 (𝑖)(𝑋𝑜𝑏𝑠 (𝑖)) )
The sufficient statistic of Xmis(i) is:
𝑇 𝑇
𝜏(𝑋𝑚𝑖𝑠 (𝑖)) = (𝑋𝑚𝑖𝑠 (𝑖), 𝑋𝑚𝑖𝑠 (𝑖)(𝑋𝑚𝑖𝑠 (𝑖)) )
We also have:
(𝑡)
(𝑡) (𝑡)
𝜇𝑀𝑖
𝐸(𝜏(𝑋𝑚𝑖𝑠 )|Θ𝑀𝑖 ) = ∫ 𝑓(𝑋𝑚𝑖𝑠 |Θ𝑀𝑖 )𝜏(𝑋𝑚𝑖𝑠 )d𝑋𝑚𝑖𝑠 =( (𝑡) (𝑡) (𝑡) 𝑇
)
𝑋𝑚𝑖𝑠
Σ𝑀𝑖 + 𝜇𝑀𝑖 (𝜇𝑀𝑖 )
Due to
𝑇 (𝑡) (𝑡) (𝑡) (𝑡) 𝑇
𝐸 (𝑋𝑚𝑖𝑠 (𝑖)(𝑋𝑚𝑖𝑠 (𝑖)) |Θ𝑀𝑖 ) = Σ𝑀𝑖 + 𝜇𝑀𝑖 (𝜇𝑀𝑖 )
(𝑡) (𝑡)
Where 𝜇𝑀𝑖 and Σ𝑀𝑖 are 𝜇𝑀𝑖 and Σ𝑀𝑖 at current iteration, respectively. By referring to equation 2.38, we have
𝑇
(𝑡) (𝑡) (𝑡) (𝑡)
𝜇𝑀𝑖 = (𝜇𝑀𝑖 (𝑚𝑖1 ), 𝜇𝑀𝑖 (𝑚𝑖2 ), … , 𝜇𝑀𝑖 (𝑚𝑖|𝑀𝑖| ))
And
(𝑡) (𝑡) (𝑡)
𝜎̃11 (𝑖) 𝜎̃12 (𝑖) ⋯ 𝜎̃1|𝑀𝑖| (𝑖)
(𝑡) (𝑡) (𝑡)
(𝑡) (𝑡) (𝑡) 𝑇 𝜎̃21 (𝑖) 𝜎̃22 (𝑖) ⋯ 𝜎̃2|𝑀𝑖| (𝑖)
Σ𝑀𝑖 + 𝜇𝑀𝑖 (𝜇𝑀𝑖 ) =
⋮ ⋮ ⋱ ⋮
(𝑡) (𝑡) (𝑡)
(𝜎̃|𝑀 |1 (𝑖)
𝑖
𝜎̃|𝑀 |2 (𝑖)
𝑖
⋯ 𝜎̃|𝑀𝑖 ||𝑀𝑖 | (𝑖))
Where,
(𝑡) (𝑡) (𝑡) (𝑡)
𝜎̃𝑢𝑣 (𝑖) = Σ𝑀𝑖 (𝑚𝑖𝑢 , 𝑚𝑖𝑣 ) + 𝜇𝑀𝑖 (𝑚𝑖𝑢 )𝜇𝑀𝑖 (𝑚𝑖𝑣 )
Therefore, τ1(t) is vector and τ2(t) is matrix and then, the sufficient statistic of X at E-step of some tth iteration, given current parameter
Θ(t) is defined as follows:
(𝑡) (𝑡) 𝑇
𝜏 (𝑡) = (𝜏1 , 𝜏2 )
(𝑡) (𝑡) (𝑡) (𝑡) 𝑇
𝜏1 = (𝑥̅1 , 𝑥̅2 , … , 𝑥̅𝑛 )
(𝑡) (𝑡) (𝑡)
𝑠11 𝑠12 ⋯ 𝑠1𝑛 (2.39)
(𝑡) (𝑡) (𝑡)
𝜏2 =
(𝑡) 𝑠21 𝑠22 ⋯ 𝑠2𝑛
⋮ ⋮ ⋱ ⋮
(𝑡) (𝑡) (𝑡)
(𝑠𝑛1 𝑠𝑛2 ⋯ 𝑠𝑛𝑛 )
(𝑡)
Each 𝑥̅𝑗 is calculated as follows:
𝑁
(𝑡) 1 𝑥𝑖𝑗 if 𝑗 ∉ 𝑀𝑖
𝑥̅𝑗 = ∑ { (𝑡) (2.40)
𝑁 𝜇𝑀𝑖 (𝑗) if 𝑗 ∈ 𝑀𝑖
𝑖=1
(𝑡) (𝑡)
Please see equation 2.35 and equation 2.38 to know 𝜇𝑀𝑖 (𝑗). Each 𝑠𝑢𝑣 is calculated as follows:
𝑥𝑖𝑢 𝑥𝑖𝑣
if 𝑢 ∉ 𝑀𝑖 and 𝑣 ∉ 𝑀𝑖
(𝑡)
𝑥𝑖𝑢 𝜇𝑀𝑖 (𝑚𝑖𝑣 )
𝑁 if 𝑢 ∉ 𝑀𝑖 and 𝑣 ∈ 𝑀𝑖
(𝑡) (𝑡) 1
𝑠𝑢𝑣 = 𝑠𝑣𝑢 = ∑ (2.41)
𝑁 (𝑡)
𝜇𝑀𝑖 (𝑚𝑖𝑢 )𝑥𝑖𝑣
𝑖=1
if 𝑢 ∈ 𝑀𝑖 and 𝑣 ∉ 𝑀𝑖
(𝑡) (𝑡) (𝑡)
Σ𝑀𝑖 (𝑚𝑖𝑢 , 𝑚𝑖𝑣 ) + 𝜇𝑀𝑖 (𝑚𝑖𝑢 )𝜇𝑀𝑖 (𝑚𝑖𝑣 )
{ if 𝑢 ∈ 𝑀𝑖 and 𝑣 ∈ 𝑀𝑖
Equation 2.39 is an instance of equation 2.11, which compose τ(X) from τ(Xobs) and τ(Xmis) when f(X|Θ) distributes normally.
Following is the proof of equation 2.41.
If 𝑢 ∉ 𝑀𝑖 and 𝑣 ∉ 𝑀𝑖 then, the partial statistic xiuxiv is kept intact because xiu and xiv are in Xobs are constant with regard to
(𝑡) (𝑡)
f(Xmis | Θ𝑀𝑖 ) If 𝑢 ∉ 𝑀𝑖 and 𝑣 ∈ 𝑀𝑖 then, the partial statistic xiuxiv is replaced by the expectation 𝐸(𝑥𝑖𝑢 𝑥𝑖𝑣 |Θ𝑀𝑖 ) as follows:
(𝑡) (𝑡) 𝑇
𝜏 (𝑡) = (𝜏1 , 𝜏2 )
(𝑡) (𝑡) (𝑡) (𝑡) 𝑇
𝜏1 = (𝑥̅1 , 𝑥̅2 , … , 𝑥̅𝑛 )
(𝑡) (𝑡) (𝑡)
𝑠11 𝑠12 ⋯ 𝑠1𝑛
(𝑡) (𝑡) (𝑡)
𝑠21 𝑠22
(𝑡)
𝜏2 = ⋯ 𝑠2𝑛
⋮ ⋮ ⋱ ⋮
(𝑡) (𝑡) (𝑡)
𝑠
( 𝑛1 𝑠𝑛2 ⋯ 𝑠𝑛𝑛 )
𝑁
(𝑡) 1 𝑥𝑖𝑗 if 𝑗 ∉ 𝑀𝑖
𝑥̅𝑗 = ∑ { (𝑡)
𝑁 𝜇𝑀𝑖 (𝑗) if 𝑗 ∈ 𝑀𝑖
𝑖=1
𝑥𝑖𝑢 𝑥𝑖𝑣
if 𝑢 ∉ 𝑀𝑖 and 𝑣 ∉ 𝑀𝑖
(𝑡)
𝑥𝑖𝑢 𝜇𝑀𝑖 (𝑚𝑖𝑣 )
𝑁 if 𝑢 ∉ 𝑀𝑖 and 𝑣 ∈ 𝑀𝑖
(𝑡) (𝑡) 1
𝑠𝑢𝑣 = 𝑠𝑣𝑢 = ∑
𝑁 (𝑡)
𝜇𝑀𝑖 (𝑚𝑖𝑢 )𝑥𝑖𝑣
𝑖=1
if 𝑢 ∈ 𝑀𝑖 and 𝑣 ∉ 𝑀𝑖
∑ 𝑝𝑗 = 1
𝑗=1
𝑛
∑ 𝑥𝑗 = 𝐾
𝑗=1
𝑥𝑗 ∈ {0,1, … , 𝐾}
Note, xj is the number of trials generating nominal value j. Therefore,
𝑛
𝐾! 𝑥𝑖𝑗
𝑓(𝑋𝑖 |Θ) = 𝑓(𝑋𝑜𝑏𝑠 (𝑖), 𝑋𝑚𝑖𝑠 (𝑖)|Θ) = ∏ 𝑝𝑗
∏𝑛𝑗=1(𝑥𝑖𝑗 !)
𝑗=1
Where,
𝑛
∑ 𝑥𝑖𝑗 = 𝐾
𝑗=1
𝑥𝑖𝑗 ∈ {0,1, … , 𝐾}
The most important task here is to define equation 2.11 and equation 2.15 in order to compose τ(X) from τ(Xobs), τ(Xmis) and to
extract ΘM from Θ when f(X|Θ) is multinomial PDF.
Let Θmis be parameter of marginal PDF of Xmis, we have:
|𝑀|
𝐾𝑚𝑖𝑠 ! 𝑝𝑚𝑗 𝑥𝑚𝑗
𝑓(𝑋𝑚𝑖𝑠 |Θ𝑚𝑖𝑠 ) = ∏( ) (2.46)
∏𝑚𝑗 ∈𝑀 (𝑥𝑚𝑗 !) 𝑃𝑚𝑖𝑠
𝑗=1
Therefore,
|𝑀𝑖 |
𝐾𝑚𝑖𝑠 (𝑖)! 𝑝𝑚𝑖𝑗 𝑥𝑖𝑚𝑗
𝑓(𝑋𝑚𝑖𝑠 (𝑖)|Θ𝑚𝑖𝑠 (𝑖)) = ∏( )
∏𝑚𝑗 ∈𝑀𝑖 (𝑥𝑖𝑚𝑗 !) 𝑃𝑚𝑖𝑠 (𝑖)
𝑗=1
Where,
𝑝𝑚𝑖1 𝑝𝑚𝑖2 𝑝𝑚𝑖|𝑀 | 𝑇
𝑖
Θ𝑚𝑖𝑠 (𝑖) = ( , ,…, )
𝑃𝑚𝑖𝑠 (𝑖) 𝑃𝑚𝑖𝑠 (𝑖) 𝑃𝑚𝑖𝑠 (𝑖)
|𝑀𝑖 |
𝑇
𝜏(𝑋𝑜𝑏𝑠 (𝑖)) = (𝑥𝑖𝑚̅1 , 𝑥𝑖𝑚̅2 , … , 𝑥𝑖𝑚̅|𝑀
̅̅̅ |
)
𝑖
The sufficient statistic of Xmis(i) with regard to 𝑓(𝑋𝑚𝑖𝑠 (𝑖)|𝑋𝑜𝑏𝑠 (𝑖), Θ𝑀𝑖 ) is:
𝑥𝑖𝑚1
𝑥𝑖𝑚2
⋮
𝑥𝑖𝑚|𝑀 |
𝜏(𝑋𝑚𝑖𝑠 (𝑖)) = 𝑖
̅ 𝑖|
|𝑀
∑ 𝑥𝑚̅𝑖𝑗
(𝑗=1 )
We also have:
𝐾𝑝𝑚1
𝐾𝑝𝑚2
⋮
(𝑡) (𝑡)
𝐸(𝜏(𝑋𝑚𝑖𝑠 )|Θ𝑀𝑖 ) = ∫ 𝑓(𝑋𝑚𝑖𝑠 |𝑋𝑜𝑏𝑠 , Θ𝑀𝑖 )𝜏(𝑋𝑚𝑖𝑠 )d𝑋𝑚𝑖𝑠 = 𝐾𝑝𝑚|𝑀 |
𝑖
𝑋𝑚𝑖𝑠 ̅ 𝑖|
|𝑀
∑ 𝐾𝑝𝑚̅𝑖𝑗
(𝑗=1 )
Therefore, the sufficient statistic of X at E-step of some tth iteration given current parameter Θ(t) = (p1(t), p2(t),…, pn(t))T is defined as
follows:
(𝑡) (𝑡) (𝑡) 𝑇
𝜏 (𝑡) = (𝑥̅1 , 𝑥̅2 , … , 𝑥̅𝑛 )
𝑁
(𝑡) 1 𝑥𝑖𝑗 if 𝑗 ∉ 𝑀𝑖 (2.52)
𝑥̅𝑗 = ∑ { (𝑡) ∀𝑗
𝑁 𝐾𝑝𝑗 if 𝑗 ∈ 𝑀𝑖
𝑖=1
Equation 2.52 is an instance of equation 2.11, which compose τ(X) from τ(Xobs) and τ(Xmis) when f(X|Θ) is multinomial PDF.
At M-step of some tth iteration, we need to maximize Q1(Θ’|Θ) with following constraint
𝑛
∑ 𝑝𝑗 = 1
𝑗=1
According to equation 2.19, we have:
𝑁 𝑁
𝑄1 (Θ′ |Θ) = ∑ 𝐸(log(𝑏(𝑋𝑜𝑏𝑠 (𝑖), 𝑋𝑚𝑖𝑠 ))|Θ𝑀𝑖 ) + (Θ′ )𝑇 ∑{𝜏(𝑋𝑜𝑏𝑠 (𝑖)), 𝐸(𝜏(𝑋𝑚𝑖𝑠 )|Θ𝑀𝑖 )} − 𝑁log(𝑎(Θ′ ))
𝑖=1 𝑖=1
Where quantities b(Xobs(i), Xmis) and a(Θ’) belongs to the PDF f(X|Θ) of X. Because there is the constraint ∑𝑛𝑗=1 𝑝𝑗 = 1, we use
Lagrange duality method to maximize Q1(Θ’|Θ). The Lagrange function la(Θ’, λ | Θ) is sum of Q1(Θ’|Θ) and the constraint
∑𝑛𝑗=1 𝑝𝑗 = 1, as follows:
𝑛
′
𝑙𝑎(Θ , λ|Θ) = 𝑄1 (Θ′ |Θ) + 𝜆 (1 − ∑ 𝑝𝑗′ )
𝑗=1
𝑁 𝑁
= ∑ 𝐸(log(𝑏(𝑋𝑜𝑏𝑠 (𝑖), 𝑋𝑚𝑖𝑠 ))|Θ𝑀𝑖 ) + (Θ′ )𝑇 ∑{𝜏(𝑋𝑜𝑏𝑠 (𝑖)), 𝐸(𝜏(𝑋𝑚𝑖𝑠 )|Θ𝑀𝑖 )} − 𝑁log(𝑎(Θ′ ))
𝑖=1 𝑖=1
𝑛
+ 𝜆 (1 − ∑ 𝑝𝑗′ )
𝑗=1
Where Θ’ = (p1’, p2’,…, pn’)T. Note, λ ≥ 0 is called Lagrange multiplier. Of course, la(Θ’, λ | Θ) is function of Θ’ and λ. The next
parameter Θ(t+1) that maximizes Q1(Θ’|Θ) is solution of the equation formed by setting the first-order derivative of Lagrange
function regarding Θ’ and λ to be zero.
The first-order partial derivative of la(Θ’, λ | Θ) with regard to Θ’ is:
𝑁
𝜕𝑙𝑎(Θ′ , λ|Θ) 𝑇
= ∑ (𝐸(𝜏(𝑋𝑜𝑏𝑠 (𝑖), 𝑋𝑚𝑖𝑠 )|Θ𝑀𝑖 )) − 𝑁log ′ (𝑎(Θ′ ))
𝜕Θ′
𝑖=1
𝑁
𝑇
= ∑{𝜏(𝑋𝑜𝑏𝑠 (𝑖)), 𝐸(𝜏(𝑋𝑚𝑖𝑠 )|Θ𝑀𝑖 )} − 𝑁log ′ (𝑎(Θ′ )) − (𝜆, 𝜆, … , 𝜆)𝑇
𝑖=1
By referring table 1.2, we have:
𝑇 𝑇
log ′ (𝑎(Θ′ )) = (𝐸(𝜏(𝑋)|Θ′ )) = ∫ 𝑓(𝑋|Θ)(𝜏(𝑋)) d𝑋
𝑋
Thus,
𝑁
𝜕𝑙𝑎(Θ′ , λ|Θ) 𝑇 𝑇
= ∑{𝜏(𝑋𝑜𝑏𝑠 (𝑖)), 𝐸(𝜏(𝑋𝑚𝑖𝑠 )|Θ𝑀𝑖 )} − 𝑁(𝐸(𝜏(𝑋)|Θ′ )) − (𝜆, 𝜆, … , 𝜆)𝑇
𝜕Θ′
𝑖=1
The first-order partial derivative of la(Θ’, λ | Θ) with regard to λ is:
𝑛
𝜕𝑙𝑎(Θ′ , λ|Θ)
= 1 − ∑ 𝑝𝑗′
𝜕λ
𝑗=1
Therefore, at M-step of some tth iteration, given current parameter Θ(t) = (p1(t), p2(t),…, pn(t))T, the next parameter Θ(t+1) is solution
of the following equation:
𝑁
(𝑡) 𝑇
∑{𝜏(𝑋𝑜𝑏𝑠 (𝑖)), 𝐸(𝜏(𝑋𝑚𝑖𝑠 )|Θ𝑀𝑖 )}
𝑖=1
𝑇
−𝑁(𝐸(𝜏(𝑋)|Θ)) − (𝜆, 𝜆, … , 𝜆) = 𝟎𝑇
𝑛
1 − ∑ 𝑝𝑗 = 0
{ 𝑗=1
This implies:
𝜆 ⁄𝑁
(𝑡) 𝜆 ⁄𝑁
𝐸(𝜏(𝑋)|Θ) = 𝜏 −( )
𝜆 ⁄𝑁
𝜆 ⁄𝑁
𝑛
∑ 𝑝𝑗 = 1
{𝑗=1
Where,
𝑁
(𝑡)
1 (𝑡)
𝜏 = ∑{𝜏(𝑋𝑜𝑏𝑠 (𝑖)), 𝐸(𝜏(𝑋𝑚𝑖𝑠 )|Θ𝑀𝑖 )}
𝑁
𝑖=1
Due to
𝐸(𝜏(𝑋)|Θ) = ∫ 𝜏(𝑋)𝑓(𝑋|Θ)d𝑋 = (𝐾𝑝1 , 𝐾𝑝2 , … , 𝐾𝑝𝑛 )𝑇
𝑋
(𝑡) (𝑡) (𝑡) 𝑇
𝜏 (𝑡) = (𝑥̅1 , 𝑥̅2 , … , 𝑥̅𝑛 )
𝑁
(𝑡) 1 𝑥𝑖𝑗 if 𝑗 ∉ 𝑀𝑖
𝑥̅𝑗 = ∑ { (𝑡) ∀𝑗
𝑁 𝐾𝑝𝑗 if 𝑗 ∈ 𝑀𝑖
𝑖=1
(𝑡)
We obtain n equations Kpj = –λ/N + 𝑥̅𝑗 and 1 constraint ∑𝑛𝑗=1 𝑝𝑗 = 1. Therefore, we have:
𝑁
𝜆 1 𝑥𝑖𝑗 if 𝑗 ∉ 𝑀𝑖
𝑝𝑗 = − + ∑ { (𝑡) ∀𝑗
𝐾𝑁 𝐾𝑁 𝐾𝑝𝑗 if 𝑗 ∈ 𝑀𝑖
𝑖=1
Summing n equations above, we have:
𝑛 𝑁 𝑛 𝑁 𝑖 𝑖 ̅ |
|𝑀 |𝑀 |
𝜆 1 𝑥𝑖𝑗 if 𝑗 ∉ 𝑀𝑖 𝜆 1 (𝑡)
1 = ∑ 𝑝𝑗 = − + ∑ (∑ { (𝑡) )=− + ∑ (∑ 𝑥𝑖𝑚̅𝑗 + ∑ 𝐾𝑝𝑚𝑗 )
𝐾𝑁 𝐾𝑁 𝐾𝑝𝑗 if 𝑗 ∈ 𝑀𝑖 𝐾𝑁 𝐾𝑁
𝑗=1 𝑗=1 𝑖=1 𝑖=1 𝑗=1 𝑗=1
Suppose every missing value 𝑥𝑖𝑚𝑗 is estimated by 𝐾𝑝𝑚𝑗 such that:
̅ 𝑖|
|𝑀 |𝑀𝑖 |
(𝑡)
∑ 𝑥𝑚̅𝑖𝑗 = ∑ 𝐾𝑝𝑚𝑗
𝑗=1 𝑗=1
We obtain:
𝑁 ̅ 𝑖|
|𝑀 |𝑀𝑖 | 𝑁
𝜆 1 𝜆 1 𝜆
1=− + ∑ (∑ 𝑥𝑖𝑚̅𝑗 + ∑ 𝑥𝑖𝑚𝑗 ) = − + ∑𝐾 = − +1
𝐾𝑁 𝐾𝑁 𝐾𝑁 𝐾𝑁 𝐾𝑁
𝑖=1 𝑗=1 𝑗=1 𝑖=1
This implies
𝜆=0
Such that
𝑁
1 𝑥𝑖𝑗 if 𝑗 ∉ 𝑀𝑖
𝑝𝑗 = ∑ { (𝑡) ∀𝑗
𝐾𝑁 𝐾𝑝𝑗 if 𝑗 ∈ 𝑀𝑖
𝑖=1
Therefore, at M-step of some tth iteration, given current parameter Θ(t) = (p1(t), p2(t),…, pn(t))T, the next parameter Θ(t+1) is specified
by following equation.
𝑁
(𝑡+1) 1 𝑥𝑖𝑗 if 𝑗 ∉ 𝑀𝑖
𝑝𝑗 = ∑ { (𝑡) ∀𝑗 (2.53)
𝐾𝑁 𝐾𝑝𝑗 if 𝑗 ∈ 𝑀𝑖
𝑖=1
In general, given sample 𝒳 = {X1, X2,…, XN} whose Xi (s) are iid is MCAR data and f(X|Θ) is multinomial PDF of K trials, GEM
for handling missing data is summarized in table 2.3.
M-step:
Given τ(t) and Θ(t) = (p1(t), p2(t),…, pn(t))T, the next parameter Θ(t+1) is specified by equation 2.53.
𝑁
(𝑡+1) 1 𝑥𝑖𝑗 if 𝑗 ∉ 𝑀𝑖
𝑝𝑗 = ∑ { (𝑡) ∀𝑗
𝐾𝑁 𝐾𝑝𝑗 if 𝑗 ∈ 𝑀𝑖
𝑖=1
Table 2.3. E-step and M-step of GEM algorithm for handling missing data given multinomial PDF
In table 2.3, E-step is implied in how to perform M-step. As aforementioned, in practice we can stop GEM after its first iteration
was done, which is reasonable enough to handle missing data. Next section includes two examples of handling missing data with
multinormal distribution and multinomial distribution.
(1) (1)
𝑚𝑖𝑠 𝜎21 = 0 𝜎23 = 0
𝑉𝑜𝑏𝑠 (1) = ( (1) (1)
)
𝜎41 = 0 𝜎43 = 0
(1) (1)
𝜎 = 0 𝜎14 = 0
𝑜𝑏𝑠
𝑉𝑚𝑖𝑠 (1) = ( 12
(1) (1)
)
𝜎32 = 0 𝜎34 = 0
(1) −1 (1) (1) 𝑇
𝑚𝑖𝑠
𝜇𝑀1 = 𝜇𝑚𝑖𝑠 (1) + (𝑉𝑜𝑏𝑠 (1)) (Σ𝑜𝑏𝑠 (1)) (𝑋𝑜𝑏𝑠 (1) − 𝜇𝑜𝑏𝑠 (1)) = (𝜇𝑀1 (2) = 0, 𝜇𝑀1 (4) = 0)
(1) (1)
(1) 𝑚𝑖𝑠 −1
𝑜𝑏𝑠
Σ𝑀1 (2,2) = 1 Σ𝑀1 (2,4) = 0
Σ𝑀1 = Σ𝑚𝑖𝑠 (1) − (𝑉𝑜𝑏𝑠 (1)) (Σ𝑜𝑏𝑠 (1)) (𝑉𝑚𝑖𝑠 )=( (1) (1)
)
Σ𝑀1 (4,2) = 0 Σ𝑀1 (4,4) = 1
(1) 1 (1)
𝑥̅1 = (𝑥11 + 𝜇𝑀2 (1)) = 0.5
2
(1) 1 (1)
𝑥̅2 = (𝜇𝑀1 (2) + 𝑥22 ) = 1
2
(1) 1 (1)
𝑥̅3 = (𝑥13 + 𝜇𝑀2 (3)) = 1.5
2
(1) 1 (1)
𝑥̅4 = (𝜇𝑀1 (4) + 𝑥24 ) = 2
2
𝑐(𝑍1 ) + 𝑐(𝑍2 ) 2 + 2
𝑝(2) = = = 0.5
4∗2 4∗2
At 2nd iteration, E-step, we have:
𝑋𝑜𝑏𝑠 (1) = (𝑥1 = 1, 𝑥3 = 3)𝑇
(2) (2) 𝑇
𝜇𝑚𝑖𝑠 (1) = (𝜇2 = 1, 𝜇4 = 2)
(2) (2)
𝜎22 = 1.5 𝜎24 = 2
Σ𝑚𝑖𝑠 (1) = ( (2) (2)
)
𝜎42 = 2 𝜎44 = 4.5
(2) (2) 𝑇
𝜇𝑜𝑏𝑠 (1) = (𝜇1 = 0.5, 𝜇3 = 1.5)
(2) (2)
𝜎 = 0.75 𝜎13 = 0.75
Σ𝑜𝑏𝑠 (1) = ( 11
(2) (2)
)
𝜎31 = 0.75 𝜎33 = 2.75
(2) (2)
𝜎 = −0.5 𝜎23 = −1.5
𝑚𝑖𝑠
𝑉𝑜𝑏𝑠 (1) = ( 21(2) (2)
)
𝜎41 = −1 𝜎43 = −3
(2) (2)
𝜎 = −0.5 𝜎14 = −1
𝑜𝑏𝑠
𝑉𝑚𝑖𝑠 (1) = ( 12
(2) (2)
)
𝜎32 = −1.5 𝜎34 = −3
(2) −1 (2) (2) 𝑇
𝑚𝑖𝑠
𝜇𝑀1 = 𝜇𝑚𝑖𝑠 (1) + (𝑉𝑜𝑏𝑠 (1)) (Σ𝑜𝑏𝑠 (1)) (𝑋𝑜𝑏𝑠 (1) − 𝜇𝑜𝑏𝑠 (1)) = (𝜇𝑀1 (2) ≅ 0.17, 𝜇𝑀1 (4) ≅ 0.33)
(2) (2)
(2) 𝑚𝑖𝑠 −1 𝑜𝑏𝑠
Σ𝑀1 (2,2) ≅ 0,67 Σ𝑀1 (2,4) ≅ 0.33
Σ𝑀1 = Σ𝑚𝑖𝑠 (1) − (𝑉𝑜𝑏𝑠 (1)) (Σ𝑜𝑏𝑠 (1)) (𝑉𝑚𝑖𝑠 ) =( (2) (2)
)
Σ𝑀1 (4,2) ≅ 0.33 Σ𝑀1 (4,4) ≅ 1.17
(2) (2)
𝑜𝑏𝑠 𝜎21 = −0.5 𝜎23 = −1.5
𝑉𝑚𝑖𝑠 (2) = ( (2) (2)
)
𝜎41 = −1 𝜎43 = −3
(2) −1 (2) (2) 𝑇
𝑚𝑖𝑠
𝜇𝑀2 = 𝜇𝑚𝑖𝑠 (2) + (𝑉𝑜𝑏𝑠 (2)) (Σ𝑜𝑏𝑠 (2)) (𝑋𝑜𝑏𝑠 (2) − 𝜇𝑜𝑏𝑠 (2)) = (𝜇𝑀2 (1) ≅ 0.05, 𝜇𝑀2 (3) = 0.14)
(2) (2)
(2) 𝑚𝑖𝑠 −1 𝑜𝑏𝑠
Σ𝑀2 (1,1) ≅ 0.52 Σ𝑀2 (1,3) ≅ 0.07
Σ𝑀2 = Σ𝑚𝑖𝑠 (2) − (𝑉𝑜𝑏𝑠 (2)) (Σ𝑜𝑏𝑠 (2)) (𝑉𝑚𝑖𝑠 ) =( (2) (2)
)
Σ𝑀2 (3,1) ≅ 0.07 Σ𝑀2 (3,3) ≅ 0.7
(2) 1 (2)
𝑥̅1 = (𝑥11 + 𝜇𝑀2 (1)) ≅ 0.52
2
(2) 1 (2)
𝑥̅2 = (𝜇𝑀1 (2) + 𝑥22 ) ≅ 1.1
2
(2) 1 (2)
𝑥̅3 = (𝑥13 + 𝜇𝑀2 (3)) ≅ 1.57
2
(2) 1 (2)
𝑥̅4 = (𝜇𝑀1 (4) + 𝑥24 ) ≅ 2.17
2
𝑐(𝑍1 ) + 𝑐(𝑍2 ) 2 + 2
𝑝(3) = = = 0.5
4∗2 4∗2
Because the sample is too small for GEM to converge to an exact maximizer with small enough number of iterations, we can stop
GEM at the second iteration with Θ(3) = Θ* = (μ*, Σ*)T and Φ(3) = Φ* = p* when difference between Θ(2) and Θ(3) is insignificant.
𝜇 ∗ = (𝜇1∗ = 0.52, 𝜇2∗ = 1.1, 𝜇3∗ = 1.57, 𝜇4∗ = 2.17)𝑇
∗ ∗ ∗ ∗
𝜎11 = 0.49 𝜎12 = −0.44 𝜎13 = 0.72 𝜎14 = −0.96
∗ ∗ ∗ ∗
𝜎 = −0.44 𝜎 = 1.17 𝜎 = −1.31 𝜎 24 = 1.85
Σ ∗ = ( 21∗
22
∗
23
∗ ∗ )
𝜎31 = 0.72 𝜎32 = −1.31 𝜎33 = 2.4 𝜎34 = −2.63
∗ ∗ ∗ ∗
𝜎41 = −0.96 𝜎42 = 1.85 𝜎43 = −2.63 𝜎44 = 3.94
𝑝∗ = 0.5
As aforementioned, because Xmis is a part of X and f(Xmis | ΘM) is derived directly from f(X|Θ), in practice we can stop GEM after
its first iteration was done, which is reasonable enough to handle missing data.
As aforementioned, an interesting application of handling missing data is to fill in or predict missing values. For instance, the
∗
missing part Xmis(1) of X1 = (x11=1, x12=?, x13=3, x14=?)T is fulfilled by 𝜇𝑀 1
according to equation 2.44 as follows:
∗
𝑥12 = 𝜇2 = 1.1
𝑥14 = 𝜇4∗ = 2.17
It is necessary to have an example for illustrating how to handle missing data with multinomial PDF.
Example 3.2. Given sample of size two, 𝒳 = {X1, X2 } in which X1 = (x11=1, x12=?, x13=3, x14=?)T and X2 = (x21=?, x22=2, x23=?,
x24=4)T are iid.
x1 x2 x3 x4
X1 1 ? 3 ?
X2 ? 2 ? 4
Of course, we have Xobs(1) = (x11=1, x13=3)T, Xmis(1) = (x12=?, x14=?)T, Xobs(2) = (x22=2, x24=4)T and Xmis(2) = (x21=?, x23=?)T. We
also have M1 = {m11=2, m12=4}, 𝑀 ̅1 = {𝑚 ̅ 11 =1, 𝑚
̅ 12 =3}, M2 = {m21=1, m22=3}, and 𝑀 ̅2 = {𝑚̅ 21 =2, 𝑚
̅ 22 =4}. Let X be random
variable representing every Xi. Suppose f(X|Θ) is multinomial PDF of 10 trials. We will estimate Θ = (p1, p2, p3, p4)T. The parameters
p1, p2, p3, and p2 are initialized arbitrarily as 0.25 as follows:
(1) (1) (1) (1) 𝑇
Θ(1) = (𝑝1 = 0.25, 𝑝2 = 0.25, 𝑝3 = 0.25, 𝑝4 = 0.25)
At 1st iteration, M-step, we have:
1 (2)
(1 + 10 ∗ 0.25) = 0.175
𝑝1 =
10 ∗ 2
(2) 1
𝑝2 = (10 ∗ 0.25 + 2) = 0.225
10 ∗ 2
(2) 1
𝑝3 = (3 + 10 ∗ 0.25) = 0.275
10 ∗ 2
(2) 1
𝑝4 = (10 ∗ 0.25 + 4) = 0.325
10 ∗ 2
We stop GEM after the first iteration was done, which results the estimate Θ(2) = Θ* = (p1*, p2*, p3*, p4*)T as follows:
𝑝1∗ = 0.175
𝑝2∗ = 0.225
𝑝3∗ = 0.275
𝑝4∗ = 0.325
IV. CONCLUSIONS
In general, GEM is a powerful tool to handle missing data, which is not so difficult except that how to extract the parameter ΘM of
the conditional PDF f(Xmis | Xobs, ΘM) from the whole parameter Θ of the PDF f(X|ΘM) is most important with note that only f(X|Θ)
is defined firstly and then f(Xmis | Xobs, ΘM) is derived from f(X|Θ). Therefore, equation 2.15 is cornerstone of this method. Note,
equation 2.35 and 2.51 are instances of equation 2.15 when f(X|Θ) is multinormal PDF or multinomial PDF.
REFERENCES
[1] Burden, R. L., & Faires, D. J. (2011). Numerical Analysis (9th Edition ed.). (M. Julet, Ed.) Brooks/Cole Cengage Learning.
[2] Dempster, A. P., Laird, N. M., & Rubin, D. B. (1977). Maximum Likelihood from Incomplete Data via the EM Algorithm. (M. Stone, Ed.) Journal of the
Royal Statistical Society, Series B (Methodological), 39(1), 1-38.
[3] Hardle, W., & Simar, L. (2013). Applied Multivariate Statistical Analysis. Berlin, Germany: Research Data Center, School of Business and Economics,
Humboldt University.
[4] Josse, J., Jiang, W., Sportisse, A., & Robin, G. (2018). Handling missing values. Inria. Julie Josse. Retrieved October 12, 2020, from http://juliejosse.com/wp-
content/uploads/2018/07/LectureNotesMissing.html
[5] Nguyen, L. (2020). Tutorial on EM algorithm. MDPI. Preprints. doi:10.20944/preprints201802.0131.v8
[6] Ta, P. D. (2014). Numerical Analysis Lecture Notes. Vietnam Institute of Mathematics, Numerical Analysis and Scientific Computing. Hanoi: Vietnam
Institute of Mathematics. Retrieved 2014
[7] Wikipedia. (2014, August 4). Karush–Kuhn–Tucker conditions. (Wikimedia Foundation) Retrieved November 16, 2014, from Wikipedia website:
http://en.wikipedia.org/wiki/Karush–Kuhn–Tucker_conditions
[8] Wikipedia. (2016, March September). Exponential family. (Wikimedia Foundation) Retrieved 2015, from Wikipedia website:
https://en.wikipedia.org/wiki/Exponential_family