0% found this document useful (0 votes)
5 views25 pages

EM Missing

Uploaded by

Sathish K
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views25 pages

EM Missing

Uploaded by

Sathish K
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 25

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/355446236

Handling Missing Data with Expectation Maximization Algorithm

Article · October 2021

CITATIONS READS
3 561

1 author:

Loc Nguyen
Loc Nguyen's Academic Network
268 PUBLICATIONS 407 CITATIONS

SEE PROFILE

All content following this page was uploaded by Loc Nguyen on 21 October 2021.

The user has requested enhancement of the downloaded file.


GRD Journal for Engineering | Volume 6 | Issue 11 | October 2021
ISSN: 2455-5703

Handling Missing Data with Expectation


Maximization Algorithm
Loc Nguyen
Independent Scholar
Department of Applied Science
Loc Nguyen's Academic Network

Abstract
Expectation maximization (EM) algorithm is a powerful mathematical tool for estimating parameter of statistical models in case
of incomplete data or hidden data. EM assumes that there is a relationship between hidden data and observed data, which can be a
joint distribution or a mapping function. Therefore, this implies another implicit relationship between parameter estimation and
data imputation. If missing data which contains missing values is considered as hidden data, it is very natural to handle missing
data by EM algorithm. Handling missing data is not a new research but this report focuses on the theoretical base with detailed
mathematical proofs for fulfilling missing values with EM. Besides, multinormal distribution and multinomial distribution are the
two sample statistical models which are concerned to hold missing values.
Keywords- Expectation Maximization (EM), Missing Data, Multinormal Distribution, Multinomial Distribution

I. INTRODUCTION
Literature of expectation maximization (EM) algorithm in this report is mainly extracted from the preeminent article “Maximum
Likelihood from Incomplete Data via the EM Algorithm” by Arthur P. Dempster, Nan M. Laird, and Donald B. Rubin (Dempster,
Laird, & Rubin, 1977). For convenience, let DLR be reference to such three authors. The preprint “Tutorial on EM algorithm”
(Nguyen, 2020) by Loc Nguyen is also referred in this report.
Now we skim through an introduction of EM algorithm. Suppose there are two spaces X and Y, in which X is hidden space
whereas Y is observed space. We do not know X but there is a mapping from X to Y so that we can survey X by observing Y. The
mapping is many-one function φ: X → Y and we denote φ–1(Y) = {𝑋 ∈ 𝑿: φ(X) = Y} as all 𝑋 ∈ 𝑿 such that φ(X) = Y. We also
denote X(Y) = φ–1(Y). Let f(X | Θ) be the probability density function (PDF) of random variable 𝑋 ∈ 𝑿 and let g(Y | Θ) be the PDF
of random variable 𝑌 ∈ 𝒀. Note, Y is also called observation. Equation 1.1 specifies g(Y | Θ) as integral of f(X | Θ) over φ–1(Y).
𝑔(𝑌|Θ) = ∫ 𝑓(𝑋|Θ)d𝑋 (1.1)
𝜑−1 (𝑌)
Where Θ is probabilistic parameter represented as a column vector, Θ = (θ1, θ2,…, θr)T in which each θi is a particular parameter.
If X and Y are discrete, equation 1.1 is re-written as follows:
𝑔(𝑌|Θ) = ∑ 𝑓(𝑋|Θ)
𝑋∈𝜑−1 (𝑌)
According to viewpoint of Bayesian statistics, Θ is also random variable. As a convention, let Ω be the domain of Θ such that Θ ∈
Ω and the dimension of Ω is r. For example, normal distribution has two particular parameters such as mean μ and variance σ2 and
so we have Θ = (μ, σ2)T. Note that, Θ can degrades into a scalar as Θ = θ. The conditional PDF of X given Y, denoted k(X | Y, Θ),
is specified by equation 1.2.
𝑓(𝑋|Θ)
𝑘(𝑋|𝑌, Θ) = (1.2)
𝑔(𝑌|Θ)
According to DLR (Dempster, Laird, & Rubin, 1977, p. 1), X is called complete data and the term “incomplete data” implies
existence of X and Y where X is not observed directly and X is only known by the many-one mapping φ: X → Y. In general, we
only know Y, f(X | Θ), and k(X | Y, Θ) and so our purpose is to estimate Θ based on such Y, f(X | Θ), and k(X | Y, Θ). Like MLE
approach, EM algorithm also maximizes the likelihood function to estimate Θ but the likelihood function in EM concerns Y and
there are also some different aspects in EM which will be described later. Pioneers in EM algorithm firstly assumed that f(X | Θ)
belongs to exponential family with note that many popular distributions such as normal, multinomial, and Poisson belong to
exponential family. Although DLR (Dempster, Laird, & Rubin, 1977) proposed a generality of EM algorithm in which f(X | Θ)
distributes arbitrarily, we should concern exponential family a little bit. Exponential family (Wikipedia, Exponential family, 2016)
refers to a set of probabilistic distributions whose PDF (s) have the same exponential form according to equation 1.3 (Dempster,
Laird, & Rubin, 1977, p. 3):
𝑓(𝑋|Θ) = 𝑏(𝑋) exp(Θ𝑇 𝜏(𝑋))⁄𝑎(Θ) (1.3)

All rights reserved by www.grdjournals.com 9


Handling Missing Data with Expectation Maximization Algorithm
(GRDJE/ Volume 6 / Issue 11 / 002)

Where b(X) is a function of X, which is called base measure and τ(X) is a vector function of X, which is sufficient statistic. For
example, the sufficient statistic of normal distribution is τ(X) = (X, XXT)T. Equation 1.3 expresses the canonical form of exponential
family. Recall that Ω is the domain of Θ such that Θ ∈ Ω. Suppose that Ω is a convex set. If Θ is restricted only to Ω then, f(X | Θ)
specifies a regular exponential family. If Θ lies in a curved sub-manifold Ω0 of Ω then, f(X | Θ) specifies a curved exponential
family. The a(Θ) is partition function for variable X, which is used for normalization.
𝑎(Θ) = ∫ 𝑏(𝑋)exp(Θ𝑇 𝜏(𝑋))d𝑋
𝑋
As usual, a PDF is known as a popular form but its exponential family form (canonical form of exponential family) specified by
equation 1.3 looks unlike popular form although they are the same. Therefore, parameter in popular form is different from
parameter in exponential family form.
For example, multinormal distribution with theoretical mean μ and covariance matrix Σ of random variable X = (x1, x2,…, xn)T
has PDF in popular form is:
𝑛 1 1
𝑓(𝑋|𝜇, Σ) = (2𝜋)−2 |Σ|−2 ∗ exp (− (𝑋 − 𝜇)𝑇 Σ −1 (𝑋 − 𝜇))
2
Hence, parameter in popular form is Θ = (μ, Σ)T. Exponential family form of such PDF is:
𝑛
𝑋 1 1
𝑓(𝑋|𝜃1 , 𝜃2 ) = (2𝜋)− 2 ∗ exp ((𝜃1 , 𝜃2 ) ( 𝑇 ))⁄exp (− 𝜃1𝑇 𝜃2−1 𝜃1 − log|−2𝜃2 |)
𝑋𝑋 4 2
Where,
𝜃
Θ = ( 1)
𝜃2
𝜃1 = Σ −1 𝜇
1
𝜃2 = − Σ −1
2 𝑛
𝑏(𝑋) = (2𝜋)−2
𝑋
𝜏(𝑋) = ( 𝑇 )
𝑋𝑋
1 1
𝑎(Θ) = exp (− 𝜃1𝑇 𝜃2−1 𝜃1 − log|−2𝜃2 |)
4 2
The exponential family form is used to represents all distributions belonging to exponential family as canonical form. Parameter
in exponential family form is called exponential family parameter. As a convention, parameter Θ mentioned in EM algorithm is
often exponential family parameter if PDF belongs to exponential family and there is no additional information.
Expectation maximization (EM) algorithm has many iterations and each iteration has two steps in which expectation step (E-
step) calculates sufficient statistic of hidden data based on observed data and current parameter whereas maximization step (M-
step) re-estimates parameter. When DLR proposed EM algorithm (Dempster, Laird, & Rubin, 1977), they firstly concerned that
the PDF f(X | Θ) of hidden space belongs to exponential family. E-step and M-step at the tth iteration are described in table 1.1
(Dempster, Laird, & Rubin, 1977, p. 4), in which the current estimate is Θ(t), with note that f(X | Θ) belongs to regular exponential
family.
E-step:
We calculate current value τ(t) of the sufficient statistic τ(X) from observed Y and current parameter Θ(t) according to
following equation:
𝜏 (𝑡) = 𝐸(𝜏(𝑋)|𝑌, Θ(𝑡) ) = ∫ 𝑘(𝑋|𝑌, Θ(𝑡) )𝜏(𝑋)d𝑋
𝜑−1 (𝑌)
M-step:
Basing on τ(t), we determine the next parameter Θ(t+1) as solution of following equation:
𝐸(𝜏(𝑋)|Θ) = ∫ 𝑓(𝑋|Θ)𝜏(𝑋)d𝑋 = 𝜏 (𝑡)
𝑋
Note, Θ(t+1) will become current parameter at the next iteration ((t+1)th iteration).
Table 1.1. E-step and M-step of EM algorithm given regular exponential PDF f(X|Θ)
EM algorithm stops if two successive estimates are equal, Θ * = Θ(t) = Θ(t+1), at some tth iteration. At that time we conclude that Θ*
is the optimal estimate of EM process. As a convention, the estimate of parameter Θ resulted from EM process is denoted Θ *
̂ in order to emphasize that Θ* is solution of optimization problem.
instead of Θ
For further research, DLR gave a preeminent generality of EM algorithm (Dempster, Laird, & Rubin, 1977, pp. 6-11) in which
f(X | Θ) specifies arbitrary distribution. In other words, there is no requirement of exponential family. They define the conditional
expectation Q(Θ’ | Θ) according to equation 1.4 (Dempster, Laird, & Rubin, 1977, p. 6).
𝑄(Θ′ |Θ) = 𝐸(log(𝑓(𝑋|Θ′ ))|𝑌, Θ) = ∫ 𝑘(𝑋|𝑌, Θ)log(𝑓(𝑋|Θ′ ))d𝑋 (1.4)
𝜑 −1 (𝑌)

All rights reserved by www.grdjournals.com 10


Handling Missing Data with Expectation Maximization Algorithm
(GRDJE/ Volume 6 / Issue 11 / 002)

If X and Y are discrete, equation 2.4 can be re-written as follows:


𝑄(Θ′ |Θ) = 𝐸(log(𝑓(𝑋|Θ′ ))|𝑌, Θ) = ∑ 𝑘(𝑋|𝑌, Θ)log(𝑓(𝑋|Θ′ ))
𝑋∈𝜑−1 (𝑌)
The two steps of generalized EM (GEM) algorithm aim to maximize Q(Θ | Θ(t)) at some tth iteration as seen in table 1.2 (Dempster,
Laird, & Rubin, 1977, p. 6).
E-step:
The expectation Q(Θ | Θ(t)) is determined based on current parameter Θ(t), according to equation 1.4. Actually, Q(Θ | Θ(t))
is formulated as function of Θ.
M-step:
The next parameter Θ(t+1) is a maximizer of Q(Θ | Θ(t)) with subject to Θ. Note that Θ(t+1) will become current parameter at
the next iteration (the (t+1)th iteration).
Table 1.2. E-step and M-step of GEM algorithm
DLR proved that GEM algorithm converges at some tth iteration. At that time, Θ* = Θ(t+1) = Θ(t) is the optimal estimate of EM
process, which is an optimizer of L(Θ).
Θ∗ = argmax 𝐿(Θ)
Θ
It is deduced from E-step and M-step that Q(Θ | Θ(t)) is increased after every iteration. How to maximize Q(Θ|Θ(t)) is the
optimization problem which is dependent on applications. For example, the estimate Θ(t+1) can be solution of the equation created
by setting the first-order derivative of Q(Θ|Θ(t)) regarding Θ to be zero, DQ(Θ|Θ(t)) = 0T. If solving such equation is too complex
or impossible, some popular methods to solve optimization problem are Newton-Raphson (Burden & Faires, 2011, pp. 67-71),
gradient descent (Ta, 2014), and Lagrange duality (Wikipedia, Karush–Kuhn–Tucker conditions, 2014).
In practice, if Y is observed as particular N observations Y1, Y2,…, YN. Let 𝒴 = {Y1, Y2,…, YN} be the observed sample of size
N with note that all Yi (s) are mutually independent and identically distributed (iid). Given an observation Yi, there is an associated
random variable Xi. All Xi (s) are iid and they are not existent in fact. Each 𝑋𝑖 ∈ 𝑿 is a random variable like X. Of course, the
domain of each Xi is X. Let 𝒳 = {X1, X2,…, XN} be the set of associated random variables. Because all Xi (s) are iid, the joint PDF
of 𝒳 is determined as follows:
𝑁

𝑓(𝒳|Θ) = 𝑓(𝑋1 , 𝑋2 , … , 𝑋𝑁 |Θ) = ∏ 𝑓(𝑋𝑖 |Θ)


𝑖=1
Because all Xi (s) are iid and each Yi is associated with Xi, the conditional joint PDF of 𝒳 given 𝒴 is determined as follows:
𝑁 𝑁

𝑘(𝒳|𝒴, Θ) = 𝑘(𝑋1 , 𝑋2 , … , 𝑋𝑁 |𝑌1 , 𝑌2 , … , 𝑌𝑁 , Θ) = ∏ 𝑘(𝑋𝑖 |𝑌1 , 𝑌2 , … , 𝑌𝑁 , Θ) = ∏ 𝑘(𝑋𝑖 |𝑌𝑖 , Θ)


𝑖=1 𝑖=1
The conditional expectation Q(Θ’ | Θ) given samples X and Y is re-written according to equation 1.5.
𝑁
′ |Θ)
𝑄(Θ =∑ ∫ 𝑘(𝑋|𝑌𝑖 , Θ)log(𝑓(𝑋|Θ′ ))d𝑋 (1.5)
𝑖=1 𝜑 −1 (𝑌𝑖 )
Equation 1.5 is proved in (Nguyen, 2020, pp. 45-47). In case that f(X | Θ) and k(X | Yi, Θ) belong to exponential family, equation
1.5 becomes equation 1.6 with an observed sample 𝒴 = {Y1, Y2,…, YN}.
𝑁 𝑁
′ |Θ) ′ )𝑇
𝑄(Θ = (∑ 𝐸(log(𝑏(𝑋))|𝑌𝑖 , Θ)) + ((Θ ∑ 𝜏Θ,𝑌𝑖 ) − 𝑁log(𝑎(Θ′ )) (1.6)
𝑖=1 𝑖=1
Where,
𝐸(log(𝑏(𝑋))|𝑌𝑖 , Θ) = ∫ 𝑘(𝑋|𝑌𝑖 , Θ)log(𝑏(𝑋))d𝑋
𝜑 −1 (𝑌𝑖 )

𝜏Θ,𝑌𝑖 = 𝐸(𝜏(𝑋)|𝑌𝑖 , Θ) = ∫ 𝑘(𝑋|𝑌𝑖 , Θ)𝜏(𝑋)d𝑋


𝜑 −1 (𝑌𝑖 )
DLR (Dempster, Laird, & Rubin, 1977, p. 1) called X as complete data because the mapping φ: X → Y is many-one function.
There is another case that the complete space Z consists of hidden space X and observed space Y with note that X and Y are
separated. There is no explicit mapping φ from X and Y but there exists a PDF of 𝑍 ∈ 𝒁 as the joint PDF of 𝑋 ∈ 𝑿 and 𝑌 ∈ 𝒀.
𝑓(𝑍|Θ) = 𝑓(𝑋, 𝑌|Θ)
The PDF of Y becomes:
𝑓(𝑌|Θ) = ∫ 𝑓(𝑋, 𝑌|Θ)d𝑋
𝑋
The PDF f(Y|Θ) is equivalent to the PDF g(Y|Θ) mentioned in equation 1.1. Although there is no explicit mapping from X to Y, the
PDF of Y above implies an implicit mapping from Z to Y. The conditional PDF of X given Z is specified according to Bayes’ rule
as follows:

All rights reserved by www.grdjournals.com 11


Handling Missing Data with Expectation Maximization Algorithm
(GRDJE/ Volume 6 / Issue 11 / 002)

𝑓(𝑋, 𝑌|Θ) 𝑓(𝑋, 𝑌|Θ)


𝑓(𝑍|𝑌, Θ) = 𝑓(𝑋, 𝑌|𝑌, Θ) = 𝑓(𝑋|𝑌)𝑓(𝑌|𝑌) = 𝑓(𝑋|𝑌, Θ) = =
𝑓(𝑌|Θ) ∫𝑋 𝑓(𝑋, 𝑌|Θ)d𝑋
The conditional PDF f(X|Y, Θ) is equivalent to the conditional PDF k(X|Y, Θ) mentioned in equation 1.2. Of course, given Y, we
always have:
∫ 𝑓(𝑋|𝑌, Θ)d𝑋 = 1
𝑋
Equation 1.7 specifies the conditional expectation Q(Θ’ | Θ) in case that there is no explicit mapping from X to Y but there exists
the joint PDF of X and Y.
𝑄(Θ′ |Θ) = ∫ 𝑓(𝑍|𝑌, Θ)log(𝑓(𝑍|Θ′ ))d𝑋 = ∫ 𝑓(𝑋|𝑌, Θ)log(𝑓(𝑋, 𝑌|Θ′ ))d𝑋 (1.7)
𝑋 𝑋
Where,
𝑓(𝑋, 𝑌|Θ) 𝑓(𝑋, 𝑌|Θ)
𝑓(𝑋|𝑌, Θ) = =
𝑓(𝑌|Θ) ∫𝑋 𝑓(𝑋, 𝑌|Θ)d𝑋
Note, X is separated from Y and the complete data Z = (X, Y) is composed of X and Y. For equation 1.7, the existence of the joint
PDF f(X, Y | Θ) can be replaced by the existence of the conditional PDF f(Y|X, Θ) and the prior PDF f(X|Θ) due to:
𝑓(𝑋, 𝑌|Θ) = 𝑓(𝑌|𝑋, Θ)𝑓(𝑋|Θ)
In applied statistics, equation 1.4 is often replaced by equation 1.7 because specifying the joint PDF f(X, Y | Θ) is more practical
than specifying the mapping φ: X → Y. However, equation 1.4 is more general equation 1.7 because the requirement of the joint
PDF for equation 1.7 is stricter than the requirement of the explicit mapping for equation 1.4. In case that X and Y are discrete,
equation 1.7 becomes:
𝑄(Θ′ |Θ) = ∑ 𝑃(𝑋|𝑌, Θ)log(𝑃(𝑋, 𝑌|Θ′ ))
𝑋
In practice, suppose Y is observed as a sample 𝒴 = {Y1, Y2,…, YN} of size N with note that all Yi (s) are mutually independent and
identically distributed (iid). The observed sample 𝒴 is associated with a a hidden set (latent set) 𝒳 = {X1, X2,…, XN} of size N. All
Xi (s) are iid and they are not existent in fact. Let 𝑋 ∈ 𝑿 be the random variable representing every Xi. Of course, the domain of X
is X. Equation 1.8 specifies the conditional expectation Q(Θ’ | Θ) given such 𝒴.
𝑁
′ |Θ)
𝑄(Θ = ∑ ∫ 𝑓(𝑋|𝑌𝑖 , Θ)log(𝑓(𝑋, 𝑌𝑖 |Θ′ ))d𝑋 (1.8)
𝑖=1 𝑋
Equation 1.8 is a variant of equation 1.5 in case that there is no explicit mapping between Xi and Yi but there exists the same joint
PDF between Xi and Yi. If both X and Y are discrete, equation 1.8 becomes:
𝑁

𝑄(Θ′ |Θ) = ∑ ∑ 𝑃(𝑋|𝑌𝑖 , Θ)log(𝑃(𝑋, 𝑌𝑖 |Θ′ )) (1.9)


𝑖=1 𝑋
If X is discrete and Y is continuous such that f(X, Y | Θ) = P(X|Θ)f(Y | X, Θ) then, according to the total probability rule, we have:
𝑓(𝑌|Θ) = ∑ 𝑃(𝑋|Θ)𝑓(𝑌|𝑋, Θ)
𝑋
Note, when only X is discrete, its PDF f(X|Θ) becomes the probability P(X|Θ). Therefore, equation 1.10 is a variant of equation
1.8, as follows:
𝑁
′ |Θ)
𝑄(Θ = ∑ ∑ 𝑃(𝑋|𝑌𝑖 , Θ)log(𝑃(𝑋|Θ′ )𝑓(𝑌𝑖 |𝑋, Θ′ )) (1.10)
𝑖=1 𝑋
Where P(X | Yi, Θ) is determined by Bayes’ rule, as follows:
𝑃(𝑋|Θ)𝑓(𝑌𝑖 |𝑋, Θ)
𝑃(𝑋|𝑌𝑖 , Θ) =
∑𝑋 𝑃(𝑋|Θ)𝑓(𝑌𝑖 |𝑋, Θ)
Equation 1.10 is the base for estimating the probabilistic mixture model by EM algorithm, which is not main subject of this report.
Now we consider how to apply EM into handling missing data in which equation 1.8 is most concerned. The goal of maximum
likelihood estimation (MLE), maximum a posteriori (MAP), and EM is to estimate statistical based on sample. Whereas MLE and
MAP require complete data, EM accepts hidden data or incomplete data. Therefore, EM is appropriate to handle missing data
which contains missing values. Indeed, estimating parameter with missing data is very natural for EM but it is necessary to have a
new viewpoint in which missing data is considered as hidden data (X). Moreover, the GEM version with joint probability (without
mapping function, please see equation 1.7 and equation 1.8) is used and some changes are required. Handling missing data, which
is the main subject of this report is described in next section.

All rights reserved by www.grdjournals.com 12


Handling Missing Data with Expectation Maximization Algorithm
(GRDJE/ Volume 6 / Issue 11 / 002)

II. HANDLING MISSING DATA


Let X = (x1, x2,…, xn)T be n-dimension random variable whose n elements are partial random variables xj (s). Suppose X is composed
of two parts such as observed part Xobs and missing part Xmis such that X = {Xobs, Xmis}. Note, Xobs and Xmis are considered as random
variables.
𝑋 = {𝑋𝑜𝑏𝑠 , 𝑋𝑚𝑖𝑠 } = (𝑥1 , 𝑥2 , … , 𝑥𝑛 )𝑇 (2.1)
When X is observed, Xobs and Xmis are determined. For example, given X = (x1, x2, x3, x4)T, when X is observed as X = (x1=1, x2=?,
x3=4, x4=?, x5=9)T where question mask “?” denotes missing value, Xobs and Xmis are determined as Xobs = (x1=1, x3=4, x5=9)T and
Xmis = (x2=?, x4=?)T. When X is observed as X = (x1=?, x2=3, x3=4, x4=?, x5=?)T then, Xobs and Xmis are determined as Xobs = (x2=3,
x3=4)T and Xmis = (x1=?, x4=?, x5=?)T. Let M be a set of indices that xj (s) are missing when X is observed. M is called missing index
set.
𝑀 = {𝑗: 𝑥𝑗 missing} where 𝑗 = ̅̅̅̅̅ 1, 𝑛 (2.2)
Suppose
𝑀 = {𝑚1 , 𝑚2 , … , 𝑚|𝑀| } (2.3)
Where,
𝑚𝑖 = ̅̅̅̅̅
1, 𝑛
𝑚𝑖 ≠ 𝑚𝑗
Let 𝑀̅ is complementary set of the set M given the set {1, 2,…., n}. 𝑀 ̅ is called existent index set.
𝑀 = {𝑗: 𝑥𝑗 existent} where 𝑗 = ̅̅̅̅̅
̅ 1, 𝑛 (2.4)
M or 𝑀̅ can be empty. They are mutual because 𝑀 ̅ can be defined based on M and vice versa.
𝑀∪𝑀 ̅ = {1,2, … , 𝑛}
𝑀∩𝑀 ̅ =∅
Suppose
̅ = {𝑚
𝑀 ̅ 1, 𝑚
̅ 2, … , 𝑚
̅ |𝑀̅| } (2.5)
Where,
̅ 𝑖 = ̅̅̅̅̅
𝑚 1, 𝑛
𝑚̅𝑖 ≠ 𝑚 ̅𝑗
̅
|𝑀| + |𝑀| = 𝑛
We have:
𝑇 𝑇
𝑋𝑚𝑖𝑠 = (𝑥𝑗 : 𝑗 ∈ 𝑀) = (𝑥𝑚1 , 𝑥𝑚2 , … , 𝑥𝑚|𝑀| ) (2.6)

𝑇
𝑋𝑜𝑏𝑠 = (𝑥𝑗 : 𝑗 ∈ 𝑀̅ )𝑇 = (𝑥𝑚̅ , 𝑥𝑚̅ , … , 𝑥𝑚̅ ̅̅̅ ) (2.7)
1 2 |𝑀|
Obviously, dimension of Xmis is |M| and dimension of Xobs is |𝑀 ̅ | = n–|M|. Note, when composing X from Xobs and Xmis as X = {Xobs,
Xmis}, it is required a right re-arrangement of elements in both Xobs and Xmis.
Let Z = (z1, z2,…, zn)T be n-dimension random variable whose each element zj is binary random variable indicating if xj is
missing. Random variable Z is also called missingness variable.
1 if 𝑥𝑗 missing
𝑧𝑗 = { (2.8)
0 if 𝑥𝑗 existent
For example, given X = (x1, x2, x3, x4)T, when X is observed as X = (x1=1, x2=?, x3=4, x4=?, x5=9)T, we have Xobs = (x1=1, x3=4,
x5=9)T, Xmis = (x2=?, x4=?)T, and Z = (z1=0, z2=1, z3=0, z4=1, z5=0)T.
Generally, when X is replaced by a sample 𝒳 = {X1, X2,…, XN} whose Xi (s) are iid, let 𝒵 = {Z1, Z2,…, ZN} be a set of
missingness variables associated with 𝒳. All Zi (s) are iid too. 𝒳 and 𝒵 can be represented as matrices. Given Xi, its associative
quantities are Zi, Mi, and 𝑀 ̅𝑖 . Let X = {Xobs, Xmis} be random variable representing every Xi. Let Z be random variable representing
every Zi. As a convention, Xobs(i) and Xmis(i) refer to Xobs part and Xmis part of Xi. We have:
𝑋𝑖 = {𝑋𝑜𝑏𝑠 (𝑖), 𝑋𝑚𝑖𝑠 (𝑖)} = (𝑥𝑖1 , 𝑥𝑖2 , … , 𝑥𝑖𝑛 )𝑇
𝑇
𝑋𝑚𝑖𝑠 (𝑖) = (𝑥𝑖𝑚1 , 𝑥𝑖𝑚2 , … , 𝑥𝑖𝑚|𝑀 | )
𝑖
𝑇
𝑋𝑜𝑏𝑠 (𝑖) = (𝑥𝑖𝑚̅𝑖1 , 𝑥𝑖𝑚̅𝑖2 , … , 𝑥𝑖𝑚̅𝑖|𝑀
̅̅̅ |
) (2.9)
𝑖

𝑀𝑖 = {𝑚𝑖1 , 𝑚𝑖2 , … , 𝑚𝑖|𝑀𝑖| }


̅𝑖 = {𝑚
𝑀 ̅ 𝑖1 , 𝑚
̅ 𝑖2 , … , 𝑚
̅ 𝑖|𝑀̅𝑖| }
𝑍𝑖 = (𝑧𝑖1 , 𝑧𝑖2 , … , 𝑧𝑖𝑛 )𝑇
For example, given sample of size 4, 𝒳 = {X1, X2, X3, X4} in which X1 = (x11=1, x12=?, x13=3, x14=?)T, X2 = (x21=?, x22=2, x23=?,
x24=4)T, X3 = (x31=1, x32=2, x33=?, x34=?)T, and X4 = (x41=?, x42=?, x43=3, x44=4)T are iid. Therefore, we also have Z1 = (z11=0, z12=1,

All rights reserved by www.grdjournals.com 13


Handling Missing Data with Expectation Maximization Algorithm
(GRDJE/ Volume 6 / Issue 11 / 002)

z13=0, z14=1)T, Z2 = (z21=1, z22=0, z23=1, z24=0)T, Z3 = (z31=0, z32=0, z33=1, z34=1)T, and Z4 = (z41=1, z42=1, z43=0, z44=0)T. All Zi (s)
are iid too.
x1 x2 x3 x4 z1 z2 z3 z4
X1 1 ? 3 ? Z1 0 1 0 1
X2 ? 2 ? 4 Z2 1 0 1 0
X3 1 2 ? ? Z3 0 0 1 1
X4 ? ? 3 4 Z4 1 1 0 0
Of course, we have Xobs(1) = (x11=1, x13=3)T, Xmis(1) = (x12=?, x14=?)T, Xobs(2) = (x22=2, x24=4)T, Xmis(2) = (x21=?, x23=?)T, Xobs(3) =
(x31=1, x32=2)T, Xmis(3) = (x33=?, x34=?)T, Xobs(4) = (x43=3, x44=4)T, and Xmis(4) = (x41=?, x42=?)T. We also have M1 = {m11=2, m12=4},
𝑀̅1 = {𝑚̅ 11 =1, 𝑚
̅ 12=3}, M2 = {m21=1, m22=3}, 𝑀 ̅2 = {𝑚̅ 21 =2, 𝑚
̅ 22=4}, M3 = {m31=3, m32=4}, 𝑀 ̅3 = {𝑚̅ 31 =1, 𝑚
̅ 32 =2}, M4 = {m41=1,
̅
m42=2}, and 𝑀4 = {𝑚 ̅ 41 =3, 𝑚
̅ 42 =4}.
Both X and Z are associated with their own PDFs, as follows:
𝑓(𝑋|Θ) = 𝑓(𝑋𝑜𝑏𝑠 , 𝑋𝑚𝑖𝑠 |Θ)
(2.10)
𝑓(𝑍|𝑋𝑜𝑏𝑠 , 𝑋𝑚𝑖𝑠 , Φ)
Where Θ and Φ are parameters of PDFs of X = {Xobs, Xmis} and Z, respectively. The goal of handling missing data is to estimate Θ
and Φ given X. Sufficient statistic of X = {Xobs, Xmis} is composed of sufficient statistic of Xobs and sufficient statistic of Xmis.
𝜏(𝑋) = 𝜏(𝑋𝑜𝑏𝑠 , 𝑋𝑚𝑖𝑠 ) = {𝜏(𝑋𝑜𝑏𝑠 ), 𝜏(𝑋𝑚𝑖𝑠 )} (2.11)
How to compose τ(X) from τ(Xobs) and τ(Xmis) is dependent on distribution type of the PDF f(X|Θ).
The joint PDF of X and Z is main object of handling missing data, which is defined as follows:
𝑓(𝑋, 𝑍|Θ, Φ) = 𝑓(𝑋𝑜𝑏𝑠 , 𝑋𝑚𝑖𝑠 , 𝑍|Θ, Φ) = 𝑓(𝑍|𝑋𝑜𝑏𝑠 , 𝑋𝑚𝑖𝑠 , Φ)𝑓(𝑋𝑜𝑏𝑠 , 𝑋𝑚𝑖𝑠 |Θ) (2.12)
The PDF of Xobs is defined as integral of f(X|Θ) over Xmis:
𝑓(𝑋𝑜𝑏𝑠 |Θ) = ∫ 𝑓(𝑋𝑜𝑏𝑠 , 𝑋𝑚𝑖𝑠 |Θ)d𝑋𝑚𝑖𝑠 (2.13)
𝑋𝑚𝑖𝑠
The PDF of Xmis is conditional PDF of Xmis given Xobs is:
𝑓(𝑋|Θ) 𝑓(𝑋𝑜𝑏𝑠 , 𝑋𝑚𝑖𝑠 |Θ)
𝑓(𝑋𝑚𝑖𝑠 |𝑋𝑜𝑏𝑠 , Θ𝑀 ) = 𝑓(𝑋𝑚𝑖𝑠 |𝑋𝑜𝑏𝑠 , Θ) = = (2.14)
𝑓(𝑋𝑜𝑏𝑠 |Θ) 𝑓(𝑋𝑜𝑏𝑠 |Θ)
The notation ΘM implies that the parameter ΘM of the PDF f(Xmis | Xobs, ΘM) is derived from the parameter Θ of the PDF f(X|Θ),
which is function of Θ and Xobs as ΘM = u(Θ, Xobs). Thus, ΘM is not a new parameter and it is dependent on distribution type.
Θ𝑀 = 𝑢(Θ, 𝑋𝑜𝑏𝑠 ) (2.15)
How to determine u(Θ, Xobs) is dependent on distribution type of the PDF f(X|Θ).
There are three types of missing data, which depends on relationship between Xobs, Xmis, and Z (Josse, Jiang, Sportisse, &
Robin, 2018):
- Missing data (X or 𝒳) is Missing Completely At Random (MCAR) if the probability of Z is independent from both Xobs
and Xmis such that f(Z | Xobs, Xmis, Φ) = f(Z | Φ).
- Missing data (X or 𝒳) is Missing At Random (MAR) if the probability of Z depends on only Xobs such that f(Z | Xobs, Xmis,
Φ) = f(Z | Xobs, Φ).
- Missing data (X or 𝒳) is Missing Not At Random (MNAR) in all other cases, f(Z | Xobs, Xmis, Φ) = f(Z | Xobs, Xmis, Φ).
There are two main approaches for handling missing data (Josse, Jiang, Sportisse, & Robin, 2018):
- Using some statistical models such as EM to estimate parameter with missing data.
- Inputting plausible values for missing values to obtain some complete samples (copies) from the missing data. Later on,
every complete sample is used to produce an estimate of parameter by some estimation methods, for example, MLE and
MAP. Finally, all estimates are synthesized to produce the best estimate.
Here we focus on the first approach with EM to estimate parameter with missing data. Without loss of generality, given sample 𝒳
= {X1, X2,…, XN} in which all Xi (s) are iid, by applying equation 1.8 for GEM with the joint PDF f(Xobs, Xmis, Z | Θ, Φ), we consider
{Xobs, Z} as observed part and Xmis as hidden part. Let X = {Xobs, Xmis} be random variable representing all Xi (s). Let Xobs(i) denote
observed part Xobs of Xi and let Zi be missingness variable corresponding to Xi, by following equation 1.8, the expectation Q(Θ’,
Φ’ | Θ, Φ) becomes:
𝑁

𝑄(Θ′ , Φ′ |Θ, Φ) = ∑ ∫ 𝑓(𝑋𝑚𝑖𝑠 |𝑋𝑜𝑏𝑠 (𝑖), 𝑍𝑖 , Θ, Φ) ∗ log(𝑓(𝑋𝑜𝑏𝑠 (𝑖), 𝑋𝑚𝑖𝑠 , 𝑍𝑖 |Θ′ , Φ′ ))d𝑋𝑚𝑖𝑠
𝑖=1 𝑋𝑚𝑖𝑠
𝑁

= ∑ ∫ 𝑓(𝑋𝑚𝑖𝑠 |𝑋𝑜𝑏𝑠 (𝑖), Θ) ∗ log(𝑓(𝑋𝑜𝑏𝑠 (𝑖), 𝑋𝑚𝑖𝑠 , 𝑍𝑖 |Θ′ , Φ′ ))d𝑋𝑚𝑖𝑠


𝑖=1 𝑋𝑚𝑖𝑠
𝑁

= ∑ ∫ 𝑓(𝑋𝑚𝑖𝑠 |𝑋𝑜𝑏𝑠 (𝑖), Θ𝑀𝑖 ) ∗ log(𝑓(𝑋𝑜𝑏𝑠 (𝑖), 𝑋𝑚𝑖𝑠 , 𝑍𝑖 |Θ′ , Φ′ ))d𝑋𝑚𝑖𝑠


𝑖=1 𝑋𝑚𝑖𝑠

All rights reserved by www.grdjournals.com 14


Handling Missing Data with Expectation Maximization Algorithm
(GRDJE/ Volume 6 / Issue 11 / 002)

= ∑ ∫ 𝑓(𝑋𝑚𝑖𝑠 |𝑋𝑜𝑏𝑠 (𝑖), Θ𝑀𝑖 ) ∗ log(𝑓(𝑋𝑜𝑏𝑠 (𝑖), 𝑋𝑚𝑖𝑠 |Θ′ , Φ′ ) ∗ 𝑓(𝑍𝑖 |𝑋𝑜𝑏𝑠 (𝑖), 𝑋𝑚𝑖𝑠 , Θ′ , Φ′ ))d𝑋𝑚𝑖𝑠
𝑖=1 𝑋𝑚𝑖𝑠
𝑁

= ∑ ∫ 𝑓(𝑋𝑚𝑖𝑠 |𝑋𝑜𝑏𝑠 (𝑖), Θ𝑀𝑖 ) ∗ log(𝑓(𝑋𝑜𝑏𝑠 (𝑖), 𝑋𝑚𝑖𝑠 |Θ′ ) ∗ 𝑓(𝑍𝑖 |𝑋𝑜𝑏𝑠 (𝑖), 𝑋𝑚𝑖𝑠 , Φ′ ))d𝑋𝑚𝑖𝑠
𝑖=1 𝑋𝑚𝑖𝑠
𝑁

= ∑ ∫ 𝑓(𝑋𝑚𝑖𝑠 |𝑋𝑜𝑏𝑠 (𝑖), Θ𝑀𝑖 ) ∗ (log(𝑓(𝑋𝑜𝑏𝑠 (𝑖), 𝑋𝑚𝑖𝑠 |Θ′ )) + log(𝑓(𝑍𝑖 |𝑋𝑜𝑏𝑠 (𝑖), 𝑋𝑚𝑖𝑠 , Φ′ ))) d𝑋𝑚𝑖𝑠
𝑖=1 𝑋𝑚𝑖𝑠
𝑁

= ∑ ∫ 𝑓(𝑋𝑚𝑖𝑠 |𝑋𝑜𝑏𝑠 (𝑖), Θ𝑀𝑖 )log(𝑓(𝑋𝑜𝑏𝑠 (𝑖), 𝑋𝑚𝑖𝑠 |Θ′ ))d𝑋𝑚𝑖𝑠


𝑖=1 𝑋𝑚𝑖𝑠
𝑁

+ ∑ ∫ 𝑓(𝑋𝑚𝑖𝑠 |𝑋𝑜𝑏𝑠 (𝑖), Θ𝑀𝑖 )log(𝑓(𝑍𝑖 |𝑋𝑜𝑏𝑠 (𝑖), 𝑋𝑚𝑖𝑠 , Φ′ ))d𝑋𝑚𝑖𝑠


𝑖=1 𝑋𝑚𝑖𝑠
In short, Q(Θ’, Φ’ | Θ, Φ) is specified as follows:
𝑄(Θ′ , Φ′ |Θ, Φ) = 𝑄1 (Θ′ |Θ) + 𝑄2 (Φ′ |Θ) (2.16)
Where,
𝑁

𝑄1 (Θ′ |Θ) = ∑ ∫ 𝑓(𝑋𝑚𝑖𝑠 |𝑋𝑜𝑏𝑠 (𝑖), Θ𝑀𝑖 )log(𝑓(𝑋𝑜𝑏𝑠 (𝑖), 𝑋𝑚𝑖𝑠 |Θ′ ))d𝑋𝑚𝑖𝑠
𝑖=1 𝑋𝑚𝑖𝑠
𝑁

𝑄2 (Φ′ |Θ) = ∑ ∫ 𝑓(𝑋𝑚𝑖𝑠 |𝑋𝑜𝑏𝑠 (𝑖), Θ𝑀𝑖 )log(𝑓(𝑍𝑖 |𝑋𝑜𝑏𝑠 (𝑖), 𝑋𝑚𝑖𝑠 , Φ′ ))d𝑋𝑚𝑖𝑠
𝑖=1 𝑋𝑚𝑖𝑠
Note, unknowns of Q(Θ’, Φ’ | Θ, Φ) are Θ’ and Φ’. Because it is not easy to maximize Q(Θ’, Φ’ | Θ, Φ) with regard to Θ’ and Φ’,
we assume that the PDF f(X|Θ) belongs to exponential family.
𝑓(𝑋|Θ) = 𝑓(𝑋𝑜𝑏𝑠 , 𝑋𝑚𝑖𝑠 |Θ) = 𝑏(𝑋𝑜𝑏𝑠 , 𝑋𝑚𝑖𝑠 ) ∗ exp((Θ)𝑇 𝜏(𝑋𝑜𝑏𝑠 , 𝑋𝑚𝑖𝑠 ))⁄𝑎(Θ) (2.17)
Note,
𝑏(𝑋) = 𝑏(𝑋𝑜𝑏𝑠 , 𝑋𝑚𝑖𝑠 )
𝜏(𝑋) = 𝜏(𝑋𝑜𝑏𝑠 , 𝑋𝑚𝑖𝑠 ) = {𝜏(𝑋𝑜𝑏𝑠 ), 𝜏(𝑋𝑚𝑖𝑠 )}
It is easy to deduce that
𝑓(𝑋𝑚𝑖𝑠 |𝑋𝑜𝑏𝑠 , Θ𝑀 ) = 𝑏(𝑋𝑚𝑖𝑠 ) exp((Θ𝑀 )𝑇 𝜏(𝑋𝑚𝑖𝑠 ))⁄𝑎(Θ𝑀 ) (2.18)
Therefore,
𝑇
𝑓(𝑋𝑚𝑖𝑠 |𝑋𝑜𝑏𝑠 (𝑖), Θ𝑀𝑖 ) = 𝑏(𝑋𝑚𝑖𝑠 ) exp ((Θ𝑀𝑖 ) 𝜏(𝑋𝑚𝑖𝑠 ))⁄𝑎(Θ𝑀𝑖 )
We have:
𝑁

𝑄1 (Θ′ |Θ) = ∑ ∫ 𝑓(𝑋𝑚𝑖𝑠 |𝑋𝑜𝑏𝑠 (𝑖), Θ𝑀𝑖 )log(𝑓(𝑋𝑜𝑏𝑠 (𝑖), 𝑋𝑚𝑖𝑠 |Θ′ ))d𝑋𝑚𝑖𝑠
𝑖=1 𝑋𝑚𝑖𝑠
𝑁

= ∑ ∫ 𝑓(𝑋𝑚𝑖𝑠 |𝑋𝑜𝑏𝑠 (𝑖), Θ𝑀𝑖 ) ∗ log(𝑏(𝑋𝑜𝑏𝑠 (𝑖), 𝑋𝑚𝑖𝑠 ) exp((Θ′ )𝑇 𝜏(𝑋𝑜𝑏𝑠 (𝑖), 𝑋𝑚𝑖𝑠 ))⁄𝑎(Θ′ ))d𝑋𝑚𝑖𝑠
𝑖=1 𝑋𝑚𝑖𝑠
𝑁

= ∑ ∫ 𝑓(𝑋𝑚𝑖𝑠 |𝑋𝑜𝑏𝑠 (𝑖), Θ𝑀𝑖 ) ∗ (log(𝑏(𝑋𝑜𝑏𝑠 (𝑖), 𝑋𝑚𝑖𝑠 )) + (Θ′ )𝑇 𝜏(𝑋𝑜𝑏𝑠 (𝑖), 𝑋𝑚𝑖𝑠 ) − log(𝑎(Θ′ ))) d𝑋𝑚𝑖𝑠
𝑖=1 𝑋𝑚𝑖𝑠
𝑁 𝑁

= ∑ ∫ 𝑓(𝑋𝑚𝑖𝑠 |𝑋𝑜𝑏𝑠 (𝑖), Θ𝑀𝑖 )log(𝑏(𝑋𝑜𝑏𝑠 (𝑖), 𝑋𝑚𝑖𝑠 ))d𝑋𝑚𝑖𝑠 + ∑ ∫ 𝑓(𝑋𝑚𝑖𝑠 |𝑋𝑜𝑏𝑠 (𝑖), Θ𝑀𝑖 )(Θ′ )𝑇 𝜏(𝑋𝑜𝑏𝑠 (𝑖), 𝑋𝑚𝑖𝑠 )d𝑋𝑚𝑖𝑠
𝑖=1 𝑋𝑚𝑖𝑠 𝑖=1 𝑋𝑚𝑖𝑠
𝑁

− ∑ ∫ 𝑓(𝑋𝑚𝑖𝑠 |𝑋𝑜𝑏𝑠 (𝑖), Θ𝑀𝑖 )log(𝑎(Θ′ ))d𝑋𝑚𝑖𝑠


𝑖=1 𝑋𝑚𝑖𝑠
𝑁 𝑁

= ∑ ∫ 𝑓(𝑋𝑚𝑖𝑠 |𝑋𝑜𝑏𝑠 (𝑖), Θ𝑀𝑖 )log(𝑏(𝑋𝑜𝑏𝑠 (𝑖), 𝑋𝑚𝑖𝑠 ))d𝑋𝑚𝑖𝑠 + (Θ′ )𝑇 ∑ ∫ 𝑓(𝑋𝑚𝑖𝑠 |𝑋𝑜𝑏𝑠 (𝑖), Θ𝑀𝑖 )𝜏(𝑋𝑜𝑏𝑠 (𝑖), 𝑋𝑚𝑖𝑠 )d𝑋𝑚𝑖𝑠
𝑖=1 𝑋𝑚𝑖𝑠 𝑖=1 𝑋𝑚𝑖𝑠
𝑁

− log(𝑎(Θ′ )) ∑ ∫ 𝑓(𝑋𝑚𝑖𝑠 |𝑋𝑜𝑏𝑠 (𝑖), Θ𝑀𝑖 )d𝑋𝑚𝑖𝑠


𝑖=1 𝑋𝑚𝑖𝑠

All rights reserved by www.grdjournals.com 15


Handling Missing Data with Expectation Maximization Algorithm
(GRDJE/ Volume 6 / Issue 11 / 002)

𝑁 𝑁

= ∑ ∫ 𝑓(𝑋𝑚𝑖𝑠 |𝑋𝑜𝑏𝑠 (𝑖), Θ𝑀𝑖 )log(𝑏(𝑋𝑜𝑏𝑠 (𝑖), 𝑋𝑚𝑖𝑠 ))d𝑋𝑚𝑖𝑠 + (Θ′ )𝑇 ∑ ∫ 𝑓(𝑋𝑚𝑖𝑠 |𝑋𝑜𝑏𝑠 (𝑖), Θ𝑀𝑖 )𝜏(𝑋𝑜𝑏𝑠 (𝑖), 𝑋𝑚𝑖𝑠 )d𝑋𝑚𝑖𝑠
𝑖=1 𝑋𝑚𝑖𝑠 𝑖=1 𝑋𝑚𝑖𝑠
− 𝑁log(𝑎(Θ′ ))
∫ 𝑓(𝑋𝑚𝑖𝑠 |𝑋𝑜𝑏𝑠 (𝑖), Θ𝑀𝑖 )𝜏(𝑋𝑜𝑏𝑠 (𝑖))d𝑋𝑚𝑖𝑠 ,
𝑁 𝑁
𝑋𝑚𝑖𝑠
= ∑ ∫ 𝑓(𝑋𝑚𝑖𝑠 |𝑋𝑜𝑏𝑠 (𝑖), Θ𝑀𝑖 )log(𝑏(𝑋𝑜𝑏𝑠 (𝑖), 𝑋𝑚𝑖𝑠 ))d𝑋𝑚𝑖𝑠 + (Θ′ )𝑇 ∑
𝑖=1 𝑋𝑚𝑖𝑠 𝑖=1 ∫ 𝑓(𝑋𝑚𝑖𝑠 |𝑋𝑜𝑏𝑠 (𝑖), Θ𝑀𝑖 )𝜏(𝑋𝑚𝑖𝑠 )d𝑋𝑚𝑖𝑠
{𝑋𝑚𝑖𝑠 }
− 𝑁log(𝑎(Θ′ ))
𝜏(𝑋𝑜𝑏𝑠 (𝑖)) ∫ 𝑓(𝑋𝑚𝑖𝑠 |𝑋𝑜𝑏𝑠 (𝑖), Θ𝑀𝑖 )d𝑋𝑚𝑖𝑠 ,
𝑁 𝑁
𝑋𝑚𝑖𝑠
= ∑ ∫ 𝑓(𝑋𝑚𝑖𝑠 |𝑋𝑜𝑏𝑠 (𝑖), Θ𝑀𝑖 )log(𝑏(𝑋𝑜𝑏𝑠 (𝑖), 𝑋𝑚𝑖𝑠 ))d𝑋𝑚𝑖𝑠 + (Θ′ )𝑇 ∑
𝑖=1 𝑋𝑚𝑖𝑠 𝑖=1 ∫ 𝑓(𝑋𝑚𝑖𝑠 |𝑋𝑜𝑏𝑠 (𝑖), Θ𝑀𝑖 )𝜏(𝑋𝑚𝑖𝑠 )d𝑋𝑚𝑖𝑠
{𝑋𝑚𝑖𝑠 }
− 𝑁log(𝑎(Θ′ ))
𝑁 𝑁 𝜏(𝑋𝑜𝑏𝑠 (𝑖)),
= ∑ ∫ 𝑓(𝑋𝑚𝑖𝑠 |𝑋𝑜𝑏𝑠 (𝑖), Θ𝑀𝑖 )log(𝑏(𝑋𝑜𝑏𝑠 (𝑖), 𝑋𝑚𝑖𝑠 ))d𝑋𝑚𝑖𝑠 + (Θ′ )𝑇 ∑ { ∫ 𝑓(𝑋 |𝑋 (𝑖), Θ )𝜏(𝑋 )d𝑋 }
𝑚𝑖𝑠 𝑜𝑏𝑠 𝑀𝑖 𝑚𝑖𝑠 𝑚𝑖𝑠
𝑖=1 𝑋𝑚𝑖𝑠 𝑖=1
𝑋𝑚𝑖𝑠
′ ))
− 𝑁log(𝑎(Θ
Therefore, equation 2.19 specifies Q1(Θ’|Θ) given f(X|Θ) belongs to exponential family.
𝑁 𝑁

𝑄1 (Θ′ |Θ) = ∑ 𝐸(log(𝑏(𝑋𝑜𝑏𝑠 (𝑖), 𝑋𝑚𝑖𝑠 ))|Θ𝑀𝑖 ) + (Θ′ )𝑇 ∑{𝜏(𝑋𝑜𝑏𝑠 (𝑖)), 𝐸(𝜏(𝑋𝑚𝑖𝑠 )|Θ𝑀𝑖 )} − 𝑁log(𝑎(Θ′ )) (2.19)
𝑖=1 𝑖=1
Where,
𝐸(log(𝑏(𝑋𝑜𝑏𝑠 (𝑖), 𝑋𝑚𝑖𝑠 ))|Θ𝑀𝑖 ) = ∫ 𝑓(𝑋𝑚𝑖𝑠 |𝑋𝑜𝑏𝑠 (𝑖), Θ𝑀𝑖 )log(𝑏(𝑋𝑜𝑏𝑠 (𝑖), 𝑋𝑚𝑖𝑠 ))d𝑋𝑚𝑖𝑠 (2.20)
𝑋𝑚𝑖𝑠

𝐸(𝜏(𝑋𝑚𝑖𝑠 )|Θ𝑀𝑖 ) = ∫ 𝑓(𝑋𝑚𝑖𝑠 |𝑋𝑜𝑏𝑠 (𝑖), Θ𝑀𝑖 )𝜏(𝑋𝑚𝑖𝑠 )d𝑋𝑚𝑖𝑠 (2.21)


𝑋𝑚𝑖𝑠
At M-step of some tth iteration, the next parameter Θ(t+1) is solution of the equation created by setting the first-order derivative of
Q1(Θ’|Θ) to be zero. The first-order derivative of Q1(Θ’|Θ) is:
𝑁 𝑁
𝜕𝑄1 (Θ′ |Θ) 𝑇 𝑇

= ∑ (𝐸(𝜏(𝑋𝑜𝑏𝑠 (𝑖), 𝑋𝑚𝑖𝑠 )|Θ𝑀𝑖 )) − 𝑁log ′ (𝑎(Θ′ )) = ∑{𝜏(𝑋𝑜𝑏𝑠 (𝑖)), 𝐸(𝜏(𝑋𝑚𝑖𝑠 )|Θ𝑀𝑖 )} − 𝑁log ′ (𝑎(Θ′ ))
𝜕Θ
𝑖=1 𝑖=1
By referring table 1.2, we have:
𝑇 𝑇
log ′ (𝑎(Θ′ )) = (𝐸(𝜏(𝑋)|Θ′ )) = ∫ 𝑓(𝑋|Θ)(𝜏(𝑋)) d𝑋
𝑋
Where,
𝑓(𝑋|Θ) = 𝑓(𝑋𝑜𝑏𝑠 , 𝑋𝑚𝑖𝑠 |Θ) = 𝑏(𝑋𝑜𝑏𝑠 , 𝑋𝑚𝑖𝑠 ) ∗ exp((Θ)𝑇 𝜏(𝑋𝑜𝑏𝑠 , 𝑋𝑚𝑖𝑠 ))⁄𝑎(Θ)
𝑏(𝑋) = 𝑏(𝑋𝑜𝑏𝑠 , 𝑋𝑚𝑖𝑠 )
𝜏(𝑋) = 𝜏(𝑋𝑜𝑏𝑠 , 𝑋𝑚𝑖𝑠 ) = {𝜏(𝑋𝑜𝑏𝑠 ), 𝜏(𝑋𝑚𝑖𝑠 )}
Thus, the next parameter Θ(t+1) is solution of the following equation:
𝑁
𝜕𝑄1 (Θ′ |Θ) 𝑇 𝑇

= ∑{𝜏(𝑋𝑜𝑏𝑠 (𝑖)), 𝐸(𝜏(𝑋𝑚𝑖𝑠 )|Θ𝑀𝑖 )} − 𝑁(𝐸(𝜏(𝑋)|Θ′ )) = 𝟎𝑇
𝜕Θ
𝑖=1
This implies the next parameter Θ(t+1) is solution of the following equation:
𝑁
1
𝐸(𝜏(𝑋)|Θ′ ) = ∑{𝜏(𝑋𝑜𝑏𝑠 (𝑖)), 𝐸(𝜏(𝑋𝑚𝑖𝑠 )|Θ𝑀𝑖 )}
𝑁
𝑖=1
As a result, at E-step of some tth iteration, given current parameter Θ(t), the sufficient statistic of X is calculated as follows:
𝑁
(𝑡)
1 (𝑡)
𝜏 = ∑{𝜏(𝑋𝑜𝑏𝑠 (𝑖)), 𝐸(𝜏(𝑋𝑚𝑖𝑠 )|Θ𝑀𝑖 )} (2.22)
𝑁
𝑖=1
Where,

All rights reserved by www.grdjournals.com 16


Handling Missing Data with Expectation Maximization Algorithm
(GRDJE/ Volume 6 / Issue 11 / 002)

(𝑡)
Θ𝑀𝑖 = 𝑢(Θ(𝑡) , 𝑀𝑖 )
(𝑡) (𝑡)
𝐸(𝜏(𝑋𝑚𝑖𝑠 )|Θ𝑀𝑖 ) = ∫ 𝑓(𝑋𝑚𝑖𝑠 |𝑋𝑜𝑏𝑠 (𝑖), Θ𝑀𝑖 )𝜏(𝑋𝑚𝑖𝑠 )d𝑋𝑚𝑖𝑠
𝑋𝑚𝑖𝑠
Equation 2.22 is variant of equation 2.11 when f(X|Θ) belongs to exponential family but how to compose τ(X) from τ(Xobs) and
τ(Xmis) is not determined exactly yet.
As a result, at M-step of some tth iteration, given τ(t) and Θ(t), the next parameter Θ(t+1) is a solution of the following equation:
𝐸(𝜏(𝑋)|Θ) = 𝜏 (𝑡) (2.23)
Moreover, at M-step of some t iteration, the next parameter Φ(t+1) is a maximizer of Q2(Φ | Θ(t)) given Θ(t) as follows:
th

Φ(𝑡+1) = argmin 𝑄2 (Φ|Θ(𝑡) ) (2.24)


Φ
Where,
𝑁
(𝑡)
𝑄2 (Φ|Θ(𝑡) ) = ∑ ∫ 𝑓(𝑋𝑚𝑖𝑠 |𝑋𝑜𝑏𝑠 (𝑖), Θ𝑀𝑖 )log(𝑓(𝑍𝑖 |𝑋𝑜𝑏𝑠 (𝑖), 𝑋𝑚𝑖𝑠 , Φ))d𝑋𝑚𝑖𝑠 (2.25)
𝑖=1 𝑋𝑚𝑖𝑠
How to maximize Q2(Φ | Θ(t)) depends on distribution type of Zi which is also formulation of the PDF f(Z | Xobs, Xmis, Φ). For some
reasons, such as accelerating estimation speed or ignoring missingness variable Z then, the next parameter Φ(t+1) will not be
estimated.
In general, the two steps of GEM algorithm for handling missing data at some tth iteration are summarized in table 2.1 with
assumption that the PDF of missing data f(X|Θ) belongs to exponential family.
E-step:
Given current parameter Θ(t), the sufficient statistic τ(t) is calculated according to equation 2.22.
𝑁
(𝑡)
1 (𝑡)
𝜏 = ∑{𝜏(𝑋𝑜𝑏𝑠 (𝑖)), 𝐸(𝜏(𝑋𝑚𝑖𝑠 )|Θ𝑀𝑖 )}
𝑁
𝑖=1
Where,
(𝑡)
Θ𝑀𝑖 = 𝑢(Θ(𝑡) , 𝑀𝑖 )
(𝑡) (𝑡)
𝐸(𝜏(𝑋𝑚𝑖𝑠 )|Θ𝑀𝑖 ) = ∫ 𝑓(𝑋𝑚𝑖𝑠 |𝑋𝑜𝑏𝑠 (𝑖), Θ𝑀𝑖 )𝜏(𝑋𝑚𝑖𝑠 )d𝑋𝑚𝑖𝑠
𝑋𝑚𝑖𝑠
M-step:
Given τ(t) and Θ(t), the next parameter Θ(t+1) is a solution of equation 2.23.
𝐸(𝜏(𝑋)|Θ) = 𝜏 (𝑡)
Given Θ , the next parameter Φ
(t) (t+1)
is a maximizer of Q2(Φ | Θ(t)) according to equation 2.24.
(𝑡+1)
Φ = argmin 𝑄2 (Φ|Θ(𝑡) )
Φ
Where,
𝑁
(𝑡) (𝑡)
𝑄2 (Φ|Θ ) = ∑ ∫ 𝑓(𝑋𝑚𝑖𝑠 |𝑋𝑜𝑏𝑠 (𝑖), Θ𝑀𝑖 )log(𝑓(𝑍𝑖 |𝑋𝑜𝑏𝑠 (𝑖), 𝑋𝑚𝑖𝑠 , Φ))d𝑋𝑚𝑖𝑠
𝑖=1 𝑋𝑚𝑖𝑠
Table 2.1. E-step and M-step of GEM algorithm for handling missing data given exponential PDF
GEM algorithm converges at some tth iteration. At that time, Θ* = Θ(t+1) = Θ(t) and Φ* = Φ(t+1) = Φ(t) are optimal estimates. If
missingness variable Z is ignored for some reasons, parameter Φ is not estimated. Because Xmis is a part of X and f(Xmis | Xobs, ΘM)
is derived directly from f(X|Θ), in practice we can stop GEM after its first iteration was done, which is reasonable enough to handle
missing data.
An interesting application of handling missing data is to fill in or predict missing values. For instance, suppose the estimate
resulted from GEM is Θ*, missing values represented by τ(Xmis) are fulfilled by expectation of τ(Xmis) as follows:
𝜏(𝑋𝑚𝑖𝑠 ) = 𝐸(𝜏(𝑋𝑚𝑖𝑠 )|Θ∗𝑀 ) (2.26)
Where,

Θ𝑀 = 𝑢(Θ∗ , 𝑋𝑜𝑏𝑠 )
Now we survey a popular case that sample 𝒳 = {X1, X2,…, XN} whose Xi (s) are iid is MCAR data and f(X|Θ) is multinormal PDF
whereas missingness variable Z follows binomial distribution of n trials. Let X = {Xobs, Xmis} be random variable representing every
Xi. Suppose dimension of X is n. Let Z be random variable representing every Zi. According to equation 2.9, recall that

All rights reserved by www.grdjournals.com 17


Handling Missing Data with Expectation Maximization Algorithm
(GRDJE/ Volume 6 / Issue 11 / 002)

𝑋𝑖 = {𝑋𝑜𝑏𝑠 (𝑖), 𝑋𝑚𝑖𝑠 (𝑖)} = (𝑥𝑖1 , 𝑥𝑖2 , … , 𝑥𝑖𝑛 )𝑇


𝑇
𝑋𝑚𝑖𝑠 (𝑖) = (𝑥𝑖𝑚1 , 𝑥𝑖𝑚2 , … , 𝑥𝑖𝑚|𝑀 | )
𝑖
𝑇
𝑋𝑜𝑏𝑠 (𝑖) = (𝑥𝑖𝑚̅𝑖1 , 𝑥𝑖𝑚̅𝑖2 , … , 𝑥𝑖𝑚̅𝑖|𝑀
̅̅̅ |
)
𝑖

𝑀𝑖 = {𝑚𝑖1 , 𝑚𝑖2 , … , 𝑚𝑖|𝑀𝑖 | }


̅𝑖 = {𝑚
𝑀 ̅ 𝑖1 , 𝑚
̅ 𝑖2 , … , 𝑚
̅ 𝑖|𝑀̅𝑖| }
𝑍𝑖 = (𝑧𝑖1 , 𝑧𝑖2 , … , 𝑧𝑖𝑛 )𝑇
The PDF of X is:
𝑛 1 1
𝑓(𝑋|Θ) = 𝑓(𝑋𝑜𝑏𝑠 , 𝑋𝑚𝑖𝑠 |Θ) = (2𝜋)−2 |Σ|−2 exp (− (𝑋 − 𝜇)𝑇 Σ −1 (𝑋 − 𝜇)) (2.27)
2
Therefore,
𝑛 1 1
𝑓(𝑋𝑖 |Θ) = 𝑓(𝑋𝑜𝑏𝑠 (𝑖), 𝑋𝑚𝑖𝑠 (𝑖)|Θ) = (2𝜋)−2 |Σ|−2 exp (− (𝑋𝑖 − 𝜇)𝑇 Σ −1 (𝑋𝑖 − 𝜇))
2
The PDF of Z is:
𝑓(𝑍|Φ) = 𝑝𝑐(𝑍) (1 − 𝑝)𝑛−𝑐(𝑍) (2.28)
Therefore,
𝑓(𝑍𝑖 |Φ) = 𝑝𝑐(𝑍𝑖) (1 − 𝑝)𝑛−𝑐(𝑍𝑖)
Where Θ = (μ, Σ) and Φ = p. Note, given the PDF f(X | Θ), µ is mean and Σ is covariance matrix whose each element σij is
T

covariance of xi and xj.


𝜇 = (𝜇1 , 𝜇2 , … , 𝜇𝑛 )𝑇
𝜎11 𝜎12 ⋯ 𝜎1𝑛
𝜎21 𝜎22 ⋯ 𝜎2𝑛 (2.29)
Σ=( ⋮ ⋮ ⋱ ⋮ )
𝜎𝑛1 𝜎𝑛2 ⋯ 𝜎𝑛𝑛
Suppose the probability of missingness at every partial random variable xj is p and it is independent from Xobs and Xmis. The quantity
c(Z) is the number of zj (s) in Z that equal 1. For example, if Z = (1, 0, 1, 0)T then, c(Z) = 2. The most important task here is to
define equation 2.11 and equation 2.15 in order to compose τ(X) from τ(Xobs), τ(Xmis) and to extract ΘM from Θ when f(X|Θ)
distributes normally.
The conditional PDF of Xmis given Xobs is also multinormal PDF.
|𝑀| 1 1
𝑓(𝑋𝑚𝑖𝑠 |Θ𝑀 ) = 𝑓(𝑋𝑚𝑖𝑠 |𝑋𝑜𝑏𝑠 , Θ𝑀 ) = 𝑓(𝑋𝑚𝑖𝑠 |𝑋𝑜𝑏𝑠 , Θ) = (2𝜋)− 2 |Σ𝑀 |−2 exp (− (𝑋𝑚𝑖𝑠 − 𝜇𝑀 )𝑇 Σ𝑀 −1
(𝑋𝑚𝑖𝑠 − 𝜇𝑀 )) (2.30)
2
Therefore,
𝑓(𝑋𝑚𝑖𝑠 (𝑖)|Θ𝑀𝑖 ) = 𝑓(𝑋𝑚𝑖𝑠 (𝑖)|𝑋𝑜𝑏𝑠 (𝑖), Θ𝑀𝑖 ) = 𝑓(𝑋𝑚𝑖𝑠 (𝑖)|𝑋𝑜𝑏𝑠 (𝑖), Θ)
|𝑀𝑖 | 1 1
− 𝑇 −1
= (2𝜋)− 2 |Σ𝑀𝑖 | 2 exp (− (𝑋𝑚𝑖𝑠 (𝑖) − 𝜇𝑀𝑖 ) Σ𝑀 (𝑋𝑚𝑖𝑠 (𝑖) − 𝜇𝑀𝑖 ))
2 𝑖
𝑇
Where Θ𝑀𝑖 = (𝜇𝑀𝑖 , Σ𝑀𝑖 ) . We denote
𝑓(𝑋𝑚𝑖𝑠 (𝑖)|Θ𝑀𝑖 ) = 𝑓(𝑋𝑚𝑖𝑠 (𝑖)|𝑋𝑜𝑏𝑠 (𝑖), Θ𝑀𝑖 )
Because 𝑓(𝑋𝑚𝑖𝑠 (𝑖)|𝑋𝑜𝑏𝑠 (𝑖), Θ𝑀𝑖 ) only depends on Θ𝑀𝑖 within normal PDF whereas Θ𝑀𝑖 depends on Xobs(i). Determining the
function Θ𝑀𝑖 = u(Θ, Xobs(i)) is now necessary to extract the parameter Θ𝑀𝑖 from Θ given Xobs(i) when f(Xi|Θ) is normal distribution.
Let Θmis = (μmis, Σmis)T be parameter of marginal PDF of Xmis, we have:
|𝑀| 1 1
𝑓(𝑋𝑚𝑖𝑠 |Θ𝑚𝑖𝑠 ) = (2𝜋)− 2 |Σ𝑚𝑖𝑠 |−2 exp (− (𝑋𝑚𝑖𝑠 − 𝜇𝑚𝑖𝑠 )𝑇 (Σ𝑚𝑖𝑠 )−1 (𝑋𝑚𝑖𝑠 − 𝜇𝑚𝑖𝑠 )) (2.31)
2
Therefore,
|𝑀𝑖 | 1 1 𝑇 −1
𝑓(𝑋𝑚𝑖𝑠 (𝑖)|Θ𝑚𝑖𝑠 (𝑖)) = (2𝜋)− 2 |Σ𝑚𝑖𝑠 (𝑖)|−2 exp (− (𝑋𝑚𝑖𝑠 (𝑖) − 𝜇𝑚𝑖𝑠 (𝑖)) (Σ𝑚𝑖𝑠 (𝑖)) (𝑋𝑚𝑖𝑠 (𝑖) − 𝜇𝑚𝑖𝑠 (𝑖)))
2
Where,
𝑇
𝜇𝑚𝑖𝑠 (𝑖) = (𝜇𝑚𝑖1 , 𝜇𝑚𝑖2 , … , 𝜇𝑚𝑖|𝑀 | )
𝑖
𝜎𝑚𝑖1𝑚𝑖1 𝜎𝑚𝑖1 𝑚𝑖2 ⋯ 𝜎𝑚𝑖1 𝑚𝑖|𝑀 |
𝑖
𝜎𝑚𝑖2𝑚𝑖1 𝜎𝑚𝑖2 𝑚𝑖2 ⋯ 𝜎𝑚𝑖2 𝑚𝑖|𝑀 | (2.32)
𝑖
Σ𝑚𝑖𝑠 (𝑖) =
⋮ ⋮ ⋱ ⋮
𝜎𝑚𝑖|𝑀 |𝑚𝑖1 𝜎𝑚𝑖|𝑀 |𝑚𝑖2 ⋯ 𝜎𝑚𝑖|𝑀 |𝑚𝑖|𝑀 |
( 𝑖 𝑖 𝑖 𝑖 )
Obviously, Θmis(i) is extracted from Θ given indicator Mi. Note, 𝜎𝑚𝑖𝑗 𝑚𝑖𝑘 is covariance of 𝑥𝑚𝑖𝑗 and 𝑥𝑚𝑖𝑘 .
Let Θobs = (μobs, Σobs)T be parameter of marginal PDF of Xobs, we have:

All rights reserved by www.grdjournals.com 18


Handling Missing Data with Expectation Maximization Algorithm
(GRDJE/ Volume 6 / Issue 11 / 002)

̅|
|𝑀 1 1
𝑓(𝑋𝑜𝑏𝑠 |Θ𝑜𝑏𝑠 ) = (2𝜋)− −
2 |Σ𝑜𝑏𝑠 | 2 exp (− (𝑋𝑜𝑏𝑠 − 𝜇𝑜𝑏𝑠 )𝑇 (Σ𝑜𝑏𝑠 )−1 (𝑋𝑜𝑏𝑠 − 𝜇𝑜𝑏𝑠 )) (2.33)
2
Therefore,
̅ 𝑖|
|𝑀 1 1 𝑇 −1
𝑓(𝑋𝑜𝑏𝑠 (𝑖)|Θ𝑜𝑏𝑠 (𝑖)) = (2𝜋)− −
2 |Σ𝑜𝑏𝑠 (𝑖)| 2 exp (− (𝑋𝑜𝑏𝑠 (𝑖) − 𝜇𝑜𝑏𝑠 (𝑖)) (Σ𝑜𝑏𝑠 (𝑖)) (𝑋𝑜𝑏𝑠 (𝑖) − 𝜇𝑜𝑏𝑠 (𝑖)))
2
Where,
𝑇
𝜇𝑜𝑏𝑠 (𝑖) = (𝜇𝑚̅𝑖1 , 𝜇𝑚̅𝑖2 , … , 𝜇𝑚̅𝑖|𝑀
̅̅̅ |
)
𝑖
𝜎𝑚̅𝑖1𝑚̅𝑖1 𝜎𝑚̅𝑖1𝑚̅𝑖2 ⋯ 𝜎𝑚̅𝑖1𝑚̅𝑖|𝑀
̅̅̅ |
𝑖
𝜎𝑚̅𝑖2𝑚̅𝑖1 𝜎𝑚̅𝑖2𝑚̅𝑖2 ⋯ 𝜎𝑚̅𝑖2𝑚̅𝑖|𝑀
̅̅̅ |
(2.34)
𝑖
Σ𝑜𝑏𝑠 (𝑖) =
⋮ ⋮ ⋱ ⋮
𝜎𝑚̅𝑖|𝑀 ̅ 𝑖1
̅̅̅ | 𝑚
𝜎𝑚̅𝑖|𝑀 ̅ 𝑖2
̅̅̅ | 𝑚
⋯ 𝜎𝑚̅𝑖|𝑀 ̅ 𝑖|𝑀
̅̅̅ | 𝑚 ̅̅̅ |
( 𝑖 𝑖 𝑖 𝑖 )
̅𝑖 or Mi. Note, 𝜎𝑚̅ 𝑚̅ is covariance of 𝑥𝑚̅ and 𝑥𝑚̅ .
Obviously, Θobs(i) is extracted from Θ given indicator 𝑀 𝑖𝑗 𝑖𝑘 𝑖𝑗 𝑖𝑘
We have:
𝑓(𝑋𝑚𝑖𝑠 (𝑖)|Θ𝑚𝑖𝑠 (𝑖)) = ∫ 𝑓(𝑋𝑜𝑏𝑠 (𝑖), 𝑋𝑚𝑖𝑠 (𝑖)|Θ)d𝑋𝑜𝑏𝑠 (𝑖)
𝑋𝑜𝑏𝑠 (𝑖)

𝑓(𝑋𝑜𝑏𝑠 (𝑖)|Θ𝑜𝑏𝑠 (𝑖)) = ∫ 𝑓(𝑋𝑜𝑏𝑠 (𝑖), 𝑋𝑚𝑖𝑠 (𝑖)|Θ)d𝑋𝑚𝑖𝑠 (𝑖)


𝑋𝑚𝑖𝑠 (𝑖)
𝑓(𝑋𝑜𝑏𝑠 (𝑖), 𝑋𝑚𝑖𝑠 (𝑖)|Θ)
𝑓(𝑋𝑚𝑖𝑠 (𝑖)|Θ𝑀𝑖 ) = 𝑓(𝑋𝑚𝑖𝑠 (𝑖)|𝑋𝑜𝑏𝑠 (𝑖), Θ) =
𝑓(𝑋𝑜𝑏𝑠 (𝑖)|Θ𝑜𝑏𝑠 (𝑖))
𝑇
Therefore, it is easy to form the parameter Θ𝑀𝑖 = (𝜇𝑀𝑖 , Σ𝑀𝑖 ) from Θmis(i) = (μmis(i), Σmis(i))T and Θobs(i) = (μobs(i), Σobs(i))T as
follows (Hardle & Simar, 2013, pp. 156-157):
𝑚𝑖𝑠 −1
𝜇𝑀𝑖 = 𝜇𝑚𝑖𝑠 (𝑖) + (𝑉𝑜𝑏𝑠 (𝑖)) (Σ𝑜𝑏𝑠 (𝑖)) (𝑋𝑜𝑏𝑠 (𝑖) − 𝜇𝑜𝑏𝑠 (𝑖))
Θ𝑀𝑖 = 𝑢(Θ, 𝑋𝑜𝑏𝑠 (𝑖)) = { −1
(2.35)
𝑚𝑖𝑠 𝑜𝑏𝑠
Σ𝑀𝑖 = Σ𝑚𝑖𝑠 (𝑖) − (𝑉𝑜𝑏𝑠 (𝑖)) (Σ𝑜𝑏𝑠 (𝑖)) (𝑉𝑚𝑖𝑠 )
Where from Θmis(i) = (μmis(i), Σmis(i))T and Θobs(i) = (μobs(i), Σobs(i))T are specified by equation 2.32 and equation 2.34. Moreover
𝑚𝑖𝑠
the kxl matrix 𝑉𝑜𝑏𝑠 ̅𝑖 |):
(𝑖) which implies correlation between Xmis and Xobs is defined as follows (k = |Mi| and l = |𝑀
𝜎𝑚𝑖1𝑚̅𝑖1 𝜎𝑚𝑖1 𝑚̅𝑖2 ⋯ 𝜎𝑚𝑖1 𝑚̅𝑖|𝑀 ̅̅̅ |
𝑖
𝜎𝑚𝑖2𝑚̅𝑖1 𝜎𝑚𝑖2 𝑚̅𝑖2 ⋯ 𝜎𝑚𝑖2 𝑚̅𝑖|𝑀 ̅̅̅ |
𝑚𝑖𝑠 𝑖
𝑉𝑜𝑏𝑠 (𝑖) = (2.36)
⋮ ⋮ ⋱ ⋮
𝜎𝑚 𝑚̅ 𝜎𝑚𝑖|𝑀 |𝑚̅𝑖2 ⋯ 𝜎𝑚𝑖|𝑀 |𝑚̅𝑖|𝑀 ̅̅̅ |
( 𝑖|𝑀𝑖| 𝑖1 𝑖 𝑖 𝑖 )
𝑜𝑏𝑠
Note, 𝜎𝑚𝑖𝑗 𝑚̅𝑖𝑘 is covariance of 𝑥𝑚𝑖𝑗 and 𝑥𝑚̅𝑖𝑘 . The lxk matrix 𝑉𝑚𝑖𝑠 (𝑖) which implies correlation between Xobs and Xmis is defined as
follows:
𝜎𝑚̅𝑖1𝑚𝑖1 𝜎𝑚̅𝑖1𝑚𝑖2 ⋯ 𝜎𝑚̅𝑖1𝑚𝑖|𝑀 |
𝑖
𝜎𝑚̅𝑖2𝑚𝑖1 𝜎𝑚̅𝑖2𝑚𝑖2 ⋯ 𝜎𝑚̅𝑖2𝑚𝑖|𝑀 |
𝑜𝑏𝑠 𝑖
𝑉𝑚𝑖𝑠 (𝑖) = (2.37)
⋮ ⋮ ⋱ ⋮
𝜎𝑚̅ ̅̅̅ |𝑚𝑖1 𝜎𝑚̅𝑖|𝑀̅̅̅ | 𝑚𝑖2
⋯ 𝜎𝑚̅𝑖|𝑀 ̅̅̅ | 𝑚𝑖|𝑀 |
( 𝑖|𝑀 𝑖 𝑖 𝑖 𝑖 )
Therefore, equation 2.35 to extract Θ𝑀𝑖 from Θ given Xobs(i) is an instance of equation 2.15. For convenience let,
𝑇
𝜇𝑀𝑖 = (𝜇𝑀𝑖 (𝑚𝑖1 ), 𝜇𝑀𝑖 (𝑚𝑖2 ), … , 𝜇𝑀𝑖 (𝑚𝑖|𝑀𝑖 | ))
Σ𝑀𝑖 (𝑚𝑖1 , 𝑚𝑖1 ) Σ𝑀𝑖 (𝑚𝑖1 , 𝑚𝑖2 ) ⋯ Σ𝑀𝑖 (𝑚𝑖1 , 𝑚𝑖|𝑀𝑖| )
(2.38)
Σ𝑀𝑖 (𝑚𝑖2 , 𝑚𝑖1 ) Σ𝑀𝑖 (𝑚𝑖2 , 𝑚𝑖2 ) ⋯ Σ𝑀𝑖 (𝑚𝑖2 , 𝑚𝑖|𝑀𝑖| )
Σ𝑀𝑖 =
⋮ ⋮ ⋱ ⋮
Σ𝑀
( 𝑖 (𝑚 𝑖|𝑀 𝑖|
, 𝑚𝑖1 ) Σ𝑀𝑖 (𝑚𝑖|𝑀𝑖 | , 𝑚𝑖2 ) ⋯ Σ𝑀𝑖 (𝑚𝑖|𝑀𝑖 | , 𝑚𝑖|𝑀𝑖 | ))
𝑇
Equation 2.38 is result of equation 2.35. Given 𝑋𝑚𝑖𝑠 (𝑖) = (𝑥𝑚𝑖1 , 𝑥𝑚𝑖2 , … , 𝑥𝑚𝑖|𝑀 | ) then, 𝜇𝑀𝑖 (𝑚𝑖𝑗 ) is estimated partial mean of
𝑖
𝑥𝑚𝑖𝑗 and Σ𝑀𝑖 (𝑚𝑖𝑢 , 𝑚𝑖𝑣 ) is estimated partial covariance of 𝑥𝑚𝑖𝑢 and 𝑥𝑚𝑖𝑣 given the conditional PDF f(Xmis | Θ𝑀𝑖 ).
At E-step of some tth iteration, given current parameter Θ(t), the sufficient statistic of X is calculated according to equation
2.22. Let,
𝑁
(𝑡) (𝑡) (𝑡) 𝑇 1 (𝑡)
𝜏 = (𝜏1 , 𝜏2 ) = ∑{𝜏(𝑋𝑜𝑏𝑠 (𝑖)), 𝐸(𝜏(𝑋𝑚𝑖𝑠 )|Θ𝑀𝑖 )}
𝑁
𝑖=1

All rights reserved by www.grdjournals.com 19


Handling Missing Data with Expectation Maximization Algorithm
(GRDJE/ Volume 6 / Issue 11 / 002)

It is necessary to calculate the sufficient with normal PDF f(Xi|Θ), which means that we need to define what τ1(t) and τ2(t) are. The
sufficient statistic of Xobs(i) is:
𝑇 𝑇
𝜏(𝑋𝑜𝑏𝑠 (𝑖)) = (𝑋𝑜𝑏𝑠 (𝑖), 𝑋𝑜𝑏𝑠 (𝑖)(𝑋𝑜𝑏𝑠 (𝑖)) )
The sufficient statistic of Xmis(i) is:
𝑇 𝑇
𝜏(𝑋𝑚𝑖𝑠 (𝑖)) = (𝑋𝑚𝑖𝑠 (𝑖), 𝑋𝑚𝑖𝑠 (𝑖)(𝑋𝑚𝑖𝑠 (𝑖)) )
We also have:
(𝑡)
(𝑡) (𝑡)
𝜇𝑀𝑖
𝐸(𝜏(𝑋𝑚𝑖𝑠 )|Θ𝑀𝑖 ) = ∫ 𝑓(𝑋𝑚𝑖𝑠 |Θ𝑀𝑖 )𝜏(𝑋𝑚𝑖𝑠 )d𝑋𝑚𝑖𝑠 =( (𝑡) (𝑡) (𝑡) 𝑇
)
𝑋𝑚𝑖𝑠
Σ𝑀𝑖 + 𝜇𝑀𝑖 (𝜇𝑀𝑖 )
Due to
𝑇 (𝑡) (𝑡) (𝑡) (𝑡) 𝑇
𝐸 (𝑋𝑚𝑖𝑠 (𝑖)(𝑋𝑚𝑖𝑠 (𝑖)) |Θ𝑀𝑖 ) = Σ𝑀𝑖 + 𝜇𝑀𝑖 (𝜇𝑀𝑖 )
(𝑡) (𝑡)
Where 𝜇𝑀𝑖 and Σ𝑀𝑖 are 𝜇𝑀𝑖 and Σ𝑀𝑖 at current iteration, respectively. By referring to equation 2.38, we have
𝑇
(𝑡) (𝑡) (𝑡) (𝑡)
𝜇𝑀𝑖 = (𝜇𝑀𝑖 (𝑚𝑖1 ), 𝜇𝑀𝑖 (𝑚𝑖2 ), … , 𝜇𝑀𝑖 (𝑚𝑖|𝑀𝑖| ))
And
(𝑡) (𝑡) (𝑡)
𝜎̃11 (𝑖) 𝜎̃12 (𝑖) ⋯ 𝜎̃1|𝑀𝑖| (𝑖)
(𝑡) (𝑡) (𝑡)
(𝑡) (𝑡) (𝑡) 𝑇 𝜎̃21 (𝑖) 𝜎̃22 (𝑖) ⋯ 𝜎̃2|𝑀𝑖| (𝑖)
Σ𝑀𝑖 + 𝜇𝑀𝑖 (𝜇𝑀𝑖 ) =
⋮ ⋮ ⋱ ⋮
(𝑡) (𝑡) (𝑡)
(𝜎̃|𝑀 |1 (𝑖)
𝑖
𝜎̃|𝑀 |2 (𝑖)
𝑖
⋯ 𝜎̃|𝑀𝑖 ||𝑀𝑖 | (𝑖))
Where,
(𝑡) (𝑡) (𝑡) (𝑡)
𝜎̃𝑢𝑣 (𝑖) = Σ𝑀𝑖 (𝑚𝑖𝑢 , 𝑚𝑖𝑣 ) + 𝜇𝑀𝑖 (𝑚𝑖𝑢 )𝜇𝑀𝑖 (𝑚𝑖𝑣 )
Therefore, τ1(t) is vector and τ2(t) is matrix and then, the sufficient statistic of X at E-step of some tth iteration, given current parameter
Θ(t) is defined as follows:
(𝑡) (𝑡) 𝑇
𝜏 (𝑡) = (𝜏1 , 𝜏2 )
(𝑡) (𝑡) (𝑡) (𝑡) 𝑇
𝜏1 = (𝑥̅1 , 𝑥̅2 , … , 𝑥̅𝑛 )
(𝑡) (𝑡) (𝑡)
𝑠11 𝑠12 ⋯ 𝑠1𝑛 (2.39)
(𝑡) (𝑡) (𝑡)
𝜏2 =
(𝑡) 𝑠21 𝑠22 ⋯ 𝑠2𝑛
⋮ ⋮ ⋱ ⋮
(𝑡) (𝑡) (𝑡)
(𝑠𝑛1 𝑠𝑛2 ⋯ 𝑠𝑛𝑛 )
(𝑡)
Each 𝑥̅𝑗 is calculated as follows:
𝑁
(𝑡) 1 𝑥𝑖𝑗 if 𝑗 ∉ 𝑀𝑖
𝑥̅𝑗 = ∑ { (𝑡) (2.40)
𝑁 𝜇𝑀𝑖 (𝑗) if 𝑗 ∈ 𝑀𝑖
𝑖=1
(𝑡) (𝑡)
Please see equation 2.35 and equation 2.38 to know 𝜇𝑀𝑖 (𝑗). Each 𝑠𝑢𝑣 is calculated as follows:
𝑥𝑖𝑢 𝑥𝑖𝑣
if 𝑢 ∉ 𝑀𝑖 and 𝑣 ∉ 𝑀𝑖
(𝑡)
𝑥𝑖𝑢 𝜇𝑀𝑖 (𝑚𝑖𝑣 )
𝑁 if 𝑢 ∉ 𝑀𝑖 and 𝑣 ∈ 𝑀𝑖
(𝑡) (𝑡) 1
𝑠𝑢𝑣 = 𝑠𝑣𝑢 = ∑ (2.41)
𝑁 (𝑡)
𝜇𝑀𝑖 (𝑚𝑖𝑢 )𝑥𝑖𝑣
𝑖=1
if 𝑢 ∈ 𝑀𝑖 and 𝑣 ∉ 𝑀𝑖
(𝑡) (𝑡) (𝑡)
Σ𝑀𝑖 (𝑚𝑖𝑢 , 𝑚𝑖𝑣 ) + 𝜇𝑀𝑖 (𝑚𝑖𝑢 )𝜇𝑀𝑖 (𝑚𝑖𝑣 )
{ if 𝑢 ∈ 𝑀𝑖 and 𝑣 ∈ 𝑀𝑖
Equation 2.39 is an instance of equation 2.11, which compose τ(X) from τ(Xobs) and τ(Xmis) when f(X|Θ) distributes normally.
Following is the proof of equation 2.41.
If 𝑢 ∉ 𝑀𝑖 and 𝑣 ∉ 𝑀𝑖 then, the partial statistic xiuxiv is kept intact because xiu and xiv are in Xobs are constant with regard to
(𝑡) (𝑡)
f(Xmis | Θ𝑀𝑖 ) If 𝑢 ∉ 𝑀𝑖 and 𝑣 ∈ 𝑀𝑖 then, the partial statistic xiuxiv is replaced by the expectation 𝐸(𝑥𝑖𝑢 𝑥𝑖𝑣 |Θ𝑀𝑖 ) as follows:

All rights reserved by www.grdjournals.com 20


Handling Missing Data with Expectation Maximization Algorithm
(GRDJE/ Volume 6 / Issue 11 / 002)

(𝑡) (𝑡) (𝑡) (𝑡)


𝐸(𝑥𝑖𝑢 𝑥𝑖𝑣 |Θ𝑀𝑖 ) = ∫ 𝑓(𝑋𝑚𝑖𝑠 |Θ𝑀𝑖 )𝑥𝑖𝑢 𝑥𝑖𝑣 d𝑋𝑚𝑖𝑠 = 𝑥𝑖𝑢 ∫ 𝑓(𝑋𝑚𝑖𝑠 |Θ𝑀𝑖 )𝑥𝑖𝑣 d𝑋𝑚𝑖𝑠 = 𝑥𝑖𝑢 𝜇𝑀𝑖 (𝑚𝑖𝑣 )
𝑋𝑚𝑖𝑠 𝑋𝑚𝑖𝑠
(𝑡)
If 𝑢 ∈ 𝑀𝑖 and 𝑣 ∉ 𝑀𝑖 then, the partial statistic xiuxiv is replaced by the expectation 𝐸(𝑥𝑖𝑢 𝑥𝑖𝑣 |Θ𝑀𝑖 ) as follows:
(𝑡) (𝑡) (𝑡) (𝑡)
𝐸(𝑥𝑖𝑢 𝑥𝑖𝑣 |Θ𝑀𝑖 ) = ∫ 𝑓(𝑋𝑚𝑖𝑠 |Θ𝑀𝑖 )𝑥𝑖𝑢 𝑥𝑖𝑣 d𝑋𝑚𝑖𝑠 = 𝑥𝑖𝑣 ∫ 𝑓(𝑋𝑚𝑖𝑠 |Θ𝑀𝑖 )𝑥𝑖𝑢 d𝑋𝑚𝑖𝑠 = 𝜇𝑀𝑖 (𝑚𝑖𝑢 )𝑥𝑖𝑣
𝑋𝑚𝑖𝑠 𝑋𝑚𝑖𝑠
(𝑡)
If 𝑢 ∈ 𝑀𝑖 and 𝑣 ∉ 𝑀𝑖 then, the partial statistic xiuxiv is replaced by the expectation 𝐸(𝑥𝑖𝑢 𝑥𝑖𝑣 |Θ𝑀𝑖 ) as follows:
(𝑡) (𝑡) (𝑡) (𝑡) (𝑡)
𝐸(𝑥𝑖𝑢 𝑥𝑖𝑣 |Θ𝑀𝑖 ) = ∫ 𝑓(𝑋𝑚𝑖𝑠 |Θ𝑀𝑖 )𝑥𝑖𝑢 𝑥𝑖𝑣 d𝑋𝑚𝑖𝑠 = Σ𝑀𝑖 (𝑚𝑖𝑢 , 𝑚𝑖𝑣 ) + 𝜇𝑀𝑖 (𝑚𝑖𝑢 )𝜇𝑀𝑖 (𝑚𝑖𝑣 )∎
𝑋𝑚𝑖𝑠
At M-step of some tth iteration, given τ(t) and Θ(t), the next parameter Θ(t+1) = (μ(t+1), Σ(t+1))T is a solution of equation 2.23.
𝐸(𝜏(𝑋)|Θ) = 𝜏 (𝑡)
Due to
𝜇
𝐸(𝜏(𝑋)|Θ) = ( )
Σ
Equation 2.23 becomes:
(𝑡)
𝜇 = 𝜏1
{ (𝑡)
Σ = 𝜏2 − 𝜇𝜇 𝑇
Which means that
(𝑡+1) (𝑡)
𝜇 = 𝑥̅𝑗
{ 𝑗(𝑡+1) (𝑡+1) (𝑡) (𝑡) (𝑡)
∀𝑗, 𝑢, 𝑣 (2.42)
𝜎𝑢𝑣 = 𝜎𝑣𝑢 = 𝑠𝑢𝑣 − 𝑥̅𝑢 𝑥̅𝑣
(𝑡) (𝑡)
Please see equation 2.40 and equation 2.41 to know 𝑥̅𝑗 and 𝑠𝑢𝑣 .
Moreover, at M-step of some tth iteration, the next parameter Φ(t+1) = p(t+1) is a maximizer of Q2(Φ | Θ(t)) given Θ(t) according
to equation 2.24.
Φ(𝑡+1) = argmin 𝑄2 (Φ|Θ(𝑡) )
Φ
Because the PDF of Zi is:
𝑓(𝑍𝑖 |Φ) = 𝑝𝑐(𝑍𝑖) (1 − 𝑝)𝑛−𝑐(𝑍𝑖)
The Q2(Φ|Θ ) becomes:
(t)
𝑁
(𝑡) (𝑡)
𝑄2 (Φ|Θ ) = ∑ ∫ 𝑓(𝑋𝑚𝑖𝑠 |Θ𝑀𝑖 )log(𝑓(𝑍𝑖 |𝑋𝑜𝑏𝑠 (𝑖), 𝑋𝑚𝑖𝑠 , Φ))d𝑋𝑚𝑖𝑠
𝑖=1 𝑋𝑚𝑖𝑠
𝑁
(𝑡)
= ∑ ∫ 𝑓(𝑋𝑚𝑖𝑠 |Θ𝑀𝑖 )log(𝑓(𝑍𝑖 |Φ))d𝑋𝑚𝑖𝑠
𝑖=1 𝑋𝑚𝑖𝑠
𝑁
(𝑡)
= ∑ log(𝑓(𝑍𝑖 |Φ)) ∫ 𝑓(𝑋𝑚𝑖𝑠 |Θ𝑀𝑖 )d𝑋𝑚𝑖𝑠
𝑖=1 𝑋𝑚𝑖𝑠
𝑁 𝑁

= ∑ log(𝑓(𝑍𝑖 |Φ)) = ∑ log(𝑝𝑐(𝑍𝑖) (1 − 𝑝)𝑛−𝑐(𝑍𝑖) )


𝑖=1 𝑖=1
𝑁

= ∑ (𝑐(𝑍𝑖 )log(𝑝) + (𝑛 − 𝑐(𝑍𝑖 ))log(1 − 𝑝))


𝑖=1
The next parameter Φ(t+1) = p(t+1) is solution of the equation created by setting the first-order derivative of Q2(Φ|Θ(t)) to be zero,
which means that:
𝑁 𝑁
𝜕𝑄2 (Φ|Θ(𝑡) ) 𝑐(𝑍𝑖 ) 𝑛 − 𝑐(𝑍𝑖 ) 1
= ∑( − )= ((∑ 𝑐(𝑍𝑖 )) − 𝑛𝑝𝑁) = 0
𝜕𝑝 𝑝 1−𝑝 𝑝(1 − 𝑝)
𝑖=1 𝑖=1
It is easy to deduce that the next parameter p(t+1) is:
∑𝑁𝑖=1 𝑐(𝑍𝑖 )
𝑝(𝑡+1) = (2.43)
𝑛𝑁
In general, given sample 𝒳 = {X1, X2,…, XN} whose Xi (s) are iid is MCAR data and f(X|Θ) is multinormal PDF whereas
missingness variable Z follows binomial distribution of n trials, GEM for handling missing data is summarized in table 2.2.
E-step:
Given current parameter Θ(t) = (μ(t), Σ(t))T, the sufficient statistic τ(t) is calculated according to equation 2.39, equation 2.40,
and equation 2.41.

All rights reserved by www.grdjournals.com 21


Handling Missing Data with Expectation Maximization Algorithm
(GRDJE/ Volume 6 / Issue 11 / 002)

(𝑡) (𝑡) 𝑇
𝜏 (𝑡) = (𝜏1 , 𝜏2 )
(𝑡) (𝑡) (𝑡) (𝑡) 𝑇
𝜏1 = (𝑥̅1 , 𝑥̅2 , … , 𝑥̅𝑛 )
(𝑡) (𝑡) (𝑡)
𝑠11 𝑠12 ⋯ 𝑠1𝑛
(𝑡) (𝑡) (𝑡)
𝑠21 𝑠22
(𝑡)
𝜏2 = ⋯ 𝑠2𝑛
⋮ ⋮ ⋱ ⋮
(𝑡) (𝑡) (𝑡)
𝑠
( 𝑛1 𝑠𝑛2 ⋯ 𝑠𝑛𝑛 )
𝑁
(𝑡) 1 𝑥𝑖𝑗 if 𝑗 ∉ 𝑀𝑖
𝑥̅𝑗 = ∑ { (𝑡)
𝑁 𝜇𝑀𝑖 (𝑗) if 𝑗 ∈ 𝑀𝑖
𝑖=1
𝑥𝑖𝑢 𝑥𝑖𝑣
if 𝑢 ∉ 𝑀𝑖 and 𝑣 ∉ 𝑀𝑖
(𝑡)
𝑥𝑖𝑢 𝜇𝑀𝑖 (𝑚𝑖𝑣 )
𝑁 if 𝑢 ∉ 𝑀𝑖 and 𝑣 ∈ 𝑀𝑖
(𝑡) (𝑡) 1
𝑠𝑢𝑣 = 𝑠𝑣𝑢 = ∑
𝑁 (𝑡)
𝜇𝑀𝑖 (𝑚𝑖𝑢 )𝑥𝑖𝑣
𝑖=1
if 𝑢 ∈ 𝑀𝑖 and 𝑣 ∉ 𝑀𝑖

(𝑡) (𝑡) (𝑡)


Σ𝑀𝑖 (𝑚𝑖𝑢 , 𝑚𝑖𝑣 ) + 𝜇𝑀𝑖 (𝑚𝑖𝑢 )𝜇𝑀𝑖 (𝑚𝑖𝑣 )
{ if 𝑢 ∈ 𝑀𝑖 and 𝑣 ∈ 𝑀𝑖
Where 𝜇𝑀𝑖 and Σ𝑀𝑖 are specified in equation 2.35 and equation 2.38.
M-step:
Given τ(t) and Θ(t), the next parameter Θ(t+1) = (μ(t+1), Σ(t+1))T is specified by equation 2.42.
(𝑡+1) (𝑡)
𝜇 = 𝑥̅𝑗
{ 𝑗(𝑡+1) (𝑡+1) (𝑡) (𝑡) (𝑡)
∀𝑗, 𝑢, 𝑣
𝜎𝑢𝑣 = 𝜎𝑣𝑢 = 𝑠𝑢𝑣 − 𝑥𝑢 𝑥𝑣
Given Θ(t), the next parameter Φ(t+1) = p(t+1) is specified by equation 2.43.
∑𝑁𝑖=1 𝑐(𝑍𝑖 )
𝑝(𝑡+1) =
𝑛𝑁
Where c(Zi) is the number of zij (s) in Zi that equal 1.
Table 2.2. E-step and M-step of GEM algorithm for handling missing data given normal PDF
As aforementioned, an interesting application of handling missing data is to fill in or predict missing values. For instance, suppose
𝑇

the estimate resulted from GEM is Θ* = (μ*, Σ*)T, missing part 𝑋𝑚𝑖𝑠 = (𝑥𝑚1 , 𝑥𝑚2 , … , 𝑥𝑚|𝑀 | ) is replaced by 𝜇𝑀 as follows:
𝑖

𝑥𝑚𝑗 = 𝜇𝑀 (𝑚𝑗 ), ∀𝑚𝑗 ∈ 𝑀 (2.44)

Note, 𝜇𝑀 which is extracted from μ is estimated mean of the conditional PDF of Xmis (given Xobs) according to equation 2.35.
*

Moreover, 𝜇𝑀 (𝑚𝑗 ) is estimated partial mean of 𝑥𝑚𝑗 given the conditional PDF f(Xmis | Θ∗𝑀 ), please see equation 2.38 for more

details about 𝜇𝑀 . As aforementioned, in practice we can stop GEM after its first iteration was done, which is reasonable enough to
handle missing data.
Now we survey another interesting case that sample 𝒳 = {X1, X2,…, XN} whose Xi (s) are iid is MCAR data and f(X|Θ) is
multinomial PDF of K trials. We ignore missingness variable Z here because it is included in the case of multinormal PDF. Let X
= {Xobs, Xmis} be random variable representing every Xi. Suppose dimension of X is n. According to equation 2.9, recall that
𝑋𝑖 = {𝑋𝑜𝑏𝑠 (𝑖), 𝑋𝑚𝑖𝑠 (𝑖)} = (𝑥𝑖1 , 𝑥𝑖2 , … , 𝑥𝑖𝑛 )𝑇
𝑇
𝑋𝑚𝑖𝑠 (𝑖) = (𝑥𝑖𝑚1 , 𝑥𝑖𝑚2 , … , 𝑥𝑖𝑚|𝑀 | )
𝑖
𝑇
𝑋𝑜𝑏𝑠 (𝑖) = (𝑥𝑖𝑚̅𝑖1 , 𝑥𝑖𝑚̅𝑖2 , … , 𝑥𝑖𝑚̅𝑖|𝑀
̅̅̅ |
)
𝑖

𝑀𝑖 = {𝑚𝑖1 , 𝑚𝑖2 , … , 𝑚𝑖|𝑀𝑖 | }


̅𝑖 = {𝑚
𝑀 ̅ 𝑖1 , 𝑚
̅ 𝑖2 , … , 𝑚
̅ 𝑖|𝑀̅𝑖| }
The PDF of X is:
𝑛
𝐾! 𝑥𝑗
𝑓(𝑋|Θ) = 𝑓(𝑋𝑜𝑏𝑠 , 𝑋𝑚𝑖𝑠 |Θ) = ∏ 𝑝𝑗 (2.45)
∏𝑛𝑗=1(𝑥𝑗 !)
𝑗=1
Where xj are integers and Θ = (p1, p2,…, pn)T is the set of probabilities such that

All rights reserved by www.grdjournals.com 22


Handling Missing Data with Expectation Maximization Algorithm
(GRDJE/ Volume 6 / Issue 11 / 002)

∑ 𝑝𝑗 = 1
𝑗=1
𝑛

∑ 𝑥𝑗 = 𝐾
𝑗=1
𝑥𝑗 ∈ {0,1, … , 𝐾}
Note, xj is the number of trials generating nominal value j. Therefore,
𝑛
𝐾! 𝑥𝑖𝑗
𝑓(𝑋𝑖 |Θ) = 𝑓(𝑋𝑜𝑏𝑠 (𝑖), 𝑋𝑚𝑖𝑠 (𝑖)|Θ) = ∏ 𝑝𝑗
∏𝑛𝑗=1(𝑥𝑖𝑗 !)
𝑗=1
Where,
𝑛

∑ 𝑥𝑖𝑗 = 𝐾
𝑗=1
𝑥𝑖𝑗 ∈ {0,1, … , 𝐾}
The most important task here is to define equation 2.11 and equation 2.15 in order to compose τ(X) from τ(Xobs), τ(Xmis) and to
extract ΘM from Θ when f(X|Θ) is multinomial PDF.
Let Θmis be parameter of marginal PDF of Xmis, we have:
|𝑀|
𝐾𝑚𝑖𝑠 ! 𝑝𝑚𝑗 𝑥𝑚𝑗
𝑓(𝑋𝑚𝑖𝑠 |Θ𝑚𝑖𝑠 ) = ∏( ) (2.46)
∏𝑚𝑗 ∈𝑀 (𝑥𝑚𝑗 !) 𝑃𝑚𝑖𝑠
𝑗=1
Therefore,
|𝑀𝑖 |
𝐾𝑚𝑖𝑠 (𝑖)! 𝑝𝑚𝑖𝑗 𝑥𝑖𝑚𝑗
𝑓(𝑋𝑚𝑖𝑠 (𝑖)|Θ𝑚𝑖𝑠 (𝑖)) = ∏( )
∏𝑚𝑗 ∈𝑀𝑖 (𝑥𝑖𝑚𝑗 !) 𝑃𝑚𝑖𝑠 (𝑖)
𝑗=1
Where,
𝑝𝑚𝑖1 𝑝𝑚𝑖2 𝑝𝑚𝑖|𝑀 | 𝑇
𝑖
Θ𝑚𝑖𝑠 (𝑖) = ( , ,…, )
𝑃𝑚𝑖𝑠 (𝑖) 𝑃𝑚𝑖𝑠 (𝑖) 𝑃𝑚𝑖𝑠 (𝑖)
|𝑀𝑖 |

𝑃𝑚𝑖𝑠 (𝑖) = ∑ 𝑝𝑚𝑖𝑗 (2.47)


𝑗=1
|𝑀𝑖 |

𝐾𝑚𝑖𝑠 (𝑖) = ∑ 𝑥𝑚𝑖𝑗


𝑗=1
Obviously, Θmis(i) is extracted from Θ given indicator Mi.
Let Θobs be parameter of marginal PDF of Xobs, we have:
̅|
|𝑀
𝐾𝑜𝑏𝑠 ! 𝑝𝑚̅𝑗 𝑥𝑚
̅̅̅𝑗
𝑓(𝑋𝑜𝑏𝑠 |Θ𝑜𝑏𝑠 ) = ∏( ) (2.48)
∏𝑚̅𝑗 ∈𝑀̅ (𝑥𝑚̅𝑗 !) 𝑃𝑜𝑏𝑠
𝑗=1
Therefore,
̅ 𝑖|
|𝑀
𝐾𝑜𝑏𝑠 (𝑖)! 𝑝𝑚̅𝑖𝑗 𝑥𝑖𝑚 ̅̅̅𝑗
𝑓(𝑋𝑜𝑏𝑠 (𝑖)|Θ𝑜𝑏𝑠 (𝑖)) = ∏( )
∏𝑚̅𝑗 ∈𝑀̅𝑖 (𝑥𝑖𝑚̅𝑗 !) 𝑃𝑜𝑏𝑠 (𝑖)
𝑗=1
Where,
𝑝𝑚̅𝑖|𝑀 𝑇
𝑝𝑚̅𝑖1 𝑝𝑚̅𝑖2 ̅̅̅ |
𝑖
Θ𝑜𝑏𝑠 (𝑖) = ( , ,…, )
𝑃𝑜𝑏𝑠 (𝑖) 𝑃𝑜𝑏𝑠 (𝑖) 𝑃𝑜𝑏𝑠 (𝑖)
̅ 𝑖|
|𝑀

𝑃𝑜𝑏𝑠 (𝑖) = ∑ 𝑝𝑚̅𝑖𝑗 (2.49)


𝑗=1
̅ 𝑖|
|𝑀

𝐾𝑜𝑏𝑠 (𝑖) = ∑ 𝑥𝑚̅𝑖𝑗


𝑗=1
̅𝑖 or Mi.
Obviously, Θobs(i) is extracted from Θ given indicator 𝑀
The conditional PDF of Xmis given Xobs is calculated based on the PDF of X and the marginal PDF of Xobs as follows:

All rights reserved by www.grdjournals.com 23


Handling Missing Data with Expectation Maximization Algorithm
(GRDJE/ Volume 6 / Issue 11 / 002)

𝑓(𝑋𝑜𝑏𝑠 , 𝑋𝑚𝑖𝑠 |Θ)


𝑓(𝑋𝑚𝑖𝑠 |Θ𝑀 ) = 𝑓(𝑋𝑚𝑖𝑠 |𝑋𝑜𝑏𝑠 , Θ) =
𝑓(𝑋𝑜𝑏𝑠 |Θ𝑜𝑏𝑠 )
𝐾! 𝑥
∏𝑛 𝑝 𝑗
∏𝑛𝑗=1(𝑥𝑗 !) 𝑗=1 𝑗
= 𝑥𝑚
̅ | 𝑝𝑚
̅̅̅𝑗
𝐾𝑜𝑏𝑠 ! ̅𝑗
̅|
|𝑀
∏|𝑀𝑗=1 (𝑃 )
∏ 𝑥 ! ̅𝑗
𝑗=1 𝑚
𝑜𝑏𝑠
̅|
|𝑀 𝑥
𝐾! ∏𝑗=1 𝑥𝑚̅𝑗 ! ∏𝑛𝑗=1 𝑝𝑗 𝑗
= ∗ 𝑥𝑚
𝐾𝑜𝑏𝑠 ! ∏𝑛𝑗=1(𝑥𝑗 !) ̅ | 𝑝𝑚
̅̅̅𝑗
|𝑀 ̅𝑗
∏𝑗=1 ( )
𝑃𝑜𝑏𝑠
|𝑀| ̅|
|𝑀 𝑥𝑚
̅̅̅
𝐾! 𝑥𝑚
𝑗
𝑥𝑚
̅̅̅ 𝑗 𝑃𝑜𝑏𝑠 𝑗
= ∗ (∏ 𝑝𝑚𝑗 ) ∗ (∏ 𝑝𝑚̅ ( ) )
|𝑀|
𝐾𝑜𝑏𝑠 ! ∏𝑗=1 (𝑥𝑚𝑗 !) 𝑗 𝑝𝑚̅𝑗
𝑗=1 𝑗=1
|𝑀| ̅|
|𝑀
𝐾! 𝑥𝑚𝑗 𝑥𝑚
̅̅̅
= |𝑀|
∗ (∏ 𝑝𝑚𝑗 ) ∗ (∏(𝑃𝑜𝑏𝑠 ) 𝑗 )
𝐾𝑜𝑏𝑠 ! ∏𝑗=1 (𝑥𝑚𝑗 !) 𝑗=1 𝑗=1
|𝑀|
𝐾! 𝑥𝑚𝑗
= |𝑀|
∗ (∏ 𝑝𝑚𝑗 ) ∗ ((𝑃𝑜𝑏𝑠 )𝐾𝑜𝑏𝑠 )
𝐾𝑜𝑏𝑠 ! ∏𝑗=1 (𝑥𝑚𝑗 !) 𝑗=1
This implies that the conditional PDF of Xmis given Xobs is multinomial PDF of K trials.
|𝑀|
𝐾! 𝑥𝑚𝑗
𝑓(𝑋𝑚𝑖𝑠 |𝑋𝑜𝑏𝑠 , Θ𝑀 ) = 𝑓(𝑋𝑚𝑖𝑠 |𝑋𝑜𝑏𝑠 , Θ) = |𝑀|
∗ (∏ 𝑝𝑚𝑗 ) ∗ ((𝑃𝑜𝑏𝑠 )𝐾𝑜𝑏𝑠 ) (2.50)
𝐾𝑜𝑏𝑠 ! ∏𝑗=1 (𝑥𝑚𝑗 !) 𝑗=1
Therefore,
|𝑀𝑖 |
𝐾! 𝑥𝑖𝑚
𝑗 𝐾𝑜𝑏𝑠 (𝑖)
𝑓(𝑋𝑚𝑖𝑠 (𝑖)|𝑋𝑜𝑏𝑠 (𝑖), Θ𝑀𝑖 ) = 𝑓(𝑋𝑚𝑖𝑠 (𝑖)|𝑋𝑜𝑏𝑠 (𝑖), Θ) = |𝑀 |
∗ (∏ 𝑝𝑚𝑖𝑗 ) ∗ ((𝑃𝑜𝑏𝑠 (𝑖)) )
𝐾𝑜𝑏𝑠 (𝑖)! ∏𝑗=1𝑖 (𝑥𝑖𝑚𝑗 !) 𝑗=1
Where
̅ 𝑖|
|𝑀

𝑃𝑜𝑏𝑠 (𝑖) = ∑ 𝑝𝑚̅𝑖𝑗


𝑗=1
̅ 𝑖|
|𝑀

𝐾𝑜𝑏𝑠 (𝑖) = ∑ 𝑥𝑚̅𝑖𝑗


𝑗=1
Obviously, the parameter Θ𝑀𝑖 of the conditional PDF 𝑓(𝑋𝑚𝑖𝑠 (𝑖)|𝑋𝑜𝑏𝑠 (𝑖), Θ𝑀𝑖 ) is:
𝑝𝑚1
𝑝𝑚2

Θ = 𝑢(Θ, 𝑋 (𝑖)) = 𝑝 𝑚 𝑘 (2.51)
𝑀𝑖 𝑜𝑏𝑠
̅ 𝑖|
|𝑀

𝑃𝑜𝑏𝑠 (𝑖) = ∑ 𝑝𝑚̅𝑖𝑗


( 𝑗=1 )
Therefore, equation 2.51 to extract Θ𝑀𝑖 from Θ given Xobs(i) is an instance of equation 2.15. It is easy to check that
|𝑀𝑖 |

∑ 𝑥𝑚𝑖𝑗 + 𝐾𝑜𝑏𝑠 (𝑖) = 𝐾𝑚𝑖𝑠 (𝑖) + 𝐾𝑜𝑏𝑠 (𝑖) = 𝐾


𝑗=1
|𝑀𝑖 | |𝑀𝑖 | ̅ 𝑖|
|𝑀 𝑛

∑ 𝑝𝑚𝑖𝑗 + 𝑃𝑜𝑏𝑠 (𝑖) = ∑ 𝑝𝑚𝑖𝑗 + ∑ 𝑝𝑚̅𝑖𝑗 = ∑ 𝑝𝑗 = 1


𝑗=1 𝑗=1 𝑗=1 𝑗=1
At E-step of some tth iteration, given current parameter Θ(t) = (p1(t), p2(t),…, pn(t))T, the sufficient statistic of X is calculated according
to equation 2.22. Let,
𝑁
(𝑡)
1 (𝑡)
𝜏 = ∑{𝜏(𝑋𝑜𝑏𝑠 (𝑖)), 𝐸(𝜏(𝑋𝑚𝑖𝑠 )|Θ𝑀𝑖 )}
𝑁
𝑖=1
The sufficient statistic of Xobs(i) is:

All rights reserved by www.grdjournals.com 24


Handling Missing Data with Expectation Maximization Algorithm
(GRDJE/ Volume 6 / Issue 11 / 002)

𝑇
𝜏(𝑋𝑜𝑏𝑠 (𝑖)) = (𝑥𝑖𝑚̅1 , 𝑥𝑖𝑚̅2 , … , 𝑥𝑖𝑚̅|𝑀
̅̅̅ |
)
𝑖
The sufficient statistic of Xmis(i) with regard to 𝑓(𝑋𝑚𝑖𝑠 (𝑖)|𝑋𝑜𝑏𝑠 (𝑖), Θ𝑀𝑖 ) is:
𝑥𝑖𝑚1
𝑥𝑖𝑚2

𝑥𝑖𝑚|𝑀 |
𝜏(𝑋𝑚𝑖𝑠 (𝑖)) = 𝑖
̅ 𝑖|
|𝑀

∑ 𝑥𝑚̅𝑖𝑗
(𝑗=1 )
We also have:
𝐾𝑝𝑚1
𝐾𝑝𝑚2

(𝑡) (𝑡)
𝐸(𝜏(𝑋𝑚𝑖𝑠 )|Θ𝑀𝑖 ) = ∫ 𝑓(𝑋𝑚𝑖𝑠 |𝑋𝑜𝑏𝑠 , Θ𝑀𝑖 )𝜏(𝑋𝑚𝑖𝑠 )d𝑋𝑚𝑖𝑠 = 𝐾𝑝𝑚|𝑀 |
𝑖
𝑋𝑚𝑖𝑠 ̅ 𝑖|
|𝑀

∑ 𝐾𝑝𝑚̅𝑖𝑗
(𝑗=1 )
Therefore, the sufficient statistic of X at E-step of some tth iteration given current parameter Θ(t) = (p1(t), p2(t),…, pn(t))T is defined as
follows:
(𝑡) (𝑡) (𝑡) 𝑇
𝜏 (𝑡) = (𝑥̅1 , 𝑥̅2 , … , 𝑥̅𝑛 )
𝑁
(𝑡) 1 𝑥𝑖𝑗 if 𝑗 ∉ 𝑀𝑖 (2.52)
𝑥̅𝑗 = ∑ { (𝑡) ∀𝑗
𝑁 𝐾𝑝𝑗 if 𝑗 ∈ 𝑀𝑖
𝑖=1
Equation 2.52 is an instance of equation 2.11, which compose τ(X) from τ(Xobs) and τ(Xmis) when f(X|Θ) is multinomial PDF.
At M-step of some tth iteration, we need to maximize Q1(Θ’|Θ) with following constraint
𝑛

∑ 𝑝𝑗 = 1
𝑗=1
According to equation 2.19, we have:
𝑁 𝑁

𝑄1 (Θ′ |Θ) = ∑ 𝐸(log(𝑏(𝑋𝑜𝑏𝑠 (𝑖), 𝑋𝑚𝑖𝑠 ))|Θ𝑀𝑖 ) + (Θ′ )𝑇 ∑{𝜏(𝑋𝑜𝑏𝑠 (𝑖)), 𝐸(𝜏(𝑋𝑚𝑖𝑠 )|Θ𝑀𝑖 )} − 𝑁log(𝑎(Θ′ ))
𝑖=1 𝑖=1
Where quantities b(Xobs(i), Xmis) and a(Θ’) belongs to the PDF f(X|Θ) of X. Because there is the constraint ∑𝑛𝑗=1 𝑝𝑗 = 1, we use
Lagrange duality method to maximize Q1(Θ’|Θ). The Lagrange function la(Θ’, λ | Θ) is sum of Q1(Θ’|Θ) and the constraint
∑𝑛𝑗=1 𝑝𝑗 = 1, as follows:
𝑛

𝑙𝑎(Θ , λ|Θ) = 𝑄1 (Θ′ |Θ) + 𝜆 (1 − ∑ 𝑝𝑗′ )
𝑗=1
𝑁 𝑁

= ∑ 𝐸(log(𝑏(𝑋𝑜𝑏𝑠 (𝑖), 𝑋𝑚𝑖𝑠 ))|Θ𝑀𝑖 ) + (Θ′ )𝑇 ∑{𝜏(𝑋𝑜𝑏𝑠 (𝑖)), 𝐸(𝜏(𝑋𝑚𝑖𝑠 )|Θ𝑀𝑖 )} − 𝑁log(𝑎(Θ′ ))
𝑖=1 𝑖=1
𝑛

+ 𝜆 (1 − ∑ 𝑝𝑗′ )
𝑗=1
Where Θ’ = (p1’, p2’,…, pn’)T. Note, λ ≥ 0 is called Lagrange multiplier. Of course, la(Θ’, λ | Θ) is function of Θ’ and λ. The next
parameter Θ(t+1) that maximizes Q1(Θ’|Θ) is solution of the equation formed by setting the first-order derivative of Lagrange
function regarding Θ’ and λ to be zero.
The first-order partial derivative of la(Θ’, λ | Θ) with regard to Θ’ is:
𝑁
𝜕𝑙𝑎(Θ′ , λ|Θ) 𝑇
= ∑ (𝐸(𝜏(𝑋𝑜𝑏𝑠 (𝑖), 𝑋𝑚𝑖𝑠 )|Θ𝑀𝑖 )) − 𝑁log ′ (𝑎(Θ′ ))
𝜕Θ′
𝑖=1
𝑁
𝑇
= ∑{𝜏(𝑋𝑜𝑏𝑠 (𝑖)), 𝐸(𝜏(𝑋𝑚𝑖𝑠 )|Θ𝑀𝑖 )} − 𝑁log ′ (𝑎(Θ′ )) − (𝜆, 𝜆, … , 𝜆)𝑇
𝑖=1
By referring table 1.2, we have:

All rights reserved by www.grdjournals.com 25


Handling Missing Data with Expectation Maximization Algorithm
(GRDJE/ Volume 6 / Issue 11 / 002)

𝑇 𝑇
log ′ (𝑎(Θ′ )) = (𝐸(𝜏(𝑋)|Θ′ )) = ∫ 𝑓(𝑋|Θ)(𝜏(𝑋)) d𝑋
𝑋
Thus,
𝑁
𝜕𝑙𝑎(Θ′ , λ|Θ) 𝑇 𝑇
= ∑{𝜏(𝑋𝑜𝑏𝑠 (𝑖)), 𝐸(𝜏(𝑋𝑚𝑖𝑠 )|Θ𝑀𝑖 )} − 𝑁(𝐸(𝜏(𝑋)|Θ′ )) − (𝜆, 𝜆, … , 𝜆)𝑇
𝜕Θ′
𝑖=1
The first-order partial derivative of la(Θ’, λ | Θ) with regard to λ is:
𝑛
𝜕𝑙𝑎(Θ′ , λ|Θ)
= 1 − ∑ 𝑝𝑗′
𝜕λ
𝑗=1
Therefore, at M-step of some tth iteration, given current parameter Θ(t) = (p1(t), p2(t),…, pn(t))T, the next parameter Θ(t+1) is solution
of the following equation:
𝑁
(𝑡) 𝑇
∑{𝜏(𝑋𝑜𝑏𝑠 (𝑖)), 𝐸(𝜏(𝑋𝑚𝑖𝑠 )|Θ𝑀𝑖 )}
𝑖=1
𝑇
−𝑁(𝐸(𝜏(𝑋)|Θ)) − (𝜆, 𝜆, … , 𝜆) = 𝟎𝑇
𝑛

1 − ∑ 𝑝𝑗 = 0
{ 𝑗=1
This implies:
𝜆 ⁄𝑁
(𝑡) 𝜆 ⁄𝑁
𝐸(𝜏(𝑋)|Θ) = 𝜏 −( )
𝜆 ⁄𝑁
𝜆 ⁄𝑁
𝑛

∑ 𝑝𝑗 = 1
{𝑗=1
Where,
𝑁
(𝑡)
1 (𝑡)
𝜏 = ∑{𝜏(𝑋𝑜𝑏𝑠 (𝑖)), 𝐸(𝜏(𝑋𝑚𝑖𝑠 )|Θ𝑀𝑖 )}
𝑁
𝑖=1
Due to
𝐸(𝜏(𝑋)|Θ) = ∫ 𝜏(𝑋)𝑓(𝑋|Θ)d𝑋 = (𝐾𝑝1 , 𝐾𝑝2 , … , 𝐾𝑝𝑛 )𝑇
𝑋
(𝑡) (𝑡) (𝑡) 𝑇
𝜏 (𝑡) = (𝑥̅1 , 𝑥̅2 , … , 𝑥̅𝑛 )
𝑁
(𝑡) 1 𝑥𝑖𝑗 if 𝑗 ∉ 𝑀𝑖
𝑥̅𝑗 = ∑ { (𝑡) ∀𝑗
𝑁 𝐾𝑝𝑗 if 𝑗 ∈ 𝑀𝑖
𝑖=1
(𝑡)
We obtain n equations Kpj = –λ/N + 𝑥̅𝑗 and 1 constraint ∑𝑛𝑗=1 𝑝𝑗 = 1. Therefore, we have:
𝑁
𝜆 1 𝑥𝑖𝑗 if 𝑗 ∉ 𝑀𝑖
𝑝𝑗 = − + ∑ { (𝑡) ∀𝑗
𝐾𝑁 𝐾𝑁 𝐾𝑝𝑗 if 𝑗 ∈ 𝑀𝑖
𝑖=1
Summing n equations above, we have:
𝑛 𝑁 𝑛 𝑁 𝑖 𝑖 ̅ |
|𝑀 |𝑀 |
𝜆 1 𝑥𝑖𝑗 if 𝑗 ∉ 𝑀𝑖 𝜆 1 (𝑡)
1 = ∑ 𝑝𝑗 = − + ∑ (∑ { (𝑡) )=− + ∑ (∑ 𝑥𝑖𝑚̅𝑗 + ∑ 𝐾𝑝𝑚𝑗 )
𝐾𝑁 𝐾𝑁 𝐾𝑝𝑗 if 𝑗 ∈ 𝑀𝑖 𝐾𝑁 𝐾𝑁
𝑗=1 𝑗=1 𝑖=1 𝑖=1 𝑗=1 𝑗=1
Suppose every missing value 𝑥𝑖𝑚𝑗 is estimated by 𝐾𝑝𝑚𝑗 such that:
̅ 𝑖|
|𝑀 |𝑀𝑖 |
(𝑡)
∑ 𝑥𝑚̅𝑖𝑗 = ∑ 𝐾𝑝𝑚𝑗
𝑗=1 𝑗=1
We obtain:
𝑁 ̅ 𝑖|
|𝑀 |𝑀𝑖 | 𝑁
𝜆 1 𝜆 1 𝜆
1=− + ∑ (∑ 𝑥𝑖𝑚̅𝑗 + ∑ 𝑥𝑖𝑚𝑗 ) = − + ∑𝐾 = − +1
𝐾𝑁 𝐾𝑁 𝐾𝑁 𝐾𝑁 𝐾𝑁
𝑖=1 𝑗=1 𝑗=1 𝑖=1
This implies

All rights reserved by www.grdjournals.com 26


Handling Missing Data with Expectation Maximization Algorithm
(GRDJE/ Volume 6 / Issue 11 / 002)

𝜆=0
Such that
𝑁
1 𝑥𝑖𝑗 if 𝑗 ∉ 𝑀𝑖
𝑝𝑗 = ∑ { (𝑡) ∀𝑗
𝐾𝑁 𝐾𝑝𝑗 if 𝑗 ∈ 𝑀𝑖
𝑖=1
Therefore, at M-step of some tth iteration, given current parameter Θ(t) = (p1(t), p2(t),…, pn(t))T, the next parameter Θ(t+1) is specified
by following equation.
𝑁
(𝑡+1) 1 𝑥𝑖𝑗 if 𝑗 ∉ 𝑀𝑖
𝑝𝑗 = ∑ { (𝑡) ∀𝑗 (2.53)
𝐾𝑁 𝐾𝑝𝑗 if 𝑗 ∈ 𝑀𝑖
𝑖=1
In general, given sample 𝒳 = {X1, X2,…, XN} whose Xi (s) are iid is MCAR data and f(X|Θ) is multinomial PDF of K trials, GEM
for handling missing data is summarized in table 2.3.
M-step:
Given τ(t) and Θ(t) = (p1(t), p2(t),…, pn(t))T, the next parameter Θ(t+1) is specified by equation 2.53.
𝑁
(𝑡+1) 1 𝑥𝑖𝑗 if 𝑗 ∉ 𝑀𝑖
𝑝𝑗 = ∑ { (𝑡) ∀𝑗
𝐾𝑁 𝐾𝑝𝑗 if 𝑗 ∈ 𝑀𝑖
𝑖=1
Table 2.3. E-step and M-step of GEM algorithm for handling missing data given multinomial PDF
In table 2.3, E-step is implied in how to perform M-step. As aforementioned, in practice we can stop GEM after its first iteration
was done, which is reasonable enough to handle missing data. Next section includes two examples of handling missing data with
multinormal distribution and multinomial distribution.

III. NUMERICAL EXAMPLES


It is necessary to have an example for illustrating how to handle missing data with multinormal PDF.
Example 3.1. Given sample of size two, 𝒳 = {X1, X2 } in which X1 = (x11=1, x12=?, x13=3, x14=?)T and X2 = (x21=?, x22=2, x23=?,
x24=4)T are iid. Therefore, we also have Z1 = (z11=0, z12=1, z13=0, z14=1)T and Z2 = (z21=1, z22=0, z23=1, z24=0)T. All Zi (s) are iid too.
x1 x2 x3 x4 z1 z2 z3 z4
X1 1 ? 3 ? Z1 0 1 0 1
X2 ? 2 ? 4 Z2 1 0 1 0
Of course, we have Xobs(1) = (x11=1, x13=3)T, Xmis(1) = (x12=?, x14=?)T, Xobs(2) = (x22=2, x24=4)T and Xmis(2) = (x21=?, x23=?)T. We
also have M1 = {m11=2, m12=4}, 𝑀 ̅1 = {𝑚
̅ 11 =1, 𝑚
̅ 12 =3}, M2 = {m21=1, m22=3}, and 𝑀̅2 = {𝑚
̅ 21 =2, 𝑚
̅ 22 =4}. Let X and Z be random
variables representing every Xi and every Zi, respectively. Suppose f(X|Θ) is multinormal PDF and missingness variable Z follows
binomial distribution of 4 trials according to equation 2.26 and equation 2.27. Dimension of X is 4. We will estimate Θ = (μ, Σ)T
and Φ = p based on 𝒳.
𝜇 = (𝜇1 , 𝜇2 , 𝜇3 , 𝜇4 )𝑇
𝜎11 𝜎12 𝜎13 𝜎14
𝜎21 𝜎22 𝜎23 𝜎24
Σ = (𝜎 𝜎32 𝜎33 𝜎34 )
31
𝜎41 𝜎42 𝜎43 𝜎44
The parameters μ and Σ are initialized arbitrarily as zero vector and identity vector whereas p is initialized 0.5 as follows:
(1) (1) (1) (1) 𝑇
𝜇 (1) = (𝜇1 = 0, 𝜇2 = 0, 𝜇3 = 0, 𝜇4 = 0)
(1) (1) (1) (1)
𝜎11 = 1 𝜎12 = 0 𝜎13 = 0 𝜎14 = 0
(1) (1) (1) (1)
𝜎21 = 0 𝜎22 = 1 𝜎23 = 0 𝜎24 = 0
Σ (1) = (1) (1) (1) (1)
𝜎31 = 0 𝜎32 = 0 𝜎33 = 1 𝜎34 = 0
(1) (1) (1) (1)
(𝜎41 = 0 𝜎42 = 0 𝜎43 = 0 𝜎44 = 1)
(1)
𝑝 = 0.5
At 1st iteration, E-step, we have:
𝑋𝑜𝑏𝑠 (1) = (𝑥1 = 1, 𝑥3 = 3)𝑇
(1) (1) 𝑇
𝜇𝑚𝑖𝑠 (1) = (𝜇2 = 0, 𝜇4 = 0)
(1) (1)
𝜎 = 1 𝜎24 = 0
Σ𝑚𝑖𝑠 (1) = ( 22
(1) (1)
)
𝜎42 = 0 𝜎44 = 1
(1) (1) 𝑇
𝜇𝑜𝑏𝑠 (1) = (𝜇1 = 0, 𝜇3 = 0)
(1) (1)
𝜎 = 1 𝜎13 = 0
Σ𝑜𝑏𝑠 (1) = ( 11
(1) (1)
)
𝜎31 = 0 𝜎33 = 1

All rights reserved by www.grdjournals.com 27


Handling Missing Data with Expectation Maximization Algorithm
(GRDJE/ Volume 6 / Issue 11 / 002)

(1) (1)
𝑚𝑖𝑠 𝜎21 = 0 𝜎23 = 0
𝑉𝑜𝑏𝑠 (1) = ( (1) (1)
)
𝜎41 = 0 𝜎43 = 0
(1) (1)
𝜎 = 0 𝜎14 = 0
𝑜𝑏𝑠
𝑉𝑚𝑖𝑠 (1) = ( 12
(1) (1)
)
𝜎32 = 0 𝜎34 = 0
(1) −1 (1) (1) 𝑇
𝑚𝑖𝑠
𝜇𝑀1 = 𝜇𝑚𝑖𝑠 (1) + (𝑉𝑜𝑏𝑠 (1)) (Σ𝑜𝑏𝑠 (1)) (𝑋𝑜𝑏𝑠 (1) − 𝜇𝑜𝑏𝑠 (1)) = (𝜇𝑀1 (2) = 0, 𝜇𝑀1 (4) = 0)
(1) (1)
(1) 𝑚𝑖𝑠 −1
𝑜𝑏𝑠
Σ𝑀1 (2,2) = 1 Σ𝑀1 (2,4) = 0
Σ𝑀1 = Σ𝑚𝑖𝑠 (1) − (𝑉𝑜𝑏𝑠 (1)) (Σ𝑜𝑏𝑠 (1)) (𝑉𝑚𝑖𝑠 )=( (1) (1)
)
Σ𝑀1 (4,2) = 0 Σ𝑀1 (4,4) = 1

𝑋𝑜𝑏𝑠 (2) = (𝑥2 = 2, 𝑥4 = 4)𝑇


(1) (1) 𝑇
𝜇𝑚𝑖𝑠 (2) = (𝜇1 = 0, 𝜇3 = 0)
(1) (1)
𝜎 = 1 𝜎13 = 0
Σ𝑚𝑖𝑠 (2) = ( 11
(1) (1)
)
𝜎31 = 0 𝜎33 = 1
(1) (1) 𝑇
𝜇𝑜𝑏𝑠 (2) = (𝜇2 = 0, 𝜇4 = 0)
(1) (1)
𝜎 = 1 𝜎24 = 0
Σ𝑜𝑏𝑠 (2) = ( 22
(1) (1)
)
𝜎42 = 0 𝜎44 = 1
(1) (1)
𝜎 = 0 𝜎14 = 0
𝑚𝑖𝑠
𝑉𝑜𝑏𝑠 (2) = ( 12(1) (1)
)
𝜎32 = 0 𝜎34 = 0
(1) (1)
𝜎 = 0 𝜎23 = 0
𝑜𝑏𝑠
𝑉𝑚𝑖𝑠 (2) = ( 21
(1) (1)
)
𝜎41 = 0 𝜎43 = 0
(1) −1 (1) (1) 𝑇
𝑚𝑖𝑠
𝜇𝑀2 = 𝜇𝑚𝑖𝑠 (2) + (𝑉𝑜𝑏𝑠 (2)) (Σ𝑜𝑏𝑠 (2)) (𝑋𝑜𝑏𝑠 (2) − 𝜇𝑜𝑏𝑠 (2)) = (𝜇𝑀2 (1) = 0, 𝜇𝑀2 (3) = 0)
(1) (1)
(1) 𝑚𝑖𝑠 −1 𝑜𝑏𝑠
Σ𝑀2 (1,1) = 1 Σ𝑀2 (1,3) = 0
Σ𝑀2 = Σ𝑚𝑖𝑠 (2) − (𝑉𝑜𝑏𝑠 (2)) (Σ𝑜𝑏𝑠 (2)) (𝑉𝑚𝑖𝑠 ) =( (1) (1)
)
Σ𝑀2 (3,1) = 0 Σ𝑀2 (3,3) = 1

(1) 1 (1)
𝑥̅1 = (𝑥11 + 𝜇𝑀2 (1)) = 0.5
2
(1) 1 (1)
𝑥̅2 = (𝜇𝑀1 (2) + 𝑥22 ) = 1
2
(1) 1 (1)
𝑥̅3 = (𝑥13 + 𝜇𝑀2 (3)) = 1.5
2
(1) 1 (1)
𝑥̅4 = (𝜇𝑀1 (4) + 𝑥24 ) = 2
2

(1) 1 (1) (1)


2
𝑠11 = ((𝑥11 )2 + (Σ𝑀2 (1,1) + (𝜇𝑀2 (1)) )) = 1
2
(1) (1) 1 (1) (1)
𝑠12 = 𝑠21 = (𝑥11 𝜇𝑀1 (2) + 𝜇𝑀2 (1)𝑥22 ) = 0
2
(1) (1) 1 (1) (1) (1)
𝑠13 = 𝑠31 = (𝑥11 𝑥13 + (Σ𝑀2 (1,3) + 𝜇𝑀2 (1)𝜇𝑀2 (3))) = 1.5
2
(1) (1) 1 (1) (1)
𝑠14 = 𝑠41 = (𝑥11 𝜇𝑀1 (4) + 𝜇𝑀2 (1)𝑥24 ) = 0
2
(1) 1 (1) (1)
2
𝑠22 = ((Σ𝑀1 (2,2) + (𝜇𝑀1 (2)) ) + (𝑥22 )2 ) = 2.5
2
(1) (1) 1 (1) (1)
𝑠23 = 𝑠32 = (𝜇𝑀1 (2)𝑥13 + 𝑥22 𝜇𝑀2 (3)) = 0
2
(1) (1) 1 (1) (1) (1)
𝑠24 = 𝑠42 = ((Σ𝑀1 (2,4) + 𝜇𝑀1 (2)𝜇𝑀1 (4)) + 𝑥22 𝑥24 ) = 4
2
(1) 1 (1) (1)
2
𝑠33 = ((𝑥13 )2 + (Σ𝑀2 (3,3) + (𝜇𝑀2 (3)) )) = 5
2
(1) (1) 1 (1) (1)
𝑠34 = 𝑠43 = (𝑥13 𝜇𝑀1 (4) + 𝜇𝑀2 (3)𝑥24 ) = 0
2

All rights reserved by www.grdjournals.com 28


Handling Missing Data with Expectation Maximization Algorithm
(GRDJE/ Volume 6 / Issue 11 / 002)

(1) 1 (1) (1) 2


𝑠44 = ((Σ𝑀1 (4,4) + (𝜇𝑀1 (4)) ) + (𝑥24 )2 ) = 8.5
2
At 1st iteration, M-step, we have:
(2) (1)
𝜇1 = 𝑥̅1 = 0.5
(2) (1)
𝜇2 = 𝑥̅2 = 1
(2) (1)
𝜇3 = 𝑥̅3 = 1.5
(2) (1)
𝜇4 = 𝑥̅4 = 2

(2) (1) (1) 2


𝜎11 = 𝑠11 − (𝑥̅1 ) = 0.75
(2) (2) (1) (1) (1)
𝜎12 = 𝜎21 = 𝑠12 − 𝑥̅1 𝑥̅2 = −0.5
(2) (2) (1) (1) (1)
𝜎13 = 𝜎31 = 𝑠13 − 𝑥̅1 𝑥̅3 = 0.75
(2) (2) (1) (1) (1)
𝜎14 = 𝜎41 = 𝑠14 − 𝑥̅1 𝑥̅4 = −1
(2) (1) (1) 2
𝜎22 = 𝑠22 − (𝑥̅2 ) = 1.5
(2) (2) (1) (1) (1)
𝜎23 = 𝜎32 = 𝑠23 − 𝑥̅2 𝑥̅3 = −1.5
(2) (2) (1) (1) (1)
𝜎24 = 𝜎42 = 𝑠24 − 𝑥̅2 𝑥̅4 = 2
(2) (1) (1) 2
𝜎33 = 𝑠33 − (𝑥̅3 ) = 2.75
(2) (2) (1) (1) (1)
𝜎34 = 𝜎43 = 𝑠34 − 𝑥̅3 𝑥̅4 = −3
(2) (1) (1) 2
𝜎44 = 𝑠44 − (𝑥̅4 ) = 4.5

𝑐(𝑍1 ) + 𝑐(𝑍2 ) 2 + 2
𝑝(2) = = = 0.5
4∗2 4∗2
At 2nd iteration, E-step, we have:
𝑋𝑜𝑏𝑠 (1) = (𝑥1 = 1, 𝑥3 = 3)𝑇
(2) (2) 𝑇
𝜇𝑚𝑖𝑠 (1) = (𝜇2 = 1, 𝜇4 = 2)
(2) (2)
𝜎22 = 1.5 𝜎24 = 2
Σ𝑚𝑖𝑠 (1) = ( (2) (2)
)
𝜎42 = 2 𝜎44 = 4.5
(2) (2) 𝑇
𝜇𝑜𝑏𝑠 (1) = (𝜇1 = 0.5, 𝜇3 = 1.5)
(2) (2)
𝜎 = 0.75 𝜎13 = 0.75
Σ𝑜𝑏𝑠 (1) = ( 11
(2) (2)
)
𝜎31 = 0.75 𝜎33 = 2.75
(2) (2)
𝜎 = −0.5 𝜎23 = −1.5
𝑚𝑖𝑠
𝑉𝑜𝑏𝑠 (1) = ( 21(2) (2)
)
𝜎41 = −1 𝜎43 = −3
(2) (2)
𝜎 = −0.5 𝜎14 = −1
𝑜𝑏𝑠
𝑉𝑚𝑖𝑠 (1) = ( 12
(2) (2)
)
𝜎32 = −1.5 𝜎34 = −3
(2) −1 (2) (2) 𝑇
𝑚𝑖𝑠
𝜇𝑀1 = 𝜇𝑚𝑖𝑠 (1) + (𝑉𝑜𝑏𝑠 (1)) (Σ𝑜𝑏𝑠 (1)) (𝑋𝑜𝑏𝑠 (1) − 𝜇𝑜𝑏𝑠 (1)) = (𝜇𝑀1 (2) ≅ 0.17, 𝜇𝑀1 (4) ≅ 0.33)
(2) (2)
(2) 𝑚𝑖𝑠 −1 𝑜𝑏𝑠
Σ𝑀1 (2,2) ≅ 0,67 Σ𝑀1 (2,4) ≅ 0.33
Σ𝑀1 = Σ𝑚𝑖𝑠 (1) − (𝑉𝑜𝑏𝑠 (1)) (Σ𝑜𝑏𝑠 (1)) (𝑉𝑚𝑖𝑠 ) =( (2) (2)
)
Σ𝑀1 (4,2) ≅ 0.33 Σ𝑀1 (4,4) ≅ 1.17

𝑋𝑜𝑏𝑠 (2) = (𝑥2 = 2, 𝑥4 = 4)𝑇


(2) (2) 𝑇
𝜇𝑚𝑖𝑠 (2) = (𝜇1 = 0.5, 𝜇3 = 1.5)
(2) (2)
𝜎 = 0.75 𝜎13 = 0.75
Σ𝑚𝑖𝑠 (2) = ( 11
(2) (2)
)
𝜎31 = 0.75 𝜎33 = 2.75
(2) (2) 𝑇
𝜇𝑜𝑏𝑠 (2) = (𝜇2 = 1, 𝜇4 = 2)
(2) (2)
𝜎 = 1.5 𝜎24 = 2
Σ𝑜𝑏𝑠 (2) = ( 22(2) (2)
)
𝜎42 = 2 𝜎44 = 4.5
(2) (2)
𝜎 = −0.5 𝜎14 = −1
𝑚𝑖𝑠
𝑉𝑜𝑏𝑠 (2) = ( 12(2) (2)
)
𝜎32 = −1.5 𝜎34 = −3

All rights reserved by www.grdjournals.com 29


Handling Missing Data with Expectation Maximization Algorithm
(GRDJE/ Volume 6 / Issue 11 / 002)

(2) (2)
𝑜𝑏𝑠 𝜎21 = −0.5 𝜎23 = −1.5
𝑉𝑚𝑖𝑠 (2) = ( (2) (2)
)
𝜎41 = −1 𝜎43 = −3
(2) −1 (2) (2) 𝑇
𝑚𝑖𝑠
𝜇𝑀2 = 𝜇𝑚𝑖𝑠 (2) + (𝑉𝑜𝑏𝑠 (2)) (Σ𝑜𝑏𝑠 (2)) (𝑋𝑜𝑏𝑠 (2) − 𝜇𝑜𝑏𝑠 (2)) = (𝜇𝑀2 (1) ≅ 0.05, 𝜇𝑀2 (3) = 0.14)
(2) (2)
(2) 𝑚𝑖𝑠 −1 𝑜𝑏𝑠
Σ𝑀2 (1,1) ≅ 0.52 Σ𝑀2 (1,3) ≅ 0.07
Σ𝑀2 = Σ𝑚𝑖𝑠 (2) − (𝑉𝑜𝑏𝑠 (2)) (Σ𝑜𝑏𝑠 (2)) (𝑉𝑚𝑖𝑠 ) =( (2) (2)
)
Σ𝑀2 (3,1) ≅ 0.07 Σ𝑀2 (3,3) ≅ 0.7

(2) 1 (2)
𝑥̅1 = (𝑥11 + 𝜇𝑀2 (1)) ≅ 0.52
2
(2) 1 (2)
𝑥̅2 = (𝜇𝑀1 (2) + 𝑥22 ) ≅ 1.1
2
(2) 1 (2)
𝑥̅3 = (𝑥13 + 𝜇𝑀2 (3)) ≅ 1.57
2
(2) 1 (2)
𝑥̅4 = (𝜇𝑀1 (4) + 𝑥24 ) ≅ 2.17
2

(2) 1 (2) (2)


2
𝑠11 = ((𝑥11 )2 + (Σ𝑀2 (1,1) + (𝜇𝑀2 (1)) )) ≅ 0.76
2
(2) (2) 1 (2) (2)
𝑠12 = 𝑠21 = (𝑥11 𝜇𝑀1 (2) + 𝜇𝑀2 (1)𝑥22 ) ≅ 0.13
2
(2) (2) 1 (2) (2) (2)
𝑠13 = 𝑠31 = (𝑥11 𝑥13 + (Σ𝑀2 (1,3) + 𝜇𝑀2 (1)𝜇𝑀2 (3))) ≅ 1.54
2
(2) (2) 1 (2) (2)
𝑠14 = 𝑠41 = (𝑥11 𝜇𝑀1 (4) + 𝜇𝑀2 (1)𝑥24 ) ≅ 0.17
2
(2) 1 (2) (2)
2
𝑠22 = ((Σ𝑀1 (2,2) + (𝜇𝑀1 (2)) ) + (𝑥22 )2 ) ≅ 2.35
2
(2) (2) 1 (2) (2)
𝑠23 = 𝑠32 = (𝜇𝑀1 (2)𝑥13 + 𝑥22 𝜇𝑀2 (3)) ≅ 0.39
2
(2) (2) 1 (2) (2) (2)
𝑠24 = 𝑠42 = ((Σ𝑀1 (2,4) + 𝜇𝑀1 (2)𝜇𝑀1 (4)) + 𝑥22 𝑥24 ) ≅ 4.19
2
(2) 1 (2) (2)
2
𝑠33 = ((𝑥13 )2 + (Σ𝑀2 (3,3) + (𝜇𝑀2 (3)) )) ≅ 4.86
2
(2) (2) 1 (1) (2)
𝑠34 = 𝑠43 = (𝑥13 𝜇𝑀1 (4) + 𝜇𝑀2 (3)𝑥24 ) ≅ 0.77
2
(2) 1 (2) (2)
2
𝑠44 = ((Σ𝑀1 (4,4) + (𝜇𝑀1 (4)) ) + (𝑥24 )2 ) ≅ 8.64
2
At 2nd iteration, M-step, we have:
(3) (2)
𝜇1 = 𝑥̅1 ≅ 0.52
(3) (2)
𝜇2 = 𝑥̅2 ≅ 1.1
(3) (2)
𝜇3 = 𝑥̅3 ≅ 1.57
(3) (2)
𝜇4 = 𝑥̅4 ≅ 2.17

(3) (2) (2) 2


𝜎11 = 𝑠11 − (𝑥̅1 ) ≅ 0.49
(3) (3) (2) (2) (2)
𝜎12 = 𝜎21 = 𝑠12 − 𝑥̅1 𝑥̅2 ≅ −0.44
(3) (3) (2) (2) (2)
𝜎13 = 𝜎31 = 𝑠13 − 𝑥̅1 𝑥̅3 ≅ 0.72
(3) (3) (2) (2) (2)
𝜎14 = 𝜎41 = 𝑠14 − 𝑥̅1 𝑥̅4 ≅ −0.96
(3) (2) (2) 2
𝜎22 = 𝑠22 − (𝑥̅2 ) ≅ 1.17
(3) (3) (2) (2) (2)
𝜎23 = 𝜎32 = 𝑠23 − 𝑥̅2 𝑥̅3 ≅ −1.31
(3) (3) (2) (2) (2)
𝜎24 = 𝜎42 = 𝑠24 − 𝑥̅2 𝑥̅4 ≅ 1.85
(3) (2) (2) 2
𝜎33 = 𝑠33 − (𝑥̅3 ) ≅ 2.4
(3) (3) (2) (2) (2)
𝜎34 = 𝜎43 = 𝑠34 − 𝑥̅3 𝑥̅4 ≅ −2.63
(3) (2) (2) 2
𝜎44 = 𝑠44 − (𝑥̅4 ) ≅ 3.94

All rights reserved by www.grdjournals.com 30


Handling Missing Data with Expectation Maximization Algorithm
(GRDJE/ Volume 6 / Issue 11 / 002)

𝑐(𝑍1 ) + 𝑐(𝑍2 ) 2 + 2
𝑝(3) = = = 0.5
4∗2 4∗2
Because the sample is too small for GEM to converge to an exact maximizer with small enough number of iterations, we can stop
GEM at the second iteration with Θ(3) = Θ* = (μ*, Σ*)T and Φ(3) = Φ* = p* when difference between Θ(2) and Θ(3) is insignificant.
𝜇 ∗ = (𝜇1∗ = 0.52, 𝜇2∗ = 1.1, 𝜇3∗ = 1.57, 𝜇4∗ = 2.17)𝑇
∗ ∗ ∗ ∗
𝜎11 = 0.49 𝜎12 = −0.44 𝜎13 = 0.72 𝜎14 = −0.96
∗ ∗ ∗ ∗
𝜎 = −0.44 𝜎 = 1.17 𝜎 = −1.31 𝜎 24 = 1.85
Σ ∗ = ( 21∗
22

23
∗ ∗ )
𝜎31 = 0.72 𝜎32 = −1.31 𝜎33 = 2.4 𝜎34 = −2.63
∗ ∗ ∗ ∗
𝜎41 = −0.96 𝜎42 = 1.85 𝜎43 = −2.63 𝜎44 = 3.94
𝑝∗ = 0.5
As aforementioned, because Xmis is a part of X and f(Xmis | ΘM) is derived directly from f(X|Θ), in practice we can stop GEM after
its first iteration was done, which is reasonable enough to handle missing data.
As aforementioned, an interesting application of handling missing data is to fill in or predict missing values. For instance, the

missing part Xmis(1) of X1 = (x11=1, x12=?, x13=3, x14=?)T is fulfilled by 𝜇𝑀 1
according to equation 2.44 as follows:

𝑥12 = 𝜇2 = 1.1
𝑥14 = 𝜇4∗ = 2.17
It is necessary to have an example for illustrating how to handle missing data with multinomial PDF.
Example 3.2. Given sample of size two, 𝒳 = {X1, X2 } in which X1 = (x11=1, x12=?, x13=3, x14=?)T and X2 = (x21=?, x22=2, x23=?,
x24=4)T are iid.
x1 x2 x3 x4
X1 1 ? 3 ?
X2 ? 2 ? 4
Of course, we have Xobs(1) = (x11=1, x13=3)T, Xmis(1) = (x12=?, x14=?)T, Xobs(2) = (x22=2, x24=4)T and Xmis(2) = (x21=?, x23=?)T. We
also have M1 = {m11=2, m12=4}, 𝑀 ̅1 = {𝑚 ̅ 11 =1, 𝑚
̅ 12 =3}, M2 = {m21=1, m22=3}, and 𝑀 ̅2 = {𝑚̅ 21 =2, 𝑚
̅ 22 =4}. Let X be random
variable representing every Xi. Suppose f(X|Θ) is multinomial PDF of 10 trials. We will estimate Θ = (p1, p2, p3, p4)T. The parameters
p1, p2, p3, and p2 are initialized arbitrarily as 0.25 as follows:
(1) (1) (1) (1) 𝑇
Θ(1) = (𝑝1 = 0.25, 𝑝2 = 0.25, 𝑝3 = 0.25, 𝑝4 = 0.25)
At 1st iteration, M-step, we have:
1 (2)
(1 + 10 ∗ 0.25) = 0.175
𝑝1 =
10 ∗ 2
(2) 1
𝑝2 = (10 ∗ 0.25 + 2) = 0.225
10 ∗ 2
(2) 1
𝑝3 = (3 + 10 ∗ 0.25) = 0.275
10 ∗ 2
(2) 1
𝑝4 = (10 ∗ 0.25 + 4) = 0.325
10 ∗ 2
We stop GEM after the first iteration was done, which results the estimate Θ(2) = Θ* = (p1*, p2*, p3*, p4*)T as follows:
𝑝1∗ = 0.175
𝑝2∗ = 0.225
𝑝3∗ = 0.275
𝑝4∗ = 0.325

IV. CONCLUSIONS
In general, GEM is a powerful tool to handle missing data, which is not so difficult except that how to extract the parameter ΘM of
the conditional PDF f(Xmis | Xobs, ΘM) from the whole parameter Θ of the PDF f(X|ΘM) is most important with note that only f(X|Θ)
is defined firstly and then f(Xmis | Xobs, ΘM) is derived from f(X|Θ). Therefore, equation 2.15 is cornerstone of this method. Note,
equation 2.35 and 2.51 are instances of equation 2.15 when f(X|Θ) is multinormal PDF or multinomial PDF.

REFERENCES
[1] Burden, R. L., & Faires, D. J. (2011). Numerical Analysis (9th Edition ed.). (M. Julet, Ed.) Brooks/Cole Cengage Learning.
[2] Dempster, A. P., Laird, N. M., & Rubin, D. B. (1977). Maximum Likelihood from Incomplete Data via the EM Algorithm. (M. Stone, Ed.) Journal of the
Royal Statistical Society, Series B (Methodological), 39(1), 1-38.
[3] Hardle, W., & Simar, L. (2013). Applied Multivariate Statistical Analysis. Berlin, Germany: Research Data Center, School of Business and Economics,
Humboldt University.
[4] Josse, J., Jiang, W., Sportisse, A., & Robin, G. (2018). Handling missing values. Inria. Julie Josse. Retrieved October 12, 2020, from http://juliejosse.com/wp-
content/uploads/2018/07/LectureNotesMissing.html
[5] Nguyen, L. (2020). Tutorial on EM algorithm. MDPI. Preprints. doi:10.20944/preprints201802.0131.v8

All rights reserved by www.grdjournals.com 31


Handling Missing Data with Expectation Maximization Algorithm
(GRDJE/ Volume 6 / Issue 11 / 002)

[6] Ta, P. D. (2014). Numerical Analysis Lecture Notes. Vietnam Institute of Mathematics, Numerical Analysis and Scientific Computing. Hanoi: Vietnam
Institute of Mathematics. Retrieved 2014
[7] Wikipedia. (2014, August 4). Karush–Kuhn–Tucker conditions. (Wikimedia Foundation) Retrieved November 16, 2014, from Wikipedia website:
http://en.wikipedia.org/wiki/Karush–Kuhn–Tucker_conditions
[8] Wikipedia. (2016, March September). Exponential family. (Wikimedia Foundation) Retrieved 2015, from Wikipedia website:
https://en.wikipedia.org/wiki/Exponential_family

All rights reserved by www.grdjournals.com 32


View publication stats

You might also like