0% found this document useful (0 votes)

5 views25 pages

EM Missing

Uploaded by

Sathish K

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

5 views25 pages

EM Missing

Uploaded by

Sathish K

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 25

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/355446236

Handling Missing Data with Expectation Maximization Algorithm

Article · October 2021

CITATIONS READS
3 561

1 author:

Loc Nguyen
Loc Nguyen's Academic Network
268 PUBLICATIONS 407 CITATIONS

SEE PROFILE

All content following this page was uploaded by Loc Nguyen on 21 October 2021.

The user has requested enhancement of the downloaded file.

GRD Journal for Engineering | Volume 6 | Issue 11 | October 2021
ISSN: 2455-5703

Handling Missing Data with Expectation

Maximization Algorithm
Loc Nguyen
Independent Scholar
Department of Applied Science
Loc Nguyen's Academic Network

Abstract
Expectation maximization (EM) algorithm is a powerful mathematical tool for estimating parameter of statistical models in case
of incomplete data or hidden data. EM assumes that there is a relationship between hidden data and observed data, which can be a
joint distribution or a mapping function. Therefore, this implies another implicit relationship between parameter estimation and
data imputation. If missing data which contains missing values is considered as hidden data, it is very natural to handle missing
data by EM algorithm. Handling missing data is not a new research but this report focuses on the theoretical base with detailed
mathematical proofs for fulfilling missing values with EM. Besides, multinormal distribution and multinomial distribution are the
two sample statistical models which are concerned to hold missing values.
Keywords- Expectation Maximization (EM), Missing Data, Multinormal Distribution, Multinomial Distribution

I. INTRODUCTION
Literature of expectation maximization (EM) algorithm in this report is mainly extracted from the preeminent article “Maximum
Likelihood from Incomplete Data via the EM Algorithm” by Arthur P. Dempster, Nan M. Laird, and Donald B. Rubin (Dempster,
Laird, & Rubin, 1977). For convenience, let DLR be reference to such three authors. The preprint “Tutorial on EM algorithm”
(Nguyen, 2020) by Loc Nguyen is also referred in this report.
Now we skim through an introduction of EM algorithm. Suppose there are two spaces X and Y, in which X is hidden space
whereas Y is observed space. We do not know X but there is a mapping from X to Y so that we can survey X by observing Y. The
mapping is many-one function φ: X → Y and we denote φ–1(Y) = {𝑋 ∈ 𝑿: φ(X) = Y} as all 𝑋 ∈ 𝑿 such that φ(X) = Y. We also
denote X(Y) = φ–1(Y). Let f(X | Θ) be the probability density function (PDF) of random variable 𝑋 ∈ 𝑿 and let g(Y | Θ) be the PDF
of random variable 𝑌 ∈ 𝒀. Note, Y is also called observation. Equation 1.1 specifies g(Y | Θ) as integral of f(X | Θ) over φ–1(Y).
𝑔(𝑌|Θ) = ∫ 𝑓(𝑋|Θ)d𝑋 (1.1)
𝜑−1 (𝑌)
Where Θ is probabilistic parameter represented as a column vector, Θ = (θ1, θ2,…, θr)T in which each θi is a particular parameter.
If X and Y are discrete, equation 1.1 is re-written as follows:
𝑔(𝑌|Θ) = ∑ 𝑓(𝑋|Θ)
𝑋∈𝜑−1 (𝑌)
According to viewpoint of Bayesian statistics, Θ is also random variable. As a convention, let Ω be the domain of Θ such that Θ ∈
Ω and the dimension of Ω is r. For example, normal distribution has two particular parameters such as mean μ and variance σ2 and
so we have Θ = (μ, σ2)T. Note that, Θ can degrades into a scalar as Θ = θ. The conditional PDF of X given Y, denoted k(X | Y, Θ),
is specified by equation 1.2.
𝑓(𝑋|Θ)
𝑘(𝑋|𝑌, Θ) = (1.2)
𝑔(𝑌|Θ)
According to DLR (Dempster, Laird, & Rubin, 1977, p. 1), X is called complete data and the term “incomplete data” implies
existence of X and Y where X is not observed directly and X is only known by the many-one mapping φ: X → Y. In general, we
only know Y, f(X | Θ), and k(X | Y, Θ) and so our purpose is to estimate Θ based on such Y, f(X | Θ), and k(X | Y, Θ). Like MLE
approach, EM algorithm also maximizes the likelihood function to estimate Θ but the likelihood function in EM concerns Y and
there are also some different aspects in EM which will be described later. Pioneers in EM algorithm firstly assumed that f(X | Θ)
belongs to exponential family with note that many popular distributions such as normal, multinomial, and Poisson belong to
exponential family. Although DLR (Dempster, Laird, & Rubin, 1977) proposed a generality of EM algorithm in which f(X | Θ)
distributes arbitrarily, we should concern exponential family a little bit. Exponential family (Wikipedia, Exponential family, 2016)
refers to a set of probabilistic distributions whose PDF (s) have the same exponential form according to equation 1.3 (Dempster,
Laird, & Rubin, 1977, p. 3):
𝑓(𝑋|Θ) = 𝑏(𝑋) exp(Θ𝑇 𝜏(𝑋))⁄𝑎(Θ) (1.3)

All rights reserved by www.grdjournals.com 9

Handling Missing Data with Expectation Maximization Algorithm
(GRDJE/ Volume 6 / Issue 11 / 002)

Where b(X) is a function of X, which is called base measure and τ(X) is a vector function of X, which is sufficient statistic. For
example, the sufficient statistic of normal distribution is τ(X) = (X, XXT)T. Equation 1.3 expresses the canonical form of exponential
family. Recall that Ω is the domain of Θ such that Θ ∈ Ω. Suppose that Ω is a convex set. If Θ is restricted only to Ω then, f(X | Θ)
specifies a regular exponential family. If Θ lies in a curved sub-manifold Ω0 of Ω then, f(X | Θ) specifies a curved exponential
family. The a(Θ) is partition function for variable X, which is used for normalization.
𝑎(Θ) = ∫ 𝑏(𝑋)exp(Θ𝑇 𝜏(𝑋))d𝑋
𝑋
As usual, a PDF is known as a popular form but its exponential family form (canonical form of exponential family) specified by
equation 1.3 looks unlike popular form although they are the same. Therefore, parameter in popular form is different from
parameter in exponential family form.
For example, multinormal distribution with theoretical mean μ and covariance matrix Σ of random variable X = (x1, x2,…, xn)T
has PDF in popular form is:
𝑛 1 1
𝑓(𝑋|𝜇, Σ) = (2𝜋)−2 |Σ|−2 ∗ exp (− (𝑋 − 𝜇)𝑇 Σ −1 (𝑋 − 𝜇))
2
Hence, parameter in popular form is Θ = (μ, Σ)T. Exponential family form of such PDF is:
𝑛
𝑋 1 1
𝑓(𝑋|𝜃1 , 𝜃2 ) = (2𝜋)− 2 ∗ exp ((𝜃1 , 𝜃2 ) ( 𝑇 ))⁄exp (− 𝜃1𝑇 𝜃2−1 𝜃1 − log|−2𝜃2 |)
𝑋𝑋 4 2
Where,
𝜃
Θ = ( 1)
𝜃2
𝜃1 = Σ −1 𝜇
1
𝜃2 = − Σ −1
2 𝑛
𝑏(𝑋) = (2𝜋)−2
𝑋
𝜏(𝑋) = ( 𝑇 )
𝑋𝑋
1 1
𝑎(Θ) = exp (− 𝜃1𝑇 𝜃2−1 𝜃1 − log|−2𝜃2 |)
4 2
The exponential family form is used to represents all distributions belonging to exponential family as canonical form. Parameter
in exponential family form is called exponential family parameter. As a convention, parameter Θ mentioned in EM algorithm is
often exponential family parameter if PDF belongs to exponential family and there is no additional information.
Expectation maximization (EM) algorithm has many iterations and each iteration has two steps in which expectation step (E-
step) calculates sufficient statistic of hidden data based on observed data and current parameter whereas maximization step (M-
step) re-estimates parameter. When DLR proposed EM algorithm (Dempster, Laird, & Rubin, 1977), they firstly concerned that
the PDF f(X | Θ) of hidden space belongs to exponential family. E-step and M-step at the tth iteration are described in table 1.1
(Dempster, Laird, & Rubin, 1977, p. 4), in which the current estimate is Θ(t), with note that f(X | Θ) belongs to regular exponential
family.
E-step:
We calculate current value τ(t) of the sufficient statistic τ(X) from observed Y and current parameter Θ(t) according to
following equation:
𝜏 (𝑡) = 𝐸(𝜏(𝑋)|𝑌, Θ(𝑡) ) = ∫ 𝑘(𝑋|𝑌, Θ(𝑡) )𝜏(𝑋)d𝑋
𝜑−1 (𝑌)
M-step:
Basing on τ(t), we determine the next parameter Θ(t+1) as solution of following equation:
𝐸(𝜏(𝑋)|Θ) = ∫ 𝑓(𝑋|Θ)𝜏(𝑋)d𝑋 = 𝜏 (𝑡)
𝑋
Note, Θ(t+1) will become current parameter at the next iteration ((t+1)th iteration).
Table 1.1. E-step and M-step of EM algorithm given regular exponential PDF f(X|Θ)
EM algorithm stops if two successive estimates are equal, Θ * = Θ(t) = Θ(t+1), at some tth iteration. At that time we conclude that Θ*
is the optimal estimate of EM process. As a convention, the estimate of parameter Θ resulted from EM process is denoted Θ *
̂ in order to emphasize that Θ* is solution of optimization problem.
instead of Θ
For further research, DLR gave a preeminent generality of EM algorithm (Dempster, Laird, & Rubin, 1977, pp. 6-11) in which
f(X | Θ) specifies arbitrary distribution. In other words, there is no requirement of exponential family. They define the conditional
expectation Q(Θ’ | Θ) according to equation 1.4 (Dempster, Laird, & Rubin, 1977, p. 6).
𝑄(Θ′ |Θ) = 𝐸(log(𝑓(𝑋|Θ′ ))|𝑌, Θ) = ∫ 𝑘(𝑋|𝑌, Θ)log(𝑓(𝑋|Θ′ ))d𝑋 (1.4)
𝜑 −1 (𝑌)

All rights reserved by www.grdjournals.com 10

Handling Missing Data with Expectation Maximization Algorithm
(GRDJE/ Volume 6 / Issue 11 / 002)

If X and Y are discrete, equation 2.4 can be re-written as follows:

𝑄(Θ′ |Θ) = 𝐸(log(𝑓(𝑋|Θ′ ))|𝑌, Θ) = ∑ 𝑘(𝑋|𝑌, Θ)log(𝑓(𝑋|Θ′ ))
𝑋∈𝜑−1 (𝑌)
The two steps of generalized EM (GEM) algorithm aim to maximize Q(Θ | Θ(t)) at some tth iteration as seen in table 1.2 (Dempster,
Laird, & Rubin, 1977, p. 6).
E-step:
The expectation Q(Θ | Θ(t)) is determined based on current parameter Θ(t), according to equation 1.4. Actually, Q(Θ | Θ(t))
is formulated as function of Θ.
M-step:
The next parameter Θ(t+1) is a maximizer of Q(Θ | Θ(t)) with subject to Θ. Note that Θ(t+1) will become current parameter at
the next iteration (the (t+1)th iteration).
Table 1.2. E-step and M-step of GEM algorithm
DLR proved that GEM algorithm converges at some tth iteration. At that time, Θ* = Θ(t+1) = Θ(t) is the optimal estimate of EM
process, which is an optimizer of L(Θ).
Θ∗ = argmax 𝐿(Θ)
Θ
It is deduced from E-step and M-step that Q(Θ | Θ(t)) is increased after every iteration. How to maximize Q(Θ|Θ(t)) is the
optimization problem which is dependent on applications. For example, the estimate Θ(t+1) can be solution of the equation created
by setting the first-order derivative of Q(Θ|Θ(t)) regarding Θ to be zero, DQ(Θ|Θ(t)) = 0T. If solving such equation is too complex
or impossible, some popular methods to solve optimization problem are Newton-Raphson (Burden & Faires, 2011, pp. 67-71),
gradient descent (Ta, 2014), and Lagrange duality (Wikipedia, Karush–Kuhn–Tucker conditions, 2014).
In practice, if Y is observed as particular N observations Y1, Y2,…, YN. Let 𝒴 = {Y1, Y2,…, YN} be the observed sample of size
N with note that all Yi (s) are mutually independent and identically distributed (iid). Given an observation Yi, there is an associated
random variable Xi. All Xi (s) are iid and they are not existent in fact. Each 𝑋𝑖 ∈ 𝑿 is a random variable like X. Of course, the
domain of each Xi is X. Let 𝒳 = {X1, X2,…, XN} be the set of associated random variables. Because all Xi (s) are iid, the joint PDF
of 𝒳 is determined as follows:
𝑁

𝑓(𝒳|Θ) = 𝑓(𝑋1 , 𝑋2 , … , 𝑋𝑁 |Θ) = ∏ 𝑓(𝑋𝑖 |Θ)

𝑖=1
Because all Xi (s) are iid and each Yi is associated with Xi, the conditional joint PDF of 𝒳 given 𝒴 is determined as follows:
𝑁 𝑁

𝑘(𝒳|𝒴, Θ) = 𝑘(𝑋1 , 𝑋2 , … , 𝑋𝑁 |𝑌1 , 𝑌2 , … , 𝑌𝑁 , Θ) = ∏ 𝑘(𝑋𝑖 |𝑌1 , 𝑌2 , … , 𝑌𝑁 , Θ) = ∏ 𝑘(𝑋𝑖 |𝑌𝑖 , Θ)

𝑖=1 𝑖=1
The conditional expectation Q(Θ’ | Θ) given samples X and Y is re-written according to equation 1.5.
𝑁
′ |Θ)
𝑄(Θ =∑ ∫ 𝑘(𝑋|𝑌𝑖 , Θ)log(𝑓(𝑋|Θ′ ))d𝑋 (1.5)
𝑖=1 𝜑 −1 (𝑌𝑖 )
Equation 1.5 is proved in (Nguyen, 2020, pp. 45-47). In case that f(X | Θ) and k(X | Yi, Θ) belong to exponential family, equation
1.5 becomes equation 1.6 with an observed sample 𝒴 = {Y1, Y2,…, YN}.
𝑁 𝑁
′ |Θ) ′ )𝑇
𝑄(Θ = (∑ 𝐸(log(𝑏(𝑋))|𝑌𝑖 , Θ)) + ((Θ ∑ 𝜏Θ,𝑌𝑖 ) − 𝑁log(𝑎(Θ′ )) (1.6)
𝑖=1 𝑖=1
Where,
𝐸(log(𝑏(𝑋))|𝑌𝑖 , Θ) = ∫ 𝑘(𝑋|𝑌𝑖 , Θ)log(𝑏(𝑋))d𝑋
𝜑 −1 (𝑌𝑖 )

𝜏Θ,𝑌𝑖 = 𝐸(𝜏(𝑋)|𝑌𝑖 , Θ) = ∫ 𝑘(𝑋|𝑌𝑖 , Θ)𝜏(𝑋)d𝑋

𝜑 −1 (𝑌𝑖 )
DLR (Dempster, Laird, & Rubin, 1977, p. 1) called X as complete data because the mapping φ: X → Y is many-one function.
There is another case that the complete space Z consists of hidden space X and observed space Y with note that X and Y are
separated. There is no explicit mapping φ from X and Y but there exists a PDF of 𝑍 ∈ 𝒁 as the joint PDF of 𝑋 ∈ 𝑿 and 𝑌 ∈ 𝒀.
𝑓(𝑍|Θ) = 𝑓(𝑋, 𝑌|Θ)
The PDF of Y becomes:
𝑓(𝑌|Θ) = ∫ 𝑓(𝑋, 𝑌|Θ)d𝑋
𝑋
The PDF f(Y|Θ) is equivalent to the PDF g(Y|Θ) mentioned in equation 1.1. Although there is no explicit mapping from X to Y, the
PDF of Y above implies an implicit mapping from Z to Y. The conditional PDF of X given Z is specified according to Bayes’ rule
as follows:

All rights reserved by www.grdjournals.com 11

Handling Missing Data with Expectation Maximization Algorithm
(GRDJE/ Volume 6 / Issue 11 / 002)

𝑓(𝑋, 𝑌|Θ) 𝑓(𝑋, 𝑌|Θ)

𝑓(𝑍|𝑌, Θ) = 𝑓(𝑋, 𝑌|𝑌, Θ) = 𝑓(𝑋|𝑌)𝑓(𝑌|𝑌) = 𝑓(𝑋|𝑌, Θ) = =
𝑓(𝑌|Θ) ∫𝑋 𝑓(𝑋, 𝑌|Θ)d𝑋
The conditional PDF f(X|Y, Θ) is equivalent to the conditional PDF k(X|Y, Θ) mentioned in equation 1.2. Of course, given Y, we
always have:
∫ 𝑓(𝑋|𝑌, Θ)d𝑋 = 1
𝑋
Equation 1.7 specifies the conditional expectation Q(Θ’ | Θ) in case that there is no explicit mapping from X to Y but there exists
the joint PDF of X and Y.
𝑄(Θ′ |Θ) = ∫ 𝑓(𝑍|𝑌, Θ)log(𝑓(𝑍|Θ′ ))d𝑋 = ∫ 𝑓(𝑋|𝑌, Θ)log(𝑓(𝑋, 𝑌|Θ′ ))d𝑋 (1.7)
𝑋 𝑋
Where,
𝑓(𝑋, 𝑌|Θ) 𝑓(𝑋, 𝑌|Θ)
𝑓(𝑋|𝑌, Θ) = =
𝑓(𝑌|Θ) ∫𝑋 𝑓(𝑋, 𝑌|Θ)d𝑋
Note, X is separated from Y and the complete data Z = (X, Y) is composed of X and Y. For equation 1.7, the existence of the joint
PDF f(X, Y | Θ) can be replaced by the existence of the conditional PDF f(Y|X, Θ) and the prior PDF f(X|Θ) due to:
𝑓(𝑋, 𝑌|Θ) = 𝑓(𝑌|𝑋, Θ)𝑓(𝑋|Θ)
In applied statistics, equation 1.4 is often replaced by equation 1.7 because specifying the joint PDF f(X, Y | Θ) is more practical
than specifying the mapping φ: X → Y. However, equation 1.4 is more general equation 1.7 because the requirement of the joint
PDF for equation 1.7 is stricter than the requirement of the explicit mapping for equation 1.4. In case that X and Y are discrete,
equation 1.7 becomes:
𝑄(Θ′ |Θ) = ∑ 𝑃(𝑋|𝑌, Θ)log(𝑃(𝑋, 𝑌|Θ′ ))
𝑋
In practice, suppose Y is observed as a sample 𝒴 = {Y1, Y2,…, YN} of size N with note that all Yi (s) are mutually independent and
identically distributed (iid). The observed sample 𝒴 is associated with a a hidden set (latent set) 𝒳 = {X1, X2,…, XN} of size N. All
Xi (s) are iid and they are not existent in fact. Let 𝑋 ∈ 𝑿 be the random variable representing every Xi. Of course, the domain of X
is X. Equation 1.8 specifies the conditional expectation Q(Θ’ | Θ) given such 𝒴.
𝑁
′ |Θ)
𝑄(Θ = ∑ ∫ 𝑓(𝑋|𝑌𝑖 , Θ)log(𝑓(𝑋, 𝑌𝑖 |Θ′ ))d𝑋 (1.8)
𝑖=1 𝑋
Equation 1.8 is a variant of equation 1.5 in case that there is no explicit mapping between Xi and Yi but there exists the same joint
PDF between Xi and Yi. If both X and Y are discrete, equation 1.8 becomes:
𝑁

𝑄(Θ′ |Θ) = ∑ ∑ 𝑃(𝑋|𝑌𝑖 , Θ)log(𝑃(𝑋, 𝑌𝑖 |Θ′ )) (1.9)

𝑖=1 𝑋
If X is discrete and Y is continuous such that f(X, Y | Θ) = P(X|Θ)f(Y | X, Θ) then, according to the total probability rule, we have:
𝑓(𝑌|Θ) = ∑ 𝑃(𝑋|Θ)𝑓(𝑌|𝑋, Θ)
𝑋
Note, when only X is discrete, its PDF f(X|Θ) becomes the probability P(X|Θ). Therefore, equation 1.10 is a variant of equation
1.8, as follows:
𝑁
′ |Θ)
𝑄(Θ = ∑ ∑ 𝑃(𝑋|𝑌𝑖 , Θ)log(𝑃(𝑋|Θ′ )𝑓(𝑌𝑖 |𝑋, Θ′ )) (1.10)
𝑖=1 𝑋
Where P(X | Yi, Θ) is determined by Bayes’ rule, as follows:
𝑃(𝑋|Θ)𝑓(𝑌𝑖 |𝑋, Θ)
𝑃(𝑋|𝑌𝑖 , Θ) =
∑𝑋 𝑃(𝑋|Θ)𝑓(𝑌𝑖 |𝑋, Θ)
Equation 1.10 is the base for estimating the probabilistic mixture model by EM algorithm, which is not main subject of this report.
Now we consider how to apply EM into handling missing data in which equation 1.8 is most concerned. The goal of maximum
likelihood estimation (MLE), maximum a posteriori (MAP), and EM is to estimate statistical based on sample. Whereas MLE and
MAP require complete data, EM accepts hidden data or incomplete data. Therefore, EM is appropriate to handle missing data
which contains missing values. Indeed, estimating parameter with missing data is very natural for EM but it is necessary to have a
new viewpoint in which missing data is considered as hidden data (X). Moreover, the GEM version with joint probability (without
mapping function, please see equation 1.7 and equation 1.8) is used and some changes are required. Handling missing data, which
is the main subject of this report is described in next section.

All rights reserved by www.grdjournals.com 12

Handling Missing Data with Expectation Maximization Algorithm
(GRDJE/ Volume 6 / Issue 11 / 002)

II. HANDLING MISSING DATA

Let X = (x1, x2,…, xn)T be n-dimension random variable whose n elements are partial random variables xj (s). Suppose X is composed
of two parts such as observed part Xobs and missing part Xmis such that X = {Xobs, Xmis}. Note, Xobs and Xmis are considered as random
variables.
𝑋 = {𝑋𝑜𝑏𝑠 , 𝑋𝑚𝑖𝑠 } = (𝑥1 , 𝑥2 , … , 𝑥𝑛 )𝑇 (2.1)
When X is observed, Xobs and Xmis are determined. For example, given X = (x1, x2, x3, x4)T, when X is observed as X = (x1=1, x2=?,
x3=4, x4=?, x5=9)T where question mask “?” denotes missing value, Xobs and Xmis are determined as Xobs = (x1=1, x3=4, x5=9)T and
Xmis = (x2=?, x4=?)T. When X is observed as X = (x1=?, x2=3, x3=4, x4=?, x5=?)T then, Xobs and Xmis are determined as Xobs = (x2=3,
x3=4)T and Xmis = (x1=?, x4=?, x5=?)T. Let M be a set of indices that xj (s) are missing when X is observed. M is called missing index
set.
𝑀 = {𝑗: 𝑥𝑗 missing} where 𝑗 = ̅̅̅̅̅ 1, 𝑛 (2.2)
Suppose
𝑀 = {𝑚1 , 𝑚2 , … , 𝑚|𝑀| } (2.3)
Where,
𝑚𝑖 = ̅̅̅̅̅
1, 𝑛
𝑚𝑖 ≠ 𝑚𝑗
Let 𝑀̅ is complementary set of the set M given the set {1, 2,…., n}. 𝑀 ̅ is called existent index set.
𝑀 = {𝑗: 𝑥𝑗 existent} where 𝑗 = ̅̅̅̅̅
̅ 1, 𝑛 (2.4)
M or 𝑀̅ can be empty. They are mutual because 𝑀 ̅ can be defined based on M and vice versa.
𝑀∪𝑀 ̅ = {1,2, … , 𝑛}
𝑀∩𝑀 ̅ =∅
Suppose
̅ = {𝑚
𝑀 ̅ 1, 𝑚
̅ 2, … , 𝑚
̅ |𝑀̅| } (2.5)
Where,
̅ 𝑖 = ̅̅̅̅̅
𝑚 1, 𝑛
𝑚̅𝑖 ≠ 𝑚 ̅𝑗
̅
|𝑀| + |𝑀| = 𝑛
We have:
𝑇 𝑇
𝑋𝑚𝑖𝑠 = (𝑥𝑗 : 𝑗 ∈ 𝑀) = (𝑥𝑚1 , 𝑥𝑚2 , … , 𝑥𝑚|𝑀| ) (2.6)

𝑇
𝑋𝑜𝑏𝑠 = (𝑥𝑗 : 𝑗 ∈ 𝑀̅ )𝑇 = (𝑥𝑚̅ , 𝑥𝑚̅ , … , 𝑥𝑚̅ ̅̅̅ ) (2.7)
1 2 |𝑀|
Obviously, dimension of Xmis is |M| and dimension of Xobs is |𝑀 ̅ | = n–|M|. Note, when composing X from Xobs and Xmis as X = {Xobs,
Xmis}, it is required a right re-arrangement of elements in both Xobs and Xmis.
Let Z = (z1, z2,…, zn)T be n-dimension random variable whose each element zj is binary random variable indicating if xj is
missing. Random variable Z is also called missingness variable.
1 if 𝑥𝑗 missing
𝑧𝑗 = { (2.8)
0 if 𝑥𝑗 existent
For example, given X = (x1, x2, x3, x4)T, when X is observed as X = (x1=1, x2=?, x3=4, x4=?, x5=9)T, we have Xobs = (x1=1, x3=4,
x5=9)T, Xmis = (x2=?, x4=?)T, and Z = (z1=0, z2=1, z3=0, z4=1, z5=0)T.
Generally, when X is replaced by a sample 𝒳 = {X1, X2,…, XN} whose Xi (s) are iid, let 𝒵 = {Z1, Z2,…, ZN} be a set of
missingness variables associated with 𝒳. All Zi (s) are iid too. 𝒳 and 𝒵 can be represented as matrices. Given Xi, its associative
quantities are Zi, Mi, and 𝑀 ̅𝑖 . Let X = {Xobs, Xmis} be random variable representing every Xi. Let Z be random variable representing
every Zi. As a convention, Xobs(i) and Xmis(i) refer to Xobs part and Xmis part of Xi. We have:
𝑋𝑖 = {𝑋𝑜𝑏𝑠 (𝑖), 𝑋𝑚𝑖𝑠 (𝑖)} = (𝑥𝑖1 , 𝑥𝑖2 , … , 𝑥𝑖𝑛 )𝑇
𝑇
𝑋𝑚𝑖𝑠 (𝑖) = (𝑥𝑖𝑚1 , 𝑥𝑖𝑚2 , … , 𝑥𝑖𝑚|𝑀 | )
𝑖
𝑇
𝑋𝑜𝑏𝑠 (𝑖) = (𝑥𝑖𝑚̅𝑖1 , 𝑥𝑖𝑚̅𝑖2 , … , 𝑥𝑖𝑚̅𝑖|𝑀
̅̅̅ |
) (2.9)
𝑖

𝑀𝑖 = {𝑚𝑖1 , 𝑚𝑖2 , … , 𝑚𝑖|𝑀𝑖| }

̅𝑖 = {𝑚
𝑀 ̅ 𝑖1 , 𝑚
̅ 𝑖2 , … , 𝑚
̅ 𝑖|𝑀̅𝑖| }
𝑍𝑖 = (𝑧𝑖1 , 𝑧𝑖2 , … , 𝑧𝑖𝑛 )𝑇
For example, given sample of size 4, 𝒳 = {X1, X2, X3, X4} in which X1 = (x11=1, x12=?, x13=3, x14=?)T, X2 = (x21=?, x22=2, x23=?,
x24=4)T, X3 = (x31=1, x32=2, x33=?, x34=?)T, and X4 = (x41=?, x42=?, x43=3, x44=4)T are iid. Therefore, we also have Z1 = (z11=0, z12=1,

All rights reserved by www.grdjournals.com 13

Handling Missing Data with Expectation Maximization Algorithm
(GRDJE/ Volume 6 / Issue 11 / 002)

z13=0, z14=1)T, Z2 = (z21=1, z22=0, z23=1, z24=0)T, Z3 = (z31=0, z32=0, z33=1, z34=1)T, and Z4 = (z41=1, z42=1, z43=0, z44=0)T. All Zi (s)
are iid too.
x1 x2 x3 x4 z1 z2 z3 z4
X1 1 ? 3 ? Z1 0 1 0 1
X2 ? 2 ? 4 Z2 1 0 1 0
X3 1 2 ? ? Z3 0 0 1 1
X4 ? ? 3 4 Z4 1 1 0 0
Of course, we have Xobs(1) = (x11=1, x13=3)T, Xmis(1) = (x12=?, x14=?)T, Xobs(2) = (x22=2, x24=4)T, Xmis(2) = (x21=?, x23=?)T, Xobs(3) =
(x31=1, x32=2)T, Xmis(3) = (x33=?, x34=?)T, Xobs(4) = (x43=3, x44=4)T, and Xmis(4) = (x41=?, x42=?)T. We also have M1 = {m11=2, m12=4},
𝑀̅1 = {𝑚̅ 11 =1, 𝑚
̅ 12=3}, M2 = {m21=1, m22=3}, 𝑀 ̅2 = {𝑚̅ 21 =2, 𝑚
̅ 22=4}, M3 = {m31=3, m32=4}, 𝑀 ̅3 = {𝑚̅ 31 =1, 𝑚
̅ 32 =2}, M4 = {m41=1,
̅
m42=2}, and 𝑀4 = {𝑚 ̅ 41 =3, 𝑚
̅ 42 =4}.
Both X and Z are associated with their own PDFs, as follows:
𝑓(𝑋|Θ) = 𝑓(𝑋𝑜𝑏𝑠 , 𝑋𝑚𝑖𝑠 |Θ)
(2.10)
𝑓(𝑍|𝑋𝑜𝑏𝑠 , 𝑋𝑚𝑖𝑠 , Φ)
Where Θ and Φ are parameters of PDFs of X = {Xobs, Xmis} and Z, respectively. The goal of handling missing data is to estimate Θ
and Φ given X. Sufficient statistic of X = {Xobs, Xmis} is composed of sufficient statistic of Xobs and sufficient statistic of Xmis.
𝜏(𝑋) = 𝜏(𝑋𝑜𝑏𝑠 , 𝑋𝑚𝑖𝑠 ) = {𝜏(𝑋𝑜𝑏𝑠 ), 𝜏(𝑋𝑚𝑖𝑠 )} (2.11)
How to compose τ(X) from τ(Xobs) and τ(Xmis) is dependent on distribution type of the PDF f(X|Θ).
The joint PDF of X and Z is main object of handling missing data, which is defined as follows:
𝑓(𝑋, 𝑍|Θ, Φ) = 𝑓(𝑋𝑜𝑏𝑠 , 𝑋𝑚𝑖𝑠 , 𝑍|Θ, Φ) = 𝑓(𝑍|𝑋𝑜𝑏𝑠 , 𝑋𝑚𝑖𝑠 , Φ)𝑓(𝑋𝑜𝑏𝑠 , 𝑋𝑚𝑖𝑠 |Θ) (2.12)
The PDF of Xobs is defined as integral of f(X|Θ) over Xmis:
𝑓(𝑋𝑜𝑏𝑠 |Θ) = ∫ 𝑓(𝑋𝑜𝑏𝑠 , 𝑋𝑚𝑖𝑠 |Θ)d𝑋𝑚𝑖𝑠 (2.13)
𝑋𝑚𝑖𝑠
The PDF of Xmis is conditional PDF of Xmis given Xobs is:
𝑓(𝑋|Θ) 𝑓(𝑋𝑜𝑏𝑠 , 𝑋𝑚𝑖𝑠 |Θ)
𝑓(𝑋𝑚𝑖𝑠 |𝑋𝑜𝑏𝑠 , Θ𝑀 ) = 𝑓(𝑋𝑚𝑖𝑠 |𝑋𝑜𝑏𝑠 , Θ) = = (2.14)
𝑓(𝑋𝑜𝑏𝑠 |Θ) 𝑓(𝑋𝑜𝑏𝑠 |Θ)
The notation ΘM implies that the parameter ΘM of the PDF f(Xmis | Xobs, ΘM) is derived from the parameter Θ of the PDF f(X|Θ),
which is function of Θ and Xobs as ΘM = u(Θ, Xobs). Thus, ΘM is not a new parameter and it is dependent on distribution type.
Θ𝑀 = 𝑢(Θ, 𝑋𝑜𝑏𝑠 ) (2.15)
How to determine u(Θ, Xobs) is dependent on distribution type of the PDF f(X|Θ).
There are three types of missing data, which depends on relationship between Xobs, Xmis, and Z (Josse, Jiang, Sportisse, &
Robin, 2018):
- Missing data (X or 𝒳) is Missing Completely At Random (MCAR) if the probability of Z is independent from both Xobs
and Xmis such that f(Z | Xobs, Xmis, Φ) = f(Z | Φ).
- Missing data (X or 𝒳) is Missing At Random (MAR) if the probability of Z depends on only Xobs such that f(Z | Xobs, Xmis,
Φ) = f(Z | Xobs, Φ).
- Missing data (X or 𝒳) is Missing Not At Random (MNAR) in all other cases, f(Z | Xobs, Xmis, Φ) = f(Z | Xobs, Xmis, Φ).
There are two main approaches for handling missing data (Josse, Jiang, Sportisse, & Robin, 2018):
- Using some statistical models such as EM to estimate parameter with missing data.
- Inputting plausible values for missing values to obtain some complete samples (copies) from the missing data. Later on,
every complete sample is used to produce an estimate of parameter by some estimation methods, for example, MLE and
MAP. Finally, all estimates are synthesized to produce the best estimate.
Here we focus on the first approach with EM to estimate parameter with missing data. Without loss of generality, given sample 𝒳
= {X1, X2,…, XN} in which all Xi (s) are iid, by applying equation 1.8 for GEM with the joint PDF f(Xobs, Xmis, Z | Θ, Φ), we consider
{Xobs, Z} as observed part and Xmis as hidden part. Let X = {Xobs, Xmis} be random variable representing all Xi (s). Let Xobs(i) denote
observed part Xobs of Xi and let Zi be missingness variable corresponding to Xi, by following equation 1.8, the expectation Q(Θ’,
Φ’ | Θ, Φ) becomes:
𝑁

𝑄(Θ′ , Φ′ |Θ, Φ) = ∑ ∫ 𝑓(𝑋𝑚𝑖𝑠 |𝑋𝑜𝑏𝑠 (𝑖), 𝑍𝑖 , Θ, Φ) ∗ log(𝑓(𝑋𝑜𝑏𝑠 (𝑖), 𝑋𝑚𝑖𝑠 , 𝑍𝑖 |Θ′ , Φ′ ))d𝑋𝑚𝑖𝑠
𝑖=1 𝑋𝑚𝑖𝑠
𝑁

= ∑ ∫ 𝑓(𝑋𝑚𝑖𝑠 |𝑋𝑜𝑏𝑠 (𝑖), Θ) ∗ log(𝑓(𝑋𝑜𝑏𝑠 (𝑖), 𝑋𝑚𝑖𝑠 , 𝑍𝑖 |Θ′ , Φ′ ))d𝑋𝑚𝑖𝑠

𝑖=1 𝑋𝑚𝑖𝑠
𝑁

= ∑ ∫ 𝑓(𝑋𝑚𝑖𝑠 |𝑋𝑜𝑏𝑠 (𝑖), Θ𝑀𝑖 ) ∗ log(𝑓(𝑋𝑜𝑏𝑠 (𝑖), 𝑋𝑚𝑖𝑠 , 𝑍𝑖 |Θ′ , Φ′ ))d𝑋𝑚𝑖𝑠

𝑖=1 𝑋𝑚𝑖𝑠

All rights reserved by www.grdjournals.com 14

Handling Missing Data with Expectation Maximization Algorithm
(GRDJE/ Volume 6 / Issue 11 / 002)

= ∑ ∫ 𝑓(𝑋𝑚𝑖𝑠 |𝑋𝑜𝑏𝑠 (𝑖), Θ𝑀𝑖 ) ∗ log(𝑓(𝑋𝑜𝑏𝑠 (𝑖), 𝑋𝑚𝑖𝑠 |Θ′ , Φ′ ) ∗ 𝑓(𝑍𝑖 |𝑋𝑜𝑏𝑠 (𝑖), 𝑋𝑚𝑖𝑠 , Θ′ , Φ′ ))d𝑋𝑚𝑖𝑠
𝑖=1 𝑋𝑚𝑖𝑠
𝑁

= ∑ ∫ 𝑓(𝑋𝑚𝑖𝑠 |𝑋𝑜𝑏𝑠 (𝑖), Θ𝑀𝑖 ) ∗ log(𝑓(𝑋𝑜𝑏𝑠 (𝑖), 𝑋𝑚𝑖𝑠 |Θ′ ) ∗ 𝑓(𝑍𝑖 |𝑋𝑜𝑏𝑠 (𝑖), 𝑋𝑚𝑖𝑠 , Φ′ ))d𝑋𝑚𝑖𝑠
𝑖=1 𝑋𝑚𝑖𝑠
𝑁

= ∑ ∫ 𝑓(𝑋𝑚𝑖𝑠 |𝑋𝑜𝑏𝑠 (𝑖), Θ𝑀𝑖 ) ∗ (log(𝑓(𝑋𝑜𝑏𝑠 (𝑖), 𝑋𝑚𝑖𝑠 |Θ′ )) + log(𝑓(𝑍𝑖 |𝑋𝑜𝑏𝑠 (𝑖), 𝑋𝑚𝑖𝑠 , Φ′ ))) d𝑋𝑚𝑖𝑠
𝑖=1 𝑋𝑚𝑖𝑠
𝑁

= ∑ ∫ 𝑓(𝑋𝑚𝑖𝑠 |𝑋𝑜𝑏𝑠 (𝑖), Θ𝑀𝑖 )log(𝑓(𝑋𝑜𝑏𝑠 (𝑖), 𝑋𝑚𝑖𝑠 |Θ′ ))d𝑋𝑚𝑖𝑠

𝑖=1 𝑋𝑚𝑖𝑠
𝑁

+ ∑ ∫ 𝑓(𝑋𝑚𝑖𝑠 |𝑋𝑜𝑏𝑠 (𝑖), Θ𝑀𝑖 )log(𝑓(𝑍𝑖 |𝑋𝑜𝑏𝑠 (𝑖), 𝑋𝑚𝑖𝑠 , Φ′ ))d𝑋𝑚𝑖𝑠

𝑖=1 𝑋𝑚𝑖𝑠
In short, Q(Θ’, Φ’ | Θ, Φ) is specified as follows:
𝑄(Θ′ , Φ′ |Θ, Φ) = 𝑄1 (Θ′ |Θ) + 𝑄2 (Φ′ |Θ) (2.16)
Where,
𝑁

𝑄1 (Θ′ |Θ) = ∑ ∫ 𝑓(𝑋𝑚𝑖𝑠 |𝑋𝑜𝑏𝑠 (𝑖), Θ𝑀𝑖 )log(𝑓(𝑋𝑜𝑏𝑠 (𝑖), 𝑋𝑚𝑖𝑠 |Θ′ ))d𝑋𝑚𝑖𝑠
𝑖=1 𝑋𝑚𝑖𝑠
𝑁

𝑄2 (Φ′ |Θ) = ∑ ∫ 𝑓(𝑋𝑚𝑖𝑠 |𝑋𝑜𝑏𝑠 (𝑖), Θ𝑀𝑖 )log(𝑓(𝑍𝑖 |𝑋𝑜𝑏𝑠 (𝑖), 𝑋𝑚𝑖𝑠 , Φ′ ))d𝑋𝑚𝑖𝑠
𝑖=1 𝑋𝑚𝑖𝑠
Note, unknowns of Q(Θ’, Φ’ | Θ, Φ) are Θ’ and Φ’. Because it is not easy to maximize Q(Θ’, Φ’ | Θ, Φ) with regard to Θ’ and Φ’,
we assume that the PDF f(X|Θ) belongs to exponential family.
𝑓(𝑋|Θ) = 𝑓(𝑋𝑜𝑏𝑠 , 𝑋𝑚𝑖𝑠 |Θ) = 𝑏(𝑋𝑜𝑏𝑠 , 𝑋𝑚𝑖𝑠 ) ∗ exp((Θ)𝑇 𝜏(𝑋𝑜𝑏𝑠 , 𝑋𝑚𝑖𝑠 ))⁄𝑎(Θ) (2.17)
Note,
𝑏(𝑋) = 𝑏(𝑋𝑜𝑏𝑠 , 𝑋𝑚𝑖𝑠 )
𝜏(𝑋) = 𝜏(𝑋𝑜𝑏𝑠 , 𝑋𝑚𝑖𝑠 ) = {𝜏(𝑋𝑜𝑏𝑠 ), 𝜏(𝑋𝑚𝑖𝑠 )}
It is easy to deduce that
𝑓(𝑋𝑚𝑖𝑠 |𝑋𝑜𝑏𝑠 , Θ𝑀 ) = 𝑏(𝑋𝑚𝑖𝑠 ) exp((Θ𝑀 )𝑇 𝜏(𝑋𝑚𝑖𝑠 ))⁄𝑎(Θ𝑀 ) (2.18)
Therefore,
𝑇
𝑓(𝑋𝑚𝑖𝑠 |𝑋𝑜𝑏𝑠 (𝑖), Θ𝑀𝑖 ) = 𝑏(𝑋𝑚𝑖𝑠 ) exp ((Θ𝑀𝑖 ) 𝜏(𝑋𝑚𝑖𝑠 ))⁄𝑎(Θ𝑀𝑖 )
We have:
𝑁

𝑄1 (Θ′ |Θ) = ∑ ∫ 𝑓(𝑋𝑚𝑖𝑠 |𝑋𝑜𝑏𝑠 (𝑖), Θ𝑀𝑖 )log(𝑓(𝑋𝑜𝑏𝑠 (𝑖), 𝑋𝑚𝑖𝑠 |Θ′ ))d𝑋𝑚𝑖𝑠
𝑖=1 𝑋𝑚𝑖𝑠
𝑁

= ∑ ∫ 𝑓(𝑋𝑚𝑖𝑠 |𝑋𝑜𝑏𝑠 (𝑖), Θ𝑀𝑖 ) ∗ log(𝑏(𝑋𝑜𝑏𝑠 (𝑖), 𝑋𝑚𝑖𝑠 ) exp((Θ′ )𝑇 𝜏(𝑋𝑜𝑏𝑠 (𝑖), 𝑋𝑚𝑖𝑠 ))⁄𝑎(Θ′ ))d𝑋𝑚𝑖𝑠
𝑖=1 𝑋𝑚𝑖𝑠
𝑁

= ∑ ∫ 𝑓(𝑋𝑚𝑖𝑠 |𝑋𝑜𝑏𝑠 (𝑖), Θ𝑀𝑖 ) ∗ (log(𝑏(𝑋𝑜𝑏𝑠 (𝑖), 𝑋𝑚𝑖𝑠 )) + (Θ′ )𝑇 𝜏(𝑋𝑜𝑏𝑠 (𝑖), 𝑋𝑚𝑖𝑠 ) − log(𝑎(Θ′ ))) d𝑋𝑚𝑖𝑠
𝑖=1 𝑋𝑚𝑖𝑠
𝑁 𝑁

= ∑ ∫ 𝑓(𝑋𝑚𝑖𝑠 |𝑋𝑜𝑏𝑠 (𝑖), Θ𝑀𝑖 )log(𝑏(𝑋𝑜𝑏𝑠 (𝑖), 𝑋𝑚𝑖𝑠 ))d𝑋𝑚𝑖𝑠 + ∑ ∫ 𝑓(𝑋𝑚𝑖𝑠 |𝑋𝑜𝑏𝑠 (𝑖), Θ𝑀𝑖 )(Θ′ )𝑇 𝜏(𝑋𝑜𝑏𝑠 (𝑖), 𝑋𝑚𝑖𝑠 )d𝑋𝑚𝑖𝑠
𝑖=1 𝑋𝑚𝑖𝑠 𝑖=1 𝑋𝑚𝑖𝑠
𝑁

− ∑ ∫ 𝑓(𝑋𝑚𝑖𝑠 |𝑋𝑜𝑏𝑠 (𝑖), Θ𝑀𝑖 )log(𝑎(Θ′ ))d𝑋𝑚𝑖𝑠

𝑖=1 𝑋𝑚𝑖𝑠
𝑁 𝑁

= ∑ ∫ 𝑓(𝑋𝑚𝑖𝑠 |𝑋𝑜𝑏𝑠 (𝑖), Θ𝑀𝑖 )log(𝑏(𝑋𝑜𝑏𝑠 (𝑖), 𝑋𝑚𝑖𝑠 ))d𝑋𝑚𝑖𝑠 + (Θ′ )𝑇 ∑ ∫ 𝑓(𝑋𝑚𝑖𝑠 |𝑋𝑜𝑏𝑠 (𝑖), Θ𝑀𝑖 )𝜏(𝑋𝑜𝑏𝑠 (𝑖), 𝑋𝑚𝑖𝑠 )d𝑋𝑚𝑖𝑠
𝑖=1 𝑋𝑚𝑖𝑠 𝑖=1 𝑋𝑚𝑖𝑠
𝑁

− log(𝑎(Θ′ )) ∑ ∫ 𝑓(𝑋𝑚𝑖𝑠 |𝑋𝑜𝑏𝑠 (𝑖), Θ𝑀𝑖 )d𝑋𝑚𝑖𝑠

𝑖=1 𝑋𝑚𝑖𝑠

All rights reserved by www.grdjournals.com 15

Handling Missing Data with Expectation Maximization Algorithm
(GRDJE/ Volume 6 / Issue 11 / 002)

𝑁 𝑁

= ∑ ∫ 𝑓(𝑋𝑚𝑖𝑠 |𝑋𝑜𝑏𝑠 (𝑖), Θ𝑀𝑖 )log(𝑏(𝑋𝑜𝑏𝑠 (𝑖), 𝑋𝑚𝑖𝑠 ))d𝑋𝑚𝑖𝑠 + (Θ′ )𝑇 ∑ ∫ 𝑓(𝑋𝑚𝑖𝑠 |𝑋𝑜𝑏𝑠 (𝑖), Θ𝑀𝑖 )𝜏(𝑋𝑜𝑏𝑠 (𝑖), 𝑋𝑚𝑖𝑠 )d𝑋𝑚𝑖𝑠
𝑖=1 𝑋𝑚𝑖𝑠 𝑖=1 𝑋𝑚𝑖𝑠
− 𝑁log(𝑎(Θ′ ))
∫ 𝑓(𝑋𝑚𝑖𝑠 |𝑋𝑜𝑏𝑠 (𝑖), Θ𝑀𝑖 )𝜏(𝑋𝑜𝑏𝑠 (𝑖))d𝑋𝑚𝑖𝑠 ,
𝑁 𝑁
𝑋𝑚𝑖𝑠
= ∑ ∫ 𝑓(𝑋𝑚𝑖𝑠 |𝑋𝑜𝑏𝑠 (𝑖), Θ𝑀𝑖 )log(𝑏(𝑋𝑜𝑏𝑠 (𝑖), 𝑋𝑚𝑖𝑠 ))d𝑋𝑚𝑖𝑠 + (Θ′ )𝑇 ∑
𝑖=1 𝑋𝑚𝑖𝑠 𝑖=1 ∫ 𝑓(𝑋𝑚𝑖𝑠 |𝑋𝑜𝑏𝑠 (𝑖), Θ𝑀𝑖 )𝜏(𝑋𝑚𝑖𝑠 )d𝑋𝑚𝑖𝑠
{𝑋𝑚𝑖𝑠 }
− 𝑁log(𝑎(Θ′ ))
𝜏(𝑋𝑜𝑏𝑠 (𝑖)) ∫ 𝑓(𝑋𝑚𝑖𝑠 |𝑋𝑜𝑏𝑠 (𝑖), Θ𝑀𝑖 )d𝑋𝑚𝑖𝑠 ,
𝑁 𝑁
𝑋𝑚𝑖𝑠
= ∑ ∫ 𝑓(𝑋𝑚𝑖𝑠 |𝑋𝑜𝑏𝑠 (𝑖), Θ𝑀𝑖 )log(𝑏(𝑋𝑜𝑏𝑠 (𝑖), 𝑋𝑚𝑖𝑠 ))d𝑋𝑚𝑖𝑠 + (Θ′ )𝑇 ∑
𝑖=1 𝑋𝑚𝑖𝑠 𝑖=1 ∫ 𝑓(𝑋𝑚𝑖𝑠 |𝑋𝑜𝑏𝑠 (𝑖), Θ𝑀𝑖 )𝜏(𝑋𝑚𝑖𝑠 )d𝑋𝑚𝑖𝑠
{𝑋𝑚𝑖𝑠 }
− 𝑁log(𝑎(Θ′ ))
𝑁 𝑁 𝜏(𝑋𝑜𝑏𝑠 (𝑖)),
= ∑ ∫ 𝑓(𝑋𝑚𝑖𝑠 |𝑋𝑜𝑏𝑠 (𝑖), Θ𝑀𝑖 )log(𝑏(𝑋𝑜𝑏𝑠 (𝑖), 𝑋𝑚𝑖𝑠 ))d𝑋𝑚𝑖𝑠 + (Θ′ )𝑇 ∑ { ∫ 𝑓(𝑋 |𝑋 (𝑖), Θ )𝜏(𝑋 )d𝑋 }
𝑚𝑖𝑠 𝑜𝑏𝑠 𝑀𝑖 𝑚𝑖𝑠 𝑚𝑖𝑠
𝑖=1 𝑋𝑚𝑖𝑠 𝑖=1
𝑋𝑚𝑖𝑠
′ ))
− 𝑁log(𝑎(Θ
Therefore, equation 2.19 specifies Q1(Θ’|Θ) given f(X|Θ) belongs to exponential family.
𝑁 𝑁

𝑄1 (Θ′ |Θ) = ∑ 𝐸(log(𝑏(𝑋𝑜𝑏𝑠 (𝑖), 𝑋𝑚𝑖𝑠 ))|Θ𝑀𝑖 ) + (Θ′ )𝑇 ∑{𝜏(𝑋𝑜𝑏𝑠 (𝑖)), 𝐸(𝜏(𝑋𝑚𝑖𝑠 )|Θ𝑀𝑖 )} − 𝑁log(𝑎(Θ′ )) (2.19)
𝑖=1 𝑖=1
Where,
𝐸(log(𝑏(𝑋𝑜𝑏𝑠 (𝑖), 𝑋𝑚𝑖𝑠 ))|Θ𝑀𝑖 ) = ∫ 𝑓(𝑋𝑚𝑖𝑠 |𝑋𝑜𝑏𝑠 (𝑖), Θ𝑀𝑖 )log(𝑏(𝑋𝑜𝑏𝑠 (𝑖), 𝑋𝑚𝑖𝑠 ))d𝑋𝑚𝑖𝑠 (2.20)
𝑋𝑚𝑖𝑠

𝐸(𝜏(𝑋𝑚𝑖𝑠 )|Θ𝑀𝑖 ) = ∫ 𝑓(𝑋𝑚𝑖𝑠 |𝑋𝑜𝑏𝑠 (𝑖), Θ𝑀𝑖 )𝜏(𝑋𝑚𝑖𝑠 )d𝑋𝑚𝑖𝑠 (2.21)

𝑋𝑚𝑖𝑠
At M-step of some tth iteration, the next parameter Θ(t+1) is solution of the equation created by setting the first-order derivative of
Q1(Θ’|Θ) to be zero. The first-order derivative of Q1(Θ’|Θ) is:
𝑁 𝑁
𝜕𝑄1 (Θ′ |Θ) 𝑇 𝑇
′
= ∑ (𝐸(𝜏(𝑋𝑜𝑏𝑠 (𝑖), 𝑋𝑚𝑖𝑠 )|Θ𝑀𝑖 )) − 𝑁log ′ (𝑎(Θ′ )) = ∑{𝜏(𝑋𝑜𝑏𝑠 (𝑖)), 𝐸(𝜏(𝑋𝑚𝑖𝑠 )|Θ𝑀𝑖 )} − 𝑁log ′ (𝑎(Θ′ ))
𝜕Θ
𝑖=1 𝑖=1
By referring table 1.2, we have:
𝑇 𝑇
log ′ (𝑎(Θ′ )) = (𝐸(𝜏(𝑋)|Θ′ )) = ∫ 𝑓(𝑋|Θ)(𝜏(𝑋)) d𝑋
𝑋
Where,
𝑓(𝑋|Θ) = 𝑓(𝑋𝑜𝑏𝑠 , 𝑋𝑚𝑖𝑠 |Θ) = 𝑏(𝑋𝑜𝑏𝑠 , 𝑋𝑚𝑖𝑠 ) ∗ exp((Θ)𝑇 𝜏(𝑋𝑜𝑏𝑠 , 𝑋𝑚𝑖𝑠 ))⁄𝑎(Θ)
𝑏(𝑋) = 𝑏(𝑋𝑜𝑏𝑠 , 𝑋𝑚𝑖𝑠 )
𝜏(𝑋) = 𝜏(𝑋𝑜𝑏𝑠 , 𝑋𝑚𝑖𝑠 ) = {𝜏(𝑋𝑜𝑏𝑠 ), 𝜏(𝑋𝑚𝑖𝑠 )}
Thus, the next parameter Θ(t+1) is solution of the following equation:
𝑁
𝜕𝑄1 (Θ′ |Θ) 𝑇 𝑇
′
= ∑{𝜏(𝑋𝑜𝑏𝑠 (𝑖)), 𝐸(𝜏(𝑋𝑚𝑖𝑠 )|Θ𝑀𝑖 )} − 𝑁(𝐸(𝜏(𝑋)|Θ′ )) = 𝟎𝑇
𝜕Θ
𝑖=1
This implies the next parameter Θ(t+1) is solution of the following equation:
𝑁
1
𝐸(𝜏(𝑋)|Θ′ ) = ∑{𝜏(𝑋𝑜𝑏𝑠 (𝑖)), 𝐸(𝜏(𝑋𝑚𝑖𝑠 )|Θ𝑀𝑖 )}
𝑁
𝑖=1
As a result, at E-step of some tth iteration, given current parameter Θ(t), the sufficient statistic of X is calculated as follows:
𝑁
(𝑡)
1 (𝑡)
𝜏 = ∑{𝜏(𝑋𝑜𝑏𝑠 (𝑖)), 𝐸(𝜏(𝑋𝑚𝑖𝑠 )|Θ𝑀𝑖 )} (2.22)
𝑁
𝑖=1
Where,

All rights reserved by www.grdjournals.com 16

Handling Missing Data with Expectation Maximization Algorithm
(GRDJE/ Volume 6 / Issue 11 / 002)

(𝑡)
Θ𝑀𝑖 = 𝑢(Θ(𝑡) , 𝑀𝑖 )
(𝑡) (𝑡)
𝐸(𝜏(𝑋𝑚𝑖𝑠 )|Θ𝑀𝑖 ) = ∫ 𝑓(𝑋𝑚𝑖𝑠 |𝑋𝑜𝑏𝑠 (𝑖), Θ𝑀𝑖 )𝜏(𝑋𝑚𝑖𝑠 )d𝑋𝑚𝑖𝑠
𝑋𝑚𝑖𝑠
Equation 2.22 is variant of equation 2.11 when f(X|Θ) belongs to exponential family but how to compose τ(X) from τ(Xobs) and
τ(Xmis) is not determined exactly yet.
As a result, at M-step of some tth iteration, given τ(t) and Θ(t), the next parameter Θ(t+1) is a solution of the following equation:
𝐸(𝜏(𝑋)|Θ) = 𝜏 (𝑡) (2.23)
Moreover, at M-step of some t iteration, the next parameter Φ(t+1) is a maximizer of Q2(Φ | Θ(t)) given Θ(t) as follows:
th

Φ(𝑡+1) = argmin 𝑄2 (Φ|Θ(𝑡) ) (2.24)

Φ
Where,
𝑁
(𝑡)
𝑄2 (Φ|Θ(𝑡) ) = ∑ ∫ 𝑓(𝑋𝑚𝑖𝑠 |𝑋𝑜𝑏𝑠 (𝑖), Θ𝑀𝑖 )log(𝑓(𝑍𝑖 |𝑋𝑜𝑏𝑠 (𝑖), 𝑋𝑚𝑖𝑠 , Φ))d𝑋𝑚𝑖𝑠 (2.25)
𝑖=1 𝑋𝑚𝑖𝑠
How to maximize Q2(Φ | Θ(t)) depends on distribution type of Zi which is also formulation of the PDF f(Z | Xobs, Xmis, Φ). For some
reasons, such as accelerating estimation speed or ignoring missingness variable Z then, the next parameter Φ(t+1) will not be
estimated.
In general, the two steps of GEM algorithm for handling missing data at some tth iteration are summarized in table 2.1 with
assumption that the PDF of missing data f(X|Θ) belongs to exponential family.
E-step:
Given current parameter Θ(t), the sufficient statistic τ(t) is calculated according to equation 2.22.
𝑁
(𝑡)
1 (𝑡)
𝜏 = ∑{𝜏(𝑋𝑜𝑏𝑠 (𝑖)), 𝐸(𝜏(𝑋𝑚𝑖𝑠 )|Θ𝑀𝑖 )}
𝑁
𝑖=1
Where,
(𝑡)
Θ𝑀𝑖 = 𝑢(Θ(𝑡) , 𝑀𝑖 )
(𝑡) (𝑡)
𝐸(𝜏(𝑋𝑚𝑖𝑠 )|Θ𝑀𝑖 ) = ∫ 𝑓(𝑋𝑚𝑖𝑠 |𝑋𝑜𝑏𝑠 (𝑖), Θ𝑀𝑖 )𝜏(𝑋𝑚𝑖𝑠 )d𝑋𝑚𝑖𝑠
𝑋𝑚𝑖𝑠
M-step:
Given τ(t) and Θ(t), the next parameter Θ(t+1) is a solution of equation 2.23.
𝐸(𝜏(𝑋)|Θ) = 𝜏 (𝑡)
Given Θ , the next parameter Φ
(t) (t+1)
is a maximizer of Q2(Φ | Θ(t)) according to equation 2.24.
(𝑡+1)
Φ = argmin 𝑄2 (Φ|Θ(𝑡) )
Φ
Where,
𝑁
(𝑡) (𝑡)
𝑄2 (Φ|Θ ) = ∑ ∫ 𝑓(𝑋𝑚𝑖𝑠 |𝑋𝑜𝑏𝑠 (𝑖), Θ𝑀𝑖 )log(𝑓(𝑍𝑖 |𝑋𝑜𝑏𝑠 (𝑖), 𝑋𝑚𝑖𝑠 , Φ))d𝑋𝑚𝑖𝑠
𝑖=1 𝑋𝑚𝑖𝑠
Table 2.1. E-step and M-step of GEM algorithm for handling missing data given exponential PDF
GEM algorithm converges at some tth iteration. At that time, Θ* = Θ(t+1) = Θ(t) and Φ* = Φ(t+1) = Φ(t) are optimal estimates. If
missingness variable Z is ignored for some reasons, parameter Φ is not estimated. Because Xmis is a part of X and f(Xmis | Xobs, ΘM)
is derived directly from f(X|Θ), in practice we can stop GEM after its first iteration was done, which is reasonable enough to handle
missing data.
An interesting application of handling missing data is to fill in or predict missing values. For instance, suppose the estimate
resulted from GEM is Θ*, missing values represented by τ(Xmis) are fulfilled by expectation of τ(Xmis) as follows:
𝜏(𝑋𝑚𝑖𝑠 ) = 𝐸(𝜏(𝑋𝑚𝑖𝑠 )|Θ∗𝑀 ) (2.26)
Where,
∗
Θ𝑀 = 𝑢(Θ∗ , 𝑋𝑜𝑏𝑠 )
Now we survey a popular case that sample 𝒳 = {X1, X2,…, XN} whose Xi (s) are iid is MCAR data and f(X|Θ) is multinormal PDF
whereas missingness variable Z follows binomial distribution of n trials. Let X = {Xobs, Xmis} be random variable representing every
Xi. Suppose dimension of X is n. Let Z be random variable representing every Zi. According to equation 2.9, recall that

All rights reserved by www.grdjournals.com 17

Handling Missing Data with Expectation Maximization Algorithm
(GRDJE/ Volume 6 / Issue 11 / 002)

𝑋𝑖 = {𝑋𝑜𝑏𝑠 (𝑖), 𝑋𝑚𝑖𝑠 (𝑖)} = (𝑥𝑖1 , 𝑥𝑖2 , … , 𝑥𝑖𝑛 )𝑇

𝑇
𝑋𝑚𝑖𝑠 (𝑖) = (𝑥𝑖𝑚1 , 𝑥𝑖𝑚2 , … , 𝑥𝑖𝑚|𝑀 | )
𝑖
𝑇
𝑋𝑜𝑏𝑠 (𝑖) = (𝑥𝑖𝑚̅𝑖1 , 𝑥𝑖𝑚̅𝑖2 , … , 𝑥𝑖𝑚̅𝑖|𝑀
̅̅̅ |
)
𝑖

𝑀𝑖 = {𝑚𝑖1 , 𝑚𝑖2 , … , 𝑚𝑖|𝑀𝑖 | }

̅𝑖 = {𝑚
𝑀 ̅ 𝑖1 , 𝑚
̅ 𝑖2 , … , 𝑚
̅ 𝑖|𝑀̅𝑖| }
𝑍𝑖 = (𝑧𝑖1 , 𝑧𝑖2 , … , 𝑧𝑖𝑛 )𝑇
The PDF of X is:
𝑛 1 1
𝑓(𝑋|Θ) = 𝑓(𝑋𝑜𝑏𝑠 , 𝑋𝑚𝑖𝑠 |Θ) = (2𝜋)−2 |Σ|−2 exp (− (𝑋 − 𝜇)𝑇 Σ −1 (𝑋 − 𝜇)) (2.27)
2
Therefore,
𝑛 1 1
𝑓(𝑋𝑖 |Θ) = 𝑓(𝑋𝑜𝑏𝑠 (𝑖), 𝑋𝑚𝑖𝑠 (𝑖)|Θ) = (2𝜋)−2 |Σ|−2 exp (− (𝑋𝑖 − 𝜇)𝑇 Σ −1 (𝑋𝑖 − 𝜇))
2
The PDF of Z is:
𝑓(𝑍|Φ) = 𝑝𝑐(𝑍) (1 − 𝑝)𝑛−𝑐(𝑍) (2.28)
Therefore,
𝑓(𝑍𝑖 |Φ) = 𝑝𝑐(𝑍𝑖) (1 − 𝑝)𝑛−𝑐(𝑍𝑖)
Where Θ = (μ, Σ) and Φ = p. Note, given the PDF f(X | Θ), µ is mean and Σ is covariance matrix whose each element σij is
T

covariance of xi and xj.

𝜇 = (𝜇1 , 𝜇2 , … , 𝜇𝑛 )𝑇
𝜎11 𝜎12 ⋯ 𝜎1𝑛
𝜎21 𝜎22 ⋯ 𝜎2𝑛 (2.29)
Σ=( ⋮ ⋮ ⋱ ⋮ )
𝜎𝑛1 𝜎𝑛2 ⋯ 𝜎𝑛𝑛
Suppose the probability of missingness at every partial random variable xj is p and it is independent from Xobs and Xmis. The quantity
c(Z) is the number of zj (s) in Z that equal 1. For example, if Z = (1, 0, 1, 0)T then, c(Z) = 2. The most important task here is to
define equation 2.11 and equation 2.15 in order to compose τ(X) from τ(Xobs), τ(Xmis) and to extract ΘM from Θ when f(X|Θ)
distributes normally.
The conditional PDF of Xmis given Xobs is also multinormal PDF.
|𝑀| 1 1
𝑓(𝑋𝑚𝑖𝑠 |Θ𝑀 ) = 𝑓(𝑋𝑚𝑖𝑠 |𝑋𝑜𝑏𝑠 , Θ𝑀 ) = 𝑓(𝑋𝑚𝑖𝑠 |𝑋𝑜𝑏𝑠 , Θ) = (2𝜋)− 2 |Σ𝑀 |−2 exp (− (𝑋𝑚𝑖𝑠 − 𝜇𝑀 )𝑇 Σ𝑀 −1
(𝑋𝑚𝑖𝑠 − 𝜇𝑀 )) (2.30)
2
Therefore,
𝑓(𝑋𝑚𝑖𝑠 (𝑖)|Θ𝑀𝑖 ) = 𝑓(𝑋𝑚𝑖𝑠 (𝑖)|𝑋𝑜𝑏𝑠 (𝑖), Θ𝑀𝑖 ) = 𝑓(𝑋𝑚𝑖𝑠 (𝑖)|𝑋𝑜𝑏𝑠 (𝑖), Θ)
|𝑀𝑖 | 1 1
− 𝑇 −1
= (2𝜋)− 2 |Σ𝑀𝑖 | 2 exp (− (𝑋𝑚𝑖𝑠 (𝑖) − 𝜇𝑀𝑖 ) Σ𝑀 (𝑋𝑚𝑖𝑠 (𝑖) − 𝜇𝑀𝑖 ))
2 𝑖
𝑇
Where Θ𝑀𝑖 = (𝜇𝑀𝑖 , Σ𝑀𝑖 ) . We denote
𝑓(𝑋𝑚𝑖𝑠 (𝑖)|Θ𝑀𝑖 ) = 𝑓(𝑋𝑚𝑖𝑠 (𝑖)|𝑋𝑜𝑏𝑠 (𝑖), Θ𝑀𝑖 )
Because 𝑓(𝑋𝑚𝑖𝑠 (𝑖)|𝑋𝑜𝑏𝑠 (𝑖), Θ𝑀𝑖 ) only depends on Θ𝑀𝑖 within normal PDF whereas Θ𝑀𝑖 depends on Xobs(i). Determining the
function Θ𝑀𝑖 = u(Θ, Xobs(i)) is now necessary to extract the parameter Θ𝑀𝑖 from Θ given Xobs(i) when f(Xi|Θ) is normal distribution.
Let Θmis = (μmis, Σmis)T be parameter of marginal PDF of Xmis, we have:
|𝑀| 1 1
𝑓(𝑋𝑚𝑖𝑠 |Θ𝑚𝑖𝑠 ) = (2𝜋)− 2 |Σ𝑚𝑖𝑠 |−2 exp (− (𝑋𝑚𝑖𝑠 − 𝜇𝑚𝑖𝑠 )𝑇 (Σ𝑚𝑖𝑠 )−1 (𝑋𝑚𝑖𝑠 − 𝜇𝑚𝑖𝑠 )) (2.31)
2
Therefore,
|𝑀𝑖 | 1 1 𝑇 −1
𝑓(𝑋𝑚𝑖𝑠 (𝑖)|Θ𝑚𝑖𝑠 (𝑖)) = (2𝜋)− 2 |Σ𝑚𝑖𝑠 (𝑖)|−2 exp (− (𝑋𝑚𝑖𝑠 (𝑖) − 𝜇𝑚𝑖𝑠 (𝑖)) (Σ𝑚𝑖𝑠 (𝑖)) (𝑋𝑚𝑖𝑠 (𝑖) − 𝜇𝑚𝑖𝑠 (𝑖)))
2
Where,
𝑇
𝜇𝑚𝑖𝑠 (𝑖) = (𝜇𝑚𝑖1 , 𝜇𝑚𝑖2 , … , 𝜇𝑚𝑖|𝑀 | )
𝑖
𝜎𝑚𝑖1𝑚𝑖1 𝜎𝑚𝑖1 𝑚𝑖2 ⋯ 𝜎𝑚𝑖1 𝑚𝑖|𝑀 |
𝑖
𝜎𝑚𝑖2𝑚𝑖1 𝜎𝑚𝑖2 𝑚𝑖2 ⋯ 𝜎𝑚𝑖2 𝑚𝑖|𝑀 | (2.32)
𝑖
Σ𝑚𝑖𝑠 (𝑖) =
⋮ ⋮ ⋱ ⋮
𝜎𝑚𝑖|𝑀 |𝑚𝑖1 𝜎𝑚𝑖|𝑀 |𝑚𝑖2 ⋯ 𝜎𝑚𝑖|𝑀 |𝑚𝑖|𝑀 |
( 𝑖 𝑖 𝑖 𝑖 )
Obviously, Θmis(i) is extracted from Θ given indicator Mi. Note, 𝜎𝑚𝑖𝑗 𝑚𝑖𝑘 is covariance of 𝑥𝑚𝑖𝑗 and 𝑥𝑚𝑖𝑘 .
Let Θobs = (μobs, Σobs)T be parameter of marginal PDF of Xobs, we have:

All rights reserved by www.grdjournals.com 18

Handling Missing Data with Expectation Maximization Algorithm
(GRDJE/ Volume 6 / Issue 11 / 002)

𝑓(𝑋𝑜𝑏𝑠 (𝑖)|Θ𝑜𝑏𝑠 (𝑖)) = ∫ 𝑓(𝑋𝑜𝑏𝑠 (𝑖), 𝑋𝑚𝑖𝑠 (𝑖)|Θ)d𝑋𝑚𝑖𝑠 (𝑖)

𝑋𝑚𝑖𝑠 (𝑖)
𝑓(𝑋𝑜𝑏𝑠 (𝑖), 𝑋𝑚𝑖𝑠 (𝑖)|Θ)
𝑓(𝑋𝑚𝑖𝑠 (𝑖)|Θ𝑀𝑖 ) = 𝑓(𝑋𝑚𝑖𝑠 (𝑖)|𝑋𝑜𝑏𝑠 (𝑖), Θ) =
𝑓(𝑋𝑜𝑏𝑠 (𝑖)|Θ𝑜𝑏𝑠 (𝑖))
𝑇
Therefore, it is easy to form the parameter Θ𝑀𝑖 = (𝜇𝑀𝑖 , Σ𝑀𝑖 ) from Θmis(i) = (μmis(i), Σmis(i))T and Θobs(i) = (μobs(i), Σobs(i))T as
follows (Hardle & Simar, 2013, pp. 156-157):
𝑚𝑖𝑠 −1
𝜇𝑀𝑖 = 𝜇𝑚𝑖𝑠 (𝑖) + (𝑉𝑜𝑏𝑠 (𝑖)) (Σ𝑜𝑏𝑠 (𝑖)) (𝑋𝑜𝑏𝑠 (𝑖) − 𝜇𝑜𝑏𝑠 (𝑖))
Θ𝑀𝑖 = 𝑢(Θ, 𝑋𝑜𝑏𝑠 (𝑖)) = { −1
(2.35)
𝑚𝑖𝑠 𝑜𝑏𝑠
Σ𝑀𝑖 = Σ𝑚𝑖𝑠 (𝑖) − (𝑉𝑜𝑏𝑠 (𝑖)) (Σ𝑜𝑏𝑠 (𝑖)) (𝑉𝑚𝑖𝑠 )
Where from Θmis(i) = (μmis(i), Σmis(i))T and Θobs(i) = (μobs(i), Σobs(i))T are specified by equation 2.32 and equation 2.34. Moreover
𝑚𝑖𝑠
the kxl matrix 𝑉𝑜𝑏𝑠 ̅𝑖 |):
(𝑖) which implies correlation between Xmis and Xobs is defined as follows (k = |Mi| and l = |𝑀
𝜎𝑚𝑖1𝑚̅𝑖1 𝜎𝑚𝑖1 𝑚̅𝑖2 ⋯ 𝜎𝑚𝑖1 𝑚̅𝑖|𝑀 ̅̅̅ |
𝑖
𝜎𝑚𝑖2𝑚̅𝑖1 𝜎𝑚𝑖2 𝑚̅𝑖2 ⋯ 𝜎𝑚𝑖2 𝑚̅𝑖|𝑀 ̅̅̅ |
𝑚𝑖𝑠 𝑖
𝑉𝑜𝑏𝑠 (𝑖) = (2.36)
⋮ ⋮ ⋱ ⋮
𝜎𝑚 𝑚̅ 𝜎𝑚𝑖|𝑀 |𝑚̅𝑖2 ⋯ 𝜎𝑚𝑖|𝑀 |𝑚̅𝑖|𝑀 ̅̅̅ |
( 𝑖|𝑀𝑖| 𝑖1 𝑖 𝑖 𝑖 )
𝑜𝑏𝑠
Note, 𝜎𝑚𝑖𝑗 𝑚̅𝑖𝑘 is covariance of 𝑥𝑚𝑖𝑗 and 𝑥𝑚̅𝑖𝑘 . The lxk matrix 𝑉𝑚𝑖𝑠 (𝑖) which implies correlation between Xobs and Xmis is defined as
follows:
𝜎𝑚̅𝑖1𝑚𝑖1 𝜎𝑚̅𝑖1𝑚𝑖2 ⋯ 𝜎𝑚̅𝑖1𝑚𝑖|𝑀 |
𝑖
𝜎𝑚̅𝑖2𝑚𝑖1 𝜎𝑚̅𝑖2𝑚𝑖2 ⋯ 𝜎𝑚̅𝑖2𝑚𝑖|𝑀 |
𝑜𝑏𝑠 𝑖
𝑉𝑚𝑖𝑠 (𝑖) = (2.37)
⋮ ⋮ ⋱ ⋮
𝜎𝑚̅ ̅̅̅ |𝑚𝑖1 𝜎𝑚̅𝑖|𝑀̅̅̅ | 𝑚𝑖2
⋯ 𝜎𝑚̅𝑖|𝑀 ̅̅̅ | 𝑚𝑖|𝑀 |
( 𝑖|𝑀 𝑖 𝑖 𝑖 𝑖 )
Therefore, equation 2.35 to extract Θ𝑀𝑖 from Θ given Xobs(i) is an instance of equation 2.15. For convenience let,
𝑇
𝜇𝑀𝑖 = (𝜇𝑀𝑖 (𝑚𝑖1 ), 𝜇𝑀𝑖 (𝑚𝑖2 ), … , 𝜇𝑀𝑖 (𝑚𝑖|𝑀𝑖 | ))
Σ𝑀𝑖 (𝑚𝑖1 , 𝑚𝑖1 ) Σ𝑀𝑖 (𝑚𝑖1 , 𝑚𝑖2 ) ⋯ Σ𝑀𝑖 (𝑚𝑖1 , 𝑚𝑖|𝑀𝑖| )
(2.38)
Σ𝑀𝑖 (𝑚𝑖2 , 𝑚𝑖1 ) Σ𝑀𝑖 (𝑚𝑖2 , 𝑚𝑖2 ) ⋯ Σ𝑀𝑖 (𝑚𝑖2 , 𝑚𝑖|𝑀𝑖| )
Σ𝑀𝑖 =
⋮ ⋮ ⋱ ⋮
Σ𝑀
( 𝑖 (𝑚 𝑖|𝑀 𝑖|
, 𝑚𝑖1 ) Σ𝑀𝑖 (𝑚𝑖|𝑀𝑖 | , 𝑚𝑖2 ) ⋯ Σ𝑀𝑖 (𝑚𝑖|𝑀𝑖 | , 𝑚𝑖|𝑀𝑖 | ))
𝑇
Equation 2.38 is result of equation 2.35. Given 𝑋𝑚𝑖𝑠 (𝑖) = (𝑥𝑚𝑖1 , 𝑥𝑚𝑖2 , … , 𝑥𝑚𝑖|𝑀 | ) then, 𝜇𝑀𝑖 (𝑚𝑖𝑗 ) is estimated partial mean of
𝑖
𝑥𝑚𝑖𝑗 and Σ𝑀𝑖 (𝑚𝑖𝑢 , 𝑚𝑖𝑣 ) is estimated partial covariance of 𝑥𝑚𝑖𝑢 and 𝑥𝑚𝑖𝑣 given the conditional PDF f(Xmis | Θ𝑀𝑖 ).
At E-step of some tth iteration, given current parameter Θ(t), the sufficient statistic of X is calculated according to equation
2.22. Let,
𝑁
(𝑡) (𝑡) (𝑡) 𝑇 1 (𝑡)
𝜏 = (𝜏1 , 𝜏2 ) = ∑{𝜏(𝑋𝑜𝑏𝑠 (𝑖)), 𝐸(𝜏(𝑋𝑚𝑖𝑠 )|Θ𝑀𝑖 )}
𝑁
𝑖=1

All rights reserved by www.grdjournals.com 19

Handling Missing Data with Expectation Maximization Algorithm
(GRDJE/ Volume 6 / Issue 11 / 002)

It is necessary to calculate the sufficient with normal PDF f(Xi|Θ), which means that we need to define what τ1(t) and τ2(t) are. The
sufficient statistic of Xobs(i) is:
𝑇 𝑇
𝜏(𝑋𝑜𝑏𝑠 (𝑖)) = (𝑋𝑜𝑏𝑠 (𝑖), 𝑋𝑜𝑏𝑠 (𝑖)(𝑋𝑜𝑏𝑠 (𝑖)) )
The sufficient statistic of Xmis(i) is:
𝑇 𝑇
𝜏(𝑋𝑚𝑖𝑠 (𝑖)) = (𝑋𝑚𝑖𝑠 (𝑖), 𝑋𝑚𝑖𝑠 (𝑖)(𝑋𝑚𝑖𝑠 (𝑖)) )
We also have:
(𝑡)
(𝑡) (𝑡)
𝜇𝑀𝑖
𝐸(𝜏(𝑋𝑚𝑖𝑠 )|Θ𝑀𝑖 ) = ∫ 𝑓(𝑋𝑚𝑖𝑠 |Θ𝑀𝑖 )𝜏(𝑋𝑚𝑖𝑠 )d𝑋𝑚𝑖𝑠 =( (𝑡) (𝑡) (𝑡) 𝑇
)
𝑋𝑚𝑖𝑠
Σ𝑀𝑖 + 𝜇𝑀𝑖 (𝜇𝑀𝑖 )
Due to
𝑇 (𝑡) (𝑡) (𝑡) (𝑡) 𝑇
𝐸 (𝑋𝑚𝑖𝑠 (𝑖)(𝑋𝑚𝑖𝑠 (𝑖)) |Θ𝑀𝑖 ) = Σ𝑀𝑖 + 𝜇𝑀𝑖 (𝜇𝑀𝑖 )
(𝑡) (𝑡)
Where 𝜇𝑀𝑖 and Σ𝑀𝑖 are 𝜇𝑀𝑖 and Σ𝑀𝑖 at current iteration, respectively. By referring to equation 2.38, we have
𝑇
(𝑡) (𝑡) (𝑡) (𝑡)
𝜇𝑀𝑖 = (𝜇𝑀𝑖 (𝑚𝑖1 ), 𝜇𝑀𝑖 (𝑚𝑖2 ), … , 𝜇𝑀𝑖 (𝑚𝑖|𝑀𝑖| ))
And
(𝑡) (𝑡) (𝑡)
𝜎̃11 (𝑖) 𝜎̃12 (𝑖) ⋯ 𝜎̃1|𝑀𝑖| (𝑖)
(𝑡) (𝑡) (𝑡)
(𝑡) (𝑡) (𝑡) 𝑇 𝜎̃21 (𝑖) 𝜎̃22 (𝑖) ⋯ 𝜎̃2|𝑀𝑖| (𝑖)
Σ𝑀𝑖 + 𝜇𝑀𝑖 (𝜇𝑀𝑖 ) =
⋮ ⋮ ⋱ ⋮
(𝑡) (𝑡) (𝑡)
(𝜎̃|𝑀 |1 (𝑖)
𝑖
𝜎̃|𝑀 |2 (𝑖)
𝑖
⋯ 𝜎̃|𝑀𝑖 ||𝑀𝑖 | (𝑖))
Where,
(𝑡) (𝑡) (𝑡) (𝑡)
𝜎̃𝑢𝑣 (𝑖) = Σ𝑀𝑖 (𝑚𝑖𝑢 , 𝑚𝑖𝑣 ) + 𝜇𝑀𝑖 (𝑚𝑖𝑢 )𝜇𝑀𝑖 (𝑚𝑖𝑣 )
Therefore, τ1(t) is vector and τ2(t) is matrix and then, the sufficient statistic of X at E-step of some tth iteration, given current parameter
Θ(t) is defined as follows:
(𝑡) (𝑡) 𝑇
𝜏 (𝑡) = (𝜏1 , 𝜏2 )
(𝑡) (𝑡) (𝑡) (𝑡) 𝑇
𝜏1 = (𝑥̅1 , 𝑥̅2 , … , 𝑥̅𝑛 )
(𝑡) (𝑡) (𝑡)
𝑠11 𝑠12 ⋯ 𝑠1𝑛 (2.39)
(𝑡) (𝑡) (𝑡)
𝜏2 =
(𝑡) 𝑠21 𝑠22 ⋯ 𝑠2𝑛
⋮ ⋮ ⋱ ⋮
(𝑡) (𝑡) (𝑡)
(𝑠𝑛1 𝑠𝑛2 ⋯ 𝑠𝑛𝑛 )
(𝑡)
Each 𝑥̅𝑗 is calculated as follows:
𝑁
(𝑡) 1 𝑥𝑖𝑗 if 𝑗 ∉ 𝑀𝑖
𝑥̅𝑗 = ∑ { (𝑡) (2.40)
𝑁 𝜇𝑀𝑖 (𝑗) if 𝑗 ∈ 𝑀𝑖
𝑖=1
(𝑡) (𝑡)
Please see equation 2.35 and equation 2.38 to know 𝜇𝑀𝑖 (𝑗). Each 𝑠𝑢𝑣 is calculated as follows:
𝑥𝑖𝑢 𝑥𝑖𝑣
if 𝑢 ∉ 𝑀𝑖 and 𝑣 ∉ 𝑀𝑖
(𝑡)
𝑥𝑖𝑢 𝜇𝑀𝑖 (𝑚𝑖𝑣 )
𝑁 if 𝑢 ∉ 𝑀𝑖 and 𝑣 ∈ 𝑀𝑖
(𝑡) (𝑡) 1
𝑠𝑢𝑣 = 𝑠𝑣𝑢 = ∑ (2.41)
𝑁 (𝑡)
𝜇𝑀𝑖 (𝑚𝑖𝑢 )𝑥𝑖𝑣
𝑖=1
if 𝑢 ∈ 𝑀𝑖 and 𝑣 ∉ 𝑀𝑖
(𝑡) (𝑡) (𝑡)
Σ𝑀𝑖 (𝑚𝑖𝑢 , 𝑚𝑖𝑣 ) + 𝜇𝑀𝑖 (𝑚𝑖𝑢 )𝜇𝑀𝑖 (𝑚𝑖𝑣 )
{ if 𝑢 ∈ 𝑀𝑖 and 𝑣 ∈ 𝑀𝑖
Equation 2.39 is an instance of equation 2.11, which compose τ(X) from τ(Xobs) and τ(Xmis) when f(X|Θ) distributes normally.
Following is the proof of equation 2.41.
If 𝑢 ∉ 𝑀𝑖 and 𝑣 ∉ 𝑀𝑖 then, the partial statistic xiuxiv is kept intact because xiu and xiv are in Xobs are constant with regard to
(𝑡) (𝑡)
f(Xmis | Θ𝑀𝑖 ) If 𝑢 ∉ 𝑀𝑖 and 𝑣 ∈ 𝑀𝑖 then, the partial statistic xiuxiv is replaced by the expectation 𝐸(𝑥𝑖𝑢 𝑥𝑖𝑣 |Θ𝑀𝑖 ) as follows:

All rights reserved by www.grdjournals.com 20

Handling Missing Data with Expectation Maximization Algorithm
(GRDJE/ Volume 6 / Issue 11 / 002)

(𝑡) (𝑡) (𝑡) (𝑡)

𝐸(𝑥𝑖𝑢 𝑥𝑖𝑣 |Θ𝑀𝑖 ) = ∫ 𝑓(𝑋𝑚𝑖𝑠 |Θ𝑀𝑖 )𝑥𝑖𝑢 𝑥𝑖𝑣 d𝑋𝑚𝑖𝑠 = 𝑥𝑖𝑢 ∫ 𝑓(𝑋𝑚𝑖𝑠 |Θ𝑀𝑖 )𝑥𝑖𝑣 d𝑋𝑚𝑖𝑠 = 𝑥𝑖𝑢 𝜇𝑀𝑖 (𝑚𝑖𝑣 )
𝑋𝑚𝑖𝑠 𝑋𝑚𝑖𝑠
(𝑡)
If 𝑢 ∈ 𝑀𝑖 and 𝑣 ∉ 𝑀𝑖 then, the partial statistic xiuxiv is replaced by the expectation 𝐸(𝑥𝑖𝑢 𝑥𝑖𝑣 |Θ𝑀𝑖 ) as follows:
(𝑡) (𝑡) (𝑡) (𝑡)
𝐸(𝑥𝑖𝑢 𝑥𝑖𝑣 |Θ𝑀𝑖 ) = ∫ 𝑓(𝑋𝑚𝑖𝑠 |Θ𝑀𝑖 )𝑥𝑖𝑢 𝑥𝑖𝑣 d𝑋𝑚𝑖𝑠 = 𝑥𝑖𝑣 ∫ 𝑓(𝑋𝑚𝑖𝑠 |Θ𝑀𝑖 )𝑥𝑖𝑢 d𝑋𝑚𝑖𝑠 = 𝜇𝑀𝑖 (𝑚𝑖𝑢 )𝑥𝑖𝑣
𝑋𝑚𝑖𝑠 𝑋𝑚𝑖𝑠
(𝑡)
If 𝑢 ∈ 𝑀𝑖 and 𝑣 ∉ 𝑀𝑖 then, the partial statistic xiuxiv is replaced by the expectation 𝐸(𝑥𝑖𝑢 𝑥𝑖𝑣 |Θ𝑀𝑖 ) as follows:
(𝑡) (𝑡) (𝑡) (𝑡) (𝑡)
𝐸(𝑥𝑖𝑢 𝑥𝑖𝑣 |Θ𝑀𝑖 ) = ∫ 𝑓(𝑋𝑚𝑖𝑠 |Θ𝑀𝑖 )𝑥𝑖𝑢 𝑥𝑖𝑣 d𝑋𝑚𝑖𝑠 = Σ𝑀𝑖 (𝑚𝑖𝑢 , 𝑚𝑖𝑣 ) + 𝜇𝑀𝑖 (𝑚𝑖𝑢 )𝜇𝑀𝑖 (𝑚𝑖𝑣 )∎
𝑋𝑚𝑖𝑠
At M-step of some tth iteration, given τ(t) and Θ(t), the next parameter Θ(t+1) = (μ(t+1), Σ(t+1))T is a solution of equation 2.23.
𝐸(𝜏(𝑋)|Θ) = 𝜏 (𝑡)
Due to
𝜇
𝐸(𝜏(𝑋)|Θ) = ( )
Σ
Equation 2.23 becomes:
(𝑡)
𝜇 = 𝜏1
{ (𝑡)
Σ = 𝜏2 − 𝜇𝜇 𝑇
Which means that
(𝑡+1) (𝑡)
𝜇 = 𝑥̅𝑗
{ 𝑗(𝑡+1) (𝑡+1) (𝑡) (𝑡) (𝑡)
∀𝑗, 𝑢, 𝑣 (2.42)
𝜎𝑢𝑣 = 𝜎𝑣𝑢 = 𝑠𝑢𝑣 − 𝑥̅𝑢 𝑥̅𝑣
(𝑡) (𝑡)
Please see equation 2.40 and equation 2.41 to know 𝑥̅𝑗 and 𝑠𝑢𝑣 .
Moreover, at M-step of some tth iteration, the next parameter Φ(t+1) = p(t+1) is a maximizer of Q2(Φ | Θ(t)) given Θ(t) according
to equation 2.24.
Φ(𝑡+1) = argmin 𝑄2 (Φ|Θ(𝑡) )
Φ
Because the PDF of Zi is:
𝑓(𝑍𝑖 |Φ) = 𝑝𝑐(𝑍𝑖) (1 − 𝑝)𝑛−𝑐(𝑍𝑖)
The Q2(Φ|Θ ) becomes:
(t)
𝑁
(𝑡) (𝑡)
𝑄2 (Φ|Θ ) = ∑ ∫ 𝑓(𝑋𝑚𝑖𝑠 |Θ𝑀𝑖 )log(𝑓(𝑍𝑖 |𝑋𝑜𝑏𝑠 (𝑖), 𝑋𝑚𝑖𝑠 , Φ))d𝑋𝑚𝑖𝑠
𝑖=1 𝑋𝑚𝑖𝑠
𝑁
(𝑡)
= ∑ ∫ 𝑓(𝑋𝑚𝑖𝑠 |Θ𝑀𝑖 )log(𝑓(𝑍𝑖 |Φ))d𝑋𝑚𝑖𝑠
𝑖=1 𝑋𝑚𝑖𝑠
𝑁
(𝑡)
= ∑ log(𝑓(𝑍𝑖 |Φ)) ∫ 𝑓(𝑋𝑚𝑖𝑠 |Θ𝑀𝑖 )d𝑋𝑚𝑖𝑠
𝑖=1 𝑋𝑚𝑖𝑠
𝑁 𝑁

= ∑ log(𝑓(𝑍𝑖 |Φ)) = ∑ log(𝑝𝑐(𝑍𝑖) (1 − 𝑝)𝑛−𝑐(𝑍𝑖) )

𝑖=1 𝑖=1
𝑁

= ∑ (𝑐(𝑍𝑖 )log(𝑝) + (𝑛 − 𝑐(𝑍𝑖 ))log(1 − 𝑝))

𝑖=1
The next parameter Φ(t+1) = p(t+1) is solution of the equation created by setting the first-order derivative of Q2(Φ|Θ(t)) to be zero,
which means that:
𝑁 𝑁
𝜕𝑄2 (Φ|Θ(𝑡) ) 𝑐(𝑍𝑖 ) 𝑛 − 𝑐(𝑍𝑖 ) 1
= ∑( − )= ((∑ 𝑐(𝑍𝑖 )) − 𝑛𝑝𝑁) = 0
𝜕𝑝 𝑝 1−𝑝 𝑝(1 − 𝑝)
𝑖=1 𝑖=1
It is easy to deduce that the next parameter p(t+1) is:
∑𝑁𝑖=1 𝑐(𝑍𝑖 )
𝑝(𝑡+1) = (2.43)
𝑛𝑁
In general, given sample 𝒳 = {X1, X2,…, XN} whose Xi (s) are iid is MCAR data and f(X|Θ) is multinormal PDF whereas
missingness variable Z follows binomial distribution of n trials, GEM for handling missing data is summarized in table 2.2.
E-step:
Given current parameter Θ(t) = (μ(t), Σ(t))T, the sufficient statistic τ(t) is calculated according to equation 2.39, equation 2.40,
and equation 2.41.

All rights reserved by www.grdjournals.com 21

Handling Missing Data with Expectation Maximization Algorithm
(GRDJE/ Volume 6 / Issue 11 / 002)

(𝑡) (𝑡) 𝑇
𝜏 (𝑡) = (𝜏1 , 𝜏2 )
(𝑡) (𝑡) (𝑡) (𝑡) 𝑇
𝜏1 = (𝑥̅1 , 𝑥̅2 , … , 𝑥̅𝑛 )
(𝑡) (𝑡) (𝑡)
𝑠11 𝑠12 ⋯ 𝑠1𝑛
(𝑡) (𝑡) (𝑡)
𝑠21 𝑠22
(𝑡)
𝜏2 = ⋯ 𝑠2𝑛
⋮ ⋮ ⋱ ⋮
(𝑡) (𝑡) (𝑡)
𝑠
( 𝑛1 𝑠𝑛2 ⋯ 𝑠𝑛𝑛 )
𝑁
(𝑡) 1 𝑥𝑖𝑗 if 𝑗 ∉ 𝑀𝑖
𝑥̅𝑗 = ∑ { (𝑡)
𝑁 𝜇𝑀𝑖 (𝑗) if 𝑗 ∈ 𝑀𝑖
𝑖=1
𝑥𝑖𝑢 𝑥𝑖𝑣
if 𝑢 ∉ 𝑀𝑖 and 𝑣 ∉ 𝑀𝑖
(𝑡)
𝑥𝑖𝑢 𝜇𝑀𝑖 (𝑚𝑖𝑣 )
𝑁 if 𝑢 ∉ 𝑀𝑖 and 𝑣 ∈ 𝑀𝑖
(𝑡) (𝑡) 1
𝑠𝑢𝑣 = 𝑠𝑣𝑢 = ∑
𝑁 (𝑡)
𝜇𝑀𝑖 (𝑚𝑖𝑢 )𝑥𝑖𝑣
𝑖=1
if 𝑢 ∈ 𝑀𝑖 and 𝑣 ∉ 𝑀𝑖

(𝑡) (𝑡) (𝑡)

Σ𝑀𝑖 (𝑚𝑖𝑢 , 𝑚𝑖𝑣 ) + 𝜇𝑀𝑖 (𝑚𝑖𝑢 )𝜇𝑀𝑖 (𝑚𝑖𝑣 )
{ if 𝑢 ∈ 𝑀𝑖 and 𝑣 ∈ 𝑀𝑖
Where 𝜇𝑀𝑖 and Σ𝑀𝑖 are specified in equation 2.35 and equation 2.38.
M-step:
Given τ(t) and Θ(t), the next parameter Θ(t+1) = (μ(t+1), Σ(t+1))T is specified by equation 2.42.
(𝑡+1) (𝑡)
𝜇 = 𝑥̅𝑗
{ 𝑗(𝑡+1) (𝑡+1) (𝑡) (𝑡) (𝑡)
∀𝑗, 𝑢, 𝑣
𝜎𝑢𝑣 = 𝜎𝑣𝑢 = 𝑠𝑢𝑣 − 𝑥𝑢 𝑥𝑣
Given Θ(t), the next parameter Φ(t+1) = p(t+1) is specified by equation 2.43.
∑𝑁𝑖=1 𝑐(𝑍𝑖 )
𝑝(𝑡+1) =
𝑛𝑁
Where c(Zi) is the number of zij (s) in Zi that equal 1.
Table 2.2. E-step and M-step of GEM algorithm for handling missing data given normal PDF
As aforementioned, an interesting application of handling missing data is to fill in or predict missing values. For instance, suppose
𝑇
∗
the estimate resulted from GEM is Θ* = (μ*, Σ*)T, missing part 𝑋𝑚𝑖𝑠 = (𝑥𝑚1 , 𝑥𝑚2 , … , 𝑥𝑚|𝑀 | ) is replaced by 𝜇𝑀 as follows:
𝑖
∗
𝑥𝑚𝑗 = 𝜇𝑀 (𝑚𝑗 ), ∀𝑚𝑗 ∈ 𝑀 (2.44)
∗
Note, 𝜇𝑀 which is extracted from μ is estimated mean of the conditional PDF of Xmis (given Xobs) according to equation 2.35.
*
∗
Moreover, 𝜇𝑀 (𝑚𝑗 ) is estimated partial mean of 𝑥𝑚𝑗 given the conditional PDF f(Xmis | Θ∗𝑀 ), please see equation 2.38 for more
∗
details about 𝜇𝑀 . As aforementioned, in practice we can stop GEM after its first iteration was done, which is reasonable enough to
handle missing data.
Now we survey another interesting case that sample 𝒳 = {X1, X2,…, XN} whose Xi (s) are iid is MCAR data and f(X|Θ) is
multinomial PDF of K trials. We ignore missingness variable Z here because it is included in the case of multinormal PDF. Let X
= {Xobs, Xmis} be random variable representing every Xi. Suppose dimension of X is n. According to equation 2.9, recall that
𝑋𝑖 = {𝑋𝑜𝑏𝑠 (𝑖), 𝑋𝑚𝑖𝑠 (𝑖)} = (𝑥𝑖1 , 𝑥𝑖2 , … , 𝑥𝑖𝑛 )𝑇
𝑇
𝑋𝑚𝑖𝑠 (𝑖) = (𝑥𝑖𝑚1 , 𝑥𝑖𝑚2 , … , 𝑥𝑖𝑚|𝑀 | )
𝑖
𝑇
𝑋𝑜𝑏𝑠 (𝑖) = (𝑥𝑖𝑚̅𝑖1 , 𝑥𝑖𝑚̅𝑖2 , … , 𝑥𝑖𝑚̅𝑖|𝑀
̅̅̅ |
)
𝑖

𝑀𝑖 = {𝑚𝑖1 , 𝑚𝑖2 , … , 𝑚𝑖|𝑀𝑖 | }

̅𝑖 = {𝑚
𝑀 ̅ 𝑖1 , 𝑚
̅ 𝑖2 , … , 𝑚
̅ 𝑖|𝑀̅𝑖| }
The PDF of X is:
𝑛
𝐾! 𝑥𝑗
𝑓(𝑋|Θ) = 𝑓(𝑋𝑜𝑏𝑠 , 𝑋𝑚𝑖𝑠 |Θ) = ∏ 𝑝𝑗 (2.45)
∏𝑛𝑗=1(𝑥𝑗 !)
𝑗=1
Where xj are integers and Θ = (p1, p2,…, pn)T is the set of probabilities such that

All rights reserved by www.grdjournals.com 22

Handling Missing Data with Expectation Maximization Algorithm
(GRDJE/ Volume 6 / Issue 11 / 002)

∑ 𝑝𝑗 = 1
𝑗=1
𝑛

∑ 𝑥𝑗 = 𝐾
𝑗=1
𝑥𝑗 ∈ {0,1, … , 𝐾}
Note, xj is the number of trials generating nominal value j. Therefore,
𝑛
𝐾! 𝑥𝑖𝑗
𝑓(𝑋𝑖 |Θ) = 𝑓(𝑋𝑜𝑏𝑠 (𝑖), 𝑋𝑚𝑖𝑠 (𝑖)|Θ) = ∏ 𝑝𝑗
∏𝑛𝑗=1(𝑥𝑖𝑗 !)
𝑗=1
Where,
𝑛

∑ 𝑥𝑖𝑗 = 𝐾
𝑗=1
𝑥𝑖𝑗 ∈ {0,1, … , 𝐾}
The most important task here is to define equation 2.11 and equation 2.15 in order to compose τ(X) from τ(Xobs), τ(Xmis) and to
extract ΘM from Θ when f(X|Θ) is multinomial PDF.
Let Θmis be parameter of marginal PDF of Xmis, we have:
|𝑀|
𝐾𝑚𝑖𝑠 ! 𝑝𝑚𝑗 𝑥𝑚𝑗
𝑓(𝑋𝑚𝑖𝑠 |Θ𝑚𝑖𝑠 ) = ∏( ) (2.46)
∏𝑚𝑗 ∈𝑀 (𝑥𝑚𝑗 !) 𝑃𝑚𝑖𝑠
𝑗=1
Therefore,
|𝑀𝑖 |
𝐾𝑚𝑖𝑠 (𝑖)! 𝑝𝑚𝑖𝑗 𝑥𝑖𝑚𝑗
𝑓(𝑋𝑚𝑖𝑠 (𝑖)|Θ𝑚𝑖𝑠 (𝑖)) = ∏( )
∏𝑚𝑗 ∈𝑀𝑖 (𝑥𝑖𝑚𝑗 !) 𝑃𝑚𝑖𝑠 (𝑖)
𝑗=1
Where,
𝑝𝑚𝑖1 𝑝𝑚𝑖2 𝑝𝑚𝑖|𝑀 | 𝑇
𝑖
Θ𝑚𝑖𝑠 (𝑖) = ( , ,…, )
𝑃𝑚𝑖𝑠 (𝑖) 𝑃𝑚𝑖𝑠 (𝑖) 𝑃𝑚𝑖𝑠 (𝑖)
|𝑀𝑖 |

𝑃𝑚𝑖𝑠 (𝑖) = ∑ 𝑝𝑚𝑖𝑗 (2.47)

𝑗=1
|𝑀𝑖 |

𝐾𝑚𝑖𝑠 (𝑖) = ∑ 𝑥𝑚𝑖𝑗

𝑗=1
Obviously, Θmis(i) is extracted from Θ given indicator Mi.
Let Θobs be parameter of marginal PDF of Xobs, we have:
̅|
|𝑀
𝐾𝑜𝑏𝑠 ! 𝑝𝑚̅𝑗 𝑥𝑚
̅̅̅𝑗
𝑓(𝑋𝑜𝑏𝑠 |Θ𝑜𝑏𝑠 ) = ∏( ) (2.48)
∏𝑚̅𝑗 ∈𝑀̅ (𝑥𝑚̅𝑗 !) 𝑃𝑜𝑏𝑠
𝑗=1
Therefore,
̅ 𝑖|
|𝑀
𝐾𝑜𝑏𝑠 (𝑖)! 𝑝𝑚̅𝑖𝑗 𝑥𝑖𝑚 ̅̅̅𝑗
𝑓(𝑋𝑜𝑏𝑠 (𝑖)|Θ𝑜𝑏𝑠 (𝑖)) = ∏( )
∏𝑚̅𝑗 ∈𝑀̅𝑖 (𝑥𝑖𝑚̅𝑗 !) 𝑃𝑜𝑏𝑠 (𝑖)
𝑗=1
Where,
𝑝𝑚̅𝑖|𝑀 𝑇
𝑝𝑚̅𝑖1 𝑝𝑚̅𝑖2 ̅̅̅ |
𝑖
Θ𝑜𝑏𝑠 (𝑖) = ( , ,…, )
𝑃𝑜𝑏𝑠 (𝑖) 𝑃𝑜𝑏𝑠 (𝑖) 𝑃𝑜𝑏𝑠 (𝑖)
̅ 𝑖|
|𝑀

𝑃𝑜𝑏𝑠 (𝑖) = ∑ 𝑝𝑚̅𝑖𝑗 (2.49)

𝑗=1
̅ 𝑖|
|𝑀

𝐾𝑜𝑏𝑠 (𝑖) = ∑ 𝑥𝑚̅𝑖𝑗

𝑗=1
̅𝑖 or Mi.
Obviously, Θobs(i) is extracted from Θ given indicator 𝑀
The conditional PDF of Xmis given Xobs is calculated based on the PDF of X and the marginal PDF of Xobs as follows:

All rights reserved by www.grdjournals.com 23

Handling Missing Data with Expectation Maximization Algorithm
(GRDJE/ Volume 6 / Issue 11 / 002)

𝑓(𝑋𝑜𝑏𝑠 , 𝑋𝑚𝑖𝑠 |Θ)

𝑓(𝑋𝑚𝑖𝑠 |Θ𝑀 ) = 𝑓(𝑋𝑚𝑖𝑠 |𝑋𝑜𝑏𝑠 , Θ) =
𝑓(𝑋𝑜𝑏𝑠 |Θ𝑜𝑏𝑠 )
𝐾! 𝑥
∏𝑛 𝑝 𝑗
∏𝑛𝑗=1(𝑥𝑗 !) 𝑗=1 𝑗
= 𝑥𝑚
̅ | 𝑝𝑚
̅̅̅𝑗
𝐾𝑜𝑏𝑠 ! ̅𝑗
̅|
|𝑀
∏|𝑀𝑗=1 (𝑃 )
∏ 𝑥 ! ̅𝑗
𝑗=1 𝑚
𝑜𝑏𝑠
̅|
|𝑀 𝑥
𝐾! ∏𝑗=1 𝑥𝑚̅𝑗 ! ∏𝑛𝑗=1 𝑝𝑗 𝑗
= ∗ 𝑥𝑚
𝐾𝑜𝑏𝑠 ! ∏𝑛𝑗=1(𝑥𝑗 !) ̅ | 𝑝𝑚
̅̅̅𝑗
|𝑀 ̅𝑗
∏𝑗=1 ( )
𝑃𝑜𝑏𝑠
|𝑀| ̅|
|𝑀 𝑥𝑚
̅̅̅
𝐾! 𝑥𝑚
𝑗
𝑥𝑚
̅̅̅ 𝑗 𝑃𝑜𝑏𝑠 𝑗
= ∗ (∏ 𝑝𝑚𝑗 ) ∗ (∏ 𝑝𝑚̅ ( ) )
|𝑀|
𝐾𝑜𝑏𝑠 ! ∏𝑗=1 (𝑥𝑚𝑗 !) 𝑗 𝑝𝑚̅𝑗
𝑗=1 𝑗=1
|𝑀| ̅|
|𝑀
𝐾! 𝑥𝑚𝑗 𝑥𝑚
̅̅̅
= |𝑀|
∗ (∏ 𝑝𝑚𝑗 ) ∗ (∏(𝑃𝑜𝑏𝑠 ) 𝑗 )
𝐾𝑜𝑏𝑠 ! ∏𝑗=1 (𝑥𝑚𝑗 !) 𝑗=1 𝑗=1
|𝑀|
𝐾! 𝑥𝑚𝑗
= |𝑀|
∗ (∏ 𝑝𝑚𝑗 ) ∗ ((𝑃𝑜𝑏𝑠 )𝐾𝑜𝑏𝑠 )
𝐾𝑜𝑏𝑠 ! ∏𝑗=1 (𝑥𝑚𝑗 !) 𝑗=1
This implies that the conditional PDF of Xmis given Xobs is multinomial PDF of K trials.
|𝑀|
𝐾! 𝑥𝑚𝑗
𝑓(𝑋𝑚𝑖𝑠 |𝑋𝑜𝑏𝑠 , Θ𝑀 ) = 𝑓(𝑋𝑚𝑖𝑠 |𝑋𝑜𝑏𝑠 , Θ) = |𝑀|
∗ (∏ 𝑝𝑚𝑗 ) ∗ ((𝑃𝑜𝑏𝑠 )𝐾𝑜𝑏𝑠 ) (2.50)
𝐾𝑜𝑏𝑠 ! ∏𝑗=1 (𝑥𝑚𝑗 !) 𝑗=1
Therefore,
|𝑀𝑖 |
𝐾! 𝑥𝑖𝑚
𝑗 𝐾𝑜𝑏𝑠 (𝑖)
𝑓(𝑋𝑚𝑖𝑠 (𝑖)|𝑋𝑜𝑏𝑠 (𝑖), Θ𝑀𝑖 ) = 𝑓(𝑋𝑚𝑖𝑠 (𝑖)|𝑋𝑜𝑏𝑠 (𝑖), Θ) = |𝑀 |
∗ (∏ 𝑝𝑚𝑖𝑗 ) ∗ ((𝑃𝑜𝑏𝑠 (𝑖)) )
𝐾𝑜𝑏𝑠 (𝑖)! ∏𝑗=1𝑖 (𝑥𝑖𝑚𝑗 !) 𝑗=1
Where
̅ 𝑖|
|𝑀

𝑃𝑜𝑏𝑠 (𝑖) = ∑ 𝑝𝑚̅𝑖𝑗

𝑗=1
̅ 𝑖|
|𝑀

𝐾𝑜𝑏𝑠 (𝑖) = ∑ 𝑥𝑚̅𝑖𝑗

𝑗=1
Obviously, the parameter Θ𝑀𝑖 of the conditional PDF 𝑓(𝑋𝑚𝑖𝑠 (𝑖)|𝑋𝑜𝑏𝑠 (𝑖), Θ𝑀𝑖 ) is:
𝑝𝑚1
𝑝𝑚2
⋮
Θ = 𝑢(Θ, 𝑋 (𝑖)) = 𝑝 𝑚 𝑘 (2.51)
𝑀𝑖 𝑜𝑏𝑠
̅ 𝑖|
|𝑀

𝑃𝑜𝑏𝑠 (𝑖) = ∑ 𝑝𝑚̅𝑖𝑗

( 𝑗=1 )
Therefore, equation 2.51 to extract Θ𝑀𝑖 from Θ given Xobs(i) is an instance of equation 2.15. It is easy to check that
|𝑀𝑖 |

∑ 𝑥𝑚𝑖𝑗 + 𝐾𝑜𝑏𝑠 (𝑖) = 𝐾𝑚𝑖𝑠 (𝑖) + 𝐾𝑜𝑏𝑠 (𝑖) = 𝐾

𝑗=1
|𝑀𝑖 | |𝑀𝑖 | ̅ 𝑖|
|𝑀 𝑛

∑ 𝑝𝑚𝑖𝑗 + 𝑃𝑜𝑏𝑠 (𝑖) = ∑ 𝑝𝑚𝑖𝑗 + ∑ 𝑝𝑚̅𝑖𝑗 = ∑ 𝑝𝑗 = 1

𝑗=1 𝑗=1 𝑗=1 𝑗=1
At E-step of some tth iteration, given current parameter Θ(t) = (p1(t), p2(t),…, pn(t))T, the sufficient statistic of X is calculated according
to equation 2.22. Let,
𝑁
(𝑡)
1 (𝑡)
𝜏 = ∑{𝜏(𝑋𝑜𝑏𝑠 (𝑖)), 𝐸(𝜏(𝑋𝑚𝑖𝑠 )|Θ𝑀𝑖 )}
𝑁
𝑖=1
The sufficient statistic of Xobs(i) is:

Handling Missing Data with Expectation Maximization Algorithm
(GRDJE/ Volume 6 / Issue 11 / 002)

𝑇
𝜏(𝑋𝑜𝑏𝑠 (𝑖)) = (𝑥𝑖𝑚̅1 , 𝑥𝑖𝑚̅2 , … , 𝑥𝑖𝑚̅|𝑀
̅̅̅ |
)
𝑖
The sufficient statistic of Xmis(i) with regard to 𝑓(𝑋𝑚𝑖𝑠 (𝑖)|𝑋𝑜𝑏𝑠 (𝑖), Θ𝑀𝑖 ) is:
𝑥𝑖𝑚1
𝑥𝑖𝑚2
⋮
𝑥𝑖𝑚|𝑀 |
𝜏(𝑋𝑚𝑖𝑠 (𝑖)) = 𝑖
̅ 𝑖|
|𝑀

∑ 𝐾𝑝𝑚̅𝑖𝑗
(𝑗=1 )
Therefore, the sufficient statistic of X at E-step of some tth iteration given current parameter Θ(t) = (p1(t), p2(t),…, pn(t))T is defined as
follows:
(𝑡) (𝑡) (𝑡) 𝑇
𝜏 (𝑡) = (𝑥̅1 , 𝑥̅2 , … , 𝑥̅𝑛 )
𝑁
(𝑡) 1 𝑥𝑖𝑗 if 𝑗 ∉ 𝑀𝑖 (2.52)
𝑥̅𝑗 = ∑ { (𝑡) ∀𝑗
𝑁 𝐾𝑝𝑗 if 𝑗 ∈ 𝑀𝑖
𝑖=1
Equation 2.52 is an instance of equation 2.11, which compose τ(X) from τ(Xobs) and τ(Xmis) when f(X|Θ) is multinomial PDF.
At M-step of some tth iteration, we need to maximize Q1(Θ’|Θ) with following constraint
𝑛

∑ 𝑝𝑗 = 1
𝑗=1
According to equation 2.19, we have:
𝑁 𝑁

𝑄1 (Θ′ |Θ) = ∑ 𝐸(log(𝑏(𝑋𝑜𝑏𝑠 (𝑖), 𝑋𝑚𝑖𝑠 ))|Θ𝑀𝑖 ) + (Θ′ )𝑇 ∑{𝜏(𝑋𝑜𝑏𝑠 (𝑖)), 𝐸(𝜏(𝑋𝑚𝑖𝑠 )|Θ𝑀𝑖 )} − 𝑁log(𝑎(Θ′ ))
𝑖=1 𝑖=1
Where quantities b(Xobs(i), Xmis) and a(Θ’) belongs to the PDF f(X|Θ) of X. Because there is the constraint ∑𝑛𝑗=1 𝑝𝑗 = 1, we use
Lagrange duality method to maximize Q1(Θ’|Θ). The Lagrange function la(Θ’, λ | Θ) is sum of Q1(Θ’|Θ) and the constraint
∑𝑛𝑗=1 𝑝𝑗 = 1, as follows:
𝑛
′
𝑙𝑎(Θ , λ|Θ) = 𝑄1 (Θ′ |Θ) + 𝜆 (1 − ∑ 𝑝𝑗′ )
𝑗=1
𝑁 𝑁

= ∑ 𝐸(log(𝑏(𝑋𝑜𝑏𝑠 (𝑖), 𝑋𝑚𝑖𝑠 ))|Θ𝑀𝑖 ) + (Θ′ )𝑇 ∑{𝜏(𝑋𝑜𝑏𝑠 (𝑖)), 𝐸(𝜏(𝑋𝑚𝑖𝑠 )|Θ𝑀𝑖 )} − 𝑁log(𝑎(Θ′ ))
𝑖=1 𝑖=1
𝑛

+ 𝜆 (1 − ∑ 𝑝𝑗′ )
𝑗=1
Where Θ’ = (p1’, p2’,…, pn’)T. Note, λ ≥ 0 is called Lagrange multiplier. Of course, la(Θ’, λ | Θ) is function of Θ’ and λ. The next
parameter Θ(t+1) that maximizes Q1(Θ’|Θ) is solution of the equation formed by setting the first-order derivative of Lagrange
function regarding Θ’ and λ to be zero.
The first-order partial derivative of la(Θ’, λ | Θ) with regard to Θ’ is:
𝑁
𝜕𝑙𝑎(Θ′ , λ|Θ) 𝑇
= ∑ (𝐸(𝜏(𝑋𝑜𝑏𝑠 (𝑖), 𝑋𝑚𝑖𝑠 )|Θ𝑀𝑖 )) − 𝑁log ′ (𝑎(Θ′ ))
𝜕Θ′
𝑖=1
𝑁
𝑇
= ∑{𝜏(𝑋𝑜𝑏𝑠 (𝑖)), 𝐸(𝜏(𝑋𝑚𝑖𝑠 )|Θ𝑀𝑖 )} − 𝑁log ′ (𝑎(Θ′ )) − (𝜆, 𝜆, … , 𝜆)𝑇
𝑖=1
By referring table 1.2, we have:

Handling Missing Data with Expectation Maximization Algorithm
(GRDJE/ Volume 6 / Issue 11 / 002)

𝑇 𝑇
log ′ (𝑎(Θ′ )) = (𝐸(𝜏(𝑋)|Θ′ )) = ∫ 𝑓(𝑋|Θ)(𝜏(𝑋)) d𝑋
𝑋
Thus,
𝑁
𝜕𝑙𝑎(Θ′ , λ|Θ) 𝑇 𝑇
= ∑{𝜏(𝑋𝑜𝑏𝑠 (𝑖)), 𝐸(𝜏(𝑋𝑚𝑖𝑠 )|Θ𝑀𝑖 )} − 𝑁(𝐸(𝜏(𝑋)|Θ′ )) − (𝜆, 𝜆, … , 𝜆)𝑇
𝜕Θ′
𝑖=1
The first-order partial derivative of la(Θ’, λ | Θ) with regard to λ is:
𝑛
𝜕𝑙𝑎(Θ′ , λ|Θ)
= 1 − ∑ 𝑝𝑗′
𝜕λ
𝑗=1
Therefore, at M-step of some tth iteration, given current parameter Θ(t) = (p1(t), p2(t),…, pn(t))T, the next parameter Θ(t+1) is solution
of the following equation:
𝑁
(𝑡) 𝑇
∑{𝜏(𝑋𝑜𝑏𝑠 (𝑖)), 𝐸(𝜏(𝑋𝑚𝑖𝑠 )|Θ𝑀𝑖 )}
𝑖=1
𝑇
−𝑁(𝐸(𝜏(𝑋)|Θ)) − (𝜆, 𝜆, … , 𝜆) = 𝟎𝑇
𝑛

1 − ∑ 𝑝𝑗 = 0
{ 𝑗=1
This implies:
𝜆 ⁄𝑁
(𝑡) 𝜆 ⁄𝑁
𝐸(𝜏(𝑋)|Θ) = 𝜏 −( )
𝜆 ⁄𝑁
𝜆 ⁄𝑁
𝑛

∑ 𝑝𝑗 = 1
{𝑗=1
Where,
𝑁
(𝑡)
1 (𝑡)
𝜏 = ∑{𝜏(𝑋𝑜𝑏𝑠 (𝑖)), 𝐸(𝜏(𝑋𝑚𝑖𝑠 )|Θ𝑀𝑖 )}
𝑁
𝑖=1
Due to
𝐸(𝜏(𝑋)|Θ) = ∫ 𝜏(𝑋)𝑓(𝑋|Θ)d𝑋 = (𝐾𝑝1 , 𝐾𝑝2 , … , 𝐾𝑝𝑛 )𝑇
𝑋
(𝑡) (𝑡) (𝑡) 𝑇
𝜏 (𝑡) = (𝑥̅1 , 𝑥̅2 , … , 𝑥̅𝑛 )
𝑁
(𝑡) 1 𝑥𝑖𝑗 if 𝑗 ∉ 𝑀𝑖
𝑥̅𝑗 = ∑ { (𝑡) ∀𝑗
𝑁 𝐾𝑝𝑗 if 𝑗 ∈ 𝑀𝑖
𝑖=1
(𝑡)
We obtain n equations Kpj = –λ/N + 𝑥̅𝑗 and 1 constraint ∑𝑛𝑗=1 𝑝𝑗 = 1. Therefore, we have:
𝑁
𝜆 1 𝑥𝑖𝑗 if 𝑗 ∉ 𝑀𝑖
𝑝𝑗 = − + ∑ { (𝑡) ∀𝑗
𝐾𝑁 𝐾𝑁 𝐾𝑝𝑗 if 𝑗 ∈ 𝑀𝑖
𝑖=1
Summing n equations above, we have:
𝑛 𝑁 𝑛 𝑁 𝑖 𝑖 ̅ |
|𝑀 |𝑀 |
𝜆 1 𝑥𝑖𝑗 if 𝑗 ∉ 𝑀𝑖 𝜆 1 (𝑡)
1 = ∑ 𝑝𝑗 = − + ∑ (∑ { (𝑡) )=− + ∑ (∑ 𝑥𝑖𝑚̅𝑗 + ∑ 𝐾𝑝𝑚𝑗 )
𝐾𝑁 𝐾𝑁 𝐾𝑝𝑗 if 𝑗 ∈ 𝑀𝑖 𝐾𝑁 𝐾𝑁
𝑗=1 𝑗=1 𝑖=1 𝑖=1 𝑗=1 𝑗=1
Suppose every missing value 𝑥𝑖𝑚𝑗 is estimated by 𝐾𝑝𝑚𝑗 such that:
̅ 𝑖|
|𝑀 |𝑀𝑖 |
(𝑡)
∑ 𝑥𝑚̅𝑖𝑗 = ∑ 𝐾𝑝𝑚𝑗
𝑗=1 𝑗=1
We obtain:
𝑁 ̅ 𝑖|
|𝑀 |𝑀𝑖 | 𝑁
𝜆 1 𝜆 1 𝜆
1=− + ∑ (∑ 𝑥𝑖𝑚̅𝑗 + ∑ 𝑥𝑖𝑚𝑗 ) = − + ∑𝐾 = − +1
𝐾𝑁 𝐾𝑁 𝐾𝑁 𝐾𝑁 𝐾𝑁
𝑖=1 𝑗=1 𝑗=1 𝑖=1
This implies

Handling Missing Data with Expectation Maximization Algorithm
(GRDJE/ Volume 6 / Issue 11 / 002)

𝜆=0
Such that
𝑁
1 𝑥𝑖𝑗 if 𝑗 ∉ 𝑀𝑖
𝑝𝑗 = ∑ { (𝑡) ∀𝑗
𝐾𝑁 𝐾𝑝𝑗 if 𝑗 ∈ 𝑀𝑖
𝑖=1
Therefore, at M-step of some tth iteration, given current parameter Θ(t) = (p1(t), p2(t),…, pn(t))T, the next parameter Θ(t+1) is specified
by following equation.
𝑁
(𝑡+1) 1 𝑥𝑖𝑗 if 𝑗 ∉ 𝑀𝑖
𝑝𝑗 = ∑ { (𝑡) ∀𝑗 (2.53)
𝐾𝑁 𝐾𝑝𝑗 if 𝑗 ∈ 𝑀𝑖
𝑖=1
In general, given sample 𝒳 = {X1, X2,…, XN} whose Xi (s) are iid is MCAR data and f(X|Θ) is multinomial PDF of K trials, GEM
for handling missing data is summarized in table 2.3.
M-step:
Given τ(t) and Θ(t) = (p1(t), p2(t),…, pn(t))T, the next parameter Θ(t+1) is specified by equation 2.53.
𝑁
(𝑡+1) 1 𝑥𝑖𝑗 if 𝑗 ∉ 𝑀𝑖
𝑝𝑗 = ∑ { (𝑡) ∀𝑗
𝐾𝑁 𝐾𝑝𝑗 if 𝑗 ∈ 𝑀𝑖
𝑖=1
Table 2.3. E-step and M-step of GEM algorithm for handling missing data given multinomial PDF
In table 2.3, E-step is implied in how to perform M-step. As aforementioned, in practice we can stop GEM after its first iteration
was done, which is reasonable enough to handle missing data. Next section includes two examples of handling missing data with
multinormal distribution and multinomial distribution.

III. NUMERICAL EXAMPLES

It is necessary to have an example for illustrating how to handle missing data with multinormal PDF.
Example 3.1. Given sample of size two, 𝒳 = {X1, X2 } in which X1 = (x11=1, x12=?, x13=3, x14=?)T and X2 = (x21=?, x22=2, x23=?,
x24=4)T are iid. Therefore, we also have Z1 = (z11=0, z12=1, z13=0, z14=1)T and Z2 = (z21=1, z22=0, z23=1, z24=0)T. All Zi (s) are iid too.
x1 x2 x3 x4 z1 z2 z3 z4
X1 1 ? 3 ? Z1 0 1 0 1
X2 ? 2 ? 4 Z2 1 0 1 0
Of course, we have Xobs(1) = (x11=1, x13=3)T, Xmis(1) = (x12=?, x14=?)T, Xobs(2) = (x22=2, x24=4)T and Xmis(2) = (x21=?, x23=?)T. We
also have M1 = {m11=2, m12=4}, 𝑀 ̅1 = {𝑚
̅ 11 =1, 𝑚
̅ 12 =3}, M2 = {m21=1, m22=3}, and 𝑀̅2 = {𝑚
̅ 21 =2, 𝑚
̅ 22 =4}. Let X and Z be random
variables representing every Xi and every Zi, respectively. Suppose f(X|Θ) is multinormal PDF and missingness variable Z follows
binomial distribution of 4 trials according to equation 2.26 and equation 2.27. Dimension of X is 4. We will estimate Θ = (μ, Σ)T
and Φ = p based on 𝒳.
𝜇 = (𝜇1 , 𝜇2 , 𝜇3 , 𝜇4 )𝑇
𝜎11 𝜎12 𝜎13 𝜎14
𝜎21 𝜎22 𝜎23 𝜎24
Σ = (𝜎 𝜎32 𝜎33 𝜎34 )
31
𝜎41 𝜎42 𝜎43 𝜎44
The parameters μ and Σ are initialized arbitrarily as zero vector and identity vector whereas p is initialized 0.5 as follows:
(1) (1) (1) (1) 𝑇
𝜇 (1) = (𝜇1 = 0, 𝜇2 = 0, 𝜇3 = 0, 𝜇4 = 0)
(1) (1) (1) (1)
𝜎11 = 1 𝜎12 = 0 𝜎13 = 0 𝜎14 = 0
(1) (1) (1) (1)
𝜎21 = 0 𝜎22 = 1 𝜎23 = 0 𝜎24 = 0
Σ (1) = (1) (1) (1) (1)
𝜎31 = 0 𝜎32 = 0 𝜎33 = 1 𝜎34 = 0
(1) (1) (1) (1)
(𝜎41 = 0 𝜎42 = 0 𝜎43 = 0 𝜎44 = 1)
(1)
𝑝 = 0.5
At 1st iteration, E-step, we have:
𝑋𝑜𝑏𝑠 (1) = (𝑥1 = 1, 𝑥3 = 3)𝑇
(1) (1) 𝑇
𝜇𝑚𝑖𝑠 (1) = (𝜇2 = 0, 𝜇4 = 0)
(1) (1)
𝜎 = 1 𝜎24 = 0
Σ𝑚𝑖𝑠 (1) = ( 22
(1) (1)
)
𝜎42 = 0 𝜎44 = 1
(1) (1) 𝑇
𝜇𝑜𝑏𝑠 (1) = (𝜇1 = 0, 𝜇3 = 0)
(1) (1)
𝜎 = 1 𝜎13 = 0
Σ𝑜𝑏𝑠 (1) = ( 11
(1) (1)
)
𝜎31 = 0 𝜎33 = 1

Handling Missing Data with Expectation Maximization Algorithm
(GRDJE/ Volume 6 / Issue 11 / 002)

(1) (1)
𝑚𝑖𝑠 𝜎21 = 0 𝜎23 = 0
𝑉𝑜𝑏𝑠 (1) = ( (1) (1)
)
𝜎41 = 0 𝜎43 = 0
(1) (1)
𝜎 = 0 𝜎14 = 0
𝑜𝑏𝑠
𝑉𝑚𝑖𝑠 (1) = ( 12
(1) (1)
)
𝜎32 = 0 𝜎34 = 0
(1) −1 (1) (1) 𝑇
𝑚𝑖𝑠
𝜇𝑀1 = 𝜇𝑚𝑖𝑠 (1) + (𝑉𝑜𝑏𝑠 (1)) (Σ𝑜𝑏𝑠 (1)) (𝑋𝑜𝑏𝑠 (1) − 𝜇𝑜𝑏𝑠 (1)) = (𝜇𝑀1 (2) = 0, 𝜇𝑀1 (4) = 0)
(1) (1)
(1) 𝑚𝑖𝑠 −1
𝑜𝑏𝑠
Σ𝑀1 (2,2) = 1 Σ𝑀1 (2,4) = 0
Σ𝑀1 = Σ𝑚𝑖𝑠 (1) − (𝑉𝑜𝑏𝑠 (1)) (Σ𝑜𝑏𝑠 (1)) (𝑉𝑚𝑖𝑠 )=( (1) (1)
)
Σ𝑀1 (4,2) = 0 Σ𝑀1 (4,4) = 1

𝑋𝑜𝑏𝑠 (2) = (𝑥2 = 2, 𝑥4 = 4)𝑇

(1) (1) 𝑇
𝜇𝑚𝑖𝑠 (2) = (𝜇1 = 0, 𝜇3 = 0)
(1) (1)
𝜎 = 1 𝜎13 = 0
Σ𝑚𝑖𝑠 (2) = ( 11
(1) (1)
)
𝜎31 = 0 𝜎33 = 1
(1) (1) 𝑇
𝜇𝑜𝑏𝑠 (2) = (𝜇2 = 0, 𝜇4 = 0)
(1) (1)
𝜎 = 1 𝜎24 = 0
Σ𝑜𝑏𝑠 (2) = ( 22
(1) (1)
)
𝜎42 = 0 𝜎44 = 1
(1) (1)
𝜎 = 0 𝜎14 = 0
𝑚𝑖𝑠
𝑉𝑜𝑏𝑠 (2) = ( 12(1) (1)
)
𝜎32 = 0 𝜎34 = 0
(1) (1)
𝜎 = 0 𝜎23 = 0
𝑜𝑏𝑠
𝑉𝑚𝑖𝑠 (2) = ( 21
(1) (1)
)
𝜎41 = 0 𝜎43 = 0
(1) −1 (1) (1) 𝑇
𝑚𝑖𝑠
𝜇𝑀2 = 𝜇𝑚𝑖𝑠 (2) + (𝑉𝑜𝑏𝑠 (2)) (Σ𝑜𝑏𝑠 (2)) (𝑋𝑜𝑏𝑠 (2) − 𝜇𝑜𝑏𝑠 (2)) = (𝜇𝑀2 (1) = 0, 𝜇𝑀2 (3) = 0)
(1) (1)
(1) 𝑚𝑖𝑠 −1 𝑜𝑏𝑠
Σ𝑀2 (1,1) = 1 Σ𝑀2 (1,3) = 0
Σ𝑀2 = Σ𝑚𝑖𝑠 (2) − (𝑉𝑜𝑏𝑠 (2)) (Σ𝑜𝑏𝑠 (2)) (𝑉𝑚𝑖𝑠 ) =( (1) (1)
)
Σ𝑀2 (3,1) = 0 Σ𝑀2 (3,3) = 1

(1) 1 (1)
𝑥̅1 = (𝑥11 + 𝜇𝑀2 (1)) = 0.5
2
(1) 1 (1)
𝑥̅2 = (𝜇𝑀1 (2) + 𝑥22 ) = 1
2
(1) 1 (1)
𝑥̅3 = (𝑥13 + 𝜇𝑀2 (3)) = 1.5
2
(1) 1 (1)
𝑥̅4 = (𝜇𝑀1 (4) + 𝑥24 ) = 2
2

(1) 1 (1) (1)

2
𝑠11 = ((𝑥11 )2 + (Σ𝑀2 (1,1) + (𝜇𝑀2 (1)) )) = 1
2
(1) (1) 1 (1) (1)
𝑠12 = 𝑠21 = (𝑥11 𝜇𝑀1 (2) + 𝜇𝑀2 (1)𝑥22 ) = 0
2
(1) (1) 1 (1) (1) (1)
𝑠13 = 𝑠31 = (𝑥11 𝑥13 + (Σ𝑀2 (1,3) + 𝜇𝑀2 (1)𝜇𝑀2 (3))) = 1.5
2
(1) (1) 1 (1) (1)
𝑠14 = 𝑠41 = (𝑥11 𝜇𝑀1 (4) + 𝜇𝑀2 (1)𝑥24 ) = 0
2
(1) 1 (1) (1)
2
𝑠22 = ((Σ𝑀1 (2,2) + (𝜇𝑀1 (2)) ) + (𝑥22 )2 ) = 2.5
2
(1) (1) 1 (1) (1)
𝑠23 = 𝑠32 = (𝜇𝑀1 (2)𝑥13 + 𝑥22 𝜇𝑀2 (3)) = 0
2
(1) (1) 1 (1) (1) (1)
𝑠24 = 𝑠42 = ((Σ𝑀1 (2,4) + 𝜇𝑀1 (2)𝜇𝑀1 (4)) + 𝑥22 𝑥24 ) = 4
2
(1) 1 (1) (1)
2
𝑠33 = ((𝑥13 )2 + (Σ𝑀2 (3,3) + (𝜇𝑀2 (3)) )) = 5
2
(1) (1) 1 (1) (1)
𝑠34 = 𝑠43 = (𝑥13 𝜇𝑀1 (4) + 𝜇𝑀2 (3)𝑥24 ) = 0
2

Handling Missing Data with Expectation Maximization Algorithm
(GRDJE/ Volume 6 / Issue 11 / 002)

(1) 1 (1) (1) 2

𝑠44 = ((Σ𝑀1 (4,4) + (𝜇𝑀1 (4)) ) + (𝑥24 )2 ) = 8.5
2
At 1st iteration, M-step, we have:
(2) (1)
𝜇1 = 𝑥̅1 = 0.5
(2) (1)
𝜇2 = 𝑥̅2 = 1
(2) (1)
𝜇3 = 𝑥̅3 = 1.5
(2) (1)
𝜇4 = 𝑥̅4 = 2

(2) (1) (1) 2

𝜎11 = 𝑠11 − (𝑥̅1 ) = 0.75
(2) (2) (1) (1) (1)
𝜎12 = 𝜎21 = 𝑠12 − 𝑥̅1 𝑥̅2 = −0.5
(2) (2) (1) (1) (1)
𝜎13 = 𝜎31 = 𝑠13 − 𝑥̅1 𝑥̅3 = 0.75
(2) (2) (1) (1) (1)
𝜎14 = 𝜎41 = 𝑠14 − 𝑥̅1 𝑥̅4 = −1
(2) (1) (1) 2
𝜎22 = 𝑠22 − (𝑥̅2 ) = 1.5
(2) (2) (1) (1) (1)
𝜎23 = 𝜎32 = 𝑠23 − 𝑥̅2 𝑥̅3 = −1.5
(2) (2) (1) (1) (1)
𝜎24 = 𝜎42 = 𝑠24 − 𝑥̅2 𝑥̅4 = 2
(2) (1) (1) 2
𝜎33 = 𝑠33 − (𝑥̅3 ) = 2.75
(2) (2) (1) (1) (1)
𝜎34 = 𝜎43 = 𝑠34 − 𝑥̅3 𝑥̅4 = −3
(2) (1) (1) 2
𝜎44 = 𝑠44 − (𝑥̅4 ) = 4.5

𝑐(𝑍1 ) + 𝑐(𝑍2 ) 2 + 2
𝑝(2) = = = 0.5
4∗2 4∗2
At 2nd iteration, E-step, we have:
𝑋𝑜𝑏𝑠 (1) = (𝑥1 = 1, 𝑥3 = 3)𝑇
(2) (2) 𝑇
𝜇𝑚𝑖𝑠 (1) = (𝜇2 = 1, 𝜇4 = 2)
(2) (2)
𝜎22 = 1.5 𝜎24 = 2
Σ𝑚𝑖𝑠 (1) = ( (2) (2)
)
𝜎42 = 2 𝜎44 = 4.5
(2) (2) 𝑇
𝜇𝑜𝑏𝑠 (1) = (𝜇1 = 0.5, 𝜇3 = 1.5)
(2) (2)
𝜎 = 0.75 𝜎13 = 0.75
Σ𝑜𝑏𝑠 (1) = ( 11
(2) (2)
)
𝜎31 = 0.75 𝜎33 = 2.75
(2) (2)
𝜎 = −0.5 𝜎23 = −1.5
𝑚𝑖𝑠
𝑉𝑜𝑏𝑠 (1) = ( 21(2) (2)
)
𝜎41 = −1 𝜎43 = −3
(2) (2)
𝜎 = −0.5 𝜎14 = −1
𝑜𝑏𝑠
𝑉𝑚𝑖𝑠 (1) = ( 12
(2) (2)
)
𝜎32 = −1.5 𝜎34 = −3
(2) −1 (2) (2) 𝑇
𝑚𝑖𝑠
𝜇𝑀1 = 𝜇𝑚𝑖𝑠 (1) + (𝑉𝑜𝑏𝑠 (1)) (Σ𝑜𝑏𝑠 (1)) (𝑋𝑜𝑏𝑠 (1) − 𝜇𝑜𝑏𝑠 (1)) = (𝜇𝑀1 (2) ≅ 0.17, 𝜇𝑀1 (4) ≅ 0.33)
(2) (2)
(2) 𝑚𝑖𝑠 −1 𝑜𝑏𝑠
Σ𝑀1 (2,2) ≅ 0,67 Σ𝑀1 (2,4) ≅ 0.33
Σ𝑀1 = Σ𝑚𝑖𝑠 (1) − (𝑉𝑜𝑏𝑠 (1)) (Σ𝑜𝑏𝑠 (1)) (𝑉𝑚𝑖𝑠 ) =( (2) (2)
)
Σ𝑀1 (4,2) ≅ 0.33 Σ𝑀1 (4,4) ≅ 1.17

𝑋𝑜𝑏𝑠 (2) = (𝑥2 = 2, 𝑥4 = 4)𝑇

(2) (2) 𝑇
𝜇𝑚𝑖𝑠 (2) = (𝜇1 = 0.5, 𝜇3 = 1.5)
(2) (2)
𝜎 = 0.75 𝜎13 = 0.75
Σ𝑚𝑖𝑠 (2) = ( 11
(2) (2)
)
𝜎31 = 0.75 𝜎33 = 2.75
(2) (2) 𝑇
𝜇𝑜𝑏𝑠 (2) = (𝜇2 = 1, 𝜇4 = 2)
(2) (2)
𝜎 = 1.5 𝜎24 = 2
Σ𝑜𝑏𝑠 (2) = ( 22(2) (2)
)
𝜎42 = 2 𝜎44 = 4.5
(2) (2)
𝜎 = −0.5 𝜎14 = −1
𝑚𝑖𝑠
𝑉𝑜𝑏𝑠 (2) = ( 12(2) (2)
)
𝜎32 = −1.5 𝜎34 = −3

Handling Missing Data with Expectation Maximization Algorithm
(GRDJE/ Volume 6 / Issue 11 / 002)

(2) (2)
𝑜𝑏𝑠 𝜎21 = −0.5 𝜎23 = −1.5
𝑉𝑚𝑖𝑠 (2) = ( (2) (2)
)
𝜎41 = −1 𝜎43 = −3
(2) −1 (2) (2) 𝑇
𝑚𝑖𝑠
𝜇𝑀2 = 𝜇𝑚𝑖𝑠 (2) + (𝑉𝑜𝑏𝑠 (2)) (Σ𝑜𝑏𝑠 (2)) (𝑋𝑜𝑏𝑠 (2) − 𝜇𝑜𝑏𝑠 (2)) = (𝜇𝑀2 (1) ≅ 0.05, 𝜇𝑀2 (3) = 0.14)
(2) (2)
(2) 𝑚𝑖𝑠 −1 𝑜𝑏𝑠
Σ𝑀2 (1,1) ≅ 0.52 Σ𝑀2 (1,3) ≅ 0.07
Σ𝑀2 = Σ𝑚𝑖𝑠 (2) − (𝑉𝑜𝑏𝑠 (2)) (Σ𝑜𝑏𝑠 (2)) (𝑉𝑚𝑖𝑠 ) =( (2) (2)
)
Σ𝑀2 (3,1) ≅ 0.07 Σ𝑀2 (3,3) ≅ 0.7

(2) 1 (2)
𝑥̅1 = (𝑥11 + 𝜇𝑀2 (1)) ≅ 0.52
2
(2) 1 (2)
𝑥̅2 = (𝜇𝑀1 (2) + 𝑥22 ) ≅ 1.1
2
(2) 1 (2)
𝑥̅3 = (𝑥13 + 𝜇𝑀2 (3)) ≅ 1.57
2
(2) 1 (2)
𝑥̅4 = (𝜇𝑀1 (4) + 𝑥24 ) ≅ 2.17
2

(2) 1 (2) (2)

2
𝑠11 = ((𝑥11 )2 + (Σ𝑀2 (1,1) + (𝜇𝑀2 (1)) )) ≅ 0.76
2
(2) (2) 1 (2) (2)
𝑠12 = 𝑠21 = (𝑥11 𝜇𝑀1 (2) + 𝜇𝑀2 (1)𝑥22 ) ≅ 0.13
2
(2) (2) 1 (2) (2) (2)
𝑠13 = 𝑠31 = (𝑥11 𝑥13 + (Σ𝑀2 (1,3) + 𝜇𝑀2 (1)𝜇𝑀2 (3))) ≅ 1.54
2
(2) (2) 1 (2) (2)
𝑠14 = 𝑠41 = (𝑥11 𝜇𝑀1 (4) + 𝜇𝑀2 (1)𝑥24 ) ≅ 0.17
2
(2) 1 (2) (2)
2
𝑠22 = ((Σ𝑀1 (2,2) + (𝜇𝑀1 (2)) ) + (𝑥22 )2 ) ≅ 2.35
2
(2) (2) 1 (2) (2)
𝑠23 = 𝑠32 = (𝜇𝑀1 (2)𝑥13 + 𝑥22 𝜇𝑀2 (3)) ≅ 0.39
2
(2) (2) 1 (2) (2) (2)
𝑠24 = 𝑠42 = ((Σ𝑀1 (2,4) + 𝜇𝑀1 (2)𝜇𝑀1 (4)) + 𝑥22 𝑥24 ) ≅ 4.19
2
(2) 1 (2) (2)
2
𝑠33 = ((𝑥13 )2 + (Σ𝑀2 (3,3) + (𝜇𝑀2 (3)) )) ≅ 4.86
2
(2) (2) 1 (1) (2)
𝑠34 = 𝑠43 = (𝑥13 𝜇𝑀1 (4) + 𝜇𝑀2 (3)𝑥24 ) ≅ 0.77
2
(2) 1 (2) (2)
2
𝑠44 = ((Σ𝑀1 (4,4) + (𝜇𝑀1 (4)) ) + (𝑥24 )2 ) ≅ 8.64
2
At 2nd iteration, M-step, we have:
(3) (2)
𝜇1 = 𝑥̅1 ≅ 0.52
(3) (2)
𝜇2 = 𝑥̅2 ≅ 1.1
(3) (2)
𝜇3 = 𝑥̅3 ≅ 1.57
(3) (2)
𝜇4 = 𝑥̅4 ≅ 2.17

(3) (2) (2) 2

𝜎11 = 𝑠11 − (𝑥̅1 ) ≅ 0.49
(3) (3) (2) (2) (2)
𝜎12 = 𝜎21 = 𝑠12 − 𝑥̅1 𝑥̅2 ≅ −0.44
(3) (3) (2) (2) (2)
𝜎13 = 𝜎31 = 𝑠13 − 𝑥̅1 𝑥̅3 ≅ 0.72
(3) (3) (2) (2) (2)
𝜎14 = 𝜎41 = 𝑠14 − 𝑥̅1 𝑥̅4 ≅ −0.96
(3) (2) (2) 2
𝜎22 = 𝑠22 − (𝑥̅2 ) ≅ 1.17
(3) (3) (2) (2) (2)
𝜎23 = 𝜎32 = 𝑠23 − 𝑥̅2 𝑥̅3 ≅ −1.31
(3) (3) (2) (2) (2)
𝜎24 = 𝜎42 = 𝑠24 − 𝑥̅2 𝑥̅4 ≅ 1.85
(3) (2) (2) 2
𝜎33 = 𝑠33 − (𝑥̅3 ) ≅ 2.4
(3) (3) (2) (2) (2)
𝜎34 = 𝜎43 = 𝑠34 − 𝑥̅3 𝑥̅4 ≅ −2.63
(3) (2) (2) 2
𝜎44 = 𝑠44 − (𝑥̅4 ) ≅ 3.94

Handling Missing Data with Expectation Maximization Algorithm
(GRDJE/ Volume 6 / Issue 11 / 002)

𝑐(𝑍1 ) + 𝑐(𝑍2 ) 2 + 2
𝑝(3) = = = 0.5
4∗2 4∗2
Because the sample is too small for GEM to converge to an exact maximizer with small enough number of iterations, we can stop
GEM at the second iteration with Θ(3) = Θ* = (μ*, Σ*)T and Φ(3) = Φ* = p* when difference between Θ(2) and Θ(3) is insignificant.
𝜇 ∗ = (𝜇1∗ = 0.52, 𝜇2∗ = 1.1, 𝜇3∗ = 1.57, 𝜇4∗ = 2.17)𝑇
∗ ∗ ∗ ∗
𝜎11 = 0.49 𝜎12 = −0.44 𝜎13 = 0.72 𝜎14 = −0.96
∗ ∗ ∗ ∗
𝜎 = −0.44 𝜎 = 1.17 𝜎 = −1.31 𝜎 24 = 1.85
Σ ∗ = ( 21∗
22
∗
23
∗ ∗ )
𝜎31 = 0.72 𝜎32 = −1.31 𝜎33 = 2.4 𝜎34 = −2.63
∗ ∗ ∗ ∗
𝜎41 = −0.96 𝜎42 = 1.85 𝜎43 = −2.63 𝜎44 = 3.94
𝑝∗ = 0.5
As aforementioned, because Xmis is a part of X and f(Xmis | ΘM) is derived directly from f(X|Θ), in practice we can stop GEM after
its first iteration was done, which is reasonable enough to handle missing data.
As aforementioned, an interesting application of handling missing data is to fill in or predict missing values. For instance, the
∗
missing part Xmis(1) of X1 = (x11=1, x12=?, x13=3, x14=?)T is fulfilled by 𝜇𝑀 1
according to equation 2.44 as follows:
∗
𝑥12 = 𝜇2 = 1.1
𝑥14 = 𝜇4∗ = 2.17
It is necessary to have an example for illustrating how to handle missing data with multinomial PDF.
Example 3.2. Given sample of size two, 𝒳 = {X1, X2 } in which X1 = (x11=1, x12=?, x13=3, x14=?)T and X2 = (x21=?, x22=2, x23=?,
x24=4)T are iid.
x1 x2 x3 x4
X1 1 ? 3 ?
X2 ? 2 ? 4
Of course, we have Xobs(1) = (x11=1, x13=3)T, Xmis(1) = (x12=?, x14=?)T, Xobs(2) = (x22=2, x24=4)T and Xmis(2) = (x21=?, x23=?)T. We
also have M1 = {m11=2, m12=4}, 𝑀 ̅1 = {𝑚 ̅ 11 =1, 𝑚
̅ 12 =3}, M2 = {m21=1, m22=3}, and 𝑀 ̅2 = {𝑚̅ 21 =2, 𝑚
̅ 22 =4}. Let X be random
variable representing every Xi. Suppose f(X|Θ) is multinomial PDF of 10 trials. We will estimate Θ = (p1, p2, p3, p4)T. The parameters
p1, p2, p3, and p2 are initialized arbitrarily as 0.25 as follows:
(1) (1) (1) (1) 𝑇
Θ(1) = (𝑝1 = 0.25, 𝑝2 = 0.25, 𝑝3 = 0.25, 𝑝4 = 0.25)
At 1st iteration, M-step, we have:
1 (2)
(1 + 10 ∗ 0.25) = 0.175
𝑝1 =
10 ∗ 2
(2) 1
𝑝2 = (10 ∗ 0.25 + 2) = 0.225
10 ∗ 2
(2) 1
𝑝3 = (3 + 10 ∗ 0.25) = 0.275
10 ∗ 2
(2) 1
𝑝4 = (10 ∗ 0.25 + 4) = 0.325
10 ∗ 2
We stop GEM after the first iteration was done, which results the estimate Θ(2) = Θ* = (p1*, p2*, p3*, p4*)T as follows:
𝑝1∗ = 0.175
𝑝2∗ = 0.225
𝑝3∗ = 0.275
𝑝4∗ = 0.325

IV. CONCLUSIONS
In general, GEM is a powerful tool to handle missing data, which is not so difficult except that how to extract the parameter ΘM of
the conditional PDF f(Xmis | Xobs, ΘM) from the whole parameter Θ of the PDF f(X|ΘM) is most important with note that only f(X|Θ)
is defined firstly and then f(Xmis | Xobs, ΘM) is derived from f(X|Θ). Therefore, equation 2.15 is cornerstone of this method. Note,
equation 2.35 and 2.51 are instances of equation 2.15 when f(X|Θ) is multinormal PDF or multinomial PDF.

REFERENCES
[1] Burden, R. L., & Faires, D. J. (2011). Numerical Analysis (9th Edition ed.). (M. Julet, Ed.) Brooks/Cole Cengage Learning.
[2] Dempster, A. P., Laird, N. M., & Rubin, D. B. (1977). Maximum Likelihood from Incomplete Data via the EM Algorithm. (M. Stone, Ed.) Journal of the
Royal Statistical Society, Series B (Methodological), 39(1), 1-38.
[3] Hardle, W., & Simar, L. (2013). Applied Multivariate Statistical Analysis. Berlin, Germany: Research Data Center, School of Business and Economics,
Humboldt University.
[4] Josse, J., Jiang, W., Sportisse, A., & Robin, G. (2018). Handling missing values. Inria. Julie Josse. Retrieved October 12, 2020, from http://juliejosse.com/wp-
content/uploads/2018/07/LectureNotesMissing.html
[5] Nguyen, L. (2020). Tutorial on EM algorithm. MDPI. Preprints. doi:10.20944/preprints201802.0131.v8

Handling Missing Data with Expectation Maximization Algorithm
(GRDJE/ Volume 6 / Issue 11 / 002)

[6] Ta, P. D. (2014). Numerical Analysis Lecture Notes. Vietnam Institute of Mathematics, Numerical Analysis and Scientific Computing. Hanoi: Vietnam
Institute of Mathematics. Retrieved 2014
[7] Wikipedia. (2014, August 4). Karush–Kuhn–Tucker conditions. (Wikimedia Foundation) Retrieved November 16, 2014, from Wikipedia website:
http://en.wikipedia.org/wiki/Karush–Kuhn–Tucker_conditions
[8] Wikipedia. (2016, March September). Exponential family. (Wikimedia Foundation) Retrieved 2015, from Wikipedia website:
https://en.wikipedia.org/wiki/Exponential_family

View publication stats

CpE646 6v3 PDF
No ratings yet
CpE646 6v3 PDF
44 pages
Theory Assignment 1
No ratings yet
Theory Assignment 1
3 pages
Midterm - EE511 - Part B: K K K K
No ratings yet
Midterm - EE511 - Part B: K K K K
8 pages
HW 1
No ratings yet
HW 1
2 pages
CS-E4820 Machine Learning: Advanced Probabilistic Methods
No ratings yet
CS-E4820 Machine Learning: Advanced Probabilistic Methods
2 pages
A Modified Expectation Maximization Algorithm For Penalized Likelihood Estimation in Emission Tomorzradhv
No ratings yet
A Modified Expectation Maximization Algorithm For Penalized Likelihood Estimation in Emission Tomorzradhv
6 pages
Zellner Bayesian Analysis
No ratings yet
Zellner Bayesian Analysis
4 pages
Quiz 11: This Is A Preview of The Published Version of The Quiz
No ratings yet
Quiz 11: This Is A Preview of The Published Version of The Quiz
4 pages
(Slides) The em Algorithm
No ratings yet
(Slides) The em Algorithm
14 pages
Solutions For Missing Data in Structural Equation Modeling
No ratings yet
Solutions For Missing Data in Structural Equation Modeling
6 pages
Generalised Coupled Tensor Factorisation: Taylan - Cemgil, Umut - Simsekli @boun - Edu.tr
No ratings yet
Generalised Coupled Tensor Factorisation: Taylan - Cemgil, Umut - Simsekli @boun - Edu.tr
9 pages
E9 205 - Machine Learning For Signal Processing: Practice For Midterm Exam # 1
No ratings yet
E9 205 - Machine Learning For Signal Processing: Practice For Midterm Exam # 1
8 pages
EM Algorithm: Shu-Ching Chang Hyung Jin Kim December 9, 2007
No ratings yet
EM Algorithm: Shu-Ching Chang Hyung Jin Kim December 9, 2007
10 pages
掃描文件 2019年10月24日
No ratings yet
掃描文件 2019年10月24日
19 pages
Solution To Assignment 5 (1001) - 2018
No ratings yet
Solution To Assignment 5 (1001) - 2018
5 pages
Extreme Learning Machine For Missing Data Using Multiple Imputations
No ratings yet
Extreme Learning Machine For Missing Data Using Multiple Imputations
18 pages
Exponential Random Variables: Scott Sheffield
No ratings yet
Exponential Random Variables: Scott Sheffield
69 pages
Figueiredo EM Algorithm
No ratings yet
Figueiredo EM Algorithm
35 pages
Adctest
No ratings yet
Adctest
5 pages
F17 10601 HW3
No ratings yet
F17 10601 HW3
13 pages
Em Algorithm
No ratings yet
Em Algorithm
20 pages
An Alternative View of EM - Poornima
No ratings yet
An Alternative View of EM - Poornima
4 pages
J.L. Schafer - Analysis of Incomplete Multivariate Data-Chapman and Hall - CRC (1997)
No ratings yet
J.L. Schafer - Analysis of Incomplete Multivariate Data-Chapman and Hall - CRC (1997)
514 pages
Problem 1: Otherwise, 0 X 0 1), 0 ( ) (
No ratings yet
Problem 1: Otherwise, 0 X 0 1), 0 ( ) (
4 pages
1preparing Data
No ratings yet
1preparing Data
6 pages
Expectation-Maximization Algorithm
No ratings yet
Expectation-Maximization Algorithm
13 pages
Robust Estimation of AR Coefficients Under Simultaneously Influencing Outliers and Missing Values
No ratings yet
Robust Estimation of AR Coefficients Under Simultaneously Influencing Outliers and Missing Values
13 pages
Boukeloua 2024 TWMS
No ratings yet
Boukeloua 2024 TWMS
16 pages
Statistics For Business and Economics
100% (1)
Statistics For Business and Economics
23 pages
Modelling Data Uncertainty in Growth Forecasts: Karmeshu T and F. Lara-Rosano
No ratings yet
Modelling Data Uncertainty in Growth Forecasts: Karmeshu T and F. Lara-Rosano
7 pages
Isye 6416: Computational Statistics Spring 2023: Prof. Yao Xie
No ratings yet
Isye 6416: Computational Statistics Spring 2023: Prof. Yao Xie
24 pages
The Expectation-Maximization Algorithm: IEEE Signal Processing Magazine December 1996
No ratings yet
The Expectation-Maximization Algorithm: IEEE Signal Processing Magazine December 1996
15 pages
IAT-II FDS-Answer Key
No ratings yet
IAT-II FDS-Answer Key
11 pages
UNIT 4 - EM Alg
No ratings yet
UNIT 4 - EM Alg
3 pages
Unit V Graphical Models
No ratings yet
Unit V Graphical Models
23 pages
EM Algorithm
No ratings yet
EM Algorithm
30 pages
Module C
No ratings yet
Module C
30 pages
Semiparametric Theory and Missing Data - Anastasios Tsiatis - Springer Series in Statistics, 1, 2006 - Springer - 9780387324487 - Anna's Archive
No ratings yet
Semiparametric Theory and Missing Data - Anastasios Tsiatis - Springer Series in Statistics, 1, 2006 - Springer - 9780387324487 - Anna's Archive
391 pages
2017-AdaCluster Adaptive Clustering For Heterogeneous Data
No ratings yet
2017-AdaCluster Adaptive Clustering For Heterogeneous Data
34 pages
A Note On Entrywise Consistency For Mixed-Data Matrix Completion
No ratings yet
A Note On Entrywise Consistency For Mixed-Data Matrix Completion
66 pages
Rubin 1976
No ratings yet
Rubin 1976
12 pages
Berthing Loads in Structural Design - Validation of Parcial Factors
100% (8)
Berthing Loads in Structural Design - Validation of Parcial Factors
127 pages
Chapter 9.4 Allele Frequency Estimation
No ratings yet
Chapter 9.4 Allele Frequency Estimation
24 pages
Mixture Models and Expectation-Maximization: Justus H. Piater
No ratings yet
Mixture Models and Expectation-Maximization: Justus H. Piater
11 pages
Jpskycak 2018 Intuiting Predictive Algorithms 1
No ratings yet
Jpskycak 2018 Intuiting Predictive Algorithms 1
16 pages
Paper 4 SiT Dec2009 Shu Tha Pat Raj
No ratings yet
Paper 4 SiT Dec2009 Shu Tha Pat Raj
18 pages
Dependence Modeling With Copulas Joe Harry Instant Download
No ratings yet
Dependence Modeling With Copulas Joe Harry Instant Download
88 pages
3.handouts Binary Dependent Variables
No ratings yet
3.handouts Binary Dependent Variables
8 pages
MLT 2021-22
No ratings yet
MLT 2021-22
14 pages
Expectation Maximization (EM) Algorithm
No ratings yet
Expectation Maximization (EM) Algorithm
47 pages
Visvesvaraya Technological University, Belagavi
No ratings yet
Visvesvaraya Technological University, Belagavi
17 pages
Unit 3 ML
No ratings yet
Unit 3 ML
45 pages
Just Wing It: Optimal Estimation of Missing Mass in A Markovian Sequence
No ratings yet
Just Wing It: Optimal Estimation of Missing Mass in A Markovian Sequence
40 pages
Gaussian Mixture Models With Rare Events: Xuetong Li
No ratings yet
Gaussian Mixture Models With Rare Events: Xuetong Li
40 pages
David Williams - Weighing The Odds A Course in Probability and Statistics
100% (1)
David Williams - Weighing The Odds A Course in Probability and Statistics
567 pages
Basic Concepts in Probability: D. Bathalomew
No ratings yet
Basic Concepts in Probability: D. Bathalomew
16 pages
Bispoh Pattern Recognition and Machine Learning Notes (AutoRecovered)
No ratings yet
Bispoh Pattern Recognition and Machine Learning Notes (AutoRecovered)
12 pages
Gate 2025
No ratings yet
Gate 2025
33 pages
Missing Data Analysis: University College London, 2015
No ratings yet
Missing Data Analysis: University College London, 2015
37 pages
A Concise Course in A-Level Statistics - Crawshaw.J
No ratings yet
A Concise Course in A-Level Statistics - Crawshaw.J
692 pages
Expt 2
No ratings yet
Expt 2
3 pages
Hw3sol 21015 PDF
No ratings yet
Hw3sol 21015 PDF
13 pages
Normal Distribution Ws & Ms
No ratings yet
Normal Distribution Ws & Ms
3 pages
Missing Data Mechanisms and Imputation Methods
No ratings yet
Missing Data Mechanisms and Imputation Methods
16 pages
Roles of Imputation Methods For Filling The Missing Values: A Review
No ratings yet
Roles of Imputation Methods For Filling The Missing Values: A Review
9 pages
The Probability of An Event - IBDP Mathematics - Applications and Interpretation SL FE2021 - Kognity
No ratings yet
The Probability of An Event - IBDP Mathematics - Applications and Interpretation SL FE2021 - Kognity
8 pages
SM60427 PDF
No ratings yet
SM60427 PDF
26 pages
5
No ratings yet
5
29 pages
Simple Probability Worksheets
No ratings yet
Simple Probability Worksheets
17 pages
Fundamentals of Applied Probability Theory
100% (1)
Fundamentals of Applied Probability Theory
152 pages
Slide FU - W2
No ratings yet
Slide FU - W2
41 pages
Additional Mathematics
No ratings yet
Additional Mathematics
40 pages
SEC07 - Section 7 Measuring Risk
No ratings yet
SEC07 - Section 7 Measuring Risk
53 pages
ML - Expectation-Maximization Algorithm
No ratings yet
ML - Expectation-Maximization Algorithm
3 pages
Permutation & Combination Questions With Solution: Daily Visit
No ratings yet
Permutation & Combination Questions With Solution: Daily Visit
27 pages
2 ContinuousDistribution
No ratings yet
2 ContinuousDistribution
19 pages
Actuarial Two. Unit 1
No ratings yet
Actuarial Two. Unit 1
50 pages
Activity Sheet in Mathematics 6 Quarter 4,: Weeks 7 & 8
0% (1)
Activity Sheet in Mathematics 6 Quarter 4,: Weeks 7 & 8
13 pages
6.041: Probabilistic Systems Analysis 6.431: Applied Probability
No ratings yet
6.041: Probabilistic Systems Analysis 6.431: Applied Probability
14 pages
Risks and Rate of Return
No ratings yet
Risks and Rate of Return
50 pages
15 BNInference
No ratings yet
15 BNInference
39 pages
LP 10
No ratings yet
LP 10
5 pages
2025 T3 B6 Schemes
No ratings yet
2025 T3 B6 Schemes
29 pages
Detailed LP Mathematics April
No ratings yet
Detailed LP Mathematics April
7 pages
HW 1
No ratings yet
HW 1
5 pages
An Introduction To Mathematical Probability
No ratings yet
An Introduction To Mathematical Probability
15 pages
The Central Limit Theorem
No ratings yet
The Central Limit Theorem
8 pages
STANDARD NORMAL DISTRIBUTION: Table Values Represent AREA To The LEFT of The Z Score
No ratings yet
STANDARD NORMAL DISTRIBUTION: Table Values Represent AREA To The LEFT of The Z Score
2 pages
4th-PT MATH5
No ratings yet
4th-PT MATH5
10 pages
Worked Examples in Mathematics for Scientists and Engineers
From Everand
Worked Examples in Mathematics for Scientists and Engineers
G. Stephenson
No ratings yet

EM Missing

Uploaded by

EM Missing

Uploaded by

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

Handling Missing Data with Expectation Maximization Algorithm

Article · October 2021

The user has requested enhancement of the downloaded file.

Handling Missing Data with Expectation

All rights reserved by www.grdjournals.com 9

All rights reserved by www.grdjournals.com 10

If X and Y are discrete, equation 2.4 can be re-written as follows:

𝑓(𝒳|Θ) = 𝑓(𝑋1 , 𝑋2 , … , 𝑋𝑁 |Θ) = ∏ 𝑓(𝑋𝑖 |Θ)

𝑘(𝒳|𝒴, Θ) = 𝑘(𝑋1 , 𝑋2 , … , 𝑋𝑁 |𝑌1 , 𝑌2 , … , 𝑌𝑁 , Θ) = ∏ 𝑘(𝑋𝑖 |𝑌1 , 𝑌2 , … , 𝑌𝑁 , Θ) = ∏ 𝑘(𝑋𝑖 |𝑌𝑖 , Θ)

𝜏Θ,𝑌𝑖 = 𝐸(𝜏(𝑋)|𝑌𝑖 , Θ) = ∫ 𝑘(𝑋|𝑌𝑖 , Θ)𝜏(𝑋)d𝑋

All rights reserved by www.grdjournals.com 11

𝑓(𝑋, 𝑌|Θ) 𝑓(𝑋, 𝑌|Θ)

𝑄(Θ′ |Θ) = ∑ ∑ 𝑃(𝑋|𝑌𝑖 , Θ)log(𝑃(𝑋, 𝑌𝑖 |Θ′ )) (1.9)

All rights reserved by www.grdjournals.com 12

II. HANDLING MISSING DATA

𝑀𝑖 = {𝑚𝑖1 , 𝑚𝑖2 , … , 𝑚𝑖|𝑀𝑖| }

All rights reserved by www.grdjournals.com 13

= ∑ ∫ 𝑓(𝑋𝑚𝑖𝑠 |𝑋𝑜𝑏𝑠 (𝑖), Θ) ∗ log(𝑓(𝑋𝑜𝑏𝑠 (𝑖), 𝑋𝑚𝑖𝑠 , 𝑍𝑖 |Θ′ , Φ′ ))d𝑋𝑚𝑖𝑠

= ∑ ∫ 𝑓(𝑋𝑚𝑖𝑠 |𝑋𝑜𝑏𝑠 (𝑖), Θ𝑀𝑖 ) ∗ log(𝑓(𝑋𝑜𝑏𝑠 (𝑖), 𝑋𝑚𝑖𝑠 , 𝑍𝑖 |Θ′ , Φ′ ))d𝑋𝑚𝑖𝑠

All rights reserved by www.grdjournals.com 14

= ∑ ∫ 𝑓(𝑋𝑚𝑖𝑠 |𝑋𝑜𝑏𝑠 (𝑖), Θ𝑀𝑖 )log(𝑓(𝑋𝑜𝑏𝑠 (𝑖), 𝑋𝑚𝑖𝑠 |Θ′ ))d𝑋𝑚𝑖𝑠

+ ∑ ∫ 𝑓(𝑋𝑚𝑖𝑠 |𝑋𝑜𝑏𝑠 (𝑖), Θ𝑀𝑖 )log(𝑓(𝑍𝑖 |𝑋𝑜𝑏𝑠 (𝑖), 𝑋𝑚𝑖𝑠 , Φ′ ))d𝑋𝑚𝑖𝑠

− ∑ ∫ 𝑓(𝑋𝑚𝑖𝑠 |𝑋𝑜𝑏𝑠 (𝑖), Θ𝑀𝑖 )log(𝑎(Θ′ ))d𝑋𝑚𝑖𝑠

− log(𝑎(Θ′ )) ∑ ∫ 𝑓(𝑋𝑚𝑖𝑠 |𝑋𝑜𝑏𝑠 (𝑖), Θ𝑀𝑖 )d𝑋𝑚𝑖𝑠

All rights reserved by www.grdjournals.com 15

𝐸(𝜏(𝑋𝑚𝑖𝑠 )|Θ𝑀𝑖 ) = ∫ 𝑓(𝑋𝑚𝑖𝑠 |𝑋𝑜𝑏𝑠 (𝑖), Θ𝑀𝑖 )𝜏(𝑋𝑚𝑖𝑠 )d𝑋𝑚𝑖𝑠 (2.21)

All rights reserved by www.grdjournals.com 16

Φ(𝑡+1) = argmin 𝑄2 (Φ|Θ(𝑡) ) (2.24)

All rights reserved by www.grdjournals.com 17

𝑋𝑖 = {𝑋𝑜𝑏𝑠 (𝑖), 𝑋𝑚𝑖𝑠 (𝑖)} = (𝑥𝑖1 , 𝑥𝑖2 , … , 𝑥𝑖𝑛 )𝑇

𝑀𝑖 = {𝑚𝑖1 , 𝑚𝑖2 , … , 𝑚𝑖|𝑀𝑖 | }

covariance of xi and xj.

All rights reserved by www.grdjournals.com 18

𝑓(𝑋𝑜𝑏𝑠 (𝑖)|Θ𝑜𝑏𝑠 (𝑖)) = ∫ 𝑓(𝑋𝑜𝑏𝑠 (𝑖), 𝑋𝑚𝑖𝑠 (𝑖)|Θ)d𝑋𝑚𝑖𝑠 (𝑖)

All rights reserved by www.grdjournals.com 19

All rights reserved by www.grdjournals.com 20

(𝑡) (𝑡) (𝑡) (𝑡)

= ∑ log(𝑓(𝑍𝑖 |Φ)) = ∑ log(𝑝𝑐(𝑍𝑖) (1 − 𝑝)𝑛−𝑐(𝑍𝑖) )

= ∑ (𝑐(𝑍𝑖 )log(𝑝) + (𝑛 − 𝑐(𝑍𝑖 ))log(1 − 𝑝))

All rights reserved by www.grdjournals.com 21

(𝑡) (𝑡) (𝑡)

𝑀𝑖 = {𝑚𝑖1 , 𝑚𝑖2 , … , 𝑚𝑖|𝑀𝑖 | }

All rights reserved by www.grdjournals.com 22

𝑃𝑚𝑖𝑠 (𝑖) = ∑ 𝑝𝑚𝑖𝑗 (2.47)

𝐾𝑚𝑖𝑠 (𝑖) = ∑ 𝑥𝑚𝑖𝑗

𝑃𝑜𝑏𝑠 (𝑖) = ∑ 𝑝𝑚̅𝑖𝑗 (2.49)

𝐾𝑜𝑏𝑠 (𝑖) = ∑ 𝑥𝑚̅𝑖𝑗

All rights reserved by www.grdjournals.com 23

𝑓(𝑋𝑜𝑏𝑠 , 𝑋𝑚𝑖𝑠 |Θ)

𝑃𝑜𝑏𝑠 (𝑖) = ∑ 𝑝𝑚̅𝑖𝑗

𝐾𝑜𝑏𝑠 (𝑖) = ∑ 𝑥𝑚̅𝑖𝑗

𝑃𝑜𝑏𝑠 (𝑖) = ∑ 𝑝𝑚̅𝑖𝑗

∑ 𝑥𝑚𝑖𝑗 + 𝐾𝑜𝑏𝑠 (𝑖) = 𝐾𝑚𝑖𝑠 (𝑖) + 𝐾𝑜𝑏𝑠 (𝑖) = 𝐾

∑ 𝑝𝑚𝑖𝑗 + 𝑃𝑜𝑏𝑠 (𝑖) = ∑ 𝑝𝑚𝑖𝑗 + ∑ 𝑝𝑚̅𝑖𝑗 = ∑ 𝑝𝑗 = 1

All rights reserved by www.grdjournals.com 24

All rights reserved by www.grdjournals.com 25

All rights reserved by www.grdjournals.com 26

III. NUMERICAL EXAMPLES

All rights reserved by www.grdjournals.com 27

𝑋𝑜𝑏𝑠 (2) = (𝑥2 = 2, 𝑥4 = 4)𝑇

(1) 1 (1) (1)

All rights reserved by www.grdjournals.com 28

(1) 1 (1) (1) 2

(2) (1) (1) 2

𝑋𝑜𝑏𝑠 (2) = (𝑥2 = 2, 𝑥4 = 4)𝑇

All rights reserved by www.grdjournals.com 29

(2) 1 (2) (2)

(3) (2) (2) 2

All rights reserved by www.grdjournals.com 30

All rights reserved by www.grdjournals.com 31

All rights reserved by www.grdjournals.com 32

You might also like