Parametric MMD Estimation With Missing Values - Robustness To Missingness and Data Model Misspecification
Parametric MMD Estimation With Missing Values - Robustness To Missingness and Data Model Misspecification
2
Research Institute for Statistics and Information Science,
University of Geneva
Abstract
In the missing data literature, the Maximum Likelihood Estimator (MLE) is celebrated
for its ignorability property under missing at random (MAR) data. However, its sensitivity
to misspecification of the (complete) data model, even under MAR, remains a significant
limitation. This issue is further exacerbated by the fact that the MAR assumption may not
always be realistic, introducing an additional source of potential misspecification through
the missingness mechanism. To address this, we propose a novel M-estimation procedure
based on the Maximum Mean Discrepancy (MMD), which is provably robust to both model
misspecification and deviations from the assumed missingness mechanism. Our approach
offers strong theoretical guarantees and improved reliability in complex settings. We establish
the consistency and asymptotic normality of the estimator under missingness completely at
random (MCAR), provide an efficient stochastic gradient descent algorithm, and derive error
bounds that explicitly separate the contributions of model misspecification and missingness
bias. Furthermore, we analyze missing not at random (MNAR) scenarios where our estimator
maintains controlled error, including a Huber setting where both the missingness mechanism
and the data model are contaminated. Our contributions refine the understanding of the
limitations of the MLE and provide a robust and principled alternative for handling missing
data.
1 Introduction
Missing values present a persistent challenge across various fields. When data contain missing
values, some dimensions of the observations are masked, requiring consideration not only of
the data model but also of the missingness mechanism. Due to the pervasiveness of this issue,
numerous approaches have been developed to obtain reliable estimates despite missing data.
These include imputation methods (Van Buuren (2007); Van Buuren (2018); Näf et al. (2024)),
weighted estimators (Li et al. (2011); Shpitser (2016); Sun and Tchetgen (2018); Cantoni and de
Luna (2020); Daniel Malinsky and Tchetgen (2022)), and likelihood-based procedures (Rubin
(1976, 1996); Takai and Kano (2013); Golden et al. (2019); Sportisse et al. (2024), among
others). Among these, likelihood-based methods - particularly Maximum Likelihood Estimation
(MLE) - are widely used, as they provide consistent estimates under the missing at random
(MAR) condition. Introduced by Rubin (1976), MAR roughly states that the probability of
missingness depends only on observed data. Rubin further showed that, under an additional
parameter distinctness assumption, maximizing the likelihood while ignoring the missingness
1
mechanism remains valid. This result established MAR as a standard assumption in missing data
analysis, as evidenced by the extensive literature discussing this condition (Seaman et al. (2013);
Farewell et al. (2021); Mealli and Rubin (2015); Näf et al. (2024)). Beyond MAR, a stronger
assumption is missing completely at random (MCAR), where the probability of missingness
is entirely independent of the data. Conversely, when the missingness mechanism depends on
unobserved information, it falls under missing not at random (MNAR), which is generally more
challenging to handle.
Despite its popularity, the MAR condition has been questioned, both due to its potential
lack of realism (see e.g., Robins and Gill (1997); Daniel Malinsky and Tchetgen (2022); Ren
et al. (2023) and references therein) and because it is challenging to handle outside of MLE.
For instance, Ma et al. (2024) show that MAR can lead to parameter non-identifiability even
in simple cases, while Näf et al. (2024) highlight the complex distribution shifts that can arise
under MAR. These issues cast doubt on MAR as a default assumption for missing data. As
an alternative, Ma et al. (2024) propose a framework in which the missingness mechanism is
assumed to be MCAR, but with Huber-style MNAR deviations affecting at most a fraction ε
of the data. This approach is particularly relevant given that the missingness mechanism is
typically unknown. Indeed, since MAR is fundamentally untestable (Robins and Gill (1997);
Molenberghs et al. (2008)), identifying the true missing data mechanism without strong domain
knowledge is often infeasible. More generally, the framework of Ma et al. (2024) allows for
contamination in both the assumed data model and the missingness mechanism. This flexibility
is crucial, as MLE lacks robustness against model misspecification and contamination. These
issues become even more problematic in the presence of missing data, where misspecifications can
stem not only from the data model but also from an incorrectly specified missingness mechanism.
A promising alternative to MLE is Maximum Mean Discrepancy (MMD) estimation, which
selects parameters by minimizing the MMD between the postulated and observed distributions,
rather than relying on the Kullback-Leibler divergence as in MLE. This approach is known to ex-
hibit desirable robustness properties (Briol et al. (2019); Chérief-Abdellatif and Alquier (2022)).
Building on the idea of safeguarding against deviations in both the assumed data distribution
and the missingness mechanism, we extend MMD estimation to the case of missing values. Our
first step is to reformulate the MMD estimator as an M-estimator, which allows us to adapt it
to the more challenging setting of missing data. We then establish that the resulting estimator
is consistent and asymptotically normal under MCAR, while remaining robust to certain de-
viations from both the assumed data model and the missingness mechanism. Specifically, we
show that under regularity conditions, the MMD estimator converges to the parameter asso-
ciated with the closest distribution (in MMD) to the true conditional within the model class.
We derive explicit bounds on this MMD distance, which split into two components: one corre-
sponding to errors from model misspecification and the other accounting for deviations from the
MCAR assumption in the missingness mechanism. To facilitate practical implementation, we
extend an existing stochastic gradient descent (SGD) algorithm for MMD estimation to handle
missing values. This allows for a straightforward adaptation of the algorithm to a wide range
of parametric models, maintaining its generality, flexibility, and computational efficiency while
providing rigorous robustness guarantees.
The paper is organized as follows. After discussing related work and notation, we first give
some background on MMD and maximum likelihood estimation and motivate our approach
in Section 2. Section 3 then presents our estimator and its asymptotic properties. Section 4
2
discusses the general robustness guarantees of the new estimator. Finally, Section 5 provides
several examples.
3
Our work extends this line of research by applying MMD-based estimation to the context
of missing data, demonstrating that it provides a natural and robust alternative to traditional
methods while avoiding explicit specification of the missingness mechanism.
1.2 Notation
In missing data analysis, the underlying random vector X with values in Rd gets masked by
another random vector M taking values in {0, 1}d , where Mj = 0 indicates that Xj is observed
while Mj = 1 means that Xj is missing. Under i.i.d. data sampling of (X1 , M1 ), . . . , (Xn , Mn ),
this induces at most 2d possible missingness patterns, with each observation being associated
with a specific pattern. Each observation from a pattern m can be seen as a masked sample
from the distribution X conditional on M = m, leading to potentially different distributions in
each pattern, exacerbating the information loss through the non-observed values. We consider
the following notation:
• PM is the marginal distribution of the mask variable M . Given its discrete nature, we
m] for a given pattern m ∈ {0, 1}d so that
introduce the probability mass function P[M = P
d
for every measurable set A ⊂ {0, 1} , PM [A] = m∈A P[M = m].
pX,M (x, m)
pX|M =m (x) = .
P[M = m]
The corresponding conditional Markov kernel of X given M is then defined as PX|M .
• We finally introduce a complete data model {Pθ }θ with density pθ w.r.t. Lebesgue’s mea-
sure. The model is well-specified when there is a true parameter θ∗ such that PX = Pθ∗ .
4
For m ∈ {0, 1}d , X (m) is the subvector of X corresponding to the variables such that mj = 1,
(m) (m) (m) (m) (m) (m)
and PX /pX , PX|M =m /pX|M =m , Pθ /pθ are the corresponding distributions/densities. We
call a missingness mechanism missing completely at random (MCAR) if, PX,M = PX × PM . A
missingness mechanism is called missing at random (MAR) if for all m in the support of PM ,
P[M = m|X = x] = P[M = m | X (m) = x(m) ] for PX -almost every x.
2 Background
We present in this section all the required background to understand the rest of the paper.
We first remind the definition of the MMD and introduce related concepts. Then, we properly
define the minimum distance estimator based on the MMD in the complete-data setting, that
we shortly compare to the MLE. Finally, we provide a brief survey of the literature on the MLE
in the presence of missing data.
We assume in this work that the kernel is bounded (say by 1 without loss of generality), that is
|k(x, y)| ≤ 1 for any x, y ∈ X . Associated with such a kernel is a Reproducing Kernel Hilbert
Space (RKHS), denoted (H, ⟨·, ·⟩H ), which satisfies the fundamental reproducing property: for
any function f ∈ H and any x ∈ X ,
We refer the interested reader to (Hsing and Eubank, 2015, Chapter 2.7) for a more detailed
introduction to RKHS theory.
A key concept in kernel methods is the kernel mean embedding, which provides a way to
represent probability measures as elements of the RKHS. Given a probability distribution P
over X , its mean embedding is defined as
This embedding generalizes the feature map Φ(X) := k(X, ·) used in kernel-based learning
algorithms such as Support Vector Machines. A major advantage of this approach is that
expectations with respect to P can be expressed as inner products in H, i.e.,
Using these embeddings, one can define a discrepancy measure between probability distributions
known as the Maximum Mean Discrepancy (MMD). For two distributions P and Q, the MMD
is given by
D (P1 , P2 ) = ∥Φ(P1 ) − Φ(P2 )∥H ,
5
which can be rewritten explicitly as
A kernel k is called characteristic if the mapping P 7→ Φ(P ) is injective, ensuring that D (P1 , P2 ) =
0 if and only if P1 = P2 , then providing a proper metric. Many conditions ensuring that a kernel
is characteristic are discussed in Section 3.3.1 of Muandet et al. (2017), including examples such
as the widely used Gaussian kernel:
∥x − y∥22
k(x, y) = exp − . (3)
γ2
For the remainder of this work, we assume that k is characteristic and dimension-independent,
meaning the same function can be applied to x(m) and y (m) on R|m| for any m ∈ {0, 1}d , e.g.
k(x(m) , y (m) ) = exp(−∥x(m) − y (m) ∥2 /γ 2 ) for the Gaussian kernel. Note that H, Φ and D (·, ·)
all implicitly depend on the kernel k.
assuming this minimum exists. An approximate minimizer can be used instead when the exact
minimizer does not exist. The MMD estimator has several favorable properties. First, it can be
shown that θnMMD is consistent and asymptotically normal under appropriate conditions:
Theorem 2.1 (Informal variant of Proposition 1 and Theorem 2 in Briol et al. (2019)). Under
appropriate regularity conditions (including the existence of the involved quantities), θnMMD is
strongly consistent and asymptotically normal:
P −a.s.
θnMMD −−X−−−→ θ∞
MMD
,
n→+∞
√ L
n(θnMMD − θ∞
MMD
) −−−−−→ N 0, B −1 ΣB −1 ,
n→+∞
where
MMD
θ∞ = arg min D (Pθ , PX ) ,
θ∈Θ
B = ∇2θ,θ D2 Pθ∞MMD , PX ,
and
h i
Σ = VX∼PX ∇θ D2 Pθ∞ MMD , δ{X} .
6
Hence, when the model is correctly specified, that is PX = Pθ∗ ∈ {Pθ }θ , then θnMMD converges
to the true parameter θ∗ , i.e. θ∞
MMD = θ ∗ , as soon as the model is identifiable, that is θ ̸= θ ′ =⇒
Pθ ̸= Pθ′ . In case of misspecification, i.e. when PX ∈ / {Pθ }θ , θnMMD converges towards the
parameter corresponding to the best (in the MMD sense) approximation of the true distribution
PX in the model. A formal statement of Theorem 2.1 can be found in Briol et al. (2019) for
generative models. The extension to the general case is straightforward.
What is particularly interesting with the MMD is that it is a robustness-inducing distance.
For instance, when estimating a univariate Gaussian mean θ∗ ∈ R with a proportion ε of data
points that are adversarially contaminated, then the limit θ∞ MMD of θ MMD will be no further
n
∗
(in Euclidean distance) from θ than ε, while the limit of the MLE can be made arbitrarily far
away from θ∗ (see the results of Section 4.1 in Chérief-Abdellatif and Alquier (2022) for a formal
statement). This is the case because the MLE
n
X
θnML = arg max log pθ (Xi ), (5)
θ i=1
ML whereby
where pθ is the density of Pθ w.r.t. Lebesgue’s measure, converges a.s. to θ∞
ML
θ∞ = arg min KL (PX ∥Pθ ) ,
θ
and the KL divergence is not bounded and does not induce the desirable robustness properties
inherent to the MMD. Consider for example the case where the data are realizations of n i.i.d.
random variables X1 , . . . , Xn actually drawn from the mixture:
where the distribution of interest N (θ∗ , 1) is contaminated by another univariate normal N (ξ, 1).
Then θ∞ ML = (1−ϵ)θ ∗ +ϵξ, which can be quite far from θ ∗ when |ϵ(ξ −θ ∗ )| is large when compared
to θ∗ . This illustrates the non-robustness of the MLE to small deviations from the parametric
assumption.
remains valid, i.e. that θnMLis equivalent to the maximum likelihood estimator computed using
(M )
the full available dataset {(Xi i , Mi )}ni=1 , provided that:
7
• the assumed missingness mechanism is Missing At Random,
It is important to note that θnML differs from the standard MLE in general, and different ter-
minology is often used to emphasize this distinction. Specifically, θnML is sometimes referred
to as the ignoring MLE, while the regular MLE is known as the full-information MLE. How-
ever, under the above conditions, both optimization procedures yield the same estimator. A
key advantage of θnML is that it eliminates the need to specify the missingness mechanism. By
(M )
implicitly treating each Xi i as a sample from a marginal of PX (while it is in fact a marginal
of PX|M ), it leads to a simpler optimization problem while still adhering to the principles of
maximum likelihood estimation.
The asymptotics of θnML under MAR have been formally studied by Takai and Kano (2013).
Perhaps surprisingly, the often cited parameter distinctness condition is not necessary for con-
sistency and asymptotic normality, although the asymptotic variance of the ignoring MLE θnML
is suboptimal compared to the full-information MLE variance when the parameter distinctness
does not hold. The analysis has been extended to the MNAR setting in a recent paper Golden
et al. (2019), as summarized below.
Theorem 2.2 (Informal variant of Theorems 1 and 2 in Golden et al. (2019)). Under appro-
priate regularity conditions (including the existence of the involved quantities), θnML is strongly
consistent and asymptotically normal:
PX,M −a.s.
θnML −−−−−−−→ θ∞
ML
,
n→+∞
√ L
n(θnML − θ∞
ML
) −−−−−→ N 0, A−1 V A−1 ,
n→+∞
where h i
ML (M ) (M )
θ∞ = EM ∼PM KL PX|M ∥Pθ ,
h i
(M )
A = E(X,M )∼PX,M ∇2θ,θ log pθML X (M ) ,
∞
and
h T i
(M ) (M )
V = E(X,M )∼PX,M ∇θ log pθML X (M ) · ∇θ log pθML X (M )
.
∞ ∞
While θ∞ML is not formulated using the KL in Golden et al. (2019) but using the expected
density (see their Assumption 5), it is straightforward to see the correspondance between the
two quantities. We refer the reader to Golden et al. (2019) for a precise statement of the
regularity conditions. As without missing values, we remark that θ∞ ML is a KL minimizer, but
trying now to match the observed marginal of the conditional PX|M on average over the pattern
M . Remarkably, when the model is well-specified, that is PX = Pθ∗ , and the missingness
mechanism is MAR, then we recover θ∞ ML = θ ∗ . Unfortunately, the KL also implies that when
the complete-data model is misspecified, the ignoring MLE under missingness suffers from similar
shortcomings as in the complete-data case. Inspired by the form the MLE estimator takes, we
introduce in the next section an MMD estimator for the case of missing values.
8
3 MMD M-estimator for missing data
In this section, we define our MMD estimator in the presence of missing data. We shortly discuss
its computation using stochastic gradient-based algorithms, and provide an asymptotic analysis
of the estimator.
θnMMD is no longer a minimum distance estimator, but simply an M-estimator. Notice that each
MMD distance considered in the sum is defined over a different space R|Mi | × R|Mi | , which poses
no issues due to the dimension-independence of the kernel k. The same remark actually applies
ML .
to the KL divergence used in the definition of θ∞
9
To apply a gradient-based method, the first step is to compute the gradient of this criterion with
respect to θ. The formula is provided in the following proposition, which is a straightforward
adaptation of Proposition 5.1 in Chérief-Abdellatif and Alquier (2022):
Proposition 3.1. Assume that each Pθ has a density pθ w.r.t. Lebesgue’s measure. Assume that
for any x, the map θ 7→ pθ (x) is differentiable w.r.t. θ, and there exists aR nonnegative function
g(x, x′ ) such that for any θ ∈ Θ, |k(x, x′ )∇θ [pθ (x)pθ (x′ )]| ≤ g(x, x′ ) and g(x, x′ )dxdx′ < ∞.
R
While this gradient may be challenging to compute explicitly and not be available in closed-
form, we can remark as Briol et al. (2019); Chérief-Abdellatif and Alquier (2022) in the complete-
data setting that the gradient is an expectation w.r.t. the model. Hence, if we can evaluate
∇θ [log pθ (x)] and if {Pθ }θ is a generative model, then we can approximate the gradient using
i.i.d.
a simple unbiased Monte Carlo estimator: first simulate (Y1 , . . . , YS ) ∼ Pθ , and then use the
following estimate for the gradient at θ
n XS
2 X
1
X
(M ) (M )
(M ) (M )
k Yj i , Yj ′ i − k Xi i , Yj i ∇θ log[pθ (Yj )].
nS S−1 ′
i=1 j=1 j ̸=j
This unbiased estimate can then be used to perform a stochastic gradient descent (SGD), leading
to Algorithm 1. We incorporated a projection step to consider the case where Θ is strictly
included in Rp , and implicitly assumed that Θ is closed and convex. ΠΘ denotes the orthogonal
projection onto Θ.
Algorithm 1 Projected Stochastic Gradient Algorithm for MMD under missing data
(M ) (M )
Input: a dataset (X1 1 , · · · , Xn n ), a model {Pθ : θ ∈ Θ ⊂ Rp }, a kernel k, a sequence of
step sizes {ηt }t≥1 , an integer S, a stopping time T .
Initialize θ(0) ∈ Θ, t = 0.
for t = 1, . . . , T do
draw (Y1 , .(. . , YS ) i.i.d from Pθ(t−1) )
n P
S
2η 1 (M ) (M ) (M ) (M )
θ(t) = ΠΘ θ(t−1) − nSt i
, Yj ′ i − k Xi i , Yj i
P P
S−1 j ′ ̸=j k Yj ∇θ log[pθ(t−1) (Yj )]
i=1 j=1
end for
This algorithm provides a basic approach to computing our estimator, primarily relying on
standard Monte Carlo estimation combined with stochastic gradient descent in the context of
generative models. As detailed in Section 6, this algorithm might be improved. In particular, a
fully generative approach as in Briol et al. (2019) is possible.
10
differ slightly from those typically used in the MMD literature. Existing approaches generally
fall into two categories: assumptions on the density Key et al. (2021); Alquier and Gerber
(2024b) or on the generator Briol et al. (2019). Some works Oates (2022); Alquier et al. (2023)
have proposed more general conditions that encompass both perspectives, though these can be
challenging to establish due to the diverse nature of the underlying random objects.
In this work, we adopt a similarly general approach by working directly with the model
distribution. This allows us to capture subtle differences between complete and missing data
settings. Nevertheless, standard assumptions ensuring the consistency and asymptotic normality
of the MMD estimator in the complete data case continue to guarantee the consistency of our
estimator, as further detailed below.
MMD is uniquely defined as
Assumption 3.1. Θ is compact and θ∞
h i
MMD (M ) (M )
θ∞ = arg min EM ∼PM D2 Pθ , PX|M .
θ∈Θ
(m)
Assumption 3.2. The map θ 7→ Φ(Pθ ) is continuous on Θ for each m ∈ {0, 1}d in the
support of PM .
These conditions ensuring the consistency of θnMMD towards θ∞ MMD are almost the same
conditions as for the standard MMD estimator, with two notable differences:
• The limit is no longer the minimizer of the MMD distance to the true whole distribution
PX as in the complete data setting. It now corresponds to the best average (over M ∼ PM )
approximation of the observed marginal of the conditional PX|M in terms of MMD distance,
as for the ignoring MLE, and is equal to θ∗ as soon as the model is well-specified and
the missingness mechanism is MCAR. This setting of the mechanism is crucial, as under
M(N)AR, distribution shifts are possible when moving from one missingness pattern to
the next. In fact, as discussed in Näf et al. (2024) and others, MAR can be defined as a
restriction on the arbitrary distribution shifts that are allowed under MNAR. Intuitively,
if these distribution shifts from one pattern to the next are not too extreme θ∞ MMD will
We need a few more assumptions to establish the asymptotic normality of our estimator:
MMD is an interior point of Θ.
Assumption 3.3. θ∞
Assumption 3.4. There exists an open neighborhood O ⊂ Θ of θ∞ MMD such that the maps
(m)
θ 7→ Φ(Pθ ) are twice (Fréchet) continuously differentiable on O for each m ∈ {0, 1}d in the
support of PM , with interchangeability of integration and derivation.
11
Assumption 3.5. There exists a compact set K ⊂ O whose interior contains θ∗ and such that
for any pair (j, k), and any m ∈ {0, 1}d in the support of PM ,
(m)
∂ 2 Φ(Pθ )
sup < +∞.
θ∈K ∂θj ∂θk
H
h i
(M ) (M )
Assumption 3.6. The matrix H = EM ∼PM ∇2θ,θ D2 PθMMD , PX|M is nonsingular.
∞
h h ii
(M )
Assumption 3.7. The two matrices Σ1 = EM ∼PM VX∼PX|M ∇θ D2 PθMMD , δ{X (M ) } and
h i ∞
(M ) (M )
Σ2 = VM ∼PM ∇θ D2 PθMMD , PX|M are nonsingular.
∞
Once again, we recover almost exactly the main conditions as for the standard MMD esti-
mators with four notable differences:
• The boundedness assumption of the second order derivative holds once again for each
marginal rather than for the whole distribution.
• The Hessian matrix B = ∇2θ,θ D2 (Pθ∗ , PX ) involving the distance to PX that was used in
the definition of the asymptotic variance of θn without missing data is replaced by the
2 2 (M ) (M )
expected (over M ∼ PM ) value of the Hessian matrix ∇θ,θ D PθMMD , PX|M involving
∞
the distance to the observed marginal of the conditional PX|M .
• The new version of the variance matrix Σ = VX∼PX ∇θ D2 Pθ∗ , δ{X} that was used for
holds for any missingness mechanism (even for MNAR data) and that provides a n−1/2 rate of
convergence:
√
Theorem 3.2. If Conditions 3.1 to 3.7 are fulfilled, then n(θnMMD − θ∞ MMD ) is asymptotically
Interestingly, the decomposition of the central asymptotic variance term Σ1 + Σ2 for our
estimator θnMMD is not explicit in the case of the MLE under missing data (see Theorem 2.2).
This arises because, in the absence of missing data, the central term V in the asymptotic variance
of the MLE can be alternatively formulated as the variance of the score:
h i h i
T
EX∼PX ∇θ log pθ∞ ML (X) · ∇θ log pθ ML (X)
∞
= V X∼PX
∇ θ log p θ ML
∞
(X) .
12
However, this identity no longer holds in the presence of missing data, since the expectation
(M )
ML (X)] becomes EX∼P
EX∼PX [∇θ log pθ∞ X|M
[∇θ log pθML (X (M ) )] which is not zero. As a result,
∞
expressing V as an expectation rather than a variance under missing data obscures the additional
term. Instead, V can be decomposed as the average (over M ∼ PM ) conditional variance of the
score plus an explicit additional term:
h i
(M ) (M )
V = E(X,M )∼PX,M ∇θ log pθML (X (M ) ) · ∇θ log pθML (X (M ) )T
h h ∞ ii ∞
(M ) (M )
= EM ∼PM VX∼PX|M ∇θ log pθML (X )
∞
h i h iT
(M ) (M ) (M ) (M )
+ EM ∼PM EX∼PX|M ∇θ log pθML (X ) · EX∼PX|M ∇θ log pθML (X ) .
∞ ∞
Having established these results we analyze the robustness of our new estimator.
4 Robustness
In the previous section, we analyzed the asymptotic behavior of our estimator θnMMD , and de-
termined its limit value. While it is evident that under a well-specified model PX = Pθ∗ and a
MCAR mechanism, θnMMD consistenly converges to the true parameter θ∗ , the situation is less
clear for an M(N)AR mechanism. Specifically, it remains uncertain whether the limit θ∞ MMD
∗
remains close to θ in the presence of a non-ignorable missingness mechanism, that is here an
M(N)AR mechanism. In this section, we examine the implications of an M(N)AR mechanism
on the behavior of θnMMD and even briefly consider its impact on the maximum likelihood esti-
mator θnML . Additionally, we address the scenario of a misspecified complete-data model, where
PX ∈/ {Pθ : θ ∈ Θ}, with a specific focus on contamination models accounting for the presence
of outliers in the data.
Theorem 4.1 (Robustness of the MLE limit to an MNAR missingness mechanism when d = 1).
Assume that d = 1 and that the model is well-specified PX = Pθ∗ . Then:
VX∼PX [π(X)]
TV2 Pθ∞
MLE , Pθ ∗ ≤2·
π2
where VX [π(X)] is the variance (w.r.t. X ∼ PX ) of the missingness mechanism π(X) = P[M =
0|X], and π = P[M = 0] = EX∼PX [π(X)].
This theorem establishes that the discrepancy between θ∞ ML and θ ∗ , quantified by the total
variation (TV) distance between Pθ∗ and Pθ∞ MLE , is upper bounded by the relative variance of
the missingness mechanism with respect to X ∼ PX . This relative variance serves as a measure
of the level of misspecification to the M(C)AR assumption: a large variance indicates that the
missingness mechanism deviates significantly from being M(C)AR, whereas a small variance
13
implies it is nearly M(C)AR. In the extreme case where the missingness mechanism is exactly
ML = θ ∗ provided the
M(C)AR - meaning that ignoring it is valid - we recover the equality θ∞
′
model is identifiable, that is θ ̸= θ =⇒ Pθ ̸= Pθ′ .
However, this result has two key limitations: (i) it is unclear whether it extends beyond the
one-dimensional case, where MCAR and MAR are distinct concepts; and (ii) the result fails
entirely when the assumption PX = Pθ∗ no longer holds. In particular, no guarantees can be
derived in scenarios involving data contamination, where the true data distribution PX deviates
from the target distribution Pθ∗ and instead represents a contaminated version of it - this issue
arises even in the absence of missing values.
Theorem 4.2 (Robustness of the MMD limit to M(N)AR and model misspecification). Assum-
ing that PX = Pθ∗ , we have:
h
2 (M ) (M )
i VX∼PX [πM (X)]
EM ∼PM D PθMMD , Pθ∗ ≤ 4 · EM ∼PM 2 (7)
∞ πM
Furthermore, if PX ∈
/ {Pθ : θ ∈ Θ}:
h
2
(M ) (M )
i h
2 (M ) (M )
i VX∼PX [πM (X)]
EM ∼PM D PθMMD , PX ≤ 2· inf EM ∼PM D Pθ , PX +4·EM ∼PM 2 .
∞ θ∈Θ πM
(8)
This theorem leverages the expected (w.r.t. the missing pattern M ∼ PM ) squared MMD
(M ) (M )
distance between the observed marginals PθMMD and Pθ∗ as a meaningful metric to assess some
∞
notion of distance between θ∞ MMD and θ ∗ . While this is not a formal distance between P
MMD and
θ∞
Pθ in the strict mathematical sense, it provides a measure of divergence between them, which
∗
(i.e. P(M = 0) > 0). Moreover, in the special case of one-dimensional data, the expected squared
MMD distance reduces to the standard MMD distance D2 (Pθ , Pθ′ ) weighted by the probability of
observing the datapoint. Finally, it is worth noting that, under standard regularity conditions
(see, e.g. Theorem 4 in Alquier and Gerber (2024b)), the expected squared MMD distance
between distributions can be related to the Euclidean distance between their parameters. This
further supports the use of this metric as a meaningful proxy for comparing θ∞ MMD and θ ∗ . For
instance, the expected squared MMD distance between two normal distributions N (θ, Id ) and
N (θ′ , Id ) is equivalent (up to a factor) to the weighted Euclidean distance EM [∥θ(M ) − θ′(M ) ∥22 ]
when θ and θ′ are close enough.
Inequality (7) in Theorem 4.2 is noteworthy as it guarantees robustness to any M(N)AR
mechanism, while when the missingness mechanism is MCAR - ensuring the validity of ignora-
bility - the bound becomes zero, and we recover the consistency of θnMMD towards θ∞ MMD = θ ∗ .
This result generalizes the bound established in the one-dimensional case for the MLE, offering
14
a quantification of the deviation from MCAR based on the variability of the missingness mech-
anism with respect to the observed data. Finally, Inequality (8) is particularly remarkable as
it guarantees robustness to both sources of misspecification and characterizes them separately:
the first term in the bound quantifies robustness to complete-data misspecification (which van-
ishes when PX = Pθ∗ ), while the second term captures robustness to deviation-to-MCAR (which
vanishes when the missingness mechanism is MCAR). Note that this inequality holds without
any assumption on the model. It is also worth noting that the variance appearing in the bounds
could be tightened further, replacing it with an absolute mean deviation over the observed vari-
ables X (m) only. However, we present the bound using the variance for the sake of clarity and
simplicity.
We finally derive in the one-dimensional case. where the expected squared MMD distance
can be written as a proper distance, a non-asymptotic inequality on the MMD estimator θnMMD
which holds in expectation over the sample S = {(Xi , Mi )}1≤i≤n .
Theorem 4.3 (Robustness of the MMD estimator to MNAR and misspecification when d = 1).
With the notations π(X) = P[M = 0|X] and π = P[M = 0] = EX∼PX [π(X)], we have:
p √
h i 2 VX∼PX [π(X)] 2 2
ES D PθnMMD , PX ≤ inf D (Pθ , PX ) + +√ .
θ∈Θ π n·π
This bound offers a detailed breakdown of the three types of errors involved. The first
term represents the misspecification error arising from using an incorrect model. The second
term reflects the error due to incorrectly assuming that the missingness mechanism follows the
M(C)AR assumption. Finally, the third term accounts for the estimation error, which stems
from the finite size of the dataset and the presence of unobserved data points, effectively reducing
the available sample size to nπ.
Theorem 4.4 (Robustness of the MMD limit to M(N)AR and Huber’s contamination model).
We have if PX = (1 − ϵ)Pθ∗ + ϵQX :
2
24ϵ2
h
2 (M ) (M )
i 1 ϵ VX∼Pθ∗ [πM (X)]
EM ∼PM
∗ D PθMMD , Pθ∗ ≤ +64·EM ∼PM
∗
∗2 +16·EM ∼PM
∗
∗2 .
∞ 1−ϵ πM 1−ϵ πM
This theorem ensures robustness to both M(N)AR data and the presence of outliers, with the
expected squared MMD distance bounded by the squared proportion of outliers ϵ2 weighted by
15
∗−2
EM [πM ]/(1 − ϵ)2 , except in cases where the deviation-to-MCAR missingness term is significant
and dominates. This result is remarkable, as such robustness is unattainable for the MLE, where
even a single outlier can cause the estimate to deteriorate arbitrarily. We emphasize that πm ∗
and thus PM ∗ is defined here as expectations w.r.t. the distribution of interest P ∗ , not under
θ
∗−2 2
PX . The second term in the bound EM [πM ]ϵ /(1 − ϵ)2 reflects this shift in the marginal of M ,
while the first term ϵ2 /(1 − ϵ) corresponds to the error in the specification of the marginal of
X. We believe that this last term can be tightened to ϵ2 and that the factor 1/(1 − ϵ) is just
an artifact from the proof, although we are only able to establish it in dimension one, as shown
below. This is anyway only little improvement, since this model misspecification term is always
∗−2 2
bounded by the pattern shift error EM [πM ]ϵ /(1 − ϵ)2 .
In the one-dimensional setting, we can derive a similar bound with tighter constants for the
MMD estimator θnMMD directly:
Theorem 4.5 (Robustness of the MMD estimator to MNAR and Huber’s contamination model
when d = 1). With the notations π(X) = P[M = 0|X] and π ∗ = EX∼Pθ∗ [π(X)], we have for
PX = (1 − ϵ)Pθ∗ + ϵQX :
p √
h i 8·ϵ 2 VX∼Pθ∗ [π(X)] 2 2
ES D PθnMMD , Pθ∗ ≤ 4 · ϵ + ∗ + +p . (9)
π (1 − ϵ) π∗ nπ ∗ (1 − ϵ)
The rate achieved by the estimator is max ϵ/π ∗ (1 − ϵ), V[π(X)]1/2 /π ∗ , (nπ ∗ (1 − ϵ))−1/2
with respect to the MMD distance. We will provide explicit rates with respect to the Euclidean
distance for a Gaussian model in the next section. Once again, the first term ϵ corresponds to the
mismatch error between PX and Pθ∗ , and is always dominated by the second term, ϵ/π ∗ (1 − ϵ),
which is the pattern shift error induced by the contamination. The third term measures the
deviation-to-M(C)AR level, while the last one is the convergence rate under M(C)AR using the
effective sample size nπ ∗ (1 − ϵ) from the M(C)AR component.
5 Examples
We investigate in this subsection explicit applications of our theory for specific examples of
MNAR missingness mechanisms. We provide three of them: the first one involves a truncation
mechanism; the second one is a Huber contamination setting for the missingness mechanism as
explored in Ma et al. (2024); while the third one allows for adversarial contamination of the
missingness mechanism.
16
Corollary 5.1. Under the setting adopted in Example 5.1, we have when PX = Pθ∗ :
MMD , Pθ ∗
D P θ∞ ≤ 4 · ε.
In the particular case where the model is Gaussian Pθ = N (θ, 1), we have an explicit formula
for the MMD distance between Pθ and Pθ′ using a Gaussian kernel as in (3):
12
γ2 (θ − θ′ )2
2
D (Pθ , Pθ′ ) = 2 1 − exp − ,
4 + γ2 4 + γ2
so that v
u 1
2 2
!
u 1 4 + γ
|θ − θ′ | = t−(4 + γ 2 ) log 1 − D2 (Pθ , Pθ′ ) .
2 γ2
Hence, if the model is well-specified Pθ∗ = PX , Corollary 5.1 becomes for small values of ε:
v s
u 1 !
2 2
1
4 + γ2 2
MMD ∗
u
2
4 + γ 2 2
|θ∞ − θ | ≤ −(4 + γ ) log 1 − 8
t ε ∼ 8(4 + γ ) · ε,
γ2 ε→0 γ2
√
and for γ = 2 (which minimizes the above linear factor), we have:
√ √
r q
MMD ∗
|θ∞ − θ | ≤ −6 log 1 − 8 3ε2 ∼ 48 3 · ε.
ε→0
Notice that we have almost the same guarantee for the MLE which is available in closed-form:
1 X
θnMLE = P Xi ,
i (1 − Mi ) i:Mi =0
where ϕ is the density of the standard Gaussian and I = (a, b). In the particular case where
I = (a, +∞) (left-truncation), we thus have
MLE ϕ(a)
θ∞ = ,
1 − Φ(a)
which is slightly worse than the MMD estimator by a logarithmic factor, but with a better
constant.
17
5.2 Huber’s Contamination Mechanism
Example 5.2. We now collect a dataset {Xi }i composed of i.i.d. copies of X and only ob-
(M )
serve {Xi i }i . While we believe
P that the MDM is MCAR with constant (w.r.t. X) probability
P[M = m|X] = αm a.s. (with m αm = 1) and thus ignore the missingness mechanism, the true
missingness mechanism is actually M(N)AR, with a proportion ε of outliers in the missingness
mechanism from a contamination process Q[M = m|X]:
P[M = m|X = x] = (1 − ε) · αm + ε · Q[M = m|X = x] for any x.
In the following we define the expectation
P EM ∼(αm ) with respect to the MCAR uncontami-
nated process, that is EM ∼(αm ) [f (M )] = m f (m)αm .
Theorem 5.2. Under the setting adopted in Example 5.2, assuming PX = (1 − ϵ)Pθ∗ + ϵQX ,
we have:
2
24ϵ2
h
(M ) (M )
i 1 ε
EM ∼(αm ) D2 Pθ∗ , PθMMD ≤ + 8 · EM ∼(αm ) 2 · .
∞ 1−ε αM 1−ε
Contrary to the general Theorem 4.4, the complete-data contamination ratio ϵ is only present
in the first term accounting for model misspecification error. This is due to a finer analysis of
the proofs in this specific mechanism contamination setting. We believe once again that the
1/(1 − ε) factor in front of the ϵ2 error may be removed, as it is the case in the one-dimensional
setting below. We also provide a non-asymptotic bound in the one-dimensional case:
Theorem 5.3. Under the setting adopted in Example 5.2, assuming PX = (1 − ϵ)Pθ∗ + ϵQX ,
we have: √
h i 2·ε 2 2
ES D PθnMMD , PX ≤ 4 · ϵ + +p .
α(1 − ε) nα(1 − ε)
An application of this result in the univariate Gaussian model would lead to a rate of order
max(ε/α(1 − ε), {nα(1 − ε)}−1/2 ) (for |θnMMD − θ∗ |) when ε = ϵ, which aligns with the minimax
optimal (high-probability) rate established in Table 1 of Ma et al. (2024). The two contamina-
tion settings are however different, since we are interested here in the separate contamination
of the complete data marginal distribution and of the missingness mechanism, while Ma et al.
(2024) rather focus on the contamination of the joint distribution. Nevertheless, while our con-
tamination model for ϵ = ε does not directly correspond to the so-called arbitrary contamination
setting of Ma et al. (2024), our framework coincides with their realizable contamination scenario
when ϵ = 0, meaning that PX = Pθ∗ . Interestingly, the authors highlighted a surprising result
in this case - and which seems to be unique to the Gaussian model; they demonstrated that
consistency towards θ∗ can still be achieved as n → ∞, even if ε > 0. They also provided a
simple estimator - the average of the extreme values - that satisfies such consistency. While this
phenomenon is indeed surprising, it comes at a cost: the convergence is logarithmic in the sam-
ple size, specifically log(nα(1 − ε))−1/2 . While our MMD estimator does not achieve consistency
towards θ∗ , its convergence rate towards θ∞ MMD is in {(nα(1 − ε))}−1/2 , thus outperforming the
average of extremes estimator in this regard for finite samples. Figure 1 illustrates this behavior
for the two estimators in a Gaussian mean example: while the MMD quickly suffers an error of
ε = 0.1 as n increases, the average of extremes moves very slowly to the true value θ∗ = 0.
Furthermore, another advantage of our estimator is its robustness to adversarial contamina-
tion in the missingness mechanism, as established in the next subsection.
18
5.3 Adversarial Contamination Mechanism
Example 5.3. Let us now introduce a more sophisticated adversarial contamination setting for
the missingness mechanism in the one-dimensional case: from the dataset {Xi }i composed of
i.i.d. copies of X ∼ PX , only a subsample of it {Xi : Mi = 0}i is observed according to a
MCAR missingness mechanism P[M = 0|X = x] = α for any x. However, the selected observed
examples Mi ’s have been modified by an adversary, and we finally observe {Xi : M fi = 0}i where
at most |{i : Mi = 0}| · ε/α (with 0 < ε < α) missing variables Mi are different from the original
f
Mi .
The choice |{i : Mi = 0}| · ε/α with ε < α is such that the adversary cannot remove all the
examples in the finally observed dataset by switching all Mi ’s equal to 0 to the value M
fi = 1, and
such that there are on average (over the sample {Mi }i ) at most nε contaminated values (recall
that contamination is in the missing variable M ). The following theorem provides a bound that
is valid for any finite value of the sample size n:
Theorem 5.4. Under the setting adopted in Example 5.3, assuming PX = (1 − ϵ)Pθ∗ + ϵQX ,
we have: √
h i 6·ε 2 2
ES D Pθ∗ , PθnMMD ≤ 4 · ϵ + +√ .
α−ε n·α
Note that the rate is different between Huber’s and adversarial contamination frameworks,
which are by nature very different, and two terms vary: (i) First, the contamination error term,
which slightly deteriorates and becomes ε/(α − ε) instead of ε/α(1 − ε), and accounts for the
expected number of outliers relative to the size of the uncontaminated observed dataset. In
both settings, there are (on average) nε outliers, but while there are nα(1 − ε) uncontaminated
observed examples in Huber’s model, this number becomes n(α − ε) in the adversarial setting.
(ii) Second, the rate of convergence slightly improves and becomes {nα}−1/2 instead of {nα(1 −
ε)}−1/2 . This difference arises because, in Huber’s model, only a fraction 1 − ε of the total
number n of datapoints has been generated following the MCAR mechanism of interest, making
the effective MCAR sample size nα(1 − ε). The contaminated fraction ε follows a different
generating process and does not contribute to the MCAR sample. This is not the case in
the adversarial model, where all of the initially sampled datapoints come from the MCAR
mechanism.
Consequences of this subtle difference can be important in the study of minimax theory for
population mean estimation problems, with completely different minimax rates when compared
to Huber’s setting. It is very unlikely that consistency towards θ∗ is possible any longer in the
realizable setting where PX = Pθ∗ , and very likely that optimal estimators for Huber’s model
behave extremely bad in this adversarial model. For instance, the average of extreme procedure
becomes a terrible estimator under adversarial contamination of the mechanism: it is sufficient
for the adversary to remove the ε/α · |{i : Mi = 0}| − 1 smallest points and observe the largest
one for a symmetric distribution to mislead the average of extremes estimator, no matter the
value of the contamination ratio. At the opposite, the MMD estimator still behaves correctly
in this setting, with an error which is (almost) linear in ε. Figure 1 illustrates this behavior:
while the MMD is stable around ε = 0.1, the average of extremes moves farther away from the
true value of the mean as n increases. The minimax optimality in the adversarial contamination
setting is an open question.
19
Figure 1: MLE, MMD (with Gaussian kernel) and Average of Extremes Estimator for a Gaus-
sian mean estimation problem following Example 5.2 (Huber contamination) in light blue and
Example 5.3 (Adversarial Contamination) in red. The shaded regions correspond to the empir-
ical 50% quantile range obtained by rerunning the experiment 1000 times for each n. In both
settings, we simulate X ∼ N (0, 1) without contamination. For the case of Huber contamination,
the MCAR mechanism with α = 0.5 probability of missingness is contaminated by an MNAR
mechanism for which M = 0 iff X > 0. For the adversarial mechanism, each Mi is first gener-
ated following an MCAR mechanism with α = 0.5 probability of missingness, and in a second
step, the Mi ’s corresponding to the 2ε · |{i : Mi = 0}| − 1 smallest values are set to 1, while the
Mi corresponding to the largest value is set to 0. In both cases, ϵ = 0.1. Implementation was
performed in R with the help of the R package regMMD (Alquier and Gerber, 2024a).
6 Conclusion
In this paper, we adapted the concept of MMD estimation for missing values to obtain a paramet-
ric estimation procedure with provable robustness against misspecificiations of the data model
as well as the missingness mechanism.
There are several directions for further improvements. Though we introduced a convenient
SGD algorithm to use our methodology in practice, our considerations were mostly of theoretical
nature, and more empirical analysis of the performance of the estimator should be done. In
addition, there is significant room for further development of the algorithm itself by leveraging
techniques exploited in the complete data setting. For instance, quasi-Monte Carlo methods, as
explored in Niu et al. (2023); Alquier et al. (2023), could enhance efficiency. Another promising
improvement involves replacing standard gradient methods with natural gradient methods, as
done in Briol et al. (2019), where the estimation problem is reformulated as a gradient flow using
the statistical Riemannian geometry induced by the MMD metric on the parameter space. Such
a stochastic natural gradient algorithm could lead to substantial computational gains compared
to traditional stochastic gradient descent.
Finally as also mentioned in the introduction, our approach might also be combined with
weighting approaches for M-Estimators under missing values, which might render the estimator
consistent under a weaker condition on the missingness mechanism than MCAR. However, in
this case the robustness properties need to be reevaluated and the advantages of this approach
would need to be carefully weighted against the more complex estimation procedure.
20
Acknowledgements
This work is part of the DIGPHAT project which was supported by a grant from the French
government, managed by the National Research Agency (ANR), under the France 2030 program,
with reference ANR-22-PESN-0017. BECA acknowledges funding from the ANR grant project
BACKUP ANR-23-CE40-0018-01.
References
Alquier, P., Chérief-Abdellatif, B.-E., Derumigny, A., and Fermanian, J.-D. (2023). Estimation of copulas
via maximum mean discrepancy. Journal of the American Statistical Association, 118(543):1997–2012.
Alquier, P. and Gerber, M. (2024a). regMMD: Robust Regression and Estimation Through Maximum
Mean Discrepancy Minimization. R package version 0.0.1.
Alquier, P. and Gerber, M. (2024b). Universal robust regression via maximum mean discrepancy.
Biometrika, 111(1):71–92.
Bharti, A., Briol, F.-X., and Pedersen, T. (2021). A general method for calibrating stochastic radio
channel models with kernels. IEEE transactions on antennas and propagation, 70(6):3986–4001.
Bińkowski, M., Sutherland, D. J., Arbel, M., and Gretton, A. (2018). Demystifying MMD GANs. In
International Conference on Learning Representations.
Briol, F.-X., Barp, A., Duncan, A. B., and Girolami, M. (2019). Statistical inference for generative
models with maximum mean discrepancy. Preprint arXiv:1906.05944.
Cantoni, E. and de Luna, X. (2020). Semiparametric inference with missing data: Robustness to outliers
and model misspecification. Econometrics and Statistics, 16:108–120.
Chao, M. T. and Strawderman, W. E. (1972). Negative moments of positive random variables. Journal
of the American Statistical Association, 67(338):429–431.
Chérief-Abdellatif, B.-E. and Alquier, P. (2020). MMD-Bayes: Robust bayesian estimation via maximum
mean discrepancy. In Symposium on Advances in Approximate Bayesian Inference, pages 1–21. PMLR.
Chérief-Abdellatif, B.-E. and Alquier, P. (2022). Finite sample properties of parametric MMD estimation:
robustness to misspecification and dependence. Bernoulli, 28(1):181–213.
Daniel Malinsky, I. S. and Tchetgen, E. J. T. (2022). Semiparametric inference for nonmonotone missing-
not-at-random data: The no self-censoring model. Journal of the American Statistical Association,
117(539):1415–1423.
Dellaporta, C., Knoblauch, J., Damoulas, T., and Briol, F.-X. (2022). Robust bayesian inference for
simulator-based models via the MMD posterior bootstrap. In International Conference on Artificial
Intelligence and Statistics, pages 943–970. PMLR.
Dussap, B., Blanchard, G., and Chérief-Abdellatif, B.-E. (2023). Label shift quantification with robust-
ness guarantees via distribution feature matching. In Joint European Conference on Machine Learning
and Knowledge Discovery in Databases, pages 69–85. Springer.
Dziugaite, G. K., Roy, D. M., and Ghahramani, Z. (2015). Training generative neural networks via
maximum mean discrepancy optimization. arXiv preprint arXiv:1505.03906.
21
Farewell, D. M., Daniel, R. M., and Seaman, S. R. (2021). Missing at random: a stochastic process
perspective. Biometrika, 109(1):227–241.
Frahm, G., Nordhausen, K., and Oja, H. (2020). M-estimation with incomplete and dependent multi-
variate data. Journal of Multivariate Analysis, 176:104569.
Golden, R. M., Henley, S. S., White, H., and Kashner, T. M. (2019). Consequences of model misspecifi-
cation for maximum likelihood estimation with missing data. Econometrics, 7(3).
Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., and Smola, A. J. (2007). A kernel method for the
two-sample-problem. In Advances in neural information processing systems, pages 513–520.
Gretton, A., Borgwardt, K. M., Rasch, M. J., Schölkopf, B., and Smola, A. (2012). A kernel two-sample
test. The Journal of Machine Learning Research, 13(1):723–773.
Hsing, T. and Eubank, R. (2015). Theoretical Foundations of Functional Data Analysis, with an Intro-
duction to Linear Operators. Wiley Series in Probability and Statistics. Wiley.
Iyer, A., Nath, S., and Sarawagi, S. (2014). Maximum mean discrepancy for class ratio estimation:
Convergence bounds and kernel selection. In International Conference on Machine Learning, pages
530–538. PMLR.
Kajihara, T., Kanagawa, M., Yamazaki, K., and Fukumizu, K. (2018). Kernel recursive abc: Point
estimation with intractable likelihood. In International Conference on Machine Learning, pages 2400–
2409. PMLR.
Key, O., Gretton, A., Briol, F.-X., and Fernandez, T. (2021). Composite goodness-of-fit tests with kernels.
arXiv preprint arXiv:2111.10275.
Legramanti, S., Durante, D., and Alquier, P. (2025). Concentration of discrepancy-based approximate
bayesian computation via rademacher complexity. The Annals of Statistics, 53(1):37–60.
Li, C.-L., Chang, W.-C., Cheng, Y., Yang, Y., and Póczos, B. (2017). Mmd gan: Towards deeper
understanding of moment matching network. Advances in neural information processing systems, 30.
Li, L., Shen, C., Li, X., and Robins, J. M. (2011). On weighting approaches for missing data. Stat
Methods Med Res, 22(1):14–30.
Li, Y., Swersky, K., and Zemel, R. (2015). Generative moment matching networks. In International
conference on machine learning, pages 1718–1727.
Ma, T., Verchand, K. A., Berrett, T. B., Wang, T., and Samworth, R. J. (2024). Estimation beyond
missing (completely) at random. Preprint arXiv:2410.10704.
Mealli, F. and Rubin, D. B. (2015). Clarifying missing at random and related definitions, and implications
when coupled with exchangeability. Biometrika, 102(4):995–1000.
Mitrovic, J., Sejdinovic, D., and Teh, Y.-W. (2016). DR-ABC: Approximate Bayesian computation
with kernel-based distribution regression. In International Conference on Machine Learning, pages
1482–1491.
Molenberghs, G., Beunckens, C., Sotto, C., and Kenward, M. G. (2008). Every missingness not at random
model has a missingness at random counterpart with equal fit. Journal of the Royal Statistical Society.
Series B (Statistical Methodology), 70(2):371–388.
Muandet, K., Fukumizu, K., Sriperumbudur, B., and Schölkopf, B. (2017). Kernel mean embedding of
distributions: A review and beyond. Foundations and Trends in Machine Learning, 10(1–2):1–141.
22
Newey, W. K. and McFadden, D. (1994). Large sample estimation and hypothesis testing. Handbook of
econometrics, 4:2111–2245.
Niu, Z., Meier, J., and Briol, F.-X. (2023). Discrepancy-based inference for intractable generative models
using quasi-monte carlo. Electronic Journal of Statistics, 17(1):1411–1456.
Näf, J., Scornet, E., and Josse, J. (2024). What is a good imputation under MAR missingness? Preprint
arXiv:2403.19196.
Pacchiardi, L., Khoo, S., and Dutta, R. (2024). Generalized bayesian likelihood-free inference. Electronic
Journal of Statistics, 18(2):3628–3686.
Park, M., Jitkrittum, W., and Sejdinovic, D. (2016). K2-ABC: Approximate Bayesian computation with
kernel embeddings. In Artificial intelligence and statistics, pages 398–407.
Ren, B., Lipsitz, S. R., Weiss, R. D., and Fitzmaurice, G. M. (2023). Multiple imputation for non-
monotone missing not at random data using the no self-censoring model. Stat Methods Med Res,
32(10):1973–1993.
Robins, J. M. and Gill, R. D. (1997). Non-response models for the analysis of non-monotone ignorable
missing data. Stat Med, 16(1-3):39–56.
Rubin, D. B. (1996). Multiple imputation after 18+ years. Journal of the American Statistical Association,
91(434):473–489.
Seaman, S., Galati, J., Jackson, D., and Carlin, J. (2013). What is meant by “Missing at Random”?
Statistical Science, 28(2):257–268.
Shpitser, I. (2016). Consistent estimation of functions of data missing non-monotonically and not at
random. In Advances in Neural Information Processing Systems, volume 29.
Sportisse, A., Marbac, M., Laporte, F., Celeux, G., Boyer, C., Josse, J., and Biernacki, C. (2024).
Model-based clustering with missing not at random data. Statistics and Computing, 34(4):135.
Sun, B. and Tchetgen, E. J. T. (2018). On inverse probability weighting for nonmonotone missing at
random data. Journal of the American Statistical Association, 113(521):369–379.
Sutherland, D. J., Tung, H.-Y., Strathmann, H., De, S., Ramdas, A., Smola, A., and Gretton, A. (2016).
Generative models and model criticism via optimized maximum mean discrepancy. In International
Conference on Learning Representations.
Takai, K. and Kano, Y. (2013). Asymptotic inference with incomplete data. Communications in Statistics
- Theory and Methods, 42(17):3174–3190.
Teymur, O., Gorham, J., Riabiz, M., and Oates, C. (2021). Optimal quantisation of probability measures
using maximum mean discrepancy. In International Conference on Artificial Intelligence and Statistics,
pages 1027–1035.
Van Buuren, S. (2007). Multiple imputation of discrete and continuous data by fully conditional specifi-
cation. Stat Methods Med Res, 16(3):219–242.
23
Van Buuren, S. (2018). Flexible Imputation of Missing Data. Second Edition. Chapman & Hall/CRC
Press.
Zhang, K., Schölkopf, B., Muandet, K., and Wang, Z. (2013). Domain adaptation under target and
conditional shift. In International conference on machine learning, pages 819–827.
24
A Proofs
A.1 Proofs from Section 3
In order to simplify the proofs, we first introduce the following notations:
n n n
b n (θ) = 1 1 X 2 (Mi ) 1X 2
(M )
X
L ℓ(Xi , Mi ; θ) = D Pθ , δ{X (Mi ) } − Φ(Xi i )
n n i n H
i=1 i=1 i=1
and
h
(M )
i 2
L(θ) = E(M,X) [ℓ(M, X; θ)] = E(X,M ) D Pθ , δ{X (M ) } − E(X,M ) Φ(X (M ) )
2
,
H
where D E D E
(m) (m) (m)
ℓ(x, m; θ) = Φ(Pθ ), Φ(Pθ ) − 2 Φ(Pθ ), Φ(x(m) ) .
H H
Hence, we have
n
b n (θ) = arg min 1
X (M )
θbn = arg min L D2 Pθ i , δ{X (Mi ) }
θ∈Θ θ∈Θ n i
i=1
and h i
MMD 2 (M ) (M )
θ∞ = arg min L(θ) = arg min EM D Pθ , PX|M .
θ∈Θ θ∈Θ
PX,M −a.s.
sup |L
cn (θ) − L(θ)| −
−−−−−−→ 0.
θ∈Θ n→+∞
25
n
( )
X 1X (m) 2
= 1(Mi = m) − P[M = m] Φ(Pθ )
n H
m∈{0,1}d i=1
n
* +
X (m) (m) 1X (m)
+2 Φ(Pθ ), P[M = m]Φ(PX|M =m ) − 1(Mi = m)Φ(Xi ) .
n
m∈{0,1}d i=1 H
Then, using the triangle inequality, Cauchy-Schwarz’s inequality and the boundedness of the
kernel:
cn (θ) − L(θ)
sup L
θ∈Θ
n
X 1X (m) 2
≤ P[M = m] − 1(Mi = m) sup Φ(Pθ )
n θ∈Θ H
m∈{0,1}d i=1
n
X (m) (m) 1X (m)
+2 sup Φ(Pθ ) · P[M = m]Φ(PX|M =m ) − 1(Mi = m)Φ(Xi )
θ∈Θ H n
m∈{0,1}d i=1 H
n n
X 1X X (m) 1X (m)
≤ P[M = m] − 1(Mi = m) +2 P[M = m]Φ(PX|M =m ) − 1(Mi = m)Φ(Xi ) ,
m
n m
n
| {zi=1 } | {z
i=1 H
}
PX,M −a.s. PX,M −a.s.
−
−−n→+∞
−−−−
→0 −−−n→+∞
−−−−→0
where the two consistency properties in the last line are direct consequences of the strong law
of large numbers. This concludes the proof.
We also provide a uniform law of large numbers for second-order derivatives that will be
useful to establish the asymptotic normality of the estimator.
Lemma A.2. If Conditions 3.4 and 3.5 are fulfilled, then the Hessians of both the empirical loss
cn and of L exist on K, and each coefficient ∂ 2 L
L cn /∂θk ∂θj almost surely converges uniformly to
2
∂ L/∂θk ∂θj , i.e. for any pair (j, k),
∂2L
cn (θ) ∂ 2 L(θ) PX,M −a.s.
sup − −−−−−−−→ 0.
θ∈K ∂θk ∂θj ∂θk ∂θj n→+∞
∂2L
cn (θ) ∂ 2 L(θ)
sup −
θ∈K ∂θk ∂θj ∂θk ∂θj
n
! * (m)
+ * (m) (m)
+!
X 1X ∂ 2 Φ(Pθ ) (m) ∂Φ(Pθ ) ∂Φ(Pθ )
= sup P[M = m] − 1(Mi = m) 2 , Φ(Pθ ) + ,
θ∈K n ∂θk ∂θj ∂θj ∂θk
m∈{0,1}d i=1
n
* (m)
+
X ∂ 2 Φ(Pθ ) (m) 1X (m)
+2 , P[M = m]Φ(PX|M =m ) − 1(Mi = m)Φ(Xi )
∂θk ∂θj n
d
m∈{0,1} i=1 H
26
n
( (m) (m) (m)
)
X 1X ∂ 2 Φ(Pθ ) (m) ∂Φ(Pθ ) ∂Φ(Pθ )
≤2 P[M = m] − 1(Mi = m) sup · Φ(Pθ ) + ·
n θ∈K ∂θk ∂θj ∂θj ∂θk
{zi=1
m∈{0,1}d
| }| {z }
PX,M −a.s. <+∞
−
−−n→+∞
−−−−
→0
(m) n
X ∂ 2 Φ(Pθ ) (m) 1X (m)
+2 sup · P[M = m]Φ(PX|M =m ) − 1(Mi = m)Φ(Xi )
θ∈K ∂θk ∂θj n
m∈{0,1}d H i=1 H
| {z } | {z }
<+∞ PX,M −a.s.
−−−n→+∞
−−−−→0
where we used the assumptions to ensure the existence of the involved quantities and the bound-
edness of the supremum in the last line. Note that the uniform boundedness of the first-order
derivative follows from the uniform boundedness of the second-order derivative and the fact that
K is compact.
Proof of Theorem 3.1. Invoking Theorem 2.1 in Newey and McFadden (1994), Assumption 3.1
and 3.2 along with Lemma A.1 imply the strong consistency of the minimizer θbn of L
cn towards
the unique minimizer of L.
that θnMMD belongs to such a neighborhood for n large enough. As 3.3 holds, the first-order
condition is
0 = ∇θ L cn (θMMD ) + ∇2 L
cn (θMMD ) = ∇θ L MMD MMD
− θ∞
∞ θ,θ n (θ̄n )(θn ), (10)
c
n
where θ̄n is a random vector whose components lie between those of θ∞ MMD and θ MMD . Let us
n
now study the asymptotic behavior of the Hessian matrix Hn = ∇2θ,θ L cn (θMMD ).
cn (θ̄n ) and of ∇θ L
∞
MMD
Consistency of Hn : Notice first that as θ̄n lies between θn and θ∞ MMD componentwise,
then θ̄n → θ∞MMD (P ∗ -a.s.) as n → +∞ and we have existing second-order derivatives of L b n at
θ̄n ∈ K for n large enough (which ensures the existence of Hn ). For any pair (j, k),
∂2L
cn (θ̄n ) ∂ 2 L(θ∞MMD ) ∂2L
cn (θ̄n ) ∂ 2 L(θ̄n ) ∂ 2 L(θ̄n ) ∂ 2 L(θ∞MMD )
− ≤ − + −
∂θk ∂θj ∂θk ∂θj ∂θk ∂θj ∂θk ∂θj ∂θk ∂θj ∂θk ∂θj
∂2L
cn (θ) ∂ 2 L(θ) ∂ 2 L(θ̄n ) ∂ 2 L(θ∞MMD )
≤ sup − + − .
θ∈K ∂θk ∂θj ∂θk ∂θj ∂θk ∂θj ∂θk ∂θj
According to Lemma A.2 (3.4 and 3.5 are satisfied), the first term in the bound is going to zero
almost surely, as well as the second one due to the continuity of ∂ 2 L(·; )/∂θk ∂θj at θ∞
MMD (3.4).
27
cn (θMMD ): We start by noticing that for any θ, j,
Asymptotic normality of ∇θ L ∞
n
∂L
cn (θ) ∂L(θ) 1 X ∂ℓ(Xi , Mi ; θ) ∂E(X,M ) [ℓ(X, M ; θ)]
− = − .
∂θj ∂θj n ∂θj ∂θj
i=1
= Σ1 + Σ2 ,
Finally, as B is nonsingular, the matrix Hn is a.s. invertible for a sufficiently large n, and
using Slutsky’s lemma in Equation 10, we get
√
n θnMMD − θ∞
MMD cn (θnMMD ) −−−L−−→ N 0, B −1 Σ1 B −1 + B −1 Σ2 B −1 .
= −Hn−1 ∇θ L
n→+∞
28
π(X) − π π(X)
≤ EX∼PX
π π
" 2 #
π(X)
= EX∼PX −1
π
π(X)2 − π 2
= EX∼PX
π2
VX∼PX [π(X)]
= .
π2
ML :
Then using respectively the triangle inequality, Pinsker’s inequality and the definition of θ∞
TV Pθ∗ , Pθ∞MLE ≤ TV PX|M =0 , Pθ∗ + TV PX|M =0 , Pθ∞ MLE
r r
1 1
≤ KL PX|M =0 ∥Pθ∗ + KL PX|M =0 ∥Pθ∞ MLE
2 2
q
≤ 2 · KL PX|M =0 ∥Pθ∗
q
= 2 · KL PX|M =0 ∥PX
r
VX∼PX [π(X)]
= 2· ,
π2
which ends the proof.
Proof of Theorem 4.2. Denoting πm (X) = P[M = m|X] the missingness mechanism, we start
from the definition of the MMD distance: for any pattern m,
h i h i
(m) (m)
D PX , PX|M =m = EX (m) ∼P(m) k(X (m) , ·) − EX (m) ∼P(m) k(X (m) , ·)
X X|M =m H
h i h i
= EX∼PX k(X (m) , ·) − EX∼PX|M =m k(X (m) , ·)
H
h
(m)
i π m (X) (m)
= EX∼PX k(X , ·) − EX∼PX k(X , ·)
πm H
πm (X) − πm
= EX∼PX · k(X (m) , ·)
πm H
|πm (X) − πm |
≤ EX∼PX · k(X (m) , ·)
πm H
EX∼PX [|πm (X) − πm |]
≤
πm
VX∼PX [πm (X)]
≤ .
πm
In the case where PX = Pθ∗ , we can conclude using the triangle inequality and the definition of
MMD :
θ∞
h i h i h i
(M ) (M ) (M ) (M ) (M ) (M )
EM ∼PM D2 PθMMD , Pθ∗ ≤ 2EM ∼PM D2 PθMMD , PX|M + 2EM ∼PM D2 PX|M , Pθ∗
∞ ∞
29
h i h i
(M ) (M ) (M ) (M )
≤ 2EM ∼PM D2 Pθ∗ , PX|M + 2EM ∼PM D2 PX|M , Pθ∗
h i
(M ) (M )
= 4EM ∼PM D2 PX , PX|M
VX∼PX [πM (X)]
≤ 4 · EM ∼PM 2 .
πM
and taking the infimum in the r.h.s. over θ ∈ Θ ends the proof.
We implicitly assume from now on that the quantity |{i : Mi = 0}| in the denominator is actually
equal to max(1, |{i : Mi = 0}|) so that Pn is well-defined. We then have for any θ ∈ Θ:
D PθnMMD , PX ≤ D PθnMMD , Pn + D (PX , Pn )
≤ D (Pθ , Pn ) + D (PX , Pn )
≤ D (Pθ , PX ) + 2D (PX , Pn )
≤ D (Pθ , PX ) + 2D PX , PX|M =0 + 2D PX|M =0 , Pn
p
2 VX∼PX [π(X)]
≤ D (Pθ , PX ) + + 2D PX|M =0 , Pn ,
π
where the bound on the second term directly follows from the proof of Theorem 4.2. The third
term can be controlled in expectation as follows. We first write:
2
1 X
D2 PX|M =0 , Pn
= EX∼PX|M =0 [Φ(X)] − Φ(Xi )
|{i : Mi = 0}|
i:Mi =0
H
2
1 X n o
= Φ(Xi ) − EX∼PX|M =0 [Φ(X)]
|{i : Mi = 0}|
i:Mi =0
H
1 X 2
= Φ(Xi ) − EX∼PX|M =0 [Φ(X)]
|{i : Mi = 0}|2 H
i:Mi =0
30
2 X D E
+ Φ(Xi ) − EX∼PX|M =0 [Φ(X)] , Φ(Xj ) − EX∼PX|M =0 [Φ(X)] .
|{i : Mi = 0}|2 H
i<j:Mi =Mj =0
The key point now lies in the fact that conditional on {Mi }1≤i≤n , all the Xi ’s such that Mi =
0 are i.i.d. with common distribution PX|M =0 . Then, in expectation over the sample S =
{(Xi , Mi )}1≤i≤n , the first term in the last line becomes:
1 X 2
ES Φ(Xi ) − EX∼PX|M =0 [Φ(X)]
|{i : Mi = 0}|2 H
i:Mi =0
1 X 2
= E{Mi }1≤i≤n ∼PM EXi ∼PX|M =0 Φ(Xi ) − EX∼PX|M =0 [Φ(X)]
|{i : Mi = 0}|2 H
i:Mi =0
1 2
= E{Mi }1≤i≤n ∼PM · EX∼PX|M =0 Φ(X) − EX∼PX|M =0 [Φ(X)]
|{i : Mi = 0}| H
1 h i
≤ E{Mi }1≤i≤n ∼PM · EX∼PX|M =0 ∥Φ(X)∥2H
|{i : Mi = 0}|
1
≤ E{Mi }1≤i≤n ∼PM
|{i : Mi = 0}|
2
≤ ,
n · P[M = 0]
where the first inequality comes from the standard inequality Variance ≤ Second order moment,
the second one from the boundedness of the kernel, and the last one as a standard bound on
the negative moments of a binomial random variable (Chao and Strawderman, 1972) (E[1/(1 +
Bin(n,p))] ≤ 1/np). Note that we also used the inequality 1/x ≤ 2/(1 + x).
Similarly, we have:
E
2 X D
ES Φ(Xi ) − EX∼PX|M =0 [Φ(X)] , Φ(Xj ) − EX∼PX|M =0 [Φ(X)]
|{i : Mi = 0}|2 H
i<j:Mi =Mj =0
2 X
= E{Mi }1≤i≤n ∼PM
|{i : Mi = 0}|2
i<j:Mi =Mj =0
D h i h iE
EXi ∼PX|M =0 Φ(Xi ) − EX∼PX|M =0 [Φ(X)] , EXj ∼PX|M =0 Φ(Xj ) − EX∼PX|M =0 [Φ(X)]
H
2 X
= E{Mi }1≤i≤n ∼PM ⟨0, 0⟩H
|{i : Mi = 0}|2
i<j:Mi =Mj =0
= 0.
31
p s
2
VX∼PX [π(X)] 2
≤ D (Pθ∗ , PX ) + +2 .
π n · P[M = 0]
Proof of Theorem 4.4. Theorem 4.2 (using the squared absolute mean deviation M instead of
the variance V) leads to:
h i h i h i
(M ) (M ) (M ) (M ) (M ) (M )
EM ∼PM D2 PθMMD , Pθ∗ ≤ 2EM ∼PM D2 PθMMD , PX + 2EM ∼PM D2 PX , Pθ∗
∞ ∞
" #!
h
2 (M ) (M )
i MX∼PX [πM (X)] h
2 (M ) (M )
i
≤ 2 2 · inf EM ∼PM D Pθ , PX + 4 · EM ∼PM + 2E M ∼PM D P X , Pθ∗
θ∈Θ EX∼PX [πM (X)]2
" #
h
2 (M ) (M )
i MX∼PX [πM (X)] h
2 (M ) (M )
i
= 4 · inf EM ∼PM D Pθ , PX + 8 · EM ∼PM + 2E M ∼P D P X , P θ ∗
EX∼PX [πM (X)]2
M
θ∈Θ
" #
h
(M ) (M )
i M X∼P [πM (X)]
≤ 6 · EM ∼PM D2 Pθ∗ , PX + 8 · EM ∼PM X
.
EX∼PX [πM (X)]2
≤ 4ϵ2 ,
while the other variance term is bounded as in the previous proof above: with the notation
∗ =E
πm X∼Pθ∗ [πm (X)], we have for any pattern m,
MX∼PX [πM (X)] MX∼(1−ϵ)Pθ∗ +ϵQX [πM (X)] 2(1 − ϵ)2 MX∼Pθ∗ [πM (X)] + 8ϵ2
= ≤ ,
EX∼PX [πM (X)]2 EX∼PX [πM (X)]2 EX∼PX [πM (X)]2
since for any measurable function f ∈ (0, 1), EX∼(1−ϵ)Pθ∗ +ϵQX [f (X)] = (1 − ϵ)EX∼Pθ∗ [f (X)] +
ϵEX∼QX [f (X)] and
1/2
MX∼(1−ϵ)P [f (X)] = EX∼(1−ϵ)Pθ∗ +ϵQX f (X) − {(1 − ϵ)EX∼Pθ∗ [f (X)] + ϵEX∼QX [f (X)]}
θ ∗ +ϵQX
= EX∼(1−ϵ)Pθ∗ +ϵQX f (X) − EX∼Pθ∗ [f (X)] + ϵ EX∼Pθ∗ [f (X)] − EX∼QX [f (X)]
≤ EX∼(1−ϵ)Pθ∗ +ϵQX f (X) − EX∼Pθ∗ [f (X)] + ϵ EX∼Pθ∗ [f (X)] − EX∼QX [f (X)]
= (1 − ϵ)EX∼Pθ∗ f (X) − EX∼Pθ∗ [f (X)] + ϵEX∼QX f (X) − EX∼Pθ∗ [f (X)]
+ ϵ EX∼Pθ∗ [f (X)] − EX∼QX [f (X)]
1/2
≤ (1 − ϵ) · MX∼Pθ∗ [f (X)] + 4ϵ.
32
Thus we get:
" #
h
(M ) (M )
i h
(M ) (M )
i M X∼P [π M (X)]
EM ∼PM D2 PθMMD , Pθ∗ ≤ 6 · EM ∼PM D2 Pθ∗ , PX + 8 · EM ∼PM X
∞ EX∼PX [πM (X)]2
" # " #
M X∼P ∗ [π M (X)] 1
≤ 24ϵ2 + 16(1 − ϵ)2 · EM ∼PM θ
+ 64ϵ2 · EM ∼PM
EX∼PX [πM (X)]2 EX∼PX [πM (X)]2
X MX∼Pθ∗ [πm (X)] X 1
= 24ϵ2 + 16(1 − ϵ)2 · EX∼PX [πm (X)] 2 + 64ϵ
2
EX∼PX [πm (X)]
m EX∼PX [πm (X)] m EX∼PX [πm (X)]2
X MX∼P ∗ [πm (X)] X 1
= 24ϵ2 + 16(1 − ϵ)2 · θ
+ 64ϵ2
m
E X∼PX [πm (X)] m
E X∼PX [πm (X)]
X MX∼Pθ∗ [πm (X)] X 1
≤ 24ϵ2 + 16(1 − ϵ)2 · + 64ϵ2
m
(1 − ϵ)EX∼Pθ∗ [πm (X)] m
(1 − ϵ)EX∼Pθ∗ [πm (X)]
64ϵ2
MX∼Pθ∗ [πM (X)] 1
= 24ϵ2 + 16(1 − ϵ) · EM ∼PM∗
∗2 + · E ∗
M ∼PM ∗2
πM 1−ϵ πM
since EX∼PX [πm (X)] = (1 − ϵ)EX∼Pθ∗ [πm (X)] + ϵEX∼QX [πm (X)] ≥ (1 − ϵ)EX∼Pθ∗ [πm (X)], and
finally, using the identity:
X
EM ∼PM [f (M )] = EX∼PX [πm (X)] f (m)
m
X
= EX∼(1−ϵ)Pθ∗ +ϵQX [πm (X)] f (m)
m
X X
= (1 − ϵ) EX∼Pθ∗ [πm (X)] f (m) + ϵ EX∼QX [πm (X)] f (m)
m m
X
≥ (1 − ϵ) EX∼Pθ∗ [πm (X)] f (m)
m
X
= (1 − ϵ) πm f (m)
m
= (1 − ϵ)EM ∼PM
∗ [f (M )] ,
we get
h
(M ) (M )
i 1 h
(M ) (M )
i
EM ∼PM
∗ D2 PθMMD , Pθ∗ ≤ EM ∼PM D2 PθMMD , Pθ∗
∞ 1−ϵ ∞
2 64ϵ2
24ϵ M X∼Pθ∗ [πM (X)] 1
≤ + 16 · EM ∼P ∗ ∗2 + · E M ∼P ∗
∗2 ,
1−ϵ πM (1 − ϵ)2 πM
Proof of Theorem 4.5. The proof follows the same line as for the proof of Theorem 4.4. Theorem
4.3 (using the squared absolute mean deviation M instead of the variance V) leads to:
D Pθ∗ , PθnMMD ≤ D (Pθ∗ , PX ) + D PX , PX|M =0 + D PX|M =0 , Pn + D Pn , PθnMMD
≤ D (Pθ∗ , PX ) + D PX , PX|M =0 + D PX|M =0 , Pn + D (Pn , Pθ∗ )
33
≤ 2D (Pθ∗ , PX ) + 2D PX , PX|M =0 + 2D PX|M =0 , Pn .
34
s
2
≤ ,
n · (1 − ϵ)EX∼Pθ∗ [π(X)]
Proof of Theorem 5.2. With the notation π em (X) = Q[M = m|X], a finer application of Theorem
4.2 gives
h i
(M ) (M )
EM ∼PM D2 PX , PθMMD
∞
" #
h
(M ) (M )
i V X∼P [πM (X)]
≤ 2 · inf EM ∼PM D2 Pθ , PX + 4 · EM ∼PM X
θ∈Θ EX∼PX [πM (X)]2
" #
h i ε 2·V [e
π (X)]
(M ) (M ) X∼PX M
= 2 · inf EM ∼PM D2 Pθ , PX + 4 · EM ∼PM
θ∈Θ EX∼PX [πM (X)]2
" #
h
(M ) (M )
i 1
≤ 2 · inf EM ∼PM D2 Pθ , PX + 4ε2 · EM ∼PM
θ∈Θ EX∼PX [πM (X)]2
h
(M ) (M )
i X 1
= 2 · inf EM ∼PM D2 Pθ , PX + 4ε2 · EX∼PX [πm (X)]
θ∈Θ
m EX∼PX [πm (X)]2
h
(M ) (M )
i X 1
= 2 · inf EM ∼PM D2 Pθ , PX + 4ε2 ·
θ∈Θ
m
EX∼PX [πm (X)]
h
(M ) (M )
i X 1
= 2 · inf EM ∼PM D2 Pθ , PX + 4ε2 ·
θ∈Θ
m
(1 − ε)αM + ε · EX∼PX [e πM (X)]
h
(M ) (M )
i X 1
≤ 2 · inf EM ∼PM D2 Pθ , PX + 4ε2 ·
θ∈Θ
m
(1 − ε)αM
35
4ε2
h
2
(M ) (M )
i 1
= 2 · inf EM ∼PM D P θ , PX + ·E .
θ∈Θ 1 − ε M ∼(αm ) αM
2
and thus
h i h i h i
(M ) (M ) (M ) (M ) (M ) (M )
EM ∼PM D2 Pθ∗ , PθMMD ≤ 2 · EM ∼PM D2 Pθ∗ , PX + 2 · EM ∼PM D2 PX , PθMMD
∞ ∞
2
2
h
2 (M ) (M )
i 8ϵ 1
≤ 8ϵ + 4 · inf EM ∼PM D Pθ , PX + ·E
θ∈Θ 1 − ε M ∼(αm ) αM 2
8ϵ2
h
(M ) (M )
i 1
≤ 8ϵ2 + 4 · EM ∼PM D2 Pθ∗ , PX + · EM ∼(αm ) 2
1−ε αM
2
8ϵ 1
≤ 24ϵ2 + · EM ∼(αm ) 2 .
1−ε αM
Finally, using the identity
X
EM ∼PM [f (M )] = EX∼PX [πm (X)] f (m)
m
X
= EX∼PX [(1 − ε)αm + εe
π (X)] f (m)
m
X X
= (1 − ε) αm f (m) + ε EX∼PX [e
π (X)] f (m)
m m
X
≥ (1 − ε) αm f (m)
m
X
= (1 − ε) αm f (m)
m
= (1 − ε)EM ∼(αm ) [f (M )] ,
we have
h
(M ) (M )
i 1 h
(M ) (M )
i
EM ∼(αm ) D2 Pθ∗ , PθMMD ≤ · EM ∼PM D2 Pθ∗ , PθMMD
∞ 1−ε ∞
2 2
24ϵ 8ε 1
≤ + ·E .
1 − ε (1 − ε)2 M ∼(αm ) αM 2
36
Proof of Theorem 5.3. This is a straightforward application of a refined version of Theorem 4.5.
Remind that Theorem 4.5 is:
p √
h i 8·ϵ 2 VX∼Pθ∗ [π(X)] 2 2
ES D PθnMMD , PX ≤ 4·ϵ+ + +p ,
EX∼Pθ∗ [π(X)](1 − ϵ) EX∼Pθ∗ [π(X)] nEX∼Pθ∗ [π(X)](1 − ϵ)
where the quantities EX∼Pθ∗ [π(X)](1 − ϵ) in the denominator come as rough lower bounds on
EX∼PX [π(X)]. Hence, using those quantities directly as established in the proof of 4.5, we have:
p √
h i 2 VX∼PX [π(X)] 2 2
ES D PθnMMD , PX ≤ 4 · ϵ + +p .
EX∼PX [π(X)] nEX∼PX [π(X)]
p
The arguments EX∼PX [π(X)] ≥ α(1 − ε), EX∼Pθ∗ [π(X)] ≥ α(1 − ε) and VX∼PX [π(X)] ≤ ε
where π(X) = (1 − ε) · α + ε · Q[M = m|X] then lead to:
√
h i 2ε 2 2
ES D PθnMMD , PX ≤ 4 · ϵ + +p .
α(1 − ε) nα(1 − ε)
With the properties |{i : M fi = 0}| − |{i : Mi = 0}| ≤ ε/α · |{i : Mi = 0}|, |{i : M
fi = 0}| ≥
(1 − ε/α) · |{i : Mi = 0}|, |{i : Mi = 0, M
fi = 1}| ≤ ε/α · |{i : Mi = 0}| and |{i : Mi = 1, M
fi =
0}| ≤ ε/α · |{i : Mi = 0}|, we have:
1 X 1 X
D Pn , Pen = Φ(Xi ) − Φ(Xi )
|{i : Mi = 0}| |{i : M
fi = 0}|
i:Mi =0 i:M
fi =0
H
n
1(Mi = 0) 1(M
!
X fi = 0)
= − Φ(Xi )
|{i : Mi = 0}| |{i : M fi = 0}|
i=1 H
n
X 1(Mi = 0) 1(M
fi = 0)
≤ − ∥Φ(Xi )∥H
|{i : Mi = 0}| |{i : M
fi = 0}|
i=1
Xn
1(Mi = 0) 1(M
fi = 0)
≤ −
|{i : Mi = 0}| |{i : M
fi = 0}|
i=1
n
X 1(Mi = 0)|{i : M
fi = 0}| − 1(M
fi = 0)|{i : Mi = 0}|
=
i=1 |{i : Mi = 0}| · |{i : M
fi = 0}|
X |{i : M
fi = 0}| X |{i : Mi = 0}|
= +
|{i : Mi = 0}| · |{i : M
fi = 0}| |{i : Mi = 0}| · |{i : M
fi = 0}|
i:Mi =0,M
fi =1 i:Mi =1,M
fi =0
37
X |{i : M
fi = 0}| − |{i : Mi = 0}|
+
|{i : Mi = 0}| · |{i : M
fi = 0}|
i:Mi =M
fi =0
|{i : Mi = 0, M
fi = 1}| |{i : Mi = 1, M
fi = 0}|
= +
|{i : Mi = 0}| |{i : M
fi = 0}|
|{i : Mi = M
fi = 0}| · |{i : M
fi = 0}| − |{i : Mi = 0}|
+
|{i : Mi = 0}| · |{i : Mfi = 0}|
|{i : Mi = 0, M fi = 0}| |{i : Mi = M
fi = 1}| |{i : Mi = 1, M fi = 0}| · ε/α · |{i : Mi = 0}|
≤ + +
|{i : Mi = 0}| |{i : M
fi = 0}| |{i : Mi = 0}| · |{i : M
fi = 0}|
|{i : Mi = 0, Mfi = 1}| |{i : Mi = 1, M fi = 0}| |{i : Mi = M fi = 0}| · ε/α
= + +
|{i : Mi = 0}| |{i : M
fi = 0}| {i : M
fi = 0}|
|{i : Mi = 0, Mfi = 1}| |{i : Mi = 1, Mfi = 0}| |{i : Mi = Mfi = 0}| · ε/α
≤ + +
|{i : Mi = 0}| (1 − ε/α) · |{i : Mi = 0}| (1 − ε/α) · |{i : Mi = 0}|
(1 − ε/α) · |{i : Mi = 0, Mi = 1}| + |{i : Mi = 1, M
f fi = 0}| + |{i : Mi = M fi = 0}| · ε/α
=
(1 − ε/α) · |{i : Mi = 0}|
(1 − ε/α) · ε/α · |{i : Mi = 0}| + ε/α · |{i : Mi = 0}| + |{i : Mi = 0}| · ε/α
≤
(1 − ε/α) · |{i : Mi = 0}|
(1 − ε/α) · ε/α + ε/α + ε/α
=
1 − ε/α
(3 − ε/α) · ε/α
=
1 − ε/α
3·ε
≤ .
α−ε
Once again, if PX = (1 − ϵ)Pθ∗ + ϵQX :
D Pθ∗ , PθnMMD ≤ D (Pθ∗ , PX ) + D PX , Pen + D Pen , PθnMMD
≤ D (Pθ∗ , PX ) + D PX , Pen + D Pen , Pθ∗
≤ 2D (Pθ∗ , PX ) + 2D PX , Pen
≤ 2D (Pθ∗ , PX ) + 2D (PX , Pn ) + 2D Pn , Pen
≤ 4ϵ + 2D (PX , Pn ) + 2D Pn , Pn .
e
38
39