0% found this document useful (0 votes)
35 views16 pages

Research 6

This paper investigates the relationship between distance-based and kernel-based measures of conditional dependence. The paper shows: 1) A generalized conditional distance covariance (gCdCov) measure defined on metric spaces of negative type and an equivalent kernel measure of conditional dependence (HSCIC). 2) An existing kernel measure of conditional dependence based on Hilbert-Schmidt norm of conditional cross-covariance operators (HSC̈IC) does not have a simple distance representation, except in limiting cases where it relates to gCdCov and conditional distance covariance (CdCov). 3) CdCov introduced previously is a special case of the HSCIC measure developed in this paper.

Uploaded by

Prateek Patidar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
35 views16 pages

Research 6

This paper investigates the relationship between distance-based and kernel-based measures of conditional dependence. The paper shows: 1) A generalized conditional distance covariance (gCdCov) measure defined on metric spaces of negative type and an equivalent kernel measure of conditional dependence (HSCIC). 2) An existing kernel measure of conditional dependence based on Hilbert-Schmidt norm of conditional cross-covariance operators (HSC̈IC) does not have a simple distance representation, except in limiting cases where it relates to gCdCov and conditional distance covariance (CdCov). 3) CdCov introduced previously is a special case of the HSCIC measure developed in this paper.

Uploaded by

Prateek Patidar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

Journal of Machine Learning Research 24 (2023) 1-16 Submitted 3/20; Revised 12/22; Published 1/23

On Distance and Kernel Measures of Conditional


Dependence

Tianhong Sheng [email protected]


Department of Statistics
Pennsylvania State University
University Park, PA 16802, USA
Bharath K. Sriperumbudur [email protected]
Department of Statistics
Pennsylvania State University
University Park, PA 16802, USA

Editor: John Shawe-Taylor

Abstract
Measuring conditional dependence is one of the important tasks in statistical inference and
is fundamental in causal discovery, feature selection, dimensionality reduction, Bayesian
network learning, and others. In this work, we explore the connection between conditional
dependence measures induced by distances on a metric space and reproducing kernels
associated with a reproducing kernel Hilbert space (RKHS). For certain distance and kernel
pairs, we show the distance-based conditional dependence measures to be equivalent to
that of kernel-based measures. On the other hand, we also show that some popular kernel
conditional dependence measures based on the Hilbert-Schmidt norm of a certain cross-
conditional covariance operator, do not have a simple distance representation, except in
some limiting cases.
Keywords: Conditional independence test, distance covariance, energy distance, Hilbert-
Schmidt independence criterion, reproducing kernel Hilbert space

1. Introduction

Measuring conditional dependence between random variables plays a fundamental role in


many statistical inference tasks such as causal discovery (Pearl, 2000; Spirtes et al., 2000),
supervised dimensionality reduction (Cook and Li, 2002; Fukumizu et al., 2004), conditional
independence testing (Su and White, 2007; Gretton et al., 2012), and others. Formally,
for random variables (X, Y, Z), X is said to be conditionally independent of Y given Z,
denoted as X Y |Z, if PXY |Z = PX|Z PY |Z a.s.-PZ , where the notation PX|Z denotes a
regular conditional probability defined as PX|Z (·) = E[1(X ∈ ·)|Z] a.s.-PZ , with PZ being
the marginal distribution of Z. Given a distance measure D on the space of probability
measures, D(PXY |Z , PX|Z PY |Z ) measures the degree of conditional dependence between X
and Y given Z, with X Y |Z if and only if D(PXY |Z , PX|Z PY |Z ) = 0 a.s.-PZ . Some popular
choices for D include the Kullback-Leibler divergence (more generally f -divergence), total
variation distance, Hellinger distance, Wasserstein distance, among others.

c 2023 Tianhong Sheng and Bharath K. Sriperumbudur.


License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided
at http://jmlr.org/papers/v24/20-238.html.
Sheng and Sriperumbudur

Recently, a class of distances on probability measures induced by a Euclidean metric


on Rd —more generally by metrics of strongly negative type—, called the energy distance
(Székely and Rizzo, 2004) and distance covariance (Székely et al., 2007; Székely and Rizzo,
2009; Lyons, 2013) has gained popularity in nonparametric hypothesis testing (e.g., two-
sample and independence testing), because of their computational simplicity and elegant
interpretation. Wang et al. (2015) extended distance covariance to conditional distributions
on Rd to obtain a measure of conditional dependence, called conditional distance covariance
(CdCov) and has been applied in conditional independence testing. We refer to these class
of probability metrics as distance-based measures and point the reader to Section 3 for
preliminaries on distance-based measures.
On the other hand, in the machine learning literature, measures of dependence have
been formulated based on embedding of probability distributions into a reproducing kernel
Hilbert space (RKHS; Aronszajn, 1950). This embedding into RKHS allows to capture the
properties of distributions and has been used in many applications including homogeneity,
independence, and conditional independence testing (for example, see Muandet et al., 2017
and references therein). Formally, given a probability measure ν defined on a measurable
space X , and a RKHS Hk with the reproducing kernel k, ν can be embedded into Hk as
Z
ν 7→ k(·, x) dν(x) := µk (ν) ∈ Hk ,
X

where µk (ν) is called the mean element or kernel mean embedding of ν. Using this notion,
the kernel distance, also called as the maximum mean discrepancy (MMD) between two
probability distributions P and Q is defined as the distance between their mean elements
(Gretton et al., 2007), i.e., D(P, Q) = kµk (P) − µk (Q)kHk . The kernel embedding and the
kernel distance are well-studied in the literature and their mathematical theory is well-
developed (Sriperumbudur et al., 2010, 2011; Sriperumbudur, 2016; Szabó and Sriperumbudur,
2018; Simon-Gabriel and Schölkopf, 2018; Simon-Gabriel et al., 2020). Generalizing this
notion of kernel embedding to distributions defined on product spaces yields a kernel measure
of dependence, called the Hilbert-Schmidt independence criterion (HSIC; Gretton et al., 2005,
Gretton et al., 2008, Smola et al., 2007), which can then be used as a measure of conditional
dependence by employing it to conditional probability distributions (Fukumizu et al., 2004,
2008). Fukumizu et al. (2004); Gretton et al. (2005) provided an alternate interpretation for
HSIC in terms of the Hilbert-Schmidt norm of a certain cross-covariance operator, based on
which the Hilbert-Schmidt norm of a conditional cross-covariance operator (we refer to it as
HSC̈IC) is then proposed as a measure of conditional dependence. We point the reader to
Sections 4 and 5 for details and refer to these class of probability metrics as kernel-based
measures.
Sejdinovic et al. (2013) established an equivalence between distance-based and kernel-
based dependence measures (i.e., distance covariance and HSIC) by showing that a repro-
ducing kernel that defines HSIC induces a semi-metric of negative type which in turn defines
the distance covariance (Székely et al., 2007, 2009), and vice-versa. However, despite the
striking similarity, the relationship between conditional distance covariance and related
kernel measures is not known. The goal of this work is to investigate the relationship
between distance and kernel-based measures of conditional independence, and in particular,

2
Distance and Kernel Measures of Conditional Dependence

understand whether these measures are equivalent (i.e., the distance measure can be obtained
from the kernel measure and vice-versa).
As our contributions, first, in Theorem 1 (Section 4.2), we generalize the conditional
distance covariance of Wang et al. (2015) to arbitrary metric spaces of negative type—we
call this as generalized CdCov (gCdCov)—and develop a kernel measure of conditional
dependence (we refer to it as HSCIC) that is equivalent to gCdCov. Therefore, it follows from
Theorem 1 that CdCov introduced by Wang et al. (2015) is a special case of the HSCIC. In
fact, the HSCIC we obtain is exactly the conditional dependence measure recently proposed
by Park and Muandet (2020). Second, in Theorem 2 (Section 5), we consider the kernel
measure of conditional dependence based on the Hilbert-Schmidt norm of the conditional
cross-covariance operator (i.e., HSC̈IC) and obtain its distance-based interpretation. We
show that this distance-based version of HSC̈IC does not have an elegant interpretation,
except in limiting cases where it is related to CdCov and gCdCov (see Corollaries 3 and 4).
The paper is organized as follows. Definitions and notation that are widely used
throughout the paper are collected in Section 2. The preliminaries on distance-based and
kernel-based measures are presented in Sections 3 and 4.1, respectively, while main results
are presented in Sections 4.2 and 5.

2. Definitions & Notation


For a non-empty set X , a function ρ : X × X → [0, ∞) is called a semi-metric on X if it
satisfies (i) ρ(x, x0 ) = 0 ⇔ x = x0 and (ii) ρ(x, x0 ) = ρ(x0 , x). Then (X , ρ) is said to be a
semi-metric space. The semi-metric space, (X , ρ) is
PsaidP to be of negative type if ∀n ≥ 2,
{xi }ni=1 ∈ X , and {αi }ni=1 ∈ R, with ni=1 αi = 0, ni=1 nj=1 αi αj ρ(zi , zj ) ≤ 0. (X , ρ) is
P
said
R R to be of strongly negative type if for all finite signed measures µ such that µ(X ) = 0,
6 0. A real-valued symmetric function k : X × X → R
ρ(x, y) dµ(x) dµ(y) < 0 for all µ =
is called
Pn a positive definite (pd) kernel if, for all n ∈ N, {αi }ni=1 ∈ R and {xi }ni=1 ∈ X , we
have i,j=1 αi αj k(xi , xj ) ≥ 0. A function k : X × X → R, (x, y) 7→ k(x, y) is a reproducing
kernel of the Hilbert space (Hk , h·, ·iHk ) of functions if and only if (i) ∀x ∈ X , k(·, x) ∈ H
and (ii) ∀x ∈ X , ∀f ∈ Hk , hk(·, x), f iHk = f (x) hold. If such a k exists, then Hk is called
a reproducing kernel Hilbert space.
X , Y and Z denote Polish spaces endowed with Borel σ-algebras. X, Y and Z denote
random elements in X , Y and Z , respectively. Ẍ is defined as (X, Z), which is a random
element in X × Z . The probability law of a random variable X is denoted by PX , the
joint probability law of random variables X and Z is denoted by PXZ and the regular
conditional probability of X given Z is defined as PX|Z (·) = E[1(X ∈ ·)|Z] a.s.-PZ such
that PX|Z=z is a probability measure on X for all z ∈ Z . The symbol X Y |Z indicates
the conditional independence of X and Y given Z. φX and φY denote the characteristic
functions of X and Y respectively and their joint characteristic function is denoted as φXY .
The conditional characteristic functions of X, Y and (X, Y ) given Z are denoted as φX|Z ,
φY |Z and φXY |Z respectively. A measurable, positive definite kernel on X is denoted as kX
and its corresponding RKHS as HX . Similarly we define kY , HY , kZ , HZ , kẌ and HẌ .
In this paper we assume that all involved RKHS’s are separable.
The space of r-integrable functions w.r.t. a σ-finite measure, µ on Rd is denoted as
L (Rd , µ) and if µ is a Lebesgue measure on Rd , we denote it as Lr (Rd ).
r

3
Sheng and Sriperumbudur

3. Conditional Distance Covariance


Distance covariance was proposed by Székely et al. (2007) as a new measure of dependence
between Euclidean random vectors in arbitrary dimension. An interesting feature of distance
covariance is that unlike the classical covariance, it is zero only if the random vectors are
independent. Formally, the distance covariance (dCov) between two random vectors is
defined as the weighted L2 norm between the joint characteristic function and the product
of marginal characteristic functions, i.e.,
|φXY (t, s) − φX (t)φY (s)|2
Z Z
1
V 2 (X, Y ) = kφXY − φX φY k2L2 (w) = dt ds,
cp cq ktkp+1 kskq+1
where φXY denotes the joint characteristic function of random variables X ∈ Rp and
Y ∈ Rq with φX and φY denoting their respective marginal characteristic functions. Here
π (p+1)/2 π (q+1)/2 Pp 2
cp = Γ((p+1)/2) , cq = Γ((q+1)/2) and w(t, s) = ktk−p−1 ksk−q−1 with ktk2 = i=1 ti for
t = (t1 , . . . , tp ). A particular advantage of distance covariance is its compact representation
in terms of certain expectation of pairwise Euclidean distances (Székely et al., 2007):
V 2 (X, Y ) = E[E[kX − X 0 kkY − Y 0 k|X, Y ]] + EkX − X 0 kEkY − Y 0 k
−2E[E[kX − X 0 k|X]E[kY − Y 0 k|Y ]], (1)
i.i.d. i.i.d.
where X ∼ X 0 , Y ∼ Y 0 , which leads to straightforward empirical estimates by replacing
the expectations with empirical estimators. Such an estimator has been used as a test statistic
in independence testing and the resulting test is shown to be consistent if the marginal
distributions have finite first moment (Székely et al., 2007). As a natural generalization, Lyons
(2013) extended (1) to metric spaces of negative type and showed that the corresponding
distance covariance—obtained by replacing the Euclidean metric by a metric of strongly
negative type—is zero if and only if X and Y are independent.
Extending the idea of distance covariance, recently, Wang et al. (2015) proposed a
conditional version to measure conditional independence between random vectors of arbitrary
dimension. To elaborate, let X ∈ Rp , Y ∈ Rq and Z ∈ Rr be random vectors. The
conditional distance covariance (CdCov) V(X, Y |Z) between random vectors X and Y with
finite moments given Z is defined as
|φXY |Z (t, s) − φX|Z (t)φY |Z (s)|2
Z Z
2 2
V (X, Y |Z) = kφXY |Z − φX|Z φY |Z kL2 (w) = dt ds,
cp cq ktkp+1 kskq+1
where
h √ √ i
φXY |Z (t, s) = E e −1ht,Xi+ −1hs,Y i |Z , φX|Z (t) = φXY |Z (t, 0) and φY |Z (s) = φXY |Z (0, s).

As a crucial property, CdCov is zero PZ -almost surely if and only if X Y |Z. Similar
to distance covariance, one advantage of this measure is that its sample version can be
expressed elegantly as a V - or U -statistic, based on which Wang et al. (2015) proposed a
statistically consistent conditional independence test.
The conditional distance covariance defined above can also be computed in terms of the
conditional expectations of pairwise Euclidean distances:
V 2 (X, Y |Z) = E[E[kX − X 0 kkY − Y 0 k|X, Y, Z]|Z] + E[kX − X 0 k|Z]E[kY − Y 0 k|Z]
−2E[[E[kX − X 0 k|X, Z]E[kY − Y 0 k|Y, Z]|Z], (2)

4
Distance and Kernel Measures of Conditional Dependence

where (X, Y ) and (X 0 , Y 0 ) are independent copies given Z. In the similar spirit of Lyons
(2013), CdCov can be extended to metric spaces of negative type through conditional
expectations so that (2) can be written as
Vρ2X ,ρY (X, Y |Z) = E[E[ρX (X, X 0 )ρY (Y, Y 0 )|X, Y, Z]|Z]
+E[ρX (X, X 0 )|Z]E[ρY (Y, Y 0 )|Z]
−2E[E[ρX (X, X 0 )|X, Z]E[ρY (Y, Y 0 )|Y, Z]|Z], (3)
0 0
=: G[ρX (X, X )ρY (Y, Y )] =: G ◦ [ρX ρY ], (4)
where ρX and ρY are metrics of strongly negative type defined on spaces X and Y
respectively with E[ρ2X (X, x0 )|Z] < ∞ a.s.-PZ and E[ρ2Y (Y, y0 )|Z] < ∞ a.s.-PZ for some
x0 ∈ X and y0 ∈ Y . The moment conditions ensure that the expectations are finite. When
ρX and ρY are strongly negative, then clearly (3) is zero if and only if X Y |Z.

4. Kernel Measures of Conditional Dependence


First, in Section 4.1, we present preliminaries on RKHS embedding of probability measures
and introduce kernel measures of dependence. Based on this discussion, in Section 4.2, we
develop a kernel measure of conditional dependence (we call it as Hilbert-Schmidt conditional
independence criterion—HSCIC) that is related to gCdCov (and therefore CdCov) discussed
in Section 3. We also present an interpretation for gCdCov through conditional cross-
covariance operator formulation for HSCIC.

4.1 RKHS embedding of probabilities


In the machine learning literature, the notion of embedding probability measures in an
RKHS has gained lot of attention and has been applied in goodness-of-fit (Balasubramanian
et al., 2021), two-sample (Gretton et al., 2007, 2012), independence (Gretton et al., 2008) and
conditional independence (Fukumizu et al.,R 2008; p Zhang et al., 2011) testing. To elaborate,
given a probability measure P such that X k(x, x) dP (x) < ∞, its RKHS embedding
(Smola et al., 2007) is defined as
Z
P 7→ µP := k(·, x) dP (x) ∈ Hk ,
X

where Hk is an RKHS with k as the reproducing kernel. Based on this embedding, a distance
on the space of probabilities can be defined through the distance between the embeddings,
i.e., Dk (P, Q) = kµP − µQ kHk , called the kernel distance or maximum mean discrepancy
(Gretton et al., 2007). If the map P 7→ µP is injective, then the kernel k that induces µP is
said to be characteristic (Fukumizu et al., 2009; Sriperumbudur et al., 2010) and therefore
1/2
Dk (P, Q) induces a metric on Mk (X ) := {P ∈ M1+ (X ) : X k(x, x) dP (x) < ∞},
R p

where M1+ (X ) denotes the set of all probability measures on X . Using the reproducing
property of the kernel, it can be shown that
Dk2 (P, Q) = EXX 0 k(X, X 0 ) + EY Y 0 k(Y, Y 0 ) − 2EXY k(X, Y ),
i.i.d. i.i.d.
where X, X 0 ∼ P and Y, Y 0 ∼ Q. Extending this distance to probability measures on
product spaces, particularly the joint measure PXY and product of marginals PX PY , yields

5
Sheng and Sriperumbudur

a measure of dependence between two random variables X and Y defined on measurable


spaces X and Y , called the Hilbert-Schmidt Independence Criterion (HSIC), which is
defined (Gretton et al., 2005) as

Dk2X kY (PXY , PX PY ) = EXY EX 0 Y 0 kX (X, X 0 )kY (Y, Y 0 )


+EX EX 0 kX (X, X 0 )EY EY 0 kY (Y, Y 0 )
−2EXY [EX 0 kX (X, X 0 )EY 0 kY (Y, Y 0 )] (5)
Z
= (kX kY )(x, y, x0 , y 0 ) d[PXY − PX PY ]2 (x, y, x0 , y 0 ).

If the kernels kX and kY are characteristic, then HSIC characterizes independence (Szabó
and Sriperumbudur, 2018), i.e., DkX kY (PXY , PX PY ) = 0 if and only if X Y . An empirical
version of (5) has been used as a test statistic in independence testing and the resultant test
is shown to be consistent against all alternatives as long as kX and kY are characteristic
(Gretton et al., 2008). An interesting connection between kernel-based HSIC and distance-
based dCov is shown by Sejdinovic et al. (2013) that dCov in (1) is in fact a special case of
HSIC and HSIC is equivalent to the generalized dCov introduced by Lyons (2013). This
result provides a unifying framework for the distance and kernel-based dependence measures.
With this background, in the rest of the paper, we explore the relation between distance
and kernel-based measures of conditional dependence.

4.2 Hilbert-Schmidt conditional independence criterion


For appropriate choice of kernels and distances, the following result provides a kernel-
equivalent of gCdCov, which we refer to as the Hilbert-Schmidt conditional independence
criterion (HSCIC).
Theorem 1 Let (X , ρX ) and (Y , ρY ) be semi-metric spaces of negative type. Suppose
E[ρ2X (X, x0 )|Z] < ∞ and E[ρ2Y (Y, y0 )|Z] < ∞ a.s.-PZ for some x0 ∈ X , y0 ∈ Y . If kX
and kY are pd kernels on X and Y that are distance-induced, i.e.,

kX (x, x0 ) = ρX (x, θ) + ρX (x0 , θ) − ρX (x, x0 )

and
kY (y, y 0 ) = ρY (y, θ0 ) + ρY (y 0 , θ0 ) − ρY (y, y 0 )
for some θ ∈ X and θ0 ∈ Y . Then

Vρ2X ,ρY (X, Y |Z) = G ◦ [ρX ρY ] = Dk2X kY (PXY |Z , PX|Z PY |Z ), a.s.-PZ (6)

with
Dk2X kY (PXY |Z , PX|Z PY |Z ) = G ◦ [kX kY ] ,
where G is defined in (4).
On the other hand, let kX and kY be pd kernels on X and Y respectively. Suppose
2 (X, X)|Z] < ∞ and E[k 2 (Y, Y )|Z] < ∞ a.s.-P . If ρ
E[kX Y Z X and ρY are semi-metrics on
X and Y that are kernel-induced, i.e.,
kX (x, x) + kX (x0 , x0 )
ρX (x, x0 ) = − kX (x, x0 )
2

6
Distance and Kernel Measures of Conditional Dependence

and
kY (y, y) + kY (y 0 , y 0 )
ρY (y, y 0 ) = − kY (y, y 0 ),
2
then (6) holds.

Proof Suppose kX and kY are distance-induced. Then

Dk2X kY (PXY |Z , PX|Z PY |Z )


= G ◦ [kX kY ] = G[kX (X, X 0 )kY (Y, Y 0 )]
= G ρX (X, θ) + ρX (X 0 , θ) − ρX (X, X 0 ) ρY (Y, θ0 ) + ρY (Y 0 , θ0 ) − ρY (Y, Y 0 )
  

= G ρX (X, X 0 )ρY (Y, Y 0 ) = Vρ2X ,ρY (X, Y |Z)


 

a.s.-PZ , where we used the fact that G[g(X, Y, X 0 , Y 0 )] = 0 a.s.-PZ when g does not depend
on one or more of its arguments (for example, a constant function). On the other hand,
suppose ρX and ρY are kernel-induced. Clearly they are of negative type. Then

Vρ2X ,ρY (X, Y |Z)


= G[ρX (X, X 0 )ρY (Y, Y 0 )]
kX (X, X) + kX (X 0 , X 0 ) kY (Y, Y ) + kY (Y 0 , Y 0 )
  
0 0
=G − kX (X, X ) − kY (Y, Y )
2 2
= G[kX (X, X 0 )kY (Y, Y 0 )] = Dk2X kY (PXY |Z , PX|Z PY |Z ),

a.s.-PZ , where we again used the above mentioned facts about G.


Note that, while the quantities θ and θ0 induce a family of kernels as θ and θ0 range
over X and Y respectively, all these kernels are equivalent in the sense that they induce
the same HSCIC as shown by the equivalence in (6). This means, CdCov is induced by
kernels of the form kX (x, x0 ) = kx − θk + kx0 − θk − kx − x0 k, x, x0 ∈ Rp and kY (y, y 0 ) =
ky − θ0 k + ky 0 − θ0 k − ky − y 0 k, y, y 0 ∈ Rq with θ = θ0 = 0 being a popular choice—this choice
leads to covariance function of a fractional Brownian motion.
We would like to mention that a concurrent and independent work by Park and Muandet
(2020) proposed a criterion with the same name HSCIC, which is defined as the distance
between the conditional mean embedding of PXY |Z and the product of marginal conditional
mean embeddings of PX|Z and PY |Z , where the conditional mean embedding of PX|Z is
denoted by µPX|Z and µPX|Z = E[kX (X, ·)|Z] (the conditional mean embedding of PY |Z and
PXY |Z can be similarly defined). It is easy to verify that

Vρ2X ,ρY (X, Y |Z) = G[ρX (X, X 0 )ρY (Y, Y 0 )] = G[kX (X, X 0 )kY (Y, Y 0 )]
= kE[kX (X, ·) ⊗ kY (Y, ·)|Z] − E[kX (X, ·)|Z] ⊗ E[kY (Y, ·)|Z]k2HX ⊗HY
2
= µPXY |Z − µPX|Z ⊗ µPY |Z . (7)
HX ⊗HY

While HSCIC is a natural measure of conditional dependence, in the kernel literature,


however, a different measure has been widely used (Fukumizu et al., 2004, 2008; Zhang et al.,
2011), which is based on the Hilbert-Schmidt norm of a certain conditional cross-covariance
operator. Before we introduce the conditional cross-covariance operator and these other

7
Sheng and Sriperumbudur

measures of conditional dependence (which we do in Section 5), first we will briefly discuss
how HSIC is related to the Hilbert-Schmidt norm of a cross-covariance operator so that its
extension to the conditional version is natural.
For random variables X ∼ PX and Y ∼ PY with joint distribution PXY such that
E[kX (X, X)] < ∞ and E[kY (Y, Y )] < ∞, there exists a unique bounded linear operator,
called the cross-covariance operator (Baker, 1973; Fukumizu et al., 2004), ΣY X : HkX →
HkY such that ∀ f ∈ HkX , g ∈ HkY ,
hg, ΣY X f iHkY = E[f (X)g(Y )] − E[f (X)]E[g(Y )].
In fact, using the reproducing property that f (x) = hf, kX (·, x)iHkX , ∀ x ∈ X and g(y) =
hg, kY (·, y)iHkY , ∀ y ∈ Y , it follows that
Z Z Z Z
ΣY X = kY (·, y) ⊗ kX (·, x) dPXY (x, y) − kY (·, y) dPY (y) ⊗ kX (·, x) dPX (x), (8)

where ⊗ denotes the tensor product. Clearly, ΣY X is a natural generalization of the finite-
dimensional covariance matrix between two random vectors X ∈ Rp and Y ∈ Rq . Based on
(8) and the reproducing property, it can be verified that
Z Z 2
2
kΣY X kHS = kX (·, x) ⊗ kY (·, y) d(PXY − PX PY )(x, y)
Z Z Z Z HS

= hkX (·, x) ⊗ kY (·, y), kX (·, x ) ⊗ kY (·, y 0 )iHS d(PXY − PX PY )(x, y)


0

× d(PXY − PX PY )(x0 , y 0 )
= Dk2X kY (PXY , PX PY ), (9)
where k · kHS denotes the Hilbert-Schmidt norm. Since HSCIC is a conditional version
of HSIC and since the latter is the Hilbert-Schmidt norm of the cross-variance operator,
it is natural to extend ΣY X to its conditional version as a PZ -measurable bounded linear
operator Σ̇Y X|Z : HkX → HkY such that ∀ f ∈ HkX , g ∈ HkY ,
hg, Σ̇Y X|Z f iHkY = E[f (X)g(Y )|Z] − E[f (X)|Z]E[g(Y )|Z], a.s.-PZ ,
thereby yielding
Σ̇Y X|Z = E[kX (·, X) ⊗ kY (·, Y )|Z] − E[kX (·, X)|Z] ⊗ E[kY (·, Y )|Z].
Similar to (9), it is easy to verify that
kΣ̇Y X|Z k2HS = Dk2X kY (PXY |Z , PX|Z PY |Z )
a.s.-PZ . Therefore if kX and kY are characteristic, then X Y |Z ⇐⇒ Σ̇Y X|Z = 0, PZ -a.s.
However, in the kernel literature, to the best of our knowledge, besides the concurrent
and independent work by Park and Muandet (2020) in which a quantity similar to HSCIC
is proposed, HSCIC has not been used as a measure of conditional independence probably
because it is a random operator. We can obtain a single measure of conditional dependence
by considering the expectation of HSCIC over Z ∼ PZ , i.e.,
DPZ (PXY |Z , PX|Z PY |Z ) := EZ [kΣ̇Y X|Z k2HS ]. (10)
This single measure of conditional dependence by taking HSCIC over Z is not discussed in
Park and Muandet (2020).

8
Distance and Kernel Measures of Conditional Dependence

5. Relation between RKHS and Distance-based Conditional Dependence


Measures
Instead of Σ̇Y X|Z , Fukumizu et al. (2004) considered an alternate operator, called the
conditional cross-covariance operator, which is defined as follows. Suppose EX [kX (X, X)] <
∞, EY [kY (Y, Y )] < ∞ and EZ [kZ (Z, Z)] < ∞. Then there exists a unique bounded linear
operator ΣY X|Z such that
hg, ΣY X|Z f iHkY = E[f (X)g(Y )] − E[E[f (X)|Z]E[g(Y )|Z]]
= E[Cov(f (X), g(Y )|Z)]
for all f ∈ HkX and g ∈ HkY . As above, using the reproducing property, it can be shown
that
ΣY X|Z = E[E[kY (·, Y ) ⊗ kX (·, X)|Z] − E[kX (·, X)|Z] ⊗ E[kY (·, Y )|Z]] = EZ [Σ̇Y X|Z ].

However, unlike Σ̇Y X|Z , the conditional cross-covariance operator ΣY X|Z does not character-
ize conditional independence since ΣY X|Z = 0—assuming kX and kY to be characteristic—
only implies PXY = EZ [PX|Z PY |Z ] and not Σ̇Y X|Z = 0, a.s.-PZ (Fukumizu et al., 2004,
Theorem 8). Therefore, Fukumizu et al. (2004, Corollary 9) considered Z as a part of X
by defining Ẍ := (X, Z) and showed that ΣY Ẍ|Z = 0 if and only if X Y |Z, assuming
kX , kY and kZ to be characteristic. This is indeed the case since if kX , kY and kZ are
characteristic, then ΣY Ẍ|Z = 0 implies EZ [Σ̇Y Ẍ|Z ] = 0 and therefore
E[E[1{X ∈ A, Y ∈ B, Z ∈ C}|Z]]
−E[E[1{X ∈ A, Z ∈ C}|Z]EY |Z [1{Y ∈ B}|Z]]
= E[1{X ∈ A, Y ∈ B, Z ∈ C}] − E[E[1{X ∈ A, Z ∈ C}|Z]E[1{Y ∈ B}|Z]]
= E[E[1{X ∈ A, Y ∈ B}|Z]1{Z ∈ C}]
−E[E[1{X ∈ A}|Z]1{Z ∈ C}]E[1{Y ∈ B}|Z]]
= E[[PXY |Z (A × B|Z) − PX|Z (A|Z)PY |Z (B|Z)]1{Z ∈ C}] = 0,
for all A ∈ BX , B ∈ BY and C ∈ BZ , where BX , BY and BZ are the Borel σ-algebras
associated with X , Y and Z respectively. This implies,
PXY |Z (A × B|Z) − PX|Z (A|Z)PY |Z (B|Z) = 0, a.s.-PZ ,
implying X Y |Z, a.s.-PZ . Hence kΣY Ẍ|Z k2HS can be used as a measure of conditional
independence, which we refer to it as HSC̈IC.
The goal of this section is to explore the distance counterpart of HSC̈IC and understand
how it is related to CdCov, gCdCov, and DPZ defined in (10). To this end, we first provide
an expression for kΣY Ẍ|Z k2HS in terms of kernels, using which we obtain an expression in
terms of distances.
Theorem 2 Suppose EX [kX 2 (X, X)] < ∞, E [k 2 (Y, Y )] < ∞ and E [k 2 (Z, Z)] < ∞.
Y Y Z Z
Denote Ẍ = (X, Z) Then
h D E i
kΣY Ẍ|Z k2HS = EZ EZ 0 kZ (Z, Z 0 ) Σ̇Y X|Z , Σ̇Y X|Z 0
HS
= EZ EZ 0 [kZ (Z, Z 0 )h(Z, Z 0 )], (11)

9
Sheng and Sriperumbudur

where h(Z, Z 0 ) := FY X|Z FY 0 X 0 |Z 0 [kX (X, X 0 )kY (Y, Y 0 )], FY X|Z := EXY |Z − EY |Z EX|Z and
EXY |Z := E[·|Z] (EY |Z and EX|Z are defined similarly).
Suppose kX and kY are distance-induced, i.e.,

kX (x, x0 ) = ρX (x, θ)+ρX (x0 , θ)−ρX (x, x0 ) and kY (y, y 0 ) = ρY (y, θ0 )+ρY (y 0 , θ0 )−ρY (y, y 0 )

for some θ ∈ X and θ0 ∈ Y . Then h(Z, Z 0 ) = FY X|Z FY 0 X 0 |Z 0 [ρX (X, X 0 )ρY (Y, Y 0 )].

Proof Note that

ΣY Ẍ|Z = E[Σ̇Y Ẍ|Z ]


= E[E[kY (·, Y ) ⊗ (kX kZ )(·, Ẍ)|Z]] − E[E[kY (·, Y )|Z] ⊗ E[(kX kZ )(·, Ẍ)|Z]]
= E[E[kY (·, Y ) ⊗ kX (·, X)|Z] ⊗ kZ (·, Z)]
−E[E[kY (·, Y )|Z] ⊗ E[kX (·, X)|Z] ⊗ kZ (·, Z)]
= EZ [Σ̇Y X|Z ⊗ kZ (·, Z)].

Therefore,
2
kΣY Ẍ|Z k2HS = E[Σ̇Y X|Z ⊗ kZ (·, Z)]
HS
D E
= EZ [Σ̇Y X|Z ⊗ kZ (·, Z)], EZ [Σ̇Y X|Z ⊗ kZ (·, Z)]
D E HS
0
= EZ EZ 0 Σ̇Y X|Z ⊗ kZ (·, Z), Σ̇Y X|Z 0 ⊗ kZ (·, Z )
HS
D E
0
= EZ EZ 0 Σ̇Y X|Z , Σ̇Y X|Z 0 hkZ (·, Z), kZ (·, Z )iHkZ
D EHS
= EZ EZ 0 Σ̇Y X|Z , Σ̇Y X|Z 0 kZ (Z, Z 0 ). (12)
HS

Note that Σ̇Y X|Z = FY X|Z [kY (·, Y ) ⊗ kX (·, X)]. Therefore,
D E
Σ̇Y X|Z , Σ̇Y X|Z 0 = FY X|Z [kY (·, Y ) ⊗ kX (·, X)] , FY X|Z 0 [kY (·, Y ) ⊗ kX (·, X)] HS
HS
= FY X|Z [kY (·, Y ) ⊗ kX (·, X)] , FY 0 X 0 |Z 0 kY (·, Y 0 ) ⊗ kX (·, X 0 ) HS
 

= FY X|Z FY 0 X 0 |Z 0 kY (·, Y ) ⊗ kX (·, X), kY (·, Y 0 ) ⊗ kX (·, X 0 ) HS


 
h i
= FY X|Z FY 0 X 0 |Z 0 kY (·, Y ), kY (·, Y 0 ) H kX (·, X), kX (·, X 0 ) H
kY kX
0 0 0
= FY X|Z FY 0 X 0 |Z 0 [kX (X, X )kY (Y, Y )] = h(Z, Z ),

using which in (12) yields the result. If kX and kY are distance-induced, then using the
fact that FY X|Z FY 0 X 0 |Z 0 [g(X, X 0 , Y, Y 0 )] = 0 when g does not depend on one or more of its
arguments—basically, the same argument that we carried out in the proof of Theorem 1—we
have
h(Z, Z 0 ) = FY X|Z FY 0 X 0 |Z 0 [ρX (X, X 0 )ρY (Y, Y 0 )],
and the result follows.
While h(Z, Z 0 ) has a distance interpretation as shown in Theorem 2, kΣY Ẍ|Z k2HS does not

10
Distance and Kernel Measures of Conditional Dependence

have an elegant representation in terms of distances. Suppose kZ is also distance-induced,


00 00 00
i.e., kZ (z, z 0 ) = ρZ (z, θ ) + ρZ (x0 , θ ) − ρZ (z, z 0 ) for some θ ∈ Z . Then
Z Z
2
kΣY Ẍ|Z kHS = h(z, z 0 )kZ (z, z 0 ) dPZ (z) dPZ (z 0 )
Z Z h i
00 00
= ρZ (z, θ ) + ρZ (X 0 , θ ) − ρZ (z, z 0 ) h(z, z 0 ) dPZ (z) dPZ (z 0 ). (13)

Unfortunately, (13) cannot be related in a simple manner to gCdCov or HSCIC. However,


some simplifications occur based on certain assumptions on kZ , as shown in the following
corollaries. Under an appropriate choice of kZ , Corollary 3 shows HSC̈IC to be asymptotically
equivalent to the weighted average of HSCIC (equivalently, the weighted average of gCdCov)
defined in (10) while Corollary 4 shows the asymptotic equivalence between HSC̈IC and
CdCov.

Corollary 3 Suppose the assumptions of Theorem 2 hold and PZ has a density pZ w.r.t. the
Lebesgue measure on Rd such that h(z, ·)pZ is uniformly continuous and bounded for all
z ∈ Rd . For t > 0, let
z − z0
 
0 1
kZ (z, z ) = d ψ , z, z 0 ∈ Rd ,
t t
where ψ ∈ L1 (Rd ) is a bounded continuous positive definite function with Rd ψ(z) dz = 1.
R

Then

lim kΣY Ẍ|Z k2HS = EZ [kΣ̇Y X|Z k2HS pZ (Z)] = DP 2 (PXY |Z , PX|Z PY |Z ).
t→0 Z

Proof Define ψt (z) := t−d ψ z



t . From (11), it follows that

kΣY Ẍ|Z k2HS = EZ EZ 0 [ψt (Z − Z 0 )h(Z, Z 0 )]


Z Z 
0 0 0 0
= pZ (z) ψt (z − z )h(z, z )pZ (z ) dz dz
Z
= pZ (z)(ψt ∗ (h(z, ·)pZ )(z) dz,

where ∗ denotes convolution. Taking the limit on both sides as t → 0 and applying dominated
convergence theorem, we obtain
Z Z
2
lim kΣY Ẍ|Z kHS = lim pZ (z)(ψt ∗ (h(z, ·)pZ )(z) dz = pZ (z) lim(ψt ∗ (h(z, ·)pZ )(z) dz.
t→0 t→0 t→0

The result follows from Folland (1999, Theorem 8.14) which yields limt→0 (ψt ∗(h(z, ·)pZ )(z) =
h(z, z)pZ (z) for all z ∈ Rd and by noting that h(Z, Z) = kΣ̇Y X|Z k2HS .

Corollary 4 Suppose the assumptions of Theorem 2 hold with ρX (x, x0 ) = kx − x0 k, x, x0 ∈


Rp and ρY (y, y 0 ) = ky − y 0 k, y, y 0 ∈ Rq . Let kZ (z, z 0 ) = η(z)η(z 0 ), z, z 0 ∈ Rd for some
real-valued function η on Rd and
h i
EZ |η(Z)| φXY |Z − φX|Z φY |Z L2 (w) < ∞. (14)

11
Sheng and Sriperumbudur

Then
2
kΣY Ẍ|Z k2HS = EZ η(Z) φXY |Z − φX|Z φY |Z
 
L2 (w)
, (15)

1 −p−1 ksk−q−1 , t ∈ Rp , s ∈ Rq . In particular, for t > 0 and some


where w(t, s) = cp cq ktk
a ∈ Rd , if η(z) = t1d θ a−z , z ∈ Rd where

t θ is a bounded continuous function with
θ(z) dz = 1 and PZ has a bounded uniformly continuous density pZ on Rd such that
R

Z
2
ess sup φXY |Z (t, s) − φX|Z (t)φY |Z (s) dw(t, s) < ∞, (16)
Z

then
lim kΣY Ẍ|Z k2HS = p2Z (a)V 2 (X, Y |Z = a). (17)
t→0

Proof In the following, we show that

h(Z, Z 0 ) = φXY |Z − φX|Z φY |Z , φXY |Z 0 − φX|Z 0 φY |Z 0 L2 (w)


(18)

and therefore (15) follows by using (18) in (11) with k(z, z 0 ) = η(z)η(z 0 ) and applying
dominated convergence theorem through (14). We now prove (18). Consider

φXY |Z − φX|Z φY |Z , φXY |Z 0 − φX|Z 0 φY |Z 0 L2 (w)


Z Z
 h i
= w(t, s) φXY |Z (t, s) − φX|Z (t)φY |Z (s) φXY |Z 0 (t, s) − φX|Z 0 (t)φY |Z 0 (s) dt ds
Z Z
= w(t, s)Λ(t, s) dt ds, (19)

where
0
 h i
Λ(t, s, Z, Z ) = φXY |Z (t, s) − φX|Z (t)φY |Z (s) φXY |Z 0 (t, s) − φX|Z 0 (t)φY |Z 0 (s)
h h i h i h ii
= E ei(ht,Xi+hs,Y i) |Z − E eiht,Xi |Z E eihs,Y i |Z
h      i
· E ei(ht,Xi+hs,Y i) |Z 0 − E eiht,Xi |Z 0 E eihs,Y i |Z 0
0 i+hs,Y −Y 0 i) 0 i+hs,Y −Y 0 i)
= EXY |Z EX 0 Y 0 |Z 0 ei(ht,X−X − EXY |Z EX 0 |Z 0 EY 0 |Z 0 ei(ht,X−X
0 i+hs,Y −Y 0 i)
−EX|Z EY |Z EX 0 Y 0 |Z 0 ei(ht,X−X
0 i+hs,Y −Y 0 i)
+EX|Z EY |Z EX 0 |Z 0 EY 0 |Z 0 ei(ht,X−X
0 i+hs,Y −Y 0 i)
= FY X|Z FY 0 X 0 |Z 0 ei(ht,X−X , (20)

where FY X|Z := EXY |Z − EY |Z EX|Z . Using (20) in (19), we obtain


Z Z Z Z
0
w(t, s)Λ(t, s, Z, Z ) dt ds = FY X|Z FY 0 X 0 |Z 0 cosht, X −X 0 i coshs, Y −Y 0 iw(t, s) dt ds
(21)

12
Distance and Kernel Measures of Conditional Dependence

by noting that sinht, X − X 0 i and sinhs, Y − Y 0 i are odd functions w.r.t. t and s respectively.
Since cosht, X − X 0 i coshs, Y − Y 0 i = 1 − (1 − cosht, X − X 0 i) − (1 − coshs, Y − Y 0 i) + (1 −
cosht, X − X 0 i)(1 − coshs, Y − Y 0 i) and

FY X|Z FY 0 X 0 |Z 0 · [f (X, X 0 , Y, Y 0 )] = 0

for f (X, X 0 , Y, Y 0 ) = 1, f (X, X 0 , Y, Y 0 ) = 1 − cosht, X − X 0 i and f (X, X 0 , Y, Y 0 ) = 1 −


coshs, Y − Y 0 i, (21) reduces to
Z Z
w(t, s)Λ(t, s, Z, Z 0 ) dt ds
1 − cosht, X − X 0 i 1 − coshs, Y − Y 0 i
Z Z  
= FY X|Z FY 0 X 0 |Z 0 · dt ds
cp ktkp+1 cq kskq+1
= FY X|Z FY 0 X 0 |Z 0 [kX − X 0 kkY − Y 0 k]
= h(Z, Z 0 ),

where the last equality follows from Lemma 1 of Székely et al. (2007) through 1−cosht,xi
R
cp ktkp+1
dt =
−d z

kxk, thereby proving the result in (15). By defining θt (z) = t θ t , we have
  
EZ [η(Z) φXY |Z − φX|Z φY |Z ] = θt ∗ φXY |Z − φX|Z φY |Z pZ (a),

which by (Folland, 1999, Theorem 8.14) converges to φXY |Z=a − φX|Z=a φY |Z=a pZ (a) as
t → 0. Using these in (15) along with dominated convergence theorem combined with (16)
yields (17).

Remark 5 Informally, the result of Corollary 3 can be obtained by choosing kZ (z, z 0 ) =


δ(z − z 0 ), z, z 0 ∈ Rd , where δ(·) is the Dirac distribution. Since such a choice does not corre-
spond to a valid reproducing kernel—Dirac distribution is not a function but a distribution
that does not belong to an RKHS—, the rigorous argument involves considering a family of
kernels indexed by bandwidth t which in the limiting case of t → 0 achieves the behavior of
the Dirac distribution. Similar argument applies to Corollary 4 as well.

6. Discussion
Conditional distance covariance is a commonly used metric for measuring conditional
dependence in the statistics community. In the machine learning community, a conditional
dependence measure based on reproducing kernels is popularly used in applications such as
conditional independence testing. In this work, we have explored the connection between
these two conditional dependence measures where we showed the distance-based measure to
be a limiting version of the kernel-based measure, where we may view conditional distance
covariance as a member of a much larger class of kernel-based conditional dependence
measures. This may enable to design more powerful conditional independence tests by
choosing a richer class of kernels.
Having understood the relation between these various measures of conditional dependence,
an important question to understand is the statistical behavior of conditional independence
tests based on these measures. Fukumizu et al. (2004, Proposition 5) provides an alternate

13
Sheng and Sriperumbudur

representation for the conditional covariance operator, ΣY Ẍ|Z in terms of only covariance
operators (this is reminiscent of the situation when (X, Y, Z) are jointly normal so that
the conditional covariance matrix can be represented in terms of the joint covariance
matrices) as ΣY Ẍ|Z = ΣY Ẍ − ΣY Z Σ̃−1 −1
ZZ ΣZ Ẍ where Σ̃ZZ is the right inverse of ΣZZ on
(Ker(ΣZZ ))⊥ . The advantage of this alternate form is that ΣY Ẍ|Z can be estimated from
i.i.d.
data (Xi , Yi , Zi )ni=1 ∼ PXY Z by simply estimating the (cross) covariance operators ΣY Ẍ ,
ΣY Z , ΣZ Ẍ , and replacing Σ̃−1
ZZ by an inverse of the regularized version of an empirical
estimator of ΣZZ . Using these, a plug-in (biased) estimator kΣ̂Y Ẍ|Z k2HS of HSC̈IC (i.e.,
kΣY Ẍ|Z k2HS ), can be shown to be consistent and to have a computational complexity of
O(n3 ), where Σ̂Y Ẍ|Z := Σ̂Y Ẍ − Σ̂Y Z (Σ̂ZZ +λI)−1 Σ̂Z Ẍ and λ > 0—these claims can be proved
using the ideas in Fukumizu et al. (2008) where such claims are proved for a normalized
version of ΣY Ẍ|Z . Similar results are shown for the kernel version of HSCIC (see (7)) by
Park and Muandet (2020). To elaborate, (Park and Muandet, 2020, Section 5.2) proposed a
biased estimator of HSCIC (see r.h.s. of (7)), which is based on Gram matrices on X , Y and
Z and associated regularized inverse, yielding a computational complexity of O(n3 ). On the
other hand, Wang et al. (2015) proposed a (biased) estimator of CdCov—the same idea can
be used to estimate gCdCov and therefore HSCIC—based on a Nadarya-Watson type density
estimator of PXY |Z , where it can be shown that HSCIC can be consistently estimated with
a computational complexity of O(n3 ). This means, all these different estimators of HSCIC
and HSC̈IC are consistent and have same computational complexity. However, the statistical
performance of these estimators as test statistics to test for conditional independence remains
open.

Acknowledgements
BKS is partially supported by National Science Foundation (NSF) award DMS-1713011 and
CAREER award DMS-1945396.

References
N. Aronszajn. Theory of reproducing kernels. Transactions of the American Mathematical
Society, 68(3):337–404, 1950.

C. R. Baker. Joint measures and cross-covariance operators. Transactions of the American


Mathematical Society, 186:273–289, 1973.

K. Balasubramanian, T. Li, and M. Yuan. On the optimality of kernel-embedding based


goodness-of-fit tests. Journal of Machine Learning Research, 22(1):1–45, 2021.

R. D. Cook and B. Li. Dimension reduction for conditional mean in regression. The Annals
of Statistics, 30(2):455–474, 2002.

G. B. Folland. Real Analysis: Modern Techniques and Their Applications. Wiley-Interscience,


New York, USA, 1999.

14
Distance and Kernel Measures of Conditional Dependence

K. Fukumizu, F. Bach, and M. I. Jordan. Dimensionality reduction for supervised learning


with reproducing kernel Hilbert spaces. Journal of Machine Learning Research, 5(Jan):
73–99, 2004.

K. Fukumizu, A. Gretton, Xiaohai S., and B. Schölkopf. Kernel measures of conditional


dependence. In J. C. Platt, D. Koller, Y. Singer, and S. T. Roweis, editors, Advances in
Neural Information Processing Systems 20, pages 489–496. Curran Associates, Inc., 2008.

K. Fukumizu, A. Gretton, B. Schölkopf, and B. K. Sriperumbudur. Characteristic kernels on


groups and semigroups. In D. Koller, D. Schuurmans, Y. Bengio, and L. Bottou, editors,
Advances in Neural Information Processing Systems 21, pages 473–480. Curran Associates,
Inc., 2009.

A. Gretton, O. Bousquet, A. Smola, and B. Schölkopf. Measuring statistical dependence with


Hilbert-Schmidt norms. In Proceedings of the 16th International Conference on Algorithmic
Learning Theory, ALT’05, pages 63–77, Berlin, Heidelberg, 2005. Springer-Verlag.

A. Gretton, K. Borgwardt, R. Malte, B. Schölkopf, and A. J. Smola. A kernel method for


the two-sample-problem. In B. Schölkopf, J. C. Platt, and T. Hoffman, editors, Advances
in Neural Information Processing Systems 19, pages 513–520. MIT Press, 2007.

A. Gretton, K. Fukumizu, C. H. Teo, L. Song, B. Schölkopf, and A. J. Smola. A kernel


statistical test of independence. In J. C. Platt, D. Koller, Y. Singer, and S. T. Roweis,
editors, Advances in Neural Information Processing Systems 20, pages 585–592. Curran
Associates, Inc., 2008.

A. Gretton, K. M. Borgwardt, M. J. Rasch, B. Schölkopf, and A. J. Smola. A kernel


two-sample test. Journal of Machine Learning Research, 13(Mar):723–773, 2012.

R. Lyons. Distance covariance in metric spaces. The Annals of Probability, 41(5):3284–3305,


2013.

K. Muandet, K. Fukumizu, B. K. Sriperumbudur, and B. Schölkopf. Kernel mean embedding


of distributions: A review and beyond. Foundations and Trends in Machine Learning, 10
(1-2):1–141, 2017.

J. Park and K. Muandet. A measure-theoretic approach to kernel conditional mean em-


beddings. In H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, and H. Lin, editors,
Advances in Neural Information Processing Systems, volume 33, pages 21247–21259.
Curran Associates, Inc., 2020.

J. Pearl. Causality: Models, Reasoning and Inference. Cambridge University Press, New
York, USA, 2000.

D. Sejdinovic, B. K. Sriperumbudur, A. Gretton, and K. Fukumizu. Equivalence of distance-


based and RKHS-based statistics in hypothesis testing. The Annals of Statistics, 41(5):
2263–2291, 2013.

15
Sheng and Sriperumbudur

C-J. Simon-Gabriel and B. Schölkopf. Kernel distribution embeddings: Universal kernels,


characteristic kernels and kernel metrics on distributions. Journal of Machine Learning
Research, 19(44):1–29, 2018.

C-J. Simon-Gabriel, A. Barp, B. Schölkopf, and L. Mackey. Metrizing weak convergence


with maximum mean discrepancies. 2020. https://arxiv.org/pdf/2006.09268.pdf.

A. J. Smola, A. Gretton, L. Song, and B. Schölkopf. A Hilbert space embedding for


distributions. In M. Hutter, R. A. Servedio, and E. Takimoto, editors, Algorithmic
Learning Theory, pages 13–31, Berlin, Heidelberg, 2007. Springer Berlin Heidelberg.

P. Spirtes, C. N. Glymour, R. Scheines, D. Heckerman, C. Meek, G. Cooper, and T. Richard-


son. Causation, Prediction, and Search. MIT press, Cambridge, MA, USA, 2000.

B. K. Sriperumbudur. On the optimal estimation of probability measures in weak and strong


topologies. Bernoulli, 22(3):1839–1893, 2016.

B. K. Sriperumbudur, A. Gretton, K. Fukumizu, B. Schölkopf, and G. R. G. Lanckriet.


Hilbert space embeddings and metrics on probability measures. Journal of Machine
Learning Research, 11(Apr):1517–1561, 2010.

B. K. Sriperumbudur, K. Fukumizu, and G. R. G. Lanckriet. Universality, characteristic


kernels and RKHS embedding of measures. Journal of Machine Learning Research, 12
(Jul):2389–2410, 2011.

L. Su and H. White. A consistent characteristic function-based test for conditional indepen-


dence. Journal of Econometrics, 141(2):807–834, 2007.

Z. Szabó and B. K. Sriperumbudur. Characteristic and universal tensor product kernels.


Journal of Machine Learning Research, 18(233):1–29, 2018.

G. Székely and M. Rizzo. Testing for equal distributions in high dimension. InterStat, (5),
2004.

G. Székely and M. Rizzo. Brownian distance covariance. The Annals of Applied Statistics, 4
(3):1233–1303, 2009.

G. Székely, M. Rizzo, and N. Bakirov. Measuring and testing dependence by correlation of


distances. The Annals of Statistics, 35(6):2769–2794, 2007.

Gábor J Székely, Maria L Rizzo, et al. Brownian distance covariance. The Annals of Applied
statistics, 3(4):1236–1265, 2009.

X. Wang, W. Pan, W. Hu, Y. Tian, and H. Zhang. Conditional distance correlation. Journal
of the American Statistical Association, 110(512):1726–1734, 2015.

K. Zhang, J. Peters, D. Janzing, and B. Schölkopf. Kernel-based conditional independence


test and application in causal discovery. In 27th Conference on Uncertainty in Artificial
Intelligence (UAI 2011), pages 804–813. AUAI Press, 2011.

16

You might also like