0% found this document useful (0 votes)

35 views16 pages

Research 6

This paper investigates the relationship between distance-based and kernel-based measures of conditional dependence. The paper shows: 1) A generalized conditional distance covariance (gCdCov) measure defined on metric spaces of negative type and an equivalent kernel measure of conditional dependence (HSCIC). 2) An existing kernel measure of conditional dependence based on Hilbert-Schmidt norm of conditional cross-covariance operators (HSC̈IC) does not have a simple distance representation, except in limiting cases where it relates to gCdCov and conditional distance covariance (CdCov). 3) CdCov introduced previously is a special case of the HSCIC measure developed in this paper.

Uploaded by

Prateek Patidar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

35 views16 pages

Research 6

Uploaded by

Prateek Patidar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 16

Journal of Machine Learning Research 24 (2023) 1-16 Submitted 3/20; Revised 12/22; Published 1/23

On Distance and Kernel Measures of Conditional

Dependence

Tianhong Sheng [email protected]

Department of Statistics
Pennsylvania State University
University Park, PA 16802, USA
Bharath K. Sriperumbudur [email protected]
Department of Statistics
Pennsylvania State University
University Park, PA 16802, USA

Editor: John Shawe-Taylor

Abstract
Measuring conditional dependence is one of the important tasks in statistical inference and
is fundamental in causal discovery, feature selection, dimensionality reduction, Bayesian
network learning, and others. In this work, we explore the connection between conditional
dependence measures induced by distances on a metric space and reproducing kernels
associated with a reproducing kernel Hilbert space (RKHS). For certain distance and kernel
pairs, we show the distance-based conditional dependence measures to be equivalent to
that of kernel-based measures. On the other hand, we also show that some popular kernel
conditional dependence measures based on the Hilbert-Schmidt norm of a certain cross-
conditional covariance operator, do not have a simple distance representation, except in
some limiting cases.
Keywords: Conditional independence test, distance covariance, energy distance, Hilbert-
Schmidt independence criterion, reproducing kernel Hilbert space

1. Introduction

Measuring conditional dependence between random variables plays a fundamental role in

many statistical inference tasks such as causal discovery (Pearl, 2000; Spirtes et al., 2000),
supervised dimensionality reduction (Cook and Li, 2002; Fukumizu et al., 2004), conditional
independence testing (Su and White, 2007; Gretton et al., 2012), and others. Formally,
for random variables (X, Y, Z), X is said to be conditionally independent of Y given Z,
denoted as X Y |Z, if PXY |Z = PX|Z PY |Z a.s.-PZ , where the notation PX|Z denotes a
regular conditional probability defined as PX|Z (·) = E[1(X ∈ ·)|Z] a.s.-PZ , with PZ being
the marginal distribution of Z. Given a distance measure D on the space of probability
measures, D(PXY |Z , PX|Z PY |Z ) measures the degree of conditional dependence between X
and Y given Z, with X Y |Z if and only if D(PXY |Z , PX|Z PY |Z ) = 0 a.s.-PZ . Some popular
choices for D include the Kullback-Leibler divergence (more generally f -divergence), total
variation distance, Hellinger distance, Wasserstein distance, among others.

c 2023 Tianhong Sheng and Bharath K. Sriperumbudur.

License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided
at http://jmlr.org/papers/v24/20-238.html.
Sheng and Sriperumbudur

Recently, a class of distances on probability measures induced by a Euclidean metric

on Rd —more generally by metrics of strongly negative type—, called the energy distance
(Székely and Rizzo, 2004) and distance covariance (Székely et al., 2007; Székely and Rizzo,
2009; Lyons, 2013) has gained popularity in nonparametric hypothesis testing (e.g., two-
sample and independence testing), because of their computational simplicity and elegant
interpretation. Wang et al. (2015) extended distance covariance to conditional distributions
on Rd to obtain a measure of conditional dependence, called conditional distance covariance
(CdCov) and has been applied in conditional independence testing. We refer to these class
of probability metrics as distance-based measures and point the reader to Section 3 for
preliminaries on distance-based measures.
On the other hand, in the machine learning literature, measures of dependence have
been formulated based on embedding of probability distributions into a reproducing kernel
Hilbert space (RKHS; Aronszajn, 1950). This embedding into RKHS allows to capture the
properties of distributions and has been used in many applications including homogeneity,
independence, and conditional independence testing (for example, see Muandet et al., 2017
and references therein). Formally, given a probability measure ν defined on a measurable
space X , and a RKHS Hk with the reproducing kernel k, ν can be embedded into Hk as
Z
ν 7→ k(·, x) dν(x) := µk (ν) ∈ Hk ,
X

where µk (ν) is called the mean element or kernel mean embedding of ν. Using this notion,
the kernel distance, also called as the maximum mean discrepancy (MMD) between two
probability distributions P and Q is defined as the distance between their mean elements
(Gretton et al., 2007), i.e., D(P, Q) = kµk (P) − µk (Q)kHk . The kernel embedding and the
kernel distance are well-studied in the literature and their mathematical theory is well-
developed (Sriperumbudur et al., 2010, 2011; Sriperumbudur, 2016; Szabó and Sriperumbudur,
2018; Simon-Gabriel and Schölkopf, 2018; Simon-Gabriel et al., 2020). Generalizing this
notion of kernel embedding to distributions defined on product spaces yields a kernel measure
of dependence, called the Hilbert-Schmidt independence criterion (HSIC; Gretton et al., 2005,
Gretton et al., 2008, Smola et al., 2007), which can then be used as a measure of conditional
dependence by employing it to conditional probability distributions (Fukumizu et al., 2004,
2008). Fukumizu et al. (2004); Gretton et al. (2005) provided an alternate interpretation for
HSIC in terms of the Hilbert-Schmidt norm of a certain cross-covariance operator, based on
which the Hilbert-Schmidt norm of a conditional cross-covariance operator (we refer to it as
HSC̈IC) is then proposed as a measure of conditional dependence. We point the reader to
Sections 4 and 5 for details and refer to these class of probability metrics as kernel-based
measures.
Sejdinovic et al. (2013) established an equivalence between distance-based and kernel-
based dependence measures (i.e., distance covariance and HSIC) by showing that a repro-
ducing kernel that defines HSIC induces a semi-metric of negative type which in turn defines
the distance covariance (Székely et al., 2007, 2009), and vice-versa. However, despite the
striking similarity, the relationship between conditional distance covariance and related
kernel measures is not known. The goal of this work is to investigate the relationship
between distance and kernel-based measures of conditional independence, and in particular,

2
Distance and Kernel Measures of Conditional Dependence

understand whether these measures are equivalent (i.e., the distance measure can be obtained
from the kernel measure and vice-versa).
As our contributions, first, in Theorem 1 (Section 4.2), we generalize the conditional
distance covariance of Wang et al. (2015) to arbitrary metric spaces of negative type—we
call this as generalized CdCov (gCdCov)—and develop a kernel measure of conditional
dependence (we refer to it as HSCIC) that is equivalent to gCdCov. Therefore, it follows from
Theorem 1 that CdCov introduced by Wang et al. (2015) is a special case of the HSCIC. In
fact, the HSCIC we obtain is exactly the conditional dependence measure recently proposed
by Park and Muandet (2020). Second, in Theorem 2 (Section 5), we consider the kernel
measure of conditional dependence based on the Hilbert-Schmidt norm of the conditional
cross-covariance operator (i.e., HSC̈IC) and obtain its distance-based interpretation. We
show that this distance-based version of HSC̈IC does not have an elegant interpretation,
except in limiting cases where it is related to CdCov and gCdCov (see Corollaries 3 and 4).
The paper is organized as follows. Definitions and notation that are widely used
throughout the paper are collected in Section 2. The preliminaries on distance-based and
kernel-based measures are presented in Sections 3 and 4.1, respectively, while main results
are presented in Sections 4.2 and 5.

2. Definitions & Notation

For a non-empty set X , a function ρ : X × X → [0, ∞) is called a semi-metric on X if it
satisfies (i) ρ(x, x0 ) = 0 ⇔ x = x0 and (ii) ρ(x, x0 ) = ρ(x0 , x). Then (X , ρ) is said to be a
semi-metric space. The semi-metric space, (X , ρ) is
PsaidP to be of negative type if ∀n ≥ 2,
{xi }ni=1 ∈ X , and {αi }ni=1 ∈ R, with ni=1 αi = 0, ni=1 nj=1 αi αj ρ(zi , zj ) ≤ 0. (X , ρ) is
P
said
R R to be of strongly negative type if for all finite signed measures µ such that µ(X ) = 0,
6 0. A real-valued symmetric function k : X × X → R
ρ(x, y) dµ(x) dµ(y) < 0 for all µ =
is called
Pn a positive definite (pd) kernel if, for all n ∈ N, {αi }ni=1 ∈ R and {xi }ni=1 ∈ X , we
have i,j=1 αi αj k(xi , xj ) ≥ 0. A function k : X × X → R, (x, y) 7→ k(x, y) is a reproducing
kernel of the Hilbert space (Hk , h·, ·iHk ) of functions if and only if (i) ∀x ∈ X , k(·, x) ∈ H
and (ii) ∀x ∈ X , ∀f ∈ Hk , hk(·, x), f iHk = f (x) hold. If such a k exists, then Hk is called
a reproducing kernel Hilbert space.
X , Y and Z denote Polish spaces endowed with Borel σ-algebras. X, Y and Z denote
random elements in X , Y and Z , respectively. Ẍ is defined as (X, Z), which is a random
element in X × Z . The probability law of a random variable X is denoted by PX , the
joint probability law of random variables X and Z is denoted by PXZ and the regular
conditional probability of X given Z is defined as PX|Z (·) = E[1(X ∈ ·)|Z] a.s.-PZ such
that PX|Z=z is a probability measure on X for all z ∈ Z . The symbol X Y |Z indicates
the conditional independence of X and Y given Z. φX and φY denote the characteristic
functions of X and Y respectively and their joint characteristic function is denoted as φXY .
The conditional characteristic functions of X, Y and (X, Y ) given Z are denoted as φX|Z ,
φY |Z and φXY |Z respectively. A measurable, positive definite kernel on X is denoted as kX
and its corresponding RKHS as HX . Similarly we define kY , HY , kZ , HZ , kẌ and HẌ .
In this paper we assume that all involved RKHS’s are separable.
The space of r-integrable functions w.r.t. a σ-finite measure, µ on Rd is denoted as
L (Rd , µ) and if µ is a Lebesgue measure on Rd , we denote it as Lr (Rd ).
r

3
Sheng and Sriperumbudur

3. Conditional Distance Covariance

Distance covariance was proposed by Székely et al. (2007) as a new measure of dependence
between Euclidean random vectors in arbitrary dimension. An interesting feature of distance
covariance is that unlike the classical covariance, it is zero only if the random vectors are
independent. Formally, the distance covariance (dCov) between two random vectors is
defined as the weighted L2 norm between the joint characteristic function and the product
of marginal characteristic functions, i.e.,
|φXY (t, s) − φX (t)φY (s)|2
Z Z
1
V 2 (X, Y ) = kφXY − φX φY k2L2 (w) = dt ds,
cp cq ktkp+1 kskq+1
where φXY denotes the joint characteristic function of random variables X ∈ Rp and
Y ∈ Rq with φX and φY denoting their respective marginal characteristic functions. Here
π (p+1)/2 π (q+1)/2 Pp 2
cp = Γ((p+1)/2) , cq = Γ((q+1)/2) and w(t, s) = ktk−p−1 ksk−q−1 with ktk2 = i=1 ti for
t = (t1 , . . . , tp ). A particular advantage of distance covariance is its compact representation
in terms of certain expectation of pairwise Euclidean distances (Székely et al., 2007):
V 2 (X, Y ) = E[E[kX − X 0 kkY − Y 0 k|X, Y ]] + EkX − X 0 kEkY − Y 0 k
−2E[E[kX − X 0 k|X]E[kY − Y 0 k|Y ]], (1)
i.i.d. i.i.d.
where X ∼ X 0 , Y ∼ Y 0 , which leads to straightforward empirical estimates by replacing
the expectations with empirical estimators. Such an estimator has been used as a test statistic
in independence testing and the resulting test is shown to be consistent if the marginal
distributions have finite first moment (Székely et al., 2007). As a natural generalization, Lyons
(2013) extended (1) to metric spaces of negative type and showed that the corresponding
distance covariance—obtained by replacing the Euclidean metric by a metric of strongly
negative type—is zero if and only if X and Y are independent.
Extending the idea of distance covariance, recently, Wang et al. (2015) proposed a
conditional version to measure conditional independence between random vectors of arbitrary
dimension. To elaborate, let X ∈ Rp , Y ∈ Rq and Z ∈ Rr be random vectors. The
conditional distance covariance (CdCov) V(X, Y |Z) between random vectors X and Y with
finite moments given Z is defined as
|φXY |Z (t, s) − φX|Z (t)φY |Z (s)|2
Z Z
2 2
V (X, Y |Z) = kφXY |Z − φX|Z φY |Z kL2 (w) = dt ds,
cp cq ktkp+1 kskq+1
where
h √ √ i
φXY |Z (t, s) = E e −1ht,Xi+ −1hs,Y i |Z , φX|Z (t) = φXY |Z (t, 0) and φY |Z (s) = φXY |Z (0, s).

As a crucial property, CdCov is zero PZ -almost surely if and only if X Y |Z. Similar
to distance covariance, one advantage of this measure is that its sample version can be
expressed elegantly as a V - or U -statistic, based on which Wang et al. (2015) proposed a
statistically consistent conditional independence test.
The conditional distance covariance defined above can also be computed in terms of the
conditional expectations of pairwise Euclidean distances:
V 2 (X, Y |Z) = E[E[kX − X 0 kkY − Y 0 k|X, Y, Z]|Z] + E[kX − X 0 k|Z]E[kY − Y 0 k|Z]
−2E[[E[kX − X 0 k|X, Z]E[kY − Y 0 k|Y, Z]|Z], (2)

4
Distance and Kernel Measures of Conditional Dependence

where (X, Y ) and (X 0 , Y 0 ) are independent copies given Z. In the similar spirit of Lyons
(2013), CdCov can be extended to metric spaces of negative type through conditional
expectations so that (2) can be written as
Vρ2X ,ρY (X, Y |Z) = E[E[ρX (X, X 0 )ρY (Y, Y 0 )|X, Y, Z]|Z]
+E[ρX (X, X 0 )|Z]E[ρY (Y, Y 0 )|Z]
−2E[E[ρX (X, X 0 )|X, Z]E[ρY (Y, Y 0 )|Y, Z]|Z], (3)
0 0
=: G[ρX (X, X )ρY (Y, Y )] =: G ◦ [ρX ρY ], (4)
where ρX and ρY are metrics of strongly negative type defined on spaces X and Y
respectively with E[ρ2X (X, x0 )|Z] < ∞ a.s.-PZ and E[ρ2Y (Y, y0 )|Z] < ∞ a.s.-PZ for some
x0 ∈ X and y0 ∈ Y . The moment conditions ensure that the expectations are finite. When
ρX and ρY are strongly negative, then clearly (3) is zero if and only if X Y |Z.

4. Kernel Measures of Conditional Dependence

First, in Section 4.1, we present preliminaries on RKHS embedding of probability measures
and introduce kernel measures of dependence. Based on this discussion, in Section 4.2, we
develop a kernel measure of conditional dependence (we call it as Hilbert-Schmidt conditional
independence criterion—HSCIC) that is related to gCdCov (and therefore CdCov) discussed
in Section 3. We also present an interpretation for gCdCov through conditional cross-
covariance operator formulation for HSCIC.

4.1 RKHS embedding of probabilities

In the machine learning literature, the notion of embedding probability measures in an
RKHS has gained lot of attention and has been applied in goodness-of-fit (Balasubramanian
et al., 2021), two-sample (Gretton et al., 2007, 2012), independence (Gretton et al., 2008) and
conditional independence (Fukumizu et al.,R 2008; p Zhang et al., 2011) testing. To elaborate,
given a probability measure P such that X k(x, x) dP (x) < ∞, its RKHS embedding
(Smola et al., 2007) is defined as
Z
P 7→ µP := k(·, x) dP (x) ∈ Hk ,
X

where Hk is an RKHS with k as the reproducing kernel. Based on this embedding, a distance
on the space of probabilities can be defined through the distance between the embeddings,
i.e., Dk (P, Q) = kµP − µQ kHk , called the kernel distance or maximum mean discrepancy
(Gretton et al., 2007). If the map P 7→ µP is injective, then the kernel k that induces µP is
said to be characteristic (Fukumizu et al., 2009; Sriperumbudur et al., 2010) and therefore
1/2
Dk (P, Q) induces a metric on Mk (X ) := {P ∈ M1+ (X ) : X k(x, x) dP (x) < ∞},
R p

where M1+ (X ) denotes the set of all probability measures on X . Using the reproducing
property of the kernel, it can be shown that
Dk2 (P, Q) = EXX 0 k(X, X 0 ) + EY Y 0 k(Y, Y 0 ) − 2EXY k(X, Y ),
i.i.d. i.i.d.
where X, X 0 ∼ P and Y, Y 0 ∼ Q. Extending this distance to probability measures on
product spaces, particularly the joint measure PXY and product of marginals PX PY , yields

5
Sheng and Sriperumbudur

a measure of dependence between two random variables X and Y defined on measurable

spaces X and Y , called the Hilbert-Schmidt Independence Criterion (HSIC), which is
defined (Gretton et al., 2005) as

Dk2X kY (PXY , PX PY ) = EXY EX 0 Y 0 kX (X, X 0 )kY (Y, Y 0 )

+EX EX 0 kX (X, X 0 )EY EY 0 kY (Y, Y 0 )
−2EXY [EX 0 kX (X, X 0 )EY 0 kY (Y, Y 0 )] (5)
Z
= (kX kY )(x, y, x0 , y 0 ) d[PXY − PX PY ]2 (x, y, x0 , y 0 ).

If the kernels kX and kY are characteristic, then HSIC characterizes independence (Szabó
and Sriperumbudur, 2018), i.e., DkX kY (PXY , PX PY ) = 0 if and only if X Y . An empirical
version of (5) has been used as a test statistic in independence testing and the resultant test
is shown to be consistent against all alternatives as long as kX and kY are characteristic
(Gretton et al., 2008). An interesting connection between kernel-based HSIC and distance-
based dCov is shown by Sejdinovic et al. (2013) that dCov in (1) is in fact a special case of
HSIC and HSIC is equivalent to the generalized dCov introduced by Lyons (2013). This
result provides a unifying framework for the distance and kernel-based dependence measures.
With this background, in the rest of the paper, we explore the relation between distance
and kernel-based measures of conditional dependence.

4.2 Hilbert-Schmidt conditional independence criterion

For appropriate choice of kernels and distances, the following result provides a kernel-
equivalent of gCdCov, which we refer to as the Hilbert-Schmidt conditional independence
criterion (HSCIC).
Theorem 1 Let (X , ρX ) and (Y , ρY ) be semi-metric spaces of negative type. Suppose
E[ρ2X (X, x0 )|Z] < ∞ and E[ρ2Y (Y, y0 )|Z] < ∞ a.s.-PZ for some x0 ∈ X , y0 ∈ Y . If kX
and kY are pd kernels on X and Y that are distance-induced, i.e.,

kX (x, x0 ) = ρX (x, θ) + ρX (x0 , θ) − ρX (x, x0 )

and
kY (y, y 0 ) = ρY (y, θ0 ) + ρY (y 0 , θ0 ) − ρY (y, y 0 )
for some θ ∈ X and θ0 ∈ Y . Then

Vρ2X ,ρY (X, Y |Z) = G ◦ [ρX ρY ] = Dk2X kY (PXY |Z , PX|Z PY |Z ), a.s.-PZ (6)

with
Dk2X kY (PXY |Z , PX|Z PY |Z ) = G ◦ [kX kY ] ,
where G is defined in (4).
On the other hand, let kX and kY be pd kernels on X and Y respectively. Suppose
2 (X, X)|Z] < ∞ and E[k 2 (Y, Y )|Z] < ∞ a.s.-P . If ρ
E[kX Y Z X and ρY are semi-metrics on
X and Y that are kernel-induced, i.e.,
kX (x, x) + kX (x0 , x0 )
ρX (x, x0 ) = − kX (x, x0 )
2

6
Distance and Kernel Measures of Conditional Dependence

and
kY (y, y) + kY (y 0 , y 0 )
ρY (y, y 0 ) = − kY (y, y 0 ),
2
then (6) holds.

Proof Suppose kX and kY are distance-induced. Then

Dk2X kY (PXY |Z , PX|Z PY |Z )

= G ◦ [kX kY ] = G[kX (X, X 0 )kY (Y, Y 0 )]
= G ρX (X, θ) + ρX (X 0 , θ) − ρX (X, X 0 ) ρY (Y, θ0 ) + ρY (Y 0 , θ0 ) − ρY (Y, Y 0 )

= G ρX (X, X 0 )ρY (Y, Y 0 ) = Vρ2X ,ρY (X, Y |Z)

a.s.-PZ , where we used the fact that G[g(X, Y, X 0 , Y 0 )] = 0 a.s.-PZ when g does not depend
on one or more of its arguments (for example, a constant function). On the other hand,
suppose ρX and ρY are kernel-induced. Clearly they are of negative type. Then

Vρ2X ,ρY (X, Y |Z)

= G[ρX (X, X 0 )ρY (Y, Y 0 )]
kX (X, X) + kX (X 0 , X 0 ) kY (Y, Y ) + kY (Y 0 , Y 0 )

0 0
=G − kX (X, X ) − kY (Y, Y )
2 2
= G[kX (X, X 0 )kY (Y, Y 0 )] = Dk2X kY (PXY |Z , PX|Z PY |Z ),

a.s.-PZ , where we again used the above mentioned facts about G.

Note that, while the quantities θ and θ0 induce a family of kernels as θ and θ0 range
over X and Y respectively, all these kernels are equivalent in the sense that they induce
the same HSCIC as shown by the equivalence in (6). This means, CdCov is induced by
kernels of the form kX (x, x0 ) = kx − θk + kx0 − θk − kx − x0 k, x, x0 ∈ Rp and kY (y, y 0 ) =
ky − θ0 k + ky 0 − θ0 k − ky − y 0 k, y, y 0 ∈ Rq with θ = θ0 = 0 being a popular choice—this choice
leads to covariance function of a fractional Brownian motion.
We would like to mention that a concurrent and independent work by Park and Muandet
(2020) proposed a criterion with the same name HSCIC, which is defined as the distance
between the conditional mean embedding of PXY |Z and the product of marginal conditional
mean embeddings of PX|Z and PY |Z , where the conditional mean embedding of PX|Z is
denoted by µPX|Z and µPX|Z = E[kX (X, ·)|Z] (the conditional mean embedding of PY |Z and
PXY |Z can be similarly defined). It is easy to verify that

While HSCIC is a natural measure of conditional dependence, in the kernel literature,

however, a different measure has been widely used (Fukumizu et al., 2004, 2008; Zhang et al.,
2011), which is based on the Hilbert-Schmidt norm of a certain conditional cross-covariance
operator. Before we introduce the conditional cross-covariance operator and these other

7
Sheng and Sriperumbudur

measures of conditional dependence (which we do in Section 5), first we will briefly discuss
how HSIC is related to the Hilbert-Schmidt norm of a cross-covariance operator so that its
extension to the conditional version is natural.
For random variables X ∼ PX and Y ∼ PY with joint distribution PXY such that
E[kX (X, X)] < ∞ and E[kY (Y, Y )] < ∞, there exists a unique bounded linear operator,
called the cross-covariance operator (Baker, 1973; Fukumizu et al., 2004), ΣY X : HkX →
HkY such that ∀ f ∈ HkX , g ∈ HkY ,
hg, ΣY X f iHkY = E[f (X)g(Y )] − E[f (X)]E[g(Y )].
In fact, using the reproducing property that f (x) = hf, kX (·, x)iHkX , ∀ x ∈ X and g(y) =
hg, kY (·, y)iHkY , ∀ y ∈ Y , it follows that
Z Z Z Z
ΣY X = kY (·, y) ⊗ kX (·, x) dPXY (x, y) − kY (·, y) dPY (y) ⊗ kX (·, x) dPX (x), (8)

where ⊗ denotes the tensor product. Clearly, ΣY X is a natural generalization of the finite-
dimensional covariance matrix between two random vectors X ∈ Rp and Y ∈ Rq . Based on
(8) and the reproducing property, it can be verified that
Z Z 2
2
kΣY X kHS = kX (·, x) ⊗ kY (·, y) d(PXY − PX PY )(x, y)
Z Z Z Z HS

= hkX (·, x) ⊗ kY (·, y), kX (·, x ) ⊗ kY (·, y 0 )iHS d(PXY − PX PY )(x, y)

× d(PXY − PX PY )(x0 , y 0 )
= Dk2X kY (PXY , PX PY ), (9)
where k · kHS denotes the Hilbert-Schmidt norm. Since HSCIC is a conditional version
of HSIC and since the latter is the Hilbert-Schmidt norm of the cross-variance operator,
it is natural to extend ΣY X to its conditional version as a PZ -measurable bounded linear
operator Σ̇Y X|Z : HkX → HkY such that ∀ f ∈ HkX , g ∈ HkY ,
hg, Σ̇Y X|Z f iHkY = E[f (X)g(Y )|Z] − E[f (X)|Z]E[g(Y )|Z], a.s.-PZ ,
thereby yielding
Σ̇Y X|Z = E[kX (·, X) ⊗ kY (·, Y )|Z] − E[kX (·, X)|Z] ⊗ E[kY (·, Y )|Z].
Similar to (9), it is easy to verify that
kΣ̇Y X|Z k2HS = Dk2X kY (PXY |Z , PX|Z PY |Z )
a.s.-PZ . Therefore if kX and kY are characteristic, then X Y |Z ⇐⇒ Σ̇Y X|Z = 0, PZ -a.s.
However, in the kernel literature, to the best of our knowledge, besides the concurrent
and independent work by Park and Muandet (2020) in which a quantity similar to HSCIC
is proposed, HSCIC has not been used as a measure of conditional independence probably
because it is a random operator. We can obtain a single measure of conditional dependence
by considering the expectation of HSCIC over Z ∼ PZ , i.e.,
DPZ (PXY |Z , PX|Z PY |Z ) := EZ [kΣ̇Y X|Z k2HS ]. (10)
This single measure of conditional dependence by taking HSCIC over Z is not discussed in
Park and Muandet (2020).

8
Distance and Kernel Measures of Conditional Dependence

5. Relation between RKHS and Distance-based Conditional Dependence

Measures
Instead of Σ̇Y X|Z , Fukumizu et al. (2004) considered an alternate operator, called the
conditional cross-covariance operator, which is defined as follows. Suppose EX [kX (X, X)] <
∞, EY [kY (Y, Y )] < ∞ and EZ [kZ (Z, Z)] < ∞. Then there exists a unique bounded linear
operator ΣY X|Z such that
hg, ΣY X|Z f iHkY = E[f (X)g(Y )] − E[E[f (X)|Z]E[g(Y )|Z]]
= E[Cov(f (X), g(Y )|Z)]
for all f ∈ HkX and g ∈ HkY . As above, using the reproducing property, it can be shown
that
ΣY X|Z = E[E[kY (·, Y ) ⊗ kX (·, X)|Z] − E[kX (·, X)|Z] ⊗ E[kY (·, Y )|Z]] = EZ [Σ̇Y X|Z ].

However, unlike Σ̇Y X|Z , the conditional cross-covariance operator ΣY X|Z does not character-
ize conditional independence since ΣY X|Z = 0—assuming kX and kY to be characteristic—
only implies PXY = EZ [PX|Z PY |Z ] and not Σ̇Y X|Z = 0, a.s.-PZ (Fukumizu et al., 2004,
Theorem 8). Therefore, Fukumizu et al. (2004, Corollary 9) considered Z as a part of X
by defining Ẍ := (X, Z) and showed that ΣY Ẍ|Z = 0 if and only if X Y |Z, assuming
kX , kY and kZ to be characteristic. This is indeed the case since if kX , kY and kZ are
characteristic, then ΣY Ẍ|Z = 0 implies EZ [Σ̇Y Ẍ|Z ] = 0 and therefore
E[E[1{X ∈ A, Y ∈ B, Z ∈ C}|Z]]
−E[E[1{X ∈ A, Z ∈ C}|Z]EY |Z [1{Y ∈ B}|Z]]
= E[1{X ∈ A, Y ∈ B, Z ∈ C}] − E[E[1{X ∈ A, Z ∈ C}|Z]E[1{Y ∈ B}|Z]]
= E[E[1{X ∈ A, Y ∈ B}|Z]1{Z ∈ C}]
−E[E[1{X ∈ A}|Z]1{Z ∈ C}]E[1{Y ∈ B}|Z]]
= E[[PXY |Z (A × B|Z) − PX|Z (A|Z)PY |Z (B|Z)]1{Z ∈ C}] = 0,
for all A ∈ BX , B ∈ BY and C ∈ BZ , where BX , BY and BZ are the Borel σ-algebras
associated with X , Y and Z respectively. This implies,
PXY |Z (A × B|Z) − PX|Z (A|Z)PY |Z (B|Z) = 0, a.s.-PZ ,
implying X Y |Z, a.s.-PZ . Hence kΣY Ẍ|Z k2HS can be used as a measure of conditional
independence, which we refer to it as HSC̈IC.
The goal of this section is to explore the distance counterpart of HSC̈IC and understand
how it is related to CdCov, gCdCov, and DPZ defined in (10). To this end, we first provide
an expression for kΣY Ẍ|Z k2HS in terms of kernels, using which we obtain an expression in
terms of distances.
Theorem 2 Suppose EX [kX 2 (X, X)] < ∞, E [k 2 (Y, Y )] < ∞ and E [k 2 (Z, Z)] < ∞.
Y Y Z Z
Denote Ẍ = (X, Z) Then
h D E i
kΣY Ẍ|Z k2HS = EZ EZ 0 kZ (Z, Z 0 ) Σ̇Y X|Z , Σ̇Y X|Z 0
HS
= EZ EZ 0 [kZ (Z, Z 0 )h(Z, Z 0 )], (11)

9
Sheng and Sriperumbudur

kX (x, x0 ) = ρX (x, θ)+ρX (x0 , θ)−ρX (x, x0 ) and kY (y, y 0 ) = ρY (y, θ0 )+ρY (y 0 , θ0 )−ρY (y, y 0 )

for some θ ∈ X and θ0 ∈ Y . Then h(Z, Z 0 ) = FY X|Z FY 0 X 0 |Z 0 [ρX (X, X 0 )ρY (Y, Y 0 )].

Proof Note that

ΣY Ẍ|Z = E[Σ̇Y Ẍ|Z ]

= FY X|Z FY 0 X 0 |Z 0 kY (·, Y ) ⊗ kX (·, X), kY (·, Y 0 ) ⊗ kX (·, X 0 ) HS

h i
= FY X|Z FY 0 X 0 |Z 0 kY (·, Y ), kY (·, Y 0 ) H kX (·, X), kX (·, X 0 ) H
kY kX
0 0 0
= FY X|Z FY 0 X 0 |Z 0 [kX (X, X )kY (Y, Y )] = h(Z, Z ),

using which in (12) yields the result. If kX and kY are distance-induced, then using the
fact that FY X|Z FY 0 X 0 |Z 0 [g(X, X 0 , Y, Y 0 )] = 0 when g does not depend on one or more of its
arguments—basically, the same argument that we carried out in the proof of Theorem 1—we
have
h(Z, Z 0 ) = FY X|Z FY 0 X 0 |Z 0 [ρX (X, X 0 )ρY (Y, Y 0 )],
and the result follows.
While h(Z, Z 0 ) has a distance interpretation as shown in Theorem 2, kΣY Ẍ|Z k2HS does not

10
Distance and Kernel Measures of Conditional Dependence

have an elegant representation in terms of distances. Suppose kZ is also distance-induced,

00 00 00
i.e., kZ (z, z 0 ) = ρZ (z, θ ) + ρZ (x0 , θ ) − ρZ (z, z 0 ) for some θ ∈ Z . Then
Z Z
2
kΣY Ẍ|Z kHS = h(z, z 0 )kZ (z, z 0 ) dPZ (z) dPZ (z 0 )
Z Z h i
00 00
= ρZ (z, θ ) + ρZ (X 0 , θ ) − ρZ (z, z 0 ) h(z, z 0 ) dPZ (z) dPZ (z 0 ). (13)

Unfortunately, (13) cannot be related in a simple manner to gCdCov or HSCIC. However,

some simplifications occur based on certain assumptions on kZ , as shown in the following
corollaries. Under an appropriate choice of kZ , Corollary 3 shows HSC̈IC to be asymptotically
equivalent to the weighted average of HSCIC (equivalently, the weighted average of gCdCov)
defined in (10) while Corollary 4 shows the asymptotic equivalence between HSC̈IC and
CdCov.

Corollary 3 Suppose the assumptions of Theorem 2 hold and PZ has a density pZ w.r.t. the
Lebesgue measure on Rd such that h(z, ·)pZ is uniformly continuous and bounded for all
z ∈ Rd . For t > 0, let
z − z0

0 1
kZ (z, z ) = d ψ , z, z 0 ∈ Rd ,
t t
where ψ ∈ L1 (Rd ) is a bounded continuous positive definite function with Rd ψ(z) dz = 1.
R

Then

lim kΣY Ẍ|Z k2HS = EZ [kΣ̇Y X|Z k2HS pZ (Z)] = DP 2 (PXY |Z , PX|Z PY |Z ).
t→0 Z

Proof Define ψt (z) := t−d ψ z

t . From (11), it follows that

kΣY Ẍ|Z k2HS = EZ EZ 0 [ψt (Z − Z 0 )h(Z, Z 0 )]

Z Z
0 0 0 0
= pZ (z) ψt (z − z )h(z, z )pZ (z ) dz dz
Z
= pZ (z)(ψt ∗ (h(z, ·)pZ )(z) dz,

where ∗ denotes convolution. Taking the limit on both sides as t → 0 and applying dominated
convergence theorem, we obtain
Z Z
2
lim kΣY Ẍ|Z kHS = lim pZ (z)(ψt ∗ (h(z, ·)pZ )(z) dz = pZ (z) lim(ψt ∗ (h(z, ·)pZ )(z) dz.
t→0 t→0 t→0

The result follows from Folland (1999, Theorem 8.14) which yields limt→0 (ψt ∗(h(z, ·)pZ )(z) =
h(z, z)pZ (z) for all z ∈ Rd and by noting that h(Z, Z) = kΣ̇Y X|Z k2HS .

Corollary 4 Suppose the assumptions of Theorem 2 hold with ρX (x, x0 ) = kx − x0 k, x, x0 ∈

Rp and ρY (y, y 0 ) = ky − y 0 k, y, y 0 ∈ Rq . Let kZ (z, z 0 ) = η(z)η(z 0 ), z, z 0 ∈ Rd for some
real-valued function η on Rd and
h i
EZ |η(Z)| φXY |Z − φX|Z φY |Z L2 (w) < ∞. (14)

11
Sheng and Sriperumbudur

Then
2
kΣY Ẍ|Z k2HS = EZ η(Z) φXY |Z − φX|Z φY |Z

L2 (w)
, (15)

1 −p−1 ksk−q−1 , t ∈ Rp , s ∈ Rq . In particular, for t > 0 and some

where w(t, s) = cp cq ktk
a ∈ Rd , if η(z) = t1d θ a−z , z ∈ Rd where

t θ is a bounded continuous function with
θ(z) dz = 1 and PZ has a bounded uniformly continuous density pZ on Rd such that
R

Z
2
ess sup φXY |Z (t, s) − φX|Z (t)φY |Z (s) dw(t, s) < ∞, (16)
Z

then
lim kΣY Ẍ|Z k2HS = p2Z (a)V 2 (X, Y |Z = a). (17)
t→0

Proof In the following, we show that

h(Z, Z 0 ) = φXY |Z − φX|Z φY |Z , φXY |Z 0 − φX|Z 0 φY |Z 0 L2 (w)

(18)

and therefore (15) follows by using (18) in (11) with k(z, z 0 ) = η(z)η(z 0 ) and applying
dominated convergence theorem through (14). We now prove (18). Consider

φXY |Z − φX|Z φY |Z , φXY |Z 0 − φX|Z 0 φY |Z 0 L2 (w)

where FY X|Z := EXY |Z − EY |Z EX|Z . Using (20) in (19), we obtain

Z Z Z Z
0
w(t, s)Λ(t, s, Z, Z ) dt ds = FY X|Z FY 0 X 0 |Z 0 cosht, X −X 0 i coshs, Y −Y 0 iw(t, s) dt ds
(21)

12
Distance and Kernel Measures of Conditional Dependence

by noting that sinht, X − X 0 i and sinhs, Y − Y 0 i are odd functions w.r.t. t and s respectively.
Since cosht, X − X 0 i coshs, Y − Y 0 i = 1 − (1 − cosht, X − X 0 i) − (1 − coshs, Y − Y 0 i) + (1 −
cosht, X − X 0 i)(1 − coshs, Y − Y 0 i) and

FY X|Z FY 0 X 0 |Z 0 · [f (X, X 0 , Y, Y 0 )] = 0

for f (X, X 0 , Y, Y 0 ) = 1, f (X, X 0 , Y, Y 0 ) = 1 − cosht, X − X 0 i and f (X, X 0 , Y, Y 0 ) = 1 −

coshs, Y − Y 0 i, (21) reduces to
Z Z
w(t, s)Λ(t, s, Z, Z 0 ) dt ds
1 − cosht, X − X 0 i 1 − coshs, Y − Y 0 i
Z Z
= FY X|Z FY 0 X 0 |Z 0 · dt ds
cp ktkp+1 cq kskq+1
= FY X|Z FY 0 X 0 |Z 0 [kX − X 0 kkY − Y 0 k]
= h(Z, Z 0 ),

where the last equality follows from Lemma 1 of Székely et al. (2007) through 1−cosht,xi
R
cp ktkp+1
dt =
−d z

kxk, thereby proving the result in (15). By defining θt (z) = t θ t , we have

EZ [η(Z) φXY |Z − φX|Z φY |Z ] = θt ∗ φXY |Z − φX|Z φY |Z pZ (a),

which by (Folland, 1999, Theorem 8.14) converges to φXY |Z=a − φX|Z=a φY |Z=a pZ (a) as
t → 0. Using these in (15) along with dominated convergence theorem combined with (16)
yields (17).

Remark 5 Informally, the result of Corollary 3 can be obtained by choosing kZ (z, z 0 ) =

δ(z − z 0 ), z, z 0 ∈ Rd , where δ(·) is the Dirac distribution. Since such a choice does not corre-
spond to a valid reproducing kernel—Dirac distribution is not a function but a distribution
that does not belong to an RKHS—, the rigorous argument involves considering a family of
kernels indexed by bandwidth t which in the limiting case of t → 0 achieves the behavior of
the Dirac distribution. Similar argument applies to Corollary 4 as well.

6. Discussion
Conditional distance covariance is a commonly used metric for measuring conditional
dependence in the statistics community. In the machine learning community, a conditional
dependence measure based on reproducing kernels is popularly used in applications such as
conditional independence testing. In this work, we have explored the connection between
these two conditional dependence measures where we showed the distance-based measure to
be a limiting version of the kernel-based measure, where we may view conditional distance
covariance as a member of a much larger class of kernel-based conditional dependence
measures. This may enable to design more powerful conditional independence tests by
choosing a richer class of kernels.
Having understood the relation between these various measures of conditional dependence,
an important question to understand is the statistical behavior of conditional independence
tests based on these measures. Fukumizu et al. (2004, Proposition 5) provides an alternate

13
Sheng and Sriperumbudur

representation for the conditional covariance operator, ΣY Ẍ|Z in terms of only covariance
operators (this is reminiscent of the situation when (X, Y, Z) are jointly normal so that
the conditional covariance matrix can be represented in terms of the joint covariance
matrices) as ΣY Ẍ|Z = ΣY Ẍ − ΣY Z Σ̃−1 −1
ZZ ΣZ Ẍ where Σ̃ZZ is the right inverse of ΣZZ on
(Ker(ΣZZ ))⊥ . The advantage of this alternate form is that ΣY Ẍ|Z can be estimated from
i.i.d.
data (Xi , Yi , Zi )ni=1 ∼ PXY Z by simply estimating the (cross) covariance operators ΣY Ẍ ,
ΣY Z , ΣZ Ẍ , and replacing Σ̃−1
ZZ by an inverse of the regularized version of an empirical
estimator of ΣZZ . Using these, a plug-in (biased) estimator kΣ̂Y Ẍ|Z k2HS of HSC̈IC (i.e.,
kΣY Ẍ|Z k2HS ), can be shown to be consistent and to have a computational complexity of
O(n3 ), where Σ̂Y Ẍ|Z := Σ̂Y Ẍ − Σ̂Y Z (Σ̂ZZ +λI)−1 Σ̂Z Ẍ and λ > 0—these claims can be proved
using the ideas in Fukumizu et al. (2008) where such claims are proved for a normalized
version of ΣY Ẍ|Z . Similar results are shown for the kernel version of HSCIC (see (7)) by
Park and Muandet (2020). To elaborate, (Park and Muandet, 2020, Section 5.2) proposed a
biased estimator of HSCIC (see r.h.s. of (7)), which is based on Gram matrices on X , Y and
Z and associated regularized inverse, yielding a computational complexity of O(n3 ). On the
other hand, Wang et al. (2015) proposed a (biased) estimator of CdCov—the same idea can
be used to estimate gCdCov and therefore HSCIC—based on a Nadarya-Watson type density
estimator of PXY |Z , where it can be shown that HSCIC can be consistently estimated with
a computational complexity of O(n3 ). This means, all these different estimators of HSCIC
and HSC̈IC are consistent and have same computational complexity. However, the statistical
performance of these estimators as test statistics to test for conditional independence remains
open.

Acknowledgements
BKS is partially supported by National Science Foundation (NSF) award DMS-1713011 and
CAREER award DMS-1945396.

References
N. Aronszajn. Theory of reproducing kernels. Transactions of the American Mathematical
Society, 68(3):337–404, 1950.

C. R. Baker. Joint measures and cross-covariance operators. Transactions of the American

Mathematical Society, 186:273–289, 1973.

K. Balasubramanian, T. Li, and M. Yuan. On the optimality of kernel-embedding based

goodness-of-fit tests. Journal of Machine Learning Research, 22(1):1–45, 2021.

R. D. Cook and B. Li. Dimension reduction for conditional mean in regression. The Annals
of Statistics, 30(2):455–474, 2002.

G. B. Folland. Real Analysis: Modern Techniques and Their Applications. Wiley-Interscience,

New York, USA, 1999.

14
Distance and Kernel Measures of Conditional Dependence

K. Fukumizu, F. Bach, and M. I. Jordan. Dimensionality reduction for supervised learning

with reproducing kernel Hilbert spaces. Journal of Machine Learning Research, 5(Jan):
73–99, 2004.

K. Fukumizu, A. Gretton, Xiaohai S., and B. Schölkopf. Kernel measures of conditional

dependence. In J. C. Platt, D. Koller, Y. Singer, and S. T. Roweis, editors, Advances in
Neural Information Processing Systems 20, pages 489–496. Curran Associates, Inc., 2008.

K. Fukumizu, A. Gretton, B. Schölkopf, and B. K. Sriperumbudur. Characteristic kernels on

groups and semigroups. In D. Koller, D. Schuurmans, Y. Bengio, and L. Bottou, editors,
Advances in Neural Information Processing Systems 21, pages 473–480. Curran Associates,
Inc., 2009.

A. Gretton, O. Bousquet, A. Smola, and B. Schölkopf. Measuring statistical dependence with

Hilbert-Schmidt norms. In Proceedings of the 16th International Conference on Algorithmic
Learning Theory, ALT’05, pages 63–77, Berlin, Heidelberg, 2005. Springer-Verlag.

A. Gretton, K. Borgwardt, R. Malte, B. Schölkopf, and A. J. Smola. A kernel method for

the two-sample-problem. In B. Schölkopf, J. C. Platt, and T. Hoffman, editors, Advances
in Neural Information Processing Systems 19, pages 513–520. MIT Press, 2007.

A. Gretton, K. Fukumizu, C. H. Teo, L. Song, B. Schölkopf, and A. J. Smola. A kernel

statistical test of independence. In J. C. Platt, D. Koller, Y. Singer, and S. T. Roweis,
editors, Advances in Neural Information Processing Systems 20, pages 585–592. Curran
Associates, Inc., 2008.

A. Gretton, K. M. Borgwardt, M. J. Rasch, B. Schölkopf, and A. J. Smola. A kernel

two-sample test. Journal of Machine Learning Research, 13(Mar):723–773, 2012.

R. Lyons. Distance covariance in metric spaces. The Annals of Probability, 41(5):3284–3305,

2013.

K. Muandet, K. Fukumizu, B. K. Sriperumbudur, and B. Schölkopf. Kernel mean embedding

of distributions: A review and beyond. Foundations and Trends in Machine Learning, 10
(1-2):1–141, 2017.

J. Park and K. Muandet. A measure-theoretic approach to kernel conditional mean em-

beddings. In H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, and H. Lin, editors,
Advances in Neural Information Processing Systems, volume 33, pages 21247–21259.
Curran Associates, Inc., 2020.

J. Pearl. Causality: Models, Reasoning and Inference. Cambridge University Press, New
York, USA, 2000.

D. Sejdinovic, B. K. Sriperumbudur, A. Gretton, and K. Fukumizu. Equivalence of distance-

based and RKHS-based statistics in hypothesis testing. The Annals of Statistics, 41(5):
2263–2291, 2013.

15
Sheng and Sriperumbudur

C-J. Simon-Gabriel and B. Schölkopf. Kernel distribution embeddings: Universal kernels,

characteristic kernels and kernel metrics on distributions. Journal of Machine Learning
Research, 19(44):1–29, 2018.

C-J. Simon-Gabriel, A. Barp, B. Schölkopf, and L. Mackey. Metrizing weak convergence

with maximum mean discrepancies. 2020. https://arxiv.org/pdf/2006.09268.pdf.

A. J. Smola, A. Gretton, L. Song, and B. Schölkopf. A Hilbert space embedding for

distributions. In M. Hutter, R. A. Servedio, and E. Takimoto, editors, Algorithmic
Learning Theory, pages 13–31, Berlin, Heidelberg, 2007. Springer Berlin Heidelberg.

P. Spirtes, C. N. Glymour, R. Scheines, D. Heckerman, C. Meek, G. Cooper, and T. Richard-

son. Causation, Prediction, and Search. MIT press, Cambridge, MA, USA, 2000.

B. K. Sriperumbudur. On the optimal estimation of probability measures in weak and strong

topologies. Bernoulli, 22(3):1839–1893, 2016.

B. K. Sriperumbudur, A. Gretton, K. Fukumizu, B. Schölkopf, and G. R. G. Lanckriet.

Hilbert space embeddings and metrics on probability measures. Journal of Machine
Learning Research, 11(Apr):1517–1561, 2010.

B. K. Sriperumbudur, K. Fukumizu, and G. R. G. Lanckriet. Universality, characteristic

kernels and RKHS embedding of measures. Journal of Machine Learning Research, 12
(Jul):2389–2410, 2011.

L. Su and H. White. A consistent characteristic function-based test for conditional indepen-

dence. Journal of Econometrics, 141(2):807–834, 2007.

Z. Szabó and B. K. Sriperumbudur. Characteristic and universal tensor product kernels.

Journal of Machine Learning Research, 18(233):1–29, 2018.

G. Székely and M. Rizzo. Testing for equal distributions in high dimension. InterStat, (5),
2004.

G. Székely and M. Rizzo. Brownian distance covariance. The Annals of Applied Statistics, 4
(3):1233–1303, 2009.

G. Székely, M. Rizzo, and N. Bakirov. Measuring and testing dependence by correlation of

distances. The Annals of Statistics, 35(6):2769–2794, 2007.

Gábor J Székely, Maria L Rizzo, et al. Brownian distance covariance. The Annals of Applied
statistics, 3(4):1236–1265, 2009.

X. Wang, W. Pan, W. Hu, Y. Tian, and H. Zhang. Conditional distance correlation. Journal
of the American Statistical Association, 110(512):1726–1734, 2015.

K. Zhang, J. Peters, D. Janzing, and B. Schölkopf. Kernel-based conditional independence

test and application in causal discovery. In 27th Conference on Uncertainty in Artificial
Intelligence (UAI 2011), pages 804–813. AUAI Press, 2011.

Kernel-Based Approximation Methods Using MATLAB
0% (1)
Kernel-Based Approximation Methods Using MATLAB
9 pages
Applied Probability Models with Optimization Applications
From Everand
Applied Probability Models with Optimization Applications
Sheldon M. Ross
2.5/5 (3)
CH 1
No ratings yet
CH 1
21 pages
A Gentle Introduction To The Kernel Distance: 1 Definitions
No ratings yet
A Gentle Introduction To The Kernel Distance: 1 Definitions
9 pages
From Two Sample Testing To Singular Gaussian Discrimination: Leonardo V. Santoro Kartik G. Waghmare Victor M. Panaretos
No ratings yet
From Two Sample Testing To Singular Gaussian Discrimination: Leonardo V. Santoro Kartik G. Waghmare Victor M. Panaretos
15 pages
A Simple Measure of Conditional Dependence 28th March 2021
No ratings yet
A Simple Measure of Conditional Dependence 28th March 2021
41 pages
On The Optimality of Gaussian Kernel Based Nonparametric Tests Against Smooth Alternatives
No ratings yet
On The Optimality of Gaussian Kernel Based Nonparametric Tests Against Smooth Alternatives
62 pages
On The Optimality of Kernel-Embedding Based Goodness-of-Fit Tests
No ratings yet
On The Optimality of Kernel-Embedding Based Goodness-of-Fit Tests
45 pages
1 s2.0 S0047259X05000515 Main
No ratings yet
1 s2.0 S0047259X05000515 Main
20 pages
Exercices Kernel Trick
No ratings yet
Exercices Kernel Trick
24 pages
The Representation of Similarities in Linear Spaces
No ratings yet
The Representation of Similarities in Linear Spaces
17 pages
High Dimensional Representation
No ratings yet
High Dimensional Representation
33 pages
The Earth Mover's Correlation
No ratings yet
The Earth Mover's Correlation
20 pages
Kernels and Distances For Structured Data
No ratings yet
Kernels and Distances For Structured Data
28 pages
More Kernels and Their Properties
No ratings yet
More Kernels and Their Properties
3 pages
Lecture 05
No ratings yet
Lecture 05
49 pages
Classes of Kernels For Machine Learning: A Statistics Perspective
No ratings yet
Classes of Kernels For Machine Learning: A Statistics Perspective
14 pages
1 s2.0 0895717796000556 Main
No ratings yet
1 s2.0 0895717796000556 Main
16 pages
Kernel Discriminant Analysis For Positive Definite and Indefinite Kernels
No ratings yet
Kernel Discriminant Analysis For Positive Definite and Indefinite Kernels
15 pages
SSRN 4836407
No ratings yet
SSRN 4836407
65 pages
Kernel Methods in Machine Learning
No ratings yet
Kernel Methods in Machine Learning
53 pages
Kernel Nearest-Neighbor Algorithm
No ratings yet
Kernel Nearest-Neighbor Algorithm
10 pages
Kernal Methods Machine Learning
No ratings yet
Kernal Methods Machine Learning
53 pages
Discussion Notes 2-6
No ratings yet
Discussion Notes 2-6
3 pages
Aaai07 262
No ratings yet
Aaai07 262
5 pages
1 s2.0 S0167715224002475 Main
No ratings yet
1 s2.0 S0167715224002475 Main
6 pages
A Tutorial on ν-Support Vector Machines: 1 An Introductory Example
No ratings yet
A Tutorial on ν-Support Vector Machines: 1 An Introductory Example
29 pages
ML Kernel Methods
No ratings yet
ML Kernel Methods
51 pages
Reproducing Kernel Hilbert Space, Mercer's Theorem, Eigenfunctions, Nystr Om Method, and Use of Kernels in Machine Learning: Tutorial and Survey
No ratings yet
Reproducing Kernel Hilbert Space, Mercer's Theorem, Eigenfunctions, Nystr Om Method, and Use of Kernels in Machine Learning: Tutorial and Survey
31 pages
Combining Entropy Measures For Anomaly Detection
No ratings yet
Combining Entropy Measures For Anomaly Detection
14 pages
Source: Books by Tan, Steinbach, Kumar Han, Kamber & Pei Evans Dinesh Kumar + Experiential Knowledge
No ratings yet
Source: Books by Tan, Steinbach, Kumar Han, Kamber & Pei Evans Dinesh Kumar + Experiential Knowledge
26 pages
PHD - Thesis - Final - Statistics Tests
100% (1)
PHD - Thesis - Final - Statistics Tests
154 pages
Quickest Detection of Gauss-Markov Random Fields: Javad Heydari Ali Tajer H. Vincent Poor
No ratings yet
Quickest Detection of Gauss-Markov Random Fields: Javad Heydari Ali Tajer H. Vincent Poor
7 pages
Lecture4 introToRKHS
No ratings yet
Lecture4 introToRKHS
33 pages
Machine Learning With Kernel Methods
No ratings yet
Machine Learning With Kernel Methods
760 pages
Durrande 2020
No ratings yet
Durrande 2020
90 pages
Gaussian Process Kernels For Pattern Discovery and Extrapolation
No ratings yet
Gaussian Process Kernels For Pattern Discovery and Extrapolation
10 pages
Unit - 3 Image Proc
No ratings yet
Unit - 3 Image Proc
71 pages
Jurnal Multivariate Normal Test PDF
No ratings yet
Jurnal Multivariate Normal Test PDF
23 pages
hw5 Kernel Trick 2021
No ratings yet
hw5 Kernel Trick 2021
4 pages
Arthur Gretton - Slides4A
No ratings yet
Arthur Gretton - Slides4A
121 pages
SCH Smo 03 C
No ratings yet
SCH Smo 03 C
24 pages
Prob Level Sets
No ratings yet
Prob Level Sets
8 pages
KNN &SVM
No ratings yet
KNN &SVM
23 pages
07 Kernels
No ratings yet
07 Kernels
6 pages
Book
No ratings yet
Book
113 pages
Probability Product Kernels: Tony Jebara Risi Kondor Andrew Howard
No ratings yet
Probability Product Kernels: Tony Jebara Risi Kondor Andrew Howard
26 pages
10 Understanding Kernels
No ratings yet
10 Understanding Kernels
41 pages
Introduction To Kernels: Max Welling
No ratings yet
Introduction To Kernels: Max Welling
16 pages
Pawlowsky Indicator Kriging
No ratings yet
Pawlowsky Indicator Kriging
4 pages
Coin: A Computational Framework For Conditional Inference
No ratings yet
Coin: A Computational Framework For Conditional Inference
11 pages
CH. A. Charalambides, M.V. Koutras, N. Balakrishnan - Probability and Statistical Models With Applications-Chapman and Hall - CRC (2000)
No ratings yet
CH. A. Charalambides, M.V. Koutras, N. Balakrishnan - Probability and Statistical Models With Applications-Chapman and Hall - CRC (2000)
609 pages
19 PS336
No ratings yet
19 PS336
77 pages
Distance
No ratings yet
Distance
18 pages
A Kernel Two-Sample Test: Arthur Gretton
No ratings yet
A Kernel Two-Sample Test: Arthur Gretton
51 pages
ML Unit 2
No ratings yet
ML Unit 2
11 pages
Characterizations of Bivariate Models by Divergence Measures
No ratings yet
Characterizations of Bivariate Models by Divergence Measures
11 pages
Lec09 466 PDF
No ratings yet
Lec09 466 PDF
5 pages
Armstrong 2018
No ratings yet
Armstrong 2018
62 pages
Two Dimensional Geometric Model: Understanding and Applications in Computer Vision
From Everand
Two Dimensional Geometric Model: Understanding and Applications in Computer Vision
Fouad Sabry
No ratings yet
Solution of Certain Problems in Quantum Mechanics
From Everand
Solution of Certain Problems in Quantum Mechanics
A. Bolotin
No ratings yet
Resume Template
No ratings yet
Resume Template
1 page
Javascript Interview Questions
No ratings yet
Javascript Interview Questions
2 pages
Javascript Interview Topics
No ratings yet
Javascript Interview Topics
3 pages
Frontend Interview Preparation
No ratings yet
Frontend Interview Preparation
3 pages
Accenture Aptitude Paper 2012
No ratings yet
Accenture Aptitude Paper 2012
3 pages
Unit 1: 23 February 2024 18:50
No ratings yet
Unit 1: 23 February 2024 18:50
1 page
Research 4
No ratings yet
Research 4
22 pages
Research 2
No ratings yet
Research 2
26 pages
Factory Layout Principles: UK-RF Closed Nuclear Cities Partnership
No ratings yet
Factory Layout Principles: UK-RF Closed Nuclear Cities Partnership
13 pages
Final LP IN MATH! TYG!
No ratings yet
Final LP IN MATH! TYG!
10 pages
Objectives of Curriculum
No ratings yet
Objectives of Curriculum
6 pages
Lab 1 Tools II Report - Fall - Abhinav Singh
No ratings yet
Lab 1 Tools II Report - Fall - Abhinav Singh
4 pages
R&D Showcase Template
No ratings yet
R&D Showcase Template
1 page
Nedelec. Integral Equations With Not Integrable Kernels, 1982.
No ratings yet
Nedelec. Integral Equations With Not Integrable Kernels, 1982.
11 pages
Basic Aerodynamics: Lecture 12: Blade Element Analysis
No ratings yet
Basic Aerodynamics: Lecture 12: Blade Element Analysis
42 pages
Graph - 2020 - Pr1 and 2
No ratings yet
Graph - 2020 - Pr1 and 2
38 pages
338 22 Residue Formulas
No ratings yet
338 22 Residue Formulas
2 pages
Resolution (Knowledge Representation)
No ratings yet
Resolution (Knowledge Representation)
15 pages
Reporte
No ratings yet
Reporte
2 pages
Binary Image Compression Schemes
0% (1)
Binary Image Compression Schemes
19 pages
Introduction To Statistics and Data Analysis 3rd Edition Roxy Peck Download
No ratings yet
Introduction To Statistics and Data Analysis 3rd Edition Roxy Peck Download
70 pages
Dew Point and Bubble
0% (1)
Dew Point and Bubble
6 pages
Mathematics in Modern World
No ratings yet
Mathematics in Modern World
3 pages
4.1 - Understanding Thermal Equilibrium
No ratings yet
4.1 - Understanding Thermal Equilibrium
12 pages
Closed Conduit Flow: Monroe L. Weber-Shirk S Civil Environmental Engineering
No ratings yet
Closed Conduit Flow: Monroe L. Weber-Shirk S Civil Environmental Engineering
44 pages
Indian Poultry Scenario
No ratings yet
Indian Poultry Scenario
54 pages
Class - 11 Exercise - 1.1
No ratings yet
Class - 11 Exercise - 1.1
135 pages
CS 388: Natural Language Processing:: N-Gram Language Models
No ratings yet
CS 388: Natural Language Processing:: N-Gram Language Models
22 pages
Lecture 3 Branching and Iterations
No ratings yet
Lecture 3 Branching and Iterations
29 pages
Comparative Analysis of Optimizers in Deep Neural Networks
No ratings yet
Comparative Analysis of Optimizers in Deep Neural Networks
4 pages
LECTURE 4 Part 2: Analysis of Statically Determinate Beams: Equation 4-1
No ratings yet
LECTURE 4 Part 2: Analysis of Statically Determinate Beams: Equation 4-1
23 pages
Gcse Matheamtics Paper 1 (N - C) : Pre-Public Examinations
No ratings yet
Gcse Matheamtics Paper 1 (N - C) : Pre-Public Examinations
23 pages
SL Prior Learning Test
No ratings yet
SL Prior Learning Test
15 pages
Ee3304 hw1 SLN
No ratings yet
Ee3304 hw1 SLN
11 pages
Performance of The Taylor Series Method For Odes/Daes: Roberto Barrio
No ratings yet
Performance of The Taylor Series Method For Odes/Daes: Roberto Barrio
21 pages
Sinharay & Johnson (2025)
No ratings yet
Sinharay & Johnson (2025)
20 pages
Books Doubtnut Question Bank
No ratings yet
Books Doubtnut Question Bank
122 pages

Research 6

Uploaded by

Research 6

Uploaded by

Journal of Machine Learning Research 24 (2023) 1-16 Submitted 3/20; Revised 12/22; Published 1/23

On Distance and Kernel Measures of Conditional

Tianhong Sheng [email protected]

Editor: John Shawe-Taylor

Measuring conditional dependence between random variables plays a fundamental role in

c 2023 Tianhong Sheng and Bharath K. Sriperumbudur.

Recently, a class of distances on probability measures induced by a Euclidean metric

2. Definitions & Notation

3. Conditional Distance Covariance

4. Kernel Measures of Conditional Dependence

4.1 RKHS embedding of probabilities

a measure of dependence between two random variables X and Y defined on measurable

Dk2X kY (PXY , PX PY ) = EXY EX 0 Y 0 kX (X, X 0 )kY (Y, Y 0 )

4.2 Hilbert-Schmidt conditional independence criterion

kX (x, x0 ) = ρX (x, θ) + ρX (x0 , θ) − ρX (x, x0 )

Proof Suppose kX and kY are distance-induced. Then

Dk2X kY (PXY |Z , PX|Z PY |Z )

= G ρX (X, X 0 )ρY (Y, Y 0 ) = Vρ2X ,ρY (X, Y |Z)

Vρ2X ,ρY (X, Y |Z)

a.s.-PZ , where we again used the above mentioned facts about G.

While HSCIC is a natural measure of conditional dependence, in the kernel literature,

= hkX (·, x) ⊗ kY (·, y), kX (·, x ) ⊗ kY (·, y 0 )iHS d(PXY − PX PY )(x, y)

5. Relation between RKHS and Distance-based Conditional Dependence

Proof Note that

ΣY Ẍ|Z = E[Σ̇Y Ẍ|Z ]

= FY X|Z FY 0 X 0 |Z 0 kY (·, Y ) ⊗ kX (·, X), kY (·, Y 0 ) ⊗ kX (·, X 0 ) HS

have an elegant representation in terms of distances. Suppose kZ is also distance-induced,

Unfortunately, (13) cannot be related in a simple manner to gCdCov or HSCIC. However,

Proof Define ψt (z) := t−d ψ z

kΣY Ẍ|Z k2HS = EZ EZ 0 [ψt (Z − Z 0 )h(Z, Z 0 )]

Corollary 4 Suppose the assumptions of Theorem 2 hold with ρX (x, x0 ) = kx − x0 k, x, x0 ∈

1 −p−1 ksk−q−1 , t ∈ Rp , s ∈ Rq . In particular, for t > 0 and some

Proof In the following, we show that

h(Z, Z 0 ) = φXY |Z − φX|Z φY |Z , φXY |Z 0 − φX|Z 0 φY |Z 0 L2 (w)

φXY |Z − φX|Z φY |Z , φXY |Z 0 − φX|Z 0 φY |Z 0 L2 (w)

where FY X|Z := EXY |Z − EY |Z EX|Z . Using (20) in (19), we obtain

for f (X, X 0 , Y, Y 0 ) = 1, f (X, X 0 , Y, Y 0 ) = 1 − cosht, X − X 0 i and f (X, X 0 , Y, Y 0 ) = 1 −

Remark 5 Informally, the result of Corollary 3 can be obtained by choosing kZ (z, z 0 ) =

C. R. Baker. Joint measures and cross-covariance operators. Transactions of the American

K. Balasubramanian, T. Li, and M. Yuan. On the optimality of kernel-embedding based

G. B. Folland. Real Analysis: Modern Techniques and Their Applications. Wiley-Interscience,

K. Fukumizu, F. Bach, and M. I. Jordan. Dimensionality reduction for supervised learning

K. Fukumizu, A. Gretton, Xiaohai S., and B. Schölkopf. Kernel measures of conditional

K. Fukumizu, A. Gretton, B. Schölkopf, and B. K. Sriperumbudur. Characteristic kernels on

A. Gretton, O. Bousquet, A. Smola, and B. Schölkopf. Measuring statistical dependence with

A. Gretton, K. Borgwardt, R. Malte, B. Schölkopf, and A. J. Smola. A kernel method for

A. Gretton, K. Fukumizu, C. H. Teo, L. Song, B. Schölkopf, and A. J. Smola. A kernel

A. Gretton, K. M. Borgwardt, M. J. Rasch, B. Schölkopf, and A. J. Smola. A kernel

R. Lyons. Distance covariance in metric spaces. The Annals of Probability, 41(5):3284–3305,

K. Muandet, K. Fukumizu, B. K. Sriperumbudur, and B. Schölkopf. Kernel mean embedding

J. Park and K. Muandet. A measure-theoretic approach to kernel conditional mean em-

D. Sejdinovic, B. K. Sriperumbudur, A. Gretton, and K. Fukumizu. Equivalence of distance-

C-J. Simon-Gabriel and B. Schölkopf. Kernel distribution embeddings: Universal kernels,

C-J. Simon-Gabriel, A. Barp, B. Schölkopf, and L. Mackey. Metrizing weak convergence

A. J. Smola, A. Gretton, L. Song, and B. Schölkopf. A Hilbert space embedding for

P. Spirtes, C. N. Glymour, R. Scheines, D. Heckerman, C. Meek, G. Cooper, and T. Richard-

B. K. Sriperumbudur. On the optimal estimation of probability measures in weak and strong

B. K. Sriperumbudur, A. Gretton, K. Fukumizu, B. Schölkopf, and G. R. G. Lanckriet.

B. K. Sriperumbudur, K. Fukumizu, and G. R. G. Lanckriet. Universality, characteristic

L. Su and H. White. A consistent characteristic function-based test for conditional indepen-

Z. Szabó and B. K. Sriperumbudur. Characteristic and universal tensor product kernels.

G. Székely, M. Rizzo, and N. Bakirov. Measuring and testing dependence by correlation of

K. Zhang, J. Peters, D. Janzing, and B. Schölkopf. Kernel-based conditional independence

You might also like