Computing marginal and conditional divergences between decomposable models with applications in quantum computing and earth observation

Lee, Loong Kuan; Webb, Geoffrey I.; Schmidt, Daniel F.; Piatkowski, Nico

doi:10.1007/s10115-024-02191-7

Computing marginal and conditional divergences between decomposable models with applications in quantum computing and earth observation

Regular Paper
Open access
Published: 22 August 2024

Volume 66, pages 7527–7556, (2024)
Cite this article

Download PDF

You have full access to this open access article

Knowledge and Information Systems Aims and scope Submit manuscript

Computing marginal and conditional divergences between decomposable models with applications in quantum computing and earth observation

Download PDF

Loong Kuan Lee¹,
Geoffrey I. Webb²,
Daniel F. Schmidt² &
…
Nico Piatkowski¹

568 Accesses
Explore all metrics

Abstract

The ability to compute the exact divergence between two high-dimensional distributions is useful in many applications, but doing so naively is intractable. Computing the $\alpha \beta $-divergence—a family of divergences that includes the Kullback–Leibler divergence and Hellinger distance—between the joint distribution of two decomposable models, i.e., chordal Markov networks, can be done in time exponential in the treewidth of these models. Extending this result, we propose an approach to compute the exact $\alpha \beta $-divergence between any marginal or conditional distribution of two decomposable models. In order to do so tractably, we provide a decomposition over the marginal and conditional distributions of decomposable models. We then show how our method can be used to analyze distributional changes by first applying it to the benchmark image dataset QMNIST and a dataset containing observations from various areas at the Roosevelt Nation Forest and their cover type. Finally, based on our framework, we propose a novel way to quantify the error in contemporary superconducting quantum computers.

Bounds for phylogenetic network space metrics

Article Open access 23 August 2017

Reconstructing Phylogenetic Trees from Multipartite Quartet Systems

Article 23 February 2022

Machine Learning Versus Semidefinite Programming Approach to a Particular Problem of the Theory of Open Quantum Systems

Article 01 July 2021

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

The ability to analyze and quantify the differences between two high-dimensional distributions has many applications in the fields of machine learning and data science, for instance, in the study of problems with changing distributions, i.e., concept drift [46, 49], in the detection of anomalous regions in spatio-temporal data [4, 43], and in tasks related to the retrieval, classification, and visualization of time series data [11].

There has been some previous work done on tractably computing the divergence between two high-dimensional distributions modeled by chordal Markov networks (MNs), i.e., decomposable models (DMs) [29]. However, reducing the differences between two distributions to just one scalar value can be quite uninformative as it does not help us better understand why the distributions are different in the first place. Specifically in high dimensions, it is quite likely that the difference between two distributions will be largely attributable to the influence of a small proportion of the variables [49].

One way to better understand distributional changes in higher dimensions is to measure the divergence between the marginal univariate distributions over individual variables in the two distributions [49]. However, this does not capture changes in the relationships between variables [21]. We will see further evidence on the importance of measuring changes over multiple variables at once in Sect. 5.3.1 and Fig. 11.

Therefore, the goal of this work is to develop a method for computing the divergence between two DMs over arbitrary variable subsets. Doing so tractably is nontrivial, as computing the $\alpha \beta $-divergence involves a sum over an exponential number of elements with respect to the number of variables in the distributions. Therefore we need a “decomposition” of the sum in the $\alpha \beta $-divergence into a product of smaller—more tractable—sums. As we will see later on, this will require a decomposition over any marginal distribution of a DM into product of smaller marginal distributions as well. An example of such a decomposition can be found in Fig. 1. Providing such a decomposition in general is one of the our main contributions and the details can be found in Sects. 4.2 and 4.3.

To this end, we will develop the tools needed to find such a decomposition and also show that these tools can also be used to develop a method for computing the $\alpha \beta $-divergence between conditional distributions of DMs as well. More generally, our method can be used to compute any functional that can be expressed as a linear combination of the Functional $\mathcal {F}$ in Definition 3, between any marginal or conditional distributions of 2 DMs. In fact, it is the properties of $\mathcal {F}$, and the $\alpha \beta $-divergence being a linear combination of $\mathcal {F}$ for most values of $\alpha $ and $\beta $ that allows for the decomposition of the $\alpha \beta $-divergence over any distributions between two DMs.

In order to underpin the relevance of our work, we provide a completely novel error analysis of real-world quantum computing devices, based on our methodology. It allows us to gain insights into the specific errors of (subsets of) qubits, which is of utmost importance for practical quantum computing.

This paper is an extended version of the paper with the title “Computing marginal and conditional divergences between decomposable models with applications in quantum computing and earth observation” published in the IEEE International Conference of Data Mining 2023 [30].

2 Background and notation

For any set of n random variables, $\varvec{X}=\{X_{1},\ldots ,X_{n}\}$, let $\mathcal {X}_{\varvec{X}} = \mathcal {X}_{X_{1}} \times \ldots \times \mathcal {X}_{X_{n}}$ be the support of $\varvec{X}$, i.e., the set of values that $\varvec{X}$ can take, and $\varvec{x}\in \mathcal {X}_{\varvec{X}}$ be one of these values. Additionally, for some subset of variables $\varvec{Z}\subseteq \varvec{X}$, let ${\bar{\varvec{Z}}}$ be the set of variables in $\varvec{X}$ not in $\varvec{Z}$, ${\bar{\varvec{Z}}} = \varvec{X}{\setminus }\varvec{Z}$.

2.1 Statistical divergences

A divergence is a measure of the “dissimilarity” between two probability distributions. More formally, a divergence is a functional with the following properties:

Definition 1

(Divergence) Suppose P is the set of probability distributions with the same support. A divergence, D, is the function $D(\cdot \ ||\ \cdot ): P \times P \rightarrow \mathbb {R}$ such that $\forall \, \mathbb {P},\mathbb {Q}\in S: D(\mathbb {P}\ ||\ \mathbb {Q}) \ge 0$ and $\mathbb {P}= \mathbb {Q}\iff D(\mathbb {P}\ ||\ \mathbb {Q}) = 0$ .^{Footnote 1}

Some popular divergences include the Kullback–Leibler (KL) divergence [26], the Hellinger distance [19], and the Itakura-Saito (IS) divergence [15, 23]. Furthermore, these aforementioned divergences can be expressed by a more general family of divergences, known as the $\alpha \beta $-divergence.

Definition 2

($\alpha \beta $-divergence) The $\alpha \beta $-divergence, $D_{\textit{AB}}$, between two positive measures $\mathbb {P}$ and $\mathbb {Q}$ is defined as [12]:

$$\begin{aligned} D_{\textit{AB}}^{\alpha ,\beta }(\mathbb {P},\mathbb {Q}) = -\frac{1 }{\alpha \beta }\left( \mathbb {P}(\varvec{x})^{\alpha }\mathbb {Q}(\varvec{x})^{\beta } - \frac{\alpha }{\alpha + \beta }\mathbb {P}(\varvec{x})^{\alpha + \beta } - \frac{\beta }{\alpha + \beta }\mathbb {Q}(\varvec{x})^{\alpha + \beta } \right) \end{aligned}$$

(1)

for $\alpha ,\beta ,\alpha +\beta \ne 0$.

We can use l’Hôpital’s rule to extend the $\alpha \beta $-divergence by continuity in order to cover all values of $\alpha ,\beta \in \mathbb {R}$, thus avoiding singularity or indeterminacy for certain values of $\alpha $ and $\beta $ [12]. Therefore, the $\alpha \beta $-divergence can be expressed in the following more explicit form:

$$\begin{aligned} D_{\textit{AB}}^{\alpha ,\beta }(\mathbb {P},\mathbb {Q}) = \sum _{\varvec{x}\in \mathcal {X}}d_{\textit{AB}}^{\alpha ,\beta } \big (\mathbb {P}(\varvec{x}),\mathbb {Q}(\varvec{x})\big ) \end{aligned}$$

(2)

where $\alpha $ and $\beta $ are parameters, and

$$\begin{aligned}&d_{\textit{AB}}^{(\alpha ,\beta )}(\mathbb {P}(\varvec{x}), \mathbb {Q}(\varvec{x})) \nonumber \\&\quad = {\left\{ \begin{array}{ll} -\frac{1 }{\alpha \beta }\left( \mathbb {P}(\varvec{x})^{\alpha }\mathbb {Q}(\varvec{x})^{\beta } - \frac{\alpha }{\alpha + \beta }\mathbb {P}(\varvec{x})^{\alpha + \beta } - \frac{\beta }{\alpha + \beta }\mathbb {Q}(\varvec{x})^{\alpha + \beta } \right) , &{} \alpha ,\beta ,\alpha +\beta \ne 0\\ \frac{1}{\alpha ^{2}}\left( \mathbb {P}(\varvec{x})^{\alpha } \log \frac{\mathbb {P}(\varvec{x})^{\alpha }}{\mathbb {Q}(\varvec{x})^{\alpha }} - \mathbb {P}(\varvec{x})^{\alpha } + \mathbb {Q}(\varvec{x})^{\alpha } \right) ,&{} \alpha \ne 0, \beta =0\\ \frac{1}{\alpha ^{2}} \left( \log \frac{\mathbb {Q}(\varvec{x})^{\alpha }}{\mathbb {P}(\varvec{x})^{\alpha }} + \left( \frac{\mathbb {Q}(\varvec{x})^{\alpha }}{\mathbb {P}(\varvec{x})^{\alpha }}\right) ^{-1} - 1 \right) ,&{} \alpha =-\beta , \alpha \ne 0\\ \frac{1}{\beta ^{2}}\left( \mathbb {Q}(\varvec{x})^{\beta }\log \frac{\mathbb {Q}(\varvec{x})^{\beta }}{\mathbb {P}(\varvec{x})^{\beta }} - \mathbb {Q}(\varvec{x})^{\beta } + \mathbb {P}(\varvec{x})^{\beta } \right) , &{} \alpha =0, \beta \ne 0\\ \frac{1}{2}(\log \mathbb {P}(\varvec{x}) - \log \mathbb {Q}(\varvec{x}))^{2}, &{} \alpha ,\beta =0 \end{array}\right. } \end{aligned}$$

(3)

To make subsequent developments easier, the $\alpha \beta $-divergence, when either $\alpha $ or $\beta $ is 0, can be expressed in terms of a linear combination of the functional $\mathcal {F}$ in Definition 3 [29].

Definition 3

($\mathcal {F}$ Functional) Let L be any function with the property:

$$\begin{aligned} L\left( \prod _{x}x\right) =\sum _{x}L(x), \end{aligned}$$

(4)

and $\{g, h, g^{*}, h^{*}\}$ be a set of functionals with the property that, $\forall f \in \{g, h, g^{*}, h^{*}\}$:

$$\begin{aligned} f\left[ \prod _{\varvec{Z}\in \mathcal {A}} P_{\varvec{Z}}\right] = \prod _{\varvec{Z}\in \mathcal {A}} f\left[ P_{\varvec{Z}}\right] \end{aligned}$$

(5)

where $\mathcal {A}$ is a set of subsets of the variables in $\varvec{X}$, $\mathcal {A}\subset \mathcal {P}(\varvec{X})$, and $P_{\varvec{X}}$ is any distribution that can be expressed as a product of smaller distribution defined over the variable subsets in $\mathcal {A}$, $P_{\varvec{X}}=\prod _{\varvec{Z}\in \mathcal {A}}P_{\varvec{Z}}$. Then we define the functional $\mathcal {F}$ to be:

$$\begin{aligned} \mathcal {F}(\mathbb {P},\mathbb {Q}; g, h, g^{*}, h^{*}, L) = \sum _{\varvec{x}\in \mathcal {X}} g\big [\mathbb {P}\big ](\varvec{x}) \cdot h\big [\mathbb {Q}\big ](\varvec{x}) \cdot L\Bigl ( g^{*}[\mathbb {P}](\varvec{x}) \cdot h^{*}[\mathbb {Q}](\varvec{x})\Bigr ) \end{aligned}$$

(6)

Therefore, using the functional $\mathcal {F}$ from Definition 3, the $\alpha \beta $-divergence can be re-expressed as follows:

$$\begin{aligned} D_{\textit{AB}}^{\alpha ,\beta }(\mathbb {P},\mathbb {Q}) ={\left\{ \begin{array}{ll} \sum _{\varvec{x}\in \mathcal {X}} \frac{1}{2}\left( \log \mathbb {P}(\varvec{x}) - \log \mathbb {Q}(\varvec{x})\right) ^{2} ,&{} \alpha ,\beta =0\\ \sum _{i} c_{i} \mathcal {F}(\mathbb {P}, \mathbb {Q}; g_{i},h_{i},g^{*}_{i},h^{*}_{i}) ,&{}\text {otherwise} \end{array}\right. } \end{aligned}$$

(7)

which is more terse compared to Eq. (3).

So far we have discussed divergences between joint distributions, but computing these divergences between two marginal distributions, $\mathbb {P}_{\varvec{Z}}$ and $\mathbb {Q}_{\varvec{Z}}$, instead, is similar as we can just plug these marginal distributions into any divergence directly.

Unfortunately, defining the divergence between two conditional distributions, $\mathbb {P}_{\varvec{Y}\mid \varvec{Z}}$ and $\mathbb {Q}_{\varvec{Y}\mid \varvec{Z}}$, is not as straightforward as they are comprised of multiple distributions over $\varvec{Y}$, one of each value of $\varvec{z}\in \mathcal {Z}$. There is however a “natural” definition for the conditional KL-divergence as the two natural approaches to define it, lead to the same result [8]:

$$\begin{aligned} D(\mathbb {P}_{\varvec{Y}\mid \varvec{Z}}\mid \mid \mathbb {Q}_{\varvec{Y}\mid \varvec{Z}}\mid \mathbb {P}_{\varvec{Z}})&:=D(\mathbb {P}_{\varvec{Y}\mid \varvec{Z}} \mathbb {P}_{\varvec{Z}}\mid \mid \mathbb {Q}_{\varvec{Y}\mid \varvec{Z}} \mathbb {P}_{\varvec{Z}}) \end{aligned}$$

(8)

$$\begin{aligned}&:= \mathbb {E}_{\varvec{Z}\sim \mathbb {P}} \left[ D(\mathbb {P}_{\varvec{Y}| \varvec{Z}} \mid \mid \mathbb {Q}_{\varvec{Y}\mid \varvec{Z}})\right] \end{aligned}$$

(9)

Unfortunately, this equivalence is unique to the KL-divergence. For defining other conditional divergences, previous approaches have either exclusively taken the definition in Eq. (8) [8, 10, 47] or Eq. (9) [5, 14, 44]. For our purposes, we will use the definition in Eq. (9) which involves taking the expectation of $D(\mathbb {P}_{\varvec{Y}\mid \varvec{z}},\mathbb {Q}_{\varvec{Y}\mid \varvec{z}})$ with respect to $\mathbb {P}_{\varvec{Z}}$.

Definition 4

(Conditional Divergence) Let $\varvec{Y}$ and $\varvec{Z}$ be two disjoint subsets of $\varvec{X}$: $\varvec{Y}\cup \varvec{Z}\subseteq \varvec{X}$, $\varvec{Y}\cap \varvec{Z}=\emptyset $. Then we define the conditional divergence of $\varvec{Y}$ given $\varvec{Z}$ to be:

$$\begin{aligned} D(\mathbb {P}_{\varvec{Y}\mid \varvec{Z}} \mid \mid \mathbb {Q}_{\varvec{Y}\mid \varvec{Z}}) = \mathbb {E}_{\varvec{Z}\sim \mathbb {P}} \left[ D(\mathbb {P}_{\varvec{Y}| \varvec{Z}} \mid \mid \mathbb {Q}_{\varvec{Y}\mid \varvec{Z}})\right] = \sum _{\varvec{z}\in \mathcal {Z}} \mathbb {P}_{\varvec{Z}}(\varvec{z}) D(\mathbb {P}_{\varvec{Y}| \varvec{z}} \mid \mid \mathbb {Q}_{\varvec{Y}\mid \varvec{z}}) \end{aligned}$$

(10)

or in other words the expectation of $D(\mathbb {P}_{\varvec{Y}| \varvec{Z}} \mid \mid \mathbb {Q}_{\varvec{Y}\mid \varvec{Z}})$ with respect to the marginal distribution $\mathbb {P}_{\varvec{Z}}$

In order to work with marginal and conditional divergences, a principled way of defining and manipulating high-dimensional probability distributions is required. In the following section, we hence introduce the framework of probabilistic graphical models that underlies a multitude of machine learning techniques and allows us to simplify computations based on factorization properties.

2.2 Graphical models

An undirected graph, $\mathcal {G}_{} = (V(\mathcal {G}_{}), E(\mathcal {G}_{}))$, consists of $n=\left| V(\mathcal {G}_{})\right| $ vertices and edges $(u,v)\in E(\mathcal {G}_{})$. When a subset of vertices, $\mathcal {C}\subseteq V(\mathcal {G}_{})$, are fully connected, they form a clique. Furthermore, let $\varvec{\mathcal {C}}(\mathcal {G}_{})$ denote the set of all maximal cliques in $\mathcal {G}_{}$, where a maximal clique is a clique that is not contained in any other clique.

2.2.1 Markov network (MN)

By associating each vertex in $\mathcal {G}_{}$ to a random variable, $\varvec{X}= \{X_{v} \mid v \in V(\mathcal {G}_{})\}$, the graph $\mathcal {G}_{}$ will encode a set of conditional independencies between the variables $\varvec{X}$ [27]. For some probability distribution $\mathbb {P}$ over the variables $\varvec{X}$, $\mathbb {P}$ is strictly positive and follows the conditional independencies in $\mathcal {G}_{}$ if and only if it is a Gibbs distribution of the form [18]: $\mathbb {P}_{\varvec{X}}(\varvec{x})=\frac{1}{Z} \prod _{\mathcal {C}\in \varvec{\mathcal {C}}(\mathcal {G}_{})}\psi _{\mathcal {C}}(\varvec{x}_{\mathcal {C}})$, where Z is a normalizing constant, $Z=\sum _{\varvec{x}\in \mathcal {X}}\prod _{\mathcal {C}\in \varvec{\mathcal {C}}(\mathcal {G}_{})} \psi _{\mathcal {C}}(\varvec{x}_{\mathcal {C}})$, and $\psi _{\mathcal {C}}$ are positive factors over the domain of each maximal clique, $\psi _{\mathcal {C}}: \mathcal {X}_{\mathcal {C}} \rightarrow \mathbb {R}$. We denote such a MN with the notation $\mathcal {M}=(\mathcal {G}_{},\Psi )$, where $\Psi =\{\psi _{\mathcal {C}} \mid \mathcal {C}\in \varvec{\mathcal {C}}(\mathcal {G}_{})\}$. Markov networks are also known as Markov random fields.

2.2.2 Belief propagation

Unfortunately, computing Z directly is intractable due to the sum growing exponentially with respect to $\left| \varvec{X}\right| $. However, by finding a chordal graph $\mathcal {H}$ that is strictly larger than $\mathcal {G}_{}$, we can use an algorithm known as belief propagation on the clique tree of $\mathcal {H}$ with initial factors $\Psi $, to compute not only Z, but also the marginal distributions of $\mathbb {P}$ over each $\mathcal {C}\in \varvec{\mathcal {C}}(\mathcal {H})$. The complexity of belief propagation is $\mathcal {O}(2^{\omega (\mathcal {H})})$, where $\omega $ is the treewidth of $\mathcal {H}$, i.e., the size of the largest maximal clique in $\mathcal {H}$ minus one.

2.2.3 Decomposable model (DM)

When a MN has a chordal graph structure, it also has a closed-form joint distribution:

$$\begin{aligned} \mathbb {P}_{\mathcal {G}_{}}(\varvec{x}) = \frac{\prod _{\mathcal {C}\in \varvec{\mathcal {C}}(\mathcal {G}_{})} \mathbb {P}_{\mathcal {C}}(\varvec{x}_{\mathcal {C}})}{\prod _{\mathcal {S}\in \varvec{\mathcal {S}}(\mathcal {G}_{})} \mathbb {P}_{\mathcal {S}}(\varvec{x}_{\mathcal {S}})} =\prod _{\mathcal {C}\in \varvec{\mathcal {C}}(\mathcal {G}_{})}\mathbb {P}^{\mathcal {T}}_{\mathcal {C}}(\varvec{x}_{\mathcal {C}}) \end{aligned}$$

(11)

where $\varvec{\mathcal {S}}(\mathcal {G}_{})$ are the sets of variables between each adjacent maximal clique in the clique tree of $\mathcal {G}_{}$, and $\mathbb {P}^{\mathcal {T}}_{\mathcal {C}}$ is a conditional probability table (CPT) over clique $\mathcal {C}$.

For clarity of exposition, we will only use DM to refer to chordal MNs with the CPTs in Eq. (11) as factors.

3 Related work

There is not much previous work on computing the marginal or conditional divergences between probabilistic graphical models in general, let alone the marginal and conditional $\alpha \beta $-divergence. That said, the idea of isolating distributional changes to some subspace of the distribution is not new. For instance, Hoens et al. [22] employed multiple base learners, each trained over a random subset of variables. When a change occurs, this approach can either update the affected learners or reduce how much they will weigh in on the final output of the ensemble. In Alberghini et al. [1], the authors created an ensemble of AESAKNNS models [45]—a model already capable of handling distributional change on its own—on different subsets of variables and samples. Furthermore, unlike Hoens et al. [22], they varied the size of the variable subsets for model training as well and can dynamically change it over time.

In order to explicitly detect subspaces where distributional changes occur, Liu et al. [32] computed a statistic between the subspace of two samples, called the local drift degree, which follows the normal distribution when both samples follow the same distribution. More recently, Maia Polo et al. [33] proposed their own statistic, based on the Kullback–Leibler divergence, for detecting changes in the distribution over the covariate and class variables, as well as conditional distributions between them. This work represents the closest work to ours in that they proposed a way to estimate the KL divergence between trained probabilistic classifiers. However, their approach requires training separate classifiers for each variable subset we wish to estimate the KL divergence over, while in our work we will only use a single trained DM. Furthermore, their approach is limited to just estimating the KL divergence, while we will compute any $\alpha \beta $-divergence between DMs.

3.1 Previous work: computing joint divergence

Despite the lack of previous work on computing the marginal or conditional divergence between probabilistic graphical models, there is previous work done on computing the joint $\alpha \beta $-divergence between two DMs [29]. It is from this previous work that most of our methods will be built off upon.

It has been shown that the complexity of computing the joint $\alpha \beta $-divergence between decomposable models $\mathbb {P}_{\mathcal {G}_\mathbb {P}}$ and $\mathbb {Q}_{\mathcal {G}_\mathbb {Q}}$ is generally, in the worst case, exponential to the treewidth of the computation graph $\mathcal {U}(\mathcal {G}_\mathbb {P},\mathcal {G}_\mathbb {Q})$, as defined in Definition 5 [29].

Definition 5

(Computation graph, $\mathcal {U}$) Let $\mathcal {G}_{(1)}$ and $\mathcal {G}_{(2)}$ be two chordal graphs. Then denote $\mathcal {U}(\mathcal {G}_{(1)}, \mathcal {G}_{(2)})$ to be a chordal graph that contains all the vertices and edges in

$\mathcal {G}_{(1)}$ and $\mathcal {G}_{(2)}$. We call such a graph, the computation graph of $\mathcal {G}_{(1)}$ and $\mathcal {G}_{(2)}$.

Specifically, when $\alpha ,\beta =0$, the joint distributions of $\mathbb {P}_{\mathcal {G}_\mathbb {P}}$ and $\mathbb {Q}_{\mathcal {G}_\mathbb {Q}}$ immediately decompose the log term in the $\alpha \beta $-divergence for this case, allowing for a complexity that is exponential to the maximum treewidth between the 2 DMs.

On the other hand, when either $\alpha $ or $\beta $ is nonzero, the decomposition of the functional $\mathcal {F}$ from Definition 3, and therefore the $\alpha \beta $-divergence in this case, is not as straightforward. Substituting the joint distribution of $\mathbb {P}_{\mathcal {G}_\mathbb {P}}$ and $\mathbb {Q}_{\mathcal {G}_\mathbb {Q}}$ into $\mathcal {F}$ eventually results in the following equation:

$$\begin{aligned} \begin{aligned} \mathcal {F}( {\mathbb {P}},{\mathbb {Q}}\;;\;g, h, g^{*}, h^{*}) =&\sum _{\mathcal {C}\in \varvec{\mathcal {C}}(\mathcal {G}_\mathbb {P})}\sum _{\varvec{x}_{\mathcal {C}}\in \mathcal {X}_{\mathcal {C}}} L\left( g^{*}\left[ \mathbb {P}_{\mathcal {C}}^{\mathcal {T}}\right] (\varvec{x}_{\mathcal {C}})\right) \textit{SP}_{\mathcal {C}}(\varvec{x}_{\mathcal {C}}) \\&+\sum _{\mathcal {C}\in \varvec{\mathcal {C}}(\mathcal {G}_\mathbb {Q})}\sum _{\varvec{x}_{\mathcal {C}}\in \mathcal {X}_{\mathcal {C}}} L\left( h^{*}\left[ \mathbb {Q}_{\mathcal {C}}^{\mathcal {T}}\right] (\varvec{x}_{\mathcal {C}})\right) \textit{SP}_{\mathcal {C}}(\varvec{x}_{\mathcal {C}}) \end{aligned} \end{aligned}$$

(12)

where

$$\begin{aligned}&\textit{SP}_{\mathcal {C}}(\varvec{x}_{\mathcal {C}}) =\sum _{\varvec{x}\in \mathcal {X}_{X-\mathcal {C}}}\left[ \prod _{\mathcal {C}\in \varvec{\mathcal {C}}_{\mathbb {P}}} g\left[ \mathbb {P}_{\mathcal {C}}^{\mathcal {T}}\right] (\varvec{x}_{\mathcal {C}},\varvec{x}) \right] \left[ \prod _{\mathcal {C}\in \varvec{\mathcal {C}}_{\mathbb {Q}}} h\left[ \mathbb {Q}_{\mathcal {C}}^{\mathcal {T}}\right] (\varvec{x}_{\mathcal {C}},\varvec{x}) \right] \end{aligned}$$

(13)

Therefore, computing $\mathcal {F}$ between two decomposable models relies on the ability to obtain the sum-product $\textit{SP}_{\mathcal {C}}$, from Eq. (13), for all maximal cliques, $\mathcal {C}$, in the computation graph $\mathcal {U}(\mathbb {P}_{\mathcal {G}_\mathbb {P}}, \mathbb {Q}_{\mathcal {G}_\mathbb {Q}})$. These sum-products can then be further marginalized to obtain sum-products over maximal cliques of $\mathcal {G}_{\mathbb {P}}$ and $\mathcal {G}_{\mathbb {Q}}$ in Eq. (12). Specifically, Lee et al. [29] obtained these sum-products using belief propagation on the junction tree of $\mathcal {U}(\mathbb {P}_{\mathcal {G}_\mathbb {P}},\mathbb {Q}_{\mathcal {G}_\mathbb {Q}})$, with factors $\{g[\mathbb {P}_{\mathcal {C}}^{\mathcal {T}}] \mid \mathcal {C}\in \varvec{\mathcal {C}}(\mathcal {G}_{\mathbb {P}})\} \cup \{h[\mathbb {Q}_{\mathcal {C}}^{\mathcal {T}}] \mid \mathcal {C}\in \varvec{\mathcal {C}}(\mathcal {G}_{\mathbb {Q}})\}$.

4 Proposed methodology

4.1 Multi-graph aggregated sum-products (MGASPs)

For the purposes of developing the methods in this paper, we will generalize the problem of obtaining $\textit{SP}_{\mathcal {C}}$ for all $\mathcal {C}$ in $\varvec{\mathcal {C}}(\mathcal {H})$ as a problem of obtaining the sum-products over two sets of factors, $\Phi _{(1)}$ and $\Phi _{(2)}$, defined over the maximal cliques over two chordal graphs, $\mathcal {G}_{(1)}$ and $\mathcal {G}_{(2)}$, respectively.

$$\begin{aligned} \forall \mathcal {C}\in \varvec{\mathcal {C}}(\mathcal {H}): \textit{SP}_{\mathcal {C}}(\varvec{x}_{\mathcal {C}}) = \sum _{\varvec{x}\in \mathcal {X}_{\varvec{X}- \mathcal {C}}} \prod _{\phi \in \Phi _{(1)}\cup \Phi _{(2)}} \phi \left( \varvec{x}_{\mathcal {C}},\varvec{x}\right) \end{aligned}$$

(14)

We can then further abstract the problem by observing that $\textit{SP}_{\mathcal {C}}$ can be seen as a sum-product over the factors of a new chordal graph constructed by merging the chordal MNs $\mathcal {M}_{(1)}=(\mathcal {G}_{(1)}, \Phi _{(1)})$ and $\mathcal {M}_{(2)}=(\mathcal {G}_{(2)}, \Phi _{(2)})$ by taking their product according to Definition 6.

Definition 6

(Product of Markov networks) Assume we have two MNs $\mathcal {M}_{(1)}=(\mathcal {G}_{(1)}, \Phi _{(1)})$ and $\mathcal {M}_{(2)}=(\mathcal {G}_{(2)}, \Phi _{(2)})$. Then, taking the product of $\mathcal {M}_{(1)}$ and $\mathcal {M}_{(2)}$, $\mathcal {M}_{(1)}\circ \mathcal {M}_{(2)}$, involves:

1.
first obtaining the chordal graph $\mathcal {H}=\mathcal {U}(\mathcal {G}_{(1)},\mathcal {G}_{(2)})$,
2.
then creating the factor set $\Psi =\Phi _{(1)}\cup \Phi _{(2)}$,

resulting in a new chordal MN, $\mathcal {M}_{\mathcal {H}}=(\mathcal {H},\Psi )$.

Once $\mathcal {M}_{\mathcal {H}}$ is obtained, like Lee et al. [29] we can then obtain $\forall \mathcal {C}\in \varvec{\mathcal {C}}(\mathcal {H}) \,: \, \textit{SP}_{\mathcal {C}}$ using belief propagation on the clique tree of $\mathcal {M}_{\mathcal {H}}$ with initial factors $\Psi $. However, in order to aid future developments, we shall further extend this result to obtain the sum-products over m different factor sets, and not just two.

Lemma 1

(MGASPs) Let $\mathcal {M}_{(1)}=(\mathcal {G}_{(1)}, \Phi _{(1)})$, $\ldots $, $\mathcal {M}_{(m)}=(\mathcal {G}_{(m)}, \Phi _{(m)})$ be m MNs, possibly defined over different sets of variables. Then, using the notion of products between MNs from Definition 6, we can obtain the following sum-products:

$$\begin{aligned} \forall \mathcal {C}\in \varvec{\mathcal {C}}(\mathcal {H}) \,:\, \textit{SP}_{\mathcal {C}}(\varvec{x}_{\mathcal {C}}) = \sum _{\varvec{x}\in \mathcal {X}_{\varvec{X}- \mathcal {C}}} \prod _{\phi \in \bigcup _{i=1}^{m}\Phi _{(i)}} \phi \left( \varvec{x}_{\mathcal {C}},\varvec{x}\right) \end{aligned}$$

(15)

by carrying out belief propagation on the junction tree of $\mathcal {U}\bigl (\mathcal {G}_{(1)},\ldots ,\mathcal {G}_{(m)}\bigr )$ using the set of initial factors $\bigcup _{i=1}^{m}\Phi _{(i)}$.

Proof

Using Definition 6, we can first take the product of the m given MNs to obtain the chordal MN $\mathcal {M}_{\mathcal {H}}=(\mathcal {H},\Psi )$: $\mathcal {M}_{\mathcal {H}}=\mathcal {M}_{(1)}\circ \ldots \circ \mathcal {M}_{(m)}$ and $\Psi = \bigcup _{i=1}^{m}\Phi _{(i)}$. Then, recall that carrying out belief propagation on the junction tree of $\mathcal {M}_{\mathcal {H}}$ will result in the following beliefs over the maximal cliques of $\mathcal {H}$ [25, Corollary 10.2]:

$$\begin{aligned} \forall \mathcal {C}\in \varvec{\mathcal {C}}(\mathcal {H}) \,: \, \beta _{\mathcal {C}}(\varvec{x}_{\mathcal {C}}) = \sum _{\varvec{x}\in \mathcal {X}_{\varvec{X}- \mathcal {C}}} \prod _{\psi \in \Psi } \psi (\varvec{x},\varvec{x}_{\mathcal {C}}) \end{aligned}$$

(16)

which is directly equivalent to the sum-products in Eq. (15) since $\Psi = \bigcup _{i=1}^{m}\Phi _{(i)}$. $\square $

4.2 Marginal divergence

The marginal distribution over the set of variables $\varvec{Z}$ for a given decomposable model $\mathbb {P}_{\mathcal {G}_\mathbb {P}}$ can be obtained by summing over the domain of the variables not in $\varvec{Z}$, ${\bar{\varvec{Z}}}=\varvec{X}\setminus \varvec{Z}$.

$$\begin{aligned} \mathbb {P}_{\varvec{Z}}(\varvec{z}) = \sum _{{\bar{\varvec{z}}}\in {\bar{\mathcal {Z}}}} \mathbb {P}_{\varvec{x}}({\bar{\varvec{z}}},\varvec{z}) = \sum _{{\bar{\varvec{z}}}\in {\bar{\mathcal {Z}}}} \prod _{\mathcal {C}\in \varvec{\mathcal {C}}(\mathcal {G}_{\mathbb {P}})} \mathbb {P}^\mathcal {T}_{\mathcal {C}}({\bar{\varvec{z}}},\varvec{z}) \end{aligned}$$

(17)

However, in order to decompose the $\alpha \beta $-divergence between marginal distributions of two DMs, we need to find a factorization of the sum over ${\bar{\mathcal {Z}}}$ in these marginal distributions.

4.2.1 Factorizing the marginal distribution of a decomposable model (DM)

Factorizing the marginal distribution in Eq. (17) will require grouping together the CPTs in Eq. (17), and therefore maximal cliques of $\mathbb {P}_{\mathcal {G}_\mathbb {P}}$, that share any variables in the set of variables to sum out, ${\bar{\varvec{Z}}}$. We shall call such a grouping of maximal cliques in $\mathcal {G}_\mathbb {P}$, an $ \varvec{\mathcal {N}} $-partition of $\mathcal {G}_\mathbb {P}$. See Sect. 4.2.1 for an example of such an $ \varvec{\mathcal {N}} $-partition (Fig. 2).

Definition 7

($ \varvec{\mathcal {N}} $-partition) For any chordal graph $\mathcal {G}_\mathbb {P}$ and marginal variables $\varvec{Z}$, with ${\bar{\varvec{Z}}}=\varvec{X}{\setminus }\varvec{Z}$, an $ \varvec{\mathcal {N}} $-partition of $\mathcal {G}_\mathbb {P}$, $ \varvec{\mathcal {N}} _{\mathbb {P},\varvec{Z}}$, is a set of vertex sets, $\forall N\in \varvec{\mathcal {N}} _{\mathbb {P},\varvec{Z}}: N\subset V(\mathcal {G}_\mathbb {P})$, with the following properties:

1.
$\{\varvec{\mathcal {C}}(\mathcal {G}_\mathbb {P}(N)) \mid N\in \varvec{\mathcal {N}} _{\mathbb {P},\varvec{Z}}\}$ is a partition of the maximal cliques of $\mathcal {G}_\mathbb {P}$,
2.
$\{{\bar{\varvec{Z}}} \cap \varvec{X}_{N} \mid N\in \varvec{\mathcal {N}} _{\mathbb {P},\varvec{Z}}, {\bar{\varvec{Z}}} \cap \varvec{X}_{N} \ne \emptyset \}$ is a partition of ${\bar{\varvec{Z}}}$,

One way to find an $ \varvec{\mathcal {N}} $-partition of $\mathcal {G}_\mathbb {P}$ is to group maximal cliques of $\mathcal {G}_\mathbb {P}$ that share any variables that are also in ${\bar{\varvec{Z}}}$.

Corollary 1

(Decomposition of Marginal Distributions) Using the $ \varvec{\mathcal {N}} $-partition from Definition 7, the marginal distribution over variables $\varvec{Z}$ of a decomposable model $\mathbb {P}_{\mathcal {G}_\mathbb {P}}$ can be factorized over the variables sets in $ \varvec{\mathcal {N}} _{\mathbb {P},\varvec{Z}}$:

$$\begin{aligned} \mathbb {P}_{\varvec{Z}}(\varvec{z}) = \prod _{N\in \varvec{\mathcal {N}} _{\mathbb {P},\varvec{Z}}} \sum _{{\bar{\varvec{z}}}\in {\bar{\mathcal {Z}}}_{N}} \prod _{\mathcal {C}\in \varvec{\mathcal {C}}(\mathcal {G}_\mathbb {P}(N))} \mathbb {P}^\mathcal {T}_{\mathcal {C}}(\varvec{z}_{\mathcal {C}},{\bar{\varvec{z}}}_{\mathcal {C}}) =\prod _{N\in \varvec{\mathcal {N}} _{\mathbb {P},\varvec{Z}}} \varphi _{N\cap \varvec{Z}}[\mathbb {P}](\varvec{z}_{N}) \end{aligned}$$

(18)

where for ease of notation, we use $\varphi _{ \varvec{\mathcal {N}} \cap {\varvec{Z}}}$ as shorthand for marginalizing factors in $N$ to remove any variables in ${\bar{\varvec{Z}}}$. See Definition 8 for a full definition of $\varphi _{ \varvec{\mathcal {N}} \cap {\varvec{Z}}}$.

Definition 8

($(N\cap \varvec{Z})$-Marginalized Factor, $\varphi _{N\cap \varvec{Z}}$) Through a slight abuse of notation, let $N\cap \varvec{Z}:= N\cap V(\varvec{Z})$.

Then we define the $(N\cap \varvec{Z})$-marginalized factor as such:

$$\begin{aligned} \varphi _{N\cap \varvec{Z}}[\mathbb {P}](\varvec{z}_{N}):= \sum _{{\bar{\varvec{z}}}\in {\bar{\mathcal {Z}}}_{N}} \prod _{\mathcal {C}\in \varvec{\mathcal {C}}(\mathcal {G}_\mathbb {P}(N))} \mathbb {P}^\mathcal {T}_{\mathcal {C}}(\varvec{z}_{\mathcal {C}},{\bar{\varvec{z}}}_{\mathcal {C}}) \end{aligned}$$

(19)

Note that it is possible for $\varphi _{N\cap \varvec{Z}}$ to be defined over the empty set when $N\cap \varvec{Z}=\{\}\iff \varvec{X}_{N}\subseteq {\bar{\varvec{Z}}}$. In such situations, $\varphi _{N\cap \varvec{Z}}$ is just a scalar factor, i.e., a number in $\mathbb {R}$. For ease of exposition, we shall also define a function $\varvec{\varphi }$ that returns a set of $(N\cap \varvec{Z})$-marginalized factors given an $ \varvec{\mathcal {N}} $-partition $ \varvec{\mathcal {N}} _{\mathbb {P},\varvec{Z}}$:

$$\begin{aligned} \varvec{\varphi }( \varvec{\mathcal {N}} _{\mathbb {P},\varvec{Z}}):= \Bigl \{ \varphi _{N\cap \varvec{Z}}[\mathbb {P}] \Bigm | N\in \varvec{\mathcal {N}} _{\mathbb {P},\varvec{Z}} \Bigr \} \end{aligned}$$

(20)

To summarize, we finally now have factorization of $\mathbb {P}_{\varvec{Z}}$ in terms of a product of factors defined over each set in the partition $ \varvec{\mathcal {N}} _{\mathbb {P},\varvec{Z}}$.

$$\begin{aligned} \mathbb {P}_{\varvec{Z}}(\varvec{z})&=\prod _{N\in \varvec{\mathcal {N}} _{\mathbb {P},\varvec{Z}}} \varphi _{N\cap \varvec{Z}}[\mathbb {P}](\varvec{z}_{N}) \end{aligned}$$

(21)

These set of factors imply an new graph structure where an edge exists between any two vertices in $\varvec{Z}$ if there is any set in $ \varvec{\mathcal {N}} _{\mathbb {P},\varvec{Z}}$ that contains both vertices together. We will call these new graphs based on the sets in an $ \varvec{\mathcal {N}} $-partition, $ \varvec{\mathcal {N}} $-graphs. See Sect. 4.2.1 for an example of an $ \varvec{\mathcal {N}} $-partition and its respective $ \varvec{\mathcal {N}} $-graph.

Definition 9

(Sets-to-Cliques graph constuctor, $\kappa ( \varvec{\mathcal {N}} )$) Let $ \varvec{\mathcal {N}} $ be a set of vertex-sets. Then $\kappa $ is a graph constructor that creates a graph $\mathcal {K}$ with the following vertices and edges:

$$\begin{aligned} V(\mathcal {K})= & {} \bigcup _{N\in \varvec{\mathcal {N}} } N\end{aligned}$$

(22)

$$\begin{aligned} E(\mathcal {K})= & {} \bigl \{ (u,v) \bigm | N\in \varvec{\mathcal {N}} , (v, u) \in N, v \ne u \bigr \} \end{aligned}$$

(23)

Therefore, the graph constructor $\kappa ( \varvec{\mathcal {N}} )$ ensures that each vertex set in $ \varvec{\mathcal {N}} $ is also as clique in the resulting graph $\mathcal {K}$.

Definition 10

($ \varvec{\mathcal {N}} $-graphs, $\Gamma ( \varvec{\mathcal {N}} _{\mathbb {P},\varvec{Z}})$) Let $\mathcal {G}_\mathbb {P}$ be a chordal graph with an $ \varvec{\mathcal {N}} $-partition $ \varvec{\mathcal {N}} _{\mathbb {P},\varvec{Z}}$. Then $\Gamma ( \varvec{\mathcal {N}} _{\mathbb {P},\varvec{Z}})$ is a function that will take the graph constructed by $\kappa ( \varvec{\mathcal {N}} _{\mathbb {P},\varvec{Z}})$ from Definition 9 and remove any vertex or edges associated with any variables in ${\bar{\varvec{Z}}}$.

Theorem 1

(The $ \varvec{\mathcal {N}} $-graph created by $\Gamma $ is a chordal graph) Let $\mathcal {G}_\mathbb {P}$ be a chordal graph and $\varvec{Z}\subset \varvec{X}$. Then the $ \varvec{\mathcal {N}} $-graph, $\Gamma ( \varvec{\mathcal {N}} _{\mathbb {P},\varvec{Z}})$, is a chordal graph.

Proof

Constructing $\Gamma ( \varvec{\mathcal {N}} _{\mathbb {P},\varvec{Z}})$ involves two steps: (1) merging maximal cliques in the same partition via $\kappa ( \varvec{\mathcal {N}} _{\mathbb {P},\varvec{Z}})$, and (2) removing any vertices and edges associated with ${\bar{\varvec{Z}}}$. We will show the resulting graph after both steps are chordal.

Only adjacent maximal cliques in the clique tree of $\mathcal {G}_\mathbb {P}$ can be in the same partition in $ \varvec{\mathcal {N}} _{\mathbb {P},\varvec{Z}}$. Merging adjacent maximal cliques in a clique tree will result in yet another clique tree, implying the graph $\kappa ( \varvec{\mathcal {N}} _{\mathbb {P},\varvec{Z}})$ is also chordal.
The deletion of all vertices and edges associated with ${\bar{\varvec{Z}}}$ is equivalent to the induced subgraph over the vertices $V(\varvec{Z})$. But any induced subgraph of a chordal graph is also a chordal graph [7].

Therefore, $\Gamma ( \varvec{\mathcal {N}} _{\mathbb {P},\varvec{Z}})$ is a chordal graph. $\square $

In conclusion, the factorization of $\mathbb {P}_{\varvec{Z}}$ essentially creates a new chordal Markov network consisting of the graph $\Gamma ( \varvec{\mathcal {N}} _{\mathbb {P},\varvec{Z}})$ and factors $\varvec{\varphi }( \varvec{\mathcal {N}} _{\mathbb {P},\varvec{Z}})$, $\mathcal {M}_{\mathbb {P},\varvec{Z}}=\bigl (\Gamma ( \varvec{\mathcal {N}} _{\mathbb {P},\varvec{Z}}),\varvec{\varphi }( \varvec{\mathcal {N}} _{\mathbb {P},\varvec{Z}})\bigr )$.

4.2.2 Computing the marginal $\alpha \beta $-divergence between DMs

With this factorization of the marginal distribution of a DM, and even a graphical representation of these factorizations, the problem of computing the $\alpha \beta $-divergence between the marginal distributions of two DMs becomes surprisingly straightforward as a direct equivalence can made to the problem of computing the joint divergence between two DMs.

Recall from Sect. 3.1 that when Lee et al. [29] tackled the problem of computing the joint divergence between $\mathbb {P}_{\mathcal {G}_\mathbb {P}}$ and $\mathbb {Q}_{\mathcal {G}_\mathbb {Q}}$, they represented the joint distribution of a DM as a set of CPTs defined over the maximal cliques of a chordal graph. These CPTs are fundamentally just ordinary factors over their respective variables, and Lee et al. [29] did not rely on any special property of these factors to compute the $\alpha \beta $-divergence between these representations of DMs.

Thanks to the factorization from Sect. 4.2.1, we have a similar problem setup to computing the joint divergence. Specifically, problem of computing the $\alpha \beta $-divergence between the marginal distributions $\mathbb {P}_{\varvec{Z}}$ and $\mathbb {Q}_{\varvec{Z}}$, of DMs $\mathbb {P}_{\mathcal {G}_\mathbb {P}}$ and $\mathbb {Q}_{\mathcal {G}_\mathbb {Q}}$, respectively, becomes a problem of computing the $\alpha \beta $-divergence between the chordal MNs: $\mathcal {M}_{\mathbb {P},\varvec{Z}}=\bigl (\Gamma ( \varvec{\mathcal {N}} _{\mathbb {P},\varvec{Z}}),\varvec{\varphi }( \varvec{\mathcal {N}} _{\mathbb {P},\varvec{Z}})\bigr )$ and $\mathcal {M}_{\mathbb {Q},\varvec{Z}}=\bigl (\Gamma ( \varvec{\mathcal {N}} _{\mathbb {Q},\varvec{Z}}),\varvec{\varphi }( \varvec{\mathcal {N}} _{\mathbb {Q},\varvec{Z}})\bigr )$.

Therefore, the complexity of computing the $\alpha \beta $-divergence between factorizations of $\mathbb {P}_{\varvec{Z}}$ and $\mathbb {Q}_{\varvec{Z}}$ is just the complexity of the approach in Lee et al. [29]. However, that does not include the complexity of obtaining the factorization of $\mathbb {P}_{\varvec{Z}}$ and $\mathbb {Q}_{\varvec{Z}}$.

Theorem 2

The worst case complexity of obtaining the factorization from Sect. 4.2.1 of $\mathbb {P}_{\varvec{Z}}$ for some DM $\mathbb {P}_{\mathcal {G}_\mathbb {P}}$ is:

$$\begin{aligned} \mathcal {O}\left( |\varvec{X}| \cdot 2^{|N^{*}|}\right) \end{aligned}$$

(24)

where $ N^{*}:= \mathop {\mathrm{arg\, max}}\nolimits _{N\in \varvec{\mathcal {N}} _{\mathbb {P},\varvec{Z}}} |N|$.

Proof

For any $N\in \varvec{\mathcal {N}} _{\mathbb {P},\varvec{Z}}$, obtaining the marginal factor $\varphi _{N\cap \varvec{Z}}[\mathbb {P}]$ requires us to essentially iterate through each value $\varvec{x}\in \mathcal {X}_{N}$ in the worst case, resulting in the complexity of $2^{|N|}$. This complexity is bounded by the complexity for the largest partition in $ \varvec{\mathcal {N}} _{\mathbb {P},\varvec{Z}}$, $N^{*}:=\mathop {\mathrm{arg\, max}}\nolimits _{N\in \varvec{\mathcal {N}} _{\mathbb {P},\varvec{Z}}} \left| N\right| $. Since we need to obtain $\varphi _{N\cap \varvec{Z}}[\mathbb {P}]$ for all $N\in \varvec{\mathcal {N}} _{\mathbb {P},\varvec{Z}}$, and $| \varvec{\mathcal {N}} _{\mathbb {P},\varvec{Z}}|$ in bounded by $|\varvec{X}|$, the final complexity is $\mathcal {O}\left( |\varvec{X}| \cdot 2^{|N^{*}|}\right) $. $\square $

Therefore, the complexity of computing the divergence between the marginal distribution over $\varvec{Z}$ of two decomposable models, $D_{AB}^{(\alpha ,\beta )}(\mathbb {P}_{\varvec{Z}}||\mathbb {Q}_{\varvec{Z}})$, is:

$$\begin{aligned}&D_{AB}^{(\alpha ,\beta )}(\mathbb {P}_{\varvec{Z}}||\mathbb {Q}_{\varvec{Z}}) \in \mathcal {O}\left( \left| \varvec{X}\right| \cdot 2^{|N^{*}|}\right) \cdot {\left\{ \begin{array}{ll} \mathcal {O}(|\varvec{X}|^{2} \cdot \omega _\text {max}2^{\omega _\text {max}+1}) &{} \alpha ,\beta =0\\ \mathcal {O}(|\varvec{X}|\cdot 2^{\omega (\mathcal {H})+1}) &{} \text {otherwise} \end{array}\right. } \end{aligned}$$

(25)

where $\omega _\text {max}= \max (\omega (\Gamma ( \varvec{\mathcal {N}} _{\mathbb {P},\varvec{Z}})), \omega (\Gamma ( \varvec{\mathcal {N}} _{\mathbb {P},\varvec{Z}})))$ and $\mathcal {H}=\mathcal {U}(\Gamma ( \varvec{\mathcal {N}} _{\mathbb {P},\varvec{Z}}),\Gamma ( \varvec{\mathcal {N}} _{\mathbb {P},\varvec{Z}}))$.

4.3 Conditional divergence

We will now show that having a factorization of any marginal distribution for a DM allows us to compute the conditional $\alpha \beta $-divergence between two DMs. Recall from Definition 4 the definition of the conditional $\alpha \beta $-divergence we will use, where $\varvec{Z}\cup \varvec{Y}\subseteq \varvec{X}$ and $\varvec{Z}\cap \varvec{Y}=\emptyset $, is:

$$\begin{aligned} D_{\textit{AB}}^{(\alpha ,\beta )}( \mathbb {P}_{\varvec{Y}\mid \varvec{Z}} \mid \mid \mathbb {Q}_{\varvec{Y}\mid \varvec{Z}}) =\sum _{\varvec{z}\in \mathcal {Z}} \mathbb {P}_{\varvec{Z}}(\varvec{z}) D_{\textit{AB}}^{(\alpha ,\beta )}(\mathbb {P}_{\varvec{Y}\mid \varvec{z}}\mid \mid \mathbb {Q}_{\varvec{Y}\mid \varvec{z}}) \end{aligned}$$

(10)

where

$$\begin{aligned} \begin{aligned} D_{\textit{AB}}^{(\alpha ,\beta )}(\mathbb {P}_{\varvec{Y}\mid \varvec{z}}\mid \mid \mathbb {Q}_{\varvec{Y}\mid \varvec{z}})&= {\left\{ \begin{array}{ll} \sum _{\varvec{y}\in \mathcal {Y}} \frac{1}{2}\left( \log \mathbb {P}_{\varvec{Y}\mid \varvec{z}}(\varvec{y}) - \log \mathbb {Q}_{\varvec{Y}\mid \varvec{z}}(\varvec{y})\right) ^{2} &{} \alpha ,\beta =0\\ \sum _{i} c_{i} \mathcal {F}(\mathbb {P}_{\varvec{Y}\mid \varvec{z}}, \mathbb {Q}_{\varvec{Y}\mid \varvec{z}}; g_{i},h_{i},g^{*}_{i},h^{*}_{i}) &{}\text {else} \end{array}\right. } \end{aligned} \end{aligned}$$

(26)

To assist in future developments, we shall define a notion of taking the quotient between two MNs.

Definition 11

(Quotient of Markov networks) Assume we are given 2 MNs, $\mathcal {M}_{\varvec{X}}=(\mathcal {G}_{\varvec{X}}, \Phi _{\varvec{X}})$ and $\mathcal {M}_{\varvec{Z}}=(\mathcal {G}_{\varvec{Z}}, \Phi _{\varvec{Z}})$. Then the quotient $\mathcal {M}_{\varvec{X}}/\mathcal {M}_{\varvec{Z}}$ involves:

1.
first finding chordal graph $\mathcal {H}=\mathcal {U}(\mathcal {G}_{\varvec{X}},\mathcal {G}_{\varvec{Z}})$,
2.
then creating the factor set: $\Psi = \Phi _{\varvec{X}} \cup \bigl \{1/\phi \bigm | \phi \in \Phi _{\varvec{Z}} \bigr \} $

This results in the chordal MN $\mathcal {M}=(\mathcal {H},\Psi )$.

Lemma 2

(Sum-Quotient) Let $\mathcal {M}_{\varvec{X}}=(\mathcal {G}_{\varvec{X}}, \Phi _{\varvec{X}})$ and $\mathcal {M}_{\varvec{Z}}=(\mathcal {G}_{\varvec{Z}}, \Phi _{\varvec{Z}})$ be MNs where $\varvec{Z}\subset \varvec{X}$ and $\mathcal {M}_{\varvec{Z}}$ is a marginalization of $\mathcal {M}_{\varvec{X}}$. Then the sum-quotients:

$$\begin{aligned} \forall \mathcal {C}\in \varvec{\mathcal {C}}(\mathcal {H}) \,:\, \textit{SP}_{\mathcal {C}}(\varvec{x}_{\mathcal {C}}) = \sum _{\varvec{x}\in \mathcal {X}_{\varvec{X}- \mathcal {C}}} \frac{\prod _{\phi \in \Phi _{\varvec{X}}} \phi \left( \varvec{x}_{\mathcal {C}},\varvec{x}\right) }{\prod _{\phi \in \Phi _{\varvec{Z}}} \phi \left( \varvec{z}_{\mathcal {C}},\varvec{z}\right) } \end{aligned}$$

(27)

can be obtained using belief propagation over the clique tree of the MN resulting from the quotient $\mathcal {M}_{\varvec{X}}/\mathcal {M}_{\varvec{Z}}$.

Proof

The proof follows the exact same argument as the proof for Lemma 1. However, since quotients are involved, we need to ensure the quotient in Eq. (27) is always defined, assuming $0/0=0$ [25, Definition 10.7]. We know that any sum, and therefore marginalization, over a product of factors that results in zero, implies that the original product of factors is zero as well. Therefore, since $\mathcal {M}_{\varvec{Z}}$ is a marginalization of $\mathcal {M}_{\varvec{X}}$, whenever the denominator in Eq. (27) is 0, its numerator is as well. $\square $

Corollary 2

(Decomposition of Conditional Distribution) Let $\varvec{Z},\varvec{Y}\subset \varvec{X}$; $\varvec{Z}\cap \varvec{Y}=\emptyset $; and $\varvec{W}=\varvec{Y}\cup \varvec{Z}$. Then the conditional distribution of a DM, $\mathbb {P}_{\varvec{Y}\mid \varvec{Z}}$, can be expressed as the quotient $\mathbb {P}_{\varvec{W}}/\mathbb {P}_{\varvec{Z}}$. Both $\mathbb {P}_{\varvec{W}}$ and $\mathbb {P}_{\varvec{Z}}$ can be expressed as the chordal MNs $\mathcal {M}_{\varvec{W}}=(\Gamma ( \varvec{\mathcal {N}} _{\mathbb {P},\varvec{W}}), \varvec{\varphi }( \varvec{\mathcal {N}} _{\mathbb {P},\varvec{W}}))$ and $\mathcal {M}_{\varvec{Z}}=(\Gamma ( \varvec{\mathcal {N}} _{\mathbb {P},\varvec{Z}}), \varvec{\varphi }( \varvec{\mathcal {N}} _{\mathbb {P},\varvec{Z}}))$. The quotient $\mathcal {M}_{\varvec{W}}/\mathcal {M}_{\varvec{Z}}$ then results in the MN $\mathcal {M}_{\varvec{Y}\mid \varvec{Z}}^{(\mathbb {P})}=(\mathcal {H}, \varphi ^{c}( \varvec{\mathcal {N}} _{\mathbb {P},\varvec{W}}, \varvec{\mathcal {N}} _{\mathbb {P},\varvec{Z}}))$ where

$$\begin{aligned} \mathcal {H}&= \mathcal {U}(\Gamma ( \varvec{\mathcal {N}} _{\mathbb {P},\varvec{W}}),\Gamma ( \varvec{\mathcal {N}} _{\mathbb {P},\varvec{Z}})) = \kappa ( \varvec{\mathcal {N}} _{\mathbb {P},\varvec{Z}}) \end{aligned}$$

(28)

$$\begin{aligned} \varphi ^{c}( \varvec{\mathcal {N}} _{\mathbb {P},\varvec{W}}, \varvec{\mathcal {N}} _{\mathbb {P},\varvec{Z}})&:= \varvec{\varphi }( \varvec{\mathcal {N}} _{\mathbb {P},\varvec{W}}) \cup \bigl \{ 1/\varphi \bigm | \varphi \in \varvec{\varphi }( \varvec{\mathcal {N}} _{\mathbb {P},\varvec{Z}}) \bigr \} \end{aligned}$$

(29)

and the product of the factors in $\mathcal {M}_{\varvec{Y}\mid \varvec{Z}}^{(\mathbb {P})}$ is a decomposition of the conditional distribution $\mathbb {P}_{\varvec{Y}\mid \varvec{Z}}$.

With Corollary 2 we now have all the tools needed to tackle computing $D_{\textit{AB}}^{(\alpha ,\beta )}(\mathbb {P}_{\varvec{Y}\mid \varvec{Z}}\mid \mid \mathbb {Q}_{\varvec{Y}\mid \varvec{Z}})$.

Theorem 3

Let $\varvec{Z},\varvec{Y}\subset \varvec{X}$; $\varvec{Z}\cap \varvec{Y}=\emptyset $; and $\varvec{W}=\varvec{Y}\cup \varvec{Z}$. The complexity of computing the conditional $\alpha \beta $-divergence between two DMs, $\mathbb {P}_{\mathcal {G}_\mathbb {P}}$ and $\mathbb {Q}_{\mathcal {G}_\mathbb {Q}}$, $D_{\textit{AB}}^{(\alpha ,\beta )}(\mathbb {P}_{\varvec{Y}\mid \varvec{Z}}\mid \mid \mathbb {Q}_{\varvec{Y}\mid \varvec{Z}})$, when either $\alpha $ or $\beta $ is 0 is:

$$\begin{aligned} D_{\textit{AB}}^{(\alpha ,\beta )}(\mathbb {P}_{\varvec{Y}\mid \varvec{Z}}\mid \mid \mathbb {Q}_{\varvec{Y}\mid \varvec{Z}})\in \mathcal {O}\left( \left| \varvec{X}\right| \cdot 2^{\omega (\mathcal {H}) + 1}\right) \end{aligned}$$

(30)

where $\mathcal {H}=\mathcal {U}(\kappa ( \varvec{\mathcal {N}} _{\mathbb {P},\varvec{Z}}), \kappa ( \varvec{\mathcal {N}} _{\mathbb {Q},\varvec{Z}}))$.

Proof

Using the product-quotient of MNs, we can re-express the factors in the log-term of the conditional $\alpha \beta $-divergence:

$$\begin{aligned}&D_{\textit{AB}}^{(\alpha ,\beta )}(\mathbb {P}_{\varvec{Y}\mid \varvec{Z}}\mid \mid \mathbb {Q}_{\varvec{Y}\mid \varvec{Z}})\\&\quad =\sum _{\varvec{z}\in \mathcal {Z}} \mathbb {P}_{\varvec{Z}}(\varvec{z}) \sum _{\varvec{y}\in \mathcal {Y}} g[\mathbb {P}_{\varvec{Y}\mid \varvec{Z}}](\varvec{y}\mid \varvec{z}) h[\mathbb {Q}_{\varvec{Y}\mid \varvec{Z}}](\varvec{y}\mid \varvec{z}) \log \bigl ( g^{*}[\mathbb {P}_{\varvec{Y}\mid \varvec{Z}}](\varvec{y}\mid \varvec{z}) h^{*}[\mathbb {Q}_{\varvec{Y}\mid \varvec{Z}}](\varvec{y}\mid \varvec{z}) \bigr ) \end{aligned}$$

as factors of the MN, $\mathcal {M}_{\log }=(\mathcal {H}_{\log }, \Psi _{\log })$, where

$$\begin{aligned} \mathcal {H}_{\log }= & {} \mathcal {U}(\kappa (\Gamma ( \varvec{\mathcal {N}} _{\mathbb {P},\varvec{Z}}), \kappa (\Gamma ( \varvec{\mathcal {N}} _{\mathbb {Q},\varvec{Z}}))\\ \Psi _{\log }= & {} \bigl \{g^{*}[\varphi ] \bigm | \varphi \in \varvec{\varphi }^{c}( \varvec{\mathcal {N}} _{\mathbb {P}, \varvec{W}}, \varvec{\mathcal {N}} _{\mathbb {P}, \varvec{Z}}) \bigr \} \cup \bigl \{h^{*}[\varphi ] \bigm | \varphi \in \varvec{\varphi }^{c}( \varvec{\mathcal {N}} _{\mathbb {Q}, \varvec{W}}, \varvec{\mathcal {N}} _{\mathbb {Q}, \varvec{Z}}) \bigr \}. \end{aligned}$$

Therefore, the conditional $\alpha \beta $-divergence in this case can be expressed as such:

$$\begin{aligned} D_{\textit{AB}}^{(\alpha ,\beta )}(\mathbb {P}_{\varvec{Y}\mid \varvec{Z}}\mid \mid \mathbb {Q}_{\varvec{Y}\mid \varvec{Z}})&=\sum _{\psi \in \Psi } \sum _{\varvec{w}_{L}\in \mathcal {W}_{L}} \log \psi (\varvec{w}_{L}) \textit{SPQ}_{L}(\varvec{w}_{L}) \end{aligned}$$

where $L:=V(\psi )$ and

$$\begin{aligned} \textit{SPQ}_{L}(\varvec{w}_{L}) = \sum _{\varvec{w}\in \mathcal {W}_{\varvec{W}- L}} \mathbb {P}_{\varvec{Z}}(\varvec{z}) {g[\mathbb {P}_{\varvec{Y}\mid \varvec{Z}}](\varvec{y},\varvec{z})} {h[\mathbb {Q}_{\varvec{Y}\mid \varvec{Z}}](\varvec{y},\varvec{z})}. \end{aligned}$$

$\textit{SPQ}_{L}$ is then a sum-product-quotient that can be obtained by carrying out belief propagation on the clique tree of the following graph $\mathcal {H}$:

$$\begin{aligned} \mathcal {H}=\mathcal {U}(\kappa ( \varvec{\mathcal {N}} _{\mathbb {P},\varvec{Z}}), \kappa ( \varvec{\mathcal {N}} _{\mathbb {Q},\varvec{Z}}), \Gamma ( \varvec{\mathcal {N}} _{\mathbb {P},\varvec{Z}})) = \mathcal {U}(\kappa ( \varvec{\mathcal {N}} _{\mathbb {P},\varvec{Z}}), \kappa ( \varvec{\mathcal {N}} _{\mathbb {Q},\varvec{Z}})) \end{aligned}$$

with factors

$$\begin{aligned} \Psi = \varvec{\varphi }( \varvec{\mathcal {N}} _{\mathbb {P},\varvec{Z}}) \cup \bigl \{ g[\varphi ] \bigm | \varphi \in \varvec{\varphi }^{c}( \varvec{\mathcal {N}} _{\mathbb {P}, \varvec{W}}, \varvec{\mathcal {N}} _{\mathbb {P}, \varvec{Z}}) \bigr \} \cup \bigl \{ h[\varphi ] \bigm | \varphi \in \varvec{\varphi }^{c}( \varvec{\mathcal {N}} _{\mathbb {Q}, \varvec{W}}, \varvec{\mathcal {N}} _{\mathbb {Q}, \varvec{Z}}) \bigr \} \end{aligned}$$

which has the complexity of $\mathcal {O}(2^{\omega (\mathcal {H})+1})$.

Since we need to carry out this process for each factor in $\Psi _{\log }$, and $|\Psi _{\log }|\in \mathcal {O}(|\varvec{X}|)$, then the final complexity of computing $D_{\textit{AB}}^{(\alpha ,\beta )}(\mathbb {P}_{\varvec{Y}\mid \varvec{Z}}\mid \mid \mathbb {Q}_{\varvec{Y}\mid \varvec{Z}})$ when either $\alpha $ or $\beta $ is 0 is $\mathcal {O}(|\varvec{X}|\cdot 2^{\omega (\mathcal {H})+1})$. $\square $

Theorem 4

Let $\varvec{Y}\subset \varvec{X}$; $\varvec{Z}\cap \varvec{Y}=\emptyset $; and $\varvec{W}=\varvec{Y}\cup \varvec{Z}$. The complexity of $D_{\textit{AB}}^{(\alpha ,\beta )}(\mathbb {P}_{\varvec{Y}\mid \varvec{Z}}\mid \mid \mathbb {Q}_{\varvec{Y}\mid \varvec{Z}})$ when $\alpha ,\beta =0$ is:

$$\begin{aligned} D_{\textit{AB}}^{(0,0)}(\mathbb {P}_{\varvec{Y}\mid \varvec{Z}}\mid \mid \mathbb {Q}_{\varvec{Y}\mid \varvec{Z}})\in \mathcal {O}\left( \left| \varvec{X}\right| ^{2} \cdot 2^{\omega (\mathcal {H}) + 1}\right) \end{aligned}$$

(31)

where $\mathcal {H}=\mathcal {U}( \kappa ( \varvec{\mathcal {N}} _{\mathbb {P},\varvec{Z}}),\kappa ( \varvec{\mathcal {N}} _{\mathbb {Q},\varvec{Z}}))$.

Proof

Recall the conditional $\alpha \beta $-divergence when $\alpha ,\beta =0$:

$$\begin{aligned}&D_{\textit{AB}}^{(0,0)}(\mathbb {P}_{\varvec{Y}\mid \varvec{Z}}\mid \mid \mathbb {Q}_{\varvec{Y}\mid \varvec{Z}})=\sum _{\varvec{z}\in \mathcal {Z}} \mathbb {P}_{\varvec{Z}}(\varvec{z}) \sum _{\varvec{y}\in \mathcal {Y}} \frac{1}{2} \left( \log \mathbb {P}_{\varvec{Y}\mid \varvec{Z}}(\varvec{y}\mid \varvec{z}) - \log \mathbb {Q}_{\varvec{Y}\mid \varvec{Z}}(\varvec{y}\mid \varvec{z})\right) ^{2} \end{aligned}$$

We can then express the factors resulting from the division between the two conditional distributions as the factors of the MN $\mathcal {M}_{\log }=(\mathcal {G}_{},\Phi )$ where

$$\begin{aligned} \mathcal {G}_{}= & {} \mathcal {U}( \kappa ( \varvec{\mathcal {N}} _{\mathbb {P},\varvec{Z}}),\kappa ( \varvec{\mathcal {N}} _{\mathbb {Q},\varvec{Z}}))\\ \Phi= & {} \varvec{\varphi }^{c}( \varvec{\mathcal {N}} _{\mathbb {P},\varvec{W}}, \varvec{\mathcal {N}} _{\mathbb {P},\varvec{Z}}) \cup \{ 1 / \varphi \mid \varphi \in \varvec{\varphi }^{c}( \varvec{\mathcal {N}} _{\mathbb {Q},\varvec{W}}, \varvec{\mathcal {N}} _{\mathbb {Q},\varvec{Z}}) \} \end{aligned}$$

Therefore:

$$\begin{aligned} D_{\textit{AB}}^{(0,0)}(\mathbb {P}_{\varvec{Y}\mid \varvec{Z}}\mid \mid \mathbb {Q}_{\varvec{Y}\mid \varvec{Z}})&=\sum _{\varvec{z}\in \mathcal {Z}} \frac{\mathbb {P}_{\varvec{Z}}(\varvec{z})}{2} \sum _{\varvec{y}\in \mathcal {Y}} \Biggl (\sum _{\phi \in \Phi } \log \phi (\varvec{y},\varvec{z}) \Biggr )^{2}\\&=\frac{1}{2}\sum _{\phi ^{+}\in \Phi } \sum _{\phi ^{*}\in \Phi }\sum _{\varvec{z}\in \mathcal {Z}}\sum _{\varvec{y}\in \mathcal {Y}} \mathbb {P}_{\varvec{Z}}(\varvec{z}) \log \phi ^{+}(\varvec{y},\varvec{z}) \log \phi ^{*}(\varvec{y},\varvec{z}) \end{aligned}$$

The sum-product over $\mathcal {Z}$ and $\mathcal {Y}$, for each $(\phi ^{+},\phi ^{*})\in \Phi \times \Phi $, can then be expressed as a sum-product over the factors of the MN $\mathcal {M}=(\mathcal {H},\Psi )$, where

$$\begin{aligned} \mathcal {H}= & {} \mathcal {U}(\Gamma ( \varvec{\mathcal {N}} _{\mathbb {P},\varvec{Z}}), \mathcal {G}_{}(V(\phi ^{+})), \mathcal {G}_{}(V(\phi ^{*})))\\ \Psi= & {} \{\log \phi ^{*}, \log \phi ^{+}\} \cup \varvec{\varphi }( \varvec{\mathcal {N}} _{\mathbb {P},\varvec{Z}}) \end{aligned}$$

In order to find the complexity of computing this sum-product, we need to find a bound on the treewidth of $\mathcal {H}$. We know that the treewidth of the induced subgraphs $\mathcal {H}(V(\phi ^{*}))$ and $\mathcal {H}(V(\phi ^{+}))$ is bounded by the treewidth of $\mathcal {H}$. Furthermore, since $E(\Gamma ( \varvec{\mathcal {N}} _{\mathbb {P},\varvec{Z}}))\subseteq E(\mathcal {H})$, then $\omega (\Gamma ( \varvec{\mathcal {N}} _{\mathbb {P},\varvec{Z}})) \le \omega (\mathcal {H})$. Therefore the complexity of computing the sum-product over $\Psi $ is exponential with respect to $\omega (\mathcal {H})$.

Since we have to obtain this sum-product for each $(\phi ^{+},\phi ^{*})\in \Phi \times \Phi $, and $\left| \Phi \times \Phi \right| =\left| \Phi \right| ^{2}\le \left| \varvec{X}\right| ^{2}$, the final complexity is: $\mathcal {O}(D_{\textit{AB}}^{(0,0)}(\mathbb {P}_{\varvec{Y}\mid \varvec{Z}}\mid \mid \mathbb {Q}_{\varvec{Y}\mid \varvec{Z}}))\in \mathcal {O}(\left| \varvec{X}\right| ^{2}2^{\omega (\mathcal {H})+1})$ $\square $

Therefore, the complexity of computing the conditional $\alpha \beta $-divergence between two DMs in the worst case is just the complexity when both $\alpha $ and $\beta $ are 0, as expressed in Theorem 4.

5 Experimental results

5.1 Experiments with QMNIST

We demonstrate the basic functionality of our method by analyzing distributional changes within image data. Since image data are interpretable by most humans, this will allow us to verify any findings from the analysis of our method.

We consider the QMNIST dataset [50]. It was proposed as a reproducible recreation of the original MNIST dataset [9, 28], using data from the NIST Handprinted Forms and Characters Database, also known as NIST Special Database 19 [16].

An interesting fact about NIST Special Database 19 is that it consists of handwriting from both NIST employees and high school students. In QMNIST, images from NIST employees have the value 0 and 1 for the variable hsf, while images from students have a value of 4. In Fig. 3, by taking the mean over all the digits written by each group, we can observe that the handwriting of NIST employees tends to be more slanted compared to the handwriting of the students. The question now is, can our method pick up on these differences as well?

In order to test this, we need to learn 10 pairs of DMs, one for each digit—i.e., 0, 1, etc.—with each pair containing the DMs learned over the 2 writer groups—NIST employees and students. Since each image has a resolution of $28\times 28$ pixels, we have 784 pixels, and therefore 784 variables, to learn our DMs over. We do this by first learning the chordal graph structure for each DMs using any off the shelf structure learner. In this instance, we will use the Java library Chordalysis [39,40,41], specifically from which we employ the Stepwise Multiple Testing learner [48] with a p-value threshold of 0.05. Once we have the structure for each DMs, we learn their parameters using pgmpy [3], specifically its BayesianEstimator parameter estimator with Dirichlet priors and a pseudocount of 1. Once done, we now have 2 DMs for each type of digit.

By marginalizing the learned DMs over each of the $28\times 28$ pixels, and computing the marginal Hellinger distance between the marginalized DMs in each pair, we can immediately observe from Fig. 4 that the digit with the most pronounced difference is 1. Between the two writer groups, the top and bottom half of the digit 1 changes drastically, while the middle of the digit remains relatively unchanged. This corresponds to the difference between how a regular and slanted 1 is written.

We can also observe some clear changes to the top right and bottom half of digits 4 and 9, with these changes looking similar in both digits. This corresponds to the observation from Fig. 3 that both digits have a similar downward stroke on the right, which changes similarly between the two writer groups. Other examples of changes observable in Fig. 3 that are also highlighted by our method in Fig. 4 are: the digit 0 changing relatively more at the top and bottom left, $\{3,2,8\}$ mostly changing mostly in the top right and bottom left, and 5 changing mostly at the top and bottom—but not as much in the middle.

However, our method seems to have picked up on more subtle changes that a brief visual inspection of Fig. 3 could have missed. For instance, although in Fig. 3 it might look like digit 7 differs the most in the downward stroke at its bottom half, Fig. 4 indicates that the area with the greatest difference for 7 is in its top right instead.

5.2 Studying distributional changes within the cover type dataset

In the previous section, we only used the methods introduced in this paper to verify that the observations we made ourselves in the QMINST dataset [50] regarding the differences between the handwriting of NIST employees and high school students is reflected in changes between the distributions of these two groups. However, where the introduced divergence computation methods shine is when we wish to find and better understand distributional changes in datasets that are not as readily interpretable by humans as the QMNIST dataset. Specifically, for our second set of experiments, we will take a look at the covertype dataset [6] and, using the marginal divergence computation method introduced in this paper, find where within this dataset distributional changes occur and also better understand which variables contribute to these changes in distribution.

Table 1 Variables for each observation in the covertype dataset

Full size table

The covertype dataset contains observations from various wilderness areas located at the Roosevelt National Forest of northern Colorado [6]. Each observation represents a $30\times 30$ m cell in these wilderness areas, with the data of these observations consisting of data from the US Forest Service (USFS) and the US Geological Survey (USGS) regarding these observational cells. Specifically, each observation consists of the variables listed in Table 1.

One variable in the covertype dataset that we are particularly interested in studying is the variable Aspect. Aspect describes the compass direction, or azimuth, the slope in a given observation cell is facing. For example, an azimuth of $0^{\circ }$ implies a slope is facing North, while an azimuth of $90^{\circ }$ implies a Eastern facing slope. Specifically, we are interested in studying how the distribution over the other variables in Table 1 changes as the value of Aspect changes.

The reason we are interested in studying the effect Aspect has on distributional changes is due to Aspect having somewhat complicated and nonobvious effects on variables such as the properties of the slope’s soil [17]. Therefore, having a way to quickly determine at which azimuth of Aspect distributional changes occur and tools to analyze these distributional changes might help with the study of how variables like Aspect effect other variables we might be interested in like the slope’s soil type [34], the nutrients in its soil [20], and even the plant species it has [17].

However, before we continue, it is important to note that the covertype dataset contains observations from 4 different wilderness areas in the Roosevelt National Forest. Each of these wilderness areas does have differing mean values and distributions over variables such as Elevation and the different soil types. More importantly, different wilderness areas are more prominent over different Aspects. For instance, we can observe from Fig. 5a that Wilderness_Area 0 is prominent between azimuths of $45^{\circ }$–$135^{\circ }$, while Wilderness_Area 2 is more prominent from azimuths of $180^{\circ }$–$225^{\circ }$. These changes in the prominence of different wilderness areas over varying values of Aspect can cause any distributional changes we might find between different Aspects to be due to these changes in the prominence of different wilderness areas.

Therefore, for the rest of the analysis in this section, we will only consider observations from a single wilderness area. Specifically, we will use observations form the Cache la Poudre wilderness area. Cache la Poudre is the 4th wilderness area in the covertype dataset, and Fig. 5b plots the number of samples this Cache la Poudre has for different windows of Aspect.

Throughout this section, we will divide Aspect into 72 windows with data over 5 azimuth degrees each. Therefore, the Aspect windows we will use throughout this section are: $[0,4], [5,9], [10,14], \ldots , [355, 359]$. The reason we divide the possible values of Aspect into these 72 windows is to both provide sufficient samples to learn a decomposable model over each of these windows, as well as to smooth out plots over different Aspect values, such as in Fig. 5, to make them more interpretable.

After dividing the observations for Cache la Poudre into 72 windows based on their Aspect values, we can then learn a decomposable model over the observations in each window using the same approach as in Sect. 5.1 on all the variables in covertype other than both the Wilderness_Area variables and Aspect. However, prior to learning these 72 decomposable models, we first discretize the quantitative variables, like Elevation, of the observations over Cache la Poudre, into 5 bins via equal-frequency binning. Once we have learned a decomposable model for each of the 72 Aspect windows, we will use these models for the analysis in the rest of this section.

5.2.1 Analysis over all variables

In order to get a sense of how the distribution over all the variables modeled by our decomposable models might change over varying Aspects, we can first compute the joint divergence between decomposable models of Aspect windows that are adjacent to each other. Specifically, in this and the rest of the analysis in this section, we will use the Hellinger distance [19] as our divergence of choice. From Fig. 6a, we can observe that there are multiple peaks in the Hellinger distance between adjacent Aspect windows, especially between azimuth $315^{\circ }$ and $360^{\circ }$. Specifically, the largest peak in distance occurs between the Aspect windows [335, 339] and [340, 344] with a Hellinger distance of 0.3652.

Table 2 Top 5 variables with the highest univariate marginal Hellinger distance

Full size table

We can further inspect this spike in the Hellinger distance between windows [335, 339] and [340, 344] by computing the univariate marginal Hellinger distance over each variable in the decomposable models for the two Aspect windows. From Table 2, we can see that out of the five variables with the highest univariate Hellinger distance, the variable Hillshade_3pm has a univariate Hellinger distance that is multiple orders or magnitude greater than the variable with the second highest univariate Hellinger distance.

As a reminder, the hillshade index is a value calculated based on the various characteristics of the slope—such as its elevation, slope, and aspect—as well as the relative position of the sun from the slope. The resulting value is then used in shading and visualizing these slopes on two-dimensional maps, where a higher value implies that the slope is—theoretically—better illuminated. Therefore, it should be no surprise that the hillshade index at 3 pm can change greatly as the slope in the observation cell faces different cardinal directions.

To better understand how hillshade changes at different azimuths for Aspect, we can plot the mean hillshade index at 9 am, Noon, and 3 pm over the different Aspect windows to see how the hillshade index differs at different azimuths at different times of the day. From Fig. 6b, we can see that the hillshade at 3 pm is high in west, while the hillshade at 9 am is high in the east. Both this observations corresponds to the fact that the sun rises in the east and sets in the west. More interestingly, at both 9 am and Noon, the hillshade in the south is more than in the north. This again corresponds to the fact that the covertype dataset collected the hillshade value during the summer solstice.

That said, the link between Aspect and Hillshade is fairly straightforward and therefore our analysis so far has not been very insightful. Therefore, for the rest of this section, we will focus on studying how the marginal distribution over just the Soil_Type variables differ with varying Aspects.

5.2.2 Analysis over soil type variables

Similar to Fig. 6a in Sect. 5.2.1, we can first compute the marginal Hellinger distance over the 40 soil type variables between decomposable models of adjacent Aspect windows in order to try and get an overview of how the Soil_Type marginal distribution changes over different azimuths for Aspect. Thanks to the marginal divergence computation methods introduced in this paper, we can compute the marginal Hellinger distance directly without having to relearn new decomposable models over just the 40 Soil_Type variables for the 72 Aspect windows, which will save a significant amount of computation needed for this task.

From Fig. 7, we can see that although there are some spikes in the marginal Hellinger distance, with the largest spike occurring at azimuth $180^{\circ }$, the largest marginal Hellinger distance between decomposable models of adjacent Aspect windows is around a Hellinger distance of 0.040. This indicates that the distributional changes over the Soil_Type variables is in general more gradual. Therefore instead of just computing the marginal Hellinger distance between decomposable models of adjacent Aspect windows, we can try to compute this divergence between all possible pairs of Aspect windows, the results of which can be found in Fig. 8.

From Fig. 8, we can see definite regions of low and high marginal Hellinger distances, with the highest Hellinger distance computed being around 0.6. We can then try to cluster these regions of Aspect window with a low marginal Hellinger distance between each other to find groups of Aspect windows with a similar marginal distribution over the Soil_Type variables to each other. Specifically, we will use the AgglomerativeClustering method—in the Python library sklearn—with complete linkage and a distance_threshold of 0.3. As seen from Fig. 9a, we obtained 3 clusters over the 72 Aspect windows from this clustering step.

Table 3 Top 5 Soil_Types with the largest univariate marginal divergence between clusters: (a) 0 & 1, (b) 0 & 2, and (c) 1 & 2

Full size table

We can further investigate the source of the marginal distributional changes over Soil_Type between these clusters by computing the univariate marginal Hellinger distance over each Soil_Type variable between these clusters. From the results in Table 3, we can observe that the Soil_Types that consistently has a high univariate Hellinger distance is the Soil_Types: 9, 2, and 0. We can investigate how the distribution over these Soil_Types change over different Aspects by plotting the probability of each Soil_Type appearing in any one of the 72 Aspect windows. Since we already have decomposable models learned on each of these Aspect windows, obtaining the probability for each of these 3 Soil_Types only requires sending the relevant inference queries to these decomposable models.

From the results in Fig. 9b, we can see that the probability of Soil_Type_9 is high in the North-Western cardinalities, while the probability of Soil_Type_2 and Soil_Type_0 is the opposite, with a high probability of being present in the South-Eastern cardinalities. Unfortunately, further analysis into why the probabilities of these 3 Soil_Types differ so much between the North-Western and the South-Eastern cardinalities requires more specialized knowledge about the geography of the region, and is therefore out of scope for this paper. However, as we have shown, through the use of the methods introduced in this paper, it is possible to quickly find distributional changes within parts of a dataset that does not have an obvious explanation and therefore to allow more effort to be placed into better understanding these distributional changes.

5.3 Quantum error analysis

Our last set of experiments addresses the quantification of the effective error behavior of quantum circuits. A quantum circuit over n qubits is a $2^n \times 2^n$ unitary complex matrix C. It $\mathinner {|{\phi }\rangle }$—a $2^n$-dimensional, $\ell _2$-normalized complex vector. The computation of an n-qubit quantum computer with circuit C can hence be written as $\mathinner {|{\psi }\rangle }=C\mathinner {|{\phi }\rangle }$. Denoting the i-th standard basis unit vector by ${\varvec{e}}_i$, any quantum state can be decomposed as $\sum _{i=0}^{2^n-1} \alpha _i {\varvec{e}}_i$ with $\sum _{i=0}^{2^n-1} |\alpha _i|^2 = 1$. Clearly, the $2^n$-dimensional output vector is never realized on a real-world quantum computer, as this would require an exponential amount of memory. Instead, a quantum computer outputs a random n-bit string, where the bit string that represents the standard binary encoding of the unsigned integer i is sampled with probability $\mathbb {P}(i)=|(C\mathinner {|{\phi }\rangle })_i|^2$, according to the Born-rule. However, due to various sources of error, samples from real-world quantum computers are generated from a (sometimes heavily) distorted distribution $\mathbb {Q}$.

In what follows, we show how to apply marginal divergences $D_{\textit{AB}}^{(\alpha ,\beta )}( \mathbb {P}_{\varvec{Z}} \mid \mid \mathbb {Q}_{\varvec{Z}})$ for quantifying the error for specific (subsets of) qubits. To this end, we conduct two sets of experiments.

First, we consider simple single-qubit circuits, shown in Fig. 10. In Fig. 10a, C is the identity matrix and the input state $\mathinner {|{\phi }\rangle }$ is set to $\mathinner {|{0}\rangle }$, i.e., the zero state. This circuit is deterministic—an exact simulation of this circuit will always output 0. In Fig. 10b, $C=\begin{pmatrix}0&{}1\\ 1&{}0\end{pmatrix}$ is the so-called Pauli-X matrix and the input state $\mathinner {|{\phi }\rangle }$ is set to $\mathinner {|{0}\rangle }$. An exact simulation of this circuit will always output 1.

Nevertheless, on a real-world quantum computer, there is a small probability, that both circuits will deliver the incorrect result. Moreover, on quantum processors with more than one qubit, the error depends on the specific hardware qubit that is selected to run the circuit—each hardware qubit has its own error behavior. The precise error is typically unknown, but quantum hardware vendors deliver rough estimates of these quantities.

Here, we employ ‘ibmq_ehningen,’ an IBM Falcon r5.11 superconducting quantum processing unit (QPU) with 27 qubits. The corresponding single-qubit errors that IBM reports for the time we ran our experiments are provided in Fig. 11 (left). Wired connections between qubits are shown in Fig. 12. We run both circuits $N=10,000$ times on each of the 27 qubits, resulting in a total $2\times 27\times 10,000 = 5,40,000$ samples for the single-qubit experiment.

For the second set of experiments, we consider the quantum Fourier transform (QFT) [13]. QFT is an essential building block of many quantum algorithms, notably Shor’s algorithm for factoring and computing the discrete logarithm, as well as quantum phase estimation. We refer to Nielsen and Chuang [37] for an elaborate introduction to these topics.

Here, we run the QFT on 300 uniformly random inputs on the same 27 qubit QPU as before. Each circuit is sampled $N=10,000$ times, resulting in a total of $2\times 10^6$ samples for the QFT experiment.

For the sake of reproducibility, the circuits, data, and python scripts will be made available after acceptance of this manuscript.

5.3.1 Single-qubit divergence

During transpilation,^{Footnote 2} a sub-set of all available qubits is chosen to realize the logical circuit. The user can decide to consider only a specific sub-set of candidate qubits. Thus, reliable information about the effective error behavior of each qubit when used within a quantum circuit is of utmost importance.

Based on data from the single-qubit experiments, we estimate $\mathbb {Q}_i$, that is, the probability that the i-th qubit takes a specific value (0 or 1). Moreover, we obtain an empirical estimate of the ground truth $\mathbb {P}_i$ from the exact simulated results, which allows us to compute the Hellinger distance between $\mathbb {P}$ and $\mathbb {Q}$ for each qubit. We do this for both circuits from Fig. 10. The average qubit-wise Hellinger distance is reported in Fig. 11 (center).

The results help us to quantify the isolated error behavior of each qubit. We see that our results qualitatively agree with the errors reported by IBM, e.g., the top-5 qubits which show the largest divergence $\{5,11,20,23,25\}$ also accumulate the largest error mass as reported by IBM. Explaining all the discrepancies is neither possible—since the exact procedure for reproducing the IBM errors in unknown—nor required. The result clearly shows that the numbers reported by IBM quantify the isolated error behavior of each qubit. We will now see that one obtains a different picture when qubits are used within a larger circuit (which is the most common use case).

To this end, we consider the results from the QFT circuits. We learn 300 pairs of DMs on the samples using the same methodology as in Sect. 5.1. We then compute the marginal divergence for each qubit and report the average over all 300 QFT circuits in Fig. 11 (right). As these numbers are marginals of the full joint divergence, we now quantify how each qubit behaves on average when it is part of a larger circuit.

We see that the result mostly disagrees with both, the errors reported by IBM and the single-qubit divergences reported by us. First, we see that qubits 12, 13, and 14—the qubits directly in the middle of the coupling map shown in Fig. 12—which are qubits that, individually, did not have much error according to both our experiments in Fig. 11 (center) and the reported errors from IBM. However, it is not a coincidence that most of these high error qubits are concentrated in the middle of the circuit. Unlike in an idealized QPU, the qubits in the physical QPU we used are not all coupled with each other. Therefore, routing has to take place whenever a quantum circuit operates on pairs of qubits which are not physically connected. As a result, the states of the qubits in the middle of the circuit—qubits 12, 13, and 14—need to be pushed around and stored in other qubits more often. This is done in order to make way for routes between qubits of opposing sides of the circuit. The more frequently these states get pushed around, the greater the chance of an error occurring; hence, why these qubits in the middle exhibit the greatest amount of error. To the best of our knowledge no quantum hardware vendor reports such numbers, although they are highly important when logic qubits are transpiled to a circuit that is mapped to the actual hardware.

From Fig. 11 (right), we can also observe a spike in the Hellinger distance for qubits 25 and 26. While qubit 25 is a heavy hitter in Fig. 11, this does not apply for qubit 26. Moreover, qubit 26 is a “leaf” qubit (w.r.t Fig. 12) and hence not used heavily for routing. Therefore, our approach reveals a type of error that is not detected by standard methods, e.g., cross-talk or three-level system activity.

5.3.2 High-order divergence

We finish our analysis by investigating marginal divergence beyond single qubits. Figure 13 shows the result of computing the pairwise marginal Hellinger distance between the DMs in each pair, taking the mean over all the pairs. The resulting heatmap is symmetrical due to Hellinger distance being a symmetric divergence. The first immediate observation is the bright bands over pairs that include qubits that already have a high univariate marginal Hellinger distance—such as qubits 12, 13, 14, 25, and 26. Another interesting observation is the bright band that occurs across the diagonal of the heatmap. We believe this is due to higher-order interactions between coupled qubits that do not necessarily exist between qubits that are multiple jumps away from each other. There are multiple causes for these higher-order interactions, ranging from unintended physical processes—such as cross-talk [24, 38]—to limitations of current QPUs—such as states being swapped between qubits solely for the purposes of routing.

These higher-order interactions are not just limited to pairwise interactions. As an example, we have discovered 70 cases where a 3-tuple of qubits, A, has a greater Hellinger distance than some other 3-tuple of qubits, B, despite the maximum Hellinger distance over all the pairwise combinations of the qubits in A being less than the minimum over that of B. This clearly shows that our method is applicable for recovering high-order error behavior that cannot be deduced by simply investigating low-order terms.

6 Conclusion

In this work, we explained how to compute the marginal and conditional $\alpha \beta $-divergence between any two DMs. In the process, we showed how to decompose the marginal distribution of a DM over any subset of variables, and how this can be used to decompose the conditional distribution of a DM as well. In order to compute the marginal and conditional $\alpha \beta $-divergence based on these decompositions, we introduced a notion of the product and quotient between MNs.

An initial numerical experiment on the image benchmark dataset QMNIST [50] showed that our methods can be applied for analyzing the differences in handwriting between NIST employees and high school students. We then showed how our methods can be used to provide more useful insights into distributional changes in a dataset by applying it to the covertype dataset [6] and finding 3 types of soil whose presence changes greatly depending on the cardinal direction the slope being observed is facing.

Finally, we proposed a completely novel error analysis of contemporary quantum computers, based on marginal divergences. To this end, we collected more than 2.5 million samples from a 27 qubit superconducting quantum processor. From these data, we found that marginal divergences can yield useful insights into the effective error behavior of physical qubits. This procedure can be applied to every quantum computing device and is of utmost importance for selecting which hardware qubits will eventually run a quantum algorithm, e.g., for feature selection or probabilistic inference [35, 36, 42].

A restriction of our approach is the limitation to computing the $\alpha \beta $-divergence between DMs. Although this family of divergences is fairly broad, it does not include some well-known divergences, such as the Jensen-Shannon divergence [31]. Furthermore, another question that arises is which relations one should analyze, since their number grows combinatorially with respect to the size of the subsets we wish to analyze. Therefore more research is needed to develop advanced tools that can aid in the analysis of higher-dimensional subsets

7 Supplementary information

Code for all the methods and experiments can be found at the kais2024 branch of the code repository: https://gitlab.com/lklee/icdm2023.

Data availability

All data used are provided in the code repository linked in at the end of the paper.

Notes

Some authors also require that the quadratic part of the Taylor expansion of $D(p, p + dp)$ defines a Riemannian metric on P [2].
Transpilation is related to compilation: the qubits and operations used in a logical quantum circuit are mapped to the resources of some real-world QPU architecture.

References

Alberghini G, Barbon Junior S, Cano A (2022) Adaptive ensemble of self-adjusting nearest neighbor subspaces for multi-label drifting data streams. Neurocomputing 481:228–248. https://doi.org/10.1016/j.neucom.2022.01.075
Article Google Scholar
Amari SI (2016) Information geometry and its applications. Springer, Berlin
Book Google Scholar
Ankan A, Panda A (2015) pgmpy: probabilistic graphical models using python. In: Proceedings of the 14th python in science conference, pp 6–11. https://doi.org/10.25080/Majora-7b98e3ed-001
Barz B, Rodner E, Garcia YG et al (2019) Detecting regions of maximal divergence for spatio-temporal anomaly detection. IEEE Trans Pattern Anal Mach Intell 41(5):1088–1101. https://doi.org/10.1109/TPAMI.2018.2823766
Article Google Scholar
Bhattacharyya R, Chakraborty S (2018) Property testing of joint distributions using conditional samples. ACM Trans Comput Theory 10(4):16:1-16:20. https://doi.org/10.1145/3241377
Article MathSciNet Google Scholar
Blackard J (1998) Covertype. UCI Machine Learning Repository. https://doi.org/10.24432/C50K5N
Blair JRS, Peyton B (1993) An introduction to chordal graphs and clique trees. In: George A, Gilbert JR, Liu JWH (eds) Graph theory and sparse matrix computation. Springer, New York, NY, The IMA Volumes in Mathematics and its Applications, pp 1–29. https://doi.org/10.1007/978-1-4613-8369-7_1
Bleuler C, Lapidoth A, Pfister C (2020) Conditional rényi divergences and horse betting. Entropy 22(3):316. https://doi.org/10.3390/e22030316
Article Google Scholar
Bottou L, Cortes C, Denker J, et al (1994) Comparison of classifier methods: a case study in handwritten digit recognition. In: Proceedings of the 12th IAPR international conference on pattern recognition, Vol. 3—Conference C: signal processing (Cat. No.94CH3440-5), vol. 2, pp 77–82 . https://doi.org/10.1109/ICPR.1994.576879
Cai C, Verdú S (2019) Conditional rényi divergence saddlepoint and the maximization of $\alpha $-mutual information. Entropy 21(10):969. https://doi.org/10.3390/e21100969
Article Google Scholar
Chen Y, Ye J, Li J (2020) Aggregated wasserstein distance and state registration for hidden Markov models. IEEE Trans Pattern Anal Mach Intell 42(9):2133–2147. https://doi.org/10.1109/TPAMI.2019.2908635
Article Google Scholar
Cichocki A, Si Amari (2010) Families of alpha- beta- and gamma- divergences: flexible and robust measures of similarities. Entropy 12(6):1532–1568. https://doi.org/10.3390/e12061532
Article MathSciNet Google Scholar
Coppersmith D (2002) An approximate Fourier transform useful in quantum factoring. https://doi.org/10.48550/arXiv.quant-ph/0201067. arXiv:quant-ph/0201067
Csiszar I (1995) Generalized cutoff rates and Renyi’s information measures. IEEE Trans Inf Theory 41(1):26–34. https://doi.org/10.1109/18.370121
Article Google Scholar
Févotte C, Bertin N, Durrieu JL (2009) Nonnegative matrix factorization with the Itakura-Saito divergence: with application to music analysis. Neural Comput 21(3):793–830. https://doi.org/10.1162/neco.2008.04-08-771
Article Google Scholar
Grother PJ, Hanaoka KK (1995) NIST special database 19: handprinted forms and characters database
Hamid M, Khuroo AA, Malik AH et al (2021) Elevation and aspect determine the differences in soil properties and plant species diversity on Himalayan mountain summits. Ecol Res 36(2):340–352. https://doi.org/10.1111/1440-1703.12202
Article Google Scholar
Hammersley JM, Clifford P (1971) Markov fields on finite graphs and lattices. Unpublished manuscript
Hellinger E (1909) Neue begründung der theorie quadratischer formen von unendlichvielen veränderlichen. J für Die Reine und Angew Math 136:210–271
Article MathSciNet Google Scholar
Hicks RR, Frank PS (1984) Relationship of aspect to soil nutrients, species importance and biomass in a forested watershed in west Virginia. For Ecol Manag 8(3):281–291. https://doi.org/10.1016/0378-1127(84)90060-4
Article Google Scholar
Hinder F, Jakob J, Hammer B (2020) Analysis of drifting features. https://doi.org/10.48550/arXiv.2012.00499. arXiv:2012.00499 [cs, stat]
Hoens TR, Chawla NV, Polikar R (2011) Heuristic updatable weighted random subspaces for non-stationary environments. In: 2011 IEEE 11th international conference on data mining, pp 241–250. https://doi.org/10.1109/ICDM.2011.75
Itakura F, Saito S (1968) Analysis synthesis telephony based on the maximum likelihood method. In: Proceedings of the 6th international congress on acoustics, Tokyo, Japan, pp 17–20
Ketterer A, Wellens T (2023) Characterizing crosstalk of superconducting transmon processors. https://doi.org/10.48550/arXiv.2303.14103. arXiv:2303.14103 [quant-ph]
Koller D, Friedman N (2009) Probabilistic graphical models: principles and techniques—adaptive computation and machine learning. MIT Press, Cambridge
Google Scholar
Kullback S, Leibler RA (1951) On information and sufficiency. Ann Math Stat 22(1):79–86. https://doi.org/10.1214/aoms/1177729694
Article MathSciNet Google Scholar
Lauritzen SL (1996) Graphical models. No. 17 in Oxford statistical science series. Clarendon Press, Oxford University Press, Oxford, New York
LeCun Y, Cortes C, Burges CJC (1994) The MNIST database of handwritten digits. [Data set]. https://yann.lecun.com/exdb/mnist/. Accessed 22 June 2023
Lee LK, Piatkowski N, Petitjean F et al (2023) Computing divergences between discrete decomposable models. Proc AAAI Conf Artif Intell 37(10):12243–12251. https://doi.org/10.1609/aaai.v37i10.26443
Article Google Scholar
Lee LK, Webb GI, Schmidt DF, et al (2023b) Computing marginal and conditional divergences between decomposable models with applications. In: 2023 IEEE international conference on data mining (ICDM), pp 239–248. https://doi.org/10.1109/ICDM58522.2023.00033
Lin J (1991) Divergence measures based on the shannon entropy. IEEE Trans Inf Theory 37(1):145–151. https://doi.org/10.1109/18.61115
Article MathSciNet Google Scholar
Liu A, Song Y, Zhang G, et al (2017) Regional concept drift detection and density synchronized drift adaptation. In: Proceedings of the twenty-sixth international joint conference on artificial intelligence, IJCAI-17, pp 2280–2286. https://doi.org/10.24963/ijcai.2017/317
Maia Polo F, Izbicki R, Lacerda EG et al (2023) A unified framework for dataset shift diagnostics. Inf Sci 649:119612. https://doi.org/10.1016/j.ins.2023.119612
Article Google Scholar
Mohammad A (2008) The effect of slope aspect on soil and vegetation characteristics in southern west bank. Bethlehem Univ J 27:9–25
Google Scholar
Mücke S, Piatkowski N (2022) Quantum-inspired structure-preserving probabilistic inference. In: IEEE congress on evolutionary computation (CEC). IEEE, pp 1–9. https://doi.org/10.1109/CEC55065.2022.9870260
Mücke S, Heese R, Müller S et al (2023) Feature selection on quantum computers. Quantum Mach Intell 5(1):1–16. https://doi.org/10.1007/s42484-023-00099-z
Article Google Scholar
Nielsen MA, Chuang IL (2016) Quantum computation and quantum information (10th anniversary edition). Cambridge University Press, Cambridge
Google Scholar
Perrin H, Scoquart T, Shnirman A, et al (2023) Mitigating crosstalk errors by randomized compiling: simulation of the BCS model on a superconducting quantum computer. https://doi.org/10.48550/arXiv.2305.02345. arXiv:2305.02345 [cond-mat, physics:quant-ph]
Petitjean F, Webb GI (2015) Scaling log-linear analysis to datasets with thousands of variables. In: Proceedings of the 2015 SIAM international conference on data mining. Society for industrial and applied mathematics, pp 469–477. https://doi.org/10.1137/1.9781611974010.53
Petitjean F, Webb GI, Nicholson AE (2013) Scaling log-linear analysis to high-dimensional data. In: 2013 IEEE 13th international conference on data mining. IEEE, Dallas, TX, USA, pp 597–606. https://doi.org/10.1109/ICDM.2013.17
Petitjean F, Allison L, Webb GI (2014) A statistically efficient and scalable method for log-linear analysis of high-dimensional data. In: 2014 IEEE international conference on data mining. IEEE, Shenzhen, China, pp 480–489. https://doi.org/10.1109/ICDM.2014.23
Piatkowski N, Zoufal C (2022) On quantum circuits for discrete graphical models. arXiv quant-ph(2206.00398)
Piatkowski N, Lee S, Morik K (2013) Spatio-temporal random fields: compressible representation and distributed estimation. Mach Learn 93(1):115–139. https://doi.org/10.1007/s10994-013-5399-7
Article MathSciNet Google Scholar
Poczos B, Schneider J (2012) Nonparametric estimation of conditional information and divergences. In: Artificial intelligence and statistics. PMLR, pp 914–923
Roseberry M, Krawczyk B, Djenouri Y et al (2021) Self-adjusting k nearest neighbors for continual learning from multi-label drifting data streams. Neurocomputing 442:10–25. https://doi.org/10.1016/j.neucom.2021.02.032
Article Google Scholar
Schlimmer JC, Granger RH (1986) Incremental learning from noisy data. Mach Learn 1(3):317–354. https://doi.org/10.1007/BF00116895
Article Google Scholar
Sibson R (1969) Information radius. Z für Wahrscheinlichkeitstheorie und Verwandte Gebiete 14(2):149–160. https://doi.org/10.1007/BF00537520
Article MathSciNet Google Scholar
Webb GI, Petitjean F (2016) A multiple test correction for streams and cascades of statistical hypothesis tests. In: Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining—KDD ’16. ACM Press, San Francisco, California, USA, pp 1255–1264. https://doi.org/10.1145/2939672.2939775
Webb GI, Lee LK, Goethals B et al (2018) Analyzing concept drift and shift from sample data. Data Min Knowl Disc 32(5):1179–1199. https://doi.org/10.1007/s10618-018-0554-1
Article MathSciNet Google Scholar
Yadav C, Bottou L (2019) Cold case: the lost MNIST digits. In: Proceedings of the 33rd international conference on neural information processing systems. Curran Associates Inc., Red Hook, NY, USA, Article 1206, 13465–13474

Download references

Acknowledgements

This work was supported by the Australian Research Council award DP210100072 as well as the Australian Government Research Training Program (RTP) Scholarship. Additionally, this work has been funded by the Federal Ministry of Education & Research of Germany and the state of North-Rhine Westphalia as part of the Lamarr-Institute for Machine Learning and Artificial Intelligence. Access to quantum computing hardware has been funded by Fraunhofer Quantum Now.

Funding

Open Access funding enabled and organized by Projekt DEAL.

Author information

Authors and Affiliations

Fraunhofer IAIS, Schloss Birlinghoven, 53757, Sankt Augustin, Germany
Loong Kuan Lee & Nico Piatkowski
Department of Data Science and AI, Monash University, Melbourne, VIC, 3800, Australia
Geoffrey I. Webb & Daniel F. Schmidt

Authors

Loong Kuan Lee
View author publications
You can also search for this author in PubMed Google Scholar
Geoffrey I. Webb
View author publications
You can also search for this author in PubMed Google Scholar
Daniel F. Schmidt
View author publications
You can also search for this author in PubMed Google Scholar
Nico Piatkowski
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

L.K.L and N.P wrote the main manuscript. G.I.W and D.F.S provided ideas and help to develop the marginal and conditional computation methods. L.K.L implemented the methods and ran all the experiments.

Corresponding author

Correspondence to Loong Kuan Lee.

Ethics declarations

Conflict of interest

The authors declare no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Lee, L.K., Webb, G.I., Schmidt, D.F. et al. Computing marginal and conditional divergences between decomposable models with applications in quantum computing and earth observation. Knowl Inf Syst 66, 7527–7556 (2024). https://doi.org/10.1007/s10115-024-02191-7

Download citation

Received: 24 December 2023
Revised: 09 July 2024
Accepted: 19 July 2024
Published: 22 August 2024
Issue Date: December 2024
DOI: https://doi.org/10.1007/s10115-024-02191-7

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Abstract
1 Introduction
2 Background and notation
3 Related work
4 Proposed methodology
5 Experimental results
6 Conclusion
7 Supplementary information
Data availability
Notes
References
Acknowledgements
Funding
Author information
Ethics declarations
Additional information
Rights and permissions
About this article

Fig. 1
View in article Full size image
Fig. 2
View in article Full size image
Fig. 3
View in article Full size image
Fig. 4
View in article Full size image
Fig. 5
View in article Full size image
Fig. 6
View in article Full size image
Fig. 7
View in article Full size image
Fig. 8
View in article Full size image
Fig. 9
View in article Full size image
Fig. 10
View in article Full size image
Fig. 11
View in article Full size image
Fig. 12
View in article Full size image
Fig. 13
View in article Full size image

Alberghini G, Barbon Junior S, Cano A (2022) Adaptive ensemble of self-adjusting nearest neighbor subspaces for multi-label drifting data streams. Neurocomputing 481:228–248. https://doi.org/10.1016/j.neucom.2022.01.075
Article Google Scholar
Amari SI (2016) Information geometry and its applications. Springer, Berlin
Book Google Scholar
Ankan A, Panda A (2015) pgmpy: probabilistic graphical models using python. In: Proceedings of the 14th python in science conference, pp 6–11. https://doi.org/10.25080/Majora-7b98e3ed-001
Barz B, Rodner E, Garcia YG et al (2019) Detecting regions of maximal divergence for spatio-temporal anomaly detection. IEEE Trans Pattern Anal Mach Intell 41(5):1088–1101. https://doi.org/10.1109/TPAMI.2018.2823766
Article Google Scholar
Bhattacharyya R, Chakraborty S (2018) Property testing of joint distributions using conditional samples. ACM Trans Comput Theory 10(4):16:1-16:20. https://doi.org/10.1145/3241377
Article MathSciNet Google Scholar
Blackard J (1998) Covertype. UCI Machine Learning Repository. https://doi.org/10.24432/C50K5N
Blair JRS, Peyton B (1993) An introduction to chordal graphs and clique trees. In: George A, Gilbert JR, Liu JWH (eds) Graph theory and sparse matrix computation. Springer, New York, NY, The IMA Volumes in Mathematics and its Applications, pp 1–29. https://doi.org/10.1007/978-1-4613-8369-7_1
Bleuler C, Lapidoth A, Pfister C (2020) Conditional rényi divergences and horse betting. Entropy 22(3):316. https://doi.org/10.3390/e22030316
Article Google Scholar
Bottou L, Cortes C, Denker J, et al (1994) Comparison of classifier methods: a case study in handwritten digit recognition. In: Proceedings of the 12th IAPR international conference on pattern recognition, Vol. 3—Conference C: signal processing (Cat. No.94CH3440-5), vol. 2, pp 77–82 . https://doi.org/10.1109/ICPR.1994.576879
Cai C, Verdú S (2019) Conditional rényi divergence saddlepoint and the maximization of $\alpha $-mutual information. Entropy 21(10):969. https://doi.org/10.3390/e21100969
Article Google Scholar
Chen Y, Ye J, Li J (2020) Aggregated wasserstein distance and state registration for hidden Markov models. IEEE Trans Pattern Anal Mach Intell 42(9):2133–2147. https://doi.org/10.1109/TPAMI.2019.2908635
Article Google Scholar
Cichocki A, Si Amari (2010) Families of alpha- beta- and gamma- divergences: flexible and robust measures of similarities. Entropy 12(6):1532–1568. https://doi.org/10.3390/e12061532
Article MathSciNet Google Scholar
Coppersmith D (2002) An approximate Fourier transform useful in quantum factoring. https://doi.org/10.48550/arXiv.quant-ph/0201067. arXiv:quant-ph/0201067
Csiszar I (1995) Generalized cutoff rates and Renyi’s information measures. IEEE Trans Inf Theory 41(1):26–34. https://doi.org/10.1109/18.370121
Article Google Scholar
Févotte C, Bertin N, Durrieu JL (2009) Nonnegative matrix factorization with the Itakura-Saito divergence: with application to music analysis. Neural Comput 21(3):793–830. https://doi.org/10.1162/neco.2008.04-08-771
Article Google Scholar
Grother PJ, Hanaoka KK (1995) NIST special database 19: handprinted forms and characters database
Hamid M, Khuroo AA, Malik AH et al (2021) Elevation and aspect determine the differences in soil properties and plant species diversity on Himalayan mountain summits. Ecol Res 36(2):340–352. https://doi.org/10.1111/1440-1703.12202
Article Google Scholar
Hammersley JM, Clifford P (1971) Markov fields on finite graphs and lattices. Unpublished manuscript
Hellinger E (1909) Neue begründung der theorie quadratischer formen von unendlichvielen veränderlichen. J für Die Reine und Angew Math 136:210–271
Article MathSciNet Google Scholar
Hicks RR, Frank PS (1984) Relationship of aspect to soil nutrients, species importance and biomass in a forested watershed in west Virginia. For Ecol Manag 8(3):281–291. https://doi.org/10.1016/0378-1127(84)90060-4
Article Google Scholar
Hinder F, Jakob J, Hammer B (2020) Analysis of drifting features. https://doi.org/10.48550/arXiv.2012.00499. arXiv:2012.00499 [cs, stat]
Hoens TR, Chawla NV, Polikar R (2011) Heuristic updatable weighted random subspaces for non-stationary environments. In: 2011 IEEE 11th international conference on data mining, pp 241–250. https://doi.org/10.1109/ICDM.2011.75
Itakura F, Saito S (1968) Analysis synthesis telephony based on the maximum likelihood method. In: Proceedings of the 6th international congress on acoustics, Tokyo, Japan, pp 17–20
Ketterer A, Wellens T (2023) Characterizing crosstalk of superconducting transmon processors. https://doi.org/10.48550/arXiv.2303.14103. arXiv:2303.14103 [quant-ph]
Koller D, Friedman N (2009) Probabilistic graphical models: principles and techniques—adaptive computation and machine learning. MIT Press, Cambridge
Google Scholar
Kullback S, Leibler RA (1951) On information and sufficiency. Ann Math Stat 22(1):79–86. https://doi.org/10.1214/aoms/1177729694
Article MathSciNet Google Scholar
Lauritzen SL (1996) Graphical models. No. 17 in Oxford statistical science series. Clarendon Press, Oxford University Press, Oxford, New York
LeCun Y, Cortes C, Burges CJC (1994) The MNIST database of handwritten digits. [Data set]. https://yann.lecun.com/exdb/mnist/. Accessed 22 June 2023
Lee LK, Piatkowski N, Petitjean F et al (2023) Computing divergences between discrete decomposable models. Proc AAAI Conf Artif Intell 37(10):12243–12251. https://doi.org/10.1609/aaai.v37i10.26443
Article Google Scholar
Lee LK, Webb GI, Schmidt DF, et al (2023b) Computing marginal and conditional divergences between decomposable models with applications. In: 2023 IEEE international conference on data mining (ICDM), pp 239–248. https://doi.org/10.1109/ICDM58522.2023.00033
Lin J (1991) Divergence measures based on the shannon entropy. IEEE Trans Inf Theory 37(1):145–151. https://doi.org/10.1109/18.61115
Article MathSciNet Google Scholar
Liu A, Song Y, Zhang G, et al (2017) Regional concept drift detection and density synchronized drift adaptation. In: Proceedings of the twenty-sixth international joint conference on artificial intelligence, IJCAI-17, pp 2280–2286. https://doi.org/10.24963/ijcai.2017/317
Maia Polo F, Izbicki R, Lacerda EG et al (2023) A unified framework for dataset shift diagnostics. Inf Sci 649:119612. https://doi.org/10.1016/j.ins.2023.119612
Article Google Scholar
Mohammad A (2008) The effect of slope aspect on soil and vegetation characteristics in southern west bank. Bethlehem Univ J 27:9–25
Google Scholar
Mücke S, Piatkowski N (2022) Quantum-inspired structure-preserving probabilistic inference. In: IEEE congress on evolutionary computation (CEC). IEEE, pp 1–9. https://doi.org/10.1109/CEC55065.2022.9870260
Mücke S, Heese R, Müller S et al (2023) Feature selection on quantum computers. Quantum Mach Intell 5(1):1–16. https://doi.org/10.1007/s42484-023-00099-z
Article Google Scholar
Nielsen MA, Chuang IL (2016) Quantum computation and quantum information (10th anniversary edition). Cambridge University Press, Cambridge
Google Scholar
Perrin H, Scoquart T, Shnirman A, et al (2023) Mitigating crosstalk errors by randomized compiling: simulation of the BCS model on a superconducting quantum computer. https://doi.org/10.48550/arXiv.2305.02345. arXiv:2305.02345 [cond-mat, physics:quant-ph]
Petitjean F, Webb GI (2015) Scaling log-linear analysis to datasets with thousands of variables. In: Proceedings of the 2015 SIAM international conference on data mining. Society for industrial and applied mathematics, pp 469–477. https://doi.org/10.1137/1.9781611974010.53
Petitjean F, Webb GI, Nicholson AE (2013) Scaling log-linear analysis to high-dimensional data. In: 2013 IEEE 13th international conference on data mining. IEEE, Dallas, TX, USA, pp 597–606. https://doi.org/10.1109/ICDM.2013.17
Petitjean F, Allison L, Webb GI (2014) A statistically efficient and scalable method for log-linear analysis of high-dimensional data. In: 2014 IEEE international conference on data mining. IEEE, Shenzhen, China, pp 480–489. https://doi.org/10.1109/ICDM.2014.23
Piatkowski N, Zoufal C (2022) On quantum circuits for discrete graphical models. arXiv quant-ph(2206.00398)
Piatkowski N, Lee S, Morik K (2013) Spatio-temporal random fields: compressible representation and distributed estimation. Mach Learn 93(1):115–139. https://doi.org/10.1007/s10994-013-5399-7
Article MathSciNet Google Scholar
Poczos B, Schneider J (2012) Nonparametric estimation of conditional information and divergences. In: Artificial intelligence and statistics. PMLR, pp 914–923
Roseberry M, Krawczyk B, Djenouri Y et al (2021) Self-adjusting k nearest neighbors for continual learning from multi-label drifting data streams. Neurocomputing 442:10–25. https://doi.org/10.1016/j.neucom.2021.02.032
Article Google Scholar
Schlimmer JC, Granger RH (1986) Incremental learning from noisy data. Mach Learn 1(3):317–354. https://doi.org/10.1007/BF00116895
Article Google Scholar
Sibson R (1969) Information radius. Z für Wahrscheinlichkeitstheorie und Verwandte Gebiete 14(2):149–160. https://doi.org/10.1007/BF00537520
Article MathSciNet Google Scholar
Webb GI, Petitjean F (2016) A multiple test correction for streams and cascades of statistical hypothesis tests. In: Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining—KDD ’16. ACM Press, San Francisco, California, USA, pp 1255–1264. https://doi.org/10.1145/2939672.2939775
Webb GI, Lee LK, Goethals B et al (2018) Analyzing concept drift and shift from sample data. Data Min Knowl Disc 32(5):1179–1199. https://doi.org/10.1007/s10618-018-0554-1
Article MathSciNet Google Scholar
Yadav C, Bottou L (2019) Cold case: the lost MNIST digits. In: Proceedings of the 33rd international conference on neural information processing systems. Curran Associates Inc., Red Hook, NY, USA, Article 1206, 13465–13474

Search

Navigation

Computing marginal and conditional divergences between decomposable models with applications in quantum computing and earth observation

Abstract

Similar content being viewed by others

Bounds for phylogenetic network space metrics

Reconstructing Phylogenetic Trees from Multipartite Quartet Systems

Machine Learning Versus Semidefinite Programming Approach to a Particular Problem of the Theory of Open Quantum Systems

1 Introduction

2 Background and notation

2.1 Statistical divergences

Definition 1

Definition 2

Definition 3

Definition 4

2.2 Graphical models

2.2.1 Markov network (MN)

2.2.2 Belief propagation

2.2.3 Decomposable model (DM)

3 Related work

3.1 Previous work: computing joint divergence

Definition 5

4 Proposed methodology

4.1 Multi-graph aggregated sum-products (MGASPs)

Definition 6

Lemma 1

Proof

4.2 Marginal divergence

4.2.1 Factorizing the marginal distribution of a decomposable model (DM)

Definition 7

Corollary 1

Definition 8

Definition 9

Definition 10

Theorem 1

Proof

4.2.2 Computing the marginal \(\alpha \beta \)-divergence between DMs

Theorem 2

Proof

4.3 Conditional divergence

Definition 11

Lemma 2

Proof

Corollary 2

Theorem 3

Proof

Theorem 4

Proof

5 Experimental results

5.1 Experiments with QMNIST

5.2 Studying distributional changes within the cover type dataset

5.2.1 Analysis over all variables

5.2.2 Analysis over soil type variables

5.3 Quantum error analysis

5.3.1 Single-qubit divergence

5.3.2 High-order divergence

6 Conclusion

7 Supplementary information

Data availability

Notes

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords