Constraining acyclicity of differentiable Bayesian structure learning with topological ordering

Tran, Quang-Duy; Nguyen, Phuoc; Duong, Bao; Nguyen, Thin

doi:10.1007/s10115-024-02140-4

Constraining acyclicity of differentiable Bayesian structure learning with topological ordering

Regular Paper
Open access
Published: 29 May 2024

Volume 66, pages 5605–5630, (2024)
Cite this article

Download PDF

You have full access to this open access article

Knowledge and Information Systems Aims and scope Submit manuscript

Constraining acyclicity of differentiable Bayesian structure learning with topological ordering

Download PDF

Quang-Duy Tran¹,
Phuoc Nguyen¹,
Bao Duong¹ &
…
Thin Nguyen¹

606 Accesses
Explore all metrics

Abstract

Distributional estimates in Bayesian approaches in structure learning have advantages compared to the ones performing point estimates when handling epistemic uncertainty. Differentiable methods for Bayesian structure learning have been developed to enhance the scalability of the inference process and are achieving optimistic outcomes. However, in the differentiable continuous setting, constraining the acyclicity of learned graphs emerges as another challenge. Various works utilize post-hoc penalization scores to impose this constraint which cannot assure acyclicity. The topological ordering of the variables is one type of prior knowledge that contains valuable information about the acyclicity of a directed graph. In this work, we propose a framework to guarantee the acyclicity of inferred graphs by integrating the information from the topological ordering into the inference process. Our integration framework does not interfere with the differentiable inference process while being able to strictly assure the acyclicity of learned graphs and reduce the inference complexity. Our extensive empirical experiments on both synthetic and real data have demonstrated the effectiveness of our approach with preferable results compared to related Bayesian approaches.

BIC-based node order learning for improving Bayesian network structure learning

Article 01 September 2021

Directed Acyclic Graph Reconstruction Leveraging Prior Partial Ordering Information

Min-BDeu and Max-BDeu Scores for Learning Bayesian Networks

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Structure learning aims to uncover the underlying directed acyclic graphs (DAGs) from observational data that can represent statistical or causal relationships between variables. The structure learning task has many applications in biology [38], economics [22], and interpretable machine learning [32]. Correspondingly, it is gaining scientific interest in various domains such as computer science, statistics, and bioinformatics [43]. One challenge of traditional structure learning methods such as GES [7] is the combinatorial search space of possible DAGs [8]. NO-TEARS [50] proposes a solution for this challenge by relaxing the formulation of the learning task in a continuous space and employs continuous optimization techniques. However, with the continuous representations, another challenge also arises which is the acyclicity constraint of the graphs.

In most continuous score-based methods [27, 28, 50, 51], the constraints of graph acyclicity are defined in a form of a penalizing score and minimizing the score will also minimize the cyclicity of the graphs. This type of approach requires a large number of running steps with complex penalization weight scheduling to ensure the correctness of the constraint, which varies greatly depending on settings. This lack of certainty will affect the quality and restrict the applicability of the learned structures. Another approach is to embed the constraint acyclicity in the generative model of the graphs such as in [9] by utilizing weighted adjacency matrices that can be decomposed into the combinations of a permutation matrix and a strictly lower triangular matrix. Our study is inspired by this approach by using a direct constraint in the generation process instead of a post-hoc penalizing score.

There is a parallel branch of permutation-based causal discovery approaches whose methods allow us to find the topological ordering in polynomial time [6, 15, 37, 39], which can provide beneficial information. Inspired by these approaches, we propose a framework, Topological Ordering in Differentiable Bayesian Structure Learning with ACyclicity Assurance (TOBAC), to greatly reduce the difficulty of the acyclicity-constraining task. Conditional inference is performed in this framework with the condition being the prior knowledge provided from the topological orderings. In this study, we consider two possible approaches for strictly constraining the acyclicity of generated graphs.

The first approach is based on the independent factorization property of the adjacency matrix $\textbf{A}_{\textbf{G}}$ of a DAG $\textbf{G}$ into a permutation matrix $\textbf{P}$, which can be obtained for each topological ordering $\varvec{\pi }$, and a strictly upper triangular matrix $\textbf{S}$, which represents the adjacency matrix when the ordering is correct and can be represented by a latent variable $\textbf{Z}$. The factorization $p\left( \textbf{G},\textbf{S},\textbf{P}\right) =p\left( \textbf{S}\right) p\left( \textbf{P}\right) p\left( \textbf{G}\mid \textbf{S},\textbf{P}\right) =p\left( \textbf{Z}\right) p\left( \varvec{\pi }\right) p\left( \textbf{G}\mid \textbf{Z},\varvec{\pi }\right) =p\left( \textbf{G},\textbf{Z},\varvec{\pi }\right) $ allows us to infer $\textbf{P}$ and $\textbf{S}$ independently from $\varvec{\pi }$ and $\textbf{Z}$, respectively. Especially, decoupling these enables us to apply recent advances in learning of topological ordering and probabilistic model inference techniques. For each case of the permutation matrix $\textbf{P}$, we can infer the DAG’s strictly upper triangular matrix $\textbf{S}$ and compute the adjacency of an isomorphic DAG with $\textbf{A}_{\textbf{G}}=\textbf{P}\textbf{S}\textbf{P}^{\top }$. In order to infer this DAG $\textbf{G}$, we choose the recent graph inference approach in this field, DiBS [27], as ours inference engine for $\textbf{S}$. This variant is called permutation-based TOBAC or P-TOBAC.

The other approach that we propose is based on the property of the gradient flow [25] on the topological ordering $\varvec{\pi }$, which allows for the construction of a corresponding mask $\textbf{M}$. This mask can be applied directly to an adjacency matrix $\textbf{E}$ to eliminate conflicting edges and generate the final adjacency matrix $\textbf{A}_{\textbf{G}}=\textbf{M}\odot \textbf{E}$ with acyclicity assured. Similar to the previous approach, the adjacency matrix $\textbf{E}$ can also be represented by a latent variable $\textbf{Z}$. Additionally, this formulation enables the independent inference processes of $\textbf{M}$ from $\varvec{\pi }$ and $\textbf{E}$ from $\textbf{Z}$, which results in the identical factorization of $p\left( \textbf{G},\textbf{M},\textbf{E}\right) =p\left( \textbf{E}\right) p\left( \textbf{M}\right) p\left( \textbf{G}\mid \textbf{E},\textbf{M}\right) =p\left( \textbf{Z}\right) p\left( \varvec{\pi }\right) p\left( \textbf{G}\mid \textbf{Z},\varvec{\pi }\right) =p\left( \textbf{G},\textbf{Z},\varvec{\pi }\right) $. This variable is called masked-based TOBAC or M-TOBAC.

Our approaches are evaluated by experimenting on synthetic data and a real flow cytometry dataset [38] in linear and nonlinear Gaussian settings. The proposed framework of structural constraint shows better DAG predictions and achieves better performance compared to other approaches.

Contributions The main contributions of this study are summarized as follows

1.
We address the limitations of post-hoc acyclicity constraint scores by strictly constraining the generative structure of the graphs. By utilizing the permutation-based decomposition of the adjacency matrix and the property of the gradient flow on the topological ordering, we can strictly guarantee the acyclicity constraint in Bayesian network.
2.
We introduce TOBAC, a framework with two variants corresponding to two possible permutation-based and mask-based approaches for independently inferring and conditioning on the topological ordering (illustrated in Fig. 1). Our inference process guarantees the acyclicity of inferred graphs as well as reduces the inference complexity of the adjacency matrices.
3.
We demonstrate the effectiveness of TOBAC in comparison with related state-of-the-art Bayesian score-based methods on both synthetic and real-world data. Our approach obtains better performance on synthetic linear and nonlinear Gaussian data and on the real flow cytometry dataset.

This work builds upon and extends the work appearing in ICDM 2023 [45], which has introduced the framework of permutation-based TOBAC (P-TOBAC). In this work, we propose an alternative masked-based approach for TOBAC (M-TOBAC) for constraining the acyclicity. This variant is more efficient when being implemented due to the avoidance of matrix multiplication operations. We also adapt the framework of TOBAC to integrate the new variant and make the formulations less complicated and more general. Last, we evaluate our framework with two implementational variants on both synthetic data, including generated data from linear and nonlinear Gaussian models, and real data, including the flow cytometry dataset [38] and the SynTReN dataset [5, 23], to demonstrate the effectiveness of our previous variant as well as the new variant. Additional ablation study has also been performed to clearly demonstrate the limitations of post-hoc acyclicity constraint and the advantages of our approach.

2 Background

Bayesian structure learning methods

Most structure learning approaches can be categorized by their learning objectives into two main categories. The majority of methods [6, 7, 23, 33, 37, 39, 50, 51] fall into the first category where each method aims to recover a point estimate or its Markov equivalence class. In the other category, which is called Bayesian structure learning, the methods [9, 10, 14, 21, 27, 28, 30] aim to learn the posterior distribution over the DAGs given the observational data, i.e., $p\left( \textbf{G}\mid \mathcal {D}\right) $. These methods can quantify the epistemic uncertainty in cases such as limited number of sample size or non-identifiable models in causal discovery. Previous approaches to learning the posterior distribution are diverse, for example, using Markov chain Monte Carlo (MCMC) [30], bootstrapping [14] with PC [42] and GES [7], and exact methods with dynamic programming [21]. Recent Bayesian structure learning approaches use more advanced methods such as variational inference [9, 27, 28] or generative flow networks (GFlowNets) [2, 10, 11, 34].

Permutation-based methods

Searching over the space of the permutations of the variables is significantly faster than searching over the space of possible DAGs. When the correct topological ordering of a DAG is found, a skeleton containing possible relations can be constructed and the DAG can be easily retrieved from this skeleton using the available conditional independence tests [4, 36, 41, 44, 46]. There have been many approaches for both linear [6, 49] and nonlinear additive noise models [15, 33, 37, 39]. EqVar [6] learns the topological orderings with the assumption of equal variances. NPVar [15] extends the work of EqVar by replacing the error variances with the corresponding residual variances and modeling using the nonparametric setting. Both of these methods iteratively find the root nodes to the leaf nodes of the causal DAG. Alternatively, SCORE [37] finds the leaves from a score estimated to match the gradient of the log probability distribution of the variables.

Discrete score-based methods

Score-based approaches in structure learning define scores to evaluate the generated graphs. The task in this setting translates to search for graphs that can maximize or minimize that score depending on the configuration of the score. For fully observational data, early methods perform the search in the discrete space of graphs and the general aim is to find algorithms that can optimally perform the search. For example, the earliest approach—GES [7] uses greedy search to efficiently search for the causal order in the permutation space by maximizing a score function. Recent developments efficiently search for the sparsest permutation, as in the Sparsest Permutation algorithm [36] and Greedy Sparsest Permutation algorithm [41], over the vertices of a permutohedron representing the space of permutations of DAGs, and reduce the search space by contracting the vertices corresponding to the same DAG [41]. These methods are also adapted for interventional data in [44, 46]. Another recent approach in this category is the group of DAG-GFlowNet [10, 11, 34] methods, which utilizes GFlowNet to search over the states of DAGs and can approximate the posterior distribution using both observational and interventional data. Constraining the acyclicity in this setting is relatively simple as the graphs containing cycle(s) can be flagged as invalid states and will be ignored if the search algorithms encounter these states.

Continuous score-based methods

Structure learning approaches with continuous relaxation [9, 23, 27, 28, 47, 51] is developing rapidly since NO-TEARS [50] was introduced. A continuous search space allows us to optimize or infer using gradient-based approaches to avoid searching over the large space of discrete DAGs. Bayesian inference with variational inference is one category of the gradient-based approaches used in the latest frameworks [9, 27, 28]. BCD Nets [9] decomposes the weighted adjacency matrix to a permutation matrix and a strictly lower triangular matrix, and infers the probabilities of these matrices using the evidence lower bound (ELBO) of the variational inference problem. DiBS [27] models the probabilities of the edges using a bilinear generative model from the latent space and infers the posterior using Stein variational gradient descent.

Acyclicity constraint in continuous score-based methods

Despite having computational benefits compared to discrete methods, the relaxation of the search into a continuous space causes the emergence of another problem that is how the acyclicity can be constrained. The more popular approach is to represent the cyclicity of generated graphs as a regularization score. By penalizing this score until this score reaches zero, the acyclicity of the graphs can be constrained. The approach in NO-TEARS [50] employs the property that if a graph is acyclic, the trace of the matrix exponential of its adjacency matrix will have the minimum value that equals to the number of nodes in the graph. The drawback of this constraint is the high computational complexity of the matrix exponential, which requires $\mathcal {O}\left( d^{3}\right) $ numerical operations to evaluate. DAG-GNN [47] proposes an alternative constraint that is less difficult to be implemented and have the same computational complexity. NO-BEARS [24] presents another constraint based on the spectral radius of the matrix. This constraint can be approximated with the computational complexity of $\mathcal {O}\left( d^{2}\right) $. Despite being easy to implement, these constraints are absent of acyclicity assurance and usually require a larger number of running steps when the data structure is more complicated.

Another approach is to structurally constrain the generated graphs to be acyclic. The topological ordering naturally represents the acyclicity of a graph as the edges can only originates from predecessor nodes in the ordering to successor nodes. As a consequence, a direct graph is acyclic if and only if there exists an corresponding topological ordering for the nodes in this graph. There are different approaches to representing the relation between a DAG and a topological ordering. In BCD Nets [9], the decomposition of the adjacency matrix of a DAG into a permutation matrix, which is one representation of the topological ordering, and a strictly upper triangular matrix, which embeds the relationships among the nodes. DAG-NoCurl [48], uses the gradient flow [25] to represent the edge directions from predecessors to successors in a form of a mask. In both methods, the topological ordering and the edges are learned simultaneously, which can render the learning process more complex because a slight change in the topological ordering can transform the generated DAG drastically.

Our framework belongs to the later category, but we avoid the complexity in the joint inference of the topological ordering and the graph. In this study, we design an acyclicity-ensuring conditional inference process with the topological ordering as a condition. With this process setting, we can effectively integrate the knowledge from the topological ordering into the inference to guarantee that generated graphs are acyclic. Additionally, with the provided prior knowledge, the number of possible graphs is reduced from $2^{d^{2}}$ to $2^{{d(d-1)}/{2}}$, which reduces the difficulty of the inference process and allows our framework to achieve higher inference efficiency.

3 TOBAC: topological ordering in differentiable Bayesian structure learning with acyclicity assurance

3.1 Permutation-based acyclicity constraint

3.1.1 Acyclicity assurance via decomposition of adjacency matrix

To analyze the decomposition of the adjacency matrix of a directed acyclic graph (DAG), we need to start from the topological orderings. A topological ordering (or, in short, an ordering) is a topological sort of the variables in a DAG. Given a DAG $\textbf{G}$ with d nodes $\textbf{X}=\left\{ X_{1},X_{2},\dots ,X_{d}\right\} $, let $\varvec{\pi }=\left\{ \pi _{1},\pi _{2},\dots ,\pi _{d}\right\} ,\pi _{i}\ne \pi _{j},\forall i\ne j$ contain the corresponding ordering $\pi _{i}$ of each node $X_{i}$. $\pi _{i}<\pi _{j}$ means that $X_{i}\in nd\left( X_{j}\right) $ where $nd\left( X_{j}\right) $ is the set containing the non-descendants of $X_{j}$. We consider the canonical case where the ordering of the nodes is already correct, i.e., $\varvec{{\pi }}^{*}=\left\{ \pi _{i}^{*}=i,\forall i\right\} $. In this case, because every node $X_{i}$ is a non-descendant of $X_{j}$ if $i<j$, the adjacency matrix $\textbf{A}_{\textbf{G}}$ of the DAG $\textbf{G}$ will become a strictly upper triangular matrix $\textbf{S}\in \left\{ 0,1\right\} ^{d\times d}$.

In order to generalize to any ordering, we use a permutation matrix $\textbf{P}\left( \varvec{\pi }\right) $ that transforms the ordering $\varvec{\pi }$ to $\varvec{\pi }^{*}$ as

$$\begin{aligned} P\left( \varvec{\pi }\right) _{j,i}={\left\{ \begin{array}{ll} 1 &{} \text {if }i=\pi _{j},\\ 0 &{} \text {otherwise.} \end{array}\right. } \end{aligned}$$

(1)

With this permutation matrix, we can always generate the adjacency matrix of an isomorphic DAG by

$$\begin{aligned} {\textbf{A}_{\textbf{G}}=\textbf{P}\left( \varvec{\pi }\right) \cdot \textbf{S}\cdot \textbf{P}\left( \varvec{\pi }\right) ^{\top }.} \end{aligned}$$

(2)

This formulation achieves the adjacency matrix $\textbf{A}_\textbf{G}$ by shifting the corresponding rows and columns of $\textbf{S}$ from the canonical ordering $\varvec{{\pi }}^{*}$ to the ordering $\varvec{{\pi }}$. As the canonical adjacency matrix $\textbf{S}$ is acyclic, the derived graph $\textbf{A}_\textbf{G}$ will always satisfy acyclicity. By employing this decomposition in the generative process, the acyclicity of the inferred graphs will always be satisfied.

3.1.2 Representing the canonical adjacency matrix in a latent space

With every permutation matrix $\textbf{P}\left( \varvec{\pi }\right) $, we only need to find the equivalent canonical adjacency matrix $\textbf{S}$. Following the DiBS approach in [27], this matrix is sampled from a latent variable $\textbf{Z}$ consisting of two embedding matrices $\textbf{U}=\left[ \textbf{u}_{1},\textbf{u}_{2},\ldots ,\textbf{u}_{d-1}\right] , \textbf{u}_{i}\in \mathbb {R}^{k}$ and $\textbf{V}=\left[ \textbf{v}_{1},\textbf{v}_{2},\ldots ,\textbf{v}_{d-1}\right] , \textbf{v}_{j}\in \mathbb {R}^{k}$. Due to the nature of strictly upper triangular matrices, only $d-1$ vectors in each embedding are needed to construct $\textbf{S}$ instead of d vectors as in DiBS. Following this configuration, the dimension k of the latent vectors is chosen to be greater or equal to $d-1$ to ensure that the generated graphs are not constrained in rank. We represent the probabilities of values in $\textbf{S}$ as follows

$$\begin{aligned} S_{\alpha }\left( \textbf{Z}\right) _{i,j}&:=p_{\alpha }\left( S_{i,j}=1\mid \textbf{u}_{i},\textbf{v}_{j-1}\right) \end{aligned}$$

(3)

$$\begin{aligned}&:={\left\{ \begin{array}{ll} \sigma _{\alpha }\left( \textbf{u}_{i}^{\top }\textbf{v}_{j-1}\right) &{} \text {if }j>i,\\ 0 &{} \text {otherwise;} \end{array}\right. } \end{aligned}$$

(4)

where $\sigma _{\alpha }\left( x\right) =1/\left( 1+\exp \left( -\alpha x\right) \right) $ and the term $\alpha $ will be increased each step to make the sigmoid function $\sigma _{\alpha }\left( x\right) $ converge to the Heaviside step function $\mathbbm {1}\left[ x>0\right] $. As $\alpha \rightarrow \infty $, the converged generated $\textbf{S}_{\infty }$ will become

$$\begin{aligned} S_{\infty }\left( \textbf{Z}\right) _{i,j}:={\left\{ \begin{array}{ll} \mathcal {\mathbbm {1}}\left[ \textbf{u}_{i}^{\top }\textbf{v}_{j-1}>0\right] &{} \text {if }j>i,\\ 0 &{} \text {otherwise.} \end{array}\right. } \end{aligned}$$

(5)

The probability of the elements in the adjacency matrix $\textbf{A}_{\textbf{G}}$ given the latent variable $\textbf{Z}$ and the permutation matrix $\textbf{P}$ corresponding to the topological ordering $\varvec{\pi }$ can be computed by

$$\begin{aligned} {A_{\alpha }\left( \textbf{Z},\varvec{\pi }\right) _{i,j}}&{:=p_{\alpha }\left( A_{i,j}=1\mid \textbf{Z},\varvec{\pi }\right) } \end{aligned}$$

(6)

$$\begin{aligned}&{:=\sum _{a=1}^{d}\sum _{b=1}^{d}P\left( \varvec{\pi }\right) _{i,b}\cdot S_{\alpha }\left( \textbf{Z}\right) _{b,a}\cdot P\left( \varvec{\pi }\right) _{a,j}^{\top }.} \end{aligned}$$

(7)

3.2 Mask-based acyclicity constraint

The formulation in Equation (2) can not only guarantee the acyclicity of generated graphs but also reduce the dimensions of latent representations required for the edge matrix $\textbf{S}$ as demonstrated in Sect. 3.1.2. However, the use of matrix multiplications in this formulation can affect the computational efficiency of the method. In this section, we propose an alternative implementation of the acyclicity constraint which allows the computation to be executed more efficiently.

We adapt the formulation from DAG-NoCurl [48] into our framework to avoid the operations of matrix multiplication. Instead of using the permutation approach, DAG-NoCurl uses a mask to eliminate edges that are in conflict with a topological ordering $\varvec{\pi }$. Let ${\varvec{\pi }}=\left\{ \pi _{1},\pi _{2},\dots ,\pi _{d}\right\} ,\pi _{i}\ne \pi _{j},\forall i\ne j$ be the corresponding orderings of the nodes $\textbf{X}=\left\{ X_{1},X_{2},\dots ,X_{d}\right\} $ of a graph $\textbf{G}$ as presented in Sect. 3.1.1. A gradient flow [25] $\text {grad}\varvec{\pi }$ on $\varvec{\pi }$ is a function whose values defined as follows

$$\begin{aligned} \left( \text {grad}{\varvec{\pi }}\right) \left( i,j\right) :=\pi _{j}-\pi _{i},1\le i,j\le d. \end{aligned}$$

(8)

The ordering value $\pi _{i}$ of $X_{i}$ will be smaller than the ordering value $\pi _{j}$ of $X_{j}$ if $X_{i}\in nd\left( X_{j}\right) $. Hence, each value of $\text {grad}\varvec{\pi }$ can have different properties depending on the topological ordering

$$\begin{aligned} \left( \text {grad}\varvec{\pi }\right) \left( i,j\right) {\left\{ \begin{array}{ll} >0 &{} \text {if }X_{i}\in nd\left( X_{j}\right) ,\\ =0 &{} \text {if }i=j,\\ <0 &{} \text {if }X_{j}\in nd\left( X_{i}\right) . \end{array}\right. } \end{aligned}$$

(9)

As the positive values of $\left( \text {grad}\varvec{\pi }\right) \left( i,j\right) $ correspond to the edges of the full DAG that have the topological ordering $\varvec{\pi }$, a mask $\textbf{M}\left( \varvec{\pi }\right) $ to filter out conflicting edges can be defined as

$$\begin{aligned} M\left( \varvec{\pi }\right) _{i,j}:=\mathcal {\mathbbm {1}}\left[ \left( \text {grad} \varvec{\pi }\right) \left( i,j\right) >0\right] . \end{aligned}$$

(10)

This mask can be applied to any adjacency matrix $\textbf{E}$ of the graph $\textbf{G}$ to constrain its acyclicity by filtering the edges that conflict with the ordering $\varvec{\pi }$ and achieve the acyclicity-assured adjacency matrix as follow

$$\begin{aligned} {\textbf{A}_{\textbf{G}} = \textbf{M}\left( \varvec{\pi }\right) \odot \textbf{E},} \end{aligned}$$

(11)

where “$\odot $” denotes the element-wise multiplication.

The remaining task in this approach is inferring the matrix $\textbf{E}$ via its latent representation. Similar to DiBS [27] and Equations (3)–(4), to represent the matrix $\textbf{E}$ in the latent space, its soft representation $\textbf{E}_{\alpha }$ is generated from a latent variable $\textbf{Z}$ consisting of two embedding matrices $\textbf{U}=\left[ \textbf{u}_{1},\textbf{u}_{2},\ldots ,\textbf{u}_{d}\right] , \textbf{u}_{i}\in \mathbb {R}^{k}$ and $\textbf{V}=\left[ \textbf{v}_{1},\textbf{v}_{2},\ldots ,\textbf{v}_{d}\right] , \textbf{v}_{j}\in \mathbb {R}^{k}$ as follows

$$\begin{aligned} E_{\alpha }\left( \textbf{Z}\right) _{i,j}&:=p_{\alpha }\left( E_{i,j}=1\mid \textbf{u}_{i},\textbf{v}_{j}\right) \end{aligned}$$

(12)

$$\begin{aligned}&:=\sigma _{\alpha }\left( \textbf{u}_{i}^{\top }\textbf{v}_{j}\right) . \end{aligned}$$

(13)

In this setting, to avoid $\textbf{E}_{\alpha }$ from being constrained in rank, we need $k \ge n$. From this representation, as in Eq. 11, the probabilities of the edges in the directed and acyclic graph $\textbf{G}$ can be computed by

$$\begin{aligned} {\textbf{A}_{\alpha }\left( \textbf{Z},\varvec{\pi }\right) :=\textbf{M} \left( \varvec{\pi }\right) \odot \textbf{E}_{\alpha }\left( \textbf{Z}\right) .} \end{aligned}$$

(14)

Compared to the two matrix multiplication operators in Equation (2) of the permutation-based approach in Sect. 3.1, this formulation is more efficient due to the use of solely one element-wise operator while still being able to assure the acyclicity. However, implementing this setting requires the latent variable comprising of 2d vectors for representating $d^{2}$ elements in the matrix $\textbf{E}$ instead of $2\left( d-1\right) $ vectors for $d\times \left( d-1\right) /2$ elements in the strictly upper matrix $\textbf{S}$.

3.3 Estimating the latent variable using Bayesian inference

In order to estimate the latent variable $\textbf{Z}$, extending the approach from [27], we first consider the generative model given in Fig. 1. In this figure, a Bayesian network consists of a pair of variables $\left( \textbf{G},{\varvec{\Theta }}\right) $ where ${\varvec{\Theta }}$ defines the parameters of the local conditional distributions at each variable given its parents in the DAG. This generative model is assumed to generate the observational data $\mathcal {D}$ containing n observations of $\textbf{X}$.

Given a topological ordering ${\varvec{\pi }}$, the generative model conditioned on $\varvec{\pi }$ can be factorized as

$$\begin{aligned} p\left( \textbf{Z},\textbf{G},{\varvec{\Theta }},\mathcal {D}\mid \varvec{\pi }\right) =p\left( \textbf{Z}\right) p\left( \textbf{G}\mid \textbf{Z},\varvec{\pi }\right) p\left( {\varvec{\Theta }}\mid \textbf{G}\right) p\left( \mathcal {D} \mid \textbf{G},{\varvec{\Theta }}\right) . \end{aligned}$$

(15)

For any function $f\left( \textbf{G},{\varvec{\Theta }}\right) $ of interest, we can compute its expectation from the distribution $p\left( \textbf{G},{\varvec{\Theta }}\mid \mathcal {D},\varvec{\pi }\right) $ by inferring $p\left( \textbf{Z},{\varvec{\Theta }}\mid \mathcal {D},\varvec{\pi }\right) $ with the following formula

$$\begin{aligned} \mathbb {E}_{p\left( \textbf{G},{\varvec{\Theta }}\mid \mathcal {D},\varvec{\pi }\right) } \left[ f\left( \textbf{G},{\varvec{\Theta }}\right) \right] =\mathbb {E}_{p\left( \textbf{Z}, {\varvec{\Theta }}\mid \mathcal {D},\varvec{\pi }\right) }\left[ \frac{\mathbb {E}_{p\left( \textbf{G} \mid \textbf{Z},\varvec{\pi }\right) }\left[ f\left( \textbf{G},{\varvec{\Theta }}\right) p\left( {\varvec{\Theta }},\mathcal {D}\mid \textbf{G}\right) \right] }{\mathbb {E}_{p\left( \textbf{G} \mid \textbf{Z},\varvec{\pi }\right) }\left[ p\left( {\varvec{\Theta }},\mathcal {D}\mid \textbf{G} \right) \right] }\right] , \end{aligned}$$

(16)

where $p\left( {\varvec{\Theta }},\mathcal {D}\mid \textbf{G}\right) =p\left( {\varvec{\Theta }} \mid \textbf{G}\right) p\left( \mathcal {D}\mid \textbf{G},{\varvec{\Theta }}\right) $ and $p\left( \textbf{G}\mid \textbf{Z},\varvec{\pi }\right) $ is computed using a graph prior (e.g., Erdős– Rényi [13] or scale-free [1]) on the soft graph in Equations (6) and (14). The function $f\left( \textbf{G},{\varvec{\Theta }}\right) $ in Equation (16) acts as a placeholder for $p\left( \mathcal {D}\mid \textbf{G},{\varvec{\Theta }}\right) $ in this study or any other functions depending on the setting of each structure learning task. The distributions of the parameters $p\left( {\varvec{\Theta }}\mid \textbf{G}\right) $ and the data $p\left( \mathcal {D}\mid \textbf{G},{\varvec{\Theta }}\right) $ are chosen differently for the linear and nonlinear Gaussian models. In the linear Gaussian model, the log probability of the parameters given the graph is

$$\begin{aligned} {\log p\left( {\varvec{\Theta }}\mid \textbf{G}\right) =\sum _{i,j} \left( A_{\textbf{G}}\right) _{i,j} \log \mathcal {N}\left( \theta _{i,j};\mu _{e},\sigma _{e}^{2}\right) ,} \end{aligned}$$

(17)

where $\mu _{e}$ and $\sigma _{e}$ are the mean and standard deviation of the Gaussian edge weights, and the log likelihood is as follows

$$\begin{aligned} \log p\left( \mathcal {D}\mid \textbf{G},{\varvec{\Theta }}\right) =\sum _{i=1}^{d}\log \mathcal {N}\left( X_{i};{\varvec{{\theta }}}_{i}^{\top }\textbf{X}_{\textrm{pa} \left( i\right) },\sigma _{obs}^{2}\right) , \end{aligned}$$

(18)

where $\sigma _{obs}$ is the standard deviation of the additive observation noise at each node. In the nonlinear Gaussian model, we follow [27, 51] by using the feed-forward neural networks (FFNs) denoted by $\textrm{FFN}\left( \cdot ;{\varvec{\Theta }}\right) :\mathbb {R}^{d}\rightarrow \mathbb {R}$ to represent the relation of the variables with their parents (i.e., $f_{i}\left( \textbf{X}_{pa\left( i\right) }\right) $ for each $X_{i}$). For each variable $X_{i}$, the output is computed as

$$\begin{aligned} \textrm{FFN}\left( \textbf{x};{\varvec{\Theta }}^{\left( i\right) }\right) :={\varvec{\Theta }} ^{\left( i,L\right) }f_{\textrm{a}}\left( \dots {\varvec{\Theta }}^{\left( i,2\right) }f_{\textrm{a}} \left( {\varvec{\Theta }}^{\left( i,1\right) }\textbf{x}+{\varvec{\theta }}_{b}^{\left( i,1\right) } \right) +{\varvec{\theta }}_{b}^{\left( i,2\right) }\dots \right) +{\varvec{\theta }}_{b}^{\left( i,L\right) }, \end{aligned}$$

(19)

where ${\varvec{\Theta }}^{(i,l)}\in \mathbb {R}^{d_{l}\times d_{l-1}}$ is the weight matrix, ${\varvec{\theta }}_{b}^{\left( i,l\right) }\in \mathbb {R}^{d_{l}}$ is the bias vector, and $f_{\textrm{a}}$ is the activation function. From the model in this setting, the log probability of the parameters is

$$\begin{aligned} {\begin{aligned}\log p\left( {\varvec{\Theta }}\mid \textbf{G}\right)&=\sum _{i=1}^{d}\Bigg (\sum _{a=1}^{d_{1}}\bigg (\log \mathcal {N}\left( \left( {\varvec{\theta }}_{b} ^{\left( i,1\right) }\right) _{a};0,\sigma _{p}^{2}\right) +\sum _{b=1}^{d}\left( A_{\textbf{G}} \right) _{i,b}^{\top }\log \mathcal {N}\left( {\varvec{\Theta }}_{a,b}^{\left( i,1\right) };0,\sigma _{p}^{2} \right) \bigg )\\&\quad +\sum _{l=2}^{L}\sum _{a=1}^{d_{l}}\bigg (\log \mathcal {N}\left( \left( {\varvec{\theta }}_{b} ^{(i,l)}\right) _{a};0,\sigma _{p}^{2}\right) +\sum _{b=1}^{d_{l-1}}\log \mathcal {N} \left( {\varvec{\Theta }}_{a,b}^{\left( i,l\right) };0,\sigma _{p}^{2}\right) \bigg )\Bigg ), \end{aligned}} \end{aligned}$$

(20)

where $\sigma _{p}$ is the standard deviation of the Gaussian parameters. In the nonlinear Gaussian setting, the value of $f_{i}\left( \textbf{X}_{pa\left( i\right) }\right) $ is assumed to be the mean of the distribution of each variable $X_{i}$. As a result, the log likelihood of this model is as follows

$$\begin{aligned} {\log p\left( \mathcal {D}\mid \textbf{G},{\varvec{\Theta }}\right) =\sum _{i=1}^{d}\log \mathcal {N} \left( X_{i};\textrm{FFN}\left( \left( A_{\textbf{G}}\right) _{i}^{\top }\odot \textbf{X}; {\varvec{\Theta }}^{(i)}\right) ,\sigma _{obs}^{2}\right) ,} \end{aligned}$$

(21)

where “$\odot $” denotes the element-wise multiplication.

3.4 Particle variational inference of intractable posterior

The joint posterior distribution $p\left( \textbf{Z},{\varvec{\Theta }}\mid \mathcal {D},\varvec{\pi }\right) $ is intractable. Stein variational gradient descent (SVGD) [26, 27] is a suitable method to approximate this joint posterior density due to its gradient-based approach. The SVGD algorithm iteratively transports a set of particles to match the target distribution, which is similar to the gradient descent algorithm in optimization.

From the proposed generative model, we need to infer the log joint posterior density of $\textbf{Z}$ and ${\varvec{\Theta }}$ using the corresponding gradient to variable $\textbf{Z}$ given by

$$\begin{aligned} \nabla _{\textbf{Z}}\log p\left( \textbf{Z},{\varvec{\Theta }}\mid \mathcal {D},\varvec{\pi }\right) =\nabla _{\textbf{Z}}\log p\left( \textbf{Z}\right) +\frac{\nabla _{\textbf{Z}}\mathbb {E}_{p\left( \textbf{G}\mid \textbf{Z},\varvec{\pi }\right) }\left[ p\left( {\varvec{\Theta }},\mathcal {D}\mid \textbf{G}\right) \right] }{\mathbb {E}_{p\left( \textbf{G}\mid \textbf{Z},\varvec{\pi }\right) }\left[ p\left( {\varvec{\Theta }},\mathcal {D}\mid \textbf{G}\right) \right] }. \end{aligned}$$

(22)

The log latent prior distribution $\log p\left( \textbf{Z}\right) $ is chosen as

$$\begin{aligned} \begin{aligned}\log p\left( \textbf{Z}\right)&:=\sum _{i,j}\mathcal {\log N}\left( U_{ij};0,\sigma _{z}^{2}\right) +\sum _{i,j}\mathcal {\log N}\left( V_{ij};0,\sigma _{z}^{2}\right) \\&\quad +\mathbb {E}_{p\left( \textbf{G}\mid \textbf{Z},\varvec{\pi }\right) }\left[ \log p\left( \textbf{G}\mid \textbf{Z},\varvec{\pi }\right) \right] -C, \end{aligned} \end{aligned}$$

(23)

where C is the log partitioning constant. Similarly, the gradient corresponding to variable ${\varvec{\Theta }}$ is given by

$$\begin{aligned} \nabla _{\varvec{\Theta }}\log p\left( \textbf{Z},{\varvec{\Theta }}\mid \mathcal {D},\varvec{\pi }\right) =\frac{\nabla _ {\varvec{\Theta }}\mathbb {E}_{p\left( \textbf{G}\mid \textbf{Z},\varvec{\pi }\right) } \left[ p\left( {\varvec{\Theta }},\mathcal {D}\mid \textbf{G}\right) \right] }{\mathbb {E}_{p\left( \textbf{G} \mid \textbf{Z},\varvec{\pi }\right) }\left[ p\left( {\varvec{\Theta }}, \mathcal {D}\mid \textbf{G}\right) \right] }. \end{aligned}$$

(24)

The numerator of the second term in Equation (22) can be approximated using the Gumbel-softmax trick [20, 29] as follows

$$\begin{aligned} {\begin{aligned}&\nabla _{{\varvec{\Theta }}}\mathbb {E}_{p\left( \textbf{G}\mid \textbf{Z},\varvec{\pi }\right) } \left[ p\left( {\varvec{\Theta }},\mathcal {D}\mid \textbf{G}\right) \right] \\&\quad \approx \mathbb {E}_{p\left( \textbf{L}\right) }\left[ \nabla _{\textbf{G}}p\left( {\varvec{\Theta }}, \mathcal {D}\mid \textbf{G}\right) \Big \vert _{\textbf{A}_{\textbf{G}}=\tilde{\textbf{A}}_{\tau } \left( \textbf{L},\textbf{Z},\varvec{\pi }\right) }\cdot \nabla _{\textbf{Z}}\tilde{\textbf{G}}_ {\tau }\left( \textbf{L},\textbf{Z},\varvec{\pi }\right) \right] , \end{aligned} } \end{aligned}$$

(25)

where $\textbf{L}\sim \text {Logistic}\left( 0,1\right) ^{\left( d-1\right) \times \left( d-1\right) }$ and $\tilde{\textbf{A}}_{\tau }\left( \textbf{L},\textbf{Z},\varvec{\pi }\right) :=\textbf{P}\left( \varvec{\pi }\right) \cdot \tilde{\textbf{S}}_{\tau }\left( \textbf{L},\textbf{Z}\right) \cdot \textbf{P}\left( \varvec{\pi }\right) ^{\top }$ in the permutation-based approach, and $\textbf{L}\sim \text {Logistic}\left( 0,1\right) ^{d\times d}$ and $\tilde{\textbf{A}}_{\tau }\left( \textbf{L},\textbf{Z},\varvec{\pi }\right) :=\textbf{M}\left( \varvec{\pi }\right) \odot \tilde{\textbf{E}}_{\tau }\left( \textbf{L},\textbf{Z}\right) $ in the mask-based approach. The element-wise definition of $\tilde{\textbf{S}}_{\tau }$ is

$$\begin{aligned} \tilde{S}_{\tau }\left( \textbf{L},\textbf{Z}\right) _{i,j}:={\left\{ \begin{array}{ll} \sigma _{\tau }\left( L_{i,j-1}+\alpha \textbf{u}_{i}^{\top }\textbf{v}_{j-1}\right) &{} \text {if }j>i,\\ 0 &{} \text {otherwise.} \end{array}\right. } \end{aligned}$$

(26)

Correspondingly, the elements of $\tilde{\textbf{E}}_{\tau }$ are defined as

$$\begin{aligned} \tilde{E}_{\tau }\left( \textbf{L},\textbf{Z}\right) _{i,j}:=\sigma _{\tau }\left( L_{i,j}+ \alpha \textbf{u}_{i}^{\top }\textbf{v}_{j}\right) . \end{aligned}$$

(27)

In our experiments, we choose $\tau =1$ in accordance with [27].

The kernel for SVGD proposed in [27] is

$$\begin{aligned} k\left( \left( \textbf{Z},{\varvec{\Theta }}\right) ,\left( \textbf{Z}^{\prime }, {\varvec{\Theta }}^{\prime }\right) \right) :=\exp \left( -\frac{1}{\gamma _{z}}\left\| \textbf{Z}-\textbf{Z}^{\prime }\right\| _{2}^{2}\right) +\exp \left( -\frac{1}{\gamma _{\theta }}\left\| {\varvec{\Theta }}-{\varvec{\Theta }}^{\prime }\right\| _{2}^{2}\right) . \end{aligned}$$

(28)

From this kernel, an incremental update for the mth particle of $\textbf{Z}$ at the tth step is computed by

$$\begin{aligned} \begin{aligned}&\phi _{t}^{\textbf{Z}}\left( \textbf{Z}_{t}^{\left( m\right) },{\varvec{\Theta }}_{t}^{\left( m\right) }\right) \\&\quad =\frac{1}{M}\sum _{r=1}^{M}\Bigg [k\left( \left( \textbf{Z}_{t}^{\left( r\right) }, {\varvec{\Theta }}_{t}^{\left( r\right) }\right) ,\left( \textbf{Z}_{t}^{\left( m\right) }, {\varvec{\Theta }}_{t}^{\left( m\right) }\right) \right) \cdot \nabla _{\textbf{Z}_{t}^{\left( r\right) }}\log p\left( \textbf{Z}_{t}^{\left( r\right) },{\varvec{\Theta }}_{t}^{\left( r\right) } \mid \mathcal {D},\varvec{\pi }\right) \\&\qquad +\nabla _{\textbf{Z}_{t}^{\left( r\right) }}k\left( \left( \textbf{Z}_{t} ^{\left( r\right) },{\varvec{\Theta }}_{t}^{\left( r\right) }\right) ,\left( \textbf{Z}_{t} ^{\left( m\right) },{\varvec{\Theta }}_{t}^{\left( m\right) }\right) \right) \Bigg ]. \end{aligned} \end{aligned}$$

(29)

A similar update for ${\varvec{\Theta }}$ is proposed by replacing $\nabla _{\textbf{Z}_{t}^{\left( r\right) }}$ by $\nabla _{{\varvec{\Theta }}_{t}^{\left( r\right) }}$ as

$$\begin{aligned}&\phi _{t}^{{\varvec{\Theta }}}\left( \textbf{Z}_{t}^{\left( m\right) }, {\varvec{\Theta }}_{t}^{\left( m\right) }\right) \nonumber \\&\quad =\frac{1}{M}\sum _{r=1}^{M}\Bigg [k\left( \left( \textbf{Z}_{t}^{\left( r\right) }, {\varvec{\Theta }}_{t}^{\left( r\right) }\right) ,\left( \textbf{Z}_{t}^{\left( m\right) }, {\varvec{\Theta }}_{t}^{\left( m\right) }\right) \right) \cdot \nabla _{{\varvec{\Theta }}_{t} ^{\left( r\right) }}\log p\left( \textbf{Z}_{t}^{\left( r\right) },{\varvec{\Theta }}_{t} ^{\left( r\right) }\mid \mathcal {D},\varvec{\pi }\right) \nonumber \\&\qquad +\nabla _{{\varvec{\Theta }}_{t}^{\left( r\right) }}k\left( \left( \textbf{Z}_{t} ^{\left( r\right) },{\varvec{\Theta }}_{t}^{\left( r\right) }\right) ,\left( \textbf{Z}_{t} ^{\left( m\right) },{\varvec{\Theta }}_{t}^{\left( m\right) }\right) \right) \Bigg ]. \end{aligned}$$

(30)

The SVGD algorithm for inferring $p\left( \textbf{G},{\varvec{\Theta }}\mid \mathcal {D},\varvec{\pi }\right) $ is represented in Algorithm 1.

4 Experiments

4.1 Experimental settings

We compare our P-TOBAC and M-TOBAC approaches with related Bayesian score-based methods including BCD Nets [9], DAG-GFlowNet (GFN) [10], and DiBS [27] on synthetic and the flow cytometry data [38]. Beside DiBS, which can learn the joint distribution $p\left( \textbf{G},{\varvec{\Theta }}\mid \mathcal {D}\right) $ and infer nonlinear Gaussian networks, BCD Nets and DAG-GFlowNet are designed to work with linear Gaussian models. BCD Nets learns the parameters using the weighted adjacency matrix, and DAG-GFlowNet uses the BGe score [16].

Regarding the selection of topological ordering for conditioning, we choose the topological ordering from EqVar [6] and the ground-truth (GT) ordering as the condition for our approach. We also analyze the effect of the ordering on the performance by replacing the ordering with the one from NPVar [15] and SCORE [37]. The DiBS+ and TOBAC+ denotations in our experiments are the results with weighted particle mixture in [27] that use $p\left( \textbf{G},{\varvec{\Theta }},\mathcal {D}\right) $ as the weight for each particle. This weight is employed as the unnormalized probability for each inferred particle when computing the expectation of the evaluation metrics.

Synthetic data and graph prior

We generate the data using the Erdős–Rényi (ER) structure [13] with the degree of 1 and 2. In all settings in Sect. 4.2, each inference is performed on $n=100$ observations. For the ablation study in Sect. 4.4, synthetic data with $d\in \left\{ 10,20,50\right\} $ variables and $n\in \left\{ 100,500\right\} $ observations are utilized for the analysis of the effect of dimensionality and sample size on the performance. For the graph prior, BCD Nets uses the Horseshoe prior for their strictly lower triangular matrix $\textbf{L}$, and GFN uses the prior from [12]. DiBS and our approach use the prior of the Erdős–Rényi graphs, which is in the form of $p\left( \textbf{G}\right) \propto q^{\left\| \textbf{A}_{\textbf{G}}\right\| _{1}}\left( 1-q\right) ^{\left( {\begin{array}{c}d\\ 2\end{array}}\right) -\left\| \textbf{A}_{\textbf{G}}\right\| _{1}}$ where q is the probability for an independent edge being added to the DAG.

Evaluation metrics

Following the evaluation metrics used by previous work [9, 10, 27], we evaluate the performance using the expected structural Hamming distance (E-SHD) and area under the receiver operating characteristic curve (AUROC). We follow [27] where the E-SHD score is the expectation of the structural Hamming distance (SHD) between each $\textbf{G}$ and the ground-truth $\textbf{G}^{*}$ over the posterior distribution $p\left( \textbf{G}\mid \mathcal {D}\right) $ to compute the expected number of edges that has been incorrectly predicted. The formulation of this evaluation score is

$$\begin{aligned} {\text {E-SHD}\left( p,\textbf{G}^{*}\right) }&{:=\sum _{\textbf{G}}p\left( \textbf{G}\mid \mathcal {D}\right) \textrm{SHD}\left( \textbf{A}_{\textbf{G}}, \textbf{A}_{\textbf{G}^{*}}\right) }. \end{aligned}$$

(31)

The AUROC score is computed for each edge probability $p\left( A_{i,j}=1\mid \mathcal {D}\right) $ of $\textbf{A}_{\textbf{G}}$ in comparison with the corresponding ground-truth element in $\textbf{A}_{\textbf{G}^{*}}$ [19]. In addition, we also evaluate the nonlinear Gaussian Bayesian networks from DiBS and ours by the cyclicity score and the average negative log likelihood [27]. The cyclicity score is proposed by [47] and is used in DiBS as the constraint for acyclicity. The score measuring the cyclicity or non-DAG-ness of a graph $\textbf{G}$ is defined as

$$\begin{aligned} {h\left( \textbf{G}\right) :=\textrm{tr}\left[ \left( \textbf{I}+\frac{1}{d}\textbf{A}_ {\textbf{G}}\right) ^{d}\right] -d,} \end{aligned}$$

(32)

where $h\left( \textbf{G}\right) =0$ if and only if $\textbf{G}$ has no cycle and the higher its value, the more cyclic $\textbf{G}$ becomes. In the average negative log likelihood evaluation, a test dataset $\mathcal {D}^{test}$ containing 100 held-out observations are also generated to compute the score as follows

$$\begin{aligned} \text {Neg.LL}\left( p,\mathcal {D}^{test}\right) :=-\sum _{\textbf{G},{\varvec{\Theta }}}p \left( \textbf{G},{\varvec{\Theta }}\mid \mathcal {D}\right) \log p\left( \mathcal {D}^{test}\mid \textbf{G},{\varvec{\Theta }}\right) . \end{aligned}$$

(33)

This score is designed to evaluate the model’s ability to predict future observations. Note that BCD Nets formulates the parameters with the weighted adjacency matrix, and DAG-GFlowNet only estimates the marginal posterior distribution $p\left( \textbf{G}\mid \mathcal {D}\right) $ and uses BGe score [16] for the parameters. Hence, we cannot compute the joint posterior distribution $p\left( \textbf{G},{\varvec{\Theta }}\mid \mathcal {D}\right) $ for the negative log likelihood evaluation. The reported results in the following sections are obtained from ten randomly generated datasets for each method and configuration.

4.2 Performance on synthetic data

Linear Gaussian models

Figure 2 illustrates the performance of the methods with linear Gaussian models. We find that our P-TOBAC and M-TOBAC models outperform other approaches. TOBAC models accomplish the lowest E-SHD scores and the almost highest AUROC scores in both ER-1 (sparse graph) and ER-2 (denser graph) settings. The more efficient variant M-TOBAC can achieve performance similar to P-TOBAC in this setting. In comparison with DiBS, the results demonstrate that introducing the prior knowledge, which is the EqVar and ground-truth orderings, to the inference process can increase the performance significantly. As the graph becomes denser, as in the ER-2 settings, DiBS models have high variances in the results, whereas our models are as stable as other approaches. Considering DAG-GFlowNet, which uses GFlowNets [2] to infer the posterior distribution, our models achieve lower E-SHD score and substantially higher AUROC score in the denser graph setting.

Effect of sample sizes and dimensionalities of the data

We study the effect of different sample sizes and data dimensionalities on TOBAC models and related approaches utilizing linear Gaussian models. As we can perceive in Fig. 3, the number of observations does not have much effect on the performance in most cases. In the case of GFN, as also described by Deleu et al. [10], when the number of observations increases, this model cannot perform as well as with a smaller sample size due to the higher peakness of the posterior distribution. These results show that Bayesian approaches can sufficiently handle the uncertainty caused by a smaller amount of data. In contrast, as the number of dimensions in the data rises, the effect on the inference performance diverges among the approaches. Graphs with 50 nodes inferred by TOBAC are more accurate in comparison with other approaches.

Nonlinear Gaussian models

The performance results of nonlinear Gaussian models are depicted in Fig. 4. The cyclicity scores clearly show that the acyclicity constraint of DiBS is not as effective compared to our approach. In addition to the certainty of acyclicity, from all the E-SHD, AUROC, and negative log likelihood scores, we can see that our approaches can infer better graphs than DiBS. As EqVar is not formulated for nonlinear Gaussian models, the results with EqVar orderings are lower than the ones with the ground-truth orderings. However, the performance of these models is still better compared to DiBS. This improvement in performance emphasizes the benefit of the orderings while guaranteeing the acyclicity at every number of sampling steps. With the orderings given, optimizing the graph parameters becomes significantly easier. As a result, the log likelihood of the observational data is improved. Furthermore, the plots in the ER-2 indicate that TOBAC variants with weighted particle mixture are significantly more stable than DiBS when inferring denser graphs.

4.3 Performance on real data

Flow cytometry dataset

Table 1 Performance on the flow cytometry dataset [38]

Full size table

The flow cytometry dataset includes $n=853$ observational continuous data points of $d=11$ phosphoproteins from the first perturbation condition in [38]. The graphs are inferred with linear Gaussian models and Erdős–Rényi (ER) graph priors at 1000 sampling steps. As we can observe in Table 1, our P-TOBAC and P-TOBAC+ approaches with the ground-truth ordering have achieved the lowest E-SHD values at $14.7\pm 0.35$ and $12.7\pm 0.82$, respectively. Additionally, the AUROC scores of M-TOBAC and P-TOBAC with the ground-truth ordering are the highest ones at $0.710\pm 0.033$ and $0.697\pm 0.029$ respectively, which are higher than DiBS’s AUROC at $0.630\pm 0.0360$. As the relationship of this data is complex, the TOBAC models with the ground-truth ordering accomplish better performance compared to the ones with the ordering from EqVar. However, all models with the EqVar ordering still surpass other related approaches. This demonstrates the advantage of integrating information from the topological orderings in our framework.

SynTReN pseudo-real dataset

This pseudo-real dataset is generated in [23] from the software that is designed to generate Synthetic Transcriptional Regulatory Networks (SynTReN) [5]. The dataset comprises of 10 subsets of $n=500$ observations generated from a subgraph of SynTReN with $d=20$. Each ground-truth subgraph has from 19 to 34 edges with an average of 23.5 edges. Similar to the experiment on the flow cytometry dataset, we also infer these graphs using linear Gaussian models and ER graph priors. The overall results are summarized in Table 2. The detailed results are available in . All of our P-TOBAC and M-TOBAC models with ground-truth orderings achieve the highest AUROC score at $0.905 \pm 0.0316$. This score significantly exceeds the results of all related baseline methods. In addition, our M-TOBAC+ model has the second lowest E-SHD result of $33.73 \pm 5.581$, which is better than most baseline methods, except BCD Nets.

Table 2 Overall performance on the SynTReN dataset [5, 23]

Full size table

4.4 Ablation study

Effectiveness of the assured acyclicity constraint

Fig. 5 exhibits the effectiveness of our proposed acyclicity constraint compared to the post-hoc constraint in DiBS [27]. Although the number of cycles in the inferred graphs does decrease when the number of sampling steps increases, after 2000 iterations, the number of existing cycles is still considerable. By introducing the prior knowledge from the topological ordering as the acyclicity constraint, our TOBAC models can assure the acyclicity of generated graphs at any sampling steps. Furthermore, the incorporation of the ordering also improves the inference performance and allows the TOBAC models to obtain better results at a fewer number of sampling steps. This enhancement is demonstrated obviously in the figure. At only 500 sampling steps, all results from TOBAC surpass the ones from DiBS at 2000 sampling steps. Moreover, due to contribution of the ordering in limiting the search space, the results acquired from TOBAC are more stable with lower variances in general.

Topological orderings from other approaches

We summarize the results of the graphs inferred by TOBAC models with the ground-truth ordering and the orderings learned by several approaches consisting of EqVar [6], NPVar [15], and SCORE [37] in Fig. 6. In this setting, the variances of the variables are equal, which are matched to EqVar settings. As a consequence, the results suggest that TOBAC with EqVar can have performance scores that are close to TOBAC with the ground-truth ordering on synthetic data. Although being the state-of-the-art approach to learning ordering, the results when employing SCORE are more unstable in comparison with other simpler ones. This unstability may be due to the nonlinear assumption of the structural equation model in SCORE, which does not fit well with this linear Gaussian setting.

Uniform and weighted particle mixture

From the observed results, especially when the data are generated from nonlinear Gaussian models, the weighted particle mixture can reduce the E-SHD and Neg.LL scores. However, it will also cause a reduction in the AUROC scores. The lower performance of the uniform mixture can be due to the crude approximation of the posterior probability mass function of the particles, which is replaced by a more meaningful approach in the weighted mixture [27].

Permutation-based and mask-based constraints

In most acquired results, there is no significant discrepancy between the results of permutation-based and mask-based constraints. Both approaches can assure the acyclicity of the generated graphs and improve the inference performance. Having more parameters to be inferred in the mask-based variant does not affect its performance. In conjunction with its effectiveness, the mask-based implementation is the ideal choice of constraint for assuring the acyclicity.

5 Conclusion

In this work, we have presented TOBAC—a framework for strictly constraining the acyclicity of the inferred graphs and integrating the knowledge from the topological orderings into the inference process. We propose two possible permutation-based and mask-based implementations of this framework from the decomposition of the adjacency matrix and the property of the gradient flow on the topological ordering. Our work uses continuous representations of the edge matrices and approximates the posterior using particle variational inference. Our proposed framework makes the inference process less complicated and enhances the accuracy of inferred graphs and parameters. Accordingly, our work can outperform most related Bayesian score-based approaches on both synthetic and real observational data. In future work, we will explore methods to learn the posterior distribution of the topological orderings with observational data, which will allow our approach to infer more diverse posterior distributions while still ensuring the acyclicity of generated graphs.

Data availability

No datasets were generated or analyzed during the current study.

Notes

References

Barabási AL, Albert R (1999) Emergence of scaling in random networks. Science 286(5439):509–512
Article MathSciNet Google Scholar
Bengio Y, Lahlou S, Deleu T et al (2023) GFlowNet foundations. J Mach Learn Res 24(210):1–55
MathSciNet Google Scholar
Bradbury J, Frostig R, Hawkins P, et al (2018) JAX: Composable transformations of Python+NumPy programs
Bühlmann P, Peters J, Ernest J (2014) CAM: Causal additive models, high-dimensional order search and penalized regression. The Ann Stat 42(6):2526–2556
Article MathSciNet Google Scholar
Van den Bulcke T, Van Leemput K, Naudts B et al (2006) SynTReN: a generator of synthetic gene expression data for design and analysis of structure learning algorithms. BMC Bioinf 7:43
Article Google Scholar
Chen W, Drton M, Wang YS (2019) On causal discovery with an equal-variance assumption. Biometrika 106(4):973–980
Article MathSciNet Google Scholar
Chickering DM (2002) Optimal structure identification with greedy search. J Mach Learn Res 3:507–554
MathSciNet Google Scholar
Chickering M, Heckerman D, Meek C (2004) Large-sample learning of Bayesian networks is NP-hard. J Mach Learn Res 5:1287–1330
MathSciNet Google Scholar
Cundy C, Grover A, Ermon S (2021) BCD Nets: Scalable variational approaches for Bayesian causal discovery. In: Advances in Neural Information Processing Systems, pp 7095–7110
Deleu T, Góis A, Emezue C, et al (2022) Bayesian structure learning with generative flow networks. In: Proceedings of Conference on Uncertainty in Artificial Intelligence, pp 518–528
Deleu T, Nishikawa-Toomey M, Subramanian J, et al (2023) Joint bayesian inference of graphical structure and parameters with a single generative flow network. In: Advances in Neural Information Processing Systems
Eggeling R, Viinikka J, Vuoksenmaa A, et al (2019) On structure priors for learning Bayesian networks. In: Proceedings of the International Conference on Artificial Intelligence and Statistics, pp 1687–1695
Erdős P, Rényi A (1960) On the evolution of random graphs. Publications of the Mathematical Institute of the Hungarian Academy of Sciences 5
Friedman N, Goldszmidt M, Wyner A (1999) Data analysis with Bayesian networks: A bootstrap approach. In: Proceedings of the Conference on Uncertainty in Artificial Intelligence, pp 196–205
Gao M, Ding Y, Aragam B (2020) A polynomial-time algorithm for learning nonparametric causal graphs. In: Advances in Neural Information Processing Systems, pp 11599–11611
Geiger D, Heckerman D (1994) Learning Gaussian networks. In: Proceedings of the Conference on Uncertainty in Artificial Intelligence, pp 235–243
Hagberg AA, Schult DA, Swart PJ (2008) Exploring network structure, dynamics, and function using NetworkX. In: Proceedings of the Python in Science Conference, pp 11–15
Harris CR, Millman KJ, van der Walt SJ et al (2020) Array programming with NumPy. Nature 585(7825):357–362
Article Google Scholar
Husmeier D (2003) Sensitivity and specificity of inferring genetic regulatory interactions from microarray experiments with dynamic Bayesian networks. Bioinformatics 19(17):2271–2282
Article Google Scholar
Jang E, Gu S, Poole B (2017) Categorical reparameterization with Gumbel-softmax. In: Proceedings of the International Conference on Learning Representations
Koivisto M (2006) Advances in exact Bayesian structure discovery in Bayesian networks. In: Proceedings of the Conference on Uncertainty in Artificial Intelligence, pp 241–248
Koller D, Friedman N (2009) Probabilistic graphical models: principles and techniques. The MIT Press
Google Scholar
Lachapelle S, Brouillard P, Deleu T, et al (2020) Gradient-based neural DAG learning. In: Proceedings of the International Conference on Learning Representations
Lee HC, Danieletto M, Miotto R et al (2019) Scaling structural learning with NO-BEARS to infer causal transcriptome networks. Pac Symp Biocomput 2020:391–402
Google Scholar
Lim LH (2020) Hodge Laplacians on graphs. SIAM Review 62(3)
Liu Q, Wang D (2016) Stein variational gradient descent: A general purpose Bayesian inference algorithm. In: Advances in Neural Information Processing Systems
Lorch L, Rothfuss J, Schölkopf B, et al (2021) DiBS: Differentiable Bayesian structure learning. In: Advances in Neural Information Processing Systems, pp 24111–24123
Lorch L, Sussex S, Rothfuss J, et al (2022) Amortized inference for causal structure learning. In: Advances in Neural Information Processing Systems, pp 13104–13118
Maddison CJ, Mnih A, Teh YW (2017) The concrete distribution: A continuous relaxation of discrete random variables. In: Proceedings of the International Conference on Learning Representations
Madigan D, York J, Allard D (1995) Bayesian graphical models for discrete data. Int Stat Rev/Rev Int de Stat 63(2):215–232
Article Google Scholar
Manber U (1989) Introduction to algorithms: A creative approach. Addison-Wesley Longman Publishing Co., Inc
Google Scholar
Molnar C, Casalicchio G, Bischl B (2020) Interpretable machine learning – a brief history, state-of-the-art and challenges. In: Proceedings of the Joint European Conference on Machine Learning and Knowledge Discovery in Databases Workshops, pp 417–431
Montagna F, Noceti N, Rosasco L, et al (2022) Scalable causal discovery with score matching. In: Proceedings of the Conference on Neural Information Processing Systems Workshop on Score-Based Methods
Nishikawa-Toomey M, Deleu T, Subramanian J, et al (2022) Bayesian learning of causal structure and mechanisms with GFlowNets and variational Bayes. arXiv preprint arXiv:2211.02763
Paszke A, Gross S, Massa F, et al (2019) PyTorch: An imperative style, high-performance deep learning library. In: Advances in Neural Information Processing Systems
Raskutti G, Uhler C (2018) Learning directed acyclic graph models based on sparsest permutations. Stat 7(1):e183
Article MathSciNet Google Scholar
Rolland P, Cevher V, Kleindessner M, et al (2022) Score matching enables causal discovery of nonlinear additive noise models. In: Proceedings of the International Conference on Machine Learning, pp 18741–18753
Sachs K, Perez O, Pe’er D et al (2005) Causal protein-signaling networks derived from multiparameter single-cell data. Science 308(5721):523–529
Article Google Scholar
Sanchez P, Liu X, O’Neil AQ, et al (2023) Diffusion models for causal discovery via topological ordering. In: Proceedings of the International Conference on Learning Representations
Servén D, Brummitt C (2018) pyGAM: Generalized additive models in Python
Solus L, Wang Y, Uhler C (2021) Consistency guarantees for greedy permutation-based causal inference algorithms. Biometrika 108(4):795–814
Article MathSciNet Google Scholar
Spirtes P, Glymour C, Scheines R (2000) Causation, Prediction, and Search. The MIT Press, Adaptive Computation and Machine Learning Series
Google Scholar
Squires C, Uhler C (2022) Causal structure learning: A combinatorial perspective. Foundations of Computational Mathematics pp 1–35
Squires C, Wang Y, Uhler C (2020) Permutation-based causal structure learning with unknown intervention targets. In: Proceedings of the Conference on Uncertainty in Artificial Intelligence, pp 1039–1048
Tran QD, Nguyen P, Duong B, et al (2023) Differentiable Bayesian structure learning with acyclicity assurance. In: Proceedings of the IEEE International Conference on Data Mining
Wang Y, Solus L, Yang K, et al (2017) Permutation-based causal inference algorithms with interventions. Advances in Neural Information Processing Systems 30
Yu Y, Chen J, Gao T, et al (2019) DAG-GNN: DAG structure learning with graph neural networks. In: Proceedings of the International Conference on Machine Learning, pp 7154–7163
Yu Y, Gao T, Yin N, et al (2021) DAGs with No Curl: An efficient DAG structure learning approach. In: Proceedings of the International Conference on Machine Learning, pp 12156–12166
Zhao R, He X, Wang J (2022) Learning linear non-Gaussian directed acyclic graph with diverging number of nodes. J Mach Learn Res 23(269):1–34
MathSciNet Google Scholar
Zheng X, Aragam B, Ravikumar PK, et al (2018) DAGs with NO-TEARS: Continuous optimization for structure learning. In: Advances in Neural Information Processing Systems
Zheng X, Dan C, Aragam B, et al (2020) Learning sparse nonparametric DAGs. In: Proceedings of the International Conference on Artificial Intelligence and Statistics, pp 3414–3425

Download references

Funding

Open Access funding enabled and organized by CAUL and its Member Institutions.

Author information

Authors and Affiliations

Applied Artificial Intelligence Institute (A2I2), Deakin University, Geelong, Victoria, Australia
Quang-Duy Tran, Phuoc Nguyen, Bao Duong & Thin Nguyen

Authors

Quang-Duy Tran
View author publications
You can also search for this author in PubMed Google Scholar
Phuoc Nguyen
View author publications
You can also search for this author in PubMed Google Scholar
Bao Duong
View author publications
You can also search for this author in PubMed Google Scholar
Thin Nguyen
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

QT, PN, BD, and TN conceived and refined the idea of this work. QT, PN, BD, and TN designed, and QT performed the experiments. All authors drafted, revised the manuscript, and approved the final version.

Corresponding author

Correspondence to Quang-Duy Tran.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Proof of Equation (16)

First, we will recall the generative model in Equation (15), which states that

$$\begin{aligned} p\left( \textbf{Z},\textbf{G},{\varvec{\Theta }},\mathcal {D}\mid \varvec{\pi }\right) =p\left( \textbf{Z}\right) p\left( \textbf{G}\mid \textbf{Z},\varvec{\pi }\right) p \left( {\varvec{\Theta }}\mid \textbf{G}\right) p\left( \mathcal {D}\mid \textbf{G},{\varvec{\Theta }}\right) \end{aligned}$$

In addition, from Bayes’ theorem, we have

$$\begin{aligned} p\left( \textbf{Z}\right) =\frac{p\left( \mathcal {D}\right) p\left( \textbf{Z}, \Theta \mid \mathcal {D},\varvec{\pi }\right) }{p\left( \Theta ,\mathcal {D}\mid \textbf{Z}, \varvec{\pi }\right) }\Rightarrow \frac{p\left( \textbf{Z}\right) }{p\left( \mathcal {D}\right) } =\frac{p\left( \textbf{Z},\Theta \mid \mathcal {D},\varvec{\pi }\right) }{p\left( \Theta , \mathcal {D}\mid \textbf{Z},\varvec{\pi }\right) }. \end{aligned}$$

Proof

The proof of the computation for expectation of $f\left( \textbf{G},{\varvec{\Theta }}\right) $ with $\left( \textbf{G},{\varvec{\Theta }}\right) \sim p\left( \textbf{G},{\varvec{\Theta }}\mid \mathcal {D},\varvec{\pi }\right) $ in Equation (16) is as follows

$$\begin{aligned}&\mathbb {E}_{p\left( \textbf{G},{\varvec{\Theta }}\mid \mathcal {D},\varvec{\pi }\right) } \left[ f\left( \textbf{G},{\varvec{\Theta }}\right) \right] =\sum _{\textbf{G}}\int _{{\varvec{\Theta }}}p\left( \textbf{G},{\varvec{\Theta }}\mid \mathcal {D},\varvec{\pi }\right) f\left( \textbf{G},{\varvec{\Theta }}\right) d{\varvec{\Theta }}\\&=\sum _{\textbf{G}}\int _{{\varvec{\Theta }}}\frac{p\left( \textbf{G},{\varvec{\Theta }},\mathcal {D}\mid \varvec{\pi }\right) f\left( \textbf{G},{\varvec{\Theta }}\right) }{p\left( \mathcal {D}\right) }d{\varvec{\Theta }}\\&=\sum _{\textbf{G}}\int _{{\varvec{\Theta }}}\int _{\textbf{Z}}\frac{p\left( \textbf{Z},\textbf{G},{\varvec{\Theta }},\mathcal {D}\mid \varvec{\pi }\right) f\left( \textbf{G},{\varvec{\Theta }}\right) }{p\left( \mathcal {D}\right) }d\textbf{Z}d{\varvec{\Theta }}\\&=\sum _{\textbf{G}}\int _{{\varvec{\Theta }}}\int _{\textbf{Z}}\frac{1}{p\left( \mathcal {D}\right) }p\left( \textbf{Z}\right) p\left( \textbf{G}\mid \textbf{Z},\varvec{\pi }\right) p\left( {\varvec{\Theta }}\mid \textbf{G}\right) p\left( \mathcal {D}\mid \textbf{G},{\varvec{\Theta }}\right) f\left( \textbf{G},{\varvec{\Theta }}\right) d\textbf{Z}d{\varvec{\Theta }}\\&=\sum _{\textbf{G}}\int _{{\varvec{\Theta }}}\int _{\textbf{Z}}\frac{1}{p\left( \mathcal {D}\right) }p\left( \textbf{Z}\right) p\left( \textbf{G}\mid \textbf{Z},\varvec{\pi }\right) p\left( {\varvec{\Theta }},\mathcal {D}\mid \textbf{G}\right) f\left( \textbf{G},{\varvec{\Theta }}\right) d\textbf{Z}d{\varvec{\Theta }}\\&=\int _{{\varvec{\Theta }}}\int _{\textbf{Z}}\frac{p\left( \textbf{Z}\right) }{p\left( \mathcal {D}\right) }\sum _{\textbf{G}}p\left( \textbf{G}\mid \textbf{Z},\varvec{\pi }\right) p\left( {\varvec{\Theta }},\mathcal {D}\mid \textbf{G}\right) f\left( \textbf{G},{\varvec{\Theta }}\right) d\textbf{Z}d{\varvec{\Theta }}\\&=\int _{{\varvec{\Theta }}}\int _{\textbf{Z}}\frac{p\left( \textbf{Z},{\varvec{\Theta }}\mid \mathcal {D},\varvec{\pi }\right) }{p\left( \mathcal {D},{\varvec{\Theta }}\mid \textbf{Z},\varvec{\pi }\right) }\sum _{\textbf{G}}p\left( \textbf{G}\mid \textbf{Z},\varvec{\pi }\right) p\left( {\varvec{\Theta }},\mathcal {D}\mid \textbf{G}\right) f\left( \textbf{G},{\varvec{\Theta }}\right) d\textbf{Z}d{\varvec{\Theta }}\\&=\int _{{\varvec{\Theta }}}\int _{\textbf{Z}}p\left( \textbf{Z},{\varvec{\Theta }}\mid \mathcal {D},\varvec{\pi }\right) \frac{\sum _{\textbf{G}}p\left( \textbf{G}\mid \textbf{Z},\varvec{\pi }\right) p\left( {\varvec{\Theta }},\mathcal {D}\mid \textbf{G}\right) f\left( \textbf{G},{\varvec{\Theta }}\right) }{p\left( \mathcal {D},{\varvec{\Theta }}\mid \textbf{Z},\varvec{\pi }\right) }d\textbf{Z}d{\varvec{\Theta }}\\&=\int _{{\varvec{\Theta }}}\int _{\textbf{Z}}p\left( \textbf{Z},{\varvec{\Theta }}\mid \mathcal {D},\varvec{\pi }\right) \frac{\sum _{\textbf{G}}p\left( \textbf{G}\mid \textbf{Z},\varvec{\pi }\right) p\left( {\varvec{\Theta }},\mathcal {D}\mid \textbf{G}\right) f\left( \textbf{G},{\varvec{\Theta }}\right) }{\sum _{\textbf{G}}p\left( \textbf{G}\mid \textbf{Z},\varvec{\pi }\right) p\left( {\varvec{\Theta }},\mathcal {D}\mid \textbf{G}\right) }d\textbf{Z}d{\varvec{\Theta }}\\&=\mathbb {E}_{p\left( \textbf{Z},{\varvec{\Theta }}\mid \mathcal {D},\varvec{\pi }\right) }\left[ \frac{\mathbb {E}_{p\left( \textbf{G}\mid \textbf{Z},\varvec{\pi }\right) }\left[ f\left( \textbf{G},{\varvec{\Theta }}\right) p\left( {\varvec{\Theta }},\mathcal {D}\mid \textbf{G}\right) \right] }{\mathbb {E}_{p\left( \textbf{G}\mid \textbf{Z},\varvec{\pi }\right) }\left[ p\left( {\varvec{\Theta }},\mathcal {D}\mid \textbf{G}\right) \right] }\right] . \end{aligned}$$

$\square $

Proof of Equations (22) & (24)

Proof

The derivative of the posterior log likelihood $\log p\left( \textbf{Z},{\varvec{\Theta }}\mid \mathcal {D},\varvec{\pi }\right) $ with respect to the latent variable $\textbf{Z}$ in Equation (22) is computed as

$$\begin{aligned} \nabla _{\textbf{Z}}\log p\left( \textbf{Z},{\varvec{\Theta }}\mid \mathcal {D},\varvec{\pi }\right)&=\nabla _{\textbf{Z}}\log p\left( \textbf{Z},{\varvec{\Theta }},\mathcal {D}\mid \varvec{\pi }\right) -\nabla _{\textbf{Z}}\log p\left( \mathcal {D}\mid \varvec{\pi }\right) \\&=\nabla _{\textbf{Z}}\log p\left( \textbf{Z},{\varvec{\Theta }},\mathcal {D}\mid \varvec{\pi }\right) \quad \text {(as }\nabla _{\textbf{Z}}\log p\left( \mathcal {D}\mid \varvec{\pi }\right) =0\text {)}\\&=\nabla _{\textbf{Z}}\log p\left( \textbf{Z}\right) +\nabla _{\textbf{Z}}\log p\left( {\varvec{\Theta }},\mathcal {D}\mid \textbf{Z},\varvec{\pi }\right) \\&=\nabla _{\textbf{Z}}\log p\left( \textbf{Z}\right) +\frac{\nabla _{\textbf{Z}}p\left( {\varvec{\Theta }},\mathcal {D}\mid \textbf{Z},\varvec{\pi }\right) }{p\left( {\varvec{\Theta }},\mathcal {D}\mid \textbf{Z},\varvec{\pi }\right) }\\&\qquad \qquad \text {(as }\nabla _{\textbf{x}}\log g\left( \textbf{x}\right) =\nabla _{\textbf{x}}g\left( \textbf{x}\right) /g\left( \textbf{x}\right) \text {)}\\&=\nabla _{\textbf{Z}}\log p\left( \textbf{Z}\right) +\frac{\nabla _{\textbf{Z}}\left[ \sum _{G}p\left( \textbf{G},{\varvec{\Theta }},\mathcal {D}\mid \textbf{Z},\varvec{\pi }\right) \right] }{\sum _{G}p\left( \textbf{G},{\varvec{\Theta }},\mathcal {D}\mid \textbf{Z},\varvec{\pi }\right) }\\&=\nabla _{\textbf{Z}}\log p\left( \textbf{Z}\right) +\frac{\nabla _{\textbf{Z}}\left[ \sum _{G}p\left( \textbf{G}\mid \textbf{Z},\varvec{\pi }\right) p\left( {\varvec{\Theta }},\mathcal {D}\mid \textbf{G}\right) \right] }{\sum _{G}p\left( \textbf{G}\mid \textbf{Z},\varvec{\pi }\right) p\left( {\varvec{\Theta }},\mathcal {D}\mid \textbf{G}\right) }\\&=\nabla _{\textbf{Z}}\log p\left( \textbf{Z}\right) +\frac{\nabla _{\textbf{Z}}\mathbb {E}_{p\left( \textbf{G}\mid \textbf{Z},\varvec{\pi }\right) }\left[ p\left( {\varvec{\Theta }},\mathcal {D}\mid \textbf{G}\right) \right] }{\mathbb {E}_{p\left( \textbf{G}\mid \textbf{Z},\varvec{\pi }\right) }\left[ p\left( {\varvec{\Theta }},\mathcal {D}\mid \textbf{G}\right) \right] }. \end{aligned}$$

The derivative of the posterior log likelihood $\log p\left( \textbf{Z},{\varvec{\Theta }}\mid \mathcal {D},\varvec{\pi }\right) $ with respect to the latent variable ${\varvec{\Theta }}$ in Equation (24) can also be computed easily as follows

$$\begin{aligned} \nabla _{{\varvec{\Theta }}}\log p\left( \textbf{Z},{\varvec{\Theta }}\mid \mathcal {D},\varvec{\pi }\right)&=\nabla _{{\varvec{\Theta }}}\log p\left( \textbf{Z},{\varvec{\Theta }},\mathcal {D}\mid \varvec{\pi }\right) -\nabla _{{\varvec{\Theta }}}\log p\left( \mathcal {D}\mid \varvec{\pi }\right) \\&=\nabla _{{\varvec{\Theta }}}\log p\left( \textbf{Z},{\varvec{\Theta }},\mathcal {D}\mid \varvec{\pi }\right) \quad \text {(as }\nabla _{{\varvec{\Theta }}}\log p\left( \mathcal {D}\mid \varvec{\pi }\right) =0\text {)}\\&=\nabla _{{\varvec{\Theta }}}\log p\left( \textbf{Z}\right) +\nabla _{{\varvec{\Theta }}}\log p\left( {\varvec{\Theta }},\mathcal {D}\mid \textbf{Z},\varvec{\pi }\right) \\&=\nabla _{{\varvec{\Theta }}}\log p\left( {\varvec{\Theta }},\mathcal {D}\mid \textbf{Z},\varvec{\pi }\right) \quad \text {(as }\nabla _{{\varvec{\Theta }}}\log p\left( \textbf{Z}\right) =0\text {)}\\&=\frac{\nabla _{{\varvec{\Theta }}}p\left( {\varvec{\Theta }},\mathcal {D}\mid \textbf{Z},\varvec{\pi }\right) }{p\left( {\varvec{\Theta }},\mathcal {D}\mid \textbf{Z},\varvec{\pi }\right) }\\&\qquad \text {(as }\nabla _{\textbf{x}}\log g\left( \textbf{x}\right) =\nabla _{\textbf{x}}g\left( \textbf{x}\right) /g\left( \textbf{x}\right) \text {)}\\&=\frac{\nabla _{{\varvec{\Theta }}}\left[ \sum _{G}p\left( \textbf{G},{\varvec{\Theta }},\mathcal {D}\mid \textbf{Z},\varvec{\pi }\right) \right] }{\sum _{G}p\left( \textbf{G},{\varvec{\Theta }},\mathcal {D}\mid \textbf{Z},\varvec{\pi }\right) }\\&=\frac{\nabla _{{\varvec{\Theta }}}\left[ \sum _{G}p\left( \textbf{G}\mid \textbf{Z},\varvec{\pi }\right) p\left( {\varvec{\Theta }},\mathcal {D}\mid \textbf{G}\right) \right] }{\sum _{G}p\left( \textbf{G}\mid \textbf{Z},\varvec{\pi }\right) p\left( {\varvec{\Theta }},\mathcal {D}\mid \textbf{G}\right) }\\&=\frac{\nabla _{{\varvec{\Theta }}}\mathbb {E}_{p\left( \textbf{G}\mid \textbf{Z},\varvec{\pi }\right) }\left[ p\left( {\varvec{\Theta }},\mathcal {D}\mid \textbf{G}\right) \right] }{\mathbb {E}_{p\left( \textbf{G}\mid \textbf{Z},\varvec{\pi }\right) }\left[ p\left( {\varvec{\Theta }},\mathcal {D}\mid \textbf{G}\right) \right] }. \end{aligned}$$

$\square $

Implementation and hyperparameters

TOBAC^{Footnote 1} is implemented using Python and the JAX [3] library. Our implementation is based on the implementation of DiBS [27].^{Footnote 2} For other approaches including BCD Nets [9]^{Footnote 3} and DAG-GFlowNet [10],^{Footnote 4} we use their original implementation for the experiments.

For synthesizing the data, we follow previous approaches [9, 10, 27] for data configurations. In the linear Gaussian synthetic data, the parameters $\theta _{i}$ of the edges are sampled from $\mathcal {N}\left( 0,1\right) $ with the additive noises sampled from $\mathcal {N}\left( 0,0.1\right) $. As in BCD Nets [9] and DiBS [27], a minimum value of 0.5 is applied to the parameters by adding ${\text {sign}}\left( \theta _{i}\right) \times 0.5$ to the sampled parameters to clarify whether the parent nodes have contributions in the variable. This value is also chosen as the threshold for determining if an edge exists in BCD Nets. In nonlinear Gaussian data, the data are generated from neural networks with one hidden layer comprising of 5 nodes and ReLU activation. The parameters in the neural networks are also sampled from $\mathcal {N}\left( 0,1\right) $ with observation noises sampled from $\mathcal {N}\left( 0,0.1\right) $ as in the linear Gaussian setting. These configurations are also applied to the corresponding inference models. In the experiments with the flow cytometry dataset [38], the linear Gaussian model is chosen for inferring the probability of the parameters given the graph.

The ground-truth topological orderings in the experiments are obtained from the ground-truth DAGs using the topological sorting algorithm [31] of the NetworkX [17] library. In order to use EqVar [6]^{Footnote 5} and NPVar [15]^{Footnote 6} for finding the orderings, we re-implement these methods with NumPy [18] and pyGAM [40] from their R implementations. Regarding SCORE [37],^{Footnote 7} as the PyTorch [35] version of the code is available, we utilize their original implementation.

All of the baselines are also implemented with the JAX library. Our experiments with TOBAC and DiBS are performed with the hyperparameters in Table 3, which are chosen in accordance to a similar approach of DiBS [27].

Table 3 Chosen hyperparameters for TOBAC in the experiments

Full size table

Detailed results

The detailed results for each subset in the SynTReN dataset [5] are listed in Table 4.

Table 4 Detailed results on 10 SynTReN subsets [5]

Full size table

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Tran, QD., Nguyen, P., Duong, B. et al. Constraining acyclicity of differentiable Bayesian structure learning with topological ordering. Knowl Inf Syst 66, 5605–5630 (2024). https://doi.org/10.1007/s10115-024-02140-4

Download citation

Received: 25 December 2023
Revised: 02 March 2024
Accepted: 06 May 2024
Published: 29 May 2024
Issue Date: September 2024
DOI: https://doi.org/10.1007/s10115-024-02140-4

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Constraining acyclicity of differentiable Bayesian structure learning with topological ordering

Abstract

Similar content being viewed by others

BIC-based node order learning for improving Bayesian network structure learning

Directed Acyclic Graph Reconstruction Leveraging Prior Partial Ordering Information

Min-BDeu and Max-BDeu Scores for Learning Bayesian Networks

1 Introduction

2 Background

3 TOBAC: topological ordering in differentiable Bayesian structure learning with acyclicity assurance

3.1 Permutation-based acyclicity constraint

3.1.1 Acyclicity assurance via decomposition of adjacency matrix

3.1.2 Representing the canonical adjacency matrix in a latent space

3.2 Mask-based acyclicity constraint

3.3 Estimating the latent variable using Bayesian inference

3.4 Particle variational inference of intractable posterior

4 Experiments

4.1 Experimental settings

4.2 Performance on synthetic data

4.3 Performance on real data

4.4 Ablation study

5 Conclusion

Data availability

Notes

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Additional information

Publisher's Note

Appendices

Proof of Equation (16)

Proof

Proof of Equations (22) & (24)

Proof

Implementation and hyperparameters

Detailed results

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation