1 Introduction

Structure learning aims to uncover the underlying directed acyclic graphs (DAGs) from observational data that can represent statistical or causal relationships between variables. The structure learning task has many applications in biology [38], economics [22], and interpretable machine learning [32]. Correspondingly, it is gaining scientific interest in various domains such as computer science, statistics, and bioinformatics [43]. One challenge of traditional structure learning methods such as GES [7] is the combinatorial search space of possible DAGs [8]. NO-TEARS [50] proposes a solution for this challenge by relaxing the formulation of the learning task in a continuous space and employs continuous optimization techniques. However, with the continuous representations, another challenge also arises which is the acyclicity constraint of the graphs.

In most continuous score-based methods [27, 28, 50, 51], the constraints of graph acyclicity are defined in a form of a penalizing score and minimizing the score will also minimize the cyclicity of the graphs. This type of approach requires a large number of running steps with complex penalization weight scheduling to ensure the correctness of the constraint, which varies greatly depending on settings. This lack of certainty will affect the quality and restrict the applicability of the learned structures. Another approach is to embed the constraint acyclicity in the generative model of the graphs such as in [9] by utilizing weighted adjacency matrices that can be decomposed into the combinations of a permutation matrix and a strictly lower triangular matrix. Our study is inspired by this approach by using a direct constraint in the generation process instead of a post-hoc penalizing score.

There is a parallel branch of permutation-based causal discovery approaches whose methods allow us to find the topological ordering in polynomial time [6, 15, 37, 39], which can provide beneficial information. Inspired by these approaches, we propose a framework, Topological Ordering in Differentiable Bayesian Structure Learning with ACyclicity Assurance (TOBAC), to greatly reduce the difficulty of the acyclicity-constraining task. Conditional inference is performed in this framework with the condition being the prior knowledge provided from the topological orderings. In this study, we consider two possible approaches for strictly constraining the acyclicity of generated graphs.

The first approach is based on the independent factorization property of the adjacency matrix \(\textbf{A}_{\textbf{G}}\) of a DAG \(\textbf{G}\) into a permutation matrix \(\textbf{P}\), which can be obtained for each topological ordering \(\varvec{\pi }\), and a strictly upper triangular matrix \(\textbf{S}\), which represents the adjacency matrix when the ordering is correct and can be represented by a latent variable \(\textbf{Z}\). The factorization \(p\left( \textbf{G},\textbf{S},\textbf{P}\right) =p\left( \textbf{S}\right) p\left( \textbf{P}\right) p\left( \textbf{G}\mid \textbf{S},\textbf{P}\right) =p\left( \textbf{Z}\right) p\left( \varvec{\pi }\right) p\left( \textbf{G}\mid \textbf{Z},\varvec{\pi }\right) =p\left( \textbf{G},\textbf{Z},\varvec{\pi }\right) \) allows us to infer \(\textbf{P}\) and \(\textbf{S}\) independently from \(\varvec{\pi }\) and \(\textbf{Z}\), respectively. Especially, decoupling these enables us to apply recent advances in learning of topological ordering and probabilistic model inference techniques. For each case of the permutation matrix \(\textbf{P}\), we can infer the DAG’s strictly upper triangular matrix \(\textbf{S}\) and compute the adjacency of an isomorphic DAG with \(\textbf{A}_{\textbf{G}}=\textbf{P}\textbf{S}\textbf{P}^{\top }\). In order to infer this DAG \(\textbf{G}\), we choose the recent graph inference approach in this field, DiBS [27], as ours inference engine for \(\textbf{S}\). This variant is called permutation-based TOBAC or P-TOBAC.

The other approach that we propose is based on the property of the gradient flow [25] on the topological ordering \(\varvec{\pi }\), which allows for the construction of a corresponding mask \(\textbf{M}\). This mask can be applied directly to an adjacency matrix \(\textbf{E}\) to eliminate conflicting edges and generate the final adjacency matrix \(\textbf{A}_{\textbf{G}}=\textbf{M}\odot \textbf{E}\) with acyclicity assured. Similar to the previous approach, the adjacency matrix \(\textbf{E}\) can also be represented by a latent variable \(\textbf{Z}\). Additionally, this formulation enables the independent inference processes of \(\textbf{M}\) from \(\varvec{\pi }\) and \(\textbf{E}\) from \(\textbf{Z}\), which results in the identical factorization of \(p\left( \textbf{G},\textbf{M},\textbf{E}\right) =p\left( \textbf{E}\right) p\left( \textbf{M}\right) p\left( \textbf{G}\mid \textbf{E},\textbf{M}\right) =p\left( \textbf{Z}\right) p\left( \varvec{\pi }\right) p\left( \textbf{G}\mid \textbf{Z},\varvec{\pi }\right) =p\left( \textbf{G},\textbf{Z},\varvec{\pi }\right) \). This variable is called masked-based TOBAC or M-TOBAC.

Our approaches are evaluated by experimenting on synthetic data and a real flow cytometry dataset [38] in linear and nonlinear Gaussian settings. The proposed framework of structural constraint shows better DAG predictions and achieves better performance compared to other approaches.

Fig. 1
figure 1

The proposed generative model of the Bayesian networks in Topological Ordering in Differentiable Bayesian Structure Learning with Acyclicity Assurance (TOBAC). The graph \(\textbf{G}\) is generated from a latent variable \(\textbf{Z}\) and a topological ordering \(\varvec{\pi }\). The latent variable \(\textbf{Z}\) is utilized to represent the edges of \(\textbf{G}\), whereas the information from the topological ordering \(\varvec{\pi }\) is employed for strictly constraining the acyclicity. The variable \({\varvec{\Theta }}\) defines the parameters of the local conditional distributions of the nodes given their parents in \(\textbf{G}\). The observational data \(\mathcal {D}\) consisting of n observations are assumed to be generated from this generative model

Contributions The main contributions of this study are summarized as follows

  1. 1.

    We address the limitations of post-hoc acyclicity constraint scores by strictly constraining the generative structure of the graphs. By utilizing the permutation-based decomposition of the adjacency matrix and the property of the gradient flow on the topological ordering, we can strictly guarantee the acyclicity constraint in Bayesian network.

  2. 2.

    We introduce TOBAC, a framework with two variants corresponding to two possible permutation-based and mask-based approaches for independently inferring and conditioning on the topological ordering (illustrated in Fig. 1). Our inference process guarantees the acyclicity of inferred graphs as well as reduces the inference complexity of the adjacency matrices.

  3. 3.

    We demonstrate the effectiveness of TOBAC in comparison with related state-of-the-art Bayesian score-based methods on both synthetic and real-world data. Our approach obtains better performance on synthetic linear and nonlinear Gaussian data and on the real flow cytometry dataset.

This work builds upon and extends the work appearing in ICDM 2023 [45], which has introduced the framework of permutation-based TOBAC (P-TOBAC). In this work, we propose an alternative masked-based approach for TOBAC (M-TOBAC) for constraining the acyclicity. This variant is more efficient when being implemented due to the avoidance of matrix multiplication operations. We also adapt the framework of TOBAC to integrate the new variant and make the formulations less complicated and more general. Last, we evaluate our framework with two implementational variants on both synthetic data, including generated data from linear and nonlinear Gaussian models, and real data, including the flow cytometry dataset [38] and the SynTReN dataset [5, 23], to demonstrate the effectiveness of our previous variant as well as the new variant. Additional ablation study has also been performed to clearly demonstrate the limitations of post-hoc acyclicity constraint and the advantages of our approach.

2 Background

Bayesian structure learning methods

Most structure learning approaches can be categorized by their learning objectives into two main categories. The majority of methods [6, 7, 23, 33, 37, 39, 50, 51] fall into the first category where each method aims to recover a point estimate or its Markov equivalence class. In the other category, which is called Bayesian structure learning, the methods [9, 10, 14, 21, 27, 28, 30] aim to learn the posterior distribution over the DAGs given the observational data, i.e., \(p\left( \textbf{G}\mid \mathcal {D}\right) \). These methods can quantify the epistemic uncertainty in cases such as limited number of sample size or non-identifiable models in causal discovery. Previous approaches to learning the posterior distribution are diverse, for example, using Markov chain Monte Carlo (MCMC) [30], bootstrapping [14] with PC [42] and GES [7], and exact methods with dynamic programming [21]. Recent Bayesian structure learning approaches use more advanced methods such as variational inference [9, 27, 28] or generative flow networks (GFlowNets) [2, 10, 11, 34].

Permutation-based methods

Searching over the space of the permutations of the variables is significantly faster than searching over the space of possible DAGs. When the correct topological ordering of a DAG is found, a skeleton containing possible relations can be constructed and the DAG can be easily retrieved from this skeleton using the available conditional independence tests [4, 36, 41, 44, 46]. There have been many approaches for both linear [6, 49] and nonlinear additive noise models [15, 33, 37, 39]. EqVar [6] learns the topological orderings with the assumption of equal variances. NPVar [15] extends the work of EqVar by replacing the error variances with the corresponding residual variances and modeling using the nonparametric setting. Both of these methods iteratively find the root nodes to the leaf nodes of the causal DAG. Alternatively, SCORE [37] finds the leaves from a score estimated to match the gradient of the log probability distribution of the variables.

Discrete score-based methods

Score-based approaches in structure learning define scores to evaluate the generated graphs. The task in this setting translates to search for graphs that can maximize or minimize that score depending on the configuration of the score. For fully observational data, early methods perform the search in the discrete space of graphs and the general aim is to find algorithms that can optimally perform the search. For example, the earliest approach—GES [7] uses greedy search to efficiently search for the causal order in the permutation space by maximizing a score function. Recent developments efficiently search for the sparsest permutation, as in the Sparsest Permutation algorithm [36] and Greedy Sparsest Permutation algorithm [41], over the vertices of a permutohedron representing the space of permutations of DAGs, and reduce the search space by contracting the vertices corresponding to the same DAG [41]. These methods are also adapted for interventional data in [44, 46]. Another recent approach in this category is the group of DAG-GFlowNet [10, 11, 34] methods, which utilizes GFlowNet to search over the states of DAGs and can approximate the posterior distribution using both observational and interventional data. Constraining the acyclicity in this setting is relatively simple as the graphs containing cycle(s) can be flagged as invalid states and will be ignored if the search algorithms encounter these states.

Continuous score-based methods

Structure learning approaches with continuous relaxation [9, 23, 27, 28, 47, 51] is developing rapidly since NO-TEARS [50] was introduced. A continuous search space allows us to optimize or infer using gradient-based approaches to avoid searching over the large space of discrete DAGs. Bayesian inference with variational inference is one category of the gradient-based approaches used in the latest frameworks [9, 27, 28]. BCD Nets [9] decomposes the weighted adjacency matrix to a permutation matrix and a strictly lower triangular matrix, and infers the probabilities of these matrices using the evidence lower bound (ELBO) of the variational inference problem. DiBS [27] models the probabilities of the edges using a bilinear generative model from the latent space and infers the posterior using Stein variational gradient descent.

Acyclicity constraint in continuous score-based methods

Despite having computational benefits compared to discrete methods, the relaxation of the search into a continuous space causes the emergence of another problem that is how the acyclicity can be constrained. The more popular approach is to represent the cyclicity of generated graphs as a regularization score. By penalizing this score until this score reaches zero, the acyclicity of the graphs can be constrained. The approach in NO-TEARS [50] employs the property that if a graph is acyclic, the trace of the matrix exponential of its adjacency matrix will have the minimum value that equals to the number of nodes in the graph. The drawback of this constraint is the high computational complexity of the matrix exponential, which requires \(\mathcal {O}\left( d^{3}\right) \) numerical operations to evaluate. DAG-GNN [47] proposes an alternative constraint that is less difficult to be implemented and have the same computational complexity. NO-BEARS [24] presents another constraint based on the spectral radius of the matrix. This constraint can be approximated with the computational complexity of \(\mathcal {O}\left( d^{2}\right) \). Despite being easy to implement, these constraints are absent of acyclicity assurance and usually require a larger number of running steps when the data structure is more complicated.

Another approach is to structurally constrain the generated graphs to be acyclic. The topological ordering naturally represents the acyclicity of a graph as the edges can only originates from predecessor nodes in the ordering to successor nodes. As a consequence, a direct graph is acyclic if and only if there exists an corresponding topological ordering for the nodes in this graph. There are different approaches to representing the relation between a DAG and a topological ordering. In BCD Nets [9], the decomposition of the adjacency matrix of a DAG into a permutation matrix, which is one representation of the topological ordering, and a strictly upper triangular matrix, which embeds the relationships among the nodes. DAG-NoCurl [48], uses the gradient flow [25] to represent the edge directions from predecessors to successors in a form of a mask. In both methods, the topological ordering and the edges are learned simultaneously, which can render the learning process more complex because a slight change in the topological ordering can transform the generated DAG drastically.

Our framework belongs to the later category, but we avoid the complexity in the joint inference of the topological ordering and the graph. In this study, we design an acyclicity-ensuring conditional inference process with the topological ordering as a condition. With this process setting, we can effectively integrate the knowledge from the topological ordering into the inference to guarantee that generated graphs are acyclic. Additionally, with the provided prior knowledge, the number of possible graphs is reduced from \(2^{d^{2}}\) to \(2^{{d(d-1)}/{2}}\), which reduces the difficulty of the inference process and allows our framework to achieve higher inference efficiency.

3 TOBAC: topological ordering in differentiable Bayesian structure learning with acyclicity assurance

3.1 Permutation-based acyclicity constraint

3.1.1 Acyclicity assurance via decomposition of adjacency matrix

To analyze the decomposition of the adjacency matrix of a directed acyclic graph (DAG), we need to start from the topological orderings. A topological ordering (or, in short, an ordering) is a topological sort of the variables in a DAG. Given a DAG \(\textbf{G}\) with d nodes \(\textbf{X}=\left\{ X_{1},X_{2},\dots ,X_{d}\right\} \), let \(\varvec{\pi }=\left\{ \pi _{1},\pi _{2},\dots ,\pi _{d}\right\} ,\pi _{i}\ne \pi _{j},\forall i\ne j\) contain the corresponding ordering \(\pi _{i}\) of each node \(X_{i}\). \(\pi _{i}<\pi _{j}\) means that \(X_{i}\in nd\left( X_{j}\right) \) where \(nd\left( X_{j}\right) \) is the set containing the non-descendants of \(X_{j}\). We consider the canonical case where the ordering of the nodes is already correct, i.e., \(\varvec{{\pi }}^{*}=\left\{ \pi _{i}^{*}=i,\forall i\right\} \). In this case, because every node \(X_{i}\) is a non-descendant of \(X_{j}\) if \(i<j\), the adjacency matrix \(\textbf{A}_{\textbf{G}}\) of the DAG \(\textbf{G}\) will become a strictly upper triangular matrix \(\textbf{S}\in \left\{ 0,1\right\} ^{d\times d}\).

In order to generalize to any ordering, we use a permutation matrix \(\textbf{P}\left( \varvec{\pi }\right) \) that transforms the ordering \(\varvec{\pi }\) to \(\varvec{\pi }^{*}\) as

$$\begin{aligned} P\left( \varvec{\pi }\right) _{j,i}={\left\{ \begin{array}{ll} 1 &{} \text {if }i=\pi _{j},\\ 0 &{} \text {otherwise.} \end{array}\right. } \end{aligned}$$
(1)

With this permutation matrix, we can always generate the adjacency matrix of an isomorphic DAG by

$$\begin{aligned} {\textbf{A}_{\textbf{G}}=\textbf{P}\left( \varvec{\pi }\right) \cdot \textbf{S}\cdot \textbf{P}\left( \varvec{\pi }\right) ^{\top }.} \end{aligned}$$
(2)

This formulation achieves the adjacency matrix \(\textbf{A}_\textbf{G}\) by shifting the corresponding rows and columns of \(\textbf{S}\) from the canonical ordering \(\varvec{{\pi }}^{*}\) to the ordering \(\varvec{{\pi }}\). As the canonical adjacency matrix \(\textbf{S}\) is acyclic, the derived graph \(\textbf{A}_\textbf{G}\) will always satisfy acyclicity. By employing this decomposition in the generative process, the acyclicity of the inferred graphs will always be satisfied.

3.1.2 Representing the canonical adjacency matrix in a latent space

With every permutation matrix \(\textbf{P}\left( \varvec{\pi }\right) \), we only need to find the equivalent canonical adjacency matrix \(\textbf{S}\). Following the DiBS approach in [27], this matrix is sampled from a latent variable \(\textbf{Z}\) consisting of two embedding matrices \(\textbf{U}=\left[ \textbf{u}_{1},\textbf{u}_{2},\ldots ,\textbf{u}_{d-1}\right] , \textbf{u}_{i}\in \mathbb {R}^{k}\) and \(\textbf{V}=\left[ \textbf{v}_{1},\textbf{v}_{2},\ldots ,\textbf{v}_{d-1}\right] , \textbf{v}_{j}\in \mathbb {R}^{k}\). Due to the nature of strictly upper triangular matrices, only \(d-1\) vectors in each embedding are needed to construct \(\textbf{S}\) instead of d vectors as in DiBS. Following this configuration, the dimension k of the latent vectors is chosen to be greater or equal to \(d-1\) to ensure that the generated graphs are not constrained in rank. We represent the probabilities of values in \(\textbf{S}\) as follows

$$\begin{aligned} S_{\alpha }\left( \textbf{Z}\right) _{i,j}&:=p_{\alpha }\left( S_{i,j}=1\mid \textbf{u}_{i},\textbf{v}_{j-1}\right) \end{aligned}$$
(3)
$$\begin{aligned}&:={\left\{ \begin{array}{ll} \sigma _{\alpha }\left( \textbf{u}_{i}^{\top }\textbf{v}_{j-1}\right) &{} \text {if }j>i,\\ 0 &{} \text {otherwise;} \end{array}\right. } \end{aligned}$$
(4)

where \(\sigma _{\alpha }\left( x\right) =1/\left( 1+\exp \left( -\alpha x\right) \right) \) and the term \(\alpha \) will be increased each step to make the sigmoid function \(\sigma _{\alpha }\left( x\right) \) converge to the Heaviside step function \(\mathbbm {1}\left[ x>0\right] \). As \(\alpha \rightarrow \infty \), the converged generated \(\textbf{S}_{\infty }\) will become

$$\begin{aligned} S_{\infty }\left( \textbf{Z}\right) _{i,j}:={\left\{ \begin{array}{ll} \mathcal {\mathbbm {1}}\left[ \textbf{u}_{i}^{\top }\textbf{v}_{j-1}>0\right] &{} \text {if }j>i,\\ 0 &{} \text {otherwise.} \end{array}\right. } \end{aligned}$$
(5)

The probability of the elements in the adjacency matrix \(\textbf{A}_{\textbf{G}}\) given the latent variable \(\textbf{Z}\) and the permutation matrix \(\textbf{P}\) corresponding to the topological ordering \(\varvec{\pi }\) can be computed by

$$\begin{aligned} {A_{\alpha }\left( \textbf{Z},\varvec{\pi }\right) _{i,j}}&{:=p_{\alpha }\left( A_{i,j}=1\mid \textbf{Z},\varvec{\pi }\right) } \end{aligned}$$
(6)
$$\begin{aligned}&{:=\sum _{a=1}^{d}\sum _{b=1}^{d}P\left( \varvec{\pi }\right) _{i,b}\cdot S_{\alpha }\left( \textbf{Z}\right) _{b,a}\cdot P\left( \varvec{\pi }\right) _{a,j}^{\top }.} \end{aligned}$$
(7)

3.2 Mask-based acyclicity constraint

The formulation in Equation (2) can not only guarantee the acyclicity of generated graphs but also reduce the dimensions of latent representations required for the edge matrix \(\textbf{S}\) as demonstrated in Sect. 3.1.2. However, the use of matrix multiplications in this formulation can affect the computational efficiency of the method. In this section, we propose an alternative implementation of the acyclicity constraint which allows the computation to be executed more efficiently.

We adapt the formulation from DAG-NoCurl [48] into our framework to avoid the operations of matrix multiplication. Instead of using the permutation approach, DAG-NoCurl uses a mask to eliminate edges that are in conflict with a topological ordering \(\varvec{\pi }\). Let \({\varvec{\pi }}=\left\{ \pi _{1},\pi _{2},\dots ,\pi _{d}\right\} ,\pi _{i}\ne \pi _{j},\forall i\ne j\) be the corresponding orderings of the nodes \(\textbf{X}=\left\{ X_{1},X_{2},\dots ,X_{d}\right\} \) of a graph \(\textbf{G}\) as presented in Sect. 3.1.1. A gradient flow [25] \(\text {grad}\varvec{\pi }\) on \(\varvec{\pi }\) is a function whose values defined as follows

$$\begin{aligned} \left( \text {grad}{\varvec{\pi }}\right) \left( i,j\right) :=\pi _{j}-\pi _{i},1\le i,j\le d. \end{aligned}$$
(8)

The ordering value \(\pi _{i}\) of \(X_{i}\) will be smaller than the ordering value \(\pi _{j}\) of \(X_{j}\) if \(X_{i}\in nd\left( X_{j}\right) \). Hence, each value of \(\text {grad}\varvec{\pi }\) can have different properties depending on the topological ordering

$$\begin{aligned} \left( \text {grad}\varvec{\pi }\right) \left( i,j\right) {\left\{ \begin{array}{ll} >0 &{} \text {if }X_{i}\in nd\left( X_{j}\right) ,\\ =0 &{} \text {if }i=j,\\ <0 &{} \text {if }X_{j}\in nd\left( X_{i}\right) . \end{array}\right. } \end{aligned}$$
(9)

As the positive values of \(\left( \text {grad}\varvec{\pi }\right) \left( i,j\right) \) correspond to the edges of the full DAG that have the topological ordering \(\varvec{\pi }\), a mask \(\textbf{M}\left( \varvec{\pi }\right) \) to filter out conflicting edges can be defined as

$$\begin{aligned} M\left( \varvec{\pi }\right) _{i,j}:=\mathcal {\mathbbm {1}}\left[ \left( \text {grad} \varvec{\pi }\right) \left( i,j\right) >0\right] . \end{aligned}$$
(10)

This mask can be applied to any adjacency matrix \(\textbf{E}\) of the graph \(\textbf{G}\) to constrain its acyclicity by filtering the edges that conflict with the ordering \(\varvec{\pi }\) and achieve the acyclicity-assured adjacency matrix as follow

$$\begin{aligned} {\textbf{A}_{\textbf{G}} = \textbf{M}\left( \varvec{\pi }\right) \odot \textbf{E},} \end{aligned}$$
(11)

where “\(\odot \)” denotes the element-wise multiplication.

The remaining task in this approach is inferring the matrix \(\textbf{E}\) via its latent representation. Similar to DiBS [27] and Equations (3)–(4), to represent the matrix \(\textbf{E}\) in the latent space, its soft representation \(\textbf{E}_{\alpha }\) is generated from a latent variable \(\textbf{Z}\) consisting of two embedding matrices \(\textbf{U}=\left[ \textbf{u}_{1},\textbf{u}_{2},\ldots ,\textbf{u}_{d}\right] , \textbf{u}_{i}\in \mathbb {R}^{k}\) and \(\textbf{V}=\left[ \textbf{v}_{1},\textbf{v}_{2},\ldots ,\textbf{v}_{d}\right] , \textbf{v}_{j}\in \mathbb {R}^{k}\) as follows

$$\begin{aligned} E_{\alpha }\left( \textbf{Z}\right) _{i,j}&:=p_{\alpha }\left( E_{i,j}=1\mid \textbf{u}_{i},\textbf{v}_{j}\right) \end{aligned}$$
(12)
$$\begin{aligned}&:=\sigma _{\alpha }\left( \textbf{u}_{i}^{\top }\textbf{v}_{j}\right) . \end{aligned}$$
(13)

In this setting, to avoid \(\textbf{E}_{\alpha }\) from being constrained in rank, we need \(k \ge n\). From this representation, as in Eq. 11, the probabilities of the edges in the directed and acyclic graph \(\textbf{G}\) can be computed by

$$\begin{aligned} {\textbf{A}_{\alpha }\left( \textbf{Z},\varvec{\pi }\right) :=\textbf{M} \left( \varvec{\pi }\right) \odot \textbf{E}_{\alpha }\left( \textbf{Z}\right) .} \end{aligned}$$
(14)

Compared to the two matrix multiplication operators in Equation (2) of the permutation-based approach in Sect. 3.1, this formulation is more efficient due to the use of solely one element-wise operator while still being able to assure the acyclicity. However, implementing this setting requires the latent variable comprising of 2d vectors for representating \(d^{2}\) elements in the matrix \(\textbf{E}\) instead of \(2\left( d-1\right) \) vectors for \(d\times \left( d-1\right) /2\) elements in the strictly upper matrix \(\textbf{S}\).

3.3 Estimating the latent variable using Bayesian inference

In order to estimate the latent variable \(\textbf{Z}\), extending the approach from [27], we first consider the generative model given in Fig. 1. In this figure, a Bayesian network consists of a pair of variables \(\left( \textbf{G},{\varvec{\Theta }}\right) \) where \({\varvec{\Theta }}\) defines the parameters of the local conditional distributions at each variable given its parents in the DAG. This generative model is assumed to generate the observational data \(\mathcal {D}\) containing n observations of \(\textbf{X}\).

Given a topological ordering \({\varvec{\pi }}\), the generative model conditioned on \(\varvec{\pi }\) can be factorized as

$$\begin{aligned} p\left( \textbf{Z},\textbf{G},{\varvec{\Theta }},\mathcal {D}\mid \varvec{\pi }\right) =p\left( \textbf{Z}\right) p\left( \textbf{G}\mid \textbf{Z},\varvec{\pi }\right) p\left( {\varvec{\Theta }}\mid \textbf{G}\right) p\left( \mathcal {D} \mid \textbf{G},{\varvec{\Theta }}\right) . \end{aligned}$$
(15)

For any function \(f\left( \textbf{G},{\varvec{\Theta }}\right) \) of interest, we can compute its expectation from the distribution \(p\left( \textbf{G},{\varvec{\Theta }}\mid \mathcal {D},\varvec{\pi }\right) \) by inferring \(p\left( \textbf{Z},{\varvec{\Theta }}\mid \mathcal {D},\varvec{\pi }\right) \) with the following formula

$$\begin{aligned} \mathbb {E}_{p\left( \textbf{G},{\varvec{\Theta }}\mid \mathcal {D},\varvec{\pi }\right) } \left[ f\left( \textbf{G},{\varvec{\Theta }}\right) \right] =\mathbb {E}_{p\left( \textbf{Z}, {\varvec{\Theta }}\mid \mathcal {D},\varvec{\pi }\right) }\left[ \frac{\mathbb {E}_{p\left( \textbf{G} \mid \textbf{Z},\varvec{\pi }\right) }\left[ f\left( \textbf{G},{\varvec{\Theta }}\right) p\left( {\varvec{\Theta }},\mathcal {D}\mid \textbf{G}\right) \right] }{\mathbb {E}_{p\left( \textbf{G} \mid \textbf{Z},\varvec{\pi }\right) }\left[ p\left( {\varvec{\Theta }},\mathcal {D}\mid \textbf{G} \right) \right] }\right] , \end{aligned}$$
(16)

where \(p\left( {\varvec{\Theta }},\mathcal {D}\mid \textbf{G}\right) =p\left( {\varvec{\Theta }} \mid \textbf{G}\right) p\left( \mathcal {D}\mid \textbf{G},{\varvec{\Theta }}\right) \) and \(p\left( \textbf{G}\mid \textbf{Z},\varvec{\pi }\right) \) is computed using a graph prior (e.g., Erdős– Rényi [13] or scale-free [1]) on the soft graph in Equations (6) and (14). The function \(f\left( \textbf{G},{\varvec{\Theta }}\right) \) in Equation (16) acts as a placeholder for \(p\left( \mathcal {D}\mid \textbf{G},{\varvec{\Theta }}\right) \) in this study or any other functions depending on the setting of each structure learning task. The distributions of the parameters \(p\left( {\varvec{\Theta }}\mid \textbf{G}\right) \) and the data \(p\left( \mathcal {D}\mid \textbf{G},{\varvec{\Theta }}\right) \) are chosen differently for the linear and nonlinear Gaussian models. In the linear Gaussian model, the log probability of the parameters given the graph is

$$\begin{aligned} {\log p\left( {\varvec{\Theta }}\mid \textbf{G}\right) =\sum _{i,j} \left( A_{\textbf{G}}\right) _{i,j} \log \mathcal {N}\left( \theta _{i,j};\mu _{e},\sigma _{e}^{2}\right) ,} \end{aligned}$$
(17)

where \(\mu _{e}\) and \(\sigma _{e}\) are the mean and standard deviation of the Gaussian edge weights, and the log likelihood is as follows

$$\begin{aligned} \log p\left( \mathcal {D}\mid \textbf{G},{\varvec{\Theta }}\right) =\sum _{i=1}^{d}\log \mathcal {N}\left( X_{i};{\varvec{{\theta }}}_{i}^{\top }\textbf{X}_{\textrm{pa} \left( i\right) },\sigma _{obs}^{2}\right) , \end{aligned}$$
(18)

where \(\sigma _{obs}\) is the standard deviation of the additive observation noise at each node. In the nonlinear Gaussian model, we follow [27, 51] by using the feed-forward neural networks (FFNs) denoted by \(\textrm{FFN}\left( \cdot ;{\varvec{\Theta }}\right) :\mathbb {R}^{d}\rightarrow \mathbb {R}\) to represent the relation of the variables with their parents (i.e., \(f_{i}\left( \textbf{X}_{pa\left( i\right) }\right) \) for each \(X_{i}\)). For each variable \(X_{i}\), the output is computed as

$$\begin{aligned} \textrm{FFN}\left( \textbf{x};{\varvec{\Theta }}^{\left( i\right) }\right) :={\varvec{\Theta }} ^{\left( i,L\right) }f_{\textrm{a}}\left( \dots {\varvec{\Theta }}^{\left( i,2\right) }f_{\textrm{a}} \left( {\varvec{\Theta }}^{\left( i,1\right) }\textbf{x}+{\varvec{\theta }}_{b}^{\left( i,1\right) } \right) +{\varvec{\theta }}_{b}^{\left( i,2\right) }\dots \right) +{\varvec{\theta }}_{b}^{\left( i,L\right) }, \end{aligned}$$
(19)

where \({\varvec{\Theta }}^{(i,l)}\in \mathbb {R}^{d_{l}\times d_{l-1}}\) is the weight matrix, \({\varvec{\theta }}_{b}^{\left( i,l\right) }\in \mathbb {R}^{d_{l}}\) is the bias vector, and \(f_{\textrm{a}}\) is the activation function. From the model in this setting, the log probability of the parameters is

$$\begin{aligned} {\begin{aligned}\log p\left( {\varvec{\Theta }}\mid \textbf{G}\right)&=\sum _{i=1}^{d}\Bigg (\sum _{a=1}^{d_{1}}\bigg (\log \mathcal {N}\left( \left( {\varvec{\theta }}_{b} ^{\left( i,1\right) }\right) _{a};0,\sigma _{p}^{2}\right) +\sum _{b=1}^{d}\left( A_{\textbf{G}} \right) _{i,b}^{\top }\log \mathcal {N}\left( {\varvec{\Theta }}_{a,b}^{\left( i,1\right) };0,\sigma _{p}^{2} \right) \bigg )\\&\quad +\sum _{l=2}^{L}\sum _{a=1}^{d_{l}}\bigg (\log \mathcal {N}\left( \left( {\varvec{\theta }}_{b} ^{(i,l)}\right) _{a};0,\sigma _{p}^{2}\right) +\sum _{b=1}^{d_{l-1}}\log \mathcal {N} \left( {\varvec{\Theta }}_{a,b}^{\left( i,l\right) };0,\sigma _{p}^{2}\right) \bigg )\Bigg ), \end{aligned}} \end{aligned}$$
(20)

where \(\sigma _{p}\) is the standard deviation of the Gaussian parameters. In the nonlinear Gaussian setting, the value of \(f_{i}\left( \textbf{X}_{pa\left( i\right) }\right) \) is assumed to be the mean of the distribution of each variable \(X_{i}\). As a result, the log likelihood of this model is as follows

$$\begin{aligned} {\log p\left( \mathcal {D}\mid \textbf{G},{\varvec{\Theta }}\right) =\sum _{i=1}^{d}\log \mathcal {N} \left( X_{i};\textrm{FFN}\left( \left( A_{\textbf{G}}\right) _{i}^{\top }\odot \textbf{X}; {\varvec{\Theta }}^{(i)}\right) ,\sigma _{obs}^{2}\right) ,} \end{aligned}$$
(21)

where “\(\odot \)” denotes the element-wise multiplication.

3.4 Particle variational inference of intractable posterior

The joint posterior distribution \(p\left( \textbf{Z},{\varvec{\Theta }}\mid \mathcal {D},\varvec{\pi }\right) \) is intractable. Stein variational gradient descent (SVGD) [26, 27] is a suitable method to approximate this joint posterior density due to its gradient-based approach. The SVGD algorithm iteratively transports a set of particles to match the target distribution, which is similar to the gradient descent algorithm in optimization.

Algorithm 1
figure e

SVGD algorithm for inference of \(p\left( \textbf{G},{\varvec{\Theta }}\mid \mathcal {D},\varvec{\pi }\right) \)

From the proposed generative model, we need to infer the log joint posterior density of \(\textbf{Z}\) and \({\varvec{\Theta }}\) using the corresponding gradient to variable \(\textbf{Z}\) given by

$$\begin{aligned} \nabla _{\textbf{Z}}\log p\left( \textbf{Z},{\varvec{\Theta }}\mid \mathcal {D},\varvec{\pi }\right) =\nabla _{\textbf{Z}}\log p\left( \textbf{Z}\right) +\frac{\nabla _{\textbf{Z}}\mathbb {E}_{p\left( \textbf{G}\mid \textbf{Z},\varvec{\pi }\right) }\left[ p\left( {\varvec{\Theta }},\mathcal {D}\mid \textbf{G}\right) \right] }{\mathbb {E}_{p\left( \textbf{G}\mid \textbf{Z},\varvec{\pi }\right) }\left[ p\left( {\varvec{\Theta }},\mathcal {D}\mid \textbf{G}\right) \right] }. \end{aligned}$$
(22)

The log latent prior distribution \(\log p\left( \textbf{Z}\right) \) is chosen as

$$\begin{aligned} \begin{aligned}\log p\left( \textbf{Z}\right)&:=\sum _{i,j}\mathcal {\log N}\left( U_{ij};0,\sigma _{z}^{2}\right) +\sum _{i,j}\mathcal {\log N}\left( V_{ij};0,\sigma _{z}^{2}\right) \\&\quad +\mathbb {E}_{p\left( \textbf{G}\mid \textbf{Z},\varvec{\pi }\right) }\left[ \log p\left( \textbf{G}\mid \textbf{Z},\varvec{\pi }\right) \right] -C, \end{aligned} \end{aligned}$$
(23)

where C is the log partitioning constant. Similarly, the gradient corresponding to variable \({\varvec{\Theta }}\) is given by

$$\begin{aligned} \nabla _{\varvec{\Theta }}\log p\left( \textbf{Z},{\varvec{\Theta }}\mid \mathcal {D},\varvec{\pi }\right) =\frac{\nabla _ {\varvec{\Theta }}\mathbb {E}_{p\left( \textbf{G}\mid \textbf{Z},\varvec{\pi }\right) } \left[ p\left( {\varvec{\Theta }},\mathcal {D}\mid \textbf{G}\right) \right] }{\mathbb {E}_{p\left( \textbf{G} \mid \textbf{Z},\varvec{\pi }\right) }\left[ p\left( {\varvec{\Theta }}, \mathcal {D}\mid \textbf{G}\right) \right] }. \end{aligned}$$
(24)

The numerator of the second term in Equation (22) can be approximated using the Gumbel-softmax trick [20, 29] as follows

$$\begin{aligned} {\begin{aligned}&\nabla _{{\varvec{\Theta }}}\mathbb {E}_{p\left( \textbf{G}\mid \textbf{Z},\varvec{\pi }\right) } \left[ p\left( {\varvec{\Theta }},\mathcal {D}\mid \textbf{G}\right) \right] \\&\quad \approx \mathbb {E}_{p\left( \textbf{L}\right) }\left[ \nabla _{\textbf{G}}p\left( {\varvec{\Theta }}, \mathcal {D}\mid \textbf{G}\right) \Big \vert _{\textbf{A}_{\textbf{G}}=\tilde{\textbf{A}}_{\tau } \left( \textbf{L},\textbf{Z},\varvec{\pi }\right) }\cdot \nabla _{\textbf{Z}}\tilde{\textbf{G}}_ {\tau }\left( \textbf{L},\textbf{Z},\varvec{\pi }\right) \right] , \end{aligned} } \end{aligned}$$
(25)

where \(\textbf{L}\sim \text {Logistic}\left( 0,1\right) ^{\left( d-1\right) \times \left( d-1\right) }\) and \(\tilde{\textbf{A}}_{\tau }\left( \textbf{L},\textbf{Z},\varvec{\pi }\right) :=\textbf{P}\left( \varvec{\pi }\right) \cdot \tilde{\textbf{S}}_{\tau }\left( \textbf{L},\textbf{Z}\right) \cdot \textbf{P}\left( \varvec{\pi }\right) ^{\top }\) in the permutation-based approach, and \(\textbf{L}\sim \text {Logistic}\left( 0,1\right) ^{d\times d}\) and \(\tilde{\textbf{A}}_{\tau }\left( \textbf{L},\textbf{Z},\varvec{\pi }\right) :=\textbf{M}\left( \varvec{\pi }\right) \odot \tilde{\textbf{E}}_{\tau }\left( \textbf{L},\textbf{Z}\right) \) in the mask-based approach. The element-wise definition of \(\tilde{\textbf{S}}_{\tau }\) is

$$\begin{aligned} \tilde{S}_{\tau }\left( \textbf{L},\textbf{Z}\right) _{i,j}:={\left\{ \begin{array}{ll} \sigma _{\tau }\left( L_{i,j-1}+\alpha \textbf{u}_{i}^{\top }\textbf{v}_{j-1}\right) &{} \text {if }j>i,\\ 0 &{} \text {otherwise.} \end{array}\right. } \end{aligned}$$
(26)

Correspondingly, the elements of \(\tilde{\textbf{E}}_{\tau }\) are defined as

$$\begin{aligned} \tilde{E}_{\tau }\left( \textbf{L},\textbf{Z}\right) _{i,j}:=\sigma _{\tau }\left( L_{i,j}+ \alpha \textbf{u}_{i}^{\top }\textbf{v}_{j}\right) . \end{aligned}$$
(27)

In our experiments, we choose \(\tau =1\) in accordance with [27].

The kernel for SVGD proposed in [27] is

$$\begin{aligned} k\left( \left( \textbf{Z},{\varvec{\Theta }}\right) ,\left( \textbf{Z}^{\prime }, {\varvec{\Theta }}^{\prime }\right) \right) :=\exp \left( -\frac{1}{\gamma _{z}}\left\| \textbf{Z}-\textbf{Z}^{\prime }\right\| _{2}^{2}\right) +\exp \left( -\frac{1}{\gamma _{\theta }}\left\| {\varvec{\Theta }}-{\varvec{\Theta }}^{\prime }\right\| _{2}^{2}\right) . \end{aligned}$$
(28)

From this kernel, an incremental update for the mth particle of \(\textbf{Z}\) at the tth step is computed by

$$\begin{aligned} \begin{aligned}&\phi _{t}^{\textbf{Z}}\left( \textbf{Z}_{t}^{\left( m\right) },{\varvec{\Theta }}_{t}^{\left( m\right) }\right) \\&\quad =\frac{1}{M}\sum _{r=1}^{M}\Bigg [k\left( \left( \textbf{Z}_{t}^{\left( r\right) }, {\varvec{\Theta }}_{t}^{\left( r\right) }\right) ,\left( \textbf{Z}_{t}^{\left( m\right) }, {\varvec{\Theta }}_{t}^{\left( m\right) }\right) \right) \cdot \nabla _{\textbf{Z}_{t}^{\left( r\right) }}\log p\left( \textbf{Z}_{t}^{\left( r\right) },{\varvec{\Theta }}_{t}^{\left( r\right) } \mid \mathcal {D},\varvec{\pi }\right) \\&\qquad +\nabla _{\textbf{Z}_{t}^{\left( r\right) }}k\left( \left( \textbf{Z}_{t} ^{\left( r\right) },{\varvec{\Theta }}_{t}^{\left( r\right) }\right) ,\left( \textbf{Z}_{t} ^{\left( m\right) },{\varvec{\Theta }}_{t}^{\left( m\right) }\right) \right) \Bigg ]. \end{aligned} \end{aligned}$$
(29)

A similar update for \({\varvec{\Theta }}\) is proposed by replacing \(\nabla _{\textbf{Z}_{t}^{\left( r\right) }}\) by \(\nabla _{{\varvec{\Theta }}_{t}^{\left( r\right) }}\) as

$$\begin{aligned}&\phi _{t}^{{\varvec{\Theta }}}\left( \textbf{Z}_{t}^{\left( m\right) }, {\varvec{\Theta }}_{t}^{\left( m\right) }\right) \nonumber \\&\quad =\frac{1}{M}\sum _{r=1}^{M}\Bigg [k\left( \left( \textbf{Z}_{t}^{\left( r\right) }, {\varvec{\Theta }}_{t}^{\left( r\right) }\right) ,\left( \textbf{Z}_{t}^{\left( m\right) }, {\varvec{\Theta }}_{t}^{\left( m\right) }\right) \right) \cdot \nabla _{{\varvec{\Theta }}_{t} ^{\left( r\right) }}\log p\left( \textbf{Z}_{t}^{\left( r\right) },{\varvec{\Theta }}_{t} ^{\left( r\right) }\mid \mathcal {D},\varvec{\pi }\right) \nonumber \\&\qquad +\nabla _{{\varvec{\Theta }}_{t}^{\left( r\right) }}k\left( \left( \textbf{Z}_{t} ^{\left( r\right) },{\varvec{\Theta }}_{t}^{\left( r\right) }\right) ,\left( \textbf{Z}_{t} ^{\left( m\right) },{\varvec{\Theta }}_{t}^{\left( m\right) }\right) \right) \Bigg ]. \end{aligned}$$
(30)

The SVGD algorithm for inferring \(p\left( \textbf{G},{\varvec{\Theta }}\mid \mathcal {D},\varvec{\pi }\right) \) is represented in Algorithm 1.

4 Experiments

4.1 Experimental settings

We compare our P-TOBAC and M-TOBAC approaches with related Bayesian score-based methods including BCD Nets [9], DAG-GFlowNet (GFN) [10], and DiBS [27] on synthetic and the flow cytometry data [38]. Beside DiBS, which can learn the joint distribution \(p\left( \textbf{G},{\varvec{\Theta }}\mid \mathcal {D}\right) \) and infer nonlinear Gaussian networks, BCD Nets and DAG-GFlowNet are designed to work with linear Gaussian models. BCD Nets learns the parameters using the weighted adjacency matrix, and DAG-GFlowNet uses the BGe score [16].

Regarding the selection of topological ordering for conditioning, we choose the topological ordering from EqVar [6] and the ground-truth (GT) ordering as the condition for our approach. We also analyze the effect of the ordering on the performance by replacing the ordering with the one from NPVar [15] and SCORE [37]. The DiBS+ and TOBAC+ denotations in our experiments are the results with weighted particle mixture in [27] that use \(p\left( \textbf{G},{\varvec{\Theta }},\mathcal {D}\right) \) as the weight for each particle. This weight is employed as the unnormalized probability for each inferred particle when computing the expectation of the evaluation metrics.

Synthetic data and graph prior

We generate the data using the Erdős–Rényi (ER) structure [13] with the degree of 1 and 2. In all settings in Sect. 4.2, each inference is performed on \(n=100\) observations. For the ablation study in Sect. 4.4, synthetic data with \(d\in \left\{ 10,20,50\right\} \) variables and \(n\in \left\{ 100,500\right\} \) observations are utilized for the analysis of the effect of dimensionality and sample size on the performance. For the graph prior, BCD Nets uses the Horseshoe prior for their strictly lower triangular matrix \(\textbf{L}\), and GFN uses the prior from [12]. DiBS and our approach use the prior of the Erdős–Rényi graphs, which is in the form of \(p\left( \textbf{G}\right) \propto q^{\left\| \textbf{A}_{\textbf{G}}\right\| _{1}}\left( 1-q\right) ^{\left( {\begin{array}{c}d\\ 2\end{array}}\right) -\left\| \textbf{A}_{\textbf{G}}\right\| _{1}}\) where q is the probability for an independent edge being added to the DAG.

Fig. 2
figure 2

Performance on synthetic data generated from linear Gaussian models with \(d=20\) variables, \(n=100\) observations, and 1000 sampling steps. Lower E-SHD and higher AUROC are preferable. Our TOBAC models with the orderings from EqVar [6] and the ground-truth orderings are compared with BCD Nets (denoted as BCD) [9], DAG-GFlowNet (denoted as GFN) [10], and DiBS [27]. The DiBS+, P-TOBAC+, and M-TOBAC models are the results with weighted particle mixture. All the methods except for DiBS are designed to ensure acyclicity, so the cyclicity score is not necessary. The different designs of BCD Nets and DAG-GFlowNet make the comparison by the negative log likelihood evaluation with the joint posterior distribution implausible. Our TOBAC models accomplish the lowest E-SHD scores and the almost highest AUROC scores in both ER-1 (sparse graph) and ER-2 (denser graph) settings

Evaluation metrics

Following the evaluation metrics used by previous work [9, 10, 27], we evaluate the performance using the expected structural Hamming distance (E-SHD) and area under the receiver operating characteristic curve (AUROC). We follow [27] where the E-SHD score is the expectation of the structural Hamming distance (SHD) between each \(\textbf{G}\) and the ground-truth \(\textbf{G}^{*}\) over the posterior distribution \(p\left( \textbf{G}\mid \mathcal {D}\right) \) to compute the expected number of edges that has been incorrectly predicted. The formulation of this evaluation score is

$$\begin{aligned} {\text {E-SHD}\left( p,\textbf{G}^{*}\right) }&{:=\sum _{\textbf{G}}p\left( \textbf{G}\mid \mathcal {D}\right) \textrm{SHD}\left( \textbf{A}_{\textbf{G}}, \textbf{A}_{\textbf{G}^{*}}\right) }. \end{aligned}$$
(31)

The AUROC score is computed for each edge probability \(p\left( A_{i,j}=1\mid \mathcal {D}\right) \) of \(\textbf{A}_{\textbf{G}}\) in comparison with the corresponding ground-truth element in \(\textbf{A}_{\textbf{G}^{*}}\) [19]. In addition, we also evaluate the nonlinear Gaussian Bayesian networks from DiBS and ours by the cyclicity score and the average negative log likelihood [27]. The cyclicity score is proposed by [47] and is used in DiBS as the constraint for acyclicity. The score measuring the cyclicity or non-DAG-ness of a graph \(\textbf{G}\) is defined as

$$\begin{aligned} {h\left( \textbf{G}\right) :=\textrm{tr}\left[ \left( \textbf{I}+\frac{1}{d}\textbf{A}_ {\textbf{G}}\right) ^{d}\right] -d,} \end{aligned}$$
(32)

where \(h\left( \textbf{G}\right) =0\) if and only if \(\textbf{G}\) has no cycle and the higher its value, the more cyclic \(\textbf{G}\) becomes. In the average negative log likelihood evaluation, a test dataset \(\mathcal {D}^{test}\) containing 100 held-out observations are also generated to compute the score as follows

$$\begin{aligned} \text {Neg.LL}\left( p,\mathcal {D}^{test}\right) :=-\sum _{\textbf{G},{\varvec{\Theta }}}p \left( \textbf{G},{\varvec{\Theta }}\mid \mathcal {D}\right) \log p\left( \mathcal {D}^{test}\mid \textbf{G},{\varvec{\Theta }}\right) . \end{aligned}$$
(33)

This score is designed to evaluate the model’s ability to predict future observations. Note that BCD Nets formulates the parameters with the weighted adjacency matrix, and DAG-GFlowNet only estimates the marginal posterior distribution \(p\left( \textbf{G}\mid \mathcal {D}\right) \) and uses BGe score [16] for the parameters. Hence, we cannot compute the joint posterior distribution \(p\left( \textbf{G},{\varvec{\Theta }}\mid \mathcal {D}\right) \) for the negative log likelihood evaluation. The reported results in the following sections are obtained from ten randomly generated datasets for each method and configuration.

Fig. 3
figure 3

Expected structural Hamming distance on synthetic data with different sample sizes and dimensionalities. Lower E-SHD is preferable. Our TOBAC models with the orderings from EqVar [6] and the ground-truth orderings are compared with BCD Nets (denoted as BCD) [9], DAG-GFlowNet (denoted as GFN) [10], and DiBS [27]. The DiBS+, P-TOBAC+, and M-TOBAC+ models are the results with weighted particle mixture. Our models can learn more accurate graphs with more stable results when the data’s dimension increases. Most approaches are not affected by the number of observations, except for DAG-GFlowNet (GFN) [10], which cannot learn as effectively due to the incapability to handle higher peakness of the posterior distribution

Fig. 4
figure 4

Performance on synthetic data generated from nonlinear Gaussian models with \(d=20\) variables, \(n=100\) observations, and 1000 sampling steps. Lower Cyclicity, E-SHD, and Neg.LL, and higher AUROC are preferable. Our TOBAC models with the orderings from EqVar [6] and the ground-truth orderings are compared with DiBS [27]. The DiBS+, P-TOBAC+, and M-TOBAC models are the results with weighted particle mixture. Note that BCD Nets [9] and DAG-GFlowNet [10] are only designed for linear Gaussian models, so these methods are not included in this experiment. In comparison with DiBS, our TOBAC models can guarantee the acyclicity of the learned graphs. Our models also perform better in most evaluation metrics and show more stable results with fewer variances, especially the ones with weighted particle mixture

4.2 Performance on synthetic data

Linear Gaussian models

Figure 2 illustrates the performance of the methods with linear Gaussian models. We find that our P-TOBAC and M-TOBAC models outperform other approaches. TOBAC models accomplish the lowest E-SHD scores and the almost highest AUROC scores in both ER-1 (sparse graph) and ER-2 (denser graph) settings. The more efficient variant M-TOBAC can achieve performance similar to P-TOBAC in this setting. In comparison with DiBS, the results demonstrate that introducing the prior knowledge, which is the EqVar and ground-truth orderings, to the inference process can increase the performance significantly. As the graph becomes denser, as in the ER-2 settings, DiBS models have high variances in the results, whereas our models are as stable as other approaches. Considering DAG-GFlowNet, which uses GFlowNets [2] to infer the posterior distribution, our models achieve lower E-SHD score and substantially higher AUROC score in the denser graph setting.

Effect of sample sizes and dimensionalities of the data

We study the effect of different sample sizes and data dimensionalities on TOBAC models and related approaches utilizing linear Gaussian models. As we can perceive in Fig. 3, the number of observations does not have much effect on the performance in most cases. In the case of GFN, as also described by Deleu et al. [10], when the number of observations increases, this model cannot perform as well as with a smaller sample size due to the higher peakness of the posterior distribution. These results show that Bayesian approaches can sufficiently handle the uncertainty caused by a smaller amount of data. In contrast, as the number of dimensions in the data rises, the effect on the inference performance diverges among the approaches. Graphs with 50 nodes inferred by TOBAC are more accurate in comparison with other approaches.

Nonlinear Gaussian models

The performance results of nonlinear Gaussian models are depicted in Fig. 4. The cyclicity scores clearly show that the acyclicity constraint of DiBS is not as effective compared to our approach. In addition to the certainty of acyclicity, from all the E-SHD, AUROC, and negative log likelihood scores, we can see that our approaches can infer better graphs than DiBS. As EqVar is not formulated for nonlinear Gaussian models, the results with EqVar orderings are lower than the ones with the ground-truth orderings. However, the performance of these models is still better compared to DiBS. This improvement in performance emphasizes the benefit of the orderings while guaranteeing the acyclicity at every number of sampling steps. With the orderings given, optimizing the graph parameters becomes significantly easier. As a result, the log likelihood of the observational data is improved. Furthermore, the plots in the ER-2 indicate that TOBAC variants with weighted particle mixture are significantly more stable than DiBS when inferring denser graphs.

4.3 Performance on real data

Flow cytometry dataset

Table 1 Performance on the flow cytometry dataset [38]

The flow cytometry dataset includes \(n=853\) observational continuous data points of \(d=11\) phosphoproteins from the first perturbation condition in [38]. The graphs are inferred with linear Gaussian models and Erdős–Rényi (ER) graph priors at 1000 sampling steps. As we can observe in Table 1, our P-TOBAC and P-TOBAC+ approaches with the ground-truth ordering have achieved the lowest E-SHD values at \(14.7\pm 0.35\) and \(12.7\pm 0.82\), respectively. Additionally, the AUROC scores of M-TOBAC and P-TOBAC with the ground-truth ordering are the highest ones at \(0.710\pm 0.033\) and \(0.697\pm 0.029\) respectively, which are higher than DiBS’s AUROC at \(0.630\pm 0.0360\). As the relationship of this data is complex, the TOBAC models with the ground-truth ordering accomplish better performance compared to the ones with the ordering from EqVar. However, all models with the EqVar ordering still surpass other related approaches. This demonstrates the advantage of integrating information from the topological orderings in our framework.

SynTReN pseudo-real dataset

This pseudo-real dataset is generated in [23] from the software that is designed to generate Synthetic Transcriptional Regulatory Networks (SynTReN) [5]. The dataset comprises of 10 subsets of \(n=500\) observations generated from a subgraph of SynTReN with \(d=20\). Each ground-truth subgraph has from 19 to 34 edges with an average of 23.5 edges. Similar to the experiment on the flow cytometry dataset, we also infer these graphs using linear Gaussian models and ER graph priors. The overall results are summarized in Table 2. The detailed results are available in . All of our P-TOBAC and M-TOBAC models with ground-truth orderings achieve the highest AUROC score at \(0.905 \pm 0.0316\). This score significantly exceeds the results of all related baseline methods. In addition, our M-TOBAC+ model has the second lowest E-SHD result of \(33.73 \pm 5.581\), which is better than most baseline methods, except BCD Nets.

Table 2 Overall performance on the SynTReN dataset [5, 23]
Fig. 5
figure 5

Performance with different numbers of sampling steps in the linear setting with \(d=20\) variables and \(n=100\) observations. Lower Cyclicity and E-SHD, and higher AUROC are preferable. Our TOBAC models are compared with DiBS [27], which uses post-hoc penalization to constrain the acyclicity. The post-hoc constraint is not effective even after 2000 sampling steps. Introducing the prior knowledge of topological ordering into the inference process of TOBAC can both assure the acyclicity and enhance the performance

Fig. 6
figure 6

Performance with topological orderings from different methods in the linear setting with \(d=20\) variables, \(n=100\) observations, and 1000 sampling steps. Lower E-SHD and Neg.LL, and higher AUROC are preferable. The TOBAC models with the orderings from EqVar [6], NPVar [15], and SCORE [37] are compared with the models with ground-truth orderings. The P-TOBAC+ and M-TOBAC+ models are the results with weighted particle mixture. The experiments are configured with equal variance, so our models with orderings from EqVar [6] can accomplish similar results to the ones with the ground-truth orderings. The assumption of nonlinearity in the setting of SCORE makes the models with its orderings less stable

4.4 Ablation study

Effectiveness of the assured acyclicity constraint

Fig. 5 exhibits the effectiveness of our proposed acyclicity constraint compared to the post-hoc constraint in DiBS [27]. Although the number of cycles in the inferred graphs does decrease when the number of sampling steps increases, after 2000 iterations, the number of existing cycles is still considerable. By introducing the prior knowledge from the topological ordering as the acyclicity constraint, our TOBAC models can assure the acyclicity of generated graphs at any sampling steps. Furthermore, the incorporation of the ordering also improves the inference performance and allows the TOBAC models to obtain better results at a fewer number of sampling steps. This enhancement is demonstrated obviously in the figure. At only 500 sampling steps, all results from TOBAC surpass the ones from DiBS at 2000 sampling steps. Moreover, due to contribution of the ordering in limiting the search space, the results acquired from TOBAC are more stable with lower variances in general.

Topological orderings from other approaches

We summarize the results of the graphs inferred by TOBAC models with the ground-truth ordering and the orderings learned by several approaches consisting of EqVar [6], NPVar [15], and SCORE [37] in Fig. 6. In this setting, the variances of the variables are equal, which are matched to EqVar settings. As a consequence, the results suggest that TOBAC with EqVar can have performance scores that are close to TOBAC with the ground-truth ordering on synthetic data. Although being the state-of-the-art approach to learning ordering, the results when employing SCORE are more unstable in comparison with other simpler ones. This unstability may be due to the nonlinear assumption of the structural equation model in SCORE, which does not fit well with this linear Gaussian setting.

Uniform and weighted particle mixture

From the observed results, especially when the data are generated from nonlinear Gaussian models, the weighted particle mixture can reduce the E-SHD and Neg.LL scores. However, it will also cause a reduction in the AUROC scores. The lower performance of the uniform mixture can be due to the crude approximation of the posterior probability mass function of the particles, which is replaced by a more meaningful approach in the weighted mixture [27].

Permutation-based and mask-based constraints

In most acquired results, there is no significant discrepancy between the results of permutation-based and mask-based constraints. Both approaches can assure the acyclicity of the generated graphs and improve the inference performance. Having more parameters to be inferred in the mask-based variant does not affect its performance. In conjunction with its effectiveness, the mask-based implementation is the ideal choice of constraint for assuring the acyclicity.

5 Conclusion

In this work, we have presented TOBAC—a framework for strictly constraining the acyclicity of the inferred graphs and integrating the knowledge from the topological orderings into the inference process. We propose two possible permutation-based and mask-based implementations of this framework from the decomposition of the adjacency matrix and the property of the gradient flow on the topological ordering. Our work uses continuous representations of the edge matrices and approximates the posterior using particle variational inference. Our proposed framework makes the inference process less complicated and enhances the accuracy of inferred graphs and parameters. Accordingly, our work can outperform most related Bayesian score-based approaches on both synthetic and real observational data. In future work, we will explore methods to learn the posterior distribution of the topological orderings with observational data, which will allow our approach to infer more diverse posterior distributions while still ensuring the acyclicity of generated graphs.