0% found this document useful (0 votes)

12 views24 pages

Cheng 2023 CUTS

The document presents CUTS, a novel neural Granger causal discovery algorithm designed to handle irregular time-series data by jointly imputing missing data and constructing causal graphs. It utilizes a two-stage iterative framework that includes a Delayed Supervision Graph Neural Network for data imputation and a causal graph fitting stage, demonstrating superior performance compared to existing methods. This approach aims to enhance causal discovery in real-world applications where data imperfections are common.

Uploaded by

j.lowhorn

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

12 views24 pages

Cheng 2023 CUTS

Uploaded by

j.lowhorn

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 24

Published as a conference paper at ICLR 2023

CUTS: N EURAL C AUSAL D ISCOVERY

FROM I RREGULAR T IME -S ERIES DATA

Yuxiao Cheng1 Runzhao Yang1 Tingxiong Xiao1 Zongren Li3

12∗ 3∗ 12∗
Jinli Suo Kunlun He Qionghai Dai
1
Department of Automation, Tsinghua University
2
Institute for Brain and Cognitive Science, Tsinghua University (THUIBCS)
3
Chinese PLA General Hospital
{cyx22,yangrz20,xtx22}@mails.tsinghua.edu.cn
{lizongren,kunlunhe}@plagh.org {jlsuo,qhdai}@tsinghua.edu.cn
arXiv:2302.07458v1 [cs.LG] 15 Feb 2023

A BSTRACT
Causal discovery from time-series data has been a central task in machine learn-
ing. Recently, Granger causality inference is gaining momentum due to its good
explainability and high compatibility with emerging deep neural networks. How-
ever, most existing methods assume structured input data and degenerate greatly
when encountering data with randomly missing entries or non-uniform sampling
frequencies, which hampers their applications in real scenarios. To address this is-
sue, here we present CUTS, a neural Granger causal discovery algorithm to jointly
impute unobserved data points and build causal graphs, via plugging in two mutu-
ally boosting modules in an iterative framework: (i) Latent data prediction stage:
designs a Delayed Supervision Graph Neural Network (DSGNN) to hallucinate
and register irregular data which might be of high dimension and with complex
distribution; (ii) Causal graph fitting stage: builds a causal adjacency matrix with
imputed data under sparse penalty. Experiments show that CUTS effectively in-
fers causal graphs from irregular time-series data, with significantly superior per-
formance to existing methods. Our approach constitutes a promising step towards
applying causal discovery to real applications with non-ideal observations.

1 I NTRODUCTION
Causal interpretation of the observed time-series data can help answer fundamental causal questions
and advance scientific discoveries in various disciplines such as medical and financial fields. To
enable causal reasoning and counterfactual prediction, researchers in the past decades have been
dedicated to discovering causal graphs from observed time-series and made large progress (Ger-
hardus & Runge, 2020; Tank et al., 2022; Khanna & Tan, 2020; Wu et al., 2022; Pamfil et al., 2020;
Löwe et al., 2022; Runge, 2021). This task is called causal discovery or causal structure learning,
which usually formulates causal relationships as Directed Acyclic Graphs (DAGs). Among these
causal discovery methods, Granger causality (Granger, 1969; Marinazzo et al., 2008) is attracting
wide attentions and demonstrates advantageous due to its high explainability and compatibility with
emerging deep neural networks (Tank et al., 2022; Khanna & Tan, 2020; Nauta et al., 2019)).
In spite of the progress, actually most existing causal discovery methods assume well structured
time-series, i.e., completely sampled with an identical dense frequency. However, in real-world
scenarios the observed time-series might suffer from random data missing (White et al., 2011) or
be with non-uniform periods. The former is usually caused by sensor limitations or transmission
loss, while the latter occurs when multiple sensors are of distinct sampling frequencies. Robustness
to such data imperfections is urgently demanded, but has not been well explored yet so far. When
confronted with unobserved data points, some straightforward solutions fill the points with zero
padding, interpolation, or other imputation algorithms, such as Gaussian Process Regression or
neural-network-based approaches (Cini et al., 2022; Cao et al., 2018; Luo et al., 2018). We will
show in the experiments section that addressing missing entries via performing such trivial data
imputation in a pre-processing manner would lead to hampered causal conclusions.
∗
Corresponding author

1
Published as a conference paper at ICLR 2023

To push causal discovery towards real applications, we attempt to infer reliable causal graphs from
irregular time-series data. Fortunately, for data that are assumed to be generated with certain causal
structural models (Pamfil et al., 2020; Tank et al., 2022), a well designed neural network can fill a
small proportion of missing entries decently given a plausible causal graph, which would conversely
improve the causal discovery, and so forth. Leveraging this benefit, we propose to conduct causal
discovery and data completion in a mutually boosting manner under an iterative framework, instead
of sequential processing. Specifically, the algorithm alternates between two stages, i.e., (a) Latent
data prediction stage that hallucinates missing entries with a delayed supervision graph neural net-
work (DSGNN) and (b) Causal graph fitting stage inferring causal graphs from filled data under
sparse constraint utilizing the extended nonlinear Granger Causality scheme. We name our algo-
rithm Causal discovery from irregUlar Time-Series (CUTS), and the main contributions are listed
as follows:

• We proposed CUTS, a novel framework for causal discovery from irregular time-series
data, which to our best knowledge is the first to address the issues of irregular time-series
in causal discovery under this paradigm. Theoretically CUTS can recover the correct causal
graph with fair assumptions, as proved in Theorem 1.
• In the data imputation stage we design a deep neural network DSGNN, which success-
fully imputes the unobserved entries in irregular time-series data and boosts the subsequent
causal discovery stage and latter iterations.
• We conduct extensive experiments to show our superior performance to state-of-the-art
causal discovery methods combined with widely used data imputation methods, the ad-
vantages of mutually-boosting strategies over sequential processing, and the robustness of
CUTS (in Appendix Section A.4).

2 R ELATED W ORKS

Causal Structural Learning / Causal Discovery. Causal Structural Learning (or Causal Dis-
covery) is a fundamental and challenging task in the field of causality and machine learning, which
can be categorized into four classes. (i) Constraint-based approaches which build causal graphs
by conditional independence tests. Two most widely used algorithms are PC (Spirtes & Glymour,
1991) and Fast Causal Inference (FCI) (Spirtes et al., 2000) which is later extended by Entner &
Hoyer (2010) to time-series data. Recently, Runge et al. propose PCMCI to combine the above
two constraint-based algorithms with linear/nonlinear conditional independence tests (Gerhardus &
Runge, 2020; Runge, 2018b) and achieve high scalability on large scale time-series data. (ii) Score-
based learning algorithms based on penalized Neural Ordinary Differential Equations (Bellot et al.,
2022) or acyclicity constraint (Pamfil et al., 2020). (iii) Convergent Cross Mapping (CCM) firstly
proposed by Sugihara et al. (2012) that tackles the problems of nonseparable weakly connected
dynamic systems by reconstructing nonlinear state space. Later, CCM is extended to situation of
synchrony (Ye et al., 2015), confounding (Benkő et al., 2020) or sporadic time series (Brouwer
et al., 2021). (iv) Approaches based on Additive Noise Model that infer causal graph based on ad-
ditive noise assumption (Shimizu et al., 2006; Hoyer et al., 2008). Recently Hoyer et al. (2008)
extend ANM to nonlinear models with almost any nonlinearities. (v) Granger causality approach
proposed by Granger (1969) which has been widely used to analyze the temporal causal relation-
ships by testing the aid of a time-series on predicting another time-series. Granger causal analysis
originally assumes that linear models and the causal structures can be discovered by fitting a Vector
Autoregressive (VAR) model. Later, the Granger causality idea was extended to nonlinear situations
(Marinazzo et al., 2008). Thanks to its high compatibility with the emerging deep neural network,
Granger causal analysis is gaining momentum and is used in our work for incorporating a neural
network imputing irregular data with high complexities.
Neural Granger Causal Discovery. With the rapid progress and wide applications of deep Neural
Networks (NNs), researchers begin to utilize RNN (or other NNs) to infer nonlinear Granger causal-
ity. Wu et al. (2022) used individual pair-wise Granger causal tests, while Tank et al. (2022) inferred
Granger causality directly from component-wise NNs by enforcing sparse input layers. Building on
Tank et al. (2022)’s idea, Khanna & Tan (2020) explored the possibility of inferring Granger causal-
ity with Statistical Recurrent Units (SRUs, Oliva et al. (2017)). Later, Löwe et al. (2022) extends the
neural Granger causality idea to causal discovery on multiple samples with different causal relation-

2
Published as a conference paper at ICLR 2023

ships but similar dynamics. However, all these approaches assume fully observed time-series and
show inferior results given irregular data, which is shown in the experiments section. In this work,
we leverage this Neural Granger Causal Discovery idea and build a two-stage iterative scheme to
impute the unobserved data points and discover causal graphs jointly.
Causal Discovery from Irregular Time-series. Irregular time-series are very common in real
scenarios, causal discovery addressing such data remains somewhat under-explored. When con-
fronted with data missing, directly conducting causal inference might suffer from significant error
(Runge, 2018a; Hyttinen et al., 2016). Although joint data imputation and causal discovery has
been explored in static settings (Tu et al., 2019; Gain & Shpitser, 2018; Morales-Alvarez et al.,
2022; Geffner et al., 2022), it is still under explored in time series causal discovery. There are
mainly two solutions—either discovering causal relations with available observed incomplete data
(Gain & Shpitser, 2018; Strobl et al., 2018) or filling missing values before causal discovery (Wang
et al., 2020; Huang et al., 2020). To infer causal graphs from partially observed time-series, sev-
eral algorithms are proposed, such as Expectation-Maximization approach (Gong et al., 2015), La-
tent Convergent Cross Mapping (Brouwer et al., 2021), Neural-ODE based approach (Bellot et al.,
2022), Partial Canonical Correlation Analysis (Partial CCA), Generalized Lasso Granger (GLG)
(Iseki et al., 2019), etc. Some other researchers introduce data imputation before causal discovery
and have made progress recently. For example, Cao et al. (2018) learn to impute values via itera-
tively applying RNN and Cini et al. (2022) use Graph Neural Networks, while a recently proposed
data completion method by Chen et al. (2022) uses Gaussian Process Regression. In this paper, we
use a deep neural network similar to Cao et al. (2018)’s work, but differently, we propose to impute
missing data points and discover causal graphs jointly instead of sequentially. Moreover, these two
processes mutually improve each other and achieve high performance.

3 P ROBLEM F ORMULATION

3.1 N ONLINEAR S TRUCTURAL C AUSAL M ODELS WITH IRREGULAR O BSERVATION

Let us denote by X = {x1:L,i }N i=1 a uniformly sampled observation of a dynamic system, in which
xt represents the sample vector at time point t and consists of N variables {xt,i }, with t ∈ {1, ..., L}
and i ∈ {1, ..., N }. In this paper, we adopt the representation proposed by Tank et al. (2022) and
Khanna & Tan (2020), and assume each sampled variable xt,i be generated by the following model
xt,i = fi (xt−τ :t−1,1 , xt−τ :t−1,2 , ..., xt−τ :t−1,N ) + et,i , i = 1, 2, ..., N. (1)
Here τ denotes the maximal time lag. In this paper, we focus on dealing with causal inference from
irregular time series, and use a bi-value observation mask ot,i to label the missing entries, i.e., the
∆
et,i = xt,i · ot,i . In this paper we
observed vector equals to its latent version when ot,i equals to 1: x
consider two types of recurrent data missing in practical observations:
Random Missing. The ith data point in the observations are missing with a certain probability pi ,
here in our experiments the missing probability follows Bernoulli distribution ot,i ∼ Ber(1 − pi ).

P∞ Ti . We model the
Periodic Missing. Different variables are sampled with their own periods
sampling process for ith variable with an observation function ot,i = n=0 δ(t − nTi ), Ti =
1, 2, ... with δ(·) denoting the Dirac’s delta function.

3.2 N ONLINEAR G RANGER C AUSALITY

For a dynamic system, time-series i Granger causes time-series j when the past values of time-series
xi aid in the prediction of the current and future status of time-series xj . The standard Granger
causality is defined for linear relation scenarios, but recently extended to nonlinear relations:

Definition 1 Time-series i Granger cause j if and only if there exists x0t−τ :t−1,i 6= xt−τ :t−1,i ,

fj (xt−τ :t−1,1 , ..., x0t−τ :t−1,i , ..., xt−τ :t−1,N ) 6=

(2)
fj (xt−τ :t−1,1 , ..., xt−τ :t−1,i , ..., xt−τ :t−1,N )
i.e., the past data points of time-series i influence the prediction of xt,j .

3
Published as a conference paper at ICLR 2023

(a) 𝒏𝟏 iterations 𝒏𝟐 iterations 𝒏𝟑 iterations

Non-imputed and Non-supervisory Imputed but Non-supervisory Imputed and Supervisory

(b)
Latent Data Prediction Stage Causal Graph Fitting Stage
Unobserved Value
𝑓 𝑓
Observed Value
Data Imputation
𝑓 𝑓
𝑴
… … Imputed Value

𝑓 𝑓 … … 𝑴
Neural Network
ℒ Delayed Supervision ℒ … … 𝑴

Figure 1: Illustration of the proposed CUTS, with a 3-variable example. (a) Illustration of our learn-
ing strategy described in Section 4.3, with three groups of iterations being of the same alternation
scheme shown in (b) but different settings in data imputation and supervised model learning. (b)
Illustration of each iteration in CUTS. The dynamics reflected by the observed time-series x1 and
x2 are described by DSGNN in the Latent data prediction stage (left). With the modeled dynam-
ics, unobserved data points are imputed (center) and fed into the Causal graph fitting stage for an
improved graph inference (right).
Granger causality is highly compatible with neural networks (NN). Considering the universal ap-
proximation ability of NN (Hornik et al., 1989), it is possible to fit a causal relationship function
with component-wise MLPs or RNNs. Imposing a sparsity regularizer onto the weights of net-
work connections, as mentioned by Tank et al. (2022) and Khanna & Tan (2020), NNs can learn
the causal relationships among all N variables. The inferred pair-wise Granger causal relationships
can then be aggregated into a Directed Acyclic Graph (DAG), represented as an adjacency matrix
A = {aij }N i,j=1 , where aij = 1 denotes time-series i Granger causes j and aij = 0 means oth-
erwise. This paradigm is well explored and shows convincing empirical evidence in recent years
(Tank et al., 2022; Khanna & Tan, 2020; Löwe et al., 2022).
Although Granger causality is not necessarily the true causality, Peters et al. (2017) provide justifi-
cation of (time-invariant) Granger causality when assuming no unobserved variables and no instan-
taneous effects, as is mentioned by Löwe et al. (2022) and Vowels et al. (2021).
In this paper, we propose a new inference approach to successfully identify causal relationships from
irregular time-series data.

4 I RREGULAR T IME - SERIES C AUSAL D ISCOVERY

CUTS implements the causal graph as a set of Causal Probability Graphs (CPGs) G =
τ
hX , {Mτ }τmax
=0 i where the element mτ,ij ∈ Mτ represents the probability of causal influence from
xt−τ,i to xt,j , i.e. mτ,ij = p(xt−τ,i → xt,j ). Since we assume no instantaneous effects, time-
series i Granger cause j if and only if there exist causal relations on at least one time lag, we define
our discovered causal graph Ã to be the maximum value across all time lags τ ∈ {1, ..., τmax }
ãi,j = max (m1,ij , ..., mτmax ,ij ) . (3)
Specifically, if ãi,j is penalized to zero (or below certain threshold), we deduce that time-series i
does not influence the prediction of time-series j, i.e., i does not Granger cause j.
During training, we alternatively learn the prediction model and CPG matrix, which are respec-
tively implemented by Latent data prediction stage and Causal graph fitting stage. Besides, proper
learning strategies are designed to facilitate convergence.

4.1 L ATENT DATA P REDICTION S TAGE

The proposed Latent data prediction stage is designed to fit the data generation function for time-
series i with a neural network fφi , which takes into account its parent nodes in the causal graph.
Here we propose Delayed Supervision Graph Neural Network (DSGNN) for imputing the missing
entries in the observation.

4
Published as a conference paper at ICLR 2023

The inputs to DSGNN include all the historical data points (with a maximum time lag τmax )
xt−τ :t−1,i and the discovered CPGs. During training we sample the causal graph with Bernoulli
distribution, in a similar manner to Lippe et al. (2021)’s work, and the prediction x̂ is the output of
the neural network fφi
x̂t,i = fφi (X S) = fφi (xt−τ :t−1,1 s1:τ,1i , ..., xt−τ :t−1,N s1:τ,N i ), (4)
τ =τ
where S = {Sτ }τ =1max , and sτ,ij ∼ Ber(1 − mτ,ij ) and denotes the Hadamard product. S
is sampled for each training sample in a mini-batch. The fitting is done under supervision from
the observed data points. Specifically, we update the network parameters φi by minimizing the
following loss function
N
X hL2 (x̂1:L,i , x̃1:L,i ) , o1:L,i i
Lpred X̃ , X̂ , O = 1 (5)
i=1 L ho1:L,i , o1:L,i i
where oi denotes the observation mask, h·i is the dot product, and L2 represents the MSE loss
function. Then, the data imputation is performed with the following equation
(m) (m)
(
(m+1) (1 − α)x̃t,i + αx̂t,i ot,i = 0 and m ≥ n1
x̃t,i = (6)
0
x̃t,i ot,i = 1 or m < n1
(0)
Here m indexes the iteration steps, and x̃t,i denotes the initial data (unobserved entries filled with
zero order holder). α is selected to prevent the abrupt change of imputed data. For the missing
(m) (m)
points, their predicted value x̂t,i is unsupervised with L but updated to x̃t,i to obtain a “delayed”
error in causal graph inference. Moreover, we impute the missing values with the help of discovered
CPG G (sampled with Bernoulli Distribution), as illustrated in Figure 1 (b), which is proved to
significantly improve performance in experiments.

4.2 C AUSAL G RAPH D ISCOVERY S TAGE

After imputing the missing time-series, we proceed to learn CPG in the Causal graph fitting stage, to
determine the causal probability p(xt−τ,i → xt,j ) = mτ,ij , we model this likelihood with mτ,ij =
σ(θτ,ij ) where σ(·) denotes the sigmoid function and θ is the learned parameter set. Since we
assume no instantaneous effect, it is unnecessary to learn the edge direction in CPG.
In this stage we optimize the graph parameters θ by minimizing the following objective

Lgraph X̃ , X̂ , O, θ = Lpred X̃ , X̂ , O + λ||σ(θ)||1 , (7)
where Lpred is the squared error loss penalizing prediction error defined in Equation (5) and || · ||1
being the L1 regularizer to enforce sparse connections on the learned CPG. If ∀τ ∈ [1, τmax ], θτ,ij
are penalized to −∞ (and mτ,ij → 0), then we deduce that time-series i does not Granger cause j.

4.3 T HE L EARNING S TRATEGY.

The overall learning process consists of n = n1 + n2 + n3 epochs, which is illustrated in Figure 1

(a): in the first n1 epochs DSGNN and CPG are optimized without data imputation (missing entries
are set with initial guess); in the next n2 epochs the iterative model learning continues with data
imputation, but the imputed data are not used for model supervision; for the last n3 epochs the
learned CPG is refined based on supervision from all the data points (including the imputed ones).
Fine-tuning. The main training process is the alternation between Latent data prediction stage and
Causal graph fitting stage. Considering that after sufficient iterations (here n1 + n2 ) the unobserved
data points can be reliably imputed with the discovery of causal relations, and we can incorporate
these predicted points to supervise the model and fine-tune the parameters to improve the perfor-
mance further. In the last n3 epochs CPG is optimized with the loss function

Lf t X̃ , X̂ = L2 (x̂1:L,i , x̃1:L,i ) + λ||σ(θ)||1 . (8)

Parameter Settings. During training the τ value for Gumbel Softmax is initially set to a relatively
high value and annealed to a low value in the first n1 +n2 epochs and then reset for the last n3 epochs.
The learning rates for Latent data prediction stage and Causal graph fitting stage are respectively
set as lrdata and lrgraph and gradually scheduled to 0.1lrdata and 0.1lrgraph during all n1 + n2 + n3
epochs. The detailed hyperparameter settings are listed in Appendix Section A.3.

5
Published as a conference paper at ICLR 2023

4.4 C ONVERGENCE C ONDITIONS FOR G RANGER C AUSALITY.

We show in Theorem 1 that under certain assumptions, the discovered causal adjacency matrix will
converge to the true Granger causal matrix.

Theorem 1 Given a time-series dataset X = {x1:L,i }N

i=1 generated with Equation 1, we have

1. ∃λ, ∀τ ∈ {1, .., τmax }, causal probability matrix element mτ,ij = σ(θτ,ij ) converges to 0
if time-series i does not Granger cause j, and
2. ∃τ ∈ {1, .., τmax }, mτ,ij converges to 1 if time-series i Granger cause j,

if the following two conditions hold:

1. DSGNN fφi in Latent data prediction stage model generative function fi with an error
smaller than arbitrarily small value eNN,i ;
2. ∃λ0 , ∀i, j = 1, ..., N, kfφj (X Sτ,ij=1 ) − fφj (X Sτ,ij=0 )k22 > λ0 , where Sτ,ij=l is set
S with element sτ,ij = l.

The implications behind these two conditions can be intuitively explained. Assumption 1 is intrinsi-
cally the Universal Approximation Theorem (Hornik et al., 1989) of neural network, i.e., the network
is of an appropriate structure and fed with sufficient training data. Assumption 2 means there exists
a threshold λ0 to binarize kfφi (X Sτ,ij=1 ) − fφi (X Sτ,ij=0 )k, serving as an indicator as to
whether time-series j contributes to prediction of i.
The proof of Theorem 1 is detailed in Appendix Section A.1. Although the convergence condition is
relevant to the appropriate setting of λ, we will show in Appendix Section A.4.6 that our algorithm
is robust to the setting changes of λ over a wide range.

5 E XPERIMENTS
Datasets. We evaluate the performance of the proposed causal discovery approach CUTS on
both numerical simulation and real-scenario inspired data. The simulated datasets come from a
linear Vector Autoregressive (VAR) model and a nonlinear Lorenz-96 model (Karimi & Paul, 2010),
while the real-scenario inspired datasets are from NetSim (Smith et al., 2011), an fMRI dataset
describing the connecting dynamics of 15 human brain regions. The irregular observations are
generated according to the following mechanisms: Random Missing (RM) is simulated by sampling
over a uniform distribution with missing probability pi ; Periodic Missing (PM) is simulated with
sampling period Ti randomly chosen for each time-series with the maximum period being Tmax .
For statistical quantitative evaluation of different causal discovery algorithms, we take average over
multiple pi and Ti in our experiments.
Baseline Algorithms. To demonstrate the superiority of our approach, we compare with five
baseline algorithms: (i) Neural Granger Causality (NGC, Tank et al. (2022)), which utilizes MLPs
and RNNs combined with weight penalties to infer Granger causal relationships, in the experiments
we use the component-wise MLP model; (ii) economy-SRU (eSRU, Khanna & Tan (2020)), a vari-
ant of SRU that is less prone to over-fitting when inferring Granger causality; (iii) PCMCI (proposed
by Runge et al.), a non-Granger-causality-based method in which we use conditional independence
tests provided along with its repository1 , i.e., ParCorr (linear partial correlation) for conditional
independence tests for linear scenarios and GPDC (Gaussian Process regression and Distance Cor-
relation Rasmussen (2003) Székely et al. (2007)) for nonlinear scenarios. (iv) Latent Convergent
Cross Mapping (LCCM, Brouwer et al. (2021)), a CCM-based approach that also tackles the irreg-
ular time-series problem. (v) Neural Graphical Model (NGM, Bellot et al. (2022)) which is based
on Neural Ordinary Differential Equations (Neural-ODE) to solve the irregular time-series problem.
In terms of quantitative evaluation, we use area under the ROC curve (AUROC) as the criterion.
For NGC, AUROC values are computed by running the algorithm with λ varying within a range
of values. For eSRU, PCMCI, LCCM, and NGM, the AUROC values are obtained with different
thresholds. For a fair comparison, we applied parameter searching to determine the hyperparameters
1
https://github.com/jakobrunge/tigramite

6
Published as a conference paper at ICLR 2023

VAR Simulation Lorenz-96 Simulation

Time-series Groundtruth CPG Time-series Groundtruth CPG

Figure 2: Examples of our simulated VAR and Lorenz-96 datasets, with two of the total 10 gener-
ated time-series from the groundtruth CPG plotted as orange and blue solid lines, while the non-
uniformly sampled points are labeled with scattered points.

of the baseline algorithms with the best performance. For baseline algorithms unable to handle ir-
regular time-series data, i.e., NGC, PCMCI, and eSRU, we imputed the irregular time-series before
feeding them to causal discovery modules, and use three data imputation algorithms, i.e., Zero-
order Holder (ZOH), Gaussian Process Regression (GP), and Multivariate Time Series Imputation
by Graph Neural Network (GRIN, Cini et al. (2022)).

5.1 VAR S IMULATION DATASETS

VAR datasets are simulated following

τX
max

xt = Aτ xt−τ + et , (9)
τ =1

where the matrix Aτ is the sparse autoregressive coefficients for time lag τ . Time-series i
Granger cause time-series j if ∃τ ∈ {1, ..., τmax } , aτ,ij > 0. The objective of causal dis-
covery is to reconstruct the non-zero elements in causal graph A (where each element of A
aij = max(a1,ij , ..., aτmax ,ij )) with Ã. We set τmax = 3, N = 10 and time-series length
L = 10000 in this experiment. For missing mechanisms, we set p = 0.3, 0.6, respectively for
Random Missing and Tmax = 2, 4 respectively for Periodic Missing. Experimental results are
shown in the upper half of Table 1. We can see that CUTS beats PCMCI, NGC, and eSRU com-
bined with ZOH, GP, and GRIN in most cases, except for the case of VAR with random missing
(p = 0.3) where PCMCI + GRIN is better by only a small margin (+0.0012). The superiority is
especially prominent when with a larger percentage of missing values (p = 0.6 for random missing
and Tmax = 4 for periodic missing). Differently, data imputation algorithms GP and GRIN provide
performance gain in some scenarios but fail to boost causal discovery in others. This indicates that
simply combining previous data imputation algorithms with causal discovery algorithms cannot give
stable and promising results, and is thus less practical than our approach. We also beat LCCM and
NGM which originally tackles the irregular time series problem by a clear margin. This hampered
performance may be attributed to the fact that LCCM and NGM both utilize Neural-ODE to model
the dynamics and do not cope with VAR datasets well.

5.2 L ORENZ -96 S IMULATION DATASETS

Lorenz-96 datasets are simulated according to

dxt,i
= −xt,i−1 (xt,i−2 − xt,i+1 ) − xt,i + F, (10)
dt
where −xt,i−1 (xt,i−2 −xt,i+1 ) is the advection term, xt,i is the diffusion term, and F is the external
forcing (a larger F implies a more chaotic system). In this Lorenz-96 model each time-series xi is
affected by historical values of four time-series xi−2 , xi−1 , xi , xi+1 , and each row in the ground
truth causal graph A has four non-zero elements. Here we set the maximal time-series length L =
1000, N = 10, force constant F = 10 and show experimental results for F = 40 in the Appendix
Section A.4.7. From the results in the lower half of Table 1, one can draw similar conclusions to
those on VAR datasets: CUTS outperforms baseline causal discovery methods either with or without
data imputation.

5.3 N ET S IM DATASETS

7
Published as a conference paper at ICLR 2023

Table 1: Performance comparison of CUTS with (i) PCMCI, eSRU, NGC combined with imputation
method ZOH, GP, GRIN and (ii) LCCM, NGM which do not need data imputation. Experiments
are performed on VAR and Lorenz-96 datasets in terms of AUROC. Results are averaged over 10
randomly generated datasets.
VAR with Random Missing VAR with Periodic Missing
Methods Imputation
p = 0.3 p = 0.6 Tmax = 2 Tmax = 4
ZOH 0.9904 ± 0.0078 0.9145 ± 0.0204 0.9974 ± 0.0040 0.9787 ± 0.0196
PCMCI GP 0.9930 ± 0.0072 0.8375 ± 0.0651 0.9977 ± 0.0038 0.9332 ± 0.1071
GRIN 0.9983 ± 0.0028 0.9497 ± 0.0132 0.9989 ± 0.0017 0.9774 ± 0.0169
ZOH 0.9899 ± 0.0105 0.9325 ± 0.0266 0.9808 ± 0.0117 0.9439 ± 0.0264
NGC GP 0.9821 ± 0.0097 0.5392 ± 0.1176 0.9833 ± 0.0108 0.7350 ± 0.2260
GRIN 0.8186 ± 0.1720 0.5918 ± 0.1170 0.8621 ± 0.0661 0.6677 ± 0.1350
ZOH 0.9760 ± 0.0113 0.8464 ± 0.0299 0.9580 ± 0.0276 0.9214 ± 0.0257
eSRU GP 0.9747 ± 0.0096 0.8988 ± 0.0301 0.9587 ± 0.0191 0.8166 ± 0.1085
GRIN 0.9677 ± 0.0134 0.8399 ± 0.0242 0.9740 ± 0.0150 0.8574 ± 0.0869
LCCM 0.6851 ± 0.0411 0.6530 ± 0.0212 0.6462 ± 0.0225 0.6388 ± 0.0170
NGM 0.7608 ± 0.0910 0.6350 ± 0.0770 0.8596 ± 0.0353 0.7968 ± 0.0305
CUTS (Proposed) 0.9971 ± 0.0026 0.9766 ± 0.0074 0.9992 ± 0.0016 0.9958 ± 0.0069
Lorenz-96 with Random Missing Lorenz-96 with Periodic Missing
Methods Imputation
p = 0.3 p = 0.6 Tmax = 2 Tmax = 4
ZOH 0.8173 ± 0.0491 0.7275 ± 0.0534 0.7229 ± 0.0348 0.7178 ± 0.0668
PCMCI GP 0.7545 ± 0.0585 0.7862 ± 0.0379 0.7782 ± 0.0406 0.7676 ± 0.0360
GRIN 0.8695 ± 0.0301 0.7544 ± 0.0404 0.7299 ± 0.0545 0.7277 ± 0.0947
ZOH 0.9933 ± 0.0058 0.9526 ± 0.0220 0.9903 ± 0.0096 0.9776 ± 0.0120
NGC GP 0.9941 ± 0.0064 0.5000 ± 0.0000 0.9949 ± 0.0050 0.7774 ± 0.2300
GRIN 0.9812 ± 0.0105 0.7222 ± 0.0680 0.9640 ± 0.0193 0.8430 ± 0.0588
ZOH 0.9968 ± 0.0038 0.9089 ± 0.0261 0.9958 ± 0.0031 0.9815 ± 0.0148
eSRU GP 0.9977 ± 0.0035 0.9597 ± 0.0169 0.9990 ± 0.0015 0.9628 ± 0.0371
GRIN 0.9937 ± 0.0071 0.9196 ± 0.0251 0.9873 ± 0.0110 0.8400 ± 0.1451
LCCM 0.7168 ± 0.0245 0.6685 ± 0.0311 0.7064 ± 0.0324 0.7129 ± 0.0235
NGM 0.9180 ± 0.0199 0.7712 ± 0.0456 0.9751 ± 0.0112 0.9171 ± 0.0189
CUTS (Proposed) 0.9996 ± 0.0005 0.9705 ± 0.0118 1.0000 ± 0.0000 0.9959 ± 0.0042

To validate the performance of CUTS on real- Table 2: Quantitative results on NetSim dataset.
scenario data, We use data from 10 humans in Results averaged over 10 human brain subjects.
NetSim datasets2 , which is generated with syn- NetSim with Random Missing
thesized dynamics of brain region connectivity Met. Imp.
p = 0.1 p = 0.2
and unknown to us and the algorithm. The total ZOH 0.7625 ± 0.0539 0.7455 ± 0.0675
length of each time-series data L = 200 and the PCMCI GP 0.7462 ± 0.0396 0.7551 ± 0.0451
number of time-series N = 15. By testing our GRIN 0.7475 ± 0.0517 0.7353 ± 0.0611
CUTS on this dataset we show that our algo- ZOH 0.7656 ± 0.0576 0.7668 ± 0.0403
rithm is capable of discovering causal relations NGC GP 0.7506 ± 0.0532 0.7545 ± 0.0518
with irregular time-series data for scientific dis- GRIN 0.6744 ± 0.0743 0.5826 ± 0.0476
covery. However, L = 200 is a small data size, ZOH 0.6384 ± 0.0473 0.6592 ± 0.0248
therefore we only perform experiments with the eSRU GP 0.6147 ± 0.0454 0.6330 ± 0.0449
Random Missing situation. Experimental re- GRIN 0.6141 ± 0.0529 0.5818 ± 0.0588
sults shown in Table 2 tell that our approach LCCM 0.7711 ± 0.0301 0.7594 ± 0.0246
beats all existing methods on both missing pro- NGM 0.7417 ± 0.0380 0.7215 ± 0.0330
portions. CUTS 0.7948 ± 0.0381 0.7699 ± 0.0550

5.4 A BLATION S TUDIES

Besides demonstrating the advantageous performance of the final results, we further conducted a
series of ablation studies to quantitatively evaluate the contributions of the key technical designs or
learning strategies in CUTS. Due to page limit, we only show experiments on Lorenz-96 datasets
with Random Missing settings in this section, and leave the other results in the Appendix Section
A.4.2.
Causal Discovery Boosts Data Imputation. To validate that Latent data prediction stage helps
Causal graph fitting stage, we reset CPGs Mτm to all-one matrices in Latent data prediction
2
Shared at https://www.fmrib.ox.ac.uk/datasets/netsim/sims.tar.gz

8
Published as a conference paper at ICLR 2023

Table 3: Quantitative results of ablation studies. “CUTS (Full)” denotes the default settings in this
paper. Here we run experiments on Lorenz-96 datasets. Ablation study results on other datasets are
provided in Appendix Section A.4.2.
Lorenz-96 with Random Missing Lorenz-96 with Periodic Missing
Methods
p = 0.3 p = 0.6 Tmax = 2 Tmax = 4
CUTS (Full) 0.9996 ± 0.0005 0.9705 ± 0.0118 1.0000 ± 0.0000 0.9959 ± 0.0042
ZOH for Imputation 0.9799 ± 0.0071 0.8731 ± 0.0312 0.9981 ± 0.0021 0.9865 ± 0.0128
GP for Imputation 0.9863 ± 0.0058 0.8575 ± 0.0536 0.9965 ± 0.0036 0.9550 ± 0.0407
GRIN for Imputation 0.9793 ± 0.0126 0.8983 ± 0.0299 0.9869 ± 0.0101 0.9325 ± 0.0415
No Imputation 0.9898 ± 0.0045 0.9206 ± 0.0216 0.9968 ± 0.0032 0.9797 ± 0.0204
Remove CPG for Imput. 0.9972 ± 0.0021 0.9535 ± 0.0167 0.9989 ± 0.0011 0.9926 ± 0.0045
No Finetuning Stage 0.9957 ± 0.0036 0.9665 ± 0.0096 0.9980 ± 0.0025 0.9794 ± 0.0124

stage and then x̂t,i is predicted with all time-series instead of only the parent nodes. This ex-
periment is shown as “Remove CPG for Imput.” in Table 6. It is observed that introducing CPGs in
data imputation is especially helpful with large quantities of missing values (p = 0.6 for Random
Missing or Tmax = 4 for Periodic Missing). Comparing with the scores in the first row, we can see
that introducing CPGs in data imputation boosts AUROC by 0.0011 ∼ 0.0170.
Data Imputation Boosts Causal Discovery. To show that Causal graph fitting stage helps Latent
data prediction stage, we disable data imputation operation defined in Equation 6, i.e., α = 0. In
other words, Causal graph fitting stage is performed with just the initially filled data (Appendix
Section A.3.2), with the results shown as “No Imputation” in Table 6. Compared with the first row,
we can see that introducing data imputation boosts AUROC by 0.0032 ∼ 0.0499. We further replace
our data imputation module with baseline modules (ZOH, GP, GRIN) to show the effectiveness of
our design. It is observed that our algorithm beats “ZOH for Imputation”, “GP for Imputation”,
“GRIN for Imputation” in most scenarios.
Finetuning Stage Raises Performance. We disable the finetuning stage and find that the per-
formance drops slightly, as shown in the “No Finetuning Stage” row in Table 6. In other words, the
finetuning stage indeed helps to refine the causal discovery process.

5.5 A DDITIONAL E XPERIMENTS

We further conduct additional experiments in Appendix to show experiments on more datasets (Ap-
pendix Section A.4.1), ablation study for choice of epoch numbers (Appendix Section A.4.3), ab-
lation study results on VAR and NetSim datasets (Appendix Section A.4.2), performance on 3-
dimensional temporal causal graph (Appendix Section A.4.4), CUTS’s performance superiority on
regular time-series (Appendix Section A.4.5), robustness to different noise levels (Appendix Section
A.4.8), robustness to hyperparameter settings (Appendix Section A.4.6), and results on Lorenz-96
with forcing constant F = 40 (Appendix Section A.4.7). We further provide implementation details
and hyperparameters settings of CUTS and baseline algorithms in Appendix Section A.3, and the
pseudocode of our approach in Appendix Section A.5.

6 C ONCLUSIONS

In this paper we propose CUTS, a time-series causal discovery method applicable for scenarios
with irregular observations with the help of nonlinear Granger causality. We conducted a series of
experiments on multiple datasets with Random Missing as well as Periodic Missing. Compared with
previous methods, CUTS utilizes two alternating stages to discover causal relations and achieved
superior performance. We show in the ablation section that these two stages mutually boost each
other to achieve an improved performance. Moreover, our CUTS is widely applicable for time-
series with different lengths, scales well to large sets of variables, and is robust to noise. Our code
is publicly available at https://github.com/jarrycyx/unn.
In this work we assume no latent confounder and no instantaneous effect for Granger causality. Our
future works includes: (i) Causal discovery in the presence of latent confounder or instantaneous
effect. (ii) Time-series imputation with causal models.

9
Published as a conference paper at ICLR 2023

R EPRODUCIBILITY S TATEMENT
For the purpose of reproducibility, we include the source code in the supplementary files, and will
published on GitHub upon acceptance. Datasets generation process is also included in source code.
Moreover, we provide all hyperparameters used for all methods in Appendix Section A.4.6. The
experiments are deployed on a server with Intel Core CPU and NVIDIA RTX3090 GPU.

ACKNOWLEDGMENTS
This work is jointly funded by Ministry of Science and Technology of China (Grant No.
2020AAA0108202), National Natural Science Foundation of China (Grant No. 61931012 and
62088102), Beijing Natural Science Foundation (Grant No. Z200021), and Project of Medical En-
gineering Laboratory of Chinese PLA General Hospital (Grant No. 2022SYSZZKY21).

R EFERENCES
Alexis Bellot, Kim Branson, and Mihaela van der Schaar. Neural graphical modelling in continuous-
time: Consistency guarantees and algorithms. In International Conference on Learning Repre-
sentations, February 2022.
Zsigmond Benkő, Ádám Zlatniczki, Marcell Stippinger, Dániel Fabó, András Sólyom, Loránd
Erőss, András Telcs, and Zoltán Somogyvári. Complete Inference of Causal Relations between
Dynamical Systems, February 2020.
Edward De Brouwer, Adam Arany, Jaak Simm, and Yves Moreau. Latent Convergent Cross Map-
ping. In International Conference on Learning Representations, March 2021.
Wei Cao, Dong Wang, Jian Li, Hao Zhou, Lei Li, and Yitan Li. BRITS: Bidirectional recurrent
imputation for time series. In Advances in Neural Information Processing Systems, volume 31.
Curran Associates, Inc., 2018.
Haonan Chen, Bo Yuan Chang, Mohamed A. Naiel1, Georges Younes, Steven Wardell, Stan
Kleinikkink, and John S. Zelek. Causal discovery from sparse time-series data using echo state
network, January 2022.
Andrea Cini, Ivan Marisca, and Cesare Alippi. Filling the G ap s: Multivariate time series im-
putation by graph neural networks. In International Conference on Learning Representations,
February 2022.
Doris Entner and Patrik O. Hoyer. On causal discovery from time series data using FCI. Probabilistic
graphical models, pp. 121–128, 2010.
Alexander Gain and Ilya Shpitser. Structure learning under missing data. In International Confer-
ence on Probabilistic Graphical Models, pp. 121–132. PMLR, 2018.
Tomas Geffner, Javier Antoran, Adam Foster, Wenbo Gong, Chao Ma, Emre Kiciman, Amit Sharma,
Angus Lamb, Martin Kukla, Nick Pawlowski, Miltiadis Allamanis, and Cheng Zhang. Deep End-
to-end Causal Inference, June 2022.
Andreas Gerhardus and Jakob Runge. High-recall causal discovery for autocorrelated time series
with latent confounders. In Advances in Neural Information Processing Systems, volume 33, pp.
12615–12625. Curran Associates, Inc., 2020.
Mingming Gong, Kun Zhang, Bernhard Schoelkopf, Dacheng Tao, and Philipp Geiger. Discover-
ing temporal causal relations from subsampled data. In Proceedings of the 32nd International
Conference on Machine Learning, pp. 1898–1906. PMLR, June 2015.
C. W. J. Granger. Investigating causal relations by econometric models and cross-spectral methods.
Econometrica, 37(3):424–438, 1969. ISSN 0012-9682. doi: 10.2307/1912791.
Kurt Hornik, Maxwell Stinchcombe, and Halbert White. Multilayer feedforward networks are uni-
versal approximators. Neural networks, 2(5):359–366, 1989.

10
Published as a conference paper at ICLR 2023

Patrik Hoyer, Dominik Janzing, Joris M Mooij, Jonas Peters, and Bernhard Schölkopf. Nonlinear
causal discovery with additive noise models. In Advances in Neural Information Processing
Systems, volume 21. Curran Associates, Inc., 2008.
Xiaoshui Huang, Fujin Zhu, Lois Holloway, and Ali Haidar. Causal discovery from incomplete data
using an encoder and reinforcement learning, June 2020.
Antti Hyttinen, Sergey Plis, Matti Järvisalo, Frederick Eberhardt, and David Danks. Causal discov-
ery from subsampled time series data by constraint optimization. In Proceedings of the Eighth
International Conference on Probabilistic Graphical Models, pp. 216–227. PMLR, August 2016.
Akane Iseki, Yusuke Mukuta, Yoshitaka Ushiku, and Tatsuya Harada. Estimating the causal effect
from partially observed time series. Proceedings of the AAAI Conference on Artificial Intelligence,
33(01):3919–3926, July 2019. ISSN 2374-3468. doi: 10.1609/aaai.v33i01.33013919.
Eric Jang, Shixiang Gu, and Ben Poole. Categorical reparameterization with gumbel-softmax.
November 2016. doi: 10.48550/arXiv.1611.01144.
A. Karimi and M. R. Paul. Extensive chaos in the lorenz-96 model. Chaos: An Interdisciplinary
Journal of Nonlinear Science, 20(4):043105, December 2010. ISSN 1054-1500. doi: 10.1063/1.
3496397.
Saurabh Khanna and Vincent Y. F. Tan. Economy statistical recurrent units for inferring nonlinear
granger causality. In International Conference on Learning Representations, March 2020.
Phillip Lippe, Taco Cohen, and Efstratios Gavves. Efficient neural causal discovery without acyclic-
ity constraints. In International Conference on Learning Representations, September 2021.
Sindy Löwe, David Madras, Richard Zemel, and Max Welling. Amortized causal discovery: Learn-
ing to infer causal graphs from time-series data. In Proceedings of the First Conference on Causal
Learning and Reasoning, pp. 509–525. PMLR, June 2022.
Yonghong Luo, Xiangrui Cai, Ying ZHANG, Jun Xu, and Yuan xiaojie. Multivariate Time Series
Imputation with Generative Adversarial Networks. In Advances in Neural Information Processing
Systems, volume 31. Curran Associates, Inc., 2018.
Daniele Marinazzo, Mario Pellicoro, and Sebastiano Stramaglia. Kernel-granger causality and the
analysis of dynamical networks. Physical review E, 77(5):056215, 2008.
Pablo Morales-Alvarez, Wenbo Gong, Angus Lamb, Simon Woodhead, Simon Peyton Jones, Nick
Pawlowski, Miltiadis Allamanis, and Cheng Zhang. Simultaneous Missing Value Imputation and
Structure Learning with Groups, February 2022.
Meike Nauta, Doina Bucur, and Christin Seifert. Causal discovery with attention-based convo-
lutional neural networks. Machine Learning and Knowledge Extraction, 1(1):312–340, March
2019. ISSN 2504-4990. doi: 10.3390/make1010019.
Junier B. Oliva, Barnabás Póczos, and Jeff Schneider. The statistical recurrent unit. In Proceedings
of the 34th International Conference on Machine Learning, pp. 2671–2680. PMLR, July 2017.
Roxana Pamfil, Nisara Sriwattanaworachai, Shaan Desai, Philip Pilgerstorfer, Konstantinos Geor-
gatzis, Paul Beaumont, and Bryon Aragam. DYNOTEARS: Structure learning from time-series
data. In Proceedings of the Twenty Third International Conference on Artificial Intelligence and
Statistics, pp. 1595–1605. PMLR, June 2020.
Jonas Peters, Dominik Janzing, and Bernhard Schölkopf. Elements of Causal Inference: Founda-
tions and Learning Algorithms. The MIT Press, 2017. ISBN 978-0-262-03731-0 978-0-262-
34429-6.
Robert J. Prill, Daniel Marbach, Julio Saez-Rodriguez, Peter K. Sorger, Leonidas G. Alexopoulos,
Xiaowei Xue, Neil D. Clarke, Gregoire Altan-Bonnet, and Gustavo Stolovitzky. Towards a Rig-
orous Assessment of Systems Biology Models: The DREAM3 Challenges. PLOS ONE, 5(2):
e9202, 2010. ISSN 1932-6203. doi: 10.1371/journal.pone.0009202.

11
Published as a conference paper at ICLR 2023

Carl Edward Rasmussen. Gaussian processes in machine learning. In Summer School on Machine
Learning, pp. 63–71. Springer, 2003.

J. Runge. Causal network reconstruction from time series: From theoretical assumptions to practical
estimation. Chaos: An Interdisciplinary Journal of Nonlinear Science, 28(7):075310, July 2018a.
ISSN 1054-1500. doi: 10.1063/1.5025050.

Jakob Runge. Conditional independence testing based on a nearest-neighbor estimator of conditional

mutual information. In Proceedings of the Twenty-First International Conference on Artificial
Intelligence and Statistics, pp. 938–947. PMLR, March 2018b.

Jakob Runge. Necessary and sufficient graphical conditions for optimal adjustment sets in causal
graphical models with hidden variables. In Advances in Neural Information Processing Systems,
volume 34, pp. 15762–15773. Curran Associates, Inc., 2021.

Jakob Runge, Peer Nowack, Marlene Kretschmer, Seth Flaxman, and Dino Sejdinovic. Detecting
and quantifying causal associations in large nonlinear time series datasets. Science Advances, 5
(11):eaau4996. doi: 10.1126/sciadv.aau4996.

Shohei Shimizu, Patrik O. Hoyer, Aapo Hyv&#228, rinen, and Antti Kerminen. A Linear Non-
Gaussian Acyclic Model for Causal Discovery. Journal of Machine Learning Research, 7(72):
2003–2030, 2006. ISSN 1533-7928.

Stephen M. Smith, Karla L. Miller, Gholamreza Salimi-Khorshidi, Matthew Webster, Christian F.

Beckmann, Thomas E. Nichols, Joseph D. Ramsey, and Mark W. Woolrich. Network modelling
methods for FMRI. NeuroImage, 54(2):875–891, January 2011. ISSN 1053-8119. doi: 10.1016/
j.neuroimage.2010.08.063.

Peter Spirtes and Clark Glymour. An algorithm for fast recovery of sparse causal graphs. Social
science computer review, 9(1):62–72, 1991.

Peter Spirtes, Clark N. Glymour, Richard Scheines, and David Heckerman. Causation, Prediction,
and Search. MIT press, 2000.

Eric V. Strobl, Shyam Visweswaran, and Peter L. Spirtes. Fast causal inference with non-random
missingness by test-wise deletion. International journal of data science and analytics, 6(1):47–
62, 2018.

George Sugihara, Robert May, Hao Ye, Chih-hao Hsieh, Ethan Deyle, Michael Fogarty, and Stephan
Munch. Detecting Causality in Complex Ecosystems. Science, 338(6106):496–500, October
2012. doi: 10.1126/science.1227079.

Gábor J. Székely, Maria L. Rizzo, and Nail K. Bakirov. Measuring and testing dependence by
correlation of distances. The Annals of Statistics, 35(6):2769–2794, December 2007. ISSN 0090-
5364, 2168-8966. doi: 10.1214/009053607000000505.

Alex Tank, Ian Covert, Nicholas Foti, Ali Shojaie, and Emily B. Fox. Neural granger causality.
IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(8):4267–4279, 2022. ISSN
1939-3539. doi: 10.1109/TPAMI.2021.3065601.

Ruibo Tu, Cheng Zhang, Paul Ackermann, Karthika Mohan, Hedvig Kjellström, and Kun Zhang.
Causal discovery in the presence of missing data. In Proceedings of the Twenty-Second Interna-
tional Conference on Artificial Intelligence and Statistics, pp. 1762–1770. PMLR, April 2019.

Matthew J. Vowels, Necati Cihan Camgoz, and Richard Bowden. D’ya like dags? A survey on
structure learning and causal discovery, March 2021.

Yuhao Wang, Vlado Menkovski, Hao Wang, Xin Du, and Mykola Pechenizkiy. Causal discovery
from incomplete data: A deep learning approach, January 2020.

Ian R. White, Patrick Royston, and Angela M. Wood. Multiple imputation using chained equations:
Issues and guidance for practice. Statistics in medicine, 30(4):377–399, 2011.

12
Published as a conference paper at ICLR 2023

Ronald J. Williams. Simple statistical gradient-following algorithms for connectionist reinforcement

learning. Machine learning, 8(3):229–256, 1992.
Alexander P. Wu, Rohit Singh, and Bonnie Berger. Granger causal inference on dags identifies
genomic loci regulating transcription. In International Conference on Learning Representations,
March 2022.
Hao Ye, Ethan R. Deyle, Luis J. Gilarranz, and George Sugihara. Distinguishing time-delayed causal
interactions using convergent cross mapping. Scientific Reports, 5(1):14750, October 2015. ISSN
2045-2322. doi: 10.1038/srep14750.

13
Published as a conference paper at ICLR 2023

A A PPENDIX

C ONTENTS

A.1 Convergence Conditions for Granger Causality . . . . . . . . . . . . . . . . . . . 14

A.1.1 Proof of Theorem 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
A.1.2 The Effects of Data Imputation . . . . . . . . . . . . . . . . . . . . . . . . 15
A.1.3 Convergence of Sigmoidal Gradients . . . . . . . . . . . . . . . . . . . . 16
A.2 An Example for Irregular Time-series Causal Discovery . . . . . . . . . . . . . . . 16
A.3 Implementation Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
A.3.1 Gumbel Softmax for Causal Graph Fitting . . . . . . . . . . . . . . . . . . 17
A.3.2 Initial Data Filling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
A.3.3 Hyperparameters Settings . . . . . . . . . . . . . . . . . . . . . . . . . . 17
A.4 Additional Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
A.4.1 DREAM-3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
A.4.2 Ablation Study on VAR and NetSim datasets . . . . . . . . . . . . . . . . 19
A.4.3 Ablation Study for Epoch Numbers . . . . . . . . . . . . . . . . . . . . . 19
A.4.4 Performance on Temporal Causal Graph . . . . . . . . . . . . . . . . . . . 19
A.4.5 Causal Discovery with Structured Time-series Data . . . . . . . . . . . . . 19
A.4.6 Robustness to Hyperparameters Settings . . . . . . . . . . . . . . . . . . . 19
A.4.7 Lorenz-96 Datasets with F=40 . . . . . . . . . . . . . . . . . . . . . . . . 19
A.4.8 Robustness to Noise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
A.5 Pseudocode for CUTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
A.6 MSE Curve for Data Imputation . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

A.1 C ONVERGENCE C ONDITIONS FOR G RANGER C AUSALITY

A.1.1 P ROOF OF T HEOREM 1

We proved that in Theorem 1 our CUTS can discover the correct Granger causality with the follow-
ing assumptions:

1. DSGNN fφi in Latent data prediction stage model generative function fi with an error
smaller than arbitrarily small value eNN,i ;

2. ∃λ0 , ∀i, j = 1, ..., N, kfφj (X Sτ,ij=1 ) − fφj (X Sτ,ij=0 )k22 > λ0 , where Sτ,ij=l is
set S with element sτ,ij = l.

In Causal graph fitting stage the loss function

N
X hL2 (x̂1:L,i , x̃1:L,i ), oi i
Lgraph (X̃ , X̂ , O, θ) = 1 + λ||σ(θ)||1
i=1 L ho1:L,i , o1:L,i i
(11)
N X
X L
2
= ci ot,i (xt,i − fφj (X S)) + λ||σ(θ)||1
i=1 t=1

14
Published as a conference paper at ICLR 2023

where sτ,ij ∼ Ber(σ(θτ,ij )), ci = ho1:L,iL,o1:L,i i . We use the REINFORCE (Williams, 1992) trick
and mτ,ij 0 s gradient is calculated as
∂ ∂
Es [Lgraph ] = Esτ,ij [ci ot,i (xt,j − fφj (X S))2 log psτ,ij ] + λσ 0 (θτ,ij )
∂θτ.ij τ,ij ∂θτ,ij
1
= λσ 0 (θτ,ij ) + σ(θτ,ij )ci ot,i (xt,j − fφj (X Sτ,ij=1 ))2 σ 0 (θτ,ij )
σ(θτ,ij )
1
+ (1 − σ(θτ,ij ))ci ot,i (xt,j − fφj (X Sτ,ij=0 ))2 σ 0 (θτ,ij )
σ(θτ,ij ) − 1
= σ 0 (θτ,ij )(ci ot,i (xt,j − fφj (X Sτ,ij=1 ))2
− ci ot,i (xt,j − fφj (X Sτ,ij=0 ))2 + λ).
(12)

Where Sτ,ij=l denotes S = {Sτ }ττmax =1 with sτ,ij set to l, and fφj (X Sτ,ij=1 ) = fφj (xt−τ :t−1,1
s1:τ,1i , ..., xt−τ :t−1,N s1:τ,N i ). According to Definition 1, time-series i does not Granger
cause j if ∀τ ∈ {1, ..., τmax }, xt−τ,i is invariant of the prediction of xt,j . Then we have
∀τ ∈ {1, ..., τmax }, fφj (..., xt−τ,i , ...) = fφj (..., 0, ...), i.e., fφj (X Sτ,ij=1 ) = fφj (X Sτ,ij=0 ).
Applying additive noise model (ANM, Equation 1) we can derive that
∂
Es [Lgraph ] = σ 0 (θτ,ij )(ci ot,i (e2t,j − e2t,j )) = λσ 0 (θτ,ij ) > 0. (13)
∂θτ,ij τ,ij

This is a sigmoidal gradient, whose convergence is analyzed in Section A.1.3. Likewise, we have
∃τ ∈ {1, ..., τmax }, fφj (X Sτ,ij=1 ) 6= fφj (X Sτ,ij=0 ) if time-series i Granger cause j, and
∃τ satisfying
∂
Es [Lgraph ] = σ 0 (θτ,ij )(ci ot,j ((xt,j − fφj (X Sτ,ij=1 ))2
∂θτ.ij τ,ij (14)
− (xt,j − fφj (X Sτ,ij=0 ))2 ) + λ).

Assuming that fφj (·) accurately models causal relations in fi (·) (i.e., DSGNN fφi in Latent data pre-
diction stage model generative function fi with an error smaller than arbitrarily small value eNN,i ),
applying Equation 1 we have
∂
Es [Lgraph ] = σ 0 (θτ,ij )(ci ot,j e2t,j − (xt,j − fφj (X Sτ,ij=0 ))2 + λ)

∂θτ,ij τ,ij
(15)
= σ 0 (θτ,ij ) ci ot,j (e2t,j − (et,j + ∆fi,j )2 ) + λ

= σ 0 (θτ,ij )(ci ot,j (−2et,j ∆fi,j − ∆2 fi,j ) + λ),

where noise term et,i ∼ N (0, σ), ∆fi,j = fφj (X Sτ,ij=1 ) − fφj (X Sτ,ij=0 ). This gradient
is expected to be negative when ∀i, j = 1, ..., N, E(ci ∆2 fi,j ) ≥ pλ0 > λ, where p is the missing
probability, i.e., E[ci ] = p (here we only consider the random missing scenario). Since we can
certainly find a λ satisfying the above inequality, θτ,ij will go towards +∞ with a properly chosen λ
and mτ,ij → 1. Moreover, we show in Appendix Section A.4.6 that CUTS is robust to a wide range
of λ values. When applies to real data we use Gumbel Softmax estimator for improved performance
(Jang et al., 2016).

A.1.2 T HE E FFECTS OF DATA I MPUTATION

To show why data imputation boosts causal discovery, we suppose xt−τ 0 ,j , a parent node of xt,i is
unobserved and imperfectly imputed with as x̂t−τ 0 ,j 6= xt−τ 0 ,j . If time-series i Granger cause j,
then f (..., x̂t−τ 0 ,j , ...) 6= f (..., xt−τ 0 ,j , ...). Let δτ 0 ,ij = f (..., xt−τ 0 ,j , ...) − f (..., x̂t−τ 0 ,j , ...), and
∂
Es [Lgraph ] = σ 0 (θτ,ij )(ci ot,i ((et,i + δτ 0 ,ij )2 − (et,i + δτ 0 ,ij + ∆fi,j )2 ) + λ)
∂θτ.ij τ,ij (16)
= σ 0 (θτ,ij )(ci ot,i (−2(et,i + δτ 0 ,ij )∆fi,j − ∆2 fi,j ) + λ)

15
Published as a conference paper at ICLR 2023

The expectation

∂
Eet,i Es [Lgraph ] = σ 0 (θτ,ij )(ci ot,i (−2δτ 0 ,ij ∆fi,j − ∆2 fi,j ) + λ)
∂θτ.ij τ,ij
As a result, if we cannot find a lower bound for δτ 0 ,ij , gradient for θτ,ij is not guaranteed to be
positive or negative and the true Granger causal relation cannot be recovered. On the other hand, if
xt−τ 0 ,j is appropriately imputed with |δτ 0 ,ij | ≤ δ < λ20 , we can find λ < pλ − pδ to insure negative
gradient and θτ,ij will go towards +∞.

A.1.3 C ONVERGENCE OF S IGMOIDAL G RADIENTS

We now analyze the descent algorithm for sigmoidal gradients with learning rate α (for simplicity
we denote θτ,ij as θ):
θk = θk−1 + αλσ 0 (θk−1 )
This is a monotonic increasing sequence. We show that this sequence converges to +∞, ∀α > 0.
If this is not the case, ∃M > 0,s.t. ∀i > 0, we have θi ≤ M , since this sequence is monotonic
increasing, we have
e−θk e−θk e−M
θk+1 = θk + αλ ≥ θ k + αλ ≥ θ k + αλ
(1 + e−θk )2 (1 + e−θ0 ) (1 + e−θ0 )
then ∃k, s.t. θk > M , this contradicts with ”∀i > 0, θi ≤ M ”, then we have θk → +∞ and
for any finite number M , θk can converge to ≥ M in finite steps. And likewise sequence θk =
θk−1 − αλσ 0 (θk−1 ) converges to ≤ −M in finite steps. This enables us to choose a threshold to
classify causal and non-causal edges.

A.2 A N E XAMPLE FOR I RREGULAR T IME - SERIES C AUSAL D ISCOVERY

In this section we provide a simple example for irregular causal discovery and show that our algo-
rithm is capable of recovering causal graphs from irregular time-series. Suppose we have a dataset
with 3 time-series x1 , x2 , x3 , which are generated with
xt,1 = et,1 , xt,2 = f2 (xt−1,2 ) + et,2 , xt,3 = f3 (xt−1,1 , xt−1,2 ) + et,3 , (17)
where e1 , e2 , e3 are the noise terms and follow N (0, σ). We assume only x2 is randomly sampled
with missing probability p2
ot,1 = 1, ot,2 ∼ Ber(1 − p2 ), ot,3 = 1, (18)
where Ber(·) denotes the Bernoulli distribution. Then the groundtruth causal relations can be illus-
trated in Figure 3 (left). We use a DSGNN fφ2 to fit f2 supervised on observed data points of x2 ,
i.e., minφ2 L2 (xt,2 , fφ2 (xt−1,1 )), ∀t, s.t. ot,2 = 1. Given fφ2 , the unobserved values of x2 can be
imputed with x̂t,2 = fφ2 (xt,1 ) and we fit f3 (·) with fφ3 (·) in Latent data prediction stage:
arg min L2 (xt,3 , fφ3 (xt−1,1 , x̂t−1,2 ))
φ3
(19)
= arg min L2 (xt,3 , fφ3 (xt−1,1 , fφ2 (xt−2,1 ))),
φ3

and CPGs Mτ is optimized in Causal graph fitting stage with

3
X
arg min L2 (xt,3 s1,13 , fφ3 (xt−1,1 s1,23 , fφ2 (xt−2,1 ), xt−1,3 s1,33 )) + λ σ(s1,i3 ), (20)
M1 i=1

where s1,ij is sampled with Gumbel Softmax technique denoted with Equation 21. Since xt−1,3 is
invariant to the prediction of xt,3 given xt,1 and xt,2 , s1,33 can be penalized to zero with a proper
λ. Here we conduct an experiment to verify this example. We set L = 10000, random missing
probability p2 = 0.2. The illustration of the discovered causal relations is Figure 3. Results show
that CUTS without data imputation tends to ignore causal relations from x2 (with missing values)
to other time-series. This causal relation x2 → x3 are instead “replaced” by x3 → x3 , which leads
to incorrect causal discovery results.

16
Published as a conference paper at ICLR 2023

𝑡−2 𝑡−1 𝑡 𝑡−2 𝑡−1 𝑡 𝑡−2 𝑡−1 𝑡

𝒙𝟏 𝒙𝟏 𝒙𝟏
True Positive

𝒙𝟐 𝒙𝟐 𝒙𝟐 False Positive

False Negative
𝒙𝟑 𝒙𝟑 𝒙𝟑

Causal Discovery with CUTS

True Causal Relations Causal Discovery with CUTS
(No Data Imputation)

Figure 3: An three-time-series example demonstrating the advantages of introducing data imputa-

tion, with the groundtruth causal graph in the left column. The recovered causal graph without data
imputation (middle column) shows some false positive and false negative edges, while CUTS (right
column) exhibits perfect results.

A.3 I MPLEMENTATION D ETAILS

A.3.1 G UMBEL S OFTMAX FOR C AUSAL G RAPH F ITTING

In our proposed CUTS, causal relations are modeled with Causal Probability Graph (CPGs), which
describe the possibility of Granger causal relations. However, the distributions of CPGs are discrete
and cannot be updated directly with neural networks in Causal graph fitting stage. To achieve
a continuous approximation of the discrete distribution, we leverage Gumbel Softmax technique
(Jang et al., 2016), which can be denoted as

exp((log(mτ,ij ) + g)/τ )
sτ,ij = , (21)
exp((log(mτ,ij ) + g)/τ ) + exp((log(1 − mτ,ij ) + g)/τ )

where g = − log(− log(u)), u ∼ Uniform(0, 1). The parameter τ is set according to the “Gumbel
tau” item in Table 4. During training we first set a relatively large value of τ and decrease it slowly.

A.3.2 I NITIAL DATA F ILLING

The missing data points are filled with Zero-Order Holder (ZOH) before the iterative learning pro-
cess to provide an initial guess x̃(0) . An intuitive solution for initial filling is Linear Interpolation,
but it would hamper successive causal discovery. For example, if xt−2,i and xt,i are observed and
(0)
xt−1,i is missing, xt−1,i is filled as x̃t−1,i = 12 (xt−2,i + xt,i ), then xt,i can be directly predicted
(0)
with 2x̃t−1,i − xt−2,i and other time-series cannot help the prediction of xt,i even if there exists
Granger causal relationships. To show the limitation of filling with linear interpolation, we con-
ducted ablation study on VAR datasets with Random Missing (p = 0.6). In this experiment, initial
data filling with ZOH achieves AUROC of (0.9766 ± 0.0074) while that with Linear interpolation
achieves an inferior accuracy (0.9636 ± 0.0145). This validates that Zero-order Holder is a better
option than linear interpolation as an initial filling implementation.

A.3.3 H YPERPARAMETERS S ETTINGS

To fit data generation function fi we use a DSGNN fφi for each time-series i. Each DSGNN contains
a Multilayer Perceptron (MLP). The layer numbers and hidden layer feature numbers are shown in
Table 4. For activation function we use LeakyReLU (with negative slope of 0.05). During training
we use Adam optimizer and different learning rate for Latent data prediction stage and Causal graph
fitting stage (shown as “Stage 1 Lr” and “Stage 2 Lr” in Table 4) with learning rate scheduler. The
input step for fφi also denotes the chosen max time lag for causal discovery. For VAR and Lorenz-
96 datasets we already know the max time lag of the underlying dynamics (τmax = 3), while for
NetSim datasets this parameter is chosen empirically.

17
Published as a conference paper at ICLR 2023

Table 4: Hyperparameters settings of CUTS in the aforementioned experiments. “a1 → a2 ” means

parameters exponentially increase/decrease from a1 to a2 .
Methods Hyperparam. VAR Lorenz NetSim DREAM-3
n1 5 50 200 20
n2 15 150 600 30
n3 30 300 200 50
α 0.1 0.01 0.01 0.01
Input step 3 3 5 5
Batch size 128 128 128 128
CUTS Hidden features 128 128 128 128
Network layers 3 3 3 5
Weight decay 0.001 0 0.001 0
Stage 1 Lr 10−4 → 10−5 10−4 → 10−5 10−4 → 10−5 10−4 → 10−5
Stage 2 Lr 10−2 → 10−3 10−2 → 10−3 10−2 → 10−3 10−2 → 10−3
Gumbel τ 1 → 0.1 1 → 0.1 1 → 0.1 1 → 0.1
λ 0.1 0.3 5 5

Table 5: Hyperparameters settings of the baseline causal discovery and data imputation algorithms.
Methods Hyperparameters VAR Lorenz NetSim DREAM-3
τmax 3 3 5 5
PCMCI P Cα 0.05 0.05 0.05 0.05
CI Test ParCorr GPDC ParCorr ParCorr
µ1 0.1 0.1 0.1 0.7
Learning rate 0.01 0.01 0.001 0.001
eSRU
Batch size 250 250 100 100
Epochs 2000 2000 2000 2000
Learning rate 0.05 0.05 0.05 0.05
NGC λridge 0.01 0.01 0.01 0.01
λ Sweeping Range 0.02 → 0.2 0.02 → 0.2 0.04 → 0.4 0.02 → 0.01
Epochs 200 200 200 200
GRIN Batch size 128 128 128 128
Window 3 3 3 3
Epochs 50 50 50 50
LCCM Batch size 10 10 10 10
Hidden size 20 20 20 20
Steps 2000 2000 2000 2000
Horizon 5 5 5 5
NGM
GL reg 0.05 0.05 0.05 0.05
Chunk num 100 100 100 46

For baseline algorithm we choose parameters mainly according to the original paper or official
repository (PCMCI3 , eSRU4 , NGC5 , GRIN6 ). For fair comparison, we applied parameter searching
to determine the key hyperparameters of the baseline algorithms with best performance. Tuned
parameters are listed in Table 5.

3
https://github.com/jakobrunge/tigramite
4
https://github.com/sakhanna/SRU for GCI
5
https://github.com/iancovert/Neural-GC
6
https://github.com/Graph-Machine-Learning-Group/grin

18
Published as a conference paper at ICLR 2023

A.4 A DDITIONAL E XPERIMENTS

A.4.1 DREAM-3 E XPERIMENTS

DREAM-3 (Prill et al., 2010) is a gene expression and regulation dataset mentioned in many causal
discovery works as quantitative benchmarks (Khanna & Tan, 2020; Tank et al., 2022). This dataset
contains 5 models, each representing measurements of 100 gene expression levels. Each measured
trajectory has a time length of T = 21. This is too low to perform random missing or periodic
missing experiments, so with DREAM-3 we only compare our approach with baselines in regular
time-series scenarios. The results are shown in Table 11.

A.4.2 A BLATION S TUDY ON VAR AND N ET S IM DATASETS

Besides the ablation studies on Lorenz-96 datasets shown in Table 3, we additionally show those
on VAR and NetSim in Tables 6 and 7. In Table 6, one can see that “CUTS (Full)” beats other
configurations in most scenarios, and the advantage is more obvious with higher missing percentage
(p = 0.6 for Random Missing and Tmax = 4 for Periodic Missing). On the NetSim datasets with a
too small data size L = 200, “CUTS (Full)” beats other configurations at a small missing probability
(p = 0.1).

A.4.3 A BLATION S TUDY FOR E POCH N UMBERS

In our proposed CUTS, each step can be recognized as a refinement of causal discovery, with builds
upon previous imputation results. Since the data imputation and causal discovery mutually boost
each other, the performance may be affected by different settings of learning steps. In Table 8 we
conduct experiments to show the impact of different epoch numbers on VAR, Lorenz-96, and Netsim
datasets. We set n1 , n2 , n3 proportional to original settings.

A.4.4 P ERFORMANCE ON T EMPORAL C AUSAL G RAPH

In the previous experiments, we calculate causal summary graphs with ãi,j = max{mτ,ij }ττmax =1 ,
i.e., maximal causal effects along time axis. Our CUTS also supports discovery of 3-dimensional
temporal graph {mτ,ij }. We conduct experiments to investigate our performance for temporal causal
graph discovery. The results are shown in Table 10.

A.4.5 C AUSAL D ISCOVERY WITH S TRUCTURED T IME - SERIES DATA

We show in this section that CUTS is able to recover causal relations not only with irregular time-
series but also with regular time-series, which is widely used for performance comparison in previ-
ous works. We again tested our algorithm on VAR, Lorenz-96, and NetSim datasets, and the results
are shown in Table 11. It is observed that our algorithm shows superior performance to baseline
methods.

A.4.6 ROBUSTNESS TO H YPERPARAMETERS S ETTINGS

We show that CUTS is robust to changes of hyperparameters settings, with experiment results listed
in Table 12. For existing Granger-causality based methods such as NGC (Tank et al., 2022) and
eSRU (Khanna & Tan, 2020), parameters λ and the maximum time lag τmax are often required to
be tuned precisely. Empirically, λ is chosen to balance between the sparsity of the inferred causal
relationship and data prediction accuracy, and τmax is chosen according to the estimated maximum
time lag. In this work we find our CUTS gives similar causal discovery results across a wide range
of λ (0.01 ∼ 0.3) and τmax (3 ∼ 9).

A.4.7 L ORENZ -96 DATASETS WITH F=40

We further conducted experiments with external forcing constant F = 40 on Lorenz-96 datasets
instead of F = 10 in Section 5.2. We show that our approach produces promising results with
p = 0.3 for random missing and Tmax = 2 for periodic missing, as shown in Table 13 with AUROC
score higher than 0.9.

19
Published as a conference paper at ICLR 2023

Table 6: Quantitation results of ablation studies on VAR dataset. “CUTS (Full)” denotes the default
settings in this paper. The highest scores (or multiple ones with ignorable gaps) of each column are
bolded for clearer illustration.

VAR with Random Missing VAR with Periodic Missing

Methods
p = 0.3 p = 0.6 Tmax = 2 Tmax = 4
CUTS (Full) 0.9971 ± 0.0026 0.9766 ± 0.0074 0.9992 ± 0.0016 0.9958 ± 0.0069
ZOH for Imputation 0.9908 ± 0.0065 0.9109 ± 0.0328 0.9974 ± 0.0020 0.9782 ± 0.0197
GP for Imputation 0.9964 ± 0.0026 0.9240 ± 0.0327 0.9980 ± 0.0018 0.9442 ± 0.0429
GRIN for Imputation 0.9963 ± 0.0047 0.9014 ± 0.0273 0.9992 ± 0.0012 0.9818 ± 0.0174
No Imputation 0.9945 ± 0.0038 0.9624 ± 0.0132 0.9968 ± 0.0032 0.9797 ± 0.0204
Remove CPG for Imput. 0.9975 ± 0.0020 0.9624 ± 0.0132 0.9991 ± 0.0016 0.9906 ± 0.0123
No Finetuning Stage 0.9960 ± 0.0073 0.9736 ± 0.0074 0.9974 ± 0.0032 0.9835 ± 0.0160

Table 7: Quantitation results of ablation studies on NetSim dataset. “CUTS (Full)” denotes the
default settings in this paper.

NetSim with Random Missing

Methods
p = 0.1 p = 0.2
CUTS (Full) 0.7948 ± 0.0381 0.7699 ± 0.0550
ZOH for Imputation 0.7937 ± 0.0349 0.7878 ± 0.0361
GP for Imputation 0.7845 ± 0.0362 0.7890 ± 0.0443
GRIN for Imputation 0.7745 ± 0.0452 0.7553 ± 0.0513
No Imputation 0.7650 ± 0.0272 0.7164 ± 0.0343
Remove CPG for Imput. 0.7912 ± 0.0389 0.7878 ± 0.0361
No Finetuning Stage 0.7650 ± 0.0272 0.7164 ± 0.0343

A.4.8 ROBUSTNESS TO N OISE

We experimentally show that CUTS is robust to noise, as shown in Table 9. We choose the non-
linear Lorenz-96 datasets for this experiment (L = 1000, F = 10) and set additive Gaussian white
noise with standard deviation σ = 0.1, 0.3, 1, respectively.

A.5 P SEUDOCODE FOR CUTS

We provide the pseudocode of two boosting modules of the proposed CUTS in Algorithm 1 and
2 respectively, and the whole iterative framework in 3. Detailed implementation is provided in
supplementary materials and will be uploaded to GitHub soon.

Algorithm 1 Latent data prediction stage

Input: Time series dataset {x1:L,1 , ..., x1:L,N }; observation mask {o1:L,1 , ..., o1:L,N };
Adam optimizer Adam(·)
Output: DSGNNs parameters {φ1 , ..., φN }
for i = 1 to N do
x̂t,i ← fφi (xt−τ :t−1,i s1:τ,ij ), sτ,ij ∼ Ber(1 − mτ,ij )
PN hL (x̂ ,x̃ ),o i
Lpred (X̃ , X̂ , O) = i=1 21 ho1:L,i ,o1:L,i i i
L 1:L,i 1:L,i
φi ← Adam(φi , Lpred )
end for

20
Published as a conference paper at ICLR 2023

Table 8: Quantitative comparison on learning step numbers, in terms of AUROC. We set n1 , n2 , n3

proportional to original settings, e.g., if original settings is n1 = 50, n2 = 250, n3 = 200 then “50%
Steps” means n1 = 25, n2 = 125, n3 = 100.

VAR with Random Missing VAR with Periodic Missing

Methods
p = 0.3 p = 0.6 Tmax = 2 Tmax = 4
25% Steps 0.9912 ± 0.0041 0.9492 ± 0.0119 0.9951 ± 0.0047 0.9818 ± 0.0158
50% Steps 0.9949 ± 0.0034 0.9640 ± 0.0087 0.9978 ± 0.0028 0.9894 ± 0.0125
75% Steps 0.9965 ± 0.0027 0.9729 ± 0.0075 0.9985 ± 0.0023 0.9921 ± 0.0105
100% Steps 0.9971 ± 0.0026 0.9766 ± 0.0074 0.9992 ± 0.0016 0.9958 ± 0.0069
Lorenz-96 with Random Missing Lorenz-96 with Periodic Missing
Methods
p = 0.3 p = 0.6 Tmax = 2 Tmax = 4
25% Steps 0.9811 ± 0.0069 0.9052 ± 0.0208 0.9924 ± 0.0050 0.9655 ± 0.0216
50% Steps 0.9952 ± 0.0022 0.9456 ± 0.0153 0.9987 ± 0.0015 0.9855 ± 0.0112
75% Steps 0.9987 ± 0.0016 0.9613 ± 0.0128 0.9998 ± 0.0005 0.9930 ± 0.0062
100% Steps 0.9996 ± 0.0005 0.9705 ± 0.0118 1.0000 ± 0.0000 0.9959 ± 0.0042
NetSim with Random Missing
Methods
p = 0.1 p = 0.2
25% Steps 0.7737 ± 0.0346 0.7403 ± 0.0355
50% Steps 0.7963 ± 0.0399 0.7699 ± 0.0443
75% Steps 0.7961 ± 0.0390 0.7714 ± 0.0503
100% Steps 0.7948 ± 0.0381 0.7699 ± 0.0550

Table 9: Accuracy of CUTS on Lorenz-96 datasets with different noise levels. The accuracy is
calculated in terms of AUROC.

Lorenz-96 with Random Missing

Methods Noise σ
p = 0.3 p = 0.6
0.1 1.0000 ± 0.0000 0.9843 ± 0.0073
CUTS 0.3 1.0000 ± 0.0001 0.9825 ± 0.0080
1 0.9999 ± 0.0002 0.9722 ± 0.0108

A.6 MSE C URVE FOR DATA I MPUTATION

The Mean Square Error (MSE) of the imputed time-series, imputed time-series without the help of
causal graph, and the groundtruth time-series during the whole training process are shown in Figure
4. We can see that under all configurations our approach successfully imputes missing values with
significantly lower MSE compared to initially filled values. Furthermore, in most settings imputing
time-series without the help of causal graph are prone to overfit. The imputed time-series then
boost the subsequent causal discovery module, and discovered causal graph help to prevent overfit
in imputation.

21
Published as a conference paper at ICLR 2023

Table 10: Quantitative comparison for 3-dimensional temporal causal graph discovery on VAR
datasets, in terms of AUROC.

VAR with Random Missing

Methods
p=0 p = 0.3 p = 0.6
CUTS 0.9979 ± 0.0018 0.9848 ± 0.0053 0.9170 ± 0.0127
VAR with Periodic Missing
Methods
Tmax = 1 Tmax = 2 Tmax = 4
CUTS 0.9973 ± 0.0024 0.9938 ± 0.0036 0.9612 ± 0.0286

Table 11: Accuracy of CUTS and five other baseline causal discovery algorithms on VAR, Lorenz-
96, NetSim, and DREAM-3 datasets without missing values. The accuracy is calculated in terms of
AUROC.

Methods Lorenz-96 VAR NetSim DREAM-3

PCMCI 0.7515 ± 0.0381 0.9999 ± 0.0002 0.7692 ± 0.0414 0.5517 ± 0.0261
NGC 0.9967 ± 0.0058 0.9988 ± 0.0015 0.7616 ± 0.0504 0.5579 ± 0.0313
eSRU 0.9996 ± 0.0005 0.9949 ± 0.0040 0.6817 ± 0.0263 0.5587 ± 0.0335
LCCM 0.9967 ± 0.0058 0.9988 ± 0.0015 0.7616 ± 0.0504 0.5046 ± 0.0318
NGM 0.9996 ± 0.0005 0.9949 ± 0.0040 0.6817 ± 0.0263 0.5477 ± 0.0252
CUTS 1.0000 ± 0.0000 0.9999 ± 0.0002 0.8277 ± 0.0435 0.5915 ± 0.0344

Table 12: Accuracy of causal discovery results of CUTS under different hyperparameters λ and
τmax settings.

λ AUROC τmax AUROC

0.01 0.9962 ± 0.0029 3 0.9971 ± 0.0026
CUTS 0.03 0.9964 ± 0.0029 6 0.9972 ± 0.0032
0.1 0.9971 ± 0.0026 9 0.9972 ± 0.0042
0.3 0.9962 ± 0.0027

Table 13: Comparison of CUTS with (i) PCMCI, eSRU, NGC combined with imputation method
ZOH, GP, GRIN and (ii) LCCM, NGM which does not need data imputation. Results are averaged
over 4 randomly generated datasets.

Random Missing Periodic Missing

Method Imputation
p = 0.3 Tmax = 2
ZOH 0.7995 ± 0.0361 0.8164 ± 0.0313
PCMCI GP 0.8124 ± 0.0221 0.7871 ± 0.0323
GRIN 0.8193 ± 0.0329 0.7816 ± 0.0361
ZOH 0.8067 ± 0.0267 0.8558 ± 0.0248
NGC GP 0.8350 ± 0.0314 0.8250 ± 0.0257
GRIN 0.6293 ± 0.0523 0.7114 ± 0.0129
ZOH 0.8883 ± 0.0131 0.9463 ± 0.0208
eSRU GP 0.9499 ± 0.0061 0.8893 ± 0.0160
GRIN 0.9417 ± 0.0199 0.9494 ± 0.0129
LCCM 0.6437 ± 0.0267 0.6215 ± 0.0343
NGM 0.6734 ± 0.0403 0.7522 ± 0.0520
CUTS (Proposed) 0.9737 ± 0.0105 0.9289 ± 0.0145

22
Published as a conference paper at ICLR 2023

Algorithm 2 Causal graph fitting stage

Input: Time series dataset {x1:L,1 , ..., x1:L,N }; observation mask {o1:L,1 , ..., o1:L,N }; Adam op-
timizer Adam(·); Gumbel softmax function Gumbel(·) described with Equation 21
Output: Causal probability mτ,ji , ∀j = 1, ..., N
for i = 1 to N do
x̂t,i ← fφi (xt−τ :t−1,i s1:τ,ij ), sτ,ij = Gumbel(1 − mτ,ij )
Lgraph (X̃ , X̂ , O, θ) = Lpred (X̃ , X̂ , O) + λ||σ(θ)||1
θ ← Adam(θ, Lgraph )
end for

Algorithm 3 Causal Discovery from Irregular Time-series (CUTS)

Input: Time series dataset {x1:L,1 , ..., x1:L,N } with time-series length L; observation mask
{o1:L,1 , ..., o1:L,N }; Zero-order holder (ZOH) imputation algorithm ZOH(·); Adam optimizer
Adam(·)
Output: Discovered causal graph
(0)
Initialize x̃1:L,i = ZOH(x1:L,i ), Causal Probability Graphs Mτ = 0, ∀τ = 1, ..., τmax
# Warming up
for n1 iterations do
Update {φ1 , ..., φN } with Algorithm 1
Update Mτ with Algorithm 2
end for
# Causal discovery with data imputation
for n2 iterations do
Update {φ1 , ..., φN } with Algorithm 1
Update Mτ with Algorithm 2
for i = 1 to N do ( (m) (m)
(m+1) (1 − α)x̃t,i + αx̂t,i ot,i = 0
Data update: x̃t,i ← (0)
x̃t,i ot,i = 1
end for
end for
# Finetuning
Reset ot,i ← 1, ∀t = 1, ..., T, i = 1, ..., N
for n3 iterations do
Update {φ1 , ..., φN } with Algorithm 1
Update Mτ with Algorithm 2
end for
for i = 1 to N do
for j = 1 to N do
ãi,j = max (m1,ij , ..., mτmax ,ij )
end for
end for
return Discovered causal adjacency matrix Â where each elements is ãi,j .

23
Published as a conference paper at ICLR 2023

Figure 4: Average MSE curve of imputed data on VAR datasets with Random Missing / Periodic
Missing (top), Lorenz-96 datasets under Random Missing / Periodic Missing (middle), and NetSim
datasets with Random Missing (bottom).

Differential Equations With Boundary Value Problems 2nd Edition Polking Fast Access
No ratings yet
Differential Equations With Boundary Value Problems 2nd Edition Polking Fast Access
317 pages
Causal Discovery Tools
No ratings yet
Causal Discovery Tools
5 pages
Shorack GR Wellner Ja Empirical Processes With Applications
100% (1)
Shorack GR Wellner Ja Empirical Processes With Applications
1,000 pages
Valse 2
No ratings yet
Valse 2
89 pages
REVISI-Haberman, R. Mathematical Models - Mechanical Vibrations, Population Dynamics, and Trafic Flow - Compressed PDF
100% (2)
REVISI-Haberman, R. Mathematical Models - Mechanical Vibrations, Population Dynamics, and Trafic Flow - Compressed PDF
421 pages
Causal Discovery in Machine Learning - Theories and Applications
No ratings yet
Causal Discovery in Machine Learning - Theories and Applications
29 pages
Exploratory Causal Analysis With Time Series Data
No ratings yet
Exploratory Causal Analysis With Time Series Data
149 pages
Exploratory Causal Analysis With Time Series Data - James M. McCracken (Morgan & Claypool, 2016)
100% (1)
Exploratory Causal Analysis With Time Series Data - James M. McCracken (Morgan & Claypool, 2016)
134 pages
(Important) The Reconstruction of Equivalent Underlying Model Based On Direct Causalities For Multivariate Time Series
No ratings yet
(Important) The Reconstruction of Equivalent Underlying Model Based On Direct Causalities For Multivariate Time Series
25 pages
Nonlinear Causal Discovery Through A Sequential Edge Orientation Approach
No ratings yet
Nonlinear Causal Discovery Through A Sequential Edge Orientation Approach
42 pages
Conditional Independences and Causal Relations Implied by Sets of Equations
No ratings yet
Conditional Independences and Causal Relations Implied by Sets of Equations
62 pages
Shimizu 06 A
No ratings yet
Shimizu 06 A
28 pages
NeurIPS 2023 Pairwise Causality Guided Transformers For Event Sequences Paper Conference
No ratings yet
NeurIPS 2023 Pairwise Causality Guided Transformers For Event Sequences Paper Conference
14 pages
Peter Spirtes 2010
No ratings yet
Peter Spirtes 2010
20 pages
Causal Discovery With Attention-Based Convolutional Neural Networks
No ratings yet
Causal Discovery With Attention-Based Convolutional Neural Networks
28 pages
Zhou 等 - 2024 - Jacobian Regularizer-based Neural Granger Causality
No ratings yet
Zhou 等 - 2024 - Jacobian Regularizer-based Neural Granger Causality
20 pages
D'Ya Like DAGs? A Survey On Structure Learning and Causal Discovery
No ratings yet
D'Ya Like DAGs? A Survey On Structure Learning and Causal Discovery
35 pages
S. Kwon - On The Fifth-Order KDV Equation - Local Well-Posedness and Lack of Uniform Continuity of The Solution Map
No ratings yet
S. Kwon - On The Fifth-Order KDV Equation - Local Well-Posedness and Lack of Uniform Continuity of The Solution Map
33 pages
Switching Regression Models and Causal Inference - Potentially Bushit Paper Dressing Up Nicely in Math
No ratings yet
Switching Regression Models and Causal Inference - Potentially Bushit Paper Dressing Up Nicely in Math
46 pages
Activation Functions in Neural Networks: What Is Activation Function?
No ratings yet
Activation Functions in Neural Networks: What Is Activation Function?
11 pages
Causal AI Final
No ratings yet
Causal AI Final
71 pages
Score Matching Through The Roof Llinear, Nonlinear, and Latent Variables Causal Discovery 26th July 2024 (AAA)
No ratings yet
Score Matching Through The Roof Llinear, Nonlinear, and Latent Variables Causal Discovery 26th July 2024 (AAA)
27 pages
A Meta-Reinforcement Learning Algorithm For Causal Discovery
No ratings yet
A Meta-Reinforcement Learning Algorithm For Causal Discovery
18 pages
Local Causal Structure Learning in The Presence of Latent Variables
No ratings yet
Local Causal Structure Learning in The Presence of Latent Variables
20 pages
CiML v5 Book PDF
No ratings yet
CiML v5 Book PDF
152 pages
CORE: Towards Scalable and Efficient Causal Discovery With Reinforcement Learning
No ratings yet
CORE: Towards Scalable and Efficient Causal Discovery With Reinforcement Learning
11 pages
Deep Causal Learning
No ratings yet
Deep Causal Learning
35 pages
Causal Discovery From Subsampled Time Series With Proxy Variables
No ratings yet
Causal Discovery From Subsampled Time Series With Proxy Variables
22 pages
29034-Article Text-33088-1-2-20240324
No ratings yet
29034-Article Text-33088-1-2-20240324
9 pages
A Survey On Causal Discovery Methods For Temporal and Non-Temporal Data
No ratings yet
A Survey On Causal Discovery Methods For Temporal and Non-Temporal Data
55 pages
Learning Temporally Causal Latent Processes From General Temporal Data
No ratings yet
Learning Temporally Causal Latent Processes From General Temporal Data
31 pages
Deci
No ratings yet
Deci
31 pages
A Practical Approach To Causal Inference Over Time
No ratings yet
A Practical Approach To Causal Inference Over Time
20 pages
Causal Inference Using LLM-Guided Discovery
No ratings yet
Causal Inference Using LLM-Guided Discovery
20 pages
Causal Discovery From Time-Series Data With Short-Term Invariance-Based Convolutional Neural Network
No ratings yet
Causal Discovery From Time-Series Data With Short-Term Invariance-Based Convolutional Neural Network
19 pages
Digital SAT Practice Test #7
No ratings yet
Digital SAT Practice Test #7
10 pages
Sample, Estimate, Aggregate: A Recipe For Causal Discovery Foundation Models
No ratings yet
Sample, Estimate, Aggregate: A Recipe For Causal Discovery Foundation Models
32 pages
A Review and Roadmap of Deep Causal Model From Different Causal Structures and Representations
No ratings yet
A Review and Roadmap of Deep Causal Model From Different Causal Structures and Representations
35 pages
Linear Scaling Causal Discovery From High-Dimensional Time Series by Dynamical Community Detection
No ratings yet
Linear Scaling Causal Discovery From High-Dimensional Time Series by Dynamical Community Detection
13 pages
NN Causality
No ratings yet
NN Causality
17 pages
Schmid - Stability and Transition in Shear Flows PDF
No ratings yet
Schmid - Stability and Transition in Shear Flows PDF
561 pages
Nonlinear Dynamic Granger Causality Analysis Framework For Root-Cause Diagnosis of Quality-Related Faults in Manufacturing Processes
No ratings yet
Nonlinear Dynamic Granger Causality Analysis Framework For Root-Cause Diagnosis of Quality-Related Faults in Manufacturing Processes
10 pages
Chu 2021
No ratings yet
Chu 2021
6 pages
Chen 2020
No ratings yet
Chen 2020
6 pages
Multi-Lag and Multi-Type Temporal Causality Inference and Analysis For
No ratings yet
Multi-Lag and Multi-Type Temporal Causality Inference and Analysis For
19 pages
1929 Causal Discovery With Reinforc
No ratings yet
1929 Causal Discovery With Reinforc
17 pages
NCS21 - 01 - Introduction To Nonlinear Control
No ratings yet
NCS21 - 01 - Introduction To Nonlinear Control
18 pages
Inferring The Time-Varying Coupling of Dynamical Systems With Temporal
No ratings yet
Inferring The Time-Varying Coupling of Dynamical Systems With Temporal
17 pages
1 - Neural Granger Causality
No ratings yet
1 - Neural Granger Causality
13 pages
Lower Bounds On Information Requirements For Causa
No ratings yet
Lower Bounds On Information Requirements For Causa
7 pages
3 - Stacked DL Non-Linear GC
No ratings yet
3 - Stacked DL Non-Linear GC
12 pages
2 - Nonlinear GC Nature
No ratings yet
2 - Nonlinear GC Nature
11 pages
Sample-Efficient Reinforcement Learning Via Counterfactual-Based Data Augmentation
No ratings yet
Sample-Efficient Reinforcement Learning Via Counterfactual-Based Data Augmentation
14 pages
Calerman Linearization
No ratings yet
Calerman Linearization
6 pages
Establishing Markov Equivalence in Cyclic Directed Graphs
No ratings yet
Establishing Markov Equivalence in Cyclic Directed Graphs
19 pages
Causality Bernhard Schölkopf
No ratings yet
Causality Bernhard Schölkopf
169 pages
Nonlinear Causal Discovery With Additive Noise Models - Hoyer, P. O., Janzing, D., Mooij, J., Peters, J., & Schölkopf, B.
No ratings yet
Nonlinear Causal Discovery With Additive Noise Models - Hoyer, P. O., Janzing, D., Mooij, J., Peters, J., & Schölkopf, B.
8 pages
A New Homotopy For Seeking All Real Roots of A Nonlinear Equation
No ratings yet
A New Homotopy For Seeking All Real Roots of A Nonlinear Equation
9 pages
Commun Nonlinear Sci Numer Simulat: Xuegeng Mao, Pengjian Shang
No ratings yet
Commun Nonlinear Sci Numer Simulat: Xuegeng Mao, Pengjian Shang
10 pages
Causal-Learn: Causal Discovery in Python
No ratings yet
Causal-Learn: Causal Discovery in Python
8 pages
Time Series Data Imputation - A Survey On Deep Learning Approaches
No ratings yet
Time Series Data Imputation - A Survey On Deep Learning Approaches
9 pages
VB Bhatia 1
67% (3)
VB Bhatia 1
25 pages
Sample Solution 1
No ratings yet
Sample Solution 1
10 pages
Introduction To Operations Research PDF
No ratings yet
Introduction To Operations Research PDF
15 pages
Solving Inverse Non-Linear Fractional de
No ratings yet
Solving Inverse Non-Linear Fractional de
10 pages
Causal Discovery Using Proxy Variables
No ratings yet
Causal Discovery Using Proxy Variables
13 pages
Discovery of Non-Gaussian Linear Causal Models Using ICA
No ratings yet
Discovery of Non-Gaussian Linear Causal Models Using ICA
8 pages
328fabd5daddb3ae693f7ff00b2dab0a_1c5a7eb974f8d7803fd15c52c72e499e
No ratings yet
328fabd5daddb3ae693f7ff00b2dab0a_1c5a7eb974f8d7803fd15c52c72e499e
9 pages
Ortiz 1985
No ratings yet
Ortiz 1985
16 pages
Syllabus Structure (R-2007) at B.E. (Instrumentation Engineering) Semester-VII
No ratings yet
Syllabus Structure (R-2007) at B.E. (Instrumentation Engineering) Semester-VII
28 pages
Ssi
No ratings yet
Ssi
57 pages
Statistical Tests For Detecting Granger Causality
No ratings yet
Statistical Tests For Detecting Granger Causality
14 pages
Mei (1999)
No ratings yet
Mei (1999)
15 pages
Non-Gaussian Methods For Causal Structure Learning: Shohei Shimizu
No ratings yet
Non-Gaussian Methods For Causal Structure Learning: Shohei Shimizu
11 pages
Dynamic Analysis of Hyd - Actuator
No ratings yet
Dynamic Analysis of Hyd - Actuator
80 pages
Linear or Not Linear Function Card Sort
No ratings yet
Linear or Not Linear Function Card Sort
8 pages
Causal Discovery With General Non-Linear Relationships Using Non-Linear ICA
No ratings yet
Causal Discovery With General Non-Linear Relationships Using Non-Linear ICA
10 pages
Rotors: Modal Testing of
No ratings yet
Rotors: Modal Testing of
35 pages
Introduction To Fe Modeling of Composite Beam To CFST Column Connection - Phase - 1
No ratings yet
Introduction To Fe Modeling of Composite Beam To CFST Column Connection - Phase - 1
60 pages
LP 4 Linear and Nonlinear Text
100% (4)
LP 4 Linear and Nonlinear Text
8 pages
Syllabus For Gate Chemical Engineering
No ratings yet
Syllabus For Gate Chemical Engineering
4 pages
Time Series Analysis With The Causality Workbench
No ratings yet
Time Series Analysis With The Causality Workbench
25 pages
Tuned Mass Damper Week 4
No ratings yet
Tuned Mass Damper Week 4
11 pages
4 Simultaneous Equations
No ratings yet
4 Simultaneous Equations
9 pages
Causal Discovery From Temporally Aggregated Time Series
No ratings yet
Causal Discovery From Temporally Aggregated Time Series
10 pages
Multi-Dimensional Causal Discovery
No ratings yet
Multi-Dimensional Causal Discovery
7 pages
Deterministic Chaos Theory and Its Applications To Materials Science
No ratings yet
Deterministic Chaos Theory and Its Applications To Materials Science
9 pages
Theoretical Solutions For Turbulence Generated by Vibrating Grids and by Wall Flows Using The K
No ratings yet
Theoretical Solutions For Turbulence Generated by Vibrating Grids and by Wall Flows Using The K
10 pages
Rustum Choksi, Yasumasa Nishiura and An-Chang Shi - Self-Assembly of Block Copolymers: Theoretical Models and Mathematical Challenges
No ratings yet
Rustum Choksi, Yasumasa Nishiura and An-Chang Shi - Self-Assembly of Block Copolymers: Theoretical Models and Mathematical Challenges
12 pages
Uncertainty Theories and Multisensor Data Fusion
From Everand
Uncertainty Theories and Multisensor Data Fusion
Alain Appriou
No ratings yet

Cheng 2023 CUTS

Uploaded by

Cheng 2023 CUTS

Uploaded by

Published as a conference paper at ICLR 2023

CUTS: N EURAL C AUSAL D ISCOVERY

Yuxiao Cheng1 Runzhao Yang1 Tingxiong Xiao1 Zongren Li3

3.1 N ONLINEAR S TRUCTURAL C AUSAL M ODELS WITH IRREGULAR O BSERVATION

3.2 N ONLINEAR G RANGER C AUSALITY

fj (xt−τ :t−1,1 , ..., x0t−τ :t−1,i , ..., xt−τ :t−1,N ) 6=

(a) 𝒏𝟏 iterations 𝒏𝟐 iterations 𝒏𝟑 iterations

4 I RREGULAR T IME - SERIES C AUSAL D ISCOVERY

4.1 L ATENT DATA P REDICTION S TAGE

4.2 C AUSAL G RAPH D ISCOVERY S TAGE

4.3 T HE L EARNING S TRATEGY.

The overall learning process consists of n = n1 + n2 + n3 epochs, which is illustrated in Figure 1

4.4 C ONVERGENCE C ONDITIONS FOR G RANGER C AUSALITY.

Theorem 1 Given a time-series dataset X = {x1:L,i }N

if the following two conditions hold:

VAR Simulation Lorenz-96 Simulation

Time-series Groundtruth CPG Time-series Groundtruth CPG

5.1 VAR S IMULATION DATASETS

VAR datasets are simulated following

5.2 L ORENZ -96 S IMULATION DATASETS

Lorenz-96 datasets are simulated according to

5.4 A BLATION S TUDIES

5.5 A DDITIONAL E XPERIMENTS

Jakob Runge. Conditional independence testing based on a nearest-neighbor estimator of conditional

Stephen M. Smith, Karla L. Miller, Gholamreza Salimi-Khorshidi, Matthew Webster, Christian F.

Ronald J. Williams. Simple statistical gradient-following algorithms for connectionist reinforcement

A.1 Convergence Conditions for Granger Causality . . . . . . . . . . . . . . . . . . . 14

A.1 C ONVERGENCE C ONDITIONS FOR G RANGER C AUSALITY

A.1.1 P ROOF OF T HEOREM 1

In Causal graph fitting stage the loss function

= σ 0 (θτ,ij )(ci ot,j (−2et,j ∆fi,j − ∆2 fi,j ) + λ),

A.1.2 T HE E FFECTS OF DATA I MPUTATION

A.1.3 C ONVERGENCE OF S IGMOIDAL G RADIENTS

A.2 A N E XAMPLE FOR I RREGULAR T IME - SERIES C AUSAL D ISCOVERY

and CPGs Mτ is optimized in Causal graph fitting stage with

𝑡−2 𝑡−1 𝑡 𝑡−2 𝑡−1 𝑡 𝑡−2 𝑡−1 𝑡

Causal Discovery with CUTS

Figure 3: An three-time-series example demonstrating the advantages of introducing data imputa-

A.3 I MPLEMENTATION D ETAILS

A.3.1 G UMBEL S OFTMAX FOR C AUSAL G RAPH F ITTING

A.3.2 I NITIAL DATA F ILLING

A.3.3 H YPERPARAMETERS S ETTINGS

Table 4: Hyperparameters settings of CUTS in the aforementioned experiments. “a1 → a2 ” means

A.4 A DDITIONAL E XPERIMENTS

A.4.1 DREAM-3 E XPERIMENTS

A.4.2 A BLATION S TUDY ON VAR AND N ET S IM DATASETS

A.4.3 A BLATION S TUDY FOR E POCH N UMBERS

A.4.4 P ERFORMANCE ON T EMPORAL C AUSAL G RAPH

A.4.5 C AUSAL D ISCOVERY WITH S TRUCTURED T IME - SERIES DATA

A.4.6 ROBUSTNESS TO H YPERPARAMETERS S ETTINGS

A.4.7 L ORENZ -96 DATASETS WITH F=40

VAR with Random Missing VAR with Periodic Missing

NetSim with Random Missing

A.4.8 ROBUSTNESS TO N OISE

A.5 P SEUDOCODE FOR CUTS

Algorithm 1 Latent data prediction stage

Table 8: Quantitative comparison on learning step numbers, in terms of AUROC. We set n1 , n2 , n3

VAR with Random Missing VAR with Periodic Missing

Lorenz-96 with Random Missing

A.6 MSE C URVE FOR DATA I MPUTATION

VAR with Random Missing

Methods Lorenz-96 VAR NetSim DREAM-3

λ AUROC τmax AUROC

Random Missing Periodic Missing

Algorithm 2 Causal graph fitting stage

Algorithm 3 Causal Discovery from Irregular Time-series (CUTS)

You might also like