0% found this document useful (0 votes)
18 views19 pages

Causal Discovery From Time-Series Data With Short-Term Invariance-Based Convolutional Neural Network

The document presents a novel approach called STIC (Short-Term Invariance-based Convolutional causal discovery) for causal discovery from time-series data, focusing on both contemporaneous and time-lagged causality. STIC enhances sample efficiency and accuracy by leveraging short-term invariance properties and employing convolutional neural networks to construct causal graphs from window observations. Experimental results demonstrate that STIC outperforms existing methods, particularly in scenarios with limited observed time steps.

Uploaded by

fabajuzzzz
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views19 pages

Causal Discovery From Time-Series Data With Short-Term Invariance-Based Convolutional Neural Network

The document presents a novel approach called STIC (Short-Term Invariance-based Convolutional causal discovery) for causal discovery from time-series data, focusing on both contemporaneous and time-lagged causality. STIC enhances sample efficiency and accuracy by leveraging short-term invariance properties and employing convolutional neural networks to construct causal graphs from window observations. Experimental results demonstrate that STIC outperforms existing methods, particularly in scenarios with limited observed time steps.

Uploaded by

fabajuzzzz
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 19

C AUSAL D ISCOVERY FROM T IME -S ERIES DATA WITH

S HORT-T ERM I NVARIANCE -BASED C ONVOLUTIONAL N EURAL


N ETWORKS

A P REPRINT
arXiv:2408.08023v1 [cs.LG] 15 Aug 2024

Rujia Shen Boran Wang


Faculty of Computing The Artificial Intelligence Institute
Harbin Institute of Technology Harbin Institute of Technology
Harbin, Heilongjiang, China Shenzhen, Guangdong, China
[email protected] [email protected]

Chao Zhao Yi Guan


The Department of Computer Science Faculty of Computing
The University of North Carolina at Chapel Hill Harbin Institute of Technology
North Carolina, USA Harbin, Heilongjiang, China
[email protected] [email protected]

Jingchi Jiang
The Artificial Intelligence Institute
Harbin Institute of Technology
Harbin, Heilongjiang, China
[email protected]

August 16, 2024

A BSTRACT

Causal discovery from time-series data aims to capture both intra-slice (contemporaneous) and
inter-slice (time-lagged) causality between variables within the temporal chain, which is crucial
for various scientific disciplines. Compared to causal discovery from non-time-series data, causal
discovery from time-series data necessitates more serialized samples with a larger amount of observed
time steps. To address the challenges, we propose a novel gradient-based causal discovery approach
STIC, which focuses on Short-Term Invariance using Convolutional neural networks to uncover the
causal relationships from time-series data. Specifically, STIC leverages both the short-term time and
mechanism invariance of causality within each window observation, which possesses the property
of independence, to enhance sample efficiency. Furthermore, we construct two causal convolution
kernels, which correspond to the short-term time and mechanism invariance respectively, to estimate
the window causal graph. To demonstrate the necessity of convolutional neural networks for causal
discovery from time-series data, we theoretically derive the equivalence between convolution and
the underlying generative principle of time-series data under the assumption that the additive noise
model is identifiable. Experimental evaluations conducted on both synthetic and FMRI benchmark
datasets demonstrate that our STIC outperforms baselines significantly and achieves the state-of-
the-art performance, particularly when the datasets contain a limited number of observed time steps.
Code is available at https://github.com/HITshenrj/STIC.

Keywords Causal discovery · Time-series data · Time invariance · Mechanism invariance · Convolutional neural
networks
arXiv Template A P REPRINT

1 Introduction

Causality behind time-series data plays a significant role in various aspects of everyday life and scientific inquiry.
Questions like “What factors in the past have led to the current rise in blood glucose?" or “How long will my headache
be alleviated if I take that pill?" require an understanding of the relationships among observed variables, such as
the relation between people’s health status and their medical interventions [Cowls and Schroeder, 2015, Pawlowski
et al., 2020]. People usually expect to find cyclical and invariant principles in a changing world, which we call causal
relationships [Chan et al., 2024, Entner and Hoyer, 2010]. These relationships can be represented as a directed acyclic
graph (DAG), where nodes represent observed variables and edges represent causal relationships between variables
with time lags. This underlying graph structure forms the factual foundation for causal reasoning and is essential for
addressing such queries [Pearl, 2009].
Current causal discovery approaches utilize intra-slice and inter-slice information of time-series data, leveraging
techniques such as conditional independence, smooth score functions, and auto-regression. These methods can be
broadly classified into three categories: Constraint-based methods [Entner and Hoyer, 2010, Runge et al., 2019, Runge,
2020], Score-based methods [Pamfil et al., 2020], and Granger-based methods [Nauta et al., 2019, Cheng et al., 2022,
2023].Constraint-based methods rely on conditional independence tests to infer causal relationships between variables.
These methods perform independence tests between pairs of variables under different conditional sets to determine
whether a causal relation exists. However, due to the difficulty of sampling, real-world data often suffers from the
limited length of observed time steps, making it challenging for statistical conditional independence tests to fully
capture causal relationships [Zhang et al., 2011, Zhang and Suzuki, 2023]. Additionally, these methods often rely on
strong yet unrealistic assumptions, such as Gaussian noise, when searching for statistical conditional independence
[Spirtes and Zhang, 2016, Wang and Michoel, 2017]. Score-based methods regard causal discovery as a constrained
optimization problem using augmented Lagrangian procedures. They assign a score function that captures properties of
the causal graph, such as acyclicity, and minimize the score function to identify potential causal graphs. While these
methods offer simplicity in optimization, they relying heavily on acyclicity regularization and often lack guarantees for
finding the correct causal graph, potentially leading to suboptimal solutions [Varando, 2020, Lippe et al., 2021, Zhang
et al., 2023]. Granger-based methods, inspired by [Granger, 1969, Granger and Hatanaka, 2015], offer an intriguing
perspective on causal discovery. These methods utilize auto-regression algorithms under the assumption of additive
noise to assess if one time series can predict another, thereby identifying causal relationships. However, they tend to
exhibit lower precision when working with limited observed time steps.
To overcome the limitations of existing approaches, such as low sample efficiency in constraint-based methods,
suboptimal solutions from acyclicity regularizers in score-based methods and low precision when limited observed time
steps in Granger-based methods, we propose a novel Short-Term Invariance-based Convolutional causal discovery
approach (STIC). STIC leverages the properties of short-term invariance to enhance the sample efficiency and accuracy
of causal discovery. More concretely, by sliding a window along the entire time-series data, STIC constructs batches of
window observations that possess invariant characteristics and improves sample utilization. Unlike existing score-based
methods, our model does not rely on predefined acyclicity constraints to avoid local optimization. As the window
observations move along the temporal chain, the structure of the window causal graph exhibits periodic patterns,
demonstrating short-term time invariance. Simultaneously, the conditional probabilities of causal effects between
variables remain unchanged as the window observations slide, indicating short-term mechanism invariance. The
contributions of our work can be summarized as follows:

• We propose STIC, the Short-Term Invariance-based Convolutional causal discovery approach, which leverages
the properties of short-term invariance to enhance the sample efficiency and accuracy of causal discovery.

• STIC uses the time-invariance block to capture the causal relationships among variables, while employing the
mechanism-invariance block for the transform function.

• To dynamically capture the contemporaneous and time-lagged causal structures of the observed variables, we
establish the equivalence between the convolution of the space-domain (contemporaneous) and time-domain
(time-lagged) components, and the multivariate Fourier transform (the underlying generative mechanism) of
time-series data.

• We conduct experiments to evaluate the performance of STIC on synthetic and benchmark datasets. The
experimental results show that STIC achieves the state-of-the-art results on synthetic time-series datasets,
even when dealing with relatively limited observed time steps. Experiments demonstrate that our approach
outperforms baseline methods in causal discovery from time-series data.

2
arXiv Template A P REPRINT

𝑡 𝑡+1 𝑡+2

𝑋5 𝑋5 𝑋5
𝑋1 𝑋1 𝑋1 𝑋1
𝑋3 𝑋3 𝑋3
𝑋2
𝑋4 𝑋4 𝑋4


𝑋5
𝑋2 𝑋2 𝑋2

Figure 1: A example showing the correspondence among the given observed variables, the underlying window causal
graph, and the window causal matrix. The observed dataset consists of d = 5 observed variables, and the true maximum
lag τ is 2. In W ∈ R5×5×3 , for each Wi,j
τ
represents the causal effect of Xi on Xj with τ time lags. For example, the
blue lines in window causal graph indicate the following three causal effects with time lags τ = 2 at any time step t,
2 2 2 2 2 2
i.e. W1,3 = 1 ⇒ X1 −→ X3 ; W1,5 = 1 ⇒ X1 −→ X5 ; W4,2 = 1 ⇒ X4 −→ X2 . Moreover, the red lines indicate
1 1
1 1 1
the causal relationships with time lags τ = 1, i.e. W3,2 = 1 ⇒ X3 −→ X2 ; W3,4 = 1 ⇒ X3 −→ X4 ; W3,5 =1⇒
1 1 1
X3 −→ X5 ; W5,4 = 1 ⇒ X5 −→ X4 . Finally, the green lines represent contemporaneous causal relationships, i.e.
0 0 0 0 0 0
W1,2 = 1 ⇒ X1 −→ X2 ; W1,4 = 1 ⇒ X1 −→ X4 ; W5,2 = 1 ⇒ X5 −→ X2 .

2 Background
In this section, we introduce the background of causal discovery from time-series data. Firstly, we show all symbols and
their definitions in Section 2.1. Secondy, in Section 2.2, we present the problem definition and formal representation of
window causal graph. Thirdly, in Section 2.3, we introduce the concepts of short-term time invariance and mechanism
invariance. Building upon these concepts, we derive an independence property specific to window causal graph.
Fourthly, in Section 2.4, we delve into the theoretical aspects of our approach. Specifically, we establish the equivalence
between the convolution operation and the underlying generative mechanism of the observed time-series data. This
theoretical grounding provides a solid basis for the proposed STIC approach. Finally, in Section 2.5, we introduce
Granger causality, an auto-regressive approach to causal discovery from time-series data.

2.1 Symbol Summarization

Firstly, to better represent the symbols used in Section 2, we arrange a table to summarize and show their definitions, as
shown in Table 1.

2.2 Problem Definition

Let an observed dataset denoted as X = {X1 , · · · , Xd } ∈ Rd×T , which consists of d observed continuous time-series
variables. Each variable Xi is represented as a time sequence Xi = {Xi1 , · · · , XiT } with the length of T . Here, each
Xit corresponds to the observed value of the i-th variable Xi at the t-th time step. Unlike graph embedding algorithms
[Cheng et al., 2020, 2021] which aims to learn time series representations, the objective of causal discovery is to uncover
the underlying structure within time-series data, which represents boolean relationships between observed variables.
Furthermore, following the Consistency Throughout Time assumption [Spirtes et al., 2000, Zhang and Spirtes, 2002,
Robins et al., 2003, Kalisch and Bühlman, 2007, Entner and Hoyer, 2010, Assaad et al., 2022], the objective of causal
discovery from time-series data is to uncover the underlying window causal graph G as an invariant causal structure.
The true window causal graph for X encompasses both intra-slice causality with 0 time lags and inter-slice causality
with time lags ranging from 1 to τe. Here, τe denotes the maximum time lag. Mathematically, the window causal graph
is defined as a finite Directed Acyclic Graph (DAG) denoted by G = (V, E). The set V = {X1 , ..., Xd } represents
the nodes within the graph G, wherein each node corresponds to an observed variable Xi . The set E represents the

3
arXiv Template A P REPRINT

Table 1: Summary of symbol definitions in Section 2.


Symbol Description
d The number of observed variables
T The length of observed time steps
Xit The observed value of the i-th variable at the t-th time step
Xi = {Xi1 , · · · , XiT } ∈ RT The observed value of i-th variable within all T time steps
d×T
X = {X1 , · · · , Xd } ∈ R The observed dataset
τe The maximum time lag
G The underlying window causal graph
V = {X1 , ..., Xd } The nodes within the graph G
E The contemporaneous and time-lagged relationships among nodes V
W ∈ Rd×d×(eτ +1) The window causal matrix
τ
Xi −→ Xj The causal relationship with τ lags between Xi and Xj
P aτt (·) The set of parents of a variable with τ time lags at time step t
P a·t (·) The set of parents of a variable with all time lags range from 0 to τe at time step t
τ
⊥⊥ The conditional independence with τ time lags at time step t
t
P aG (X ) The relationships among X in the window causal graph G
E The noise term
f The underlying functions among X
F(X ) The multivariate Fourier transform of X
ω The angular frequency
fˆ, h, g The functions in intermediate processes
∗ The convolution operation
∝ The directly proportional relationship
στ2 (Xi |X ) The variance of predicting Xi using X with τ time lags

contemporaneous and time-lagged relationships among these nodes, encompassing all 2(eτ +1)×d possible combinations.
The window causal graph is often represented by the window causal matrix, which is defined as follows.

Definition 1 (Window Causal Matrix) The window causal graph G, which captures both contemporaneous and time-
lagged causality, can be effectively represented using a three-dimensional boolean matrix W ∈ Rd×d×(eτ +1) . Each
τ
entry Wi,j in the boolean matrix corresponds to the causal relationship between variables Xi and Xj with τ time lags.
τ =0
To be more specific, if Wi,j = 1, it signifies the presence of an intra-slice causal relationship between Xi and Xj ,
τ >0
meaning they influence each other at the same time step. On the other hand, if Wi,j = 1, it indicates that Xi causally
affects Xj with τ time lags.

Figure 1 provides a visual example of a window causal graph along with its corresponding matrix defined in Definition
τ τ
1. As shown in Figure 1, the time-series causal relationships of the form Xi −→ Xj can be represented as Wi,j = 1.
Conversely, Wi,j = 1 in the boolean matrix indicates that the value Xi at any time step t influences the value Xjt+τ
τ t

with τ time lags later.

2.3 Short-Term Causal Invariance

There has been an assertion that causal relationships typically exhibit short-term time and mechanism invariance across
extensive time scales [Entner and Hoyer, 2010, Liu et al., 2023, Zhang et al., 2017]. These two aspects of invariance are
commonly regarded as fundamental assumptions of causal invariance in causal discovery from time-series data. In the
following, we will present the definitions for these two forms of invariance.

Definition 2 (Short-Term Time Invariance) Given X ∈ Rd×T , for any Xi , Xj , τ ≥ 0, if Xi ∈ P aτt (Xj ) at time t,
then there exists Xi ∈ P aτt′ (Xj ) at time t′ ̸= t in a short period of time, where P aτt (·) denotes the set of parents of a
variable with τ time lags at time step t.

4
arXiv Template A P REPRINT

Short-term time invariance refers to the stability of parent-child relationships over time. In other words, it implies that
the dependencies between variables remain consistent regardless of specific time points. For instance, considering
Figure 1: X5 is a parent of X4 with time lag τ = 1 at t, then X5 will also be a parent of X4 with time lag τ = 1 at
t′ = t + 1; similarly, when τ = 0, if X5 is a parent of X2 at t, then X5 will be a parent of X2 at no matter t′ = t + 1 or
t′ = t + 2.
Definition 3 (Short-Term Mechanism Invariance) For any Xi , the conditional probability distribution
P (Xi |P a·t (Xi )) remains constant across the short-term temporal chain. In other words, for any time step t
and t′ , it holds that P (Xi |P a·t (Xi )) = P (Xi |P a·t′ (Xi )), where P a·t (Xi ) means the set of parents of Xi with all time
lags range from 0 to τe at time step t.
In particular, based on Definition 3, short-term mechanism invariance implies that conditional probability distributions
remain constant over time. For instance, in Figure 1, we have P a·t (X2 ) = {X3 , X1 , X5 } = P a·t+1 (X2 ). Then, we
have P (X2 |P a·t (X2 )) = P (X2 |P a·t+1 (X2 ))
Building upon the definitions of short-term time invariance and mechanism invariance, we can derive the following
lemma, which characterizes the invariant nature of independence among variables. Inspired by causal invariance [Entner
and Hoyer, 2010], we further provide a detailed proof procedure as outlined below.
Lemma 1 (Independence Property) Given X ∈ Rd×T be the observed dataset. If we have Xi ⊥⊥τ Xj |Xk , ..., Xl ,
t
τ
then we have Xi ⊥
⊥′
⊥τ means conditional independence with τ time lags at time step t.
Xj |Xk , ..., Xl . ⊥
t t

Proof 1 Due to the short-term time invariance of the relationships among variables and the short-term mechanism

invariance of conditional probabilities, different value Xit and Xit of Xi is mapped to the same variable Xi in the
window causal graph G. Consequently, P aτt (Xi ) and P aτt′ (Xi ) correspond to the same variable set. Thus, if the
condition Xi ⊥⊥τ Xj |Xk , ..., Xl holds, then Xi ⊥⊥τ Xj |Xk , ..., Xl holds in the window causal graph G, which further
t G
τ
implies Xi ⊥
⊥′
Xj |Xk , ..., Xl .
t

This lemma establishes that, in an identifiable window causal graph, the independence property remains invariant
with time translation. Leveraging this insight, we can transform the observed time series into window observations to
perform causal discovery while maintaining the invariance conditions, as outlined in Section 3.1.

2.4 Necessity of Convolution

Granger demonstrated, through the Cramer representation and the spectral representation of the covariance sequence
[Granger, 1969, Mills and Granger, 2013, Granger and Hatanaka, 2015], that time-series data can be decomposed
into a sum of uncorrelated components. Inspired by these representations and the concept of graph Fourier transform
[Shuman et al., 2013, Sandryhaila and Moura, 2013, Sardellitti et al., 2017], we propose considering a underlying
function X = f (P aG (X ), W) + E, where P aG (X ) denotes relationships among X in the window causal graph G and
E is the noise term, to describe the generative process of the observed dataset X = {X1 , · · · , Xd } ∈ Rd×T , with an
underlying window causal matrix W ∈ Rd×d×(eτ +1) . We can then decompose f (P aG (X ), W) into Fourier integral
forms:

X = f (P aG (X ), W) + E
(1)
= fˆ(s, t) + E
Here, s and t denote the spatial and temporal projections, respectively, of f (P aG (X ), W). Equation 1 is derived
from the observation that the contemporaneous part in time-series data corresponds to the spatial domain, while the
time-lagged part corresponds to the temporal domain. Therefore, we employ the multivariate Fourier transform,
ZZ ∞
F(X ) = fˆ(x, y; s, t)e−iω(sx+ty) dxdy
−∞
ZZ ∞ (2)
∝ h(ŝ)g(t̂)e−iω(ŝ+t̂) dŝdt̂
−∞

where ŝ represents the spatial domain component, t̂ represents the temporal domain component, and ω represents the
angular frequency along with transform function fˆ, h and g. The first line corresponds to applying the Fourier transform

5
arXiv Template A P REPRINT

Figure 2: An illustration of the STIC framework. Let X = {X1 , · · · , Xd } ∈ Rd×T be the observed dataset, representing
d observed continuous time series of the same length T . First, we convert the observations of the first T − 1 time
steps, X 1:T −1 = {X11:T −1 , · · · , Xd1:T −1 } ∈ Rd×(T −1) , into a window representation W ∈ Rd×τ̂ ×c using a sliding
window with a predefined window length τ̂ and step length 1, where c = T − τ̂ . Time-Invariance Block (Bt ): In
order to better discover the causal structure from X , we use convolution kernel Kt ∈ Rd×τ̂ to act on W , and get the
common representation Kt ⊙ Wψ of X for each window observations Wψ . Afterwards, we pass the commonality
through an FNN network to obtain a predicted window causal matrix Ŵ ∈ Rd×d×τ̂ . Mechanism-Invariance Block
(Bm ): To identify numerical transform in window causal graph, we use another convolution kernel Km ∈ Rd×τ̂ in
each Bm to transform W . Then we output W ∈ Rd×τ̂ ×c as the prediction of f (W ). Next, we do hadamard product of
τ
each W ψ ∈ Rd in W and each Ŵ τ ∈ Rd×d in Ŵ to get the predicted X̂ τ̂ +ψ until we get all X̂ ∈ Rd×c . Finally, we
calculate the Mean Squared Error (MSE) loss between X̂ and X , and adopt gradient descent to optimize the parameters
within the time-invariance and mechanism-invariance blocks.

to both sides of Equation 1. In the second line, inspired by the Time-Independent Schrödinger Equation [Zabusky,
1968, Rana and Liao, 2019], we assume that f (x, y; s, t) can be decomposed into the spatial and temporal domains, i.e.,
fˆ(x, y; s, t) = h(ŝ)g(t̂). Next, by utilizing the convolution theorem [Zayed, 1998] for tempered distributions, which
states that under suitable conditions the Fourier transform of a convolution of two functions (or signals) is the pointwise
product of their Fourier transform, i.e., F(h ∗ g) = F(h) · F(g), where F(·) represents the Fourier transform, we
convert the convolution formula into the following expression:

F[h(ŝ) ∗ g(t̂)] ∝ F(h(ŝ)) · F(g(t̂))


Z ∞ Z ∞
∝ h(ŝ)e−iωŝ dŝ g(t̂)e−iωt̂ dt̂ (3)
−∞ −∞
∝ F(X )

The first line of the Formula 3 is obtained through the convolution theorem, while the second line expands F(h(ŝ)) and
F(g(t̂)) using the Fourier transform. The third line is derived from Equation 2. Therefore, it indicates that the observed
dataset X can be obtained by convolving the convolution kernel with temporal information and the spatial details, which
we will deal with corresponding to the two kinds of invariance. We posit that the convolution operation precisely aligns
with the functional causal data generation mechanism, i.e., X ∝ h(ŝ) ∗ g(t̂). Conversely, the convolution operation
can be used to analytically model the generation mechanism of functional time-series data. Therefore, we will employ
the convolution operation to extract the functional causal relationships within the window causal graph. In conclusion,
the equivalence between the generation mechanism of time-series causal data and convolution operations serves as
motivation to incorporate convolution operations into our STIC framework.

6
arXiv Template A P REPRINT

Table 2: Summary of symbol definitions in Section 3.


Symbol Description
τ The predefined maximum time lag
τ̂ The predefined window length, τ̂ = τ + 1
W ∈ Rd×τ̂ ×c The window representation, where c = T − τ̂
Bt The time-invariance block
d×τ̂
Kt ∈ R The convolution kernel in the time-invariance block
⊙ The Hadamard product
d×d×τ̂
Ŵ ∈ R The predicted window causal matrix
Bm The mechanism-invariance block
d×τ̂
Km ∈ R The convolution kernel in the mechanism-invariance block
W ∈ Rd×τ̂ ×c The prediction of f (W )
d×c
X̂ ∈ R The prediction of the observed dataset
c×d×τ̂ d×d×τ̂
f1 : R →R The feed-forward neural network
τ
Ŵi,j The estimated binary existence of the causal effect of Xi on Xj with τ time lags
p The threshold used to eliminate edges with low probability of existence
d×τ̂ d×τ̂
f2 : R →R The estimated transformation function

2.5 Granger Causality

Granger causality [Granger, 1969, Pavasant et al., 2021, Assaad et al., 2022] is a method that utilizes numerical
calculations to assess causality by measuring fitting loss and variance. Formally, we say that a variable Xi Granger-
causes another variable Xj when the past values of Xi at time t (i.e., Xi1 , · · · , Xit−1 ) enhance the prediction of Xj at
time t (i.e., Xjt ) compared to considering only the past values of Xj . The definition of Granger causality is as follows:

Definition 4 (Granger Causality) Let X = {X1 , · · · , Xd } ∈ Rd×T be a observed dataset containing d variables. If
στ2 (Xj |X ) < στ2 (Xj |X − Xi ), where στ2 (Xj |X ) denotes the variance of predicting Xj using X with τ time lags, we
τ
say that Xi causes Xj , which is represented by Wi,j = 1.

In simpler terms, Granger causality states that Xi Granger-causes Xj if past values of Xi (i.e., Xit ) provide unique and
statistically significant information for predicting future values of Xj (i.e., Xjt ). Therefore, following the definition of
Granger causality, we can approach causal discovery as an autoregressive problem.

3 Method
In this section, we introduce STIC, which involves four components: Window Representation, Time-Invariance Block,
Mechanism-Invariance Block, and Parallel Blocks for Joint Training. The process is depicted in Figure 2. Firstly, we
transform the observed time series into a window representation format, leveraging Lemma 1. Next, we input the
window representation into both the time-invariance block and the mechanism-invariance block (Bt and Bm in Figure
2). Finally, we conduct joint training using the extracted features from two kinds of parallel blocks. In particular,
the time-invariance block Bt generates the estimated window causal matrix Ŵ. To better represent the symbols used
in Section 3, we also arrange a table to summarize and show their definitions, as shown in Table 2. The subsequent
subsections provide a detailed explanation of the key components of STIC.

3.1 Window Representation

The observed dataset X ∈ Rd×T contains d observed continuous time series (variables) with T time steps. We
also define a predefined maximum time lag as τ . To ensure that the entire causal contemporaneous and time-lagged
influence is observed, we calculate the minimum length of the window that can capture this influence as τ̂ = τ + 1.
To construct the window observations, we select the observed values from the first T − 1 time steps, i.e. X 1:T −1 =
{X11:T −1 , · · · , Xd1:T −1 } ∈ Rd×(T −1) . Using a sliding window approach along the temporal chain of observations, we
create window observations of length τ̂ and width d, with a step size of 1. This process results in c = T − τ̂ window

7
arXiv Template A P REPRINT

𝑋11:𝑇−1 …
𝑋21:𝑇−1


𝑋𝑑1:𝑇−1 …

𝑊1 𝑊2 … 𝑊𝑐

concat

Figure 3: Window representation. First, we get c matrices Wψ by sliding window with predefined window length τ̂ and
step size 1, where each Wψ ∈ Rd×τ̂ , ψ = 1, ..., c represents the data we observe in the window. Then, we concatenate
the obtained Wψ together to get the final window representation W ∈ Rd×τ̂ ×c .

observations Wψ where ψ = 1, ..., c. These window observations are referred to as the window representation W , as
illustrated in Figure 3.

3.2 Time-Invariance Block

According to Definition 2, the causal relationships among variables remain unchanged as time progresses. Exploiting
this property, we can extract shared information from the window representation W and utilize it to finally obtain
the estimated window causal matrix Ŵ. Inspired by convolutional neural networks used in causal discovery[Nauta
et al., 2019], we introduce a invariance-based convolutional network structure denoted as Bt to incorporate temporal
information within the window representation W . For each window observation Wψ ∈ Rd×τ̂ , we employ the following
formula to aggregate similar information among the time series within the window observations

Ŵ = f1 (Kt ⊙ W1 , ..., Kt ⊙ Wc ) (4)

Here, shared Kt ∈ Rd×τ̂ represents a learnable extraction kernel utilized to extract information from each window
observation. The symbol ⊙ denotes the Hadamard product between matrices, and f1 refers to a neural network structure.
By applying the Hadamard product with the shared kernel Kt , the resulting output exhibits similar characteristics across
the time series. Moreover, Kt serves as a time-invariant feature extractor, capturing recurring patterns that appear in the
input series and aiding in forecasting short-term future values of the target variable. In Granger causality, these learned
patterns reflect causal relationships between time series, which are essential for causal discovery [Nauta, 2018]. To
ensure the generality of STIC, we employ a simple feed-forward neural network (FNN) f1 : Rc×d×τ̂ → Rd×d×τ̂ to
extract shared information from each Kt ⊙ Wψ , ψ = 1, ..., c. Furthermore, we impose a constraint to prohibit self-loops
in the estimated window causal matrix Ŵ when the time lag is zero. That is:


 0 if i = j and τ = 0
τ τ
Ŵi,j = 0 if Ŵi,j <p , (5)
1 else

8
arXiv Template A P REPRINT

7  7  7  7 


 
 

3UHFLVLRQ
 
)

 
 
 
7  7  7  7 
 
 

3UHFLVLRQ
 
)

 
 
                 
1XPEHURI2EVHUYHG9DULDEOHV G 1XPEHURI2EVHUYHG9DULDEOHV G 1XPEHURI2EVHUYHG9DULDEOHV G 1XPEHURI2EVHUYHG9DULDEOHV G
3&0&, 3&0&, '<127($56 7&') &876 &876 67,& RXUV

Figure 4: The results of F1 (detailed in Figure 4 left) and precision (detailed in Figure 4 right) evaluated on linear
Gaussian datasets with varying numbers of variables (d) and observed time steps (T ). The observed data X is generated
by sampling d time series with T observed time steps from a linear Gaussian distribution. We consider different values
of d ranging from 5 to 20, and varying observed time steps T including 100, 200, 500, and 1000.

τ
where Ŵi,j represents the estimated binary existence of the causal effect of Xi on Xj with a time delay of τ ∈ {0, ..., τ },
and p is a threshold used to eliminate edges with low probability of existence.

3.3 Mechanism-Invariance Block

As stated in Definition 3, the causal conditional probability relationships among the time series remain unchanged as
time varies. Consequently, the causal functions between variables also remain constant over time. With this in mind, our
objective in Bm is to find a unified transform function f2 : Rd×τ̂ → Rd×τ̂ that accommodates all window observations.
To achieve this goal, as depicted in Figure 2, we employ a convolution kernel Km ∈ Rd×τ̂ as f2 . This kernel performs
a Hadamard product operation with each window Wψ ∈ Rd×τ̂ in W , where ψ = 1, ..., c. Subsequently, we employ the
Parametric Rectified Linear Unit (PReLU) activation function [Zhu et al., 2017] to obtain the output W ψ ∈ Rd×τ̂ ,

W ψ = P ReLU (Km ⊙ Wψ ) (6)

Each W ψ represents the transformed matrix obtained from the window observation Wψ by a unified transform function
f2 implemented with convolution kernel Km . Each W ψ is finally used to predict X̂ τ̂ +ψ . Note that this transform
1 N
function f2 can also be composed of N different but equal dimensional kernels Km , ..., Km ∈ Rd×τ̂ , which are nested
to perform complex nonlinear transformations. After f2 , the value inside the window Wψ ∈ Rd×τ̂ is then pressed for
Ŵ-selected column summation to predict X̂ τ̂ +ψ ∈ Rd .

3.4 Parallel Blocks for Joint Training

So far, we have obtained the estimated window causal matrix Ŵ by using Bt . In addition, we also obtained the
transformed matrix W with Bm . We used convolutional neural networks in both Bt and Bm . Their structures are
similar, but their functions and purposes are different. In Bt , we focus on the shared underlying unified structure of
all window observations. Following the Definition 2 of short-term time invariance, we choose a convolutional neural
network structure with translation invariance [Kayhan and Gemert, 2020, Singh et al., 2023]. We expect that f1 with
Kt as the main component can extract the invariant structure of the window representation W . In Bm , we focus on
the convolution kernel Km , which is expected to serve as a unified transform function f2 to satisfy the Definition 3 of
short-term mechanism invariance and perform complex nonlinear transformations.

9
arXiv Template A P REPRINT

Based on Definition 4 described in Section 2.5, after obtaining the estimated window causal matrix and the transform
functions between variables, we can combine the outputs from the time-invariance and mechanism-invariance blocks
and using Ŵ-selected column summation to predict X̂ . We consider that the time-invariance block facilitates the
identification of parent-child relationships between variables, formalized as Ŵ, while the mechanism-invariance block
helps to explore the generative mechanisms, i.e., transform functions. Consequently, we can naturally combine the
outputs Ŵ and W . Specifically, by utilizing Ŵ and the computed W ψ , ψ = 1, ..., c, we can ultimately obtain the
estimates X̂ τ̂ +ψ , namely Ŵ-selected column summation,

τ
X τ
X̂ τ̂ +ψ = W ψ ⊙ Ŵ τ (7)
τ =0

Here, we need to consider each τ ∈ {0, ..., τ } and combine the estimated window causal matrix Ŵ with the correspond-
ing transformed window observations W ψ obtained through Bm to obtain the values of X̂ τ̂ +ψ . Our ultimate goal is to
find a window causal matrix Ŵ that satisfies the conditions by optimizing the squared error loss (MSE) L between the
predicted X̂ and the ground truth X at each time point t. The final auto-regressive equation is expressed as follows:

T
X d
X
L= ||Xit − X̂it ||22 (8)
t=τ̂ +1 i=1

We adopt the gradient ▽L to optimize the parameters within the time-invariance and mechanism-invariance blocks.

4 Experiment Results
In this section, we present a comprehensive series of experiments on both synthetic and benchmark datasets to verify
the effectiveness of the proposed STIC. Following the experimental setup of [Runge et al., 2019, Runge, 2020], we
compare STIC against the constraint-based approaches such as PCMCI [Runge et al., 2019] and PCMCI+ [Runge,
2020], the score-based approaches such as DYNOTEARS [Pamfil et al., 2020], and the Granger-based approaches
TCDF [Nauta et al., 2019], CUTS [Cheng et al., 2022] and CUTS+ [Cheng et al., 2023].
Our causal discovery algorithm is implemented using PyTorch. The source code for our algorithm is publicly available
at the following URL 1 . Both the time-invariance block and mechanism-invariance block are implemented using
convolutional neural networks.
Firstly, we conducted experiments on synthetic datasets, encompassing both linear and non-linear cases. The methods of
generating synthetic datasets for both linear and non-linear cases will be introduced separately in Section 4.2. Secondly,
we proceeded to perform experiments on benchmark datasets to demonstrate the practical value of our model in Section
4.3. Thirdly, to evaluate the sensitivity of hyper-parameters, such as the learning rate (default 1e−5 ), the predefined τ
(default 0.4d) and the threshold p (default 0.3), we conducted ablation experiments as detailed in Section 4.4.
We employ two kinds of evaluation metrics to assess the quality of the estimated causal matrix: the F1 score and
precision. A higher F1 score indicates a more comprehensive estimation of the window causal matrix, while a higher
precision indicates the ability to identify a larger number of causal edges. In this paper, we consider causal edges with
different time lags for the same pair of variables as distinct causal edges. Specifically, if there exists a causal edge
from Xi to Xj with a time lag of τ1 , and another causal edge from Xi to Xj with the time lags of τ2 , where i ̸= j and
τ1 ̸= τ2 , we regard these as two separate causal edges. Due to the need to predefine the maximum time lag in STIC, we
truncate the estimated Ŵ ∈ Rd×d×(τ +1) to Ŵ ∈ Rd×d×(eτ +1) and then compute the evaluation metrics. We handle
other baselines (such as PCMCI, PCMCI+, DYNOTEARS, CUTS, CUTS+) requiring a predefined maximum time lag
parameter in the same manner.

4.1 Baselines

We select six state-of-the-art causal discovery methods as baselines for comparison:

• PCMCI [Runge et al., 2019] is a notable work that extends the PC algorithm [Kalisch and Bühlman, 2007]
for causal discovery from time-series data. The source code for PCMCI is available at https://github.
1
https://github.com/HITshenrj/STIC

10
arXiv Template A P REPRINT

 

 

 

3UHFLVLRQ
)
 

 

 
&, &, 56 ') 876 876 RXUV &, &, 56 ') 876 876 RXUV
3&0 3&0 7($ 7& & & & 3&0 3&0 7($ 7& & & &
'<
12 67, '<
12 67,

Figure 5: The results of F1 and precision evaluated on nonlinear Gaussian datasets. We fix numbers of variables d = 5
and observed time steps T = 1000.

com/jakobrunge/tigramite. PCMCI divides the causal discovery process into two components: the
identification of relevant sets through conditional independence tests and the direction determination. It
assumes causal stationarity, the absence of contemporaneous causal links, and no hidden variables. Specifically,
the PC-stable algorithm [Colombo et al., 2014] is employed to remove irrelevant conditions through iterative
independence tests. Furthermore the Multivariate Conditional Independence test is conducted to address
false-positive control in scenarios with highly interdependent time series.

• PCMCI+ [Runge, 2020] improves upon PCMCI by reducing the number of independence tests and optimizing
the selection of conditional sets, resulting in superior effectiveness and efficiency in the same experimental
setting. The source code is also available at https://github.com/jakobrunge/tigramite. PCMCI+
overcomes the limitation of the “no contemporaneous causal links" assumption in PCMCI. PCMCI+ expedites
the selection of conditional sets by testing all time-lagged pairs conditional on only the strongest p adjacencies
in each p-iteration, without evaluating all p-dimensional subsets of adjacencies. Moreover, intra-slice sets are
introduced to further refine the determination of all structures.

• DYNOTEARS [Pamfil et al., 2020] represents a groundbreaking advancement in the field of causal discovery
from time-series data by transforming the combinatorial graph search problem into a continuous optimization
problem. The details of this work can be found in the repository located at https://github.com/ckassaad/
causal_discovery_for_time_series. This approach characterizes the acyclicity constraint as a smooth
equality constraint through the minimization of a penalized loss while adhering to the acyclicity constraint.

• TCDF [Nauta et al., 2019] is an outstanding work that utilizes attention-based convolutional neural networks
(CNNs) to explore causal relationships between time series and the time delay between cause and effect. The
code for TCDF can be accessed at https://github.com/M-Nauta/TCDF. By leveraging Granger causality,
TCDF predicts one time series based on other time series and its own historical values, employing CNNs to
identify and analyze causal relationships within time-series data.

• CUTS [Cheng et al., 2022] is an outstanding neural Granger causal discovery algorithm for jointly imputing
unobserved data points and building causal graphs, by incorporating two mutually boosting modules (latent data
prediction and causal graph fitting) in an iterative framework. fter hallucinating and registering unstructured
data, which might be of high dimension and with complex distribution, CUTS builds a causal adjacency
matrix with imputed data under sparse penalty. The code for CUTS is available at https://github.com/
jarrycyx/UNN/tree/main/CUTS. CUTS is a promising step toward applying causal discovery to real-world
applications with non-ideal observations.

• CUTS+ [Cheng et al., 2023] is built on the Granger-causality-based causal discovery method CUTS and in-
creases scalability through coarse-to-fine discovery and message-passing-based methods. The code for CUTS+
can be accessed at https://github.com/jarrycyx/UNN/tree/main/CUTS_Plus. CUTS+ significantly
improves causal discovery performance on high-dimensional data with various types of irregular sampling.

11
arXiv Template A P REPRINT

4.2 Experiments on Synthetic Datasets

We generate synthetic datasets in the following manner. Firstly, we consider several typical challenges [Runge et al.,
2019, Runge, 2020] with contemporaneous and time-lagged causal dependencies, following an additive noise model.
We set the ground truth maximum time lag to 0.4d and initialize the existence of each edge in the true window causal
matrix W with a probability of 50%. For each variable Xi , its relation to with its parents P aG (Xi ) is defined as
Xi = fi (P aG (Xi )) + εi , where fi represents the ground truth transformation function between Xi ’s parents P aG (Xi )
and Xi . If Xj ∈ P aτG (Xi ), then in the ground truth causal matrix W, Wji
τ
= 1. Secondly, for linear datasets, each fi is
defined by a weighted linear function, while for nonlinear datasets, each fi is defined using a weighted cosine function.
We sample the weights from a uniform distribution, such that if a causal edge exists, the corresponding weight in the
additive noise model is sampled from the interval U (−2, −0.5] ∪ [0.5, 2) to ensure non-zero values. For non-causal
edges, the weight is set to 0. The noise term εi follows either a standard normal distribution N (0, 1) or is uniformly
sampled from the interval U [0, 1]. These data-generating procedures are similar to those used by the PCMCI family
[Runge et al., 2019, Runge, 2020] and CUTS family [Cheng et al., 2022, 2023].
In the following, we present different results on linear Gaussian datasets (Section 4.2.1), nonlinear Gaussian datasets
(Section 4.2.2), and linear uniform datasets (Section 4.2.3) to demonstrate the superiority of our model. Specifically, to
reduce the impact of random initialization, we conduct 10 experiments for each type of datasets and report the mean
and variance of the experimental results.

4.2.1 Linear Gaussian Datasets

The data generation process for linear Gaussian datasets follows the relationship Xi = wi P aG (Xi ) + εi , where εi is
sampled from a standard normal distribution N (0, 1). To demonstrate the capability of our model in causal discovery
from time-series data on datasets of varying sizes, we compare STIC with baselines under different conditions, including
different numbers of variables (d = {5, 10, 15, 20}) and different lengths of time steps (T = {100, 200, 500, 1000}).
The results are summarized in Figure 4. Figure 4 left presents the variation of F1 score as the number of variables
increases, while Figure 4 right shows the variation of precision with the number of variables. A comprehensive analysis
of the experiments requires the joint consideration of both Figure 4 left and right. From a macroscopic perspective, our
proposed STIC achieves the highest F1 scores on linear Gaussian datasets, while precision reaches the state-of-the-art
levels in most cases. We will compare the performance of STIC and the baselines from two aspects of analysis:
Aspect 1: The relationship between the number of variables d and the model when T remains constant. When the
observed time steps is fixed at T = 1000, corresponding to the top-left graphs in Figure 4 left and right, we observe that
as the number of variables increases, the F1 scores of all causal discovery methods tend to decrease. However, our
proposed STIC achieves an average F1 score of 0.86, 0.77, 0.79, 0.77 and an average precision of 0.80, 0.62, 0.74, 0.72
across the four different numbers of variables, surpassing other strong baselines. By comparing the line plots in the
corresponding positions of Figure 4 left and right, especially when T = 100, corresponding to the bottom-right graphs
in Figure 4 left and right, we find that our proposed STIC achieves an average F1 score of 0.76, 0.76, 0.77, 0.65 and an
average precision of 0.66, 0.70, 0.77, 0.65 across the four different numbers of variables, significantly outperforming
other strong baselines.
In the case of fixed observed time steps, as the number of variables increases, constraint-based approaches such as
PCMCI and PCMCI+ suffer from severe performance degradation because they require significant prior knowledge
involvement in determining the threshold p, which determines the presence of causal edges. For score-based methods,
the DYNOTEARS method shows relatively stable performance as the number of variables increases, but it does not
achieve the optimal performance among all methods. As for Granger-based methods, CUTS and CUTS+ often suffer
from poor performance due to the inability to recognize time lags. Our proposed STIC and the TCDF method achieve
competitive results in terms of F1 scores. However, our method exhibits higher precision.
We attribute this superior performance to the window representation employed in STIC. By repeatedly extracting features
from observed time series in different window observations, such representation acts as a form of data augmentation
and aggregation. It enables a macroscopic view of common characteristics among multiple window observations,
facilitating the learning of more accurate causal structures. Thus, our STIC model achieves optimal performance when
the number of variables d changes.
Aspect 2: The relationship between the observed time steps T and the model when d remains constant. When
examining the impact of observed time steps T on the models while keeping the number of variables constant, we
observe that our STIC method consistently maintains an F1 score of approximately 0.7 across different values of T .
However, PCMCI+ and DYNOTEARS exhibit a significant decline in their F1 scores as T decreases. For instance, at
T = 1000, PCMCI+ and DYNOTEARS perform similarly to our STIC method, but at T = 100, their F1 scores drop
to half of that achieved by our STIC method. For PCMCI, it consistently falls behind our STIC method, regardless

12
arXiv Template A P REPRINT







3UHFLVLRQ
)




        


1XPEHURI2EVHUYHG9DULDEOHV G 1XPEHURI2EVHUYHG9DULDEOHV G
3&0&, 3&0&, '<127($56 7&') &876 &876 67,& RXUV

Figure 6: The results of F1 and precision evaluated on linear uniform datasets with fixed observed time steps (T = 1000)
and the number of variables (d) ranging from 5 to 20.

Table 3: The results of F1 and precision evaluated on FMRI dataset. In terms of average of both F1 and precision, STIC
outperforms the other baselines. Moreover, STIC shows a better stablity in view of variance.
PCMCI PCMCI+ DTNOTEARS TCDF CUTS CUTS+ STIC (ours)
F1 0.38±0.004 0.37±0.009 0.43±0.009 0.42±0.007 0.35±0.001 0.36±0.034 0.45±0.003↑
d=5
precision 0.31±0.014 0.33±0.016 0.31±0.017 0.44±0.006 0.22±0.007 0.29±0.035 0.70±0.030↑
F1 0.24±0.001 0.31±0.001 0.20±0.002 0.42±0.001 0.11±0.001 0.44±0.001 0.47±0.001↑
d = 10
precision 0.34±0.001 0.37±0.00 0.33±0.004 0.44±0.003 0.20±0.000 0.51±0.001 0.60±0.023↑
F1 0.27±0.003 0.35±0.007 0.19±0.011 0.35±0.002 0.14±0.001 0.26±0.020 0.53±0.006↑
d = 15
precision 0.19±0.002 0.20±0.001 0.11±0.010 0.43±0.032 0.08±0.002 0.26±0.003 0.80±0.005↑

of changes in T . While TCDF achieves a relatively consistent level of performance, it exhibits lower performance
compared to our model. Furthermore, we find that after treating different time lags as different causal edges, the F1
scores and precisions of CUTS and CUTS+ are maintained at a relatively low level.
For constraint-based approaches, the PCMCI and PCMCI+ algorithms perform poorly because as the number of samples
decreases, the statistical significance of conditional independence cannot fully capture the causal relationships between
variables. As for score-based methods, DYNOTEARS does not perform well on linear data. One possible reason is that
DYNOTEARS heavily relies on acyclicity in its search, which may not converge to the correct causal graph. Regarding
Granger-based methods, we believe that they are overly conservative and fail to accurately predict all correct causal
edges.
In contrast, our STIC model is capable of predicting a greater number of causal edges, which is crucial for discovering
new knowledge. This superior performance can be attributed to the design of the convolutional time-invariance block.
This design allows for the extraction of more causal structure features from limited observed data, enabling a more
accurate exploration of potential causal relationships even with a small number of samples. Consequently, our STIC
model effectively addresses the challenge of causal discovery in low-sample scenarios, i.e., improving sample efficiency.

4.2.2 Nonlinear Gaussian Datasets


In this section, we perform experiments on nonlinear Gaussian datasets to evaluate the performance of STIC. We set the
number of variables (d = 5) and the observed time steps (T = 1000). For each Xi , its relationship with its parents
P aG (Xi ) is defined using the cosine function, and the noise term εi follows the standard normal distribution.
The performance of STIC and the baselines is visualized in Figure 5. It can be observed that STIC achieves an F1 score
of 0.44, which is higher than all baselines (PCMCI: 0.41, PCMCI+: 0.43, DYNOTEARS: 0.22, TCDF: 0.41, CUTS:
0.24, CUTS+: 0.37). It can be seen that STIC achieves a higher F1 score despite having lower precision compared to
the other baselines. For constraint-based methods (PCMCI and PCMCI+), one possible reason for achieving similar F1
scores with our proposed STIC is that the length of observed time steps is set to 1000, which is sufficient for statistical
independence tests. Thus, the conditional independence tests can directly operate on the data without being affected by
noise. Regarding score-based methods, we believe that DYNOTEARS uses a simple network that may not effectively

13
arXiv Template A P REPRINT

Table 4: The results of ablation study on the linear Gaussian datasets with the number of variables (d = 5).
Learning Rate Max Time Lag Threshold
lr=1e−4 lr=1e−5 lr=1e−6 τ =2 τ =3 τ =4 p = 0.1 p = 0.3 p = 0.5
F1 0.77±0.005 0.76±0.008 0.78±0.004 0.76±0.008 0.68±0.004 0.60±0.003 0.43±0.001 0.76±0.008 0.80±0.020
precision 0.66±0.006 0.66±0.013 0.68±0.016 0.66±0.013 0.53±0.005 0.45±0.003 0.27±0.001 0.66±0.013 0.89±0.019

capture nonlinear transforms, leading to lower F1 scores. For Granger-based methods, although TCDF achieves a
comparable F1 score to STIC (and even higher precision), the variance of STIC is significantly lower. This indicates
that TCDF is highly unstable, and there is a considerable amount of uncertainty in causal discovery. One possible
reason for this is that TCDF does not incorporate window representation like STIC, which could lead to inefficient
training of the convolutional neural network. We find that CUTS and CUTS+ are not very good at causal discovery
on nonlinear Gaussian datasets, and both F 1 and precision are lower than our STIC. One possible reason is that both
models rely on graph neural networks and treat learnable graph structures as estimated causal graphs. However, the
graph structure in the graph neural network is full of correlational relationships rather than causal relationships, so the
output graph structure does not contain only causal relationships, resulting in a decrease in both F1 and precision. We
believe that the robustness of our proposed STIC lies in the mechanism-invariance block, which repeatedly verifies the
functional causal relationships within each single window, effectively reducing model instability.

4.2.3 Linear Uniform Datasets

The linear uniform datasets is generated with observed time steps (T = 1000) by varied numbers of variables
(d = {5, 10, 15, 20}). For each Xi , fi is set as a linear function, while the noise term εi follows a uniform distribution
U [0, 1].
The performance of STIC and baselines are shown in Figure 6. STIC outperforms baselines in terms of F1 score
and precision in most cases, especially when the number of time series d is large. For constraint-based methods,
PCMCI and PCMCI+ perform poorly in terms of F1 score and precision when the number of variables is relatively
large (d = {10, 15, 20}). We consider that since conditional independence tests serve as strict indicators of causal
relationships, they may fail due to the limited number of time steps and the presence of uniform noise. Moreover,
PCMCI cannot determine intra-slice causal relationships and performs much worse than our STIC model in terms of F1
score and precision. For score-based methods, DYNOTEARS identifies causal relationships by fitting auto-regressive
coefficients between variables, treating them as estimated causal relationships. However, due to the strong influence
of noise, DYNOTEARS fails to recognize causal relationships in the linear uniform datasets. Interestingly, TCDF,
which shows competitive performance compared to our STIC model in Section 4.2.1, performs particularly poorly
on the linear uniform datasets. From the high precision and low F1 score of TCDF, we can deduce that the uniform
distribution introduces many incorrectly estimated causal edges during the process of estimating temporal causality
based on Granger causality using past value of other variables. The F1 score and precision of CUTS and CUTS+ further
support the idea that Granger causality is not well applicable to linear uniform datasets. One possible reason is that for
a uniform distribution, its inverse transformation equation still exists, which leads to the performance degradation of
finding causality from correlation.

4.3 Experiments on Benchmark Datasets

In this section, we utilize FMRI benchmark datasets, a common neuroscientific benchmark dataset called Functional
Magnetic Resonance Imaging [Smith et al., 2011], to explore and discover brain blood flow patterns. The dataset
contains 28 different underlying brain networks with the number of observed variables (d = {5, 10, 15}). For each of
the 28 brain networks, we observe 200 time steps for causal discovery. The results are reported in Table 3.
The results demonstrate that STIC achieves the highest average F1 scores on all kinds of observed variables, surpassing
the average F1 scores of PCMCI, PCMCI+, DYNOTEARS, TCDF, CUTS and CUTS+. Moreover, in terms of precision,
STIC achieves significantly higher precisions than those of the other baselines. For constraint-based methods, such as
PCMCI and PCMCI+, their poor performance on the FMRI datasets may be attributed to the short length of observed
time steps, which affects their ability to accurately test for conditional independence. Regarding DYNOTEARS, we
believe that acyclicity regularizers still limit its performance. In comparison, our STIC model outperforms TCDF,
CUTS and CUTS+ by utilizing a window representation, which enhances the representation of observed data within
each window. This enables more accurate learning of common causal features and structures across multiple windows.

14
arXiv Template A P REPRINT

4.4 Ablation Study

We conduct ablation experiments on the linear Gaussian datasets with the number of variables (d = 5), to investigate
the impact of different hyper-parameters on the experimental results, such as the learning rate (default: 1e−5 ), the
predefined maximum time lag (default: 0.4d = 2), and the threshold p (default: 0.3). Specifically, we vary the learning
rate by increasing it to 1e−4 and decreasing it to 1e−6 . We also increased the predefined maximum lag to τ = 3 and
τ = 4, respectively, and change the threshold to p = 0.1 or p = 0.5. The empirical results are summarized in Table 4.

• The learning rate lr: The experiments reveal that manipulating the learning rate, either by increasing or
decreasing it, has little effect on the F1 score and precision. This finding suggests that our convolutional neural
network architecture is not sensitive to changes in the learning rate, simplifying the parameter tuning process.
• The predefined maximum time lag τ : However, increasing the predefined maximum lag τ gradually
deteriorates performance. We speculate that this decline occurs because, with a longer lag, the window for
τ1 τ2
observations expands, potentially causing STIC to learn multi-hop causal edges (Xi −→ Xj −→ Xk ) as
τ1 +τ2
single-hop causal edges (Xi −→ Xk ). Addressing this issue could be a focus for future research.
• The threshold p: Furthermore, comparing the default setting to STIC with p = 0.1, we observe a significant
decline in both F1 score and precision when the threshold is lower. When comparing the default setting to
STIC with p = 0.5, we find that while the F1 score remains relatively stable, precision notably improves when
the threshold is increased. These findings indicate that reducing the threshold adversely affects the model’s
ability to explore causal edges, while setting a higher threshold may cause the model to consider nearly all
estimated edges as causal, resulting in increased precision but a similar F1 score. Thus, the threshold plays a
pivotal role in discovering more causal edges, and a trade-off needs to be made.

5 Discussion
This study presents two kinds of short-term invariance-based convolutional neural networks for discovery causality
from time-series data. Major findings include: (1) our methods, based on gradients, effectively discover causality from
time-series data; (2) convolutional neural network based on short-term invariance improves the sample efficiency of
causal discovery. (3) our proposed STIC demonstrates significantly superior performance compared to baseline causal
discovery algorithms. In this section, we discuss these results in detail.

5.1 What contributes to the effectiveness of STIC?

5.1.1 Why can STIC find causal relationships


Numerous gradient-based methods have been developed, such as DYNOTEARS within score-based approaches [Pamfil
et al., 2020], and TCDF [Nauta et al., 2019], CUTS[Cheng et al., 2022] and CUTS+[Cheng et al., 2023] within Granger-
based approaches. Including our proposed STIC, these gradient-based methods aim to optimize estimated causal matrix
by maximizing or minimizing constrained functions. With the rapid advancement and widespread adoption of deep
Neural Networks (NNs), researchers have begun employing NNs to infer nonlinear Granger causality, demonstrating
the effectiveness of gradient-based methods in causal discovery [Tank et al., 2021, Wu et al., 2021, Khanna and Tan,
2019]. In our approach, we maintain the assumption and the constrained functions of Granger causality so that our
method remains effective in discovering causal relationships.

5.1.2 Why can STIC find the true causality


As time progresses, the values of observed variables change due to statistical shifts in distributions. However, the causal
relationships between the variables remain the same. For example, carbohydrate intake may lead to an increase in blood
glucose, but the specific magnitude of the increase may vary with covariates such as body weight. The “lead” property is
used as an indicator of causal relationships, i.e., invariance [Magliacane et al., 2018, Rojas-Carulla et al., 2018, Santos,
2021, Li et al., 2021]. In this paper, we observe that some causal relationships may also vary over time. Therefore, we
make a more reasonable assumption, namely short-term time invariance and mechanism invariance [Entner and Hoyer,
2010, Liu et al., 2023, Zhang et al., 2017]. Building on these two forms of short-term invariance, we posit that both the
window causal matrix W and the transform functions f remain unchanged in the short term. For example, within a few
days (short-term), since covariates affecting blood glucose levels, such as body weight, remain nearly constant, the
increase in blood glucose levels due to carbohydrate intake is also essentially constant. The short-term mechanism
invariance proposed in this paper is also considered an invariant principle [Liu et al., 2022]. Building on these forms
of invariance, a natural extension is the introduction of parallel time-invariance and mechanism-invariance blocks for

15
arXiv Template A P REPRINT

joint training, as proposed in this paper. Through the theoretical validation of convolution in Section 2.4, we further
affirm the applicability of convolution to causal discovery from time-series data. Additionally, the Granger causality is
commonly employed to examine short-term causal relationships [Ahmad et al., 2005], which further aligns with our
assumptions. Under the premise of theoretical soundness and practical applicability, our STIC framework proves highly
effective.

5.2 What contributes to the exceptional performance of STIC?

5.2.1 High F1 scores and precisions

The experiments conducted on both synthetic and FMRI benchmark datasets in Section 4 demonstrate that our STIC
model achieves the state-of-the-art F1 scores and precisions in most cases. We attribute the performance improvement
to the incorporation of the window representation, the time-invariance block, and the mechanism-invariance block. The
window representation serves as a form of data augmentation and aggregation, providing a macroscopic understanding
of common features across multiple window observations, thereby facilitating the learning of more accurate causal
structures. The time-invariance block extracts common features from multiple window observations and achieves
effective information aggregation, enhancing sample efficiency and enabling the model to achieve high performance.
The mechanism-invariance block, with nested convolution kernels, iteratively examines the functional transform within
each individual window, enabling complex nonlinear transformations. With improved accuracy in both causal structures
and complex nonlinear transformations, STIC demonstrates exceptional performance.

5.2.2 High sample efficiency

The window representation, introduced in Section 3.1, facilitates the segmentation of the entire observed dataset
X ∈ Rd×T into c = T − τ − 1 partitions, leveraging only a predefined hyperparameter τ . Each window observation
Wψ , where ψ = 1, ..., c, is ensured to satisfy both short-term time invariance and mechanism invariance simultaneously.
This representation method, similar to batch training techniques [Liang et al., 2006, Li et al., 2014, Hong et al., 2020]
optimizes data utilization, thus enhancing sample efficiency. Moreover, another pivotal aspect contributing to sample
efficiency is the novel invariance-based convolutional neural network design. This architecture enables the extraction of
richer causal structure features from limited observed data, facilitating more accurate exploration of potential causal
relationships even with a limited length of observed time steps. Consequently, our STIC model effectively tackles the
challenge of causal discovery in low-sample scenarios, thereby improving sample efficiency.

6 Conclusion, limitations and Future Works

This paper introduces STIC, a novel method designed for causal discovery from time-series data by leveraging both
short-term time invariance and mechanism invariance. STIC employs sliding windows in conjunction with convolutional
neural networks to incorporate these two kinds of invariance, and then transforms the searching for window causal
matrix into a continuous auto-regressive problem. The compatibility between causal structures in time series and
convolutional neural networks is supported by our theoretical analysis, reinforcing the rationale behind STIC’s design.
Our experimental results on synthetic and benchmark datasets demonstrate the efficiency and stability of STIC,
particularly when dealing with datasets that have shorter lengths of observed time steps. It showcases the effectiveness
of the short-term invariance-based approach in capturing temporal causal structures.
However, STIC has certain limitations that require further investigation. Firstly, while STIC demonstrates effectiveness
under this assumption, it becomes constrained when faced with non-additive noise. Future research should aim to
develop more comprehensive approaches capable of handling various types of non-additive noise. Secondly, the manual
predefined maximum lag used in STIC may pose limitations. Ablation experiments indicate that this hyperparameter
setting can lead to the model learning multi-hop causal edges as single-hop causal edges. To overcome this, future
research should explore more advanced blocks, such as attention mechanisms, to further enhance the performance of
STIC.
In summary, STIC represents a promising research direction for addressing the challenges of causal discovery from
time-series data, and we hope that STIC can discover new causal knowledge and provide new research ideas for medical
and other fields.

16
arXiv Template A P REPRINT

Acknowledgments
This study was supported in part by a grant from the National Key Research and Development Program of China
[2021ZD0110900] and the National Natural Science Foundation of China [72293584].

References
Josh Cowls and Ralph Schroeder. Causation, correlation, and big data in social science research. Policy & Internet, 7
(4):447–472, 2015.
Nick Pawlowski, Daniel Coelho de Castro, and Ben Glocker. Deep structural causal models for tractable counterfactual
inference. Advances in Neural Information Processing Systems, 33:857–869, 2020.
Kit Yan Chan, Ka Fai Cedric Yiu, Dowon Kim, and Ahmed Abu-Siada. Fuzzy clustering-based deep learning for
short-term load forecasting in power grid systems using time-varying and time-invariant features. Sensors, 24(5):
1391, 2024.
Doris Entner and Patrik O Hoyer. On causal discovery from time series data using fci. Probabilistic graphical models,
pages 121–128, 2010.
Judea Pearl. Causality. Cambridge university press, 2009.
Jakob Runge, Peer Nowack, Marlene Kretschmer, Seth Flaxman, and Dino Sejdinovic. Detecting and quantifying
causal associations in large nonlinear time series datasets. Science advances, 5(11):eaau4996, 2019.
Jakob Runge. Discovering contemporaneous and lagged causal relations in autocorrelated nonlinear time series datasets.
In Conference on Uncertainty in Artificial Intelligence, pages 1388–1397. PMLR, 2020.
Roxana Pamfil, Nisara Sriwattanaworachai, Shaan Desai, Philip Pilgerstorfer, Konstantinos Georgatzis, Paul Beaumont,
and Bryon Aragam. Dynotears: Structure learning from time-series data. In International Conference on Artificial
Intelligence and Statistics, pages 1595–1605. PMLR, 2020.
Meike Nauta, Doina Bucur, and Christin Seifert. Causal discovery with attention-based convolutional neural networks.
Machine Learning and Knowledge Extraction, 1(1):312–340, 2019.
Yuxiao Cheng, Runzhao Yang, Tingxiong Xiao, Zongren Li, Jinli Suo, Kunlun He, and Qionghai Dai. Cuts: Neural
causal discovery from irregular time-series data. In The Eleventh International Conference on Learning Representa-
tions, 2022.
Yuxiao Cheng, Lianglong Li, Tingxiong Xiao, Zongren Li, Jinli Suo, Kunlun He, and Qionghai Dai. Cuts+: High-
dimensional causal discovery from irregular time-series. arXiv preprint arXiv:2305.05890, 2023.
K Zhang, J Peters, D Janzing, and B Schölkopf. Kernel-based conditional independence test and application in causal
discovery. In 27th Conference on Uncertainty in Artificial Intelligence (UAI 2011), pages 804–813. AUAI Press,
2011.
Bingyuan Zhang and Joe Suzuki. Extending hilbert–schmidt independence criterion for testing conditional independence.
Entropy, 25(3):425, 2023.
Peter Spirtes and Kun Zhang. Causal discovery and inference: concepts and recent methodological advances. In Applied
informatics, volume 3, pages 1–28. SpringerOpen, 2016.
Lingfei Wang and Tom Michoel. Efficient and accurate causal inference with hidden confounders from genome-
transcriptome variation data. PLoS computational biology, 13(8):e1005703, 2017.
Gherardo Varando. Learning dags without imposing acyclicity. arXiv preprint arXiv:2006.03005, 2020.
Phillip Lippe, Taco Cohen, and Efstratios Gavves. Efficient neural causal discovery without acyclicity constraints. In
International Conference on Learning Representations, 2021.
An Zhang, Fangfu Liu, Wenchang Ma, Zhibo Cai, Xiang Wang, and Tat-seng Chua. Boosting differentiable causal
discovery via adaptive sample reweighting. arXiv preprint arXiv:2303.03187, 2023.
Clive WJ Granger. Investigating causal relations by econometric models and cross-spectral methods. Econometrica:
journal of the Econometric Society, pages 424–438, 1969.
Clive William John Granger and Michio Hatanaka. Spectral Analysis of Economic Time Series.(PSME-1), volume 2066.
Princeton university press, 2015.
Ziqiang Cheng, Yang Yang, Wei Wang, Wenjie Hu, Yueting Zhuang, and Guojie Song. Time2graph: Revisiting time
series modeling with dynamic shapelets. In Proceedings of the AAAI conference on artificial intelligence, volume 34,
pages 3617–3624, 2020.

17
arXiv Template A P REPRINT

Ziqiang Cheng, Yang Yang, Shuo Jiang, Wenjie Hu, Zhangchi Ying, Ziwei Chai, and Chunping Wang. Time2graph+:
Bridging time series and graph representation learning via multiple attentions. IEEE Transactions on Knowledge and
Data Engineering, 2021.
Peter Spirtes, Clark N Glymour, and Richard Scheines. Causation, prediction, and search. MIT press, 2000.
Jiji Zhang and Peter Spirtes. Strong faithfulness and uniform consistency in causal inference. In Proceedings of the
Nineteenth conference on Uncertainty in Artificial Intelligence, pages 632–639, 2002.
James M Robins, Richard Scheines, Peter Spirtes, and Larry Wasserman. Uniform consistency in causal inference.
Biometrika, 90(3):491–515, 2003.
Markus Kalisch and Peter Bühlman. Estimating high-dimensional directed acyclic graphs with the pc-algorithm.
Journal of Machine Learning Research, 8(3), 2007.
Charles K Assaad, Emilie Devijver, and Eric Gaussier. Survey and evaluation of causal discovery methods for time
series. Journal of Artificial Intelligence Research, 73:767–819, 2022.
Mingzhou Liu, Xinwei Sun, Lingjing Hu, and Yizhou Wang. Causal discovery from subsampled time series with proxy
variables. arXiv preprint arXiv:2305.05276, 2023.
Kun Zhang, Biwei Huang, Jiji Zhang, Clark Glymour, and Bernhard Schölkopf. Causal discovery from nonstation-
ary/heterogeneous data: Skeleton estimation and orientation determination. In IJCAI: Proceedings of the Conference,
volume 2017, page 1347. NIH Public Access, 2017.
Terence C Mills and Clive Granger. Granger: Spectral analysis, causality, forecasting, model interpretation and
non-linearity. A Very British Affair: Six Britons and the Development of Time Series Analysis During the 20th
Century, pages 288–342, 2013.
David I Shuman, Sunil K Narang, Pascal Frossard, Antonio Ortega, and Pierre Vandergheynst. The emerging field
of signal processing on graphs: Extending high-dimensional data analysis to networks and other irregular domains.
IEEE signal processing magazine, 30(3):83–98, 2013.
Aliaksei Sandryhaila and José MF Moura. Discrete signal processing on graphs: Graph fourier transform. In 2013
IEEE International Conference on Acoustics, Speech and Signal Processing, pages 6167–6170. IEEE, 2013.
Stefania Sardellitti, Sergio Barbarossa, and Paolo Di Lorenzo. On the graph fourier transform for directed graphs. IEEE
Journal of Selected Topics in Signal Processing, 11(6):796–811, 2017.
Norman J Zabusky. Solitons and bound states of the time-independent schrödinger equation. Physical review, 168(1):
124, 1968.
Jyotirmoy Rana and Shijun Liao. On time independent schrödinger equations in quantum mechanics by the homotopy
analysis method. Theoretical and Applied Mechanics Letters, 9(6):376–381, 2019.
Ahmed I Zayed. A convolution and product theorem for the fractional fourier transform. IEEE Signal processing letters,
5(4):101–103, 1998.
Nat Pavasant, Masayuki Numao, and Ken-ichi Fukui. Spatio-temporal change detection using granger sequence
pattern. In Proceedings of the Twenty-Ninth International Conference on International Joint Conferences on Artificial
Intelligence, pages 5202–5203, 2021.
Meike Nauta. Temporal causal discovery and structure learning with attention-based convolutional neural networks.
Master’s thesis, University of Twente, 2018.
Guibo Zhu, Zhaoxiang Zhang, Xu-Yao Zhang, and Cheng-Lin Liu. Diverse neuron type selection for convolutional
neural networks. In IJCAI, pages 3560–3566, 2017.
Osman Semih Kayhan and Jan C van Gemert. On translation invariance in cnns: Convolutional layers can exploit
absolute spatial location. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,
pages 14274–14285, 2020.
Jaspreet Singh, Chandan Singh, and Ankur Rana. Orthogonal transforms for learning invariant representations in
equivariant neural networks. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer
Vision, pages 1523–1530, 2023.
Diego Colombo, Marloes H Maathuis, et al. Order-independent constraint-based causal structure learning. J. Mach.
Learn. Res., 15(1):3741–3782, 2014.
Stephen M Smith, Karla L Miller, Gholamreza Salimi-Khorshidi, Matthew Webster, Christian F Beckmann, Thomas E
Nichols, Joseph D Ramsey, and Mark W Woolrich. Network modelling methods for fmri. Neuroimage, 54(2):
875–891, 2011.

18
arXiv Template A P REPRINT

Alex Tank, Ian Covert, Nicholas Foti, Ali Shojaie, and Emily B Fox. Neural granger causality. IEEE Transactions on
Pattern Analysis and Machine Intelligence, 44(8):4267–4279, 2021.
Alexander P Wu, Rohit Singh, and Bonnie Berger. Granger causal inference on dags identifies genomic loci regulating
transcription. In International Conference on Learning Representations, 2021.
Saurabh Khanna and Vincent YF Tan. Economy statistical recurrent units for inferring nonlinear granger causality. In
International Conference on Learning Representations, 2019.
Sara Magliacane, Thijs Van Ommen, Tom Claassen, Stephan Bongers, Philip Versteeg, and Joris M Mooij. Domain
adaptation by using causal inference to predict invariant conditional distributions. Advances in neural information
processing systems, 31, 2018.
Mateo Rojas-Carulla, Bernhard Schölkopf, Richard Turner, and Jonas Peters. Invariant models for causal transfer
learning. The Journal of Machine Learning Research, 19(1):1309–1342, 2018.
Luis Gustavo Moneda dos Santos. Domain generalization, invariance and the Time Robust Forest. PhD thesis,
Universidade de São Paulo, 2021.
Zhenghui Li, Zhiming Ao, and Bin Mo. Revisiting the valuable roles of global financial assets for international stock
markets: Quantile coherence and causality-in-quantiles approaches. Mathematics, 9(15):1750, 2021.
Yuejiang Liu, Riccardo Cadei, Jonas Schweizer, Sherwin Bahmani, and Alexandre Alahi. Towards robust and adaptive
motion forecasting: A causal representation perspective. In Proceedings of the IEEE/CVF Conference on Computer
Vision and Pattern Recognition, pages 17081–17092, 2022.
Khan Masood Ahmad, Shahid Ashraf, and Shahid Ahmed. Is the indian stock market integrated with the us and
japanese markets? an empirical analysis. South Asia Economic Journal, 6(2):193–206, 2005.
Nan-Ying Liang, Guang-Bin Huang, Paramasivan Saratchandran, and Narasimhan Sundararajan. A fast and accurate
online sequential learning algorithm for feedforward networks. IEEE Transactions on neural networks, 17(6):
1411–1423, 2006.
Mu Li, Tong Zhang, Yuqiang Chen, and Alexander J Smola. Efficient mini-batch training for stochastic optimization.
In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, pages
661–670, 2014.
Danfeng Hong, Lianru Gao, Jing Yao, Bing Zhang, Antonio Plaza, and Jocelyn Chanussot. Graph convolutional
networks for hyperspectral image classification. IEEE Transactions on Geoscience and Remote Sensing, 59(7):
5966–5978, 2020.

19

You might also like