Causal Discovery From Time-Series Data With Short-Term Invariance-Based Convolutional Neural Network
Causal Discovery From Time-Series Data With Short-Term Invariance-Based Convolutional Neural Network
A P REPRINT
arXiv:2408.08023v1 [cs.LG] 15 Aug 2024
Jingchi Jiang
The Artificial Intelligence Institute
Harbin Institute of Technology
Harbin, Heilongjiang, China
[email protected]
A BSTRACT
Causal discovery from time-series data aims to capture both intra-slice (contemporaneous) and
inter-slice (time-lagged) causality between variables within the temporal chain, which is crucial
for various scientific disciplines. Compared to causal discovery from non-time-series data, causal
discovery from time-series data necessitates more serialized samples with a larger amount of observed
time steps. To address the challenges, we propose a novel gradient-based causal discovery approach
STIC, which focuses on Short-Term Invariance using Convolutional neural networks to uncover the
causal relationships from time-series data. Specifically, STIC leverages both the short-term time and
mechanism invariance of causality within each window observation, which possesses the property
of independence, to enhance sample efficiency. Furthermore, we construct two causal convolution
kernels, which correspond to the short-term time and mechanism invariance respectively, to estimate
the window causal graph. To demonstrate the necessity of convolutional neural networks for causal
discovery from time-series data, we theoretically derive the equivalence between convolution and
the underlying generative principle of time-series data under the assumption that the additive noise
model is identifiable. Experimental evaluations conducted on both synthetic and FMRI benchmark
datasets demonstrate that our STIC outperforms baselines significantly and achieves the state-of-
the-art performance, particularly when the datasets contain a limited number of observed time steps.
Code is available at https://github.com/HITshenrj/STIC.
Keywords Causal discovery · Time-series data · Time invariance · Mechanism invariance · Convolutional neural
networks
arXiv Template A P REPRINT
1 Introduction
Causality behind time-series data plays a significant role in various aspects of everyday life and scientific inquiry.
Questions like “What factors in the past have led to the current rise in blood glucose?" or “How long will my headache
be alleviated if I take that pill?" require an understanding of the relationships among observed variables, such as
the relation between people’s health status and their medical interventions [Cowls and Schroeder, 2015, Pawlowski
et al., 2020]. People usually expect to find cyclical and invariant principles in a changing world, which we call causal
relationships [Chan et al., 2024, Entner and Hoyer, 2010]. These relationships can be represented as a directed acyclic
graph (DAG), where nodes represent observed variables and edges represent causal relationships between variables
with time lags. This underlying graph structure forms the factual foundation for causal reasoning and is essential for
addressing such queries [Pearl, 2009].
Current causal discovery approaches utilize intra-slice and inter-slice information of time-series data, leveraging
techniques such as conditional independence, smooth score functions, and auto-regression. These methods can be
broadly classified into three categories: Constraint-based methods [Entner and Hoyer, 2010, Runge et al., 2019, Runge,
2020], Score-based methods [Pamfil et al., 2020], and Granger-based methods [Nauta et al., 2019, Cheng et al., 2022,
2023].Constraint-based methods rely on conditional independence tests to infer causal relationships between variables.
These methods perform independence tests between pairs of variables under different conditional sets to determine
whether a causal relation exists. However, due to the difficulty of sampling, real-world data often suffers from the
limited length of observed time steps, making it challenging for statistical conditional independence tests to fully
capture causal relationships [Zhang et al., 2011, Zhang and Suzuki, 2023]. Additionally, these methods often rely on
strong yet unrealistic assumptions, such as Gaussian noise, when searching for statistical conditional independence
[Spirtes and Zhang, 2016, Wang and Michoel, 2017]. Score-based methods regard causal discovery as a constrained
optimization problem using augmented Lagrangian procedures. They assign a score function that captures properties of
the causal graph, such as acyclicity, and minimize the score function to identify potential causal graphs. While these
methods offer simplicity in optimization, they relying heavily on acyclicity regularization and often lack guarantees for
finding the correct causal graph, potentially leading to suboptimal solutions [Varando, 2020, Lippe et al., 2021, Zhang
et al., 2023]. Granger-based methods, inspired by [Granger, 1969, Granger and Hatanaka, 2015], offer an intriguing
perspective on causal discovery. These methods utilize auto-regression algorithms under the assumption of additive
noise to assess if one time series can predict another, thereby identifying causal relationships. However, they tend to
exhibit lower precision when working with limited observed time steps.
To overcome the limitations of existing approaches, such as low sample efficiency in constraint-based methods,
suboptimal solutions from acyclicity regularizers in score-based methods and low precision when limited observed time
steps in Granger-based methods, we propose a novel Short-Term Invariance-based Convolutional causal discovery
approach (STIC). STIC leverages the properties of short-term invariance to enhance the sample efficiency and accuracy
of causal discovery. More concretely, by sliding a window along the entire time-series data, STIC constructs batches of
window observations that possess invariant characteristics and improves sample utilization. Unlike existing score-based
methods, our model does not rely on predefined acyclicity constraints to avoid local optimization. As the window
observations move along the temporal chain, the structure of the window causal graph exhibits periodic patterns,
demonstrating short-term time invariance. Simultaneously, the conditional probabilities of causal effects between
variables remain unchanged as the window observations slide, indicating short-term mechanism invariance. The
contributions of our work can be summarized as follows:
• We propose STIC, the Short-Term Invariance-based Convolutional causal discovery approach, which leverages
the properties of short-term invariance to enhance the sample efficiency and accuracy of causal discovery.
• STIC uses the time-invariance block to capture the causal relationships among variables, while employing the
mechanism-invariance block for the transform function.
• To dynamically capture the contemporaneous and time-lagged causal structures of the observed variables, we
establish the equivalence between the convolution of the space-domain (contemporaneous) and time-domain
(time-lagged) components, and the multivariate Fourier transform (the underlying generative mechanism) of
time-series data.
• We conduct experiments to evaluate the performance of STIC on synthetic and benchmark datasets. The
experimental results show that STIC achieves the state-of-the-art results on synthetic time-series datasets,
even when dealing with relatively limited observed time steps. Experiments demonstrate that our approach
outperforms baseline methods in causal discovery from time-series data.
2
arXiv Template A P REPRINT
𝑡 𝑡+1 𝑡+2
𝑋5 𝑋5 𝑋5
𝑋1 𝑋1 𝑋1 𝑋1
𝑋3 𝑋3 𝑋3
𝑋2
𝑋4 𝑋4 𝑋4
…
𝑋5
𝑋2 𝑋2 𝑋2
Figure 1: A example showing the correspondence among the given observed variables, the underlying window causal
graph, and the window causal matrix. The observed dataset consists of d = 5 observed variables, and the true maximum
lag τ is 2. In W ∈ R5×5×3 , for each Wi,j
τ
represents the causal effect of Xi on Xj with τ time lags. For example, the
blue lines in window causal graph indicate the following three causal effects with time lags τ = 2 at any time step t,
2 2 2 2 2 2
i.e. W1,3 = 1 ⇒ X1 −→ X3 ; W1,5 = 1 ⇒ X1 −→ X5 ; W4,2 = 1 ⇒ X4 −→ X2 . Moreover, the red lines indicate
1 1
1 1 1
the causal relationships with time lags τ = 1, i.e. W3,2 = 1 ⇒ X3 −→ X2 ; W3,4 = 1 ⇒ X3 −→ X4 ; W3,5 =1⇒
1 1 1
X3 −→ X5 ; W5,4 = 1 ⇒ X5 −→ X4 . Finally, the green lines represent contemporaneous causal relationships, i.e.
0 0 0 0 0 0
W1,2 = 1 ⇒ X1 −→ X2 ; W1,4 = 1 ⇒ X1 −→ X4 ; W5,2 = 1 ⇒ X5 −→ X2 .
2 Background
In this section, we introduce the background of causal discovery from time-series data. Firstly, we show all symbols and
their definitions in Section 2.1. Secondy, in Section 2.2, we present the problem definition and formal representation of
window causal graph. Thirdly, in Section 2.3, we introduce the concepts of short-term time invariance and mechanism
invariance. Building upon these concepts, we derive an independence property specific to window causal graph.
Fourthly, in Section 2.4, we delve into the theoretical aspects of our approach. Specifically, we establish the equivalence
between the convolution operation and the underlying generative mechanism of the observed time-series data. This
theoretical grounding provides a solid basis for the proposed STIC approach. Finally, in Section 2.5, we introduce
Granger causality, an auto-regressive approach to causal discovery from time-series data.
Firstly, to better represent the symbols used in Section 2, we arrange a table to summarize and show their definitions, as
shown in Table 1.
Let an observed dataset denoted as X = {X1 , · · · , Xd } ∈ Rd×T , which consists of d observed continuous time-series
variables. Each variable Xi is represented as a time sequence Xi = {Xi1 , · · · , XiT } with the length of T . Here, each
Xit corresponds to the observed value of the i-th variable Xi at the t-th time step. Unlike graph embedding algorithms
[Cheng et al., 2020, 2021] which aims to learn time series representations, the objective of causal discovery is to uncover
the underlying structure within time-series data, which represents boolean relationships between observed variables.
Furthermore, following the Consistency Throughout Time assumption [Spirtes et al., 2000, Zhang and Spirtes, 2002,
Robins et al., 2003, Kalisch and Bühlman, 2007, Entner and Hoyer, 2010, Assaad et al., 2022], the objective of causal
discovery from time-series data is to uncover the underlying window causal graph G as an invariant causal structure.
The true window causal graph for X encompasses both intra-slice causality with 0 time lags and inter-slice causality
with time lags ranging from 1 to τe. Here, τe denotes the maximum time lag. Mathematically, the window causal graph
is defined as a finite Directed Acyclic Graph (DAG) denoted by G = (V, E). The set V = {X1 , ..., Xd } represents
the nodes within the graph G, wherein each node corresponds to an observed variable Xi . The set E represents the
3
arXiv Template A P REPRINT
contemporaneous and time-lagged relationships among these nodes, encompassing all 2(eτ +1)×d possible combinations.
The window causal graph is often represented by the window causal matrix, which is defined as follows.
Definition 1 (Window Causal Matrix) The window causal graph G, which captures both contemporaneous and time-
lagged causality, can be effectively represented using a three-dimensional boolean matrix W ∈ Rd×d×(eτ +1) . Each
τ
entry Wi,j in the boolean matrix corresponds to the causal relationship between variables Xi and Xj with τ time lags.
τ =0
To be more specific, if Wi,j = 1, it signifies the presence of an intra-slice causal relationship between Xi and Xj ,
τ >0
meaning they influence each other at the same time step. On the other hand, if Wi,j = 1, it indicates that Xi causally
affects Xj with τ time lags.
Figure 1 provides a visual example of a window causal graph along with its corresponding matrix defined in Definition
τ τ
1. As shown in Figure 1, the time-series causal relationships of the form Xi −→ Xj can be represented as Wi,j = 1.
Conversely, Wi,j = 1 in the boolean matrix indicates that the value Xi at any time step t influences the value Xjt+τ
τ t
There has been an assertion that causal relationships typically exhibit short-term time and mechanism invariance across
extensive time scales [Entner and Hoyer, 2010, Liu et al., 2023, Zhang et al., 2017]. These two aspects of invariance are
commonly regarded as fundamental assumptions of causal invariance in causal discovery from time-series data. In the
following, we will present the definitions for these two forms of invariance.
Definition 2 (Short-Term Time Invariance) Given X ∈ Rd×T , for any Xi , Xj , τ ≥ 0, if Xi ∈ P aτt (Xj ) at time t,
then there exists Xi ∈ P aτt′ (Xj ) at time t′ ̸= t in a short period of time, where P aτt (·) denotes the set of parents of a
variable with τ time lags at time step t.
4
arXiv Template A P REPRINT
Short-term time invariance refers to the stability of parent-child relationships over time. In other words, it implies that
the dependencies between variables remain consistent regardless of specific time points. For instance, considering
Figure 1: X5 is a parent of X4 with time lag τ = 1 at t, then X5 will also be a parent of X4 with time lag τ = 1 at
t′ = t + 1; similarly, when τ = 0, if X5 is a parent of X2 at t, then X5 will be a parent of X2 at no matter t′ = t + 1 or
t′ = t + 2.
Definition 3 (Short-Term Mechanism Invariance) For any Xi , the conditional probability distribution
P (Xi |P a·t (Xi )) remains constant across the short-term temporal chain. In other words, for any time step t
and t′ , it holds that P (Xi |P a·t (Xi )) = P (Xi |P a·t′ (Xi )), where P a·t (Xi ) means the set of parents of Xi with all time
lags range from 0 to τe at time step t.
In particular, based on Definition 3, short-term mechanism invariance implies that conditional probability distributions
remain constant over time. For instance, in Figure 1, we have P a·t (X2 ) = {X3 , X1 , X5 } = P a·t+1 (X2 ). Then, we
have P (X2 |P a·t (X2 )) = P (X2 |P a·t+1 (X2 ))
Building upon the definitions of short-term time invariance and mechanism invariance, we can derive the following
lemma, which characterizes the invariant nature of independence among variables. Inspired by causal invariance [Entner
and Hoyer, 2010], we further provide a detailed proof procedure as outlined below.
Lemma 1 (Independence Property) Given X ∈ Rd×T be the observed dataset. If we have Xi ⊥⊥τ Xj |Xk , ..., Xl ,
t
τ
then we have Xi ⊥
⊥′
⊥τ means conditional independence with τ time lags at time step t.
Xj |Xk , ..., Xl . ⊥
t t
Proof 1 Due to the short-term time invariance of the relationships among variables and the short-term mechanism
′
invariance of conditional probabilities, different value Xit and Xit of Xi is mapped to the same variable Xi in the
window causal graph G. Consequently, P aτt (Xi ) and P aτt′ (Xi ) correspond to the same variable set. Thus, if the
condition Xi ⊥⊥τ Xj |Xk , ..., Xl holds, then Xi ⊥⊥τ Xj |Xk , ..., Xl holds in the window causal graph G, which further
t G
τ
implies Xi ⊥
⊥′
Xj |Xk , ..., Xl .
t
This lemma establishes that, in an identifiable window causal graph, the independence property remains invariant
with time translation. Leveraging this insight, we can transform the observed time series into window observations to
perform causal discovery while maintaining the invariance conditions, as outlined in Section 3.1.
Granger demonstrated, through the Cramer representation and the spectral representation of the covariance sequence
[Granger, 1969, Mills and Granger, 2013, Granger and Hatanaka, 2015], that time-series data can be decomposed
into a sum of uncorrelated components. Inspired by these representations and the concept of graph Fourier transform
[Shuman et al., 2013, Sandryhaila and Moura, 2013, Sardellitti et al., 2017], we propose considering a underlying
function X = f (P aG (X ), W) + E, where P aG (X ) denotes relationships among X in the window causal graph G and
E is the noise term, to describe the generative process of the observed dataset X = {X1 , · · · , Xd } ∈ Rd×T , with an
underlying window causal matrix W ∈ Rd×d×(eτ +1) . We can then decompose f (P aG (X ), W) into Fourier integral
forms:
X = f (P aG (X ), W) + E
(1)
= fˆ(s, t) + E
Here, s and t denote the spatial and temporal projections, respectively, of f (P aG (X ), W). Equation 1 is derived
from the observation that the contemporaneous part in time-series data corresponds to the spatial domain, while the
time-lagged part corresponds to the temporal domain. Therefore, we employ the multivariate Fourier transform,
ZZ ∞
F(X ) = fˆ(x, y; s, t)e−iω(sx+ty) dxdy
−∞
ZZ ∞ (2)
∝ h(ŝ)g(t̂)e−iω(ŝ+t̂) dŝdt̂
−∞
where ŝ represents the spatial domain component, t̂ represents the temporal domain component, and ω represents the
angular frequency along with transform function fˆ, h and g. The first line corresponds to applying the Fourier transform
5
arXiv Template A P REPRINT
Figure 2: An illustration of the STIC framework. Let X = {X1 , · · · , Xd } ∈ Rd×T be the observed dataset, representing
d observed continuous time series of the same length T . First, we convert the observations of the first T − 1 time
steps, X 1:T −1 = {X11:T −1 , · · · , Xd1:T −1 } ∈ Rd×(T −1) , into a window representation W ∈ Rd×τ̂ ×c using a sliding
window with a predefined window length τ̂ and step length 1, where c = T − τ̂ . Time-Invariance Block (Bt ): In
order to better discover the causal structure from X , we use convolution kernel Kt ∈ Rd×τ̂ to act on W , and get the
common representation Kt ⊙ Wψ of X for each window observations Wψ . Afterwards, we pass the commonality
through an FNN network to obtain a predicted window causal matrix Ŵ ∈ Rd×d×τ̂ . Mechanism-Invariance Block
(Bm ): To identify numerical transform in window causal graph, we use another convolution kernel Km ∈ Rd×τ̂ in
each Bm to transform W . Then we output W ∈ Rd×τ̂ ×c as the prediction of f (W ). Next, we do hadamard product of
τ
each W ψ ∈ Rd in W and each Ŵ τ ∈ Rd×d in Ŵ to get the predicted X̂ τ̂ +ψ until we get all X̂ ∈ Rd×c . Finally, we
calculate the Mean Squared Error (MSE) loss between X̂ and X , and adopt gradient descent to optimize the parameters
within the time-invariance and mechanism-invariance blocks.
to both sides of Equation 1. In the second line, inspired by the Time-Independent Schrödinger Equation [Zabusky,
1968, Rana and Liao, 2019], we assume that f (x, y; s, t) can be decomposed into the spatial and temporal domains, i.e.,
fˆ(x, y; s, t) = h(ŝ)g(t̂). Next, by utilizing the convolution theorem [Zayed, 1998] for tempered distributions, which
states that under suitable conditions the Fourier transform of a convolution of two functions (or signals) is the pointwise
product of their Fourier transform, i.e., F(h ∗ g) = F(h) · F(g), where F(·) represents the Fourier transform, we
convert the convolution formula into the following expression:
The first line of the Formula 3 is obtained through the convolution theorem, while the second line expands F(h(ŝ)) and
F(g(t̂)) using the Fourier transform. The third line is derived from Equation 2. Therefore, it indicates that the observed
dataset X can be obtained by convolving the convolution kernel with temporal information and the spatial details, which
we will deal with corresponding to the two kinds of invariance. We posit that the convolution operation precisely aligns
with the functional causal data generation mechanism, i.e., X ∝ h(ŝ) ∗ g(t̂). Conversely, the convolution operation
can be used to analytically model the generation mechanism of functional time-series data. Therefore, we will employ
the convolution operation to extract the functional causal relationships within the window causal graph. In conclusion,
the equivalence between the generation mechanism of time-series causal data and convolution operations serves as
motivation to incorporate convolution operations into our STIC framework.
6
arXiv Template A P REPRINT
Granger causality [Granger, 1969, Pavasant et al., 2021, Assaad et al., 2022] is a method that utilizes numerical
calculations to assess causality by measuring fitting loss and variance. Formally, we say that a variable Xi Granger-
causes another variable Xj when the past values of Xi at time t (i.e., Xi1 , · · · , Xit−1 ) enhance the prediction of Xj at
time t (i.e., Xjt ) compared to considering only the past values of Xj . The definition of Granger causality is as follows:
Definition 4 (Granger Causality) Let X = {X1 , · · · , Xd } ∈ Rd×T be a observed dataset containing d variables. If
στ2 (Xj |X ) < στ2 (Xj |X − Xi ), where στ2 (Xj |X ) denotes the variance of predicting Xj using X with τ time lags, we
τ
say that Xi causes Xj , which is represented by Wi,j = 1.
′
In simpler terms, Granger causality states that Xi Granger-causes Xj if past values of Xi (i.e., Xit ) provide unique and
statistically significant information for predicting future values of Xj (i.e., Xjt ). Therefore, following the definition of
Granger causality, we can approach causal discovery as an autoregressive problem.
3 Method
In this section, we introduce STIC, which involves four components: Window Representation, Time-Invariance Block,
Mechanism-Invariance Block, and Parallel Blocks for Joint Training. The process is depicted in Figure 2. Firstly, we
transform the observed time series into a window representation format, leveraging Lemma 1. Next, we input the
window representation into both the time-invariance block and the mechanism-invariance block (Bt and Bm in Figure
2). Finally, we conduct joint training using the extracted features from two kinds of parallel blocks. In particular,
the time-invariance block Bt generates the estimated window causal matrix Ŵ. To better represent the symbols used
in Section 3, we also arrange a table to summarize and show their definitions, as shown in Table 2. The subsequent
subsections provide a detailed explanation of the key components of STIC.
The observed dataset X ∈ Rd×T contains d observed continuous time series (variables) with T time steps. We
also define a predefined maximum time lag as τ . To ensure that the entire causal contemporaneous and time-lagged
influence is observed, we calculate the minimum length of the window that can capture this influence as τ̂ = τ + 1.
To construct the window observations, we select the observed values from the first T − 1 time steps, i.e. X 1:T −1 =
{X11:T −1 , · · · , Xd1:T −1 } ∈ Rd×(T −1) . Using a sliding window approach along the temporal chain of observations, we
create window observations of length τ̂ and width d, with a step size of 1. This process results in c = T − τ̂ window
7
arXiv Template A P REPRINT
𝑋11:𝑇−1 …
𝑋21:𝑇−1
…
…
𝑋𝑑1:𝑇−1 …
𝑊1 𝑊2 … 𝑊𝑐
concat
Figure 3: Window representation. First, we get c matrices Wψ by sliding window with predefined window length τ̂ and
step size 1, where each Wψ ∈ Rd×τ̂ , ψ = 1, ..., c represents the data we observe in the window. Then, we concatenate
the obtained Wψ together to get the final window representation W ∈ Rd×τ̂ ×c .
observations Wψ where ψ = 1, ..., c. These window observations are referred to as the window representation W , as
illustrated in Figure 3.
According to Definition 2, the causal relationships among variables remain unchanged as time progresses. Exploiting
this property, we can extract shared information from the window representation W and utilize it to finally obtain
the estimated window causal matrix Ŵ. Inspired by convolutional neural networks used in causal discovery[Nauta
et al., 2019], we introduce a invariance-based convolutional network structure denoted as Bt to incorporate temporal
information within the window representation W . For each window observation Wψ ∈ Rd×τ̂ , we employ the following
formula to aggregate similar information among the time series within the window observations
Here, shared Kt ∈ Rd×τ̂ represents a learnable extraction kernel utilized to extract information from each window
observation. The symbol ⊙ denotes the Hadamard product between matrices, and f1 refers to a neural network structure.
By applying the Hadamard product with the shared kernel Kt , the resulting output exhibits similar characteristics across
the time series. Moreover, Kt serves as a time-invariant feature extractor, capturing recurring patterns that appear in the
input series and aiding in forecasting short-term future values of the target variable. In Granger causality, these learned
patterns reflect causal relationships between time series, which are essential for causal discovery [Nauta, 2018]. To
ensure the generality of STIC, we employ a simple feed-forward neural network (FNN) f1 : Rc×d×τ̂ → Rd×d×τ̂ to
extract shared information from each Kt ⊙ Wψ , ψ = 1, ..., c. Furthermore, we impose a constraint to prohibit self-loops
in the estimated window causal matrix Ŵ when the time lag is zero. That is:
0 if i = j and τ = 0
τ τ
Ŵi,j = 0 if Ŵi,j <p , (5)
1 else
8
arXiv Template A P REPRINT
3 U H F L V L R Q
)
7 7 7 7
3 U H F L V L R Q
)
1 X P E H U R I 2 E V H U Y H G 9 D U L D E O H V G 1 X P E H U R I 2 E V H U Y H G 9 D U L D E O H V G 1 X P E H U R I 2 E V H U Y H G 9 D U L D E O H V G 1 X P E H U R I 2 E V H U Y H G 9 D U L D E O H V G
3 &