Causal Discovery With Attention-Based Convolutional Neural Networks
Causal Discovery With Attention-Based Convolutional Neural Networks
knowledge extraction
Article
Causal Discovery with Attention-Based
Convolutional Neural Networks
Meike Nauta * , Doina Bucur and Christin Seifert
Faculty of EEMCS, University of Twente, PO Box 217, 7500 AE Enschede, The Netherlands;
[email protected] (D.B.); [email protected] (C.S.)
* Correspondence: [email protected]
Received: 5 November 2018; Accepted: 27 December 2018; Published: 7 January 2019
Abstract: Having insight into the causal associations in a complex system facilitates decision making,
e.g., for medical treatments, urban infrastructure improvements or financial investments. The amount
of observational data grows, which enables the discovery of causal relationships between variables
from observation of their behaviour in time. Existing methods for causal discovery from time series
data do not yet exploit the representational power of deep learning. We therefore present the Temporal
Causal Discovery Framework (TCDF), a deep learning framework that learns a causal graph structure
by discovering causal relationships in observational time series data. TCDF uses attention-based
convolutional neural networks combined with a causal validation step. By interpreting the internal
parameters of the convolutional networks, TCDF can also discover the time delay between a cause
and the occurrence of its effect. Our framework learns temporal causal graphs, which can include
confounders and instantaneous effects. Experiments on financial and neuroscientific benchmarks
show state-of-the-art performance of TCDF on discovering causal relationships in continuous time
series data. Furthermore, we show that TCDF can circumstantially discover the presence of hidden
confounders. Our broadly applicable framework can be used to gain novel insights into the causal
dependencies in a complex system, which is important for reliable predictions, knowledge discovery
and data-driven decision making.
Keywords: convolutional neural network; time series; causal discovery; attention; machine learning
1. Introduction
What makes a stock’s price increase? What influences the water level of a river? Although machine
learning has been successfully applied to predict these variables, most predictive models (such as
decision trees and neural networks) cannot answer those causal questions: they make predictions on
the basis of correlations alone, but correlation does not imply causation [1]. Measures of correlation are
symmetrical, since correlation only tells us that there exists a relation between variables. In contrast,
causation is usually asymmetrical and therefore gives the directionality of a relation. Correlation which
is not causation often arises if two variables have a common cause, or if there is a spurious correlation
such that the values of two unrelated variables are coincidentally statistically correlated.
Most machine learning methods, including Neural Networks, aim for a high prediction accuracy
encoding only correlations. A predictive model based on correlations alone cannot guarantee robust
relationships, making it impossible to foresee when a predictive model will stop working [2], unless the
correlation function is carefully modelled to ensure stability (e.g., [3]). If a model would learn causal
relationships, we can make more robust predictions. In addition to making forecasts, the goal in
many sciences is often to understand the mechanisms by which variables come to take on their values,
and to predict what the values would be if the naturally occurring mechanisms were subject to outside
manipulations [4]. Those mechanisms can be understood by discovering causal associations between
events. Knowledge of the underlying causes allows us to develop effective policies to prevent or
produce a particular outcome [2].
The traditional way to discover causal relations is to manipulate the value of a variable by using
interventions or real-life experiments. All other influencing factors of the target variable can be held
fixed, to test whether a manipulation of a potential cause changes the target variable. However,
such experiments and interventions are often costly, time-consuming, unethical or even impossible
to carry out. With the current advances in digital sensing, the amount of observational data grows,
allowing us to do causal discovery [5], i.e., reveal (hypothetical) causal information by analysing this
data. Causal discovery helps to interpret data, formulate and test hypotheses, prioritize experiments,
and build or improve theories or models. Since humans use causal beliefs and reasoning to generate
explanations [6], causal discovery is also an important topic in the rapidly evolving field of Explainable
Artificial Intelligence (XAI) that aims to construct interpretable and transparent algorithms that can
explain how they arrive at their decisions [7].
The notion of time aids the discovery of the directionality of a causal relationship, since a cause
generally happens before the effect. Most algorithms that have been developed to discover causal
relationships from multivariate temporal observational data are statistical measures, which rely on
idealized assumptions that rarely hold in practice, e.g., assumptions that the time series data is linear,
stationary or without noise [8,9], that the underlying causal structure has no (hidden) common causes
nor instantaneous effects [10,11]. Furthermore, existing methods are usually only designed to discover
causal associations, and they cannot be used for prediction.
We exploit the representational power of deep learning by using Attention-based Deep Neural
Networks (DNNs) for both time series prediction and temporal causal discovery. DNNs are able
to discover complex underlying phenomena by learning and generalizing from examples without
knowledge of generalization rules, and have a high degree of error resistivity which makes them less
sensitive to noise in the data [12].
Our framework, called Temporal Causal Discovery Framework (TCDF), consists of multiple
convolutional neural networks (CNNs), where each network receives all observed time series as input.
One network is trained to predict one time series, based on the past values of all time series in a dataset.
While a CNN performs supervised prediction, it trains its internal parameters using backpropagation.
We suggest using these internal parameters for unsupervised causal discovery and delay discovery.
More specifically, TCDF applies attention mechanisms that allow us to learn to which time series a CNN
attends to when predicting a time series. After training the attention-based CNNs, TCDF validates
whether a potential cause (found by the attention mechanism) is an actual cause of the predicted time
series by applying a causal validation step. In this validation step, we intervene on a time series to test if
it is causally related with a predicted time series. All validated causal relationships are included in a
temporal causal graph. TCDF also includes a novel method to learn the time delay between cause and
effect from a CNN, by interpreting the network’s internal parameters. In summary:
• We present a new temporal causal discovery method (TCDF) that uses attention-based CNNs to
discover causal relationships in time series data, to discover the time delay between each cause
and effect, and to construct a temporal causal graph of causal relationships with delays.
• We evaluate TCDF and several other temporal causal discovery methods on two benchmarks:
financial data describing stock returns, and FMRI data measuring brain blood flow.
The remainder of the paper is organized as follows. Section 2 presents a formal problem statement.
Section 3 surveys the existing temporal causal discovery methods, the recent advances in non-temporal
causal discovery with deep learning, time series prediction methods based on CNNs, and describes
various causal validation methods. Section 4 presents our Temporal Causal Discovery Framework.
The evaluation is detailed in Section 5. Section 6 discusses hyperparameter tuning and experiment
limitations. The conclusions, including future work, are in Section 7.
Mach. Learn. Knowl. Extr. 2019, 1, 19 3 of 28
2. Problem Statement
Temporal causal discovery from observational data can be defined as follows. Given a dataset X
containing N observed continuous time series of the same length T (i.e., X = {X1 , X2 , ..., X N } ∈ R N ×T ),
the goal is to discover the causal relationships between all N time series in X and the time delay
between cause and effect, and to model both in a temporal causal graph. In the directed causal graph
G = (V, E), vertex vi ∈ V represents an observed time series Xi and each directed edge ei,j ∈ E from
vertex vi to v j denotes a causal relationship where time series Xi causes an effect in X j . Furthermore,
we denote by p = hvi , ..., v j i a path in G from vi to v j . In a temporal causal graph, every edge ei,j is
annotated with a weight d(ei,j ), that denotes the time delay between the occurrence of cause Xi and
the occurrence of effect X j . An example is shown in Figure 1.
6 3
1 1
4
...
Figure 1. A temporal causal graph learnt from multivariate observational time series data. A graph
node models one time series. A directed edge denotes a causal relationship and is annotated with the
time delay between cause and effect.
Causal discovery methods have major challenges if the underlying causal model is complex:
• The method should distinguish direct from indirect causes. Vertex vi is seen as an indirect cause
of v j if ei,j 6∈ G and if there is a two-edge path p = hvi , vk , v j i ∈ G (Figure 2a). Pairwise methods,
i.e., methods that only find causal relationships between two variables, are often unable to make
this distinction [10]. In contrast, multivariate methods take all variables into account to distinguish
between direct and indirect causality [11].
• The method should learn instantaneous causal effects, where the delay between cause and effect
is 0 time steps. Neglecting instantaneous influences can lead to misleading interpretations [13].
In practice, instantaneous effects mostly occur when cause and effect refer to the same time step
that cannot be causally ordered a priori, because of a too coarse time scale.
• The presence of a confounder, a common cause of at least two variables, is a well-known challenge
for causal discovery methods (Figure 2b). Although confounders are quite common in real-world
situations, they complicate causal discovery since the confounder’s effects (X2 and X3 in Figure 2b)
are correlated, but are not causally related. Especially when the delays between the confounder
and its effects are not equal, one should be careful to not incorrectly include a causal relationship
between the confounder’s effects (the grey edge in Figure 2b).
• A particular challenge occurs when a confounder is not observed (a hidden (or latent)
confounder). Although it might not even be known how many hidden confounders exist, it is
important that a causal discovery method can hypothesise the existence of a hidden confounder
to prevent learning an incorrect causal relation between its effects.
4 X1
1 4
1 3 3
X1 X2 X3 X2 X3
Figure 2. Temporal causal graphs showing causal relationships and delays between cause and effect.
Mach. Learn. Knowl. Extr. 2019, 1, 19 4 of 28
3. Related Work
Section 3.1 discusses existing approaches for temporal causal discovery and classifies a selection
of recent temporal causal discovery algorithms along various dimensions. From this overview, we can
conclude that there are no other temporal causal discovery methods based on deep learning. Therefore,
Section 3.2 describes deep learning approaches for non-temporal causal discovery. Since TCDF
discovers causal relationships by predicting time series using CNNs, Section 3.3 discusses related
network architectures for time series prediction. Section 3.4 shortly discusses the attention mechanism.
Table 1. Causal discovery methods for time series data, classified among various dimensions.
Non-Stationary
Instantaneous
Hidden Conf.
Confounders
Multivariate
Non-Linear
Continuous
Delay
Noise
Granger Causality (GC) [24] is one of the earliest methods developed to quantify the causal effects
between two time series. Time series Xi Granger causes time series X j if the future value of X j (at time
t + 1) can be better predicted by using both the values of Xi and X j up to time t than by using only the
past values of X j itself. Since pairwise methods cannot correctly handle indirect causal relationships,
conditional Granger causality takes a third time series into account [25]. However, in practice not
all relevant variables may be observed and GC cannot correctly deal with unmeasured time series,
including hidden confounders [4]. In the system identification domain, this limitation is overcome with
sparse plus low-rank (S + L) networks that include an extra layer in a causal graph to explicitly model
hidden variables (called factors) [26]. Furthermore, GC only captures the linear interdependencies
between time series. Various extensions have been made to nonlinear and higher-order causality,
e.g., [27,28]. A recent extension that outperforms other GC methods is based on conditional copula,
that allows to dissociate the marginal distributions from their joint density distribution to focus only
on statistical dependence between variables [10].
Constraint-based Time Series approaches are often adapted versions of non-temporal causal
graph discovery algorithms. The temporal precedence constraint reduces the search space of the causal
structure [29]. The well-known algorithms PC and FCI both have a time series version: PCMCI [8] and
tsFCI [21]. PC [30] makes use of a series of tests to efficiently explore the whole space of Directed Acyclic
Graphs (DAGs). FCI [30] can, contrary to PC, deal with hidden confounders by using independence
tests. Both temporal algorithms require stationary data. Additive Non-linear Time Series Model
(ANLTSM) [20] does causal discovery in both linear and non-linear time series data, and can also deal
with hidden confounders. It uses statistical tests based on additive model regression.
Structural Equation Model approaches assume that a causal system can be represented by a
Structural Equation Model (SEM) that describes a variable X j as a function of other variables X− j ,
and an error term eX to account for additive noise such that X := f (X− j , eX ) [29]. It assumes that the
set X− j is jointly independent. TiMINo [22] discovers a causal relationship if the coefficient of Xit for
any t is nonzero for X jt6=i . Self-causation is not discovered. TiMINo remains undecided if the direct
causes of Xi are not independent, instead of drawing possibly wrong conclusions. TiMINo is not
suitable for large datasets, since small differences between the data and the fitted model may lead to
failed independence tests. VAR-LiNGAM [13] is a restricted SEM. It makes additional assumptions on
the data distribution and combines a non-Gaussian instantaneous model with autoregressive models.
Information-theoretic approaches for temporal causal discovery exist, such as (mutual) shifted
directed information [23] and transfer entropy [11]. Their main advantage is that they are model free
and are able to detect both linear and non-linear dependencies [19]. The universal idea is that Xi is
likely a cause of X j , i 6= j, if X j can be better sequentially compressed given the past of both Xi and
X j than given the past of X j alone. Transfer entropy cannot, contrary to directed information [31],
deal with non-stationary time series. Partial Symbolic Transfer Entropy (PSTE) [11] overcomes this
limitation, but is not effective when only linear causal relationships are present.
Causal Significance is a causal discovery framework that calculates a causal significance measure
α(c, e) for a specific cause-effect pair by isolating the impact of cause c on effect e. [9]. It also discovers
time delay and impact of a causal relationship. The method assumes that causal relationships are linear
and additive, and that all causes are observed. However, the authors experimentally demonstrate that
low false discovery and negative rates are achieved if some assumptions are violated.
Our Deep Learning approach uses neural networks to learn a function for time series prediction.
Although learning such a function is comparable to SEM, the interpretation of coefficients is different
(Section 4.2). Furthermore, we apply a validation step that is to some extent comparable to conditional
Granger causality. Instead of removing a variable, we randomly permute its values (Section 4.3).
emerging field of explainable machine learning enables DNN interpretation [7]. Feature importance
proposed by an interpretable LSTM already showed to be highly in line with results from the Granger
causality test [32]. Multiple deep learning models exist for non-temporal causal discovery: Variational
Autoencoders [33] to estimate causal effects, Causal Generative Neural Networks to learn functional
causal models [34] and the Structural Agnostic Model (SAM) [35] for causal graph reconstruction.
Although called ‘causal filters’ by the authors, SAM uses an attention mechanism by multiplying each
observed input variable by a trainable score, comparable to the TCDF approach. Contrary to TCDF,
SAM does not perform a causal validation step. Non-temporal methods however cannot be applied to
time series data, since they do not check the temporal precedence assumption (cause precedes effect).
6 3
1 1
4
...
Figure 3. Overview of Temporal Causal Discovery Framework (TCDF). With time series data as input,
TCDF performs four steps (gray boxes) using the technique described in the white box and outputs a
temporal causal graph.
More specifically, TCDF consists of N independent attention-based CNNs, all with the same
architecture but a different target time series. An overview of TCDF containing multiple networks is
shown in Figure 4. This shows that the goal of the jth network N j is to predict its target time series X j
by minimizing the loss L between the actual values of X j and the predicted X̂ j . The input to network
N j consists of a N × T dataset X consisting of N equal-sized time series of length T. Row X j from the
dataset corresponds to the target time series, while all other rows in the dataset, X− j , are the so-called
exogenous time series.
T
X1
Input X2
..
.
Xn
N1 Nn
N2
X1 X2 Xn X1 X2 Xn X1 X2 Xn
...
• Attention Interpretation
• Causal Validation
• Delay Discovery
6 3 6 3 6 3
X2
1 1 1 1 1 1
4 4 4
X1 Xn
Xj 6 3
X2 Xi
Output 1 1
4
Xn X1
Figure 4. TCDF with N independent CNNs N1 ...Nn , all having time series X1 ...Xn of length T as input
(N is equal to the number of time series in the input data set). N j predicts X j and also outputs, besides
X̂ j , the kernel weights W j and attention scores a j . After attention interpretation, causal validation and
delay discovery, TCDF constructs a temporal causal graph.
When network N j is trained to predict X j , the attention scores a j of the attention mechanism
explain where network N j attends to when predicting X j . Since the network uses the attended time
series for prediction, this time series must contain information that is useful for prediction, implying
that this time series is potentially causally associated with the target time series X j . By including
Mach. Learn. Knowl. Extr. 2019, 1, 19 8 of 28
the target time series in the input as well, the attention mechanism can also learn self-causation.
We designed a specific architecture for these attention-based CNNs that allows TCDF to discover
these potential causes. We call our networks Attention-based Dilated Depthwise Separable Temporal
Convolutional Networks (AD-DSTCNs).
The rest of this section is structured as follows: Section 4.1 describes the architecture of
AD-DSTCNs. Section 4.2 presents our algorithm to detect potential causes of a predicted time series.
Section 4.3 describes our Permutation Importance Validation Method (PIVM) to validate potential
causes. For delay discovery, TCDF uses the kernel weights W j of each AD-DSTCN N j , which will
be discussed in more detail in Section 4.4. TCDF merges the results of all networks to construct a
Temporal Causal Graph that shows the discovered causal relationships and their delays.
4.1.1. Dilations
In a TCN with only one layer (i.e., no hidden layers), the receptive field (the number of time steps
seen by the sliding kernel) is equal to the user-specified kernel size K. To successfully discover a
causal relationship, the receptive field should be as least as large as the delay between cause and effect.
To increase the receptive field, one can increase the kernel size or add hidden layers to the network.
A convolutional network with a 1D kernel has a receptive field that grows linearly in the number of
layers, which is computationally expensive when a large receptive field is needed. More formally,
the receptive field R of a CNN is
L
RCNN = 1 + ( L + 1)(K − 1) = 1 + ∑ ( K − 1) , (1)
l =0
with K the user-specified kernel size and L the number of hidden layers. L = 0 gives a network without
hidden layers, where one convolution in a channel maps an input time series to the output.
TCN, inspired by the well-known WaveNet architecture [45], employs dilated convolutions instead.
A dilated convolution applies a kernel over an area larger than its size by skipping input values with
Mach. Learn. Knowl. Extr. 2019, 1, 19 9 of 28
a certain step size f . This step size f , called dilation factor, increases exponentially depending on the
chosen dilation coefficient c, such that f = cl for layer l. An example of dilated convolutions is shown
in Figure 5.
Linear f = 23
Padding = 8
Hidden
PReLU f = 22
Padding = 4
Hidden
PReLU f = 21
Padding = 2
Hidden
PReLU f = 20
Padding = 1
Input
X11 X12 X13 X14 ... X116 ... X1T
Figure 5. Dilated TCN to predict X2 , with L = 3 hidden layers, kernel size K = 2 (shown as arrows)
and dilation coefficient c = 2, leading to a receptive field R = 16. A PReLU activation function is
applied after each convolution. To predict the first values (shown as dashed arrows), zero padding is
added to the left of the sequence. Weights are shared across layers, indicated by the identical colors.
With an exponentially increasing dilation factor f , a network with stacked dilated convolutions
can operate on a coarser scale without loss of resolution or coverage. The receptive field R of a kernel
in a 1D Dilated TCN (D-TCN) is
L
R D-TCN = 1 + ∑ ( K − 1) · c l . (2)
l =0
This shows that dilated convolutions support an exponential increase of the receptive field while
the number of parameters grows only linearly, which is especially useful when there is a large delay
between cause and effect.
However, the disadvantage of this approach is that the output from each convolutional layer is always
one-dimensional, meaning that the input time series are mixed. This mixing of inputs hinders causal
discovery when a deep network architecture is desired.
To allow for multivariate causal discovery, we extend the univariate TCN architecture to
a one-dimensional depthwise separable architecture in which the input time series stay separated.
The depthwise separable convolution is introduced in [46] and became popular with Google’s Xception
architecture for image classification [47]. It consists of depthwise convolutions, where channels are kept
separate by applying a different kernel to each input channel, followed by a 1 × 1 pointwise convolution
that merges together the resulting output channels [47]. This is different from normal convolutional
architectures that have only one kernel per layer. A depthwise separable architecture improves
accuracy and convergence speed [47], and the separate channels allow us to correctly interpret the
relation between an input time series and the target time series, without mixing the inputs.
Our TCDF architecture consists of N channels, one for each input time series. In network
N j , channel j corresponds to the target time series X j = [0, X 1j , X 2j , ..., X jT −1 ] and all other channels
correspond to the exogenous time series Xi6= j = [Xi1 , Xi2 , ..., XiT −1 , XiT ]. An overview of this architecture
is shown in Figure 6, including the attention mechanism that is discussed next.
X11 X12 X13 X14 X15 X16 X17 X18 X19 X110 X111 X112 X113 X21 X22 X23 X24 X25 X26 X27 X28 X29 X210 X211 X212 X213 ... Xn1 Xn2 Xn3 Xn4 Xn5 Xn6 Xn7 Xn8 Xn9 Xn10Xn11 Xn12 Xn13 Input
Depthwise
⊕ ⊕ ⊕
... Hidden
Residual
... Channel
Pointwise
Output
X̂21 X̂22 X̂23 X̂24 X̂25 X̂26 X̂27 X̂28 X̂29 X̂210 X̂211 X̂212 X̂213 Output
We add a residual connection in each channel after each convolution from the input of the
convolution to the output (first layer excluded), as shown in Figure 6.
We denote by h j the set of attention scores in a j to which the HardSoftmax function is applied.
TCDF creates a set of potential causes P j for each time series X j ∈ X. Time series Xi is considered a
potential cause of the target time series X j if hi,j ∈ h j > 0.
We created an algorithm that determines τj by finding the largest gap between the attention scores
in a j . The algorithm ranks the attention scores from high to low and searches for the largest gap g
between two adjacent attention scores ai,j and ak6=i,j . The threshold τj is then equal to the attention
score on the left side of the gap. This approach is graphically shown in Figure 7. We denote by G the
list of gaps [g0 , ..., g N −1 ].
τi
Attention scores
2.0 1.0 0.0
g0 g2
Figure 7. Threshold τj is set equal to the attention score at the left side of the largest gap gk where k 6= 0
and k < |G|/2. In this example, τj is set equal to the third largest attention score.
• Since a temporal causal graph is usually sparse, we require that the gap selected for τj lies in the
first half of G (if N > 5) to ensure that the algorithm does not include low attention scores in
the selection. At most 50% of the input time series can be a potential cause of target X j . By this
requirement, we limit the number of time series labeled as potential causes. Although this number
can be configured, we experimentally estimated that 50% gives good results.
• We require that the gap for τj cannot be in first position (i.e., between the highest and
second-highest attention score). This ensures that the algorithm does not truncate to zero the
scores for time series which were actually a cause of the target time series, but were weaker than
the top scorer. Thus, the potential causes P j for target X j will include at least two time series.
Note that a HardSoftmax score > 0 could also be the result of a spurious correlation. However,
since it is impossible to judge whether a correlation is spurious purely on the analysis of observational
data, TCDF does not take the possibility of a spurious correlation into account. After causal discovery
from observational data, it is up to a domain expert to judge or test whether a discovered causal
relationship is correct. Section 6 presents a more extensive discussion on this topic.
By comparing all attention scores, we create a set of potential causes for each time series. Then,
we will use our Permutation Importance Validation Method (PIVM) to validate if a potential cause is a
true cause. More specifically, TCDF will apply PIVM to distinguish between case 2a and 2b, between
case 3a and 3b and between case 4a and 4b.
Since we use a temporal convolutional network architecture, there is no information leakage from
future to past. Therefore, we comply with the temporal precedence assumption. The second aspect
Mach. Learn. Knowl. Extr. 2019, 1, 19 13 of 28
is usually defined in terms of interventions. More specifically, an observed time series Xi is a cause
of another observed time series X j if there exists an intervention on Xi such that if all other time
series X−i ∈ X are held fixed, Xi and X j are associated [52]. However, such controlled experiments
in which other time series are held fixed may not be feasible in many time series applications (e.g.,
stock markets). In those cases, a data-driven causal validation measure can act as intervention method.
A causal validation measure models the difference in evaluation score between the real input data
and an intervened dataset in which a potential cause is manipulated to evaluate whether this changes
the effect.
TCDF uses Permutation Importance (PI) [53] as causal validation method. This feature importance
method measures how much an error score increases when the values of a variable are randomly
permuted [53]. According to van der Laan [54], the importance of a variable can be interpreted as
causal effect if the observed data structure is chronologically ordered, consistent and contains no
hidden confounding or randomization. (If the last assumption is violated, the variable importance
measures can still be applied, and subsequent experiments can determine until what degree the
variable importance is causal [54].) Permuting a time series’ values removes chronologicity and
therefore breaks a potential causal relationship between cause and effect. Only if the loss of a network
increases significantly when a variable is permuted, the variable is a cause of the predicted variable.
A closely related measure is the Causal Quantitative Input Influence measure of [55]. They construct
an intervened distribution by retaining the marginal distribution over all other inputs from the dataset
and randomly sampling the input of interest from its prior distribution. Instead of intervening on
variables, the “destruction of edges” [56] (intervening on the edges) in a Bayesian network can be
used to validate and quantify causal strength by calculating the relative entropy between the old and
intervened distribution. The method excludes instantaneous effects.
Note that the Permutation Importance method is a more adequate causal validation method than
simply removing a potential cause from the dataset. Removing a correlated variable may lead to worse
predictions, but this does not necessarily mean that the correlated variable is a cause of the predicted
variable. For example, suppose that a dataset contains one variable with values in [0, 1], and all other
variables in the dataset have values in [5000, 15, 000]. If the predicted variable lies within [0, 1], a neural
network might base its prediction on the variable having the same range of values. Removing it from
the dataset then leads to a higher loss, even if the variable was not a cause of the predicted variable.
probably not change significantly, since the network still has access to the chronological order of the
values of confounder X1 to predict X3 . TCDF will then conclude that only X1 is a true cause of X3 .
To determine whether an increase in loss between the original dataset and the intervened dataset
is ‘significant’, one could require a certain percentage of increase. However, the required increase in loss
is dependent on the dataset. A network applied to a dataset with clear patterns will, during training,
decrease its loss more compared to one trained on a dataset without clear patterns. TCDF includes a
small algorithm, called the Permutation Importance Validation Method (PIVM), to determine when an
increase in loss between the original dataset and the intervened dataset is relatively significant. This is
based on the initial loss at the first epoch, and uses a user-specified parameter s ∈ [0, 1] denoting a
significance measure. We experimentally found that a significance of s = 0.8 gives good results, but the
user can specify any other value in [0, 1].
TCDF trains a network N j for E epochs on the original dataset and measures the decrease in
ground loss between epoch 1 and epoch E : ∆LG = L1G − LEG . This denotes the improvement in loss
that N j can achieve by training on the input data. Subsequently, TCDF applies the trained network N j
to an intervened dataset where the values of Xi ∈ Pj are randomly permuted, and measures the loss
L I . It then calculates ∆L I = L1G − L I . This denotes the difference between the initial loss at the first
epoch and the loss when the trained network is applied to the permuted dataset.
If this difference ∆L I is greater than ∆LG · s, then ∆L I is significantly large, so the loss L I has not
increased significantly compared to LG . TCDF then concludes that the permuted variable Xi ∈ Pj is
not a true cause of X j . On the other hand, if ∆L I is small (≤ ∆LG · s), then the permuted dataset leads
to loss L I that is larger than LG and relatively close to (or greater than) the initial loss at the first epoch.
TCDF can therefore conclude that Xi ∈ Pj is a true cause of X j .
X j . More specifically, TCDF will discover that the delay of this causal relationship will be equal to
the delay from the confounder to Xi minus the delay from the confounder to X j . Figure 8a shows an
example of this situation. The same reasoning applies when the delay from the confounder to Xi is
greater than the delay to X j (case 2b).
X1 X1
1 4 4 4
3 0
X2 X3 X2 X3
0
(a) TCDF will incorrectly discover a causal relationship (b) TCDF will discover a 2-cycle between X2 and X3
from X2 to X3 when the delay from X1 to X2 is smaller where both delays equal 0, such that there should exist
than the delay from X1 to X3 . a hidden confounder between X2 and X3 .
Figure 8. How TCDF deals, in theory, with hidden confounders (denoted by squares). A black square
indicates that the hidden confounder is discovered by TCDF; a grey square indicates that it is not
discovered. Black edges indicate causal relationships that will be included in the learnt temporal causal
graph G L ; grey edges will not be included in G L .
However, TCDF will not discover a causal relationship when the hidden confounder has equal
delays to its effects Xi and X j (case 4b), and can even conclude that there should be a hidden confounder
that causes both Xi and X j . Because the confounder has equal delays to Xi and X j , the delays from Xi
to X j and from X j to Xi will both be 0. The zero delays give away the presence of a hidden confounder,
since there cannot exist a 2-cycle where both time series have an instantaneous effect on each other.
Recall that an instantaneous effect means that there is an effect within 1 measured time step. If both
time series cause each other instantaneously, there will be an infinite causal influence between the time
series within 1 time step, which is impossible. Therefore, TCDF will conclude that Xi and X j are not
causally related, and that there exists a hidden confounder between Xi and X j . Figure 8b shows an
example of this situation.
The advantage of our approach is that TCDF not only concludes that two variables are not causally
related, but can also detect the presence of a hidden confounder.
0.3 0.8 f = 23
Hidden
f = 22
1.2 0.4 1.2 0.4
Hidden
Hidden
0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1
f = 20
1.3 1.3 1.3 1.3 1.3 1.3 1.3 1.3
Input
X11 X12 X13 X14 X110 X116
Figure 9. Discovering the delay between cause X1 and target X2 , both having T = 16. Starting from
the top convolutional layer, the algorithm traverses through the path with the highest kernel weights.
Eventually, the algorithm ends in input value X110 , indicating a delay of 16 − 10 = 6 time steps.
5. Experiments
To evaluate our framework, we apply TCDF to two benchmarks, each consisting of multiple
simulated datasets for which the true underlying causal structures are known. The benchmarks are
discussed in Section 5.1. The ground truth allows us to evaluate the accuracy of TCDF. We compare
the performance of TCDF with that of three existing temporal causal discovery methods described in
Section 5.2. Besides causal discovery accuracy, we evaluate prediction performance, delay discovery
accuracy and effectiveness of the causal validation step PIVM. We also evaluate how the architecture
of AD-DSTCNs influences the discovery of correct causal relationships. However, since it would be
impractical to test all parameter settings, we only vary the number of hidden layers L. As a side
experiment, we evaluate how TCDF handles hidden confounders. The evaluation measures for these
evaluations are described in Section 5.3. Results of all experiments are presented in Section 5.4.
Table 2. Summary of evaluation benchmarks. Delays between cause and effect not available in FMRI.
6
4
2
1 8
0
value 0
9
2 7
2
4
4 6
3 5
6
0 25 50 75 100 125 150 175 200
time step
10
7 16 6
5
1
1
0 1 1 12
1 11 5
1
21
value
5 18 8
1 1
2 0
1
10 1 1 1 1
15
24
1
15 20
3 23 1
17 14
20 1 1
Figure 10. Example datasets and causal graphs: simulation 17 from FMRI (top), graph 20-1A from
FINANCE (bottom). A colored line corresponds to one time series (node) in the causal graph.
The first benchmark, called FINANCE, contains datasets for 10 different causal structures of financial
markets [2]. For our experiments, we exclude the dataset without any causal relationships (since
this would result in an F1-score of 0). The datasets are created using the Fama-French Three-Factor
Model [57] that can be used to describe stock returns based on the three factors ‘volatility’, ‘size’ and
‘value’. A portfolio’s return Xit depends on these three factors at time t plus a portfolio-specific error
term [2]. We use one of the two 4000-day observation periods for each financial portfolio.
To evaluate the ability to detect hidden confounders, we created the benchmark FINANCE HIDDEN
containing four datasets. Each dataset corresponds to either dataset ‘20-1A’ or ‘40-1-3’ from FINANCE
except that one time series is hidden by replacing all its values by 0. Figure 11 shows the underlying
causal structures, in which a grey node denotes a hidden confounder. As can be seen, we test TCDF
on hidden confounders with both equal delays and unequal delays to its effects. To evaluate the
predictive ability of TCDF, we created training data sets corresponding to the first 80% of the data sets
and utilized the remaining 20% for testing. These data sets are referred to as FINANCE TRAIN/TEST.
23
1
2
22 1 8
4
3
3
11
2 2
2
3
7 16 6 3 1
3
24
1 1
9 1
1 3 7
12
1
1 1 3
1 11 1 2
5 3
21
1
1 1 21
18 2 0 8 15
1 16 2
3
3
1 1 24 19
1 1 15 10
20 1
1 1
1
1
1
13
18
20 17
3 23
2
1
17 14 1
2
2
2
1
1 1
0
12 2
2 3
2
22 13 10 1 19 1 9 1 4 5
3
6 14
Figure 11. Adapted ground truth for the hidden confounder experiment, showing graphs 20-1A (left)
and 40-1-3 (right) from FINANCE. Only one grey node was removed per experiment.
Mach. Learn. Knowl. Extr. 2019, 1, 19 18 of 28
The second benchmark, called FMRI, contains realistic, simulated BOLD (Blood-oxygen-level
dependent) datasets for 28 different underlying brain networks [58]. BOLD FMRI measures the neural
activity of different regions of interest in the brain based on the change of blood flow. Each region
(i.e., node in the brain network) has its own associated time series. Since not all existing methods can
handle 50 time series, we excluded one dataset with 50 nodes. For each of the remaining 27 brain
networks, we selected one dataset (scanning session) out of multiple available. All time series have a
hidden external input, white noise, and are fed through a non-linear balloon model [59].
Since FMRI contains only six (out of 27) datasets with ‘long’ time series, we create an extra
benchmark that is a subset of FMRI. This subset contains only datasets in which the time series have
at least 1000 time steps, therefore denoted as FMRI T > 1000, and coincidentally are all stationary.
To evaluate the predictive ability of TCDF, we created a training and test set corresponding to the
resp. first 80% and last 20% of the datasets, referred to as FMRI TRAIN/TEST and FMRI T > 1000
TRAIN/TEST.
where E(G) is the set of all edges in graph G . These TP and FP measures evaluate G L only based on
the direct causes in GG . However, also an indirect cause has, although indirectly, a causal influence
on the effect. Counting an indirect cause as a False Positive would not be objective (see Figure 12a,c
for an example). We therefore construct the full ground-truth graph G F from the ground truth graph
GG by adding edges that correspond to indirect causal relationships. This means that the full ground
truth graph G F contains a directed edge ei,j for each directed path hvi , vk , v j i in ground truth graph
GG . An example is given in Figure 12. Note that we do not adapt the False Negatives calculation,
since methods should not be punished for excluding indirect causal relationships in their graph.
Comparing the full ground-truth graph with the learnt graph we obtain the following measures:
2TP 2TP0
TP’ = | E(G F ) ∩ E(G L )|, FP’ = | E(G L ) \ E(G F )|, F1 = , F1’ = .
2TP + 2FN + FP 2TP + 2FN + FP0
0
1 3 1 3 4
X1 X2 X3 X1 X2 X3 X1 X3
Figure 12. Example with three variables showing that G L has TP = 0, FP = 1 (e1,3 ), TP’ = 1 (e1,3 ), FP’ = 0
and FN = 2 (e1,2 and e2,3 ). Therefore, F1 = 0 and F1’= 0.5.
We evaluate the discovered delay d(ei,j ∈ G L ) between cause Xi and effect X j by comparing it to
the full ground truth delay d(ei,j ∈ G F ). By comparing it to the full ground truth, we not only evaluate
the delay of direct causal relationships, but can also evaluate if the discovered delay of indirect causal
relationships is correct. The ground truth delay of an indirect causal relationship is the sum of the
delays of its direct relationships. We only evaluate the delay of True Positive edges since the other
edges do not exist in both the full ground truth graph G F and the learnt graph G L . We measure the
percentage of delays on correctly discovered edges w.r.t. the full ground-truth graph.
We summarize the PIVM effectiveness by calculating the relative increase (or decrease) of the
F1-score and F1’-score when PIVM is applied compared to when it is not. The goal of the Permutation
Importance Validation Method (PIVM) is to label a subset of the potential causes as true causes.
We evaluate whether a causal discovery method discovers the existence of a hidden confounder
between two time series by applying it to the FINANCE HIDDEN benchmark and counting how many
hidden confounders were discovered. As discussed in Section 4.3.2, TCDF should be able to discover
the existence of a hidden confounder between two time series Xi and X j when the confounder has
equal delays to its effects Xi and X j . If the confounder has unequal delays to its effects, we expect that
Mach. Learn. Knowl. Extr. 2019, 1, 19 20 of 28
TCDF will discover an incorrect causal relationship between Xi and X j . We therefore not only evaluate
how many hidden confounders were discovered, but also how many incorrect causal relationships
were learnt between the confounder and its effects.
5.4. Results
In this section we present the results obtained by the four compared methods for causal discovery
and delay discovery. Additionally we show the impact of applying PIVM. The section ends with a
small case study showing that TCDF can circumstantially detect hidden confounders.
Table 3. Time series prediction performance of TCDF, in terms of the mean absolute scaled error
(MASE) averaged across all datasets, plus its standard deviation. Best results are highlighted in bold.
Table 4 shows F1 and F1’-scores obtained by the four compared methods for causal discovery.
Recall that the F1-score evaluates only direct causal relationships (i.e., an edge from vertex vi to vertex
v j in the ground truth graph). The F1’-score also evaluates the indirect causal relationships, such that a
learnt edge from vi to v j can correspond with a path p = hvi , vk , v j i in the ground truth graph.
Table 4. Causal discovery overview for all data sets and all methods. Showing macro-averaged F1 and
F1’ scores and standard deviations. The highest score per benchmark is highlighted in bold.
FINANCE (9 Data Sets) FMRI (27 Data Sets) FMRI T > 1000 (6 Data Sets)
TCDF (L = 0) 0.64 ± 0.06 0.77 ± 0.08 0.60 ± 0.09 0.63 ± 0.09 0.68 ± 0.05 0.68 ± 0.05
TCDF (L = 1) 0.65 ± 0.09 0.78 ± 0.10 0.58 ± 0.15 0.62 ± 0.14 0.65 ± 0.13 0.68 ± 0.11
TCDF (L = 2) 0.64 ± 0.09 0.77 ± 0.09 0.55 ± 0.13 0.63 ± 0.11 0.70 ± 0.09 0.73 ± 0.08
PCMCI 0.55 ± 0.22 0.56 ± 0.22 0.63 ± 0.10 0.67 ± 0.11 0.67 ± 0.04 0.67 ± 0.04
tsFCI 0.37 ± 0.11 0.37 ± 0.12 0.49 ± 0.22 0.49 ± 0.22 0.48 ± 0.28 0.48 ± 0.28
TiMINo 0.13 ± 0.05 0.21 ± 0.10 0.23 ± 0.12 0.37 ± 0.14 0.23 ± 0.11 0.37 ± 0.15
When comparing the overall performance on the FINANCE benchmark, TCDF outperforms the
other methods. Especially the F1’-score of TCDF is much higher, indicating that a substantial part
of the False Positives of TCDF are correct indirect causes. Since Deep Learning models have many
parameters that need to be fit during training and therefore usually need more data than models with
a less complex hypothesis space [62], TCDF performs slightly worse on the FMRI benchmark compared
to FINANCE because of some short time series in FMRI. Whereas all datasets in FINANCE contain 4000
time steps, FMRI contains only six (out of 27) datasets with more than 1000 time steps. The results
Mach. Learn. Knowl. Extr. 2019, 1, 19 21 of 28
for TCDF when applied only to datasets with T > 1000 are therefore better than the overall average
from all datasets. For FMRI T > 1000, our results are slightly better than the performance of PCMCI,
and TCDF clearly outperforms tsFCI and TiMINo. PCMCI is not affected by time series length and
performs comparably for both FMRI benchmarks. TiMINo performs very poorly when applied to
FINANCE and only slightly better on FMRI, which is mainly due to a large number of False Positives.
TiMINo’s poor results are in line with results from the authors, who already stated that TiMINo is not
suitable for high-dimensional data [22]. In contrast, where TiMINo discovers many incorrect causal
relationships, tsFCI seems to be too conservative, missing many causal relationships in all benchmarks.
Our poor results of tsFCI correspond with poor results of tsFCI in experiments done by the authors on
continuous data [21]. In terms of computation time, PCMCI and tsFCI are faster than TCDF for both
benchmarks, as shown in Table 5.
Table 5. Run time in seconds, averaged over all datasets in the benchmark. TCDF (without parallelism)
and TiMINo are run on a Ubuntu 16.04.4 LTS computer with an Intel R Xeon R E5-2683-v4 CPU and
NVIDIA TitanX 12GB GPU. PCMCI and tsFCI are run on a Windows 10 1803 computer with an Intel R
TM
Core i7-5500U CPU.
Table 6 shows the evaluation results for discovering the time delay between cause and effect.
Since FMRI does not explicitly include delays and therefore does not have a delay ground truth,
we only evaluate FINANCE. PCMCI discovered all delays correctly, closely followed by tsFCI and TCDF.
Note that TiMINo only outputs causal relationships without delays. This experiment suggests that
our delay discovery algorithm performs well not only without hidden layers (which makes the delay
discovery relatively easy), but still keeps the percentage of correctly discovered delays relatively high
when the number of hidden layers L (and therefore the number of kernels, the receptive field and
maximum delay) is increased. Thus, the number of hidden layers seems of almost no influence for the
accuracy of the delay discovery.
Table 6. Delay discovery overview for all data sets of the FINANCE benchmark (nine datasets). Showing
macro-averaged percentage of delays that are correctly discovered w.r.t. the full ground truth,
and standard deviation. TiMINo does not discover delays.
FINANCE 97.79% ± 2.56 96.42% ± 3.68 95.49% ± 4.15 100.00% ± 0.00 98.77% ± 3.49 n.a.
Table 7. Impact of causal validation step. Showing macro-averaged F1 scores and standard deviation
for TCDF with PIVM and TCDF without PIVM. ∆ shows the change in F1-score or F1’-score in percent.
FINANCE (9 Data Sets) FMRI (27 Data Sets) FMRI T > 1000 (6 Data Sets)
TCDF (L = 0) 0.64 ± 0.06 0.77 ± 0.08 0.60 ± 0.09 0.63 ± 0.09 0.68 ± 0.05 0.68 ± 0.05
TCDF (L = 0) w/o PIVM 0.22 ± 0.09 0.30 ± 0.13 0.60 ± 0.09 0.63 ± 0.09 0.68 ± 0.05 0.68 ± 0.05
∆ (PIVM) −66% −61% 0% 0% 0% 0%
Table 8. Results of our TCDF (L = 1) applied to FINANCE HIDDEN. ‘Equal Delays’ denotes whether the
delays from the confounder (conf.) to the confounder’s effects are equal. Grey causal relationships
denote that the discovered relationship was not causal according to the ground truth.
Dataset Hidden Conf. Effects Equal Delays Conf. Discovered Learnt Causal Relationships
20-1A X16 X8 , X5 3 3 X16 → X8 , X16 → X5
40-1-3 X7 X8 , X3 3 3 X7 → X8 , X7 → X3
40-1-3 X0 X5 , X6 7 7 X5 → X6
40-1-3 X8 X23 , X4 3 3 X8 → X23 , X8 → X4
40-1-3 X8 X15 , X4 7 7 -
40-1-3 X8 X24 , X4 7 7 -
40-1-3 X8 X24 , X15 7 7 -
40-1-3 X8 X24 , X23 7 7 X24 → X23
40-1-3 X8 X15 , X23 7 7 -
It can be seen in Table 8 that TCDF discovered all hidden confounders with equal delays
to the confounder’s effects, which corresponds with our expectations. In two out of six cases,
TCDF incorrectly learnt a causal relationship between the effects of a hidden confounder with unequal
delays. TCDF (correctly) did not detect a causal relationship between some effects of the hidden
confounder X8 , because the attention mechanism did not discover the potential causal relationships.
We think that a non-causal correlation that arises because of the hidden confounder with unequal
delays was too weak to be selected as potential cause by the attention mechanism, which indicates that
our attention interpretation method to select potential causes is effective and strict enough.
Table 9. Results of TCDF compared with PCMCI, tsFCI and TiMINo when applied to datasets with
hidden confounders. The first row denotes the number of incorrect causal relationships that were
discovered between the effects of the hidden confounders. The second row denotes the number of
hidden confounders that were located.
Whereas TCDF discovered two incorrect causal relationships because of a hidden confounder,
PCMCI did not discover any incorrect causal relationship. However, in contrast to TCDF, PCMCI does
not give any indication that two particular time series are correlated, or that there might be a hidden
confounder between these time series. tsFCI should handle hidden confounders by including a special
edge type (Xi ↔ X j ) that shows that Xi is not a cause of X j and that X j is not a cause of Xi . However,
Mach. Learn. Knowl. Extr. 2019, 1, 19 23 of 28
the results of tsFCI in our experiment are not in accordance with the theoretical claims, since tsFCI did
not discover any hidden confounder. In three cases, it even discovered incorrect causal relationships.
TiMINo discovered in all cases except one an indirect causal relationship.
This case study suggests that TCDF performs as expected by successfully discovering the presence
of a hidden confounder when the delays to the confounder’s effects are equal and, in some cases,
incorrectly discovering a causal relationship between the confounder’s effects when the delays to the
effects are unequal. Compared to other approaches, PCMCI performs better in terms of not discovering
any incorrect causal relationships between the confounder’s effects, but TCDF is the only method
capable of locating the presence of a hidden confounder.
5.5. Summary
Besides being accurate in predicting time series, TCDF correctly discovers most causal
relationships. TCDF outperforms the compared methods (PCMCI, tsFCI and TiMINo) in terms
of causal discovery accuracy when applied to FINANCE and FMRI T > 1000. Since a Deep Learning
method has many parameters to fit, TCDF performs slightly worse on short time series in FMRI.
In contrast, the accuracy of PCMCI is not affected by time series length. Although computation time is
not so relevant in the domain of knowledge extraction, PCMCI is faster than TCDF. TCDF discovers
roughly 95%-97% of delays correctly, which is only slightly worse than PCMCI and tsFCI. TCDF is the
only method to locate the presence of a hidden confounder but, contrary to PCMCI, discovers in some
cases an incorrect causal relationship between a confounder’s effects.
6. Discussion
Since a causal discovery method based on observational data cannot physically intervene in a
system to check if manipulating the cause changes the effect, causal discovery methods are principally
used to discover and investigate hypotheses. Therefore, a constructed temporal causal graph by
TCDF (and any other causal discovery method) should be interpreted as a hypothetical graph,
learnt from observational time series data, which can subsequently be confirmed by a domain expert
or experimentation. This is especially relevant in the case of spurious correlations, where the values
of two unrelated variables are coincidentally statistically correlated. A causal discovery method
will probably label a spurious correlation as a causal relationship if there are no counterexamples
available. Only based on domain knowledge or experiments, one can conclude that the discovered
causal relationship is incorrect. However, whereas most researchers are aware that real-life experiments
are considered the “gold standard” for causal inference, manipulation of the independent variable
of interest will often be unfeasible, unethical, or simply impossible [63]. Thus, causal discovery from
observational data is often the better (or only) option.
As shown in the previous section, our Temporal Causal Discovery Framework can discover causal
relationships from time series data, including a correctly discovered time delay. In the following
sections, we will discuss the limitations of our approach as well as the sensitivity of TCDF to
hyperparameters.
6.1. Hyperparameters
In the experiments, we applied TCDF with different values for L, the number of hidden layers
in the depthwise convolutions. From Table 4, we can conclude that the F1-scores of the FINANCE
benchmark barely differ across different values for L. TCDF L = 2 performs worst on FMRI because
the architecture is probably too complex for the dataset (there are too many parameters to fit) and the
receptive field (and therefore the maximum delay) is unnecessary large. The results for TCDF with
L = 2 improve substantially when applied to time series having more than 1000 time steps. Thus,
the best number of hidden layers depends on the dataset and mainly on the length of the time series.
The number of hidden layers also influences the receptive field: TCDF with L = 2 and kernel size
K = 4 has a receptive field of 64 time steps. Since the maximum delay in the FINANCE benchmark is
Mach. Learn. Knowl. Extr. 2019, 1, 19 24 of 28
three time steps, it might be more challenging for TCDF to discover the correct patterns. Interestingly,
increasing the number of hidden layers barely influences the number of correctly discovered delays.
The experiments show that despite the more complex delay discovery and the increased receptive
field, our delay discovery algorithm correctly discovers almost all delays.
The underlying causal structure is not known when TCDF is applied to actual data, so the number
of hidden layers is a hyperparameter that is difficult to choose. Since the receptive field should be
as least as large as the expected maximum delay in the dataset, our first rule of thumb would be
that when a large time delay is expected, more dilated hidden layers can be included, such that the
receptive field increases exponentially. Secondly, the length of the time series can give an indication of
the number of hidden layers. Short time series seem to require fewer hidden layers.
In our experiments, TCDF performs reasonably well on all benchmarks with L = 0. For future
work, it is interesting to study whether this also holds for datasets with a much larger time delay
between cause and effect, since a larger receptive field is required there. We also note that TCDF
with L = 0 (i.e., no hidden layers in the depthwise convolution) is conceptually equivalent to a 2D
convolution with one channel and a 2D kernel with a height equal to the number of time series. It is
interesting to study whether such a simple 2D convolutional architecture with an attention mechanism
would give significantly different results than our AD-DSTCN architecture with L = 0.
Besides the varying number of hidden layers, there are a few other hyperparameters than can
be optimized: number of epochs, learning rate, loss function and learning rate. We leave this for
future work.
which was mainly due to some short time series in the benchmark. When evaluating TCDF only on time
series with at least 1000 time steps, TCDF outperforms the other compared methods. By interpreting
the networks’ internal parameters, TCDF discovered roughly 95–97% of the time delays correctly,
which is only slightly worse than the delay discovery accuracy of other methods. In a small case study,
we showed that TCDF can circumstantially locate the existence of hidden confounders.
Future work includes hyperparameter optimization, and applying TCDF to more datasets with
different noise-levels, (non-)stationarity and various time delays. We might be able to increase
performance by improving the attention interpretation or studying other causal validation methods.
Author Contributions: M.N. conceived the overall idea, designed the framework, created the software and
processed related work. D.B. supervised the study and designed the graph evaluation measures. M.N. conducted
the experiments in collaboration with C.S. M.N. wrote the paper and it was structured and revised by C.S. D.B.
and C.S. reviewed the writing. All authors read and approved the final paper.
Funding: This research received no external funding.
Acknowledgments: The authors would like to thank Maurice van Keulen for the valuable feedback.
Conflicts of Interest: The authors declare no conflict of interest.
References
1. Kleinberg, S. Why: A Guide to Finding and Using Causes; O’Reilly: Springfield, MA, USA, 2015.
2. Kleinberg, S. Causality, Probability, and Time; Cambridge University Press: Cambridge, UK, 2013.
3. Zorzi, M.; Sepulchre, R. AR Identification of Latent-Variable Graphical Models. IEEE Trans. Autom. Control
2016, 61, 2327–2340. [CrossRef]
4. Spirtes, P. Introduction to causal inference. J. Mach. Learn. Res. 2010, 11, 1643–1662.
5. Zhang, K.; Schölkopf, B.; Spirtes, P.; Glymour, C. Learning causality and causality-related learning:
Some recent progress. Natl. Sci. Rev. 2017, 5, 26–29. [CrossRef]
6. Danks, D. The Psychology of Causal Perception and Reasoning. In The Oxford Handbook of Causation;
Helen Beebee, C.H., Menzies, P., Eds.; Oxford University Press: Oxford, UK, 2009; Chapter 21, pp. 447–470.
7. Abdul, A.; Vermeulen, J.; Wang, D.; Lim, B.Y.; Kankanhalli, M. Trends and trajectories for explainable,
accountable and intelligible systems: An hci research agenda. In Proceedings of the 2018 CHI Conference
on Human Factors in Computing Systems, Montreal, QC, Canada, 21–26 April 2018; ACM: New York, NY,
USA, 2018; p. 582.
8. Runge, J.; Sejdinovic, D.; Flaxman, S. Detecting causal associations in large nonlinear time series datasets.
arXiv 2017, arXiv:1702.07007.
9. Huang, Y.; Kleinberg, S. Fast and Accurate Causal Inference from Time Series Data. In Proceedings of the
FLAIRS Conference, Hollywood, FL, USA, 18–20 May 2015; pp. 49–54.
10. Hu, M.; Liang, H. A copula approach to assessing Granger causality. NeuroImage 2014, 100, 125–134.
[CrossRef]
11. Papana, A.; Kyrtsou, C.; Kugiumtzis, D.; Diks, C. Detecting causality in non-stationary time series using
partial symbolic transfer entropy: Evidence in financial data. Comput. Econ. 2016, 47, 341–365. [CrossRef]
12. Müller, B.; Reinhardt, J.; Strickland, M.T. Neural Networks: An Introduction; Springer: Berlin/Heidelberg,
Germany, 2012.
13. Hyvärinen, A.; Shimizu, S.; Hoyer, P.O. Causal modelling combining instantaneous and lagged effects:
An identifiable model based on non-Gaussianity. In Proceedings of the 25th International Conference on
Machine Learning, Helsinki, Finland, 5–9 July 2008; pp. 424–431.
14. Malinsky, D.; Danks, D. Causal discovery algorithms: A practical guide. Philos. Compass 2018, 13, e12470.
[CrossRef]
15. Quinn, C.J.; Coleman, T.P.; Kiyavash, N.; Hatsopoulos, N.G. Estimating the directed information to infer
causal relationships in ensemble neural spike train recordings. J. Comput. Neurosci. 2011, 30, 17–44. [CrossRef]
16. Gevers, M.; Bazanella, A.S.; Parraga, A. On the identifiability of dynamical networks. IFAC-PapersOnLine
2017, 50, 10580–10585. [CrossRef]
Mach. Learn. Knowl. Extr. 2019, 1, 19 26 of 28
17. Friston, K.; Moran, R.; Seth, A.K. Analysing connectivity with Granger causality and dynamic causal
modelling. Curr. Opin. Neurobiol. 2013, 23, 172–178. [CrossRef]
18. Peters, J.; Janzing, D.; Schölkopf, B. Elements of Causal Inference: Foundations and Learning Algorithms; MIT Press:
Cambridge, MA, USA, 2017.
19. Papana, A.; Kyrtsou, K.; Kugiumtzis, D.; Diks, C. Identifying Causal Relationships in Case of Non-Stationary
Time Series; Technical Report; Universiteit van Amsterdam: Amsterdam, The Netherlands, 2014.
20. Chu, T.; Glymour, C. Search for additive nonlinear time series causal models. J. Mach. Learn. Res. 2008,
9, 967–991.
21. Entner, D.; Hoyer, P.O. On causal discovery from time series data using FCI. In Proceedings of the
Fifth European Workshop on Probabilistic Graphical Models, Helsinki, Finland, 13–15 September 2010;
pp. 121–128.
22. Peters, J.; Janzing, D.; Schölkopf, B. Causal inference on time series using restricted structural equation
models. In Advances in Neural Information Processing Systems; The MIT Press: Cambridge, MA, USA, 2013;
pp. 154–162.
23. Jiao, J.; Permuter, H.H.; Zhao, L.; Kim, Y.H.; Weissman, T. Universal estimation of directed information.
IEEE Trans. Inf. Theory 2013, 59, 6220–6242. [CrossRef]
24. Granger, C.W. Investigating causal relations by econometric models and cross-spectral methods. Econom. J.
Econom. Soc. 1969, 37, 424–438. [CrossRef]
25. Chen, Y.; Bressler, S.L.; Ding, M. Frequency decomposition of conditional Granger causality and application
to multivariate neural field potential data. J. Neurosci. Methods 2006, 150, 228–237. [CrossRef]
26. Zorzi, M.; Chiuso, A. Sparse plus low rank network identification: A nonparametric approach. Automatica
2017, 76, 355–366. [CrossRef]
27. Marinazzo, D.; Pellicoro, M.; Stramaglia, S. Kernel method for nonlinear Granger causality. Phys. Rev. Lett.
2008, 100, 144103. [CrossRef]
28. Luo, Q.; Ge, T.; Grabenhorst, F.; Feng, J.; Rolls, E.T. Attention-dependent modulation of cortical taste circuits
revealed by Granger causality with signal-dependent noise. PLoS Comput. Biol. 2013, 9, e1003265. [CrossRef]
29. Spirtes, P.; Zhang, K. Causal discovery and inference: Concepts and recent methodological advances.
In Applied Informatics; Springer: Berlin, Germany, 2016; Volume 3, p. 3.
30. Spirtes, P.; Glymour, C.N.; Scheines, R. Causation, Prediction, and Search; MIT Press: Cambridge, MA,
USA, 2000.
31. Liu, Y.; Aviyente, S. The relationship between transfer entropy and directed information. In Proceedings of
the Statistical Signal Processing Workshop (SSP), Ann Arbor, MI, USA, 5–8 August 2012; pp. 73–76.
32. Guo, T.; Lin, T.; Lu, Y. An Interpretable LSTM Neural Network for Autoregressive Exogenous Model.
In Proceedings of the International Conference on Learning Representations, Vancouver, BC, Canada,
30 April–3 May 2018.
33. Louizos, C.; Shalit, U.; Mooij, J.M.; Sontag, D.; Zemel, R.; Welling, M. Causal effect inference with deep
latent-variable models. In Advances in Neural Information Processing Systems; The MIT Press: Cambridge, MA,
USA, 2017; pp. 6446–6456.
34. Goudet, O.; Kalainathan, D.; Caillou, P.; Guyon, I.; Lopez-Paz, D.; Sebag, M. Causal Generative Neural
Networks. arXiv 2018, arXiv:1711.08936v2.
35. Kalainathan, D.; Goudet, O.; Guyon, I.; Lopez-Paz, D.; Sebag, M. SAM: Structural Agnostic Model, Causal
Discovery and Penalized Adversarial Learning. arXiv 2018, arXiv:1803.04929.
36. Bai, S.; Kolter, J.Z.; Koltun, V. Convolutional Sequence Modeling Revisited. In Proceedings of the
International Conference on Learning Representations, Vancouver, BC, Canada, 30 April–3 May 2018.
37. Bengio, Y.; Simard, P.; Frasconi, P. Learning long-term dependencies with gradient descent is difficult.
IEEE Trans. Neural Netw. 1994, 5, 157–166. [CrossRef]
38. Gehring, J.; Auli, M.; Grangier, D.; Yarats, D.; Dauphin, Y.N. Convolutional Sequence to Sequence
Learning. In Proceedings of the 34th International Conference on Machine Learning, Sydney, Australia,
6–11 August 2017; Volume 70, pp. 1243–1252.
Mach. Learn. Knowl. Extr. 2019, 1, 19 27 of 28
39. Van den Oord, A.; Kalchbrenner, N.; Espeholt, L.; Vinyals, O.; Graves, A.; Kavukcuoglu, K. Conditional
image generation with pixelCNN decoders. In Advances in Neural Information Processing Systems; The MIT
Press: Cambridge, MA, USA, 2016; pp. 4790–4798.
40. Borovykh, A.; Bohte, S.; Oosterlee, C.W. Conditional time series forecasting with convolutional neural
networks. In Lecture Notes in Computer Science/Lecture Notes in Artificial Intelligence; Springer: Berlin, Germany,
2017; pp. 729–730.
41. Binkowski, M.; Marti, G.; Donnat, P. Autoregressive Convolutional Neural Networks for Asynchronous
Time Series. arXiv 2017, arXiv:1703.04122.
42. Walther, D.; Rutishauser, U.; Koch, C.; Perona, P. On the usefulness of attention for object recognition.
In Proceedings of the Workshop on Attention and Performance in Computational Vision at ECCV, Prague,
Czech Republic, 15 May 2004; pp. 96–103.
43. Yin, W.; Schütze, H.; Xiang, B.; Zhou, B. ABCNN: Attention-Based Convolutional Neural Network for
Modeling Sentence Pairs. Trans. Assoc. Comput. Linguist. 2016, 4, 259–272. [CrossRef]
44. He, K.; Zhang, X.; Ren, S.; Sun, J. Delving deep into rectifiers: Surpassing human-level performance on
imagenet classification. In Proceedings of the 2015 IEEE International Conference on Computer Vision
(ICCV), Santiago, Chile, 7–13 December 2015; pp. 1026–1034.
45. Van Den Oord, A.; Dieleman, S.; Zen, H.; Simonyan, K.; Vinyals, O.; Graves, A.; Kalchbrenner, N.; Senior, A.;
Kavukcuoglu, K. Wavenet: A generative model for raw audio. arXiv 2016, arXiv:1609.03499.
46. Sifre, L.; Mallat, S. Rigid-Motion Scattering for Image Classification. 2014. Available online: http://citeseerx.
ist.psu.edu/viewdoc/download?doi=10.1.1.672.7091&rep=rep1&type=pdf (accessed on 15 October 2018)
47. Chollet, F. Xception: Deep Learning with Depthwise Separable Convolutions. In Proceedings of the 2017
IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017;
pp. 1800–1807.
48. He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016;
pp. 770–778.
49. Martins, A.; Astudillo, R. From softmax to sparsemax: A sparse model of attention and multi-label
classification. In Proceedings of the International Conference on Machine Learning, New York, NY, USA,
19–24 June 2016; pp. 1614–1623.
50. Shen, T.; Zhou, T.; Long, G.; Jiang, J.; Wang, S.; Zhang, C. Reinforced Self-Attention Network: A Hybrid of
Hard and Soft Attention for Sequence Modeling. In Proceedings of the Twenty-Seventh International Joint
Conference on Artificial Intelligence, IJCAI 2018, Stockholm, Sweden, 13–19 July 2018; pp. 4345–4352.
51. Eichler, M. Causal inference in time series analysis. In Causality: Statistical Perspectives and Applications; Wiley:
Hoboken, NJ, USA, 2012; pp. 327–354.
52. Woodward, J. Making Things Happen: A Theory of Causal Explanation; Oxford University Press: Oxford,
UK, 2005.
53. Breiman, L. Random forests. Mach. Learn. 2001, 45, 5–32. [CrossRef]
54. Van der Laan, M.J. Statistical inference for variable importance. Int. J. Biostat. 2006, 2. [CrossRef]
55. Datta, A.; Sen, S.; Zick, Y. Algorithmic transparency via quantitative input influence: Theory and experiments
with learning systems. In Proceedings of the IEEE Symposium on Security and Privacy (SP), San Jose, CA,
USA, 23–25 May 2016; pp. 598–617.
56. Janzing, D.; Balduzzi, D.; Grosse-Wentrup, M.; Schölkopf, B. Quantifying causal influences. Ann. Stat. 2013,
41, 2324–2358. [CrossRef]
57. Fama, E.F.; French, K.R. The cross-section of expected stock returns. J. Financ. 1992, 47, 427–465. [CrossRef]
58. Smith, S.M.; Miller, K.L.; Salimi-Khorshidi, G.; Webster, M.; Beckmann, C.F.; Nichols, T.E.; Ramsey, J.D.;
Woolrich, M.W. Network modelling methods for FMRI. Neuroimage 2011, 54, 875–891. [CrossRef]
59. Buxton, R.B.; Wong, E.C.; Frank, L.R. Dynamics of blood flow and oxygenation changes during brain
activation: The balloon model. Magn. Reson. Med. 1998, 39, 855–864. [CrossRef]
60. Kingma, D.P.; Ba, J. Adam: A method for stochastic optimization, 2014. In Proceedings of the International
Conference on Learning Representations, San Diego, CA, USA, 7–9 May 2015.
61. Hyndman, R.; Khandakar, Y. Automatic Time Series Forecasting: The forecast Package for R. J. Stat. Softw.
2008, 27, 95405. [CrossRef]
Mach. Learn. Knowl. Extr. 2019, 1, 19 28 of 28
62. Goodfellow, I.; Bengio, Y.; Courville, A. Deep Learning; MIT Press: Cambridge, MA, USA, 2016; Available
online: http://www.deeplearningbook.org (accessed on 3 December 2018).
63. Rohrer, J.M. Thinking clearly about correlations and causation: Graphical causal models for observational
data. Adv. Methods Pract. Psychol. Sci. 2018, 1, 27–42. [CrossRef]
c 2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access
article distributed under the terms and conditions of the Creative Commons Attribution
(CC BY) license (http://creativecommons.org/licenses/by/4.0/).