0% found this document useful (0 votes)
25 views10 pages

The Mathematics of Causality

Uploaded by

R S
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
25 views10 pages

The Mathematics of Causality

Uploaded by

R S
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

The Mathematics of Causality

Miquel Noguer i Alonso


Artificial Intelligence Finance Institute
November 28, 2024

Abstract
This paper provides a comprehensive exploration of causality through
a mathematical lens, integrating classical frameworks and contem-
porary advancements in machine learning. We highlight the role of
Structural Causal Models (SCMs) and the Potential Outcomes Frame-
work in establishing foundational principles for causal inference. Fur-
thermore, recent breakthroughs in transformer-based models reveal
their ability to encode latent causal structures within their atten-
tion mechanisms during gradient descent training. This work bridges
the gap between traditional causal inference and modern computa-
tional approaches, offering insights into the interplay between classi-
cal models and deep learning architectures. Applications in time-series
analysis, graphical models, and natural language processing are dis-
cussed, alongside challenges and future directions for integrating these
paradigms.

1 Introduction
Causality is foundational for disentangling correlations from causations in
various fields. Recent advancements highlight the role of transformers, par-
ticularly their self-attention mechanisms, in capturing causal structures. This
work integrates classical causal frameworks Pearl [2009], modern machine
learning approaches Chernozhukov et al. [2018], and insights from Nichani
et al. [2024] on how transformers learn latent causal graphs and López de
Prado [2022].

1
2 Classical Causal Models
Causality is fundamental to scientific reasoning, providing tools to distinguish
correlation from causation. Classical causal models offer rigorous frameworks
for analyzing cause-and-effect relationships. This section discusses two key
paradigms: Structural Causal Models (SCMs) and the Potential Outcomes
Framework, highlighting their theoretical foundations, key concepts, and ap-
plications.

2.1 Structural Causal Models (SCMs)


Structural Causal Models (SCMs), introduced by Pearl [2009], combine struc-
tural equations with directed acyclic graphs (DAGs) to encode causal mech-
anisms. SCMs formalize how variables interact, supporting causal inference
through graph-based reasoning and counterfactual analysis.

2.1.1 Components of SCMs


An SCM is defined by:

• A set of variables X = {X1 , X2 , . . . , Xn }, partitioned into observed


variables and unobserved latent variables.

• A set of structural equations fi for each variable Xi , expressed as:

Xi = fi ( Pai , Ui )

where Pai ⊆ X are the parents of Xi in the DAG, and Ui are exogenous
noise variables independent of each other.

• A directed acyclic graph (DAG) G = (X, E), where E is the set of


edges. Each directed edge Xi → Xj encodes a causal relationship.

2.1.2 Key Assumptions


SCMs operate under three fundamental assumptions:

• Causal Sufficiency: All relevant variables affecting the system are


included.

2
X2

X1 X4

X3

Figure 1: A Directed Acyclic Graph (DAG) illustrating causal relationships


among variables.

• Faithfulness: Statistical independencies in the data are reflected in


the causal graph.

• Markov Property: Each variable is independent of its non-descendants


given its parents in the DAG.

2.1.3 Causal Queries in SCMs


SCMs enable three types of causal queries:

1. Association: Observed statistical relationships (e.g., P (Y | X) ).

2. Intervention: Effects of external manipulations, modeled using the


do-operator:
P (Y | do(X = x))

3. Counterfactuals: Hypothetical scenarios, answering ”what would have


happened if...”

2.1.4 Interventions and Do-Calculus


Interventions modify the structural equations of an SCM by fixing a variable
X to a value x, removing its dependence on its parents:
X
P (Y | do(X = x)) = P (Y | X = x, Z)P (Z)
Z

where Z satisfies the backdoor criterion.

3
2.2 Potential Outcomes Framework
The Potential Outcomes Framework, pioneered by Rubin [1974], provides an
alternative formulation of causality based on counterfactual reasoning. It is
widely used for treatment effect estimation in experimental and observational
studies.

2.2.1 Setup
For each individual, let:

• Y (1) be the potential outcome under treatment (T = 1).

• Y (0) be the potential outcome under control (T = 0).

The observed outcome is:

Y = T · Y (1) + (1 − T ) · Y (0)

2.2.2 Key Assumptions


1. Consistency: The observed outcome matches the potential outcome
under the observed treatment:

Y = Y (T )

2. Ignorability (Unconfoundedness): Treatment assignment is inde-


pendent of potential outcomes, conditional on observed covariates:

T ⊥⊥ {Y (1), Y (0)} | X

3. Positivity: Each individual has a non-zero probability of receiving


both treatments:
0 < P (T = 1 | X) < 1

2.2.3 Average Treatment Effect (ATE)


The causal effect of treatment is captured by the Average Treatment Effect
(ATE):
ATE = E[Y (1) − Y (0)]

4
2.2.4 Estimation Techniques
• Randomized Controlled Trials (RCTs): Ensure ignorability through
randomization.

• Propensity Score Matching: Matches treated and control units


with similar covariates.

• Inverse Probability Weighting (IPW): Reweights the population


to account for differences in treatment probabilities.

3 Causality in Transformers
Transformers trained via gradient descent effectively encode causal relation-
ships in their attention mechanisms. This section explores how gradient
dynamics align attention weights with causal structures, supported by theo-
retical insights, empirical results, and illustrative examples.

3.1 Mathematical Foundations of Gradient Dynamics


The cross-entropy loss function for a sequence X = {x1 , x2 , . . . , xn } is:
n
X
L=− log P (xt | X<t , Θ)
t=1

where Θ represents the parameters of the transformer. Self-attention com-


putes weights Aij for token xj when processing token xi :
√ 
exp Qi KjT / dk
Aij = Pn T
√ 
k=1 exp Q i K k / dk

3.1.1 Proposition: Gradient Alignment with Causal Structures


For data generated from a causal graph G, the gradient of the attention
matrix A aligns with the adjacency matrix of G. Specifically:

∇Aij L ∝ I (Xi ; Xj )

where I (Xi ; Xj ) is the mutual information between tokens Xi and Xj .

5
3.1.2 Proof Sketch
By the data processing inequality:
 
I (Xi ; Xj ) ≥ I Xi : X̂j

where X̂j is the model’s internal representation. Gradient updates to Aij


maximize P (Xj | Xi ), aligning attention weights with causal dependencies.

3.2 Mutual Information and Attention Weights


Mutual information is a fundamental measure in information theory that
quantifies the dependency between two random variables. In the context of
transformers, mutual information plays a critical role in determining how
attention weights align with the causal relationships encoded in data. This
subsection provides a detailed explanation of how mutual information drives
the learning dynamics of attention weights and how this process facilitates
the encoding of causal structures.

3.2.1 Definition of Mutual Information


For two random variables Xi and Xj , the mutual information I (Xi ; Xj ) is
defined as:
I (Xi ; Xj ) = H (Xi ) − H (Xi | Xj )
where:
• H (Xi ) is the entropy of Xi , representing the uncertainty of Xi .
• H (Xi | Xj ) is the conditional entropy of Xi given Xj , representing the
remaining uncertainty of Xi after observing Xj .
High mutual information indicates a strong dependency between Xi and Xj ,
which is a key property in identifying causal relationships.

3.2.2 Attention Mechanism and Mutual Information


The self-attention mechanism in transformers computes attention weights
Aij as: √ 
exp Qi KjT / dk
Aij = Pn T
√ 
k=1 exp Qi Kk / dk

6
where:
• Qi = WQ Xi , Kj = WK Xj , and Vj = WV Xj are the query, key, and
value vectors for tokens Xi and Xj , respectively.
• dk is the dimensionality of the key vectors.
The attention weights Aij determine the contribution of token Xj to the
representation of token Xi . Tokens with higher mutual information I (Xi ; Xj )
are expected to have higher attention weights.

3.2.3 Gradient Alignment with Mutual Information


The learning dynamics of transformers are governed by gradient descent on
the cross-entropy loss:
n
X
L=− log P (xt | X<t , Θ)
t=1

where Θ represents the model parameters. The gradient of the loss with
respect to the attention weights Aij is:
∂I (Xi ; Xj )
∇Aij L ∝
∂Aij
This relationship ensures that the learning process prioritizes dependencies
with higher mutual information.
By the data processing inequality, mutual information between the model’s
internal representation X̂j and Xi is bounded by:
 
I (Xi ; Xj ) ≥ I Xi ; X̂j
 
The model maximizes I Xi ; X̂j , aligning Aij with the causal relationship
between Xi and Xj .

3.2.4 Role of Softmax in Amplifying Dependencies


The softmax operation ensures that the most relevant tokens dominate the
attention mechanism:
∂Aij
= Aij (1 − Aij )
∂Qi KjT

7
For token pairs with strong mutual information I (Xi ; Xj ), the dot product
Qi KjT is larger, resulting in amplified attention weights Aij . This amplifica-
tion focuses the model’s resources on causally significant relationships.

3.2.5 Implications for Causal Encoding


The interplay between mutual information and attention weights allows trans-
formers to encode the structure of causal graphs during training. By learn-
ing to prioritize dependencies with high mutual information, attention lay-
ers align their weights with the adjacency matrices of the underlying causal
graph. This alignment has been empirically observed in studies on synthetic
datasets, such as Markov chains and DAGs, where attention weights converge
to reflect parent-child relationships.
Mutual information serves as a driving force in the learning dynamics of
transformers, influencing attention weights through gradient descent and the
softmax operation. This alignment enables transformers to identify and en-
code causal relationships, bridging the gap between classical causal inference
and modern deep learning architectures.

3.3 Empirical Validation of Gradient Dynamics


3.3.1 Experiment 1: Markov Chains
A dataset was generated using a Markov process Xt → Xt+1 . After training:
• Attention weights aligned with the Markov chain adjacency matrix.

• Gradients were strongest for causally connected token pairs (t, t + 1).

3.3.2 Experiment 2: Directed Acyclic Graphs (DAGs)


Data generated from a DAG exhibited:
• Attention alignment with parent-child relationships.

• Accurate predictions for nodes with multiple parents.

3.3.3 Heatmaps
Heatmaps of attention scores and gradients confirmed the alignment between
attention layers and causal graph structure.

8
Medium
P12 P23

Low High
P31

Figure 2: Markov chain modeling asset price transitions.

3.4 Role of Softmax in Causal Learning


The softmax operation amplifies significant dependencies, ensuring:

Aij → 1 if Qi KjT ≫ Qi KkT for k ̸= j.

This mechanism focuses the model’s attention on causally relevant tokens.


Gradient dynamics in transformers align attention mechanisms with causal
structures. Theoretical and empirical evidence shows that attention layers
effectively encode dependencies, paving the way for integrating transformers
into causal inference frameworks.

4 Conclusion
Causality, as a scientific and mathematical discipline, has undergone a pro-
found transformation with the advent of modern machine learning models.
Classical causal frameworks, such as Structural Causal Models and the Po-
tential Outcomes Framework, have long provided rigorous methodologies for
disentangling correlation from causation. These approaches remain indis-
pensable for defining causal relationships and guiding experimental design.
Recent advancements in transformer-based architectures have extended
these classical methods, showcasing the capacity of self-attention mechanisms
to encode latent causal structures. The work by Nichani et al. [2024] demon-
strates how gradient descent dynamics in transformers align their attention
layers with the adjacency matrices of causal graphs, enabling models to un-
cover complex dependencies in sequential data. Induction heads further en-
hance this capability, allowing for in-context learning and accurate predic-
tions in diverse applications.

9
Despite these advancements, challenges remain. Theoretical questions
about the scalability of transformers to arbitrary DAGs, handling multipar-
ent dependencies, and ensuring interpretability warrant further investigation.
Additionally, bridging the gap between deep learning’s computational power
and the theoretical guarantees of classical causal frameworks presents excit-
ing opportunities for future research.
By integrating the strengths of classical and modern approaches, this
work paves the way for a unified framework for causal inference, combining
rigorous theoretical underpinnings with the adaptability and scalability of
machine learning. This synthesis holds promise for applications in time-series
forecasting, policy evaluation, and understanding complex systems across
various domains.

References
Victor Chernozhukov, Denis Chetverikov, Mert Demirer, et al. Dou-
ble/debiased machine learning for treatment and structural parameters.
The Econometrics Journal, 21:C1–C68, 2018.

Marcos López de Prado. Causal factor investing: Can factor investing


become scientific? SSRN Electronic Journal, December 2022. doi:
10.2139/ssrn.4205613. URL https://ssrn.com/abstract=4205613.

Eshaan Nichani, Alex Damian, and Jason D. Lee. How transformers learn
causal structure with gradient descent. arXiv preprint arXiv:2402.14735,
2024.

Judea Pearl. Causality: Models, Reasoning, and Inference. Cambridge Uni-


versity Press, 2009.

Donald B. Rubin. Estimating causal effects of treatments in randomized and


nonrandomized studies. Journal of Educational Psychology, 66(5):688–701,
1974.

10

You might also like