The Mathematics of Causality
The Mathematics of Causality
Abstract
This paper provides a comprehensive exploration of causality through
a mathematical lens, integrating classical frameworks and contem-
porary advancements in machine learning. We highlight the role of
Structural Causal Models (SCMs) and the Potential Outcomes Frame-
work in establishing foundational principles for causal inference. Fur-
thermore, recent breakthroughs in transformer-based models reveal
their ability to encode latent causal structures within their atten-
tion mechanisms during gradient descent training. This work bridges
the gap between traditional causal inference and modern computa-
tional approaches, offering insights into the interplay between classi-
cal models and deep learning architectures. Applications in time-series
analysis, graphical models, and natural language processing are dis-
cussed, alongside challenges and future directions for integrating these
paradigms.
1 Introduction
Causality is foundational for disentangling correlations from causations in
various fields. Recent advancements highlight the role of transformers, par-
ticularly their self-attention mechanisms, in capturing causal structures. This
work integrates classical causal frameworks Pearl [2009], modern machine
learning approaches Chernozhukov et al. [2018], and insights from Nichani
et al. [2024] on how transformers learn latent causal graphs and López de
Prado [2022].
1
2 Classical Causal Models
Causality is fundamental to scientific reasoning, providing tools to distinguish
correlation from causation. Classical causal models offer rigorous frameworks
for analyzing cause-and-effect relationships. This section discusses two key
paradigms: Structural Causal Models (SCMs) and the Potential Outcomes
Framework, highlighting their theoretical foundations, key concepts, and ap-
plications.
Xi = fi ( Pai , Ui )
where Pai ⊆ X are the parents of Xi in the DAG, and Ui are exogenous
noise variables independent of each other.
2
X2
X1 X4
X3
3
2.2 Potential Outcomes Framework
The Potential Outcomes Framework, pioneered by Rubin [1974], provides an
alternative formulation of causality based on counterfactual reasoning. It is
widely used for treatment effect estimation in experimental and observational
studies.
2.2.1 Setup
For each individual, let:
Y = T · Y (1) + (1 − T ) · Y (0)
Y = Y (T )
T ⊥⊥ {Y (1), Y (0)} | X
4
2.2.4 Estimation Techniques
• Randomized Controlled Trials (RCTs): Ensure ignorability through
randomization.
3 Causality in Transformers
Transformers trained via gradient descent effectively encode causal relation-
ships in their attention mechanisms. This section explores how gradient
dynamics align attention weights with causal structures, supported by theo-
retical insights, empirical results, and illustrative examples.
∇Aij L ∝ I (Xi ; Xj )
5
3.1.2 Proof Sketch
By the data processing inequality:
I (Xi ; Xj ) ≥ I Xi : X̂j
6
where:
• Qi = WQ Xi , Kj = WK Xj , and Vj = WV Xj are the query, key, and
value vectors for tokens Xi and Xj , respectively.
• dk is the dimensionality of the key vectors.
The attention weights Aij determine the contribution of token Xj to the
representation of token Xi . Tokens with higher mutual information I (Xi ; Xj )
are expected to have higher attention weights.
where Θ represents the model parameters. The gradient of the loss with
respect to the attention weights Aij is:
∂I (Xi ; Xj )
∇Aij L ∝
∂Aij
This relationship ensures that the learning process prioritizes dependencies
with higher mutual information.
By the data processing inequality, mutual information between the model’s
internal representation X̂j and Xi is bounded by:
I (Xi ; Xj ) ≥ I Xi ; X̂j
The model maximizes I Xi ; X̂j , aligning Aij with the causal relationship
between Xi and Xj .
7
For token pairs with strong mutual information I (Xi ; Xj ), the dot product
Qi KjT is larger, resulting in amplified attention weights Aij . This amplifica-
tion focuses the model’s resources on causally significant relationships.
• Gradients were strongest for causally connected token pairs (t, t + 1).
3.3.3 Heatmaps
Heatmaps of attention scores and gradients confirmed the alignment between
attention layers and causal graph structure.
8
Medium
P12 P23
Low High
P31
4 Conclusion
Causality, as a scientific and mathematical discipline, has undergone a pro-
found transformation with the advent of modern machine learning models.
Classical causal frameworks, such as Structural Causal Models and the Po-
tential Outcomes Framework, have long provided rigorous methodologies for
disentangling correlation from causation. These approaches remain indis-
pensable for defining causal relationships and guiding experimental design.
Recent advancements in transformer-based architectures have extended
these classical methods, showcasing the capacity of self-attention mechanisms
to encode latent causal structures. The work by Nichani et al. [2024] demon-
strates how gradient descent dynamics in transformers align their attention
layers with the adjacency matrices of causal graphs, enabling models to un-
cover complex dependencies in sequential data. Induction heads further en-
hance this capability, allowing for in-context learning and accurate predic-
tions in diverse applications.
9
Despite these advancements, challenges remain. Theoretical questions
about the scalability of transformers to arbitrary DAGs, handling multipar-
ent dependencies, and ensuring interpretability warrant further investigation.
Additionally, bridging the gap between deep learning’s computational power
and the theoretical guarantees of classical causal frameworks presents excit-
ing opportunities for future research.
By integrating the strengths of classical and modern approaches, this
work paves the way for a unified framework for causal inference, combining
rigorous theoretical underpinnings with the adaptability and scalability of
machine learning. This synthesis holds promise for applications in time-series
forecasting, policy evaluation, and understanding complex systems across
various domains.
References
Victor Chernozhukov, Denis Chetverikov, Mert Demirer, et al. Dou-
ble/debiased machine learning for treatment and structural parameters.
The Econometrics Journal, 21:C1–C68, 2018.
Eshaan Nichani, Alex Damian, and Jason D. Lee. How transformers learn
causal structure with gradient descent. arXiv preprint arXiv:2402.14735,
2024.
10