0% found this document useful (0 votes)

25 views10 pages

The Mathematics of Causality

Uploaded by

R S

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

25 views10 pages

The Mathematics of Causality

Uploaded by

R S

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 10

The Mathematics of Causality

Miquel Noguer i Alonso

Artificial Intelligence Finance Institute
November 28, 2024

Abstract
This paper provides a comprehensive exploration of causality through
a mathematical lens, integrating classical frameworks and contem-
porary advancements in machine learning. We highlight the role of
Structural Causal Models (SCMs) and the Potential Outcomes Frame-
work in establishing foundational principles for causal inference. Fur-
thermore, recent breakthroughs in transformer-based models reveal
their ability to encode latent causal structures within their atten-
tion mechanisms during gradient descent training. This work bridges
the gap between traditional causal inference and modern computa-
tional approaches, offering insights into the interplay between classi-
cal models and deep learning architectures. Applications in time-series
analysis, graphical models, and natural language processing are dis-
cussed, alongside challenges and future directions for integrating these
paradigms.

1 Introduction
Causality is foundational for disentangling correlations from causations in
various fields. Recent advancements highlight the role of transformers, par-
ticularly their self-attention mechanisms, in capturing causal structures. This
work integrates classical causal frameworks Pearl [2009], modern machine
learning approaches Chernozhukov et al. [2018], and insights from Nichani
et al. [2024] on how transformers learn latent causal graphs and López de
Prado [2022].

1
2 Classical Causal Models
Causality is fundamental to scientific reasoning, providing tools to distinguish
correlation from causation. Classical causal models offer rigorous frameworks
for analyzing cause-and-effect relationships. This section discusses two key
paradigms: Structural Causal Models (SCMs) and the Potential Outcomes
Framework, highlighting their theoretical foundations, key concepts, and ap-
plications.

2.1 Structural Causal Models (SCMs)

Structural Causal Models (SCMs), introduced by Pearl [2009], combine struc-
tural equations with directed acyclic graphs (DAGs) to encode causal mech-
anisms. SCMs formalize how variables interact, supporting causal inference
through graph-based reasoning and counterfactual analysis.

2.1.1 Components of SCMs

An SCM is defined by:

• A set of variables X = {X1 , X2 , . . . , Xn }, partitioned into observed

variables and unobserved latent variables.

• A set of structural equations fi for each variable Xi , expressed as:

Xi = fi ( Pai , Ui )

where Pai ⊆ X are the parents of Xi in the DAG, and Ui are exogenous
noise variables independent of each other.

• A directed acyclic graph (DAG) G = (X, E), where E is the set of

edges. Each directed edge Xi → Xj encodes a causal relationship.

2.1.2 Key Assumptions

SCMs operate under three fundamental assumptions:

• Causal Sufficiency: All relevant variables affecting the system are

included.

2
X2

X1 X4

Figure 1: A Directed Acyclic Graph (DAG) illustrating causal relationships

among variables.

• Faithfulness: Statistical independencies in the data are reflected in

the causal graph.

• Markov Property: Each variable is independent of its non-descendants

given its parents in the DAG.

2.1.3 Causal Queries in SCMs

SCMs enable three types of causal queries:

1. Association: Observed statistical relationships (e.g., P (Y | X) ).

2. Intervention: Effects of external manipulations, modeled using the

do-operator:
P (Y | do(X = x))

3. Counterfactuals: Hypothetical scenarios, answering ”what would have

happened if...”

2.1.4 Interventions and Do-Calculus

Interventions modify the structural equations of an SCM by fixing a variable
X to a value x, removing its dependence on its parents:
X
P (Y | do(X = x)) = P (Y | X = x, Z)P (Z)
Z

where Z satisfies the backdoor criterion.

3
2.2 Potential Outcomes Framework
The Potential Outcomes Framework, pioneered by Rubin [1974], provides an
alternative formulation of causality based on counterfactual reasoning. It is
widely used for treatment effect estimation in experimental and observational
studies.

2.2.1 Setup
For each individual, let:

• Y (1) be the potential outcome under treatment (T = 1).

• Y (0) be the potential outcome under control (T = 0).

The observed outcome is:

Y = T · Y (1) + (1 − T ) · Y (0)

2.2.2 Key Assumptions

1. Consistency: The observed outcome matches the potential outcome
under the observed treatment:

Y = Y (T )

2. Ignorability (Unconfoundedness): Treatment assignment is inde-

pendent of potential outcomes, conditional on observed covariates:

T ⊥⊥ {Y (1), Y (0)} | X

3. Positivity: Each individual has a non-zero probability of receiving

both treatments:
0 < P (T = 1 | X) < 1

2.2.3 Average Treatment Effect (ATE)

The causal effect of treatment is captured by the Average Treatment Effect
(ATE):
ATE = E[Y (1) − Y (0)]

4
2.2.4 Estimation Techniques
• Randomized Controlled Trials (RCTs): Ensure ignorability through
randomization.

• Propensity Score Matching: Matches treated and control units

with similar covariates.

• Inverse Probability Weighting (IPW): Reweights the population

to account for differences in treatment probabilities.

3 Causality in Transformers
Transformers trained via gradient descent effectively encode causal relation-
ships in their attention mechanisms. This section explores how gradient
dynamics align attention weights with causal structures, supported by theo-
retical insights, empirical results, and illustrative examples.

3.1 Mathematical Foundations of Gradient Dynamics

The cross-entropy loss function for a sequence X = {x1 , x2 , . . . , xn } is:
n
X
L=− log P (xt | X<t , Θ)
t=1

where Θ represents the parameters of the transformer. Self-attention com-

putes weights Aij for token xj when processing token xi :
√
exp Qi KjT / dk
Aij = Pn T
√
k=1 exp Q i K k / dk

3.1.1 Proposition: Gradient Alignment with Causal Structures

For data generated from a causal graph G, the gradient of the attention
matrix A aligns with the adjacency matrix of G. Specifically:

∇Aij L ∝ I (Xi ; Xj )

where I (Xi ; Xj ) is the mutual information between tokens Xi and Xj .

5
3.1.2 Proof Sketch
By the data processing inequality:

I (Xi ; Xj ) ≥ I Xi : X̂j

where X̂j is the model’s internal representation. Gradient updates to Aij

maximize P (Xj | Xi ), aligning attention weights with causal dependencies.

3.2 Mutual Information and Attention Weights

Mutual information is a fundamental measure in information theory that
quantifies the dependency between two random variables. In the context of
transformers, mutual information plays a critical role in determining how
attention weights align with the causal relationships encoded in data. This
subsection provides a detailed explanation of how mutual information drives
the learning dynamics of attention weights and how this process facilitates
the encoding of causal structures.

3.2.1 Definition of Mutual Information

For two random variables Xi and Xj , the mutual information I (Xi ; Xj ) is
defined as:
I (Xi ; Xj ) = H (Xi ) − H (Xi | Xj )
where:
• H (Xi ) is the entropy of Xi , representing the uncertainty of Xi .
• H (Xi | Xj ) is the conditional entropy of Xi given Xj , representing the
remaining uncertainty of Xi after observing Xj .
High mutual information indicates a strong dependency between Xi and Xj ,
which is a key property in identifying causal relationships.

3.2.2 Attention Mechanism and Mutual Information

The self-attention mechanism in transformers computes attention weights
Aij as: √
exp Qi KjT / dk
Aij = Pn T
√
k=1 exp Qi Kk / dk

6
where:
• Qi = WQ Xi , Kj = WK Xj , and Vj = WV Xj are the query, key, and
value vectors for tokens Xi and Xj , respectively.
• dk is the dimensionality of the key vectors.
The attention weights Aij determine the contribution of token Xj to the
representation of token Xi . Tokens with higher mutual information I (Xi ; Xj )
are expected to have higher attention weights.

3.2.3 Gradient Alignment with Mutual Information

The learning dynamics of transformers are governed by gradient descent on
the cross-entropy loss:
n
X
L=− log P (xt | X<t , Θ)
t=1

where Θ represents the model parameters. The gradient of the loss with
respect to the attention weights Aij is:
∂I (Xi ; Xj )
∇Aij L ∝
∂Aij
This relationship ensures that the learning process prioritizes dependencies
with higher mutual information.
By the data processing inequality, mutual information between the model’s
internal representation X̂j and Xi is bounded by:

I (Xi ; Xj ) ≥ I Xi ; X̂j

The model maximizes I Xi ; X̂j , aligning Aij with the causal relationship
between Xi and Xj .

3.2.4 Role of Softmax in Amplifying Dependencies

The softmax operation ensures that the most relevant tokens dominate the
attention mechanism:
∂Aij
= Aij (1 − Aij )
∂Qi KjT

7
For token pairs with strong mutual information I (Xi ; Xj ), the dot product
Qi KjT is larger, resulting in amplified attention weights Aij . This amplifica-
tion focuses the model’s resources on causally significant relationships.

3.2.5 Implications for Causal Encoding

The interplay between mutual information and attention weights allows trans-
formers to encode the structure of causal graphs during training. By learn-
ing to prioritize dependencies with high mutual information, attention lay-
ers align their weights with the adjacency matrices of the underlying causal
graph. This alignment has been empirically observed in studies on synthetic
datasets, such as Markov chains and DAGs, where attention weights converge
to reflect parent-child relationships.
Mutual information serves as a driving force in the learning dynamics of
transformers, influencing attention weights through gradient descent and the
softmax operation. This alignment enables transformers to identify and en-
code causal relationships, bridging the gap between classical causal inference
and modern deep learning architectures.

3.3 Empirical Validation of Gradient Dynamics

3.3.1 Experiment 1: Markov Chains
A dataset was generated using a Markov process Xt → Xt+1 . After training:
• Attention weights aligned with the Markov chain adjacency matrix.

• Gradients were strongest for causally connected token pairs (t, t + 1).

3.3.2 Experiment 2: Directed Acyclic Graphs (DAGs)

Data generated from a DAG exhibited:
• Attention alignment with parent-child relationships.

• Accurate predictions for nodes with multiple parents.

3.3.3 Heatmaps
Heatmaps of attention scores and gradients confirmed the alignment between
attention layers and causal graph structure.

8
Medium
P12 P23

Low High
P31

Figure 2: Markov chain modeling asset price transitions.

3.4 Role of Softmax in Causal Learning

The softmax operation amplifies significant dependencies, ensuring:

Aij → 1 if Qi KjT ≫ Qi KkT for k ̸= j.

This mechanism focuses the model’s attention on causally relevant tokens.

Gradient dynamics in transformers align attention mechanisms with causal
structures. Theoretical and empirical evidence shows that attention layers
effectively encode dependencies, paving the way for integrating transformers
into causal inference frameworks.

4 Conclusion
Causality, as a scientific and mathematical discipline, has undergone a pro-
found transformation with the advent of modern machine learning models.
Classical causal frameworks, such as Structural Causal Models and the Po-
tential Outcomes Framework, have long provided rigorous methodologies for
disentangling correlation from causation. These approaches remain indis-
pensable for defining causal relationships and guiding experimental design.
Recent advancements in transformer-based architectures have extended
these classical methods, showcasing the capacity of self-attention mechanisms
to encode latent causal structures. The work by Nichani et al. [2024] demon-
strates how gradient descent dynamics in transformers align their attention
layers with the adjacency matrices of causal graphs, enabling models to un-
cover complex dependencies in sequential data. Induction heads further en-
hance this capability, allowing for in-context learning and accurate predic-
tions in diverse applications.

9
Despite these advancements, challenges remain. Theoretical questions
about the scalability of transformers to arbitrary DAGs, handling multipar-
ent dependencies, and ensuring interpretability warrant further investigation.
Additionally, bridging the gap between deep learning’s computational power
and the theoretical guarantees of classical causal frameworks presents excit-
ing opportunities for future research.
By integrating the strengths of classical and modern approaches, this
work paves the way for a unified framework for causal inference, combining
rigorous theoretical underpinnings with the adaptability and scalability of
machine learning. This synthesis holds promise for applications in time-series
forecasting, policy evaluation, and understanding complex systems across
various domains.

References
Victor Chernozhukov, Denis Chetverikov, Mert Demirer, et al. Dou-
ble/debiased machine learning for treatment and structural parameters.
The Econometrics Journal, 21:C1–C68, 2018.

Marcos López de Prado. Causal factor investing: Can factor investing

become scientific? SSRN Electronic Journal, December 2022. doi:
10.2139/ssrn.4205613. URL https://ssrn.com/abstract=4205613.

Eshaan Nichani, Alex Damian, and Jason D. Lee. How transformers learn
causal structure with gradient descent. arXiv preprint arXiv:2402.14735,
2024.

Judea Pearl. Causality: Models, Reasoning, and Inference. Cambridge Uni-

versity Press, 2009.

Donald B. Rubin. Estimating causal effects of treatments in randomized and

nonrandomized studies. Journal of Educational Psychology, 66(5):688–701,
1974.

Maroc 100 Recettes Authentiques Textbook PDF Download
100% (11)
Maroc 100 Recettes Authentiques Textbook PDF Download
17 pages
Causal Inference Book Part I-Ifqdve
No ratings yet
Causal Inference Book Part I-Ifqdve
158 pages
Introduction To Bayesian Networks - Koski - Noble
No ratings yet
Introduction To Bayesian Networks - Koski - Noble
471 pages
A Brief Introduction To Causal Inference in Machine Learning
No ratings yet
A Brief Introduction To Causal Inference in Machine Learning
88 pages
Causal Machine Learning - A Survey and Open Problems
No ratings yet
Causal Machine Learning - A Survey and Open Problems
191 pages
Learning Bayesian Networks (Neapolitan, Richard) PDF
100% (1)
Learning Bayesian Networks (Neapolitan, Richard) PDF
704 pages
Deep Learning For Causal Inference
No ratings yet
Deep Learning For Causal Inference
67 pages
Causal Discovery in Machine Learning - Theories and Applications
No ratings yet
Causal Discovery in Machine Learning - Theories and Applications
29 pages
2020 - Introduction - To - Causal - Inference - From ML Perspective
100% (1)
2020 - Introduction - To - Causal - Inference - From ML Perspective
133 pages
Causal Inference in Statistics: An Overview
100% (1)
Causal Inference in Statistics: An Overview
51 pages
ThesisFinal AlbertoCaron
No ratings yet
ThesisFinal AlbertoCaron
197 pages
Causal Modeling of Dynamical Systems: S. Bongers, T. Blom, and J.M. Mooij
No ratings yet
Causal Modeling of Dynamical Systems: S. Bongers, T. Blom, and J.M. Mooij
54 pages
Conditional Independences and Causal Relations Implied by Sets of Equations
No ratings yet
Conditional Independences and Causal Relations Implied by Sets of Equations
62 pages
Potential Outcome and Directed Acyclic Graph Approaches To Causality: Relevance For Empirical Practice in Economics
No ratings yet
Potential Outcome and Directed Acyclic Graph Approaches To Causality: Relevance For Empirical Practice in Economics
76 pages
Introduction To Causal Inference-Aug25 2020-Neal
No ratings yet
Introduction To Causal Inference-Aug25 2020-Neal
61 pages
A Survey of Deep Causal Model
No ratings yet
A Survey of Deep Causal Model
31 pages
A Review and Roadmap of Deep Causal Model From Different Causal Structures and Representations
No ratings yet
A Review and Roadmap of Deep Causal Model From Different Causal Structures and Representations
35 pages
Causal Discovery Using Proxy Variables
No ratings yet
Causal Discovery Using Proxy Variables
13 pages
Random Sets Approach and Its Applications
No ratings yet
Random Sets Approach and Its Applications
12 pages
A Meta Transfer Objective For Learning To Disentangle Causal Mechanisms
No ratings yet
A Meta Transfer Objective For Learning To Disentangle Causal Mechanisms
12 pages
Sample-Efficient Reinforcement Learning Via Counterfactual-Based Data Augmentation
No ratings yet
Sample-Efficient Reinforcement Learning Via Counterfactual-Based Data Augmentation
14 pages
AAAI-2023 教程用于因果推断的机器学习
No ratings yet
AAAI-2023 教程用于因果推断的机器学习
145 pages
Causal Inference in Statistics: An Overview
100% (2)
Causal Inference in Statistics: An Overview
51 pages
Sample, Estimate, Aggregate: A Recipe For Causal Discovery Foundation Models
No ratings yet
Sample, Estimate, Aggregate: A Recipe For Causal Discovery Foundation Models
32 pages
Valse 2
No ratings yet
Valse 2
89 pages
Switching Regression Models and Causal Inference - Potentially Bushit Paper Dressing Up Nicely in Math
No ratings yet
Switching Regression Models and Causal Inference - Potentially Bushit Paper Dressing Up Nicely in Math
46 pages
An Introduction To Causal Modelling: Gauranga Kumar Baishya and M. R. Srinivasan Chennai Mathematical Institute (CMI)
No ratings yet
An Introduction To Causal Modelling: Gauranga Kumar Baishya and M. R. Srinivasan Chennai Mathematical Institute (CMI)
52 pages
Deep Causal Learning
No ratings yet
Deep Causal Learning
35 pages
Deci
No ratings yet
Deci
31 pages
1929 Causal Discovery With Reinforc
No ratings yet
1929 Causal Discovery With Reinforc
17 pages
Causal Notes
No ratings yet
Causal Notes
17 pages
Causal AI Final
No ratings yet
Causal AI Final
71 pages
21.1 Causality
No ratings yet
21.1 Causality
56 pages
04 - Graphical Causal Models - Causal Inference For The Brave and True
No ratings yet
04 - Graphical Causal Models - Causal Inference For The Brave and True
13 pages
Causality Bernhard Schölkopf
No ratings yet
Causality Bernhard Schölkopf
169 pages
A Causality Inspired Framework For Model Interpretation
No ratings yet
A Causality Inspired Framework For Model Interpretation
11 pages
Score Matching Through The Roof Llinear, Nonlinear, and Latent Variables Causal Discovery 26th July 2024 (AAA)
No ratings yet
Score Matching Through The Roof Llinear, Nonlinear, and Latent Variables Causal Discovery 26th July 2024 (AAA)
27 pages
Causal Machine Learning For Healthcare and Precision Medicine
No ratings yet
Causal Machine Learning For Healthcare and Precision Medicine
19 pages
Causal-Learn: Causal Discovery in Python
No ratings yet
Causal-Learn: Causal Discovery in Python
8 pages
Causal Networks 2000
No ratings yet
Causal Networks 2000
26 pages
Peter Spirtes 2010
No ratings yet
Peter Spirtes 2010
20 pages
Causal Interpretation of
No ratings yet
Causal Interpretation of
16 pages
Wieczorek Roth 2019 Entropy
No ratings yet
Wieczorek Roth 2019 Entropy
26 pages
Causal Tutorial 1
No ratings yet
Causal Tutorial 1
18 pages
Causal ML Python Package For Causal Inference Machine Learning
No ratings yet
Causal ML Python Package For Causal Inference Machine Learning
7 pages
Causal-Inference 2020 Engineering
No ratings yet
Causal-Inference 2020 Engineering
11 pages
Lec 13
No ratings yet
Lec 13
35 pages
A Meta-Reinforcement Learning Algorithm For Causal Discovery
No ratings yet
A Meta-Reinforcement Learning Algorithm For Causal Discovery
18 pages
DS ML Applied Modelling
No ratings yet
DS ML Applied Modelling
6 pages
04 - Graphical Causal Models - Causal Inference For The Brave and True
No ratings yet
04 - Graphical Causal Models - Causal Inference For The Brave and True
13 pages
Casual Inference Important
No ratings yet
Casual Inference Important
12 pages
Unsupervised Learning
No ratings yet
Unsupervised Learning
32 pages
AIOps Whitepaper
100% (1)
AIOps Whitepaper
28 pages
Internship Report: Supply Chain Management
100% (1)
Internship Report: Supply Chain Management
32 pages
BMW E46 Code List
No ratings yet
BMW E46 Code List
82 pages
CFN Ug
No ratings yet
CFN Ug
4,069 pages
App X-Ray Multiphos 10 Plus
No ratings yet
App X-Ray Multiphos 10 Plus
8 pages
Module-5 Part-2: Exception and Interrupt Handling
No ratings yet
Module-5 Part-2: Exception and Interrupt Handling
23 pages
SublimationPrinting101 2011edition PDF
No ratings yet
SublimationPrinting101 2011edition PDF
62 pages
Algorithms Lecture Notes Cambridge
No ratings yet
Algorithms Lecture Notes Cambridge
133 pages
Notes On Randomized Algorithms
No ratings yet
Notes On Randomized Algorithms
539 pages
Power Point Presentation On Topic: Framework: Submitted By: Himani Kathal
No ratings yet
Power Point Presentation On Topic: Framework: Submitted By: Himani Kathal
11 pages
Simple Arduino LoRa Communciation More Than 5km
No ratings yet
Simple Arduino LoRa Communciation More Than 5km
9 pages
Lecture Notes Introduction To Topological Data Analysis
No ratings yet
Lecture Notes Introduction To Topological Data Analysis
80 pages
The Mathematics of Kolmogorov-Arnold-Networks
No ratings yet
The Mathematics of Kolmogorov-Arnold-Networks
26 pages
Petronas Carigali SDN BHD: Document Review Status
No ratings yet
Petronas Carigali SDN BHD: Document Review Status
1 page
Solution of MCQ - Ch1,2 &3
No ratings yet
Solution of MCQ - Ch1,2 &3
4 pages
Inventory Management, Intro To Materials Management
No ratings yet
Inventory Management, Intro To Materials Management
27 pages
CHAPTER 6 Frequency Analysis
No ratings yet
CHAPTER 6 Frequency Analysis
38 pages
Lecture Notes On High Dimensional Linear Regression
No ratings yet
Lecture Notes On High Dimensional Linear Regression
73 pages
Vocpro RKI Instrument Datasheet
No ratings yet
Vocpro RKI Instrument Datasheet
2 pages
IEC Timers and IEC Counter For SIMATIC S7-1200
No ratings yet
IEC Timers and IEC Counter For SIMATIC S7-1200
33 pages
Prime H510M-K R2.0
No ratings yet
Prime H510M-K R2.0
5 pages
NAKIVO Sales Presentation
No ratings yet
NAKIVO Sales Presentation
57 pages
Ratesem 2000 Handouts Int2
No ratings yet
Ratesem 2000 Handouts Int2
31 pages
Myopia Master
No ratings yet
Myopia Master
6 pages
B Com 1st, 3rd, 5th
No ratings yet
B Com 1st, 3rd, 5th
1 page
1031590-Ejosat1083443-2292434 231226 134238
No ratings yet
1031590-Ejosat1083443-2292434 231226 134238
7 pages
Interchange Limit and Summation
No ratings yet
Interchange Limit and Summation
4 pages
Crop Improvement IA 3 Poster
No ratings yet
Crop Improvement IA 3 Poster
1 page
Integration Between Service Quality With Refined KANO To Improve Academic Quality at MTI
No ratings yet
Integration Between Service Quality With Refined KANO To Improve Academic Quality at MTI
8 pages
Logistic Binary Classification
No ratings yet
Logistic Binary Classification
3 pages
AIAA Space 2016 Proceedings LFRE
No ratings yet
AIAA Space 2016 Proceedings LFRE
8 pages
Land Forces Academy Review) Collision Avoidance System Using Ultrasonic Sensor
No ratings yet
Land Forces Academy Review) Collision Avoidance System Using Ultrasonic Sensor
8 pages
Precision Fluxgate Compass: Model
No ratings yet
Precision Fluxgate Compass: Model
2 pages
Module 4 Quiz - Chapters 5 and 6 - CYBR 365 Intro To Digital Forensics - Jan 2022 - Online
No ratings yet
Module 4 Quiz - Chapters 5 and 6 - CYBR 365 Intro To Digital Forensics - Jan 2022 - Online
9 pages
Electronics, Technology and Trends in The Online
No ratings yet
Electronics, Technology and Trends in The Online
1 page
Mathematical Foundations of Information Theory
From Everand
Mathematical Foundations of Information Theory
A. Ya. Khinchin
3.5/5 (9)
Ordinary Differential Equations and Stability Theory: An Introduction
From Everand
Ordinary Differential Equations and Stability Theory: An Introduction
David A. Sanchez
No ratings yet
Worked Examples in Mathematics for Scientists and Engineers
From Everand
Worked Examples in Mathematics for Scientists and Engineers
G. Stephenson
No ratings yet
Geometric functions in computer aided geometric design
From Everand
Geometric functions in computer aided geometric design
Oscar Ruiz
No ratings yet
Co-Clustering: Models, Algorithms and Applications
From Everand
Co-Clustering: Models, Algorithms and Applications
Gérard Govaert
No ratings yet
A Treatise on the Calculus of Finite Differences
From Everand
A Treatise on the Calculus of Finite Differences
George Boole
4/5 (1)
Digital Signal Processing (DSP) with Python Programming
From Everand
Digital Signal Processing (DSP) with Python Programming
Maurice Charbit
No ratings yet
Learn Statistics Fast: A Simplified Detailed Version for Students
From Everand
Learn Statistics Fast: A Simplified Detailed Version for Students
Hesbon R.M
No ratings yet

The Mathematics of Causality

Uploaded by

The Mathematics of Causality

Uploaded by

The Mathematics of Causality

Miquel Noguer i Alonso

2.1 Structural Causal Models (SCMs)

2.1.1 Components of SCMs

• A set of variables X = {X1 , X2 , . . . , Xn }, partitioned into observed

• A set of structural equations fi for each variable Xi , expressed as:

• A directed acyclic graph (DAG) G = (X, E), where E is the set of

2.1.2 Key Assumptions

• Causal Sufficiency: All relevant variables affecting the system are

Figure 1: A Directed Acyclic Graph (DAG) illustrating causal relationships

• Faithfulness: Statistical independencies in the data are reflected in

• Markov Property: Each variable is independent of its non-descendants

2.1.3 Causal Queries in SCMs

1. Association: Observed statistical relationships (e.g., P (Y | X) ).

2. Intervention: Effects of external manipulations, modeled using the

3. Counterfactuals: Hypothetical scenarios, answering ”what would have

2.1.4 Interventions and Do-Calculus

where Z satisfies the backdoor criterion.

• Y (1) be the potential outcome under treatment (T = 1).

• Y (0) be the potential outcome under control (T = 0).

The observed outcome is:

2.2.2 Key Assumptions

2. Ignorability (Unconfoundedness): Treatment assignment is inde-

3. Positivity: Each individual has a non-zero probability of receiving

2.2.3 Average Treatment Effect (ATE)

• Propensity Score Matching: Matches treated and control units

• Inverse Probability Weighting (IPW): Reweights the population

3.1 Mathematical Foundations of Gradient Dynamics

where Θ represents the parameters of the transformer. Self-attention com-

3.1.1 Proposition: Gradient Alignment with Causal Structures

where I (Xi ; Xj ) is the mutual information between tokens Xi and Xj .

where X̂j is the model’s internal representation. Gradient updates to Aij

3.2 Mutual Information and Attention Weights

3.2.1 Definition of Mutual Information

3.2.2 Attention Mechanism and Mutual Information

3.2.3 Gradient Alignment with Mutual Information

3.2.4 Role of Softmax in Amplifying Dependencies

3.2.5 Implications for Causal Encoding

3.3 Empirical Validation of Gradient Dynamics

3.3.2 Experiment 2: Directed Acyclic Graphs (DAGs)

• Accurate predictions for nodes with multiple parents.

Figure 2: Markov chain modeling asset price transitions.

3.4 Role of Softmax in Causal Learning

Aij → 1 if Qi KjT ≫ Qi KkT for k ̸= j.

This mechanism focuses the model’s attention on causally relevant tokens.

Marcos López de Prado. Causal factor investing: Can factor investing

Judea Pearl. Causality: Models, Reasoning, and Inference. Cambridge Uni-

Donald B. Rubin. Estimating causal effects of treatments in randomized and

You might also like