0% found this document useful (0 votes)

21 views7 pages

p1 Piotr

Uploaded by

2wbdj44jqr

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

21 views7 pages

p1 Piotr

Uploaded by

2wbdj44jqr

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 7

Reinforcement Learning under Latent Confounding

Piotr Kolodziejski
Supervisor: Sebastian Tschiatschek, PhD
July 17, 2023

Abstract
In this project, I explore concepts related to reinforcement learning and latent confounding,
i.e., aspects influencing the dynamics and rewards of the systems, which cannot be observed by the
learner directly. The deep dive into basics of RL and, more specifically Q-Learning, are presented.
I also outline the concepts surrounding the intersection of RL and Causal Inference.

1 Introduction
This paper can be split into five major parts. In the first part I present an implementation of a
custom reinforcement learning, more specifically Q-Learning, algorithm and corresponding self-built
environment. In the second part two closely-related TD-learning algorithms SARSA and Q-Learning
are used to solve the more complex OpenAI’s gymnasium (maintained fork of gym) environment and
compared. I also present the investigation of causality and some results from dowhy library. The last
two sections are about the basics of causality and concepts surrounding reinforcement learning and
causal inference. Before I move onto the first result, I would like to introduce the basics of RL. The
following section also sheds light on the nature of the problem I am trying to solve.

1.1 Semi-formal introduction to reinforcement learning

An agent, which interacts with the environment, is in the corresponding state s at time t st . It then
takes action a at time t (at ) from a policy π(at |st ), receives a reward rt and transitions to a new state
st+1 . The agent continues interacting with the environment and accumulating rewards until it reaches
a terminal state. The aim is to maximize this cumulative reward:

∞
X
R= γ t rt
t=0

where γ discounts the future rewards - rewards closer in time are more valuable to biological enti-
ties that RL is mimicing (to some extent). In case the reward system is inherently random we need to
use the language of probability and thus arrive at optimal action a in a given state s to be:

as ∗ = argmax E[R|s0 = 0, a0 = 0]
a

where R is a random variable. This is an optimal action a for a given state s that maximizes the
expectation. A policy π ∗ is optimal if it maximizes E for all possible states.

1.2 Bellman equation

The setting, in which the transition of the state and the reward collected are conditioned only on the
current state and action is called a Markov decision process. Thanks to possessing this property we

1
can express the above search for an optimal policy in a recursive fashion.

X X
Q( s, a) = R(s, a) + T (s′ |s, a)(r + γ π(a′ |s′ )Qπ (s′ , a′ ))
s′ a′ ∈A(s′ )

where T (s′ |s, a) is a transition distribution. Note that variables with superscript ’ mean those in
the future. This equation serves as a basis for a lot of algorithmic solutions - for example Q-learning.

1.3 Q-Learning
Q-Learning is a specific type of Temporal-Difference algorithm that learns how to approximate a Q-
function. It finds an optimal policy by maximizing the expectation of the reward overall steps of the
algorithm. It follows exactly the same logic as outlined above. Q-function calculates the quality of the
state-action i.e. the expected rewards for taking an action in a given state.

Q:S×A→R (1)

There are three three parameters that need to be taken into account:

Learning rate - alpha - it determines how much should the new information affect our beliefs so
far.

Discount factor - gamma - discounts the future rewards, usually it is degrading over consecutive
steps. Thus first we are more interested in the rewards more in the future and with the algorithm
progressing we become more and more short-sighted.

There also need to be some initial conditions like the randomly initialized q table, which will change
over time until we reach a satisfactory table of state-action pairs.

After t steps into the future, which in our case is one, the algorithm decides the next step. Fac-
tor γ t , weights this step. Normally it is set to be 0 < γ < 1. At each step the agent takes action,
enters a new state, receives a reward r, depending on the new state and possible actions there, and Q
is updated. Below is a basic Value Iteration for the Q-function.
Maximum predicted reward, given
new state and all possible actions
z }| {
[NewQ(s, a) = Q(s, a) + α [R(s, a) + γ max Q′ (s′ , a′ ) −Q(s, a)]] (2)
| {z } | {z }
New Q-Value Reward
Learning rate
Discount rate

1.4 SARSA
State–action–reward–state–action algorithm (SARSA) learns a Markov decision process policy. It
follows a certain policy and learns the Q values associated with it i.e. it makes a promise about
taking a certain step in the future and keeps it. Q-learning, on the other hand, updates an estimate
of the optimal function Q’ based on the maximum reward in the future state (even if it does not
take associated action in future state) i.e. it is optimistic. We say that SARSA is on-policy, while
Q-Learning is off-policy.

N ewQ(s, a) = Q(st , at ) + α · (R(s, a) + γ · Q(st+1 , at+1 − Q(st , at )) (3)

In the case of outcomes Q-Learning takes more risks, but can also achieve results closer to optimal,
while SARSA is more conservative and takes safer approach to the problem.
The reason that Q-learning is off-policy is that it updates its Q-values using the Q-value of the next
state and the greedy action . In other words, it estimates the return (total discounted future reward)
for state-action pairs assuming a greedy policy were followed despite the fact that it’s not following a
greedy policy.

2
The reason that SARSA is on-policy is that it updates its Q-values using the Q-value of the next
state and the current policy’s action . It estimates the return for state-action pairs assuming the
current policy continues to be followed.

2 Custom environment for Q-Learning

2.1 Problem and outcome
There are three entities in our environment - a hiker, a house and a tree. The goal is to train the hiker
to reach the house without hitting the tree. In order to achieve that in Python one can use a Q-Table.
It is a structure that holds state-action pairs values and is updated throughout training - using logic
outlined in the introduction - so that the model knows, which action is most beneficial in the given
state. The reward after each step can be negative as there are penalties or positive (for reaching the
destination). We can see an example environment rendered in Fig.1. Every 10000 episodes a new
window is presented to the user and the hiker tries to reach the house real-time. Fig.1 also shows
how the algorithm progresses in the learning process.Those results were produced setting the most
important hyperparameters to: numEpisodes = 150000, learnRate (α) = 0.1 and discount (γ) = 0.95.

2.2 Environment in Python

The environment is a grid in which the entity can move in four directions without ever leaving the grid.
There is a reward/penalty system in place to ensure that the state-action pairs in Q-table are properly
updated during consecutive steps. Every step without a reward is slightly penalised, while hitting a
tree corresponds to a large penalty and termination of the algorithm with ”fail” state. Reaching a
house corresponds to a large reward and termination of the algorithm with ”success” state. Ratio of
those with consecutive iterations i.e. learning process are plotted. Below we can see the results of
the learning and see that success percentage is increasing quickly. Q-learning can actually identify
an optimal action-selecting policy given infinite exploration. As we can see clearly on the right the
progress slows down with consecutive steps, but does not stop.

Figure 1: Example environment frame (left), success to fail ratio in every epoch (center), success
percentage ratio (right).

3 SARSA and Q-Learning in gymnasium environment

3.1 Gymnasium environment
The Taxi Problem involves navigating to passengers in a grid world, picking them up and dropping
them off at one of four locations.
Taxi environment is a 5x5 grid with the problem of locating the passengers and dropping them
off in one of the specified locations - red, green, yellow and blue. If we reach the desired destination
of the passenger the episode ends. Both taxi and passengers are located randomly in the grid at the
start. The rewards is for successful drop off (+20), while the penalty is for missing the passenger or

3
dropping someone off in the wrong location (-10). There is also a penalty for excessive movement (-1)
i.e. if there is no reward assign a small penalty. Similar mechanism to that implemented by me in the
hiker, house, tree scenario.

There are six actions available in the action space:

0: Move south
1: Move north
2: Move east
3: Move west
4: Pick the passenger up
5: Drop the passenger off

As for the observation space there are 500 states - 5x5 (taxi positions) * 5 (passenger position -
can be the same as taxi i.e. sits inside) * 4 (drop off positions)

For this experiment I have adapted my previous solution to the gymnasium library environment.
As one can see the SARSA algorithm still performs well. The hyperparameters used for for this result
are: numEpisodes = 100000 (α) = 0.1 and (γ) = 0.98.

4 Investigating causality
4.1 Basics of causal inference
Note: All of code and plots associated with the below causality investigation are in jupyter notebooks.
Suppose that we want to find the causal effect of taking an action A on the outcome O. To define
the causal effect, consider two worlds - World 1 (Real World), where the action A was taken and O
observed and World 2 (Counterfactual World), where the action A was not taken (but everything else
is the same). Causal effect is the difference between O values attained in the real world versus the
counterfactual world.
Let us define a causal effect - put simply it is a difference between an outcome in the real world
and the outcome in the counterfactual world. The former is the world, in which we are executing an
action A and observe the outcome O. The latter is the world, in which the action A is never executed,
but all the other variables stay the same. This way we have a clear view on whether there is some
causal effect between A and O. Changing action A, while keeping the rest of the world the same is
called an intervention. More formally it is a difference of expected values.

E[O|A = 1] − E[O|A = 0]

There are four major steps in investigating causal inference:

4
4.2 Modelling
In the first step we present domain knowledge as a causal model (can be a graph). We also need to
include two variables in order to estimate the causal effect mainly confounders ( cause both the action
and the outcome) and instruments (cause the action, but do not directly affect the outcome).

4.3 Identification
Here we want to check whether the some target variable can be estimated using observed variables.
There are two methods of identification - First is backdoor criterion - if we observe all confounders of
action A and outcome O then it implies that the causal effect can be identified by conditioning on all
of those confounders:

E[Y |A = 1] = Ec[O|A = 1, C = 1]

where C is a set of all confounders. Second are instrumental variables - We can estimate effect even
when none of the confounders of A and O are observed. The influence of the instrument variable on O
can be broken up into two parts: influence (or effect) on the the action and the influence of the action
on some target outcome:
(E[O|f (I=1)]−E[O|f (I=0)])
E[O|f (A = 1)] − E[O|f (A = 0)] = (E[A|f (I=1)]−E[A|f (I=0)]))

4.4 Estimation
Performing a statistical estimation to compute the target quantity identified in the previous step.
There are two methods used in two estimators that I have submitted. One learns the causal structure
using linear regression and the other one uses gradient-based sklearn method.

4.5 Refutation
Last step is to check the reliability of our estimator. We need to check whether previous assumptions
during estimator development resulted in a working solution. We seek to disprove the correctness
of this solution using the properties of a correct one. Our could test whether estimator returns an
unexpected outcome when the action variable A is replaced by a random variable, all the other things
the same.

Results can be found in jupyter notebook

5 Concepts of Causal Reinforcement Learning

5.1 Structural causal models
SCM is basically a formalism used to describe causal relations between variables. It has a close con-
nection to decision making. In the mathematical world one can think of reinforcement learning as
sequential decision making process and thus this connection will be important to establish a link be-
tween the two. SCMs are build of three blocks:

i) A set of variables that describe the state of things, in which we are i.e. the world. Outcome
variables and explanatory variables are observed variables - they describe events, which are measured
in a certain dataset. In the below example those would be age, treatment and disease. This was already
laid out in jupyter notebook corresponding to the investigation of causality (4) above. Unobserved
variables are those, for which there is no observational data.

ii) Causal relationships, which describe the causal effect variables have on one another.

Treatment ← fT (Age, Unobserved)

5
Disease ← fD(Treatment, Age, Unobserved)

iii) A probability distribution defined over unobserved variables. Put more formally: A SCM is a
tuple (V, U, F, P(u)) where,
V = {v1,...,vn} are observable variables
U = {u1,...,un} are unobserved variables
F = {f1,...,fn} are functions that determine Vs
P(u) is a probability distribution over U
The SCM formalism is tightly linked to directed acyclic graphs as it obeys the same constraints.

5.2 Causal graphs

There is a connection between SCMs and DAGs mentioned above - the former is usually represented in
the form of the latter. Causal relationships between outcome and explanatory variables are represented
by solid arrows.
The direction here is important and the arrows always points in the direction of causality. For
unobserved variable hypothetically causing an observed variable we use a dashed line.

5.3 Problem with causal inference

The SCMs are almost never fully observed. What is crucial however, is that they can be used to
represent the environments in which RL agents take actions. Furthermore the agent can learn the
corresponding causal graph that represents the SCM. This is the key idea that links RL and CI
together.

5.4 Partially observable Markov decision process

It is a process where the dynamics of the system still follow MDP, but the agent cannot directly observe
the underlying state. The logic is almost the same as the outlined for MDP above (1.2), but with some
important differences. With each transition to a new state with a probability described by transition
′
distribution the agent receives an observation that depends on the new state s and action that it just
took. The agent does not directly observe the entire state and thus must make decisions based on
a belief i.e. without the knowledge of the true state of the environment. By receiving observations,
however, the agent may update the probability distribution of the current state i.e. update its beliefs
to match the true distribution more closely and make better decisions in the future. Key difference
with MDP is that MDP does not include the observation set - its observation set is equal to the true
state of the environment and thus in practice is not applicable.

5.5 RL and causality

In case where the state is only partially observable and obeys Markov property then we have a POMDP.
RL systems must discover the hidden states of the environemnt using observations. Model-based RL
involves building a causal model of the environment. Both model-based and model-free RL rely on
inferring latent causes in partially observable states. There is an active ongoing research on how to
resolve the problems arising in learning the underlying structures in systems using the best of RL and
CI formalisms. Those include:

- Generalized Policy Learning

- Learning when to intervene
- Counterfactual decision making (explored in Jupyter notebooks)
- Robustness of causal assumptions (explored in Jupyter notebooks)
- Causal imitation learning
and others.

6
References
1. Causality for Machine Learning, Bernhard Schölkopf (https://arxiv.org/abs/1911.10500)

2. Data Driven Science and Engineering, Steven L. Brunton, J. Nathan Kutz, (https://faculty.washington.edu/sbrunton/da

3. Gymnasium environment documentation (https://gymnasium.farama.org)

4. dowhy library documentation (https://github.com/py-why/dowhy/tree/main/docs)

5. Reinforcement learning and causal models, Samuel J. Gershman, 2015

6. Probabilistic and Causal Inference: The Works of Judea Pearl, H. Geffner, R. Dechter, J. Y.
Halpern, 2022

Lahore Garrison University: Course Syllabus Description
No ratings yet
Lahore Garrison University: Course Syllabus Description
4 pages
Module 5-rl
No ratings yet
Module 5-rl
54 pages
Ai (It) Unit-5
No ratings yet
Ai (It) Unit-5
43 pages
Algorithm Analysis and Design
No ratings yet
Algorithm Analysis and Design
83 pages
LMI Control Toolbox
0% (1)
LMI Control Toolbox
356 pages
Implementation of de Morgan's Law With Two Input.
No ratings yet
Implementation of de Morgan's Law With Two Input.
3 pages
Discriminant Analysis Chapter-Seven
No ratings yet
Discriminant Analysis Chapter-Seven
7 pages
DD2431 Machine Learning Lab 4: Reinforcement Learning Python Version
No ratings yet
DD2431 Machine Learning Lab 4: Reinforcement Learning Python Version
9 pages
Finite Volume Method
No ratings yet
Finite Volume Method
4 pages
Air Heater Control System
100% (1)
Air Heater Control System
25 pages
Module II-3
No ratings yet
Module II-3
21 pages
Numerical Methods For High-Speed Flows: Sergio Pirozzoli
No ratings yet
Numerical Methods For High-Speed Flows: Sergio Pirozzoli
37 pages
Discrete-Time Signals and Systems
No ratings yet
Discrete-Time Signals and Systems
29 pages
RL Class Mtech
No ratings yet
RL Class Mtech
67 pages
Unit 5
No ratings yet
Unit 5
65 pages
Operations Research: Transportation Problem Find Optimal Solution
No ratings yet
Operations Research: Transportation Problem Find Optimal Solution
13 pages
CS6700 Reinforcement Learning PA1 Jan May 2024
No ratings yet
CS6700 Reinforcement Learning PA1 Jan May 2024
4 pages
Unit 5
No ratings yet
Unit 5
54 pages
Temporal Difference Learning
No ratings yet
Temporal Difference Learning
17 pages
Reinforcement Learning: Mitchell, Ch. 13 (See Also Barto & Sutton Book On-Line)
No ratings yet
Reinforcement Learning: Mitchell, Ch. 13 (See Also Barto & Sutton Book On-Line)
14 pages
Intro To Reinforcement Learning
No ratings yet
Intro To Reinforcement Learning
56 pages
EE 675 Lecture 27th March
No ratings yet
EE 675 Lecture 27th March
4 pages
Reinforcement Learning: Russell and Norvig: CH 21
No ratings yet
Reinforcement Learning: Russell and Norvig: CH 21
16 pages
Artificial Intelligence: Lecture 11 - Reinforcement Learning II Dr. Shivanjali Khare
No ratings yet
Artificial Intelligence: Lecture 11 - Reinforcement Learning II Dr. Shivanjali Khare
52 pages
Lec17 ReinforcementLearning
No ratings yet
Lec17 ReinforcementLearning
58 pages
Unit - 5
No ratings yet
Unit - 5
43 pages
Chip Design For Turbo Encoder Module For In-Vehicle System: A Project Report ON
No ratings yet
Chip Design For Turbo Encoder Module For In-Vehicle System: A Project Report ON
7 pages
Report p1
No ratings yet
Report p1
7 pages
Krce
No ratings yet
Krce
71 pages
Sections
No ratings yet
Sections
76 pages
Understanding Time Complexity With Simple Examples
No ratings yet
Understanding Time Complexity With Simple Examples
78 pages
Lec 09
No ratings yet
Lec 09
26 pages
10 Deep Reinforcement
No ratings yet
10 Deep Reinforcement
40 pages
Reinforcement Learning 2
No ratings yet
Reinforcement Learning 2
41 pages
Windowing Functions Improve FFT Results,: Richard Lyons
No ratings yet
Windowing Functions Improve FFT Results,: Richard Lyons
7 pages
Intro To Reinforcement Learning - DQ Q AC A3C
No ratings yet
Intro To Reinforcement Learning - DQ Q AC A3C
36 pages
ML - Unit 3 - Part II
No ratings yet
ML - Unit 3 - Part II
51 pages
DSA5102 Lecture11
No ratings yet
DSA5102 Lecture11
44 pages
Ch06 - Image Compression
No ratings yet
Ch06 - Image Compression
42 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
38 pages
A Quick Assesment of "Automatic" Curve Discretization
No ratings yet
A Quick Assesment of "Automatic" Curve Discretization
4 pages
Q-Learning in RL With Openai Gym: Joo Soon Lee
No ratings yet
Q-Learning in RL With Openai Gym: Joo Soon Lee
34 pages
Deep Learning Binoy-19-3-RL Q Learning
No ratings yet
Deep Learning Binoy-19-3-RL Q Learning
26 pages
AI Seminar RL
No ratings yet
AI Seminar RL
27 pages
I2ml3e Chap18
No ratings yet
I2ml3e Chap18
27 pages
Formation of Partial Differential Equations
No ratings yet
Formation of Partial Differential Equations
9 pages
21 - Reinforcement Learning
No ratings yet
21 - Reinforcement Learning
25 pages
I2ml3e Chap18
No ratings yet
I2ml3e Chap18
27 pages
Reinforcement Learning: Mitchell, Ch. 13 (See Also Barto & Sutton Book On-Line)
No ratings yet
Reinforcement Learning: Mitchell, Ch. 13 (See Also Barto & Sutton Book On-Line)
14 pages
Computer Graphics Using OpenGL - 4
No ratings yet
Computer Graphics Using OpenGL - 4
80 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
32 pages
WINSEM2023-24 BCSE304L TH VL2023240501037 2024-01-09 Reference-Material-I
No ratings yet
WINSEM2023-24 BCSE304L TH VL2023240501037 2024-01-09 Reference-Material-I
42 pages
Q Learning
No ratings yet
Q Learning
12 pages
DMRG Theory and Introducton - Manual For DMRG Code
No ratings yet
DMRG Theory and Introducton - Manual For DMRG Code
28 pages
Q Learning
No ratings yet
Q Learning
9 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
12 pages
Unit 1
No ratings yet
Unit 1
18 pages
Sarsa - RL-BR
No ratings yet
Sarsa - RL-BR
15 pages
7 - Reinforcement Learning
No ratings yet
7 - Reinforcement Learning
23 pages
Reinforcement Learning Mastery Path
No ratings yet
Reinforcement Learning Mastery Path
18 pages
Q-Learning Algorithm
No ratings yet
Q-Learning Algorithm
13 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
17 pages
Unit-5 MLT
No ratings yet
Unit-5 MLT
13 pages
ME 17 - Homework #5 Solving Partial Differential Equations Poisson's Equation
No ratings yet
ME 17 - Homework #5 Solving Partial Differential Equations Poisson's Equation
10 pages
An Incremental Clustering Algorithm Based On Mahalanobis Distance
No ratings yet
An Incremental Clustering Algorithm Based On Mahalanobis Distance
1 page
37 RL
No ratings yet
37 RL
18 pages
Learning Task
No ratings yet
Learning Task
14 pages
Reinforcement Learning: Russell and Norvig: CH 21
No ratings yet
Reinforcement Learning: Russell and Norvig: CH 21
16 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
11 pages
p1 Report
No ratings yet
p1 Report
4 pages
Reinforcement Learning: Instructor: Max Welling
No ratings yet
Reinforcement Learning: Instructor: Max Welling
18 pages
Lec - 06 - Lyapunov - Stability - Analysis Part 2 Week 7 Feb 29
No ratings yet
Lec - 06 - Lyapunov - Stability - Analysis Part 2 Week 7 Feb 29
34 pages
Unit-5 Part C 1) Explain The Q Function and Q Learning Algorithm Assuming Deterministic Rewards and Actions With Example. Ans)
No ratings yet
Unit-5 Part C 1) Explain The Q Function and Q Learning Algorithm Assuming Deterministic Rewards and Actions With Example. Ans)
11 pages
MAS Lab7 QFA
No ratings yet
MAS Lab7 QFA
10 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
9 pages
Rule-Based Reinforcement Learning Augmented by External Knowledge
No ratings yet
Rule-Based Reinforcement Learning Augmented by External Knowledge
7 pages
Reinforcement Learning Algorithms in Global Path Planning For Mobile Robot
No ratings yet
Reinforcement Learning Algorithms in Global Path Planning For Mobile Robot
5 pages
RL Concepts and Methods
No ratings yet
RL Concepts and Methods
8 pages
Q Learning SARSA Deep Q Learning
No ratings yet
Q Learning SARSA Deep Q Learning
4 pages
Project Crashing To Solve Time
No ratings yet
Project Crashing To Solve Time
7 pages
Reinforcement Learning - Ipynb - Colaboratory
No ratings yet
Reinforcement Learning - Ipynb - Colaboratory
7 pages
Lec 17 SARSA Expected SARSA Q Learning
No ratings yet
Lec 17 SARSA Expected SARSA Q Learning
4 pages
RL PDF
No ratings yet
RL PDF
4 pages
Homework6 - Nguyễn Minh Hiếu
No ratings yet
Homework6 - Nguyễn Minh Hiếu
2 pages
Assignment 2
No ratings yet
Assignment 2
2 pages
SAP - LeetCode
No ratings yet
SAP - LeetCode
2 pages
Worksheet Topic 1.5 Polynomials and Complex Zeros
No ratings yet
Worksheet Topic 1.5 Polynomials and Complex Zeros
2 pages
Mahzaib CV
No ratings yet
Mahzaib CV
2 pages
SLB 1
No ratings yet
SLB 1
1 page
Analytical Methods of Optimization
From Everand
Analytical Methods of Optimization
D. F. Lawden
No ratings yet

p1 Piotr

Uploaded by

p1 Piotr

Uploaded by

Reinforcement Learning under Latent Confounding

1.1 Semi-formal introduction to reinforcement learning

1.2 Bellman equation

N ewQ(s, a) = Q(st , at ) + α · (R(s, a) + γ · Q(st+1 , at+1 − Q(st , at )) (3)

2 Custom environment for Q-Learning

2.2 Environment in Python

3 SARSA and Q-Learning in gymnasium environment

There are six actions available in the action space:

There are four major steps in investigating causal inference:

Results can be found in jupyter notebook

5 Concepts of Causal Reinforcement Learning

Treatment ← fT (Age, Unobserved)

5.2 Causal graphs

5.3 Problem with causal inference

5.4 Partially observable Markov decision process

5.5 RL and causality

- Generalized Policy Learning

3. Gymnasium environment documentation (https://gymnasium.farama.org)

4. dowhy library documentation (https://github.com/py-why/dowhy/tree/main/docs)

5. Reinforcement learning and causal models, Samuel J. Gershman, 2015

You might also like