p1 Piotr
p1 Piotr
Piotr Kolodziejski
Supervisor: Sebastian Tschiatschek, PhD
July 17, 2023
Abstract
In this project, I explore concepts related to reinforcement learning and latent confounding,
i.e., aspects influencing the dynamics and rewards of the systems, which cannot be observed by the
learner directly. The deep dive into basics of RL and, more specifically Q-Learning, are presented.
I also outline the concepts surrounding the intersection of RL and Causal Inference.
1 Introduction
This paper can be split into five major parts. In the first part I present an implementation of a
custom reinforcement learning, more specifically Q-Learning, algorithm and corresponding self-built
environment. In the second part two closely-related TD-learning algorithms SARSA and Q-Learning
are used to solve the more complex OpenAI’s gymnasium (maintained fork of gym) environment and
compared. I also present the investigation of causality and some results from dowhy library. The last
two sections are about the basics of causality and concepts surrounding reinforcement learning and
causal inference. Before I move onto the first result, I would like to introduce the basics of RL. The
following section also sheds light on the nature of the problem I am trying to solve.
∞
X
R= γ t rt
t=0
where γ discounts the future rewards - rewards closer in time are more valuable to biological enti-
ties that RL is mimicing (to some extent). In case the reward system is inherently random we need to
use the language of probability and thus arrive at optimal action a in a given state s to be:
as ∗ = argmax E[R|s0 = 0, a0 = 0]
a
where R is a random variable. This is an optimal action a for a given state s that maximizes the
expectation. A policy π ∗ is optimal if it maximizes E for all possible states.
1
can express the above search for an optimal policy in a recursive fashion.
X X
Q( s, a) = R(s, a) + T (s′ |s, a)(r + γ π(a′ |s′ )Qπ (s′ , a′ ))
s′ a′ ∈A(s′ )
where T (s′ |s, a) is a transition distribution. Note that variables with superscript ’ mean those in
the future. This equation serves as a basis for a lot of algorithmic solutions - for example Q-learning.
1.3 Q-Learning
Q-Learning is a specific type of Temporal-Difference algorithm that learns how to approximate a Q-
function. It finds an optimal policy by maximizing the expectation of the reward overall steps of the
algorithm. It follows exactly the same logic as outlined above. Q-function calculates the quality of the
state-action i.e. the expected rewards for taking an action in a given state.
Q:S×A→R (1)
There are three three parameters that need to be taken into account:
Learning rate - alpha - it determines how much should the new information affect our beliefs so
far.
Discount factor - gamma - discounts the future rewards, usually it is degrading over consecutive
steps. Thus first we are more interested in the rewards more in the future and with the algorithm
progressing we become more and more short-sighted.
There also need to be some initial conditions like the randomly initialized q table, which will change
over time until we reach a satisfactory table of state-action pairs.
After t steps into the future, which in our case is one, the algorithm decides the next step. Fac-
tor γ t , weights this step. Normally it is set to be 0 < γ < 1. At each step the agent takes action,
enters a new state, receives a reward r, depending on the new state and possible actions there, and Q
is updated. Below is a basic Value Iteration for the Q-function.
Maximum predicted reward, given
new state and all possible actions
z }| {
[NewQ(s, a) = Q(s, a) + α [R(s, a) + γ max Q′ (s′ , a′ ) −Q(s, a)]] (2)
| {z } | {z }
New Q-Value Reward
Learning rate
Discount rate
1.4 SARSA
State–action–reward–state–action algorithm (SARSA) learns a Markov decision process policy. It
follows a certain policy and learns the Q values associated with it i.e. it makes a promise about
taking a certain step in the future and keeps it. Q-learning, on the other hand, updates an estimate
of the optimal function Q’ based on the maximum reward in the future state (even if it does not
take associated action in future state) i.e. it is optimistic. We say that SARSA is on-policy, while
Q-Learning is off-policy.
In the case of outcomes Q-Learning takes more risks, but can also achieve results closer to optimal,
while SARSA is more conservative and takes safer approach to the problem.
The reason that Q-learning is off-policy is that it updates its Q-values using the Q-value of the next
state and the greedy action . In other words, it estimates the return (total discounted future reward)
for state-action pairs assuming a greedy policy were followed despite the fact that it’s not following a
greedy policy.
2
The reason that SARSA is on-policy is that it updates its Q-values using the Q-value of the next
state and the current policy’s action . It estimates the return for state-action pairs assuming the
current policy continues to be followed.
Figure 1: Example environment frame (left), success to fail ratio in every epoch (center), success
percentage ratio (right).
3
dropping someone off in the wrong location (-10). There is also a penalty for excessive movement (-1)
i.e. if there is no reward assign a small penalty. Similar mechanism to that implemented by me in the
hiker, house, tree scenario.
As for the observation space there are 500 states - 5x5 (taxi positions) * 5 (passenger position -
can be the same as taxi i.e. sits inside) * 4 (drop off positions)
For this experiment I have adapted my previous solution to the gymnasium library environment.
As one can see the SARSA algorithm still performs well. The hyperparameters used for for this result
are: numEpisodes = 100000 (α) = 0.1 and (γ) = 0.98.
4 Investigating causality
4.1 Basics of causal inference
Note: All of code and plots associated with the below causality investigation are in jupyter notebooks.
Suppose that we want to find the causal effect of taking an action A on the outcome O. To define
the causal effect, consider two worlds - World 1 (Real World), where the action A was taken and O
observed and World 2 (Counterfactual World), where the action A was not taken (but everything else
is the same). Causal effect is the difference between O values attained in the real world versus the
counterfactual world.
Let us define a causal effect - put simply it is a difference between an outcome in the real world
and the outcome in the counterfactual world. The former is the world, in which we are executing an
action A and observe the outcome O. The latter is the world, in which the action A is never executed,
but all the other variables stay the same. This way we have a clear view on whether there is some
causal effect between A and O. Changing action A, while keeping the rest of the world the same is
called an intervention. More formally it is a difference of expected values.
E[O|A = 1] − E[O|A = 0]
4
4.2 Modelling
In the first step we present domain knowledge as a causal model (can be a graph). We also need to
include two variables in order to estimate the causal effect mainly confounders ( cause both the action
and the outcome) and instruments (cause the action, but do not directly affect the outcome).
4.3 Identification
Here we want to check whether the some target variable can be estimated using observed variables.
There are two methods of identification - First is backdoor criterion - if we observe all confounders of
action A and outcome O then it implies that the causal effect can be identified by conditioning on all
of those confounders:
E[Y |A = 1] = Ec[O|A = 1, C = 1]
where C is a set of all confounders. Second are instrumental variables - We can estimate effect even
when none of the confounders of A and O are observed. The influence of the instrument variable on O
can be broken up into two parts: influence (or effect) on the the action and the influence of the action
on some target outcome:
(E[O|f (I=1)]−E[O|f (I=0)])
E[O|f (A = 1)] − E[O|f (A = 0)] = (E[A|f (I=1)]−E[A|f (I=0)]))
4.4 Estimation
Performing a statistical estimation to compute the target quantity identified in the previous step.
There are two methods used in two estimators that I have submitted. One learns the causal structure
using linear regression and the other one uses gradient-based sklearn method.
4.5 Refutation
Last step is to check the reliability of our estimator. We need to check whether previous assumptions
during estimator development resulted in a working solution. We seek to disprove the correctness
of this solution using the properties of a correct one. Our could test whether estimator returns an
unexpected outcome when the action variable A is replaced by a random variable, all the other things
the same.
i) A set of variables that describe the state of things, in which we are i.e. the world. Outcome
variables and explanatory variables are observed variables - they describe events, which are measured
in a certain dataset. In the below example those would be age, treatment and disease. This was already
laid out in jupyter notebook corresponding to the investigation of causality (4) above. Unobserved
variables are those, for which there is no observational data.
ii) Causal relationships, which describe the causal effect variables have on one another.
5
Disease ← fD(Treatment, Age, Unobserved)
iii) A probability distribution defined over unobserved variables. Put more formally: A SCM is a
tuple (V, U, F, P(u)) where,
V = {v1,...,vn} are observable variables
U = {u1,...,un} are unobserved variables
F = {f1,...,fn} are functions that determine Vs
P(u) is a probability distribution over U
The SCM formalism is tightly linked to directed acyclic graphs as it obeys the same constraints.
6
References
1. Causality for Machine Learning, Bernhard Schölkopf (https://arxiv.org/abs/1911.10500)
2. Data Driven Science and Engineering, Steven L. Brunton, J. Nathan Kutz, (https://faculty.washington.edu/sbrunton/da
6. Probabilistic and Causal Inference: The Works of Judea Pearl, H. Geffner, R. Dechter, J. Y.
Halpern, 2022