Reinforcment Learning under Latent Confounding
Piotr Kolodziejski
Supervisor: Sebastian Tschiatschek, PhD
June 30, 2023
Abstract
In this project, I study the impact of reinforcement learning under latent confounding, i.e.,
aspects influencing the dynamics and rewards of the systems, which cannot be observed by the
learner directly. Finally only results concerning the gradient-based method have been established.
1 Introduction
This paper can be split into three major parts. In the first part I present an implementation of a
custom reinforcement learning, more specifically Q-Learning, environment. In the second part two
closely-related TD-learning algorithms SARSA and Q-Learning are used to solve the more complex
OpenAI’s gymnasium (maintained fork of gym) environment and compared. Lastly, I present the
investigation of causality and the results on gradient-based method resilience to random variables.
Before I move onto the first result, I would like to introduce the basics of Q-Learning. The following
section also sheds light on the nature of the problem I am trying to solve.
1.1 Q-Learning
I assume a certain level of familiarity with the subject, and thus do not introduce Markov Decision
Processes or Temporal Difference Learning. Rather I go straight to the point.
Q-Learning is a specific type of TD algorithm that learns how to approximate a Q-function. This
function calculates the quality of the state-action pair:
Q:S×A→R (1)
After t steps into the future, which in our case is one, the algorithm decides the next step. Factor
γ t , which weights this step, is called the discount rate (again, in this case it is just γ). Normally it
is set to be 0 < γ < 1. What it means is that the reward nearer in time matters more than the one
further in the future. At each step the agent takes action, receives a reward r, enters a new state and
Q is updated. Key idea is Bellman equation, which is a basic Value Iteration for the Q-function.
Maximum predicted reward, given
new state and all possible actions
z }| {
[NewQ(s, a) = Q(s, a) + α [R(s, a) + γ max Q′ (s′ , a′ ) −Q(s, a)]] (2)
| {z } | {z }
New Q-Value Reward
Learning rate
Discount rate
1.2 SARSA
State–action–reward–state–action algorithm (SARSA) learns a Markov decision process policy. It
follows a certain policy and learns the Q values associated with it i.e. it makes a promise about
taking a certain step in the future and keeps it. Q-learning, on the other hand, updates an estimate
of the optimal function Q’ based on the maximum reward in the future state (even if it does not
1
Figure 1: Example environemnt frame (left), success to fail ratio in every epoch (center), success
percentage ratio (right).
take associated action in future state) i.e. it is optimistic. We say that SARSA is on-policy, while
Q-Learning is off-policy.
N ewQ(s, a) = Q(st , at ) + α · (R(s, a) + γ · Q(st+1 , at+1 − Q(st , at )) (3)
In the case of outcomes Q-Learning takes more risks, but can also achieve results closer to optimal,
while SARSA is more conservative and takes safer approach to the problem.
2 Custom environment for Q-Learning
2.1 Problem and outcome
There are three entities in our environment - a hiker, a house and a tree. The goal is to train the hiker
to reach the house without hitting the tree. In order to achieve that in Python one can use a Q-Table.
It is a structure that holds state-action pairs values and is updated throughout training - using Bellman
equation - so that the model knows, which action is most beneficial in the given state. The reward
after each step can be negative as there are penalties or positive (for reaching the destination). We can
see an example environment rendered in Fig.1. Every 10000 episodes a new window is presented to the
user and the hiker tries to reach the house real-time. Fig.1 also shows how the algorithm progresses
in the learning process.Those results were produced setting the most important hyperparameters to:
numEpisodes = 150000, learnRate (α) = 0.1 and discount (γ) = 0.95.
3 SARSA and Q-Learning in gymnasium environment
For this experiment I have adapted my previous solution to the gymnasium library environment.
As one can see the algorithm still performs well. The hyperparameters used for for this result are:
numEpisodes = 100000 (α) = 0.1 and (γ) = 0.98.
2
4 Investigating causality
4.1 Basics of causal inference
Note: All of code and plots associated with the below causality investigation are in jupyter notebooks.
Suppose that we want to find the causal effect of taking an action A on the outcome O. To define
the causal effect, consider two worlds - World 1 (Real World), where the action A was taken and O
observed and World 2 (Counterfactual World), where the action A was not taken (but everything else
is the same). Causal effect is the difference between O values attained in the real world versus the
counterfactual world.
Let us define a causal effect - put simply it is a difference between an outcome in the real world
and the outcome in the counterfactual world. The former is the world, in which we are executing an
action A and observe the outcome O. The latter is the world, in which the action A is never executed,
but all the other variables stay the same. This way we have a clear view on whether there is some
causal effect between A and O. Changing action A, while keeping the rest of the world the same is
called an intervention. More formally it is a difference of expected values.
E[O|A = 1] − E[O|A = 0]
There are four major steps in investigating causal inference:
4.2 Modelling
In the first step we present domain knowledge as a causal model (can be a graph). We also need to
include two variables in order to estimate the causal effect mainly confounders ( cause both the action
and the outcome) and instruments (cause the action, but do not directly affect the outcome).
4.3 Identification
Here we want to check whether the some target variable can be estimated using observed variables.
There are two methods of identification - First is backdoor criterion - if we observe all confounders of
action A and outcome O then it implies that the causal effect can be identified by conditioning on all
of those confounders:
E[Y |A = 1] = Ec[O|A = 1, C = 1]
where C is a set of all confounders. Second are instrumental variables - We can estimate effect even
when none of the confounders of A and O are observed. The influence of the instrument variable on O
can be broken up into two parts: influence (or effect) on the the action and the influence of the action
on some target outcome:
(E[O|f (I=1)]−E[O|f (I=0)])
E[O|f (A = 1)] − E[O|f (A = 0)] = (E[A|f (I=1)]−E[A|f (I=0)]))
4.4 Estimation
Performing a statistical estmation to compute the target quantity identified in the previous step.
There are two methods used in two estimators that I have submitted. One learns the causal structure
using linear regression and the other one uses gradient-based sklearn method. Unfortunaltely I did
not manage to establish a connection between reinforcemnt learning-based estimators and unobserved
confounders, which was supposed to be the last step of this project.
4.5 Refutation
Last step is to check the reliability of our estimator. We need to check whether previous assumptions
during estimator development resulted in a working solution. We seek to disprove the correctness
of this solution using the properties of a correct one. Our could test whether estimator returns an
unexpected outcome when the action variable A is replaced by a random variable, all the other things
the same.
3
4.6 Results (jupyter notebook)
I have modeled both linear and non-linear datasets. Linear regression based estimator obviously yielded
satisfying results for the linear case. In the nonlinear case the HistGradientBoostingRegressor() solver
from sklearn library has been used. It has proven to estimate the value close to the real one and
respond properly to the refutation test - yield an oppsite value, which can be rounded to 0. The
reinforcment learning based estimators have not been evaluated.
Nonlinear case - gradient-based solver:
*** Causal Estimate ***
Identified estimand
Estimand type: EstimandType.NONPARAMETRICA T E
Estimand : 1
Estimand name: backdoor
Estimand expression:
d (E[Outcome—w1,w0])
d[Treatment] Estimand assumption 1,
Unconfoundedness: If U→Treatment and U→Outcome then
P(Outcome—Treatment,w1,w0,U) = P(Outcome—Treatment,w1,w0)
Realized estimand b: Outcome Treatment+w1+w0 —
Target units: ate
Estimate
Mean value: 1.2674648531032078
Effect estimates: [[1.26746485]]
Refute: Use a Placebo Treatment
Estimated effect:0.9880992921829223
New effect:0.00030557816700916565
p value:0.2763353222059542
References
1. Causality for Machine Learning, Bernhard Schölkopf (https://arxiv.org/abs/1911.10500)
2. Data Driven Science and Engineering, Steven L. Brunton, J. Nathan Kutz, (https://faculty.washington.edu/sbrunton/da
3. Gymnasium environment documentation (https://gymnasium.farama.org)
4. dowhy library documentation (https://github.com/py-why/dowhy/tree/main/docs)