0% found this document useful (0 votes)
21 views7 pages

p1 Piotr

Uploaded by

2wbdj44jqr
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views7 pages

p1 Piotr

Uploaded by

2wbdj44jqr
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

Reinforcement Learning under Latent Confounding

Piotr Kolodziejski
Supervisor: Sebastian Tschiatschek, PhD
July 17, 2023

Abstract
In this project, I explore concepts related to reinforcement learning and latent confounding,
i.e., aspects influencing the dynamics and rewards of the systems, which cannot be observed by the
learner directly. The deep dive into basics of RL and, more specifically Q-Learning, are presented.
I also outline the concepts surrounding the intersection of RL and Causal Inference.

1 Introduction
This paper can be split into five major parts. In the first part I present an implementation of a
custom reinforcement learning, more specifically Q-Learning, algorithm and corresponding self-built
environment. In the second part two closely-related TD-learning algorithms SARSA and Q-Learning
are used to solve the more complex OpenAI’s gymnasium (maintained fork of gym) environment and
compared. I also present the investigation of causality and some results from dowhy library. The last
two sections are about the basics of causality and concepts surrounding reinforcement learning and
causal inference. Before I move onto the first result, I would like to introduce the basics of RL. The
following section also sheds light on the nature of the problem I am trying to solve.

1.1 Semi-formal introduction to reinforcement learning


An agent, which interacts with the environment, is in the corresponding state s at time t st . It then
takes action a at time t (at ) from a policy π(at |st ), receives a reward rt and transitions to a new state
st+1 . The agent continues interacting with the environment and accumulating rewards until it reaches
a terminal state. The aim is to maximize this cumulative reward:


X
R= γ t rt
t=0

where γ discounts the future rewards - rewards closer in time are more valuable to biological enti-
ties that RL is mimicing (to some extent). In case the reward system is inherently random we need to
use the language of probability and thus arrive at optimal action a in a given state s to be:

as ∗ = argmax E[R|s0 = 0, a0 = 0]
a

where R is a random variable. This is an optimal action a for a given state s that maximizes the
expectation. A policy π ∗ is optimal if it maximizes E for all possible states.

1.2 Bellman equation


The setting, in which the transition of the state and the reward collected are conditioned only on the
current state and action is called a Markov decision process. Thanks to possessing this property we

1
can express the above search for an optimal policy in a recursive fashion.

X X
Q( s, a) = R(s, a) + T (s′ |s, a)(r + γ π(a′ |s′ )Qπ (s′ , a′ ))
s′ a′ ∈A(s′ )

where T (s′ |s, a) is a transition distribution. Note that variables with superscript ’ mean those in
the future. This equation serves as a basis for a lot of algorithmic solutions - for example Q-learning.

1.3 Q-Learning
Q-Learning is a specific type of Temporal-Difference algorithm that learns how to approximate a Q-
function. It finds an optimal policy by maximizing the expectation of the reward overall steps of the
algorithm. It follows exactly the same logic as outlined above. Q-function calculates the quality of the
state-action i.e. the expected rewards for taking an action in a given state.

Q:S×A→R (1)

There are three three parameters that need to be taken into account:

Learning rate - alpha - it determines how much should the new information affect our beliefs so
far.

Discount factor - gamma - discounts the future rewards, usually it is degrading over consecutive
steps. Thus first we are more interested in the rewards more in the future and with the algorithm
progressing we become more and more short-sighted.

There also need to be some initial conditions like the randomly initialized q table, which will change
over time until we reach a satisfactory table of state-action pairs.

After t steps into the future, which in our case is one, the algorithm decides the next step. Fac-
tor γ t , weights this step. Normally it is set to be 0 < γ < 1. At each step the agent takes action,
enters a new state, receives a reward r, depending on the new state and possible actions there, and Q
is updated. Below is a basic Value Iteration for the Q-function.
Maximum predicted reward, given
new state and all possible actions
z }| {
[NewQ(s, a) = Q(s, a) + α [R(s, a) + γ max Q′ (s′ , a′ ) −Q(s, a)]] (2)
| {z } | {z }
New Q-Value Reward
Learning rate
Discount rate

1.4 SARSA
State–action–reward–state–action algorithm (SARSA) learns a Markov decision process policy. It
follows a certain policy and learns the Q values associated with it i.e. it makes a promise about
taking a certain step in the future and keeps it. Q-learning, on the other hand, updates an estimate
of the optimal function Q’ based on the maximum reward in the future state (even if it does not
take associated action in future state) i.e. it is optimistic. We say that SARSA is on-policy, while
Q-Learning is off-policy.

N ewQ(s, a) = Q(st , at ) + α · (R(s, a) + γ · Q(st+1 , at+1 − Q(st , at )) (3)

In the case of outcomes Q-Learning takes more risks, but can also achieve results closer to optimal,
while SARSA is more conservative and takes safer approach to the problem.
The reason that Q-learning is off-policy is that it updates its Q-values using the Q-value of the next
state and the greedy action . In other words, it estimates the return (total discounted future reward)
for state-action pairs assuming a greedy policy were followed despite the fact that it’s not following a
greedy policy.

2
The reason that SARSA is on-policy is that it updates its Q-values using the Q-value of the next
state and the current policy’s action . It estimates the return for state-action pairs assuming the
current policy continues to be followed.

2 Custom environment for Q-Learning


2.1 Problem and outcome
There are three entities in our environment - a hiker, a house and a tree. The goal is to train the hiker
to reach the house without hitting the tree. In order to achieve that in Python one can use a Q-Table.
It is a structure that holds state-action pairs values and is updated throughout training - using logic
outlined in the introduction - so that the model knows, which action is most beneficial in the given
state. The reward after each step can be negative as there are penalties or positive (for reaching the
destination). We can see an example environment rendered in Fig.1. Every 10000 episodes a new
window is presented to the user and the hiker tries to reach the house real-time. Fig.1 also shows
how the algorithm progresses in the learning process.Those results were produced setting the most
important hyperparameters to: numEpisodes = 150000, learnRate (α) = 0.1 and discount (γ) = 0.95.

2.2 Environment in Python


The environment is a grid in which the entity can move in four directions without ever leaving the grid.
There is a reward/penalty system in place to ensure that the state-action pairs in Q-table are properly
updated during consecutive steps. Every step without a reward is slightly penalised, while hitting a
tree corresponds to a large penalty and termination of the algorithm with ”fail” state. Reaching a
house corresponds to a large reward and termination of the algorithm with ”success” state. Ratio of
those with consecutive iterations i.e. learning process are plotted. Below we can see the results of
the learning and see that success percentage is increasing quickly. Q-learning can actually identify
an optimal action-selecting policy given infinite exploration. As we can see clearly on the right the
progress slows down with consecutive steps, but does not stop.

Figure 1: Example environment frame (left), success to fail ratio in every epoch (center), success
percentage ratio (right).

3 SARSA and Q-Learning in gymnasium environment


3.1 Gymnasium environment
The Taxi Problem involves navigating to passengers in a grid world, picking them up and dropping
them off at one of four locations.
Taxi environment is a 5x5 grid with the problem of locating the passengers and dropping them
off in one of the specified locations - red, green, yellow and blue. If we reach the desired destination
of the passenger the episode ends. Both taxi and passengers are located randomly in the grid at the
start. The rewards is for successful drop off (+20), while the penalty is for missing the passenger or

3
dropping someone off in the wrong location (-10). There is also a penalty for excessive movement (-1)
i.e. if there is no reward assign a small penalty. Similar mechanism to that implemented by me in the
hiker, house, tree scenario.

There are six actions available in the action space:


0: Move south
1: Move north
2: Move east
3: Move west
4: Pick the passenger up
5: Drop the passenger off

As for the observation space there are 500 states - 5x5 (taxi positions) * 5 (passenger position -
can be the same as taxi i.e. sits inside) * 4 (drop off positions)

For this experiment I have adapted my previous solution to the gymnasium library environment.
As one can see the SARSA algorithm still performs well. The hyperparameters used for for this result
are: numEpisodes = 100000 (α) = 0.1 and (γ) = 0.98.

4 Investigating causality
4.1 Basics of causal inference
Note: All of code and plots associated with the below causality investigation are in jupyter notebooks.
Suppose that we want to find the causal effect of taking an action A on the outcome O. To define
the causal effect, consider two worlds - World 1 (Real World), where the action A was taken and O
observed and World 2 (Counterfactual World), where the action A was not taken (but everything else
is the same). Causal effect is the difference between O values attained in the real world versus the
counterfactual world.
Let us define a causal effect - put simply it is a difference between an outcome in the real world
and the outcome in the counterfactual world. The former is the world, in which we are executing an
action A and observe the outcome O. The latter is the world, in which the action A is never executed,
but all the other variables stay the same. This way we have a clear view on whether there is some
causal effect between A and O. Changing action A, while keeping the rest of the world the same is
called an intervention. More formally it is a difference of expected values.

E[O|A = 1] − E[O|A = 0]

There are four major steps in investigating causal inference:

4
4.2 Modelling
In the first step we present domain knowledge as a causal model (can be a graph). We also need to
include two variables in order to estimate the causal effect mainly confounders ( cause both the action
and the outcome) and instruments (cause the action, but do not directly affect the outcome).

4.3 Identification
Here we want to check whether the some target variable can be estimated using observed variables.
There are two methods of identification - First is backdoor criterion - if we observe all confounders of
action A and outcome O then it implies that the causal effect can be identified by conditioning on all
of those confounders:

E[Y |A = 1] = Ec[O|A = 1, C = 1]

where C is a set of all confounders. Second are instrumental variables - We can estimate effect even
when none of the confounders of A and O are observed. The influence of the instrument variable on O
can be broken up into two parts: influence (or effect) on the the action and the influence of the action
on some target outcome:
(E[O|f (I=1)]−E[O|f (I=0)])
E[O|f (A = 1)] − E[O|f (A = 0)] = (E[A|f (I=1)]−E[A|f (I=0)]))

4.4 Estimation
Performing a statistical estimation to compute the target quantity identified in the previous step.
There are two methods used in two estimators that I have submitted. One learns the causal structure
using linear regression and the other one uses gradient-based sklearn method.

4.5 Refutation
Last step is to check the reliability of our estimator. We need to check whether previous assumptions
during estimator development resulted in a working solution. We seek to disprove the correctness
of this solution using the properties of a correct one. Our could test whether estimator returns an
unexpected outcome when the action variable A is replaced by a random variable, all the other things
the same.

Results can be found in jupyter notebook

5 Concepts of Causal Reinforcement Learning


5.1 Structural causal models
SCM is basically a formalism used to describe causal relations between variables. It has a close con-
nection to decision making. In the mathematical world one can think of reinforcement learning as
sequential decision making process and thus this connection will be important to establish a link be-
tween the two. SCMs are build of three blocks:

i) A set of variables that describe the state of things, in which we are i.e. the world. Outcome
variables and explanatory variables are observed variables - they describe events, which are measured
in a certain dataset. In the below example those would be age, treatment and disease. This was already
laid out in jupyter notebook corresponding to the investigation of causality (4) above. Unobserved
variables are those, for which there is no observational data.

ii) Causal relationships, which describe the causal effect variables have on one another.

Treatment ← fT (Age, Unobserved)

5
Disease ← fD(Treatment, Age, Unobserved)

iii) A probability distribution defined over unobserved variables. Put more formally: A SCM is a
tuple (V, U, F, P(u)) where,
V = {v1,...,vn} are observable variables
U = {u1,...,un} are unobserved variables
F = {f1,...,fn} are functions that determine Vs
P(u) is a probability distribution over U
The SCM formalism is tightly linked to directed acyclic graphs as it obeys the same constraints.

5.2 Causal graphs


There is a connection between SCMs and DAGs mentioned above - the former is usually represented in
the form of the latter. Causal relationships between outcome and explanatory variables are represented
by solid arrows.
The direction here is important and the arrows always points in the direction of causality. For
unobserved variable hypothetically causing an observed variable we use a dashed line.

5.3 Problem with causal inference


The SCMs are almost never fully observed. What is crucial however, is that they can be used to
represent the environments in which RL agents take actions. Furthermore the agent can learn the
corresponding causal graph that represents the SCM. This is the key idea that links RL and CI
together.

5.4 Partially observable Markov decision process


It is a process where the dynamics of the system still follow MDP, but the agent cannot directly observe
the underlying state. The logic is almost the same as the outlined for MDP above (1.2), but with some
important differences. With each transition to a new state with a probability described by transition

distribution the agent receives an observation that depends on the new state s and action that it just
took. The agent does not directly observe the entire state and thus must make decisions based on
a belief i.e. without the knowledge of the true state of the environment. By receiving observations,
however, the agent may update the probability distribution of the current state i.e. update its beliefs
to match the true distribution more closely and make better decisions in the future. Key difference
with MDP is that MDP does not include the observation set - its observation set is equal to the true
state of the environment and thus in practice is not applicable.

5.5 RL and causality


In case where the state is only partially observable and obeys Markov property then we have a POMDP.
RL systems must discover the hidden states of the environemnt using observations. Model-based RL
involves building a causal model of the environment. Both model-based and model-free RL rely on
inferring latent causes in partially observable states. There is an active ongoing research on how to
resolve the problems arising in learning the underlying structures in systems using the best of RL and
CI formalisms. Those include:

- Generalized Policy Learning


- Learning when to intervene
- Counterfactual decision making (explored in Jupyter notebooks)
- Robustness of causal assumptions (explored in Jupyter notebooks)
- Causal imitation learning
and others.

6
References
1. Causality for Machine Learning, Bernhard Schölkopf (https://arxiv.org/abs/1911.10500)

2. Data Driven Science and Engineering, Steven L. Brunton, J. Nathan Kutz, (https://faculty.washington.edu/sbrunton/da

3. Gymnasium environment documentation (https://gymnasium.farama.org)

4. dowhy library documentation (https://github.com/py-why/dowhy/tree/main/docs)

5. Reinforcement learning and causal models, Samuel J. Gershman, 2015

6. Probabilistic and Causal Inference: The Works of Judea Pearl, H. Geffner, R. Dechter, J. Y.
Halpern, 2022

You might also like