p1 Report

Uploaded by

2wbdj44jqr

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

15 views4 pages

p1 Report

Uploaded by

2wbdj44jqr

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 4

Reinforcment Learning under Latent Confounding

Piotr Kolodziejski
Supervisor: Sebastian Tschiatschek, PhD
June 30, 2023

Abstract
In this project, I study the impact of reinforcement learning under latent confounding, i.e.,
aspects influencing the dynamics and rewards of the systems, which cannot be observed by the
learner directly. Finally only results concerning the gradient-based method have been established.

1 Introduction
This paper can be split into three major parts. In the first part I present an implementation of a
custom reinforcement learning, more specifically Q-Learning, environment. In the second part two
closely-related TD-learning algorithms SARSA and Q-Learning are used to solve the more complex
OpenAI’s gymnasium (maintained fork of gym) environment and compared. Lastly, I present the
investigation of causality and the results on gradient-based method resilience to random variables.
Before I move onto the first result, I would like to introduce the basics of Q-Learning. The following
section also sheds light on the nature of the problem I am trying to solve.

1.1 Q-Learning
I assume a certain level of familiarity with the subject, and thus do not introduce Markov Decision
Processes or Temporal Difference Learning. Rather I go straight to the point.
Q-Learning is a specific type of TD algorithm that learns how to approximate a Q-function. This
function calculates the quality of the state-action pair:

Q:S×A→R (1)

After t steps into the future, which in our case is one, the algorithm decides the next step. Factor
γ t , which weights this step, is called the discount rate (again, in this case it is just γ). Normally it
is set to be 0 < γ < 1. What it means is that the reward nearer in time matters more than the one
further in the future. At each step the agent takes action, receives a reward r, enters a new state and
Q is updated. Key idea is Bellman equation, which is a basic Value Iteration for the Q-function.
Maximum predicted reward, given
new state and all possible actions
z }| {
[NewQ(s, a) = Q(s, a) + α [R(s, a) + γ max Q′ (s′ , a′ ) −Q(s, a)]] (2)
| {z } | {z }
New Q-Value Reward
Learning rate
Discount rate

1.2 SARSA
State–action–reward–state–action algorithm (SARSA) learns a Markov decision process policy. It
follows a certain policy and learns the Q values associated with it i.e. it makes a promise about
taking a certain step in the future and keeps it. Q-learning, on the other hand, updates an estimate
of the optimal function Q’ based on the maximum reward in the future state (even if it does not

1
Figure 1: Example environemnt frame (left), success to fail ratio in every epoch (center), success
percentage ratio (right).

take associated action in future state) i.e. it is optimistic. We say that SARSA is on-policy, while
Q-Learning is off-policy.

N ewQ(s, a) = Q(st , at ) + α · (R(s, a) + γ · Q(st+1 , at+1 − Q(st , at )) (3)

In the case of outcomes Q-Learning takes more risks, but can also achieve results closer to optimal,
while SARSA is more conservative and takes safer approach to the problem.

2 Custom environment for Q-Learning

2.1 Problem and outcome
There are three entities in our environment - a hiker, a house and a tree. The goal is to train the hiker
to reach the house without hitting the tree. In order to achieve that in Python one can use a Q-Table.
It is a structure that holds state-action pairs values and is updated throughout training - using Bellman
equation - so that the model knows, which action is most beneficial in the given state. The reward
after each step can be negative as there are penalties or positive (for reaching the destination). We can
see an example environment rendered in Fig.1. Every 10000 episodes a new window is presented to the
user and the hiker tries to reach the house real-time. Fig.1 also shows how the algorithm progresses
in the learning process.Those results were produced setting the most important hyperparameters to:
numEpisodes = 150000, learnRate (α) = 0.1 and discount (γ) = 0.95.

3 SARSA and Q-Learning in gymnasium environment

For this experiment I have adapted my previous solution to the gymnasium library environment.
As one can see the algorithm still performs well. The hyperparameters used for for this result are:
numEpisodes = 100000 (α) = 0.1 and (γ) = 0.98.

2
4 Investigating causality
4.1 Basics of causal inference
Note: All of code and plots associated with the below causality investigation are in jupyter notebooks.
Suppose that we want to find the causal effect of taking an action A on the outcome O. To define
the causal effect, consider two worlds - World 1 (Real World), where the action A was taken and O
observed and World 2 (Counterfactual World), where the action A was not taken (but everything else
is the same). Causal effect is the difference between O values attained in the real world versus the
counterfactual world.
Let us define a causal effect - put simply it is a difference between an outcome in the real world
and the outcome in the counterfactual world. The former is the world, in which we are executing an
action A and observe the outcome O. The latter is the world, in which the action A is never executed,
but all the other variables stay the same. This way we have a clear view on whether there is some
causal effect between A and O. Changing action A, while keeping the rest of the world the same is
called an intervention. More formally it is a difference of expected values.

E[O|A = 1] − E[O|A = 0]

There are four major steps in investigating causal inference:

4.2 Modelling
In the first step we present domain knowledge as a causal model (can be a graph). We also need to
include two variables in order to estimate the causal effect mainly confounders ( cause both the action
and the outcome) and instruments (cause the action, but do not directly affect the outcome).

4.3 Identification
Here we want to check whether the some target variable can be estimated using observed variables.
There are two methods of identification - First is backdoor criterion - if we observe all confounders of
action A and outcome O then it implies that the causal effect can be identified by conditioning on all
of those confounders:

E[Y |A = 1] = Ec[O|A = 1, C = 1]

where C is a set of all confounders. Second are instrumental variables - We can estimate effect even
when none of the confounders of A and O are observed. The influence of the instrument variable on O
can be broken up into two parts: influence (or effect) on the the action and the influence of the action
on some target outcome:
(E[O|f (I=1)]−E[O|f (I=0)])
E[O|f (A = 1)] − E[O|f (A = 0)] = (E[A|f (I=1)]−E[A|f (I=0)]))

4.4 Estimation
Performing a statistical estmation to compute the target quantity identified in the previous step.
There are two methods used in two estimators that I have submitted. One learns the causal structure
using linear regression and the other one uses gradient-based sklearn method. Unfortunaltely I did
not manage to establish a connection between reinforcemnt learning-based estimators and unobserved
confounders, which was supposed to be the last step of this project.

4.5 Refutation
Last step is to check the reliability of our estimator. We need to check whether previous assumptions
during estimator development resulted in a working solution. We seek to disprove the correctness
of this solution using the properties of a correct one. Our could test whether estimator returns an
unexpected outcome when the action variable A is replaced by a random variable, all the other things
the same.

3
4.6 Results (jupyter notebook)
I have modeled both linear and non-linear datasets. Linear regression based estimator obviously yielded
satisfying results for the linear case. In the nonlinear case the HistGradientBoostingRegressor() solver
from sklearn library has been used. It has proven to estimate the value close to the real one and
respond properly to the refutation test - yield an oppsite value, which can be rounded to 0. The
reinforcment learning based estimators have not been evaluated.
Nonlinear case - gradient-based solver:

* Causal Estimate *

Identified estimand
Estimand type: EstimandType.NONPARAMETRICA T E

Estimand : 1
Estimand name: backdoor
Estimand expression:
d (E[Outcome—w1,w0])
d[Treatment] Estimand assumption 1,
Unconfoundedness: If U→Treatment and U→Outcome then
P(Outcome—Treatment,w1,w0,U) = P(Outcome—Treatment,w1,w0)

Realized estimand b: Outcome Treatment+w1+w0 —

Target units: ate

Estimate
Mean value: 1.2674648531032078
Effect estimates: [[1.26746485]]

Refute: Use a Placebo Treatment

Estimated effect:0.9880992921829223
New effect:0.00030557816700916565
p value:0.2763353222059542

References
1. Causality for Machine Learning, Bernhard Schölkopf (https://arxiv.org/abs/1911.10500)

2. Data Driven Science and Engineering, Steven L. Brunton, J. Nathan Kutz, (https://faculty.washington.edu/sbrunton/da

3. Gymnasium environment documentation (https://gymnasium.farama.org)

4. dowhy library documentation (https://github.com/py-why/dowhy/tree/main/docs)

Project Proposal Solar Street Lights.
93% (14)
Project Proposal Solar Street Lights.
3 pages
Q2 M1 Research 1
No ratings yet
Q2 M1 Research 1
31 pages
Installation Guide Bcd10 r0c
No ratings yet
Installation Guide Bcd10 r0c
16 pages
Liquid-Liquid Mixing in Stirred Vessels: A Review
No ratings yet
Liquid-Liquid Mixing in Stirred Vessels: A Review
5 pages
Reinforcement Learning Ebook Part1 PDF
No ratings yet
Reinforcement Learning Ebook Part1 PDF
24 pages
Chapter 10 PPT Agm-1
No ratings yet
Chapter 10 PPT Agm-1
13 pages
Characteristics of The Atmosphere: 7 Grade Science
100% (1)
Characteristics of The Atmosphere: 7 Grade Science
40 pages
DD2431 Machine Learning Lab 4: Reinforcement Learning Python Version
No ratings yet
DD2431 Machine Learning Lab 4: Reinforcement Learning Python Version
9 pages
The Oscilloscope and Function Generator
No ratings yet
The Oscilloscope and Function Generator
9 pages
UNIT 1 New
No ratings yet
UNIT 1 New
13 pages
POETRY
No ratings yet
POETRY
5 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
28 pages
Perancangan Produk Interior Dan Booth Dengan Pendekatan Kearifan Lokal
No ratings yet
Perancangan Produk Interior Dan Booth Dengan Pendekatan Kearifan Lokal
10 pages
Advanced Math Hard Answers
No ratings yet
Advanced Math Hard Answers
168 pages
Tdas West Shell 091602
No ratings yet
Tdas West Shell 091602
15 pages
Stainless-Steel-Electrodes NSN 308
No ratings yet
Stainless-Steel-Electrodes NSN 308
2 pages
Piano Man and Much More.
No ratings yet
Piano Man and Much More.
7 pages
Jurnal Tesis (Penggunaan Web Google Site)
No ratings yet
Jurnal Tesis (Penggunaan Web Google Site)
5 pages
En 14889 1 2006
No ratings yet
En 14889 1 2006
11 pages
Year 8 Maths Final Exam Notification 2022
No ratings yet
Year 8 Maths Final Exam Notification 2022
2 pages
Momentum Equation Ppt4
No ratings yet
Momentum Equation Ppt4
3 pages
BS Assignment
No ratings yet
BS Assignment
4 pages
Reinforcement Learning: Instructor: Max Welling
No ratings yet
Reinforcement Learning: Instructor: Max Welling
18 pages
Causal Inference and Stable Learning: Peng Cui Tong Zhang
No ratings yet
Causal Inference and Stable Learning: Peng Cui Tong Zhang
95 pages
9Cb3 - Đề Nam Từ Liêm
No ratings yet
9Cb3 - Đề Nam Từ Liêm
4 pages
Poerwowibowo@yahoo - Co.id: Kata-Kata Kunci: Kesejahteraan Sosial, Pekerjaan Sosial, Pekerja Sosial, Lembaga
No ratings yet
Poerwowibowo@yahoo - Co.id: Kata-Kata Kunci: Kesejahteraan Sosial, Pekerjaan Sosial, Pekerja Sosial, Lembaga
13 pages
SP14 CS188 Lecture 10 - Reinforcement Learning I
No ratings yet
SP14 CS188 Lecture 10 - Reinforcement Learning I
35 pages
07 Deep Reinforcement Learning (John)
No ratings yet
07 Deep Reinforcement Learning (John)
52 pages
Reinforcement Learning: Csci 5512: Artificial Intelligence Ii
No ratings yet
Reinforcement Learning: Csci 5512: Artificial Intelligence Ii
30 pages
20 Questions With Their Answers
No ratings yet
20 Questions With Their Answers
5 pages
Temporal Difference Learning
No ratings yet
Temporal Difference Learning
17 pages
Report On Reinforcement Learning
No ratings yet
Report On Reinforcement Learning
26 pages
SP14 CS188 Lecture 10 - Reinforcement Learning I PDF
No ratings yet
SP14 CS188 Lecture 10 - Reinforcement Learning I PDF
38 pages
RL Theory Tutorial
No ratings yet
RL Theory Tutorial
80 pages
Reinforcement Learning I:: The Setting and Classical Stochastic Dynamic Programming Algorithms
No ratings yet
Reinforcement Learning I:: The Setting and Classical Stochastic Dynamic Programming Algorithms
42 pages
This Content Downloaded From 213.55.85.89 On Wed, 24 May 2023 23:03:41 +00:00
No ratings yet
This Content Downloaded From 213.55.85.89 On Wed, 24 May 2023 23:03:41 +00:00
24 pages
Training An Artificial Neural Network With Op-Amp
No ratings yet
Training An Artificial Neural Network With Op-Amp
6 pages
Overview.: 1.1 Statistical Learning
No ratings yet
Overview.: 1.1 Statistical Learning
2 pages
Practical Examination Schedule 2023
No ratings yet
Practical Examination Schedule 2023
21 pages
I2ml3e Chap18
No ratings yet
I2ml3e Chap18
27 pages
EE 675 Lecture 27th March
No ratings yet
EE 675 Lecture 27th March
4 pages
Tri-Tue-Nhan-Tao - Nathan-Lambert - Lec13 - 6up-Reinforcement-Learning - (Cuuduongthancong - Com)
No ratings yet
Tri-Tue-Nhan-Tao - Nathan-Lambert - Lec13 - 6up-Reinforcement-Learning - (Cuuduongthancong - Com)
8 pages
Generalization Bounds and Representation Learning For Estimation of Potential Outcomes and Causal Effects
No ratings yet
Generalization Bounds and Representation Learning For Estimation of Potential Outcomes and Causal Effects
50 pages
Causal Notes
No ratings yet
Causal Notes
17 pages
Transcription Hero Module 3 en
No ratings yet
Transcription Hero Module 3 en
12 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
11 pages
06 Stat Est
No ratings yet
06 Stat Est
41 pages
G3 PH Tech BJU Quarter Exam Guidelines SY 2021-22
No ratings yet
G3 PH Tech BJU Quarter Exam Guidelines SY 2021-22
3 pages
I2ml3e Chap18
No ratings yet
I2ml3e Chap18
27 pages
21 - Reinforcement Learning
No ratings yet
21 - Reinforcement Learning
25 pages
p1 Piotr
No ratings yet
p1 Piotr
7 pages
Report p1
No ratings yet
Report p1
7 pages
37 RL
No ratings yet
37 RL
18 pages
Kil Bertus 2020
No ratings yet
Kil Bertus 2020
31 pages
Aglietti2020 Part2
No ratings yet
Aglietti2020 Part2
51 pages
Delphic Offline Reinforcement
No ratings yet
Delphic Offline Reinforcement
29 pages
Difference of Q Estimation
No ratings yet
Difference of Q Estimation
28 pages
Interior of The Earth - DrishtiIAS Quiz
No ratings yet
Interior of The Earth - DrishtiIAS Quiz
13 pages
Lecture Notes v1.0 687 F22
No ratings yet
Lecture Notes v1.0 687 F22
115 pages
CIML2023
No ratings yet
CIML2023
87 pages
Ai Unit V
No ratings yet
Ai Unit V
18 pages
Intro To Reinforcement Learning
No ratings yet
Intro To Reinforcement Learning
56 pages
Mlda U4
No ratings yet
Mlda U4
5 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
9 pages
Reinforcement Ch.2
No ratings yet
Reinforcement Ch.2
77 pages
S18 Reinforcement Learning 2
No ratings yet
S18 Reinforcement Learning 2
46 pages
Topic - Four - Decisions - Are - Hard Aa
No ratings yet
Topic - Four - Decisions - Are - Hard Aa
85 pages
Unit 3
No ratings yet
Unit 3
13 pages
Ai (It) Unit-5
No ratings yet
Ai (It) Unit-5
43 pages
Online Learning Survey (ML Reading Group ACEMS)
No ratings yet
Online Learning Survey (ML Reading Group ACEMS)
90 pages
Learning Task
No ratings yet
Learning Task
14 pages
Rule-Based Reinforcement Learning Augmented by External Knowledge
No ratings yet
Rule-Based Reinforcement Learning Augmented by External Knowledge
7 pages
Causal Inference With Instrumental Variables
No ratings yet
Causal Inference With Instrumental Variables
26 pages
RL-Notes Book
No ratings yet
RL-Notes Book
119 pages
01 Intro
No ratings yet
01 Intro
22 pages
An Introduction To Causal Reinforcement Learning: Elias Bareinboim Junzhe Zhang
No ratings yet
An Introduction To Causal Reinforcement Learning: Elias Bareinboim Junzhe Zhang
180 pages
DS ML Applied Modelling
No ratings yet
DS ML Applied Modelling
6 pages
MAS Lab7 QFA
No ratings yet
MAS Lab7 QFA
10 pages
Reinforcement Learning in The Era of LLMS: What Is Essential? What Is Needed? An RL Perspective On RLHF, Prompting, and Beyond
No ratings yet
Reinforcement Learning in The Era of LLMS: What Is Essential? What Is Needed? An RL Perspective On RLHF, Prompting, and Beyond
11 pages
Monte Carlo 1
No ratings yet
Monte Carlo 1
245 pages
Week 12
No ratings yet
Week 12
59 pages
SP14 CS188 Lecture 10 - Reinforcement Learning I - Print
No ratings yet
SP14 CS188 Lecture 10 - Reinforcement Learning I - Print
25 pages
Concept of Liberty
No ratings yet
Concept of Liberty
14 pages
Reinforcement Learning Mastery Path
No ratings yet
Reinforcement Learning Mastery Path
18 pages
Artificial Intelligence: Lecture 10 - Reinforcement Learning Prof. Shivanjali Khare
No ratings yet
Artificial Intelligence: Lecture 10 - Reinforcement Learning Prof. Shivanjali Khare
45 pages
Solutions - REINFORCE and Linear Function Approximation
No ratings yet
Solutions - REINFORCE and Linear Function Approximation
5 pages
Artificial Intelligence: Lecture 11 - Reinforcement Learning II Dr. Shivanjali Khare
No ratings yet
Artificial Intelligence: Lecture 11 - Reinforcement Learning II Dr. Shivanjali Khare
52 pages
An Introduction To Causal Modelling: Gauranga Kumar Baishya and M. R. Srinivasan Chennai Mathematical Institute (CMI)
No ratings yet
An Introduction To Causal Modelling: Gauranga Kumar Baishya and M. R. Srinivasan Chennai Mathematical Institute (CMI)
52 pages
Analytical Methods of Optimization
From Everand
Analytical Methods of Optimization
D. F. Lawden
No ratings yet
Acceptance-Rejection Sampling and Multi-dimensional Monte Carlo Integrations Utilizing Mathematica®
From Everand
Acceptance-Rejection Sampling and Multi-dimensional Monte Carlo Integrations Utilizing Mathematica®
SUJAUL CHOWDHURY
No ratings yet