0% found this document useful (0 votes)
1 views

Reinforcement Learning Dissertation

Uploaded by

seanpietercraven
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
1 views

Reinforcement Learning Dissertation

Uploaded by

seanpietercraven
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

Chapter 1

Experiments

1.1 Motivation

With the ultimate aim of leveraging equivariance for model-based RL. A series of
Experiments were carried out, sequentially, to investigate how to exploit known
symmetries in a model based environment.

Model-Based RL, is inherently more complex than Actor-Critic or Value based


methods, as not only does a policy need to be learned, but also a model of the
environment dynamics must also be learned.

Due to the increased complexity of model-based methods, and the benefit of


implementing existing methods, while also providing a baseline for further experi-
mentation a model-free implementation was created.

1.2 Baseline

The baseline that was chosen was a proximal policy optimization agent from Pure-
JaxRL, [Lu et al., 2022]. This baseline provides a training framework, that trains a
single agent concurrently on multiple environments.

The Training regime is outlined below in pseudocode.


1.3. Equivariant Actor-Critics 9

Algorithm 1 PureJaxRL PPO Agent Training Structure


Initialize Agent: Actor-Critic πθ , vϕ
Initialize Replay Buffer: D
for Num Updates do
Gain Experience For Num Timestep
Store trajectories: D.append((S, A, S ′ , R))
Calculate GAE Estimate From timesteps
for Num Epochs do
Split GAE Estimates into minibatches
Mini-Batch SGD with Adam on πθ , vϕ ▷ See ?? for Losses to optimize
Returns(D)

1.3 Equivariant Actor-Critics

1.3.1 CartPole

To form an equivariant network to the group structure of CartPole the actor network
must be equivariant to both the identity and inversion operator. This report provides
structures for equivariant G-CNNs for both CartPole and Catch, that can easily be
extended to other environments with known discrete symmetries.

In this section, the outline for the network design is described. In the Catch
section 1.3.3, a more detailed description of how to extend the procedure to other
Groups is outlined. The Group for CartPole contains two unique elements, in both
state and action space. In state space the inversion and identity operator r, e are,
   
1 0 0 0 −1 0 0 0
   
   
0 1 0 0 S  0 −1 0 0
ℓSe = 

 , ℓr = 
 
.
 (1.1)
0 0 1 0   0 0 −1 0 
   
0 0 0 1 0 0 0 −1

Then the action space, the inversion and identity operator are,
   
1 0 0 1
ℓA
e =
  , ℓA
r =
 . (1.2)
0 1 1 0
1.3. Equivariant Actor-Critics 10

Thus, to parametrize an equivariant actor network, the network πθ must satisfy,

πθ (ℓSe s) = ℓA
e πθ (s), (1.3)

πθ (ℓSr s) = ℓA
r πθ (s). (1.4)

Instead of using a G-CNN for the Actor-Critic, a simpler solution involves employ-
ing a network with only odd operations, such as the tanh activation, and excluding
biases for all hidden representations. By definition, odd functions are equivariant
to both inversion and identity transformations, ensuring the network’s equivariance.
However, this doesn’t address the issue of equivariance in the action space. To
bridge this gap, a group convolution layer is incorporated to map between represen-
tations

The network can be considered as a composition of fθ : S → R|H| , an odd


embedding MLP and gcθ : R|H| → RA a group convolution layer that ”lifts” the
equivariance to the action space. As such the parametric policy is,

πθ (s) = gcθ (fθ (s)). (1.5)

The equivariance properties of the sub-networks with respect to the inversion oper-
ator are,
fθ (−s) = −fθ (s), (1.6)
 
0 1
gcθ (−x) =   gcθ (x), (1.7)
1 0

Where gcθ (x) = [P (A = a0 ), P (A = a1 )] describes the distribution over the binary


actions of CartPole.

This G-CNN network for policy learning is the first novel contribution of this
report. In comparison to the work of [Mondal et al., 2020], the policy learning G-
CNN is fully equivariant, rather than learning action values from an equivariant
embedding. Additionally, the network parametrizes a policy directly rather than Q
values. Due to the network’s end to end equivariance in comparison to the Q-value
1.3. Equivariant Actor-Critics 11

network proposed by [Mondal et al., 2020], agents parametrized by this network


must take the same actions in states that are in the same orbit, which is not the case
for the Q-value network, which only has an equivariant embedding.
Further, this network architecture has advantages over Symmetrizer networks,
[van der Pol et al., 2020], in that it does not require a large matrix inversions to solve
for the parameters of the network, while still maintaining the same equivariance
qualities.

1.3.2 Training Dynamics On the Cart-Pole Benchmark


With the network’s structure established, the benchmark task focuses on mastering
an expert policy within the CartPole environment. By default, CartPole imposes
a maximum episode length of 500 interactions. All non-terminal states yield a re-
ward of +1, setting the maximum episodic return at 500. As with most traditional
RL problems, the primary objective in CartPole is to optimize the agent’s episodic
return. Finding an expert policy in CartPole using standard deep learning methods
is relatively straightforward and primarily serves as an implementation benchmark.
For an equivariant network structure to truly enhance the quality of the learned pol-
icy, it should strive to approach the 500 episodic return benchmark with fewer MDP
interactions.
When comparing the learning dynamics of policy agents, it’s crucial to ensure
not just that the agent achieves expertise in the task but also that the policy learning
procedure remains stable across multiple random seeds. A random seed refers to
the initial state in which both the agent and environment begin. Keeping training
stability in mind, examining the performance under the least favourable random
seeds is informative about the training robustness. If an algorithm is particularly
sensitive to its initialization, its performance may be significantly impacted, and
may not converge over multiple random seeds.
For all experiments, we utilize a standard MLP for the critic without imposing
any equivariance constraints. While this setup might not yield optimal performance,
it’s essential to highlight that training is conducted across 128 random seeds to
ensure the stability of the agent’s learning.
1.3. Equivariant Actor-Critics 12

Due to constraints on the equivariant network structures, it’s not always fea-
sible to maintain an identical number of parameters across networks. In instances
where the exact parameter count differs, we ensure that the depth of all networks re-
mains constant. We then adjust the width to achieve a parameter count that’s within
10% of the MLP baseline.
Refer to fig1.1 below, where the mean episodic return of three agents with
distinct network architectures is illustrated. Despite the differences in their struc-
tures, all networks share the same training hyperparameters, as provided in the
PureJaxRL[Lu et al., 2022] baselines. The three networks depicted are the MLP
baseline from PureJaxRL, an implementation of the Symmetrizer network from
[van der Pol et al., 2020], and the G-CNN policy network introduced in this report.

Figure 1.1: Left: Mean episodic returns for the CartPole agents across 128 random seeds
plotted against number of experience time-steps in the MDP. Right: The mean
cumulative episodic returns of the worst performing 128 random seeds against
number of experience time-steps in the MDP. Both of the plots are moving aver-
ages, with windows of 10 time-steps. Additionally, all plots have two standard
errors plotted.

Both the Symmetrizer and the G-CNN are equivariant to the actions of the C2
group. This equivariance constraint requires a learned policy that respects the inver-
sion symmetry present in Cart-pole. The equivariance should improve the sample
efficiency of the agent as any learning from one state additionally informs the agent
about the agent about the policy for the other state in the orbit. This hypothesis,
is supported somewhat by the observed training dynamics. Over the first period of
1.3. Equivariant Actor-Critics 13

Time-steps Baseline Symmetrizer G-CNN


10, 000 102 ± 5 116 ± 6 158 ± 8
100, 000 260 ± 10 240± 10 330 ± 10
500, 000 497 ± 1 500 ± 1 499 ± 1

Table 1.1: Cumulative episodic returns tabulated for the three network architectures. All
episodic returns are recorded with confidence intervals of two standard errors
across 128 random seeds.

training both the Symmetrizer and the G-CNN, outperform the baseline. The Sym-
metrizer, does not maintain this performance advantage. Our implementation uses
the same network size, as the original paper and the same network hyperparame-
ters. Despite this, the Symmetrizer agent fails to learn an expert policy in fewer
steps than that of the baseline.

It should be noted that here the mean plus minus two standard deviations is
plotted in comparison to the median and upper and lower quartiles of cumulative
returns, which is plotted in [van der Pol et al., 2020]. The performance of the Sym-
metrizer, is underwhelming despite the implemented network being checked for
equivariance. Further tuning of the hyperparameters may yield performance that
improves upon the baseline’s returns. However, this was not a primary concern in
this report.

The G-CNN does compare favourably with the baseline implementation of


an MLP, having slightly fewer parameters. It can be seen that it both converges
on average to an expert policy in fewer time-steps but also has a more favourable
convergence behaviour in challenging initialization conditions. This can be seen in
the right of Figure 1.1 where, the bottom tenth percentile of cumulative returns, still
converges notably faster than that of the baseline.

Additionally, the mean episodic returns across all random seeds are tabulated at
10, 000, 100, 000, and 500, 000 time-steps in Table 1.1. Upon closer inspection, the
G-CNN agent incorporating the equivariant inductive bias significantly outperforms
the baseline.

While both models have similar parameter counts, the structure of the G-CNNs
makes their forward passes more computationally intensive. This is due to the G-
1.3. Equivariant Actor-Critics 14

CNN architecture requiring twice as many operations in a forward pass, attributed


to the two group actions. However, in a scenario like Cart-Pole, where the networks
are relatively small and inexpensive to evaluate—and where computation can be ef-
ficiently parallelized—both models can train 128 random seeds in under a minute.
This speed is achieved using the hyperparameters listed in the Appendix and exe-
cuted on a contemporary graphics card.
Encouraged by the promising results from the initial experiment, we decided to
explore a new environment to determine whether the equivariance constraint could
further enhance performance.

1.3.3 Catch
To demonstrate that G-CNN actor-critics can be extended to other environments
with different group actions, we implemented the equivariant actor-critic for the
Bsuite Catch environment [Osband et al., 2020]. For the Catch environment, con-
structing an equivariant network is challenging. Unlike the CartPole case where
layers can be made equivariant to the 2 C 2 representation, in Catch, the entire net-
work must be built using Group Convolutions. As in previous scenarios, we impose
an equivariant constraint on the actor ( ) (s).
In the context of Catch, the input state space is represented as [ 0 , 1 ] 50 S[0,1]
50 . Instead of detailing the cumbersome matrix representation of the reflection
action r, we represent it as r S . This action modifies the x-coordinate of both
the ball and the paddle using the transformation ( ) = ( 2 ) r(x)=(2x). Thus, the
equivariance constraint is expressed as:

 
πθ (ℓSr s) = 0 0 1 0 1 0 1 0 0 πθ (s). (1.8)

To construct a G-CNN, which is equivariant to the group actions a group action


in the hidden layers must be determined. Consider a group convolution input layer
f : S → H ∈ R|G|×|H| , commonly referred to as a lifting layer. Lifting layers
apply a group action to the input and then passes the transformed input through,
g : S → R|H| , a dense/convolution layer. In the full layer the input is transformed
1.3. Equivariant Actor-Critics 15

by every group action, and then passed through g. In the absence of any pooling,
this gives a hidden representation, that is the hidden dimensions of the equivalent
dense/convolution layer, plus a new axis that is the size of the group. To illustrate
the new equivariance constraint a single layer;

g(s) = ⃗h ∈ R|H| . (1.9)

When formed into a group convolution,


 
g(s)
 
 
 g(ℓS1 s) 
 
 
f (s) =  g(ℓS2 s)  (1.10)
 
 .. 
 . 
 
g(ℓ|G| s).

Once a group action is applied to the input, the output values undergo permutation.
While the exact permutation is contingent on the group, the permutations’ structure
can be readily determined due to group closure:
   
g(ℓSn s) g(ℓS s)
   n 
   
 g(ℓSn ℓS1 s)   g(ℓSi s) 
   
   
f (ℓSn ) = g(ℓSn ℓ|G| s) =  g(ℓSj s)  = Pf (s) (1.11)
   
 ..   .. 
 .   . 
   
S
g(ℓn ℓ|G| s) g(ℓk s).

Where P, is a permutation matrix defined by the group. There is a unique permu-


tation matrix for each group element. To construct subsequent layers that are also
equivariant, we exploit the fact that the output is a permutation of vectors. Consider
a subsequent, h : H → H′ , a hidden layer that must continue the equivariance
to group G. This is achieved by treating each P as the group action, and ensuring
that the output is equivariant to its application. The new layer can be through of as
1.3. Equivariant Actor-Critics 16

taking a vector of vectors, where it must be equivariant to the Vectors’ permutation,


    
v v
  1   1
    
  v2  v 
h P   = Pi h  2  . (1.12)
  .. 
i  .. 
  .   . 
    
v|G| v|G|

Given this equivariant structure for each layer, when constructing a network out of
an input layer and subsequent hidden layers that all adhere to the equivariance con-
straint, what remains is to identify an appropriate output. In scenarios such as an
actor network operating within a discrete action space, the permutation representa-
tion proves ideal. For instance, in the ”Catch” environment, the probability of mov-
ing left in state s should mirror the probability of moving right in its reflected state
s′ = ℓSr . Since the network produces logits—un-normalized probabilities—their
permutation upon reflection possesses the desired properties. However, when a per-
mutation doesn’t fit the necessary group action, establishing equivariance becomes
more intricate. We’ll delve into this challenge in subsequent discussions.

With the framework of an equivariant network in place—similar to the Cart-


Pole scenario—we established both an equivariant actor-critic and an MLP actor-
critic. These were designed with two hidden layers and an approximate equivalent
in parameter count. For this experiment, the ”Symmetrizer” was omitted due to the
difficulties encountered when trying to optimize its performance, as depicted in the
primary study. Given the simplicity of the ”Catch” task relative to CartPole, agents
were allotted 20,000 MDP interactions for learning. The MLP baseline once again
employed the PureJaxRL baseline actor-critic, albeit with minor modifications for
adaptation to the new environment. Both network architectures were assigned iden-
tical hyperparameters.

Upon examining the qualitative differences in the episodic return curves, we


observe a familiar yet more pronounced trend. For Catch, the inductive bias of
equivariance seems especially beneficial. Not only does the agent with the G-CNN
converge to an expert policy more rapidly than its counterpart, but the bottom decile
1.4. Conclusion 17

Figure 1.2: Left: Mean episodic returns for the Catch agents across 128 random seeds
plotted against number of interaction time-steps in the MDP. Right: The mean
cumulative episodic returns of the worst performing 128 random seeds against
number of experience time-steps in the MDP. Both of the plots are moving
averages, with windows of 10 time-steps. Additionally, two standard errors are
plotted.

Time-steps Baseline G-CNN


2, 000 −0.51 ± 0.07 −0.06 ± 0.09
10, 000 0.70 ± 0.06 0.95 ± 0.03
20, 000 0.875 ± 0.04 0.96 ± 0.02

Table 1.2: Cumulative episodic returns tabulated for the two network architectures. All
episodic returns are recorded with confidence intervals of two standard errors
across 128 random seeds.

of random seeds also exhibits markedly improved performance. Furthermore, by


referring to the tabulated results in Table 1.2, the superiority of the Equivariant Net-
work architecture is evident. It manages to achieve a higher proficiency in ”Catch”
in merely half the time compared to the alternative

1.4 Conclusion
Equivariant Actors were tested in both environments, and showed a substantially
more efficient learning. These results are inline with other Experiments, where
Agents are trained with inductive biases about the task’s symmetry outperform con-
ventional methods. Further, they also demonstrated the same improvement in cases
1.4. Conclusion 18

with challenging initialization conditions.


The equivariance constraint itself however, is quite limiting. For any given
task the group must be known beforehand. Not only this but it must also be an
exact symmetry. In many settings these constraints are rare, especially with discrete
symmetry groups.
Additionally, there is a forward pass cost to the equivariant network structure,
especially with larger groups, despite the same number of parameters being used.
With the way G-CNNs are constructed, the number of operations in a forward pass
of the network is O(|G|). As such, the inference time increases when including the
inductive bias. However, because these operations can be executed in parallel, the
evaluation time increase is limited. For a rough estimation of the increased overhead
the G-CNN adds The implementations of the G-CNN for catch and the MLP were
benchmarked. The MLP’s mean forward pass over 10000 states was 27.5 ± 0.1µs,
and 29.9 ± 0.1µs for the G-CNN. This is an 8% increase in evaluation time, with
similar standard error across 1000 tests.
Despite the mild performance overhead, if a task is known to poses a discrete
group symmetry, constructing a network with an inductive bias that accounts for
this is demonstrably useful in increasing the training efficiency and reliability of an
Agent, and should be exploited. However, the case of an exact symmetry is not
universal.
The limited applicability of equivariant models motivates the next section of
the report. In the case where the environment has a group structured symmetry
or an approximate group structured symmetry, building a world model to perform
model based RL, that contains an inductive bias that the environment has a group
structured symmetry, may further improve the agent’s learning dynamics, while not
requiring the exact symmetry to be present.
Bibliography

[Lu et al., 2022] Lu, C., Kuba, J., Letcher, A., Metz, L., Schroeder de Witt, C.,
and Foerster, J. (2022). Discovered policy optimisation. Advances in Neural
Information Processing Systems, 35:16455–16468.

[Mondal et al., 2020] Mondal, A. K., Nair, P., and Siddiqi, K. (2020). Group equiv-
ariant deep reinforcement learning. arXiv preprint arXiv:2007.03437.

[Osband et al., 2020] Osband, I., Doron, Y., Hessel, M., Aslanides, J., Sezener, E.,
Saraiva, A., McKinney, K., Lattimore, T., Szepesvári, C., Singh, S., Van Roy, B.,
Sutton, R., Silver, D., and van Hasselt, H. (2020). Behaviour suite for reinforce-
ment learning. In International Conference on Learning Representations.

[van der Pol et al., 2020] van der Pol, E., Worrall, D. E., van Hoof, H., Oliehoek,
F. A., and Welling, M. (2020). Mdp homomorphic networks: Group symmetries
in reinforcement learning.

You might also like