Reinforcement Learning Dissertation
Reinforcement Learning Dissertation
Experiments
1.1 Motivation
With the ultimate aim of leveraging equivariance for model-based RL. A series of
Experiments were carried out, sequentially, to investigate how to exploit known
symmetries in a model based environment.
1.2 Baseline
The baseline that was chosen was a proximal policy optimization agent from Pure-
JaxRL, [Lu et al., 2022]. This baseline provides a training framework, that trains a
single agent concurrently on multiple environments.
1.3.1 CartPole
To form an equivariant network to the group structure of CartPole the actor network
must be equivariant to both the identity and inversion operator. This report provides
structures for equivariant G-CNNs for both CartPole and Catch, that can easily be
extended to other environments with known discrete symmetries.
In this section, the outline for the network design is described. In the Catch
section 1.3.3, a more detailed description of how to extend the procedure to other
Groups is outlined. The Group for CartPole contains two unique elements, in both
state and action space. In state space the inversion and identity operator r, e are,
1 0 0 0 −1 0 0 0
0 1 0 0 S 0 −1 0 0
ℓSe =
, ℓr =
.
(1.1)
0 0 1 0 0 0 −1 0
0 0 0 1 0 0 0 −1
Then the action space, the inversion and identity operator are,
1 0 0 1
ℓA
e =
, ℓA
r =
. (1.2)
0 1 1 0
1.3. Equivariant Actor-Critics 10
πθ (ℓSe s) = ℓA
e πθ (s), (1.3)
πθ (ℓSr s) = ℓA
r πθ (s). (1.4)
Instead of using a G-CNN for the Actor-Critic, a simpler solution involves employ-
ing a network with only odd operations, such as the tanh activation, and excluding
biases for all hidden representations. By definition, odd functions are equivariant
to both inversion and identity transformations, ensuring the network’s equivariance.
However, this doesn’t address the issue of equivariance in the action space. To
bridge this gap, a group convolution layer is incorporated to map between represen-
tations
The equivariance properties of the sub-networks with respect to the inversion oper-
ator are,
fθ (−s) = −fθ (s), (1.6)
0 1
gcθ (−x) = gcθ (x), (1.7)
1 0
This G-CNN network for policy learning is the first novel contribution of this
report. In comparison to the work of [Mondal et al., 2020], the policy learning G-
CNN is fully equivariant, rather than learning action values from an equivariant
embedding. Additionally, the network parametrizes a policy directly rather than Q
values. Due to the network’s end to end equivariance in comparison to the Q-value
1.3. Equivariant Actor-Critics 11
Due to constraints on the equivariant network structures, it’s not always fea-
sible to maintain an identical number of parameters across networks. In instances
where the exact parameter count differs, we ensure that the depth of all networks re-
mains constant. We then adjust the width to achieve a parameter count that’s within
10% of the MLP baseline.
Refer to fig1.1 below, where the mean episodic return of three agents with
distinct network architectures is illustrated. Despite the differences in their struc-
tures, all networks share the same training hyperparameters, as provided in the
PureJaxRL[Lu et al., 2022] baselines. The three networks depicted are the MLP
baseline from PureJaxRL, an implementation of the Symmetrizer network from
[van der Pol et al., 2020], and the G-CNN policy network introduced in this report.
Figure 1.1: Left: Mean episodic returns for the CartPole agents across 128 random seeds
plotted against number of experience time-steps in the MDP. Right: The mean
cumulative episodic returns of the worst performing 128 random seeds against
number of experience time-steps in the MDP. Both of the plots are moving aver-
ages, with windows of 10 time-steps. Additionally, all plots have two standard
errors plotted.
Both the Symmetrizer and the G-CNN are equivariant to the actions of the C2
group. This equivariance constraint requires a learned policy that respects the inver-
sion symmetry present in Cart-pole. The equivariance should improve the sample
efficiency of the agent as any learning from one state additionally informs the agent
about the agent about the policy for the other state in the orbit. This hypothesis,
is supported somewhat by the observed training dynamics. Over the first period of
1.3. Equivariant Actor-Critics 13
Table 1.1: Cumulative episodic returns tabulated for the three network architectures. All
episodic returns are recorded with confidence intervals of two standard errors
across 128 random seeds.
training both the Symmetrizer and the G-CNN, outperform the baseline. The Sym-
metrizer, does not maintain this performance advantage. Our implementation uses
the same network size, as the original paper and the same network hyperparame-
ters. Despite this, the Symmetrizer agent fails to learn an expert policy in fewer
steps than that of the baseline.
It should be noted that here the mean plus minus two standard deviations is
plotted in comparison to the median and upper and lower quartiles of cumulative
returns, which is plotted in [van der Pol et al., 2020]. The performance of the Sym-
metrizer, is underwhelming despite the implemented network being checked for
equivariance. Further tuning of the hyperparameters may yield performance that
improves upon the baseline’s returns. However, this was not a primary concern in
this report.
Additionally, the mean episodic returns across all random seeds are tabulated at
10, 000, 100, 000, and 500, 000 time-steps in Table 1.1. Upon closer inspection, the
G-CNN agent incorporating the equivariant inductive bias significantly outperforms
the baseline.
While both models have similar parameter counts, the structure of the G-CNNs
makes their forward passes more computationally intensive. This is due to the G-
1.3. Equivariant Actor-Critics 14
1.3.3 Catch
To demonstrate that G-CNN actor-critics can be extended to other environments
with different group actions, we implemented the equivariant actor-critic for the
Bsuite Catch environment [Osband et al., 2020]. For the Catch environment, con-
structing an equivariant network is challenging. Unlike the CartPole case where
layers can be made equivariant to the 2 C 2 representation, in Catch, the entire net-
work must be built using Group Convolutions. As in previous scenarios, we impose
an equivariant constraint on the actor ( ) (s).
In the context of Catch, the input state space is represented as [ 0 , 1 ] 50 S[0,1]
50 . Instead of detailing the cumbersome matrix representation of the reflection
action r, we represent it as r S . This action modifies the x-coordinate of both
the ball and the paddle using the transformation ( ) = ( 2 ) r(x)=(2x). Thus, the
equivariance constraint is expressed as:
πθ (ℓSr s) = 0 0 1 0 1 0 1 0 0 πθ (s). (1.8)
by every group action, and then passed through g. In the absence of any pooling,
this gives a hidden representation, that is the hidden dimensions of the equivalent
dense/convolution layer, plus a new axis that is the size of the group. To illustrate
the new equivariance constraint a single layer;
Once a group action is applied to the input, the output values undergo permutation.
While the exact permutation is contingent on the group, the permutations’ structure
can be readily determined due to group closure:
g(ℓSn s) g(ℓS s)
n
g(ℓSn ℓS1 s) g(ℓSi s)
f (ℓSn ) = g(ℓSn ℓ|G| s) = g(ℓSj s) = Pf (s) (1.11)
.. ..
. .
S
g(ℓn ℓ|G| s) g(ℓk s).
Given this equivariant structure for each layer, when constructing a network out of
an input layer and subsequent hidden layers that all adhere to the equivariance con-
straint, what remains is to identify an appropriate output. In scenarios such as an
actor network operating within a discrete action space, the permutation representa-
tion proves ideal. For instance, in the ”Catch” environment, the probability of mov-
ing left in state s should mirror the probability of moving right in its reflected state
s′ = ℓSr . Since the network produces logits—un-normalized probabilities—their
permutation upon reflection possesses the desired properties. However, when a per-
mutation doesn’t fit the necessary group action, establishing equivariance becomes
more intricate. We’ll delve into this challenge in subsequent discussions.
Figure 1.2: Left: Mean episodic returns for the Catch agents across 128 random seeds
plotted against number of interaction time-steps in the MDP. Right: The mean
cumulative episodic returns of the worst performing 128 random seeds against
number of experience time-steps in the MDP. Both of the plots are moving
averages, with windows of 10 time-steps. Additionally, two standard errors are
plotted.
Table 1.2: Cumulative episodic returns tabulated for the two network architectures. All
episodic returns are recorded with confidence intervals of two standard errors
across 128 random seeds.
1.4 Conclusion
Equivariant Actors were tested in both environments, and showed a substantially
more efficient learning. These results are inline with other Experiments, where
Agents are trained with inductive biases about the task’s symmetry outperform con-
ventional methods. Further, they also demonstrated the same improvement in cases
1.4. Conclusion 18
[Lu et al., 2022] Lu, C., Kuba, J., Letcher, A., Metz, L., Schroeder de Witt, C.,
and Foerster, J. (2022). Discovered policy optimisation. Advances in Neural
Information Processing Systems, 35:16455–16468.
[Mondal et al., 2020] Mondal, A. K., Nair, P., and Siddiqi, K. (2020). Group equiv-
ariant deep reinforcement learning. arXiv preprint arXiv:2007.03437.
[Osband et al., 2020] Osband, I., Doron, Y., Hessel, M., Aslanides, J., Sezener, E.,
Saraiva, A., McKinney, K., Lattimore, T., Szepesvári, C., Singh, S., Van Roy, B.,
Sutton, R., Silver, D., and van Hasselt, H. (2020). Behaviour suite for reinforce-
ment learning. In International Conference on Learning Representations.
[van der Pol et al., 2020] van der Pol, E., Worrall, D. E., van Hoof, H., Oliehoek,
F. A., and Welling, M. (2020). Mdp homomorphic networks: Group symmetries
in reinforcement learning.