MDP Homomorphic Networks: Group Symmetries in Reinforcement Learning
MDP Homomorphic Networks: Group Symmetries in Reinforcement Learning
Abstract
This paper introduces MDP homomorphic networks for deep reinforcement learn-
ing. MDP homomorphic networks are neural networks that are equivariant under
symmetries in the joint state-action space of an MDP. Current approaches to deep
reinforcement learning do not usually exploit knowledge about such structure. By
building this prior knowledge into policy and value networks using an equivariance
constraint, we can reduce the size of the solution space. We specifically focus
on group-structured symmetries (invertible transformations). Additionally, we
introduce an easy method for constructing equivariant network layers numerically,
so the system designer need not solve the constraints by hand, as is typically done.
We construct MDP homomorphic MLPs and CNNs that are equivariant under either
a group of reflections or rotations. We show that such networks converge faster
than unstructured baselines on CartPole, a grid world and Pong.
1 Introduction
This paper considers learning decision-making systems that exploit symmetries in the structure of the
world. Deep reinforcement learning (DRL) is concerned with learning neural function approximators
for decision making strategies. While DRL algorithms have been shown to solve complex, high-
dimensional problems [35, 34, 26, 25], they are often used in problems with large state-action spaces,
and thus require many samples before convergence. Many tasks exhibit symmetries, easily recognized
by a designer of a reinforcement learning system. Consider the classic control task of balancing a
pole on a cart. Balancing a pole that falls to the right requires an equivalent, but mirrored, strategy to
one that falls to the left. See Figure 1. In this paper, we exploit knowledge of such symmetries in the
state-action space of Markov decision processes (MDPs) to reduce the size of the solution space.
We use the notion of MDP homomorphisms [32, 30] to formalize these symmetries. Intuitively, an
MDP homomorphism is a map between MDPs, preserving the essential structure of the original
MDP, while removing redundancies in the problem description, i.e., equivalent state-action pairs. The
removal of these redundancies results in a smaller state-action space, upon which we may more easily
build a policy. While earlier work has been concerned with discovering an MDP homomorphism for
a given MDP [32, 30, 27, 31, 6, 39], we are instead concerned with how to construct deep policies,
satisfying the MDP homomorphism. We call these models MDP homomorphic networks.
34th Conference on Neural Information Processing Systems (NeurIPS 2020), Vancouver, Canada.
MDP homomorphic networks use experience from one
state-action pair to improve the policy for all ‘equivalent’
pairs. See Section 2.1 for a definition. They do this by ty-
ing the weights for two states if they are equivalent under a
transformation chosen by the designer, such as s and L[s]
in Figure 1. Such weight-tying follows a similar principle
to the use of convolutional networks [18], which are equiv-
ariant to translations of the input [11]. In particular, when
equivalent state-action pairs can be related by an invert-
ible transformation, which we refer to as group-structured,
we show that the policy network belongs to the class of
group-equivariant neural networks [11, 46].Equivariant
neural networks are a class of neural network, which
have built-in symmetries [11, 12, 46, 43, 41]. They are a
generalization of convolutional neural networks—which
exhibit translation symmetry—to transformation groups
(group-structured equivariance) and transformation semi- Figure 1: Example state-action space
groups [47] (semigroup-structured equivariance). They symmetry. Pairs (s, ←) and (L[s], →)
have been shown to reduce sample complexity for classi- (and by extension (s, →) and (L[s], ←))
fication tasks [46, 44] and also to be universal approxima- are symmetric under a horizontal flip.
tors of symmetric functions1 [48]. We borrow from the Constraining the set of policies to those
literature on group equivariant networks to design policies where π(s, ←) = π(L[s], →) reduces
that tie weights for state-action pairs given their equiv- the size of the solution space.
alence classes, with the goal of reducing the number of
samples needed to find good policies. Furthermore, we
can use the MDP homomorphism property to design not just policy networks, but also value networks
and even environment models. MDP homomorphic networks are agnostic to the type of model-free
DRL algorithm, as long as an appropriate transformation on the output is given. In this paper we
focus on equivariant policy and invariant value networks. See Figure 1 for an example policy.
An additional contribution of this paper is a novel numerical way of finding equivariant layers for
arbitrary transformation groups. The design of equivariant networks imposes a system of linear
constraint equations on the linear/convolutional layers [12, 11, 46, 43]. Solving these equations has
typically been done analytically by hand, which is a time-consuming and intricate process, barring
rapid prototyping. Rather than requiring analytical derivation, our method only requires that the
system designer specify input and output transformation groups of the form {state transformation,
policy transformation}. We provide Pytorch [29] implementations of our equivariant network layers,
and implementations of the transformations used in this paper. We also experimentally demonstrate
that exploiting equivalences in MDPs leads to faster learning of policies for DRL.
Our contributions are two-fold:
• We draw a connection between MDP homomorphisms and group equivariant networks,
proposing MDP homomorphic networks to exploit symmetries in decision-making problems;
• We introduce a numerical algorithm for the automated construction of equivariant layers.
2 Background
Here we outline the basics of the theory behind MDP homomorphisms and equivariance. We begin
with a brief outline of the concepts of equivalence, invariance, and equivariance, followed by a review
of the Markov decision process (MDP). We then review the MDP homomorphism, which builds a
map between ‘equivalent’ MDPs.
2
same optimal value V ∗ (s) = V ∗ (s0 ) would be V ∗ -equivalent or optimal value equivalent [30]. An
example of two optimal value equivalent states would be states s and L[s] in the CartPole example of
Figure 1. The set of all points f -equivalent to x is called the equivalence class of x.
Invariance and Symmetries Typically there exist very intuitive relationships between the points in
an equivalence class. In the CartPole example of Figure 1 this relationship is a horizontal flip about
the vertical axis. This is formalized with the transformation operator Lg : X → X , where g ∈ G and
G is a mathematical group. If Lg satisfies
f (x) = f (Lg [x]), for all g ∈ G, x ∈ X , (1)
then we say that f is invariant or symmetric to Lg and that {Lg }g∈G is a set of symmetries of f . We
can see that for the invariance equation to be satisfied, it must be that Lg can only map x to points
in its equivalence class. Note that in abstract algebra for Lg to be a true transformation operator, G
must contain an identity operation; that is Lg [x] = x for some g and all x. An interesting property
of transformation operators which leave f invariant, is that they can be composed and still leave f
invariant, so Lg ◦ Lh is also a symmetry of f for all g, h ∈ G. In abstract algebra, this property is
known as a semigroup property. If Lg is always invertible, this is called a group property. In this
work, we experiment with group-structured transformation operators. For more information, see [14].
One extra helpful concept is that of orbits. If f is invariant to Lg , then it is invariant along the orbits
of G. The orbit Ox of point x is the set of points reachable from x via transformation operator Lg :
Ox , {Lg [x] ∈ X |g ∈ G}. (2)
Equivariance A related notion to invariance is equivariance. Given a transformation operator
Lg : X → X and a mapping f : X → Y, we say that f is equivariant [11, 46] to the transformation
if there exists a second transformation operator Kg : Y → Y in the output space of f such that
Kg [f (x)] = f (Lg [x]), for all g ∈ G, x ∈ X . (3)
The operators Lg and Kg can be seen to describe the same transformation, but in different spaces. In
fact, an equivariant map can be seen to map orbits to orbits. We also see that invariance is a special
case of equivariance, if we set Kg to the identity operator for all g. Given Lg and Kg , we can solve
for the collection of equivariant functions f satisfying the equivariance constraint. Moreover, for
linear transformation operators and linear f a rich theory already exists in which f is referred to
as an intertwiner [12]. In the equivariant deep learning literature, neural networks are built from
interleaving intertwiners and equivariant nonlinearities. As far as we are aware, most of these methods
are hand-designed per pair of transformation operators, with the exception of [13]. In this paper, we
introduce a computational method to solve for intertwiners given a pair of transformation operators.
A Markov decision process (MDP) is a tuple (S, A, R, T, γ), with state space S, action space A,
immediate reward function R : S × A → R, transition function T : S × A × S → R≥0 , and
discount factor γ ∈ [0, 1]. The goal of solving an MDP is to find a policy π ∈ Π, π : S × A → R≥0
(written π(a|s)), where π normalizes to unity over the action space, that maximizes the expected
PT
return Rt = Eπ [ k=0 γ k rt+k+1 ]. The expected return from a state s under a policy π is given by
the value function V π . A related object is the Q-value Qπ , the expected return from a state s after
taking action a under π. V π and Qπ are governed by the well-known Bellman equations [5] (see
Supplementary). In an MDP, optimal policies π ∗ attain an optimal value V ∗ and corresponding
Q-value given by V ∗ (s) = max V π (s) and Q∗ (s) = max Qπ (s).
π∈Π π∈Π
MDP with Symmetries Symmetries can appear in MDPs. For instance, in Figure 2 CartPole has a
reflection symmetry about the vertical axis. Here we define an MDP with symmetries. In an MDP
with symmetries there is a set of transformations on the state-action space, which leaves the reward
function and transition operator invariant. We define a state transformation and a state-dependent
action transformation as Lg : S → S and Kgs : A → A respectively. Invariance of the reward
function and transition function is then characterized as
R(s, a) = R(Lg [s], Kgs [a]) for all g ∈ G, s ∈ S, a ∈ A (4)
T (s0 |s, a) = T (Lg [s0 ]|Lg [s], Kgs [a]) for all g ∈ G, s ∈ S, a ∈ A. (5)
Written like this, we see that in an MDP with symmetries the reward function and transition operator
are invariant along orbits defined by the transformations (Lg , Kgs ).
3
Figure 2: Example of a reduction in an MDP’s state-action space under an MDP homomorphism h.
Here ‘equivalence’ is represented by a reflection of the dynamics in the vertical axis. This equivalence
class is encoded by h by mapping all equivalent state-action pairs to the same abstract state-actions.
MDP Homomorphisms MDPs with symmetries are closely related to MDP homomorphisms, as
we explain below. First we define the latter. An MDP homomorphism h [32, 30] is a mapping from
one MDP M = (S, A, R, T, γ) to another M̄ = (S̄, Ā, R̄, T̄ , γ) defined by a surjective map from the
state-action space S × A to an abstract state-action space S̄ × Ā. In particular, h consists of a tuple
of surjective maps (σ, {αs |s ∈ S}), where we have the state map σ : S → S̄ and the state-dependent
action map αs : A → Ā. These maps are built to satisfy the following conditions
R̄(σ(s), αs (a)) , R(s, a) for all s ∈ S, a ∈ A, (6)
X
T̄ (σ(s0 )|σ(s), αs (a)) , 00
T (s |s, a) 0
for all s, s ∈ S, a ∈ A. (7)
s00 ∈σ −1 (s0 )
An exact MDP homomorphism provides a model equivalent abstraction [20]. Given an MDP
homomorphism h, two state-action pairs (s, a) and (s0 , a0 ) are called h-equivalent if σ(s) = σ(s0 )
and αs (a) = αs0 (a0 ). Symmetries and MDP homomorphisms are connected in a natural way: If
an MDP has symmetries Lg and Kg , the above equations (4) and (5) hold. This means that we can
define a corresponding MDP homomorphism, which we define next.
3 Method
The focus of the next section is on the design of MDP homomorphic networks—policy networks and
value networks obeying the MDP homomorphism. In the first section of the method, we show that any
2
Note that we use the terminology lifting to stay consistent with [30].
4
policy network satisfying the MDP homomorphism property must be an equivariant neural network.
In the second part of the method, we introduce a novel numerical technique for constructing group-
equivariant networks, based on the transformation operators defining the equivalence state-action
pairs under the MDP homomorphism.
Lifted policies in symmetric MDPs with group-structured symmetries are invariant under the group
of symmetries. Consider the following: Take an MDP with symmetries defined by transformation
operators (Lg , Kgs ) for g ∈ G. Now, if we take s0 = Lg [s] and a0 = Kgs [a] for any g ∈ G, (s0 , a0 )
and (s, a) are h-equivalent under the corresponding MDP homomorphism h = (σ, {αs |s ∈ S}). So
π̄(αs (a)|σ(s)) π̄(αs0 (a0 )|σ(s0 ))
π ↑ (a|s) = −1 = = π ↑ (a0 |s0 ), (10)
|{a ∈ αs (ā)}| |{a0 ∈ αs−1 0 (ā)}|
for all s ∈ S, a ∈ A and g ∈ G. In the first equality we have used the definition of the lifted
policy. In the second equality, we have used the definition of h-equivalent state-action pairs, where
σ(s) = σ(Lg (s)) and αs (a) = αs0 (a0 ). In the third equality, we have reused the definition of the
lifted policy. Thus we see that, written in this way, the lifted policy is invariant under state-action
transformations (Lg , Kgs ). This equation is very general and applies for all group-structured state-
action transformations. For a finite action space, this statement of invariance can be re-expressed as a
statement of equivariance, by considering the vectorized policy.
Invariant Policies On Finite Action Spaces Are Equivariant Vectorized Policies For convenience
we introduce a vector of probabilities for each of the discrete actions under the policy
>
π(s) , [π(a1 |s), π(a2 |s), ..., π(aN |s)] , (11)
where a1 , ..., aN are the N possible discrete actions in action space A. The action transformation Kgs
maps actions to actions invertibly. Thus applying an action transformation to the vectorized policy
permutes the elements. We write the corresponding permutation matrix as Kg . Note that
>
K−1 s s s
g π(s) , π(Kg [a1 ]|s), π(Kg [a2 ]|s), ..., π(Kg [aN ]|s) , (12)
where writing the inverse K−1g instead of Kg is required to maintain the property Kg Kh = Kgh .
The invariance of the lifted policy can then be written as π ↑ (s) = K−1 ↑
g π (Lg [s]), which can be
rearranged to the equivariance equation
Kg π ↑ (s) = π ↑ (Lg [s]) for all g ∈ G, s ∈ S, a ∈ A. (13)
This equation shows that the lifted policy must satisfy an equivariance constraint. In deep learning,
this has already been well-explored in the context of supervised learning [11, 12, 46, 47, 43]. Next,
we present a novel way to construct such networks.
Our goal is to build neural networks that follow Eq. 13; that is, we wish to find neural networks that
are equivariant under a set of state and policy transformations. Equivariant networks are common
in supervised learning [11, 12, 46, 47, 43, 41]. For instance, in semantic segmentation shifts and
rotations of the input image result in shifts and rotations in the segmentation. A neural network
consisting of only equivariant layers and non-linearities is equivariant as a whole, too3 [11]. Thus,
once we know how to build a single equivariant layer, we can simply stack such layers together. Note
that this is true regardless of the representation of the group, i.e. this works for spatial transformations
of the input, feature map permutations in intermediate layers, and policy transformations in the output
layer. For the experiments presented in this paper, we use the same group representations for the
intermediate layers as for the output, i.e. permutations. For finite groups, such as cyclic groups or
permutations, pointwise nonlinearities preserve equivariance [11].
In the past, learnable equivariant layers were designed by hand for each transformation group
individually [11, 12, 46, 47, 44, 43, 41]. This is time-consuming and laborious. Here we present a
novel way to build learnable linear layers that satisfy equivariance automatically.
5
Equivariant Layers We begin with a single linear layer z0 = Wz + b, where W ∈ RDout ×Din and
b ∈ RDin is a bias. To simplify the math, we merge the bias into the weights so W 7→ [W, b] and
z 7→ [z, 1]> . We denote the space of the augmented weights as Wtotal . For a given pair of linear group
transformation operators in matrix form (Lg , Kg ), where Lg is the input transformation and Kg is
the output transformation, we then have to solve the equation
Kg Wz = WLg z, for all g ∈ G, z ∈ RDin +1 . (14)
Since this equation is true for all z we can in fact drop z entirely. Our task now is to find all weights
W which satisfy Equation 14. We label this space of equivariant weights as W, defined as
W , {W ∈ Wtotal | Kg W = WLg , for all g ∈ G}, (15)
again noting that we have dropped z. To find the space W notice that for each g ∈ G the constraint
Kg W = WLg is in fact linear in W. Thus, to find W we need to solve a set of linear equations in W.
For this we introduce a construction, which we call a symmetrizer S(W). The symmetrizer is
1 X −1
S(W) , Kg WLg . (16)
|G|
g∈G
S has three important properties, of which proofs are provided in Appendix A. First, S(W) is
symmetric (S(W) ∈ W). Second, S fixes any symmetric W: (W ∈ W =⇒ S(W) = W). These
properties show that S projects arbitrary W ∈ Wtotal to the equivariant subspace W.
Since W is the solution set for a set of simultaneous linear equations, W is a linear sub-
space of the space of all possible weights Wtotal . Thus each W ∈ W can be parametrized
as a linear combination of basis weights {Vi }ri=1 , where r is the rank of the subspace and
span({Vi }ri=1 ) = W. To find as basis for W, we take a Gram-Schmidt orthogonalization ap-
proach. We first sample weights in the total space Wtotal and then project them into the equivariant
subspace with the symmetrizer. We do this for multiple weight
matrices, which we then stack and feed through a singular value de-
composition to find a basis for the equivariant space. This procedure
is outlined in Algorithm 1. Any equivariant layer can then be written
as a linear combination of bases
r Figure 3: Example of 4-way
(17) rotationally symmetric filters.
X
W= ci Vi ,
i=1
where the ci ’s are learnable scalar coefficients, r is the rank of the equivariant space, and the matrices
Vi are the basis vectors, formed from the reshaped right-singular vectors in the SVD. An example is
shown in Figure 3. To run this procedure, all that is needed are the transformation operators Lg and
Kg . Note we do not need to know the explicit transformation matrices, but just to be able to perform
the mappings W 7→ WLg and W 7→ K−1 g W. For instance, some matrix Lg rotates an image patch,
but we could equally implement WLg using a built-in rotation function. Code is available 4 .
4 Experiments
We evaluated three flavors of MDP homomorphic network—an MLP, a CNN, and an equivariant
feature extractor—on three RL tasks that exhibit group symmetry: CartPole, a grid world, and Pong.
3
See Appendix B for more details.
4
https://github.com/ElisevanderPol/symmetrizer/
6
Table 1: E NVIRONMENTS AND S YMMETRIES: We showcase a visual guide of the state and action
spaces for each environment along with the effect of the transformations. Note, the symbols should
not be taken to be hard mathematical statements, they are merely a visual guide for communication.
Environment Space Transformations
CartPole S (x, θ, ẋ, θ̇) (x, θ, ẋ, θ̇), (−x, −θ, −ẋ, −θ̇)
A (←, →) (←, →), (→, ←)
Grid World S {0, 1}21×21 Identity, y 90◦ , y 180◦ , y 270◦
A (∅, ↑, →, ↓, ←) (∅, ↑, →, ↓, ←), (∅, →, ↓, ←, ↑), (∅, ↓, ←, ↑, →), (∅, ←, ↑, →, ↓)
Pong S {0, ..., 255}4×80×80 Identity, reflect
A (∅, ∅, ↑, ↓, ↑, ↓) (∅, ∅, ↑, ↓, ↑, ↓), (∅, ∅, ↓, ↑, ↓, ↑)
We use RLPYT [36] for the algorithms. Hyperparameters (and the range considered), architectures,
and group implementation details are in the Supplementary Material. Code is available 5 .
4.1 Environments
For each environment we show S and A with respective representations of the group transformations.
CartPole In the classic pole balancing task [3], we used a two-element group of reflections about the
y-axis. We used OpenAI’s Cartpole-v1 [7] implementation, which has a 4-dimensional observation
vector: (cart position x, pole angle θ, cart velocity ẋ, pole velocity θ̇). The (discrete) action space
consists of applying a force left and right (←, →). We chose this example for its simple symmetries.
Grid world We evaluated on a toroidal 7-by-7 predator-prey grid world with agent-centered coordi-
nates. The prey and predator are randomly placed at the start of each episode, lasting a maximum
of 100 time steps. The agent’s goal is to catch the prey, which takes a step in a random compass
direction with probability 0.15 and stands still otherwise. Upon catching the prey, the agent receives a
reward of +1, and -0.1 otherwise. The observation is a 21 × 21 binary image identifying the position
of the agent in the center and the prey in relative coordinates. See Figure 6a. This environment was
chosen due to its four-fold rotational symmetry.
Pong We evaluated on the RLPYT [36] implementation of Pong. In our experiments, the observation
consisted of the 4 last observed frames, with upper and lower margins cut off and downscaled to
an 80 × 80 grayscale image. In this setting, there is a flip symmetry over the horizontal axis: if
we flip the observations, the up and down actions also flip. A curious artifact of Pong is that it has
duplicate (up, down) actions, which means that to simplify matters, we mask out the policy values
for the second pair of (up, down) actions. We chose Pong because of its higher dimensional state
space. Finally, for Pong we additionally compare to two data augmentation baselines: stochastic
data augmentation, where for each state, action pair we randomly transform them or not before
feeding them to the network, and the second an equivariant version of [16] and similar to [35], where
both state and transformed state are input to the network. The output of the transformed state is
appropriately transformed, and both policies are averaged.
4.2 Models
We implemented MDP homomorphic networks on top of two base architectures: MLP and CNN
(exact architectures in Supplementary). We further experimented with an equivariant feature extractor,
appended by a non-equivariant network, to isolate where equivariance made the greatest impact.
Basis Networks We call networks whose weights are linear combinations of basis weights basis
networks. As an ablation study on all equivariant networks, we sought to measure the effects of the
basis training dynamics. We compared an equivariant basis against a pure nullspace basis, i.e. an
explicitly non-symmetric basis using the right-null vectors from the equivariant layer construction,
and a random basis, where we skip the symmetrization step in the layer construction and use the
full rank basis. Unless stated otherwise, we reduce the number of ‘channels’ in the basis networks
compared to the regular networks by dividing by the square root of the group size, ending up with a
comparable number of trainable parameters.
5
https://github.com/ElisevanderPol/mdp-homomorphic-networks
7
500 500 20
15
400 400
10
Average Return
Average Return
Average Return
300 300
5
0
200 200 5
10 Nullspace
100 Nullspace 100 MLP, 4 64 128 2 Random
Random MLP, 4 128 128 2 15 Convolutional
Equivariant Equivariant, 4 64 64 2 Equivariant
0 0 20
0 20 40 60 80 100 120 140 0 20 40 60 80 100 120 140 0 100 200 300 400 500 600
Time steps (x 500) Time steps (x 500) Time steps (x 25000)
We show training curves for CartPole in 4a-4b, Pong in Figure 4c and for the grid world in Fig-
ure 6. Across all experiments we observed that the MDP homomorphic network outperforms both
the non-equivariant basis networks and the standard architectures, in terms of convergence speed.
This confirms our motivations that building symmetry-preserving 20
Average Return
5
equivariant networks is more beneficial. This is consistent with 0
other results in the equivariance literature [4, 42, 44, 46]. While data 5
10 Stoch. Data Aug.
augmentation can be used to create a larger dataset by exploiting 15
Full Data Aug.
Convolutional
Equivariant
symmetries, it does not directly lead to effective parameter sharing 20
0 100 200 300 400 500 600
(as our approach does). Note, in Pong we only train the first 15 mil- Time steps (x 25000)
lion frames to highlight the difference in the beginning; in constrast, Figure 5: Data augmentation
a typical training duration is 50-200 million frames [25, 36]. comparison on Pong.
For our ablation experiment, we wanted to control for the introduction of bases. It is not clear a
priori that a network with a basis has the same gradient descent dynamics as an equivalent ‘basisless’
network. We compared equivariant, non-equivariant, and random bases, as mentioned above. We
found the equivariant basis led to the fastest convergence. Figures 4a and 4c show that for CartPole
and Pong the nullspace basis converged faster than the random basis. In the grid world there was no
clear winner between the two. This is a curious result, requiring deeper investigation in a follow-up.
For a third experiment, we investigated what happens if we sacrifice complete equivariance of the
policy. This is attractive because it removes the need to find a transformation operator for a flattened
output feature map. Instead, we only maintained an equivariant feature extractor, compared against a
basic CNN feature extractor. The networks built on top of these extractors were MLPs. The results,
in Figure 4c, are two-fold: 1) Basis feature extractors converge faster than standard CNNs, and 2)
the equivariant feature extractor has fastest convergence. We hypothesize the equivariant feature
extractor is fastest as it is easiest to learn an equivariant policy from equivariant features.
We have additionally compared an equivariant feature extractor to a regular convolutional network
on the Atari game Breakout, where the difference between the equivariant network and the regular
network is much less pronounced. For details, see Appendix C.
5 Related Work
Past work on MDP homomorphisms has often aimed at discovering the map itself based on knowledge
of the transition and reward function, and under the assumption of enumerable state spaces [30, 31,
32, 38]. Other work relies on learning the map from sampled experience from the MDP [39, 6,
23]. Exactly computing symmetries in MDPs is graph isomorphism complete [27] even with full
knowledge of the MDP dynamics. Rather than assuming knowledge of the transition and reward
function, and small and enumerable state spaces, in this work we take the inverse view: we assume that
we have an easily identifiable transformation of the joint state–action space and exploit this knowledge
8
0 0
Average Return
Average Return
100 100
Nullspace
Random Convolutional
Equivariant Equivariant
101 101
0 25 50 75 100 125 150 175 200 0 25 50 75 100 125 150 175 200
Time steps (x 10000) Time steps (x 10000)
(a) Symmetries (b) Grid World: Bases (c) Grid World: CNNs
Figure 6: G RID W ORLD: Trained with A2C, all networks fine-tuned over 6 learning rates. 25%,
50% and 75% quantiles over 20 random seeds shown. a) showcase of symmetries, b) Equivariant,
nullspace, and random bases c) plain CNN and equivariant CNN.
to learn more efficiently. Exploiting symmetries in deep RL has been previously explored in the
game of Go, in the form of symmetric filter weights [33, 8] or data augmentation [35]. Other work
on data augmentation increases sample efficiency and generalization on well-known benchmarks by
augmenting existing data points state transformations such as random translations, cutout, color jitter
and random convolutions [16, 9, 17, 19]. In contrast, we encode symmetries into the neural network
weights, leading to more parameter sharing. Additionally, such data augmentation approaches tend to
take the invariance view, augmenting existing data with state transformations that leave the state’s
Q-values intact [16, 9, 17, 19] (the exception being [21] and [24], who augment trajectories rather
than just states). Similarly, permutation invariant networks are commonly used in approaches to
multi-agent RL [37, 22, 15]. We instead take the equivariance view, which accommodates a much
larger class of symmetries that includes transformations on the action space. Abdolhosseini et al. [1]
have previously manually constructed an equivariant network for a single group of symmetries in
a single RL problem, namely reflections in a bipedal locomotion task. Our MDP homomorphic
networks allow for automated construction of networks that are equivariant under arbitrary discrete
groups and are therefore applicable to a wide variety of problems.
From an equivariance point-of-view, the automatic construction of equivariant layers is new. [12]
comes close to specifying a procedure, outlining the system of equations to solve, but does not specify
an algorithm. The basic theory of group equivariant networks was outlined in [11, 12] and [10], with
notable implementations to 2D roto-translations on grids [46, 43, 41] and 3D roto-translations on
grids [45, 44, 42]. All of these works have relied on hand-constructed equivariant layers.
6 Conclusion
This paper introduced MDP homomorphic networks, a family of deep architectures for reinforcement
learning problems where symmetries have been identified. MDP homomorphic networks tie weights
over symmetric state-action pairs. This weight-tying leads to fewer degrees-of-freedom and in our
experiments we found that this translates into faster convergence. We used the established theory of
MDP homomorphisms to motivate the use of equivariant networks, thus formalizing the connection
between equivariant networks and symmetries in reinforcement learning. As an innovation, we also
introduced the first method to automatically construct equivariant network layers, given a specification
of the symmetries in question, thus removing a significant implementational obstacle. For future
work, we want to further understand the symmetrizer and its effect on learning dynamics, as well as
generalizing to problems that are not fully symmetric.
Elise van der Pol was funded by Robert Bosch GmbH. Daniel Worrall was funded
by Philips. F.A.O. received funding from the European Research Council (ERC)
under the European Union’s Horizon 2020 research and innovation programme
(grant agreement No. 758824 —INFLUENCE). Max Welling reports part-time
employment at Qualcomm AI Research.
9
8 Broader Impact
The goal of this paper is to make (deep) reinforcement learning techniques
more efficient at solving Markov decision processes (MDPs) by making use of prior knowledge about
symmetries. We do not expect the particular algorithm we develop to lead to immediate societal risks.
However, Markov decision processes are very general, and can e.g. be used to model problems in
autonomous driving, smart grids, and scheduling. Thus, solving such problems more efficiently can
in the long run cause positive or negative societal impact.
For example, making transportation or power grids more efficient, thereby making better use of
scarce resources, would be a significantly positive impact. Other potential applications, such as in
autonomous weapons, pose a societal risk [28]. Like many AI technologies, when used in automation,
our technology can have a positive impact (increased productivity) and a negative impact (decreased
demand) on labor markets.
More immediately, control strategies learned using RL techniques are hard to verify and validate.
Without proper precaution (e.g. [40]), employing such control strategies on physical systems thus run
the risk of causing accidents involving people, e.g. due to reward misspecification, unsafe exploration,
or distributional shift [2].
References
[1] Farzad Abdolhosseini, Hung Yu Ling, Zhaoming Xie, Xue Bin Peng, and Michiel van de Panne. On
learning symmetric locomotion. In ACM SIGGRAPH Motion, Interaction, and Games. 2019.
[2] Dario Amodei, Chris Olah, Jacob Steinhardt, Paul Christiano, John Schulman, and Dan Mané. Concrete
problems in AI safety. arXiv:1606.06565, 2016.
[3] Andrew G. Barto, Richard S. Sutton, and Charles W. Anderson. Neuronlike adaptive elements that can
solve difficult learning control problems. IEEE transactions on systems, man, and cybernetics, 1983.
[4] Erik J. Bekkers, Maxime W. Lafarge, Mitko Veta, Koen A.J. Eppenhof, Josien P.W. Pluim, and Remco
Duits. Roto-translation covariant convolutional networks for medical image analysis. In International
Conference on Medical Image Computing and Computer-Assisted Intervention, 2018.
[5] Richard E. Bellman. Dynamic Programming. Princeton University Press, 1957.
[6] Ondrej Biza and Robert Platt. Online abstraction with MDP homomorphisms for deep learning. In
International Conference on Autonomous Agents and MultiAgent Systems, 2019.
[7] Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, and
Wojciech Zaremba. OpenAI Gym. arXiv:1606.01540, 2016.
[8] Christopher Clark and Amos Storkey. Teaching deep convolutional neural networks to play Go.
arXiv:1412.3409, 2014.
[9] Karl Cobbe, Oleg Klimov, Chris Hesse, Taehoon Kim, and John Schulman. Quantifying generalization in
reinforcement learning. In International Conference on Machine Learning, 2019.
[10] Taco S. Cohen, Mario Geiger, and Maurice Weiler. A general theory of equivariant CNNs on homogeneous
spaces. In Advances in Neural Information Processing Systems. 2019.
[11] Taco S. Cohen and Max Welling. Group equivariant convolutional networks. In International Conference
on Machine Learning, 2016.
[12] Taco S. Cohen and Max Welling. Steerable CNNs. In International Conference on Learning Representa-
tions, 2017.
[13] Nichita Diaconu and Daniel E. Worrall. Learning to convolve: A generalized weight-tying approach. In
International Conference on Machine Learning, 2019.
[14] David Steven Dummit and Richard M. Foote. Abstract Algebra. Wiley, 2004.
[15] Jiechuan Jiang, Chen Dun, Tiejun Huang, and Zongqing Lu. Graph convolutional reinforcement learning.
In International Conference on Learning Representations, 2020.
[16] Ilya Kostrikov, Denis Yarats, and Rob Fergus. Image augmentation is all you need: Regularizing deep
reinforcement learning from pixels. arXiv:2004.13649, 2020.
[17] Michael Laskin, Kimin Lee, Adam Stooke, Lerrel Pinto, Pieter Abbeel, and Aravind Srinivas. Reinforce-
ment learning with augmented data. arXiv:2004.14990, 2020.
[18] Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to
document recognition. Proceedings of the IEEE, 1998.
10
[19] Kimin Lee, Kibok Lee, Jinwoo Shin, and Honglak Lee. Network randomization: A simple technique for
generalization in deep reinforcement learning. In International Conference on Learning Representations,
2020.
[20] Lihong Li, Thomas J. Walsh, and Michael L. Littman. Towards a unified theory of state abstraction for
mdps. In International Symposium on Artificial Intelligence and Mathematics, 2006.
[21] Yijiong Lin, Jiancong Huang, Matthieu Zimmer, Yisheng Guan, Juan Rojas, and Paul Weng. Invariant
transform experience replay: Data augmentation for deep reinforcement learning. IEEE Robotics and
Automation Letters, 2020.
[22] Iou-Jen Liu, Raymond A. Yeh, and Alexander G. Schwing. PIC: Permutation invariant critic for multi-agent
deep reinforcement learning. In Conference on Robot Learning, 2019.
[23] Anuj Mahajan and Theja Tulabandhula. Symmetry learning for function approximation in reinforcement
learning. arXiv:1706.02999, 2017.
[24] Aditi Mavalankar. Goal-conditioned batch reinforcement learning for rotation invariant locomotion.
arXiv:2004.08356, 2020.
[25] Volodymyr Mnih, Adrià Puigdomènech Badia, Mehdi Mirza, Alex Graves, Tim Harley, Timothy P.
Lillicrap, David Silver, and Koray Kavukcuoglu. Asynchronous methods for deep reinforcement learning.
In International Conference on Machine Learning, 2016.
[26] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel Veness, Marc G. Bellemare,
Alex Graves, Martin Riedmiller, Andreas K. Fidjeland, Georg Ostrovski, Stig Petersen, Charles Beattie,
Amir Sadik, Ioannis Antonoglou, Helen King, Dharshan Kumaran, Daan Wierstra, Shane Legg, and Demis
Hassabis. Human-level control through deep reinforcement learning. In Nature, 2015.
[27] Shravan Matthur Narayanamurthy and Balaraman Ravindran. On the hardness of finding symmetries in
Markov decision processes. In International Conference on Machine learning, 2008.
[28] Future of Life Institute. Autonomous weapons: An open letter from AI & robotics researchers, 2015.
[29] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor
Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf, Edward Yang,
Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie
Bai, and Soumith Chintala. Pytorch: An imperative style, high-performance deep learning library. In
Advances in Neural Information Processing Systems, 2019.
[30] Balaraman Ravindran and Andrew G. Barto. Symmetries and model minimization in Markov Decision
Processes. Technical report, University of Massachusetts, 2001.
[31] Balaraman Ravindran and Andrew G. Barto. SMDP homomorphisms: An algebraic approach to abstraction
in Semi Markov Decision Processes. In International Joint Conference on Artificial Intelligence, 2003.
[32] Balaraman Ravindran and Andrew G. Barto. Approximate homomorphisms: A framework for non-exact
minimization in Markov Decision Processes. In International Conference on Knowledge Based Computer
Systems, 2004.
[33] Nicol N. Schraudolph, Peter Dayan, and Terrence J. Sejnowski. Temporal difference learning of position
evaluation in the game of Go. In Advances in Neural Information Processing Systems, 1994.
[34] John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy
optimization algorithms. In arXiv:1707.06347, 2017.
[35] David Silver, Aja Huang, Chris J. Maddison, Arthur Guez, Laurent Sifre, George Van Den Driessche,
Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, et al. Mastering the game
of Go with deep neural networks and tree search. In Nature, 2016.
[36] Adam Stooke and Pieter Abbeel. rlpyt: A research code base for deep reinforcement learning in Pytorch.
In arXiv:1909.01500, 2019.
[37] Sainbayar Sukhbaatar, Arthur Szlam, and Rob Fergus. Learning multiagent communication with backprop-
agation. In Advances in Neural Information Processing Systems, 2016.
[38] Jonathan Taylor, Doina Precup, and Prakash Panagaden. Bounding performance loss in approximate MDP
homomorphisms. In Advances in Neural Information Processing Systems, 2008.
[39] Elise van der Pol, Thomas Kipf, Frans A. Oliehoek, and Max Welling. Plannable approximations to MDP
homomorphisms: Equivariance under actions. In International Conference on Autonomous Agents and
MultiAgent Systems, 2020.
[40] K. P. Wabersich and M. N. Zeilinger. Linear model predictive safety certification for learning-based control.
In IEEE Conference on Decision and Control, 2018.
[41] Maurice Weiler and Gabriele Cesa. General E(2)-equivariant steerable CNNs. In Advances in Neural
Information Processing Systems, 2019.
11
[42] Maurice Weiler, Mario Geiger, Max Welling, Wouter Boomsma, and Taco S. Cohen. 3D steerable CNNs:
Learning rotationally equivariant features in volumetric data. In Advances in Neural Information Processing
Systems. 2018.
[43] Maurice Weiler, Fred A. Hamprecht, and Martin Storath. Learning steerable filters for rotation equivariant
CNNs. In IEEE Conference on Computer Vision and Pattern Recognition, 2018.
[44] Marysia Winkels and Taco S. Cohen. 3D G-CNNs for pulmonary nodule detection. In Medical Imaging
with Deep Learning Conference, 2018.
[45] Daniel E. Worrall and Gabriel J. Brostow. CubeNet: Equivariance to 3D rotation and translation. In
European Conference on Computer Vision (ECCV), 2018.
[46] Daniel E. Worrall, Stephan J. Garbin, Daniyar Turmukhambetov, and Gabriel J. Brostow. Harmonic
networks: Deep translation and rotation equivariance. In IEEE Conference on Computer Vision and Pattern
Recognition, 2017.
[47] Daniel E. Worrall and Max Welling. Deep scale-spaces: Equivariance over scale. In Advances in Neural
Information Processing Systems, 2019.
[48] Dmitry Yarotsky. Universal approximations of invariant maps by neural networks. arXiv:1804.10306,
2018.
12
A The Symmetrizer
In this section we prove three properties of the symmetrizer: the symmetric property (S(W) ∈ W for all W ∈
Wtotal ), the fixing property (W ∈ W =⇒ S(W) = W) , and the idempotence property (S(S(W)) = S(W)
for all W ∈ Wtotal ).
The Symmetric Property Here we show that the symmetrizer S maps matrices W ∈ Wtotal to equivariant
matrices S(W) ∈ W. For this, we show that a symmetrized weight matrix S(W) from Equation 16 satisfies the
equivariance constraint of Equation 14.
The Fixing Property For the symmetrizer to be useful, we need to make sure that its range covers the
equivariant subspace W, and not just a subset of it; that is, we need to show that
W = {S(W) ∈ W|W ∈ Wtotal }. (27)
We show this by picking a matrix W ∈ W and showing that W ∈ W =⇒ S(W) = W.
13
The Idempotence Property Here we show that the symmetrizer S(W) from Equation 16 is idempotent,
S(S(W)).
1 X −1
= Kg0 WLg0 sum over constant (41)
|G| 0
g ∈G
= S(W) (42)
Thus we see that S(W) satisfies the equivariance constraint, which implies that S(W) ∈ W.
B Experimental Settings
B.1 Designing representations
In the main text we presented a method to construct a space of intertwiners W using the symmetrizer. This relies
on us already having chosen specific representations/transformation operators for the input, the output, and for
every intermediate layer of the MDP homomorphic networks. While for the input space (state space) and output
space (policy space), these transformation operators are easy to define, it is an open question how to design a
transformation operator for the intermediate layers of our networks. Here we give some rules of thumb that we
used, followed by the specific transformation operators we used in our experiments.
For each experiment we first identified the group G of transformations. In every case, this was a finite group of
size |G|, where the size is the number of elements in the group (number of distinct transformation operators).
For example, a simple flip group as in Pong has two elements, so |G| = 2. Note that the group size |G| does not
necessarily equal the size of the transformation operators, whose size is determined by the dimensionality of the
input/activation layer/policy.
Stacking Equivariant Layers If we stack equivariant layers, the resulting network is equivariant as a
whole too [11]. To see that this is the case, consider the following example. Assume we have network f ,
consisting of layers f1 and f2 , which satisfy the layer-wise equivariance constraints:
Pg [f1 (x)] = f1 (Lg [x]) (43)
Kg [f2 (x)] = f2 (Pg [x]) (44)
14
With Kg the output transformation of the network, Lg the input transformation, and Pg the intermediate
transformation. Now,
Kg [f (x)] = Kg [f2 (f1 (x))] (45)
= f2 (Pg [f1 (x)] (f2 equivariance constraint) (46)
= f2 (f1 (Lg [x])) (f1 equivariance constraint) (47)
= f (Lg [x]) (48)
and so the whole network f is equivariant with regards to the input transformation Lg and the output transforma-
tion Kg . Note that this depends on the intermediate representation Pg being shared between layers, i.e. f1 ’s
output transformation is the same as f2 ’s input transformation.
MLP-structured networks For MLP-structured networks (CartPole), typically the activations have
shape [batch_size, num_channels]. Instead we used a shape of [batch_size, num_channels,
representation_size], where for the intermediate layers representation_size=|G|+1 (we have a +1
because of the bias). The transformation operators we then apply to the activations is the set of permutations for
group size |G| appended with a 1 on the diagonal for the bias, acting on this last ‘representation dimension’.
Thus a forward pass of a layer is computed as
num_channels
X |G|+1
X
yb,cout ,rout = zb,cin ,rin Wcout ,rout ,cin ,rin (49)
cin =1 rin =1
where
rank(W)
X
Wcout ,rout ,cin ,rin = ci,cout ,cin Vi,rout ,rin . (50)
i=1
CNN-structured networks For CNN-structured networks (Pong and Grid World), typically the ac-
tivations have shape [batch_size, num_channels, height, width]. Instead we used a shape of
[batch_size, num_channels, representation_size, height, width], where for the intermediate
layers representation_size=|G|+1. The transformation operators we apply to the input of the layer is a spa-
tial transformation on the height, width dimensions and a permutation on the representation dimension.
This is because in the intermediate layers of the network the activations do not only transform in space, but also
along the representation dimensions of the tensor. The transformation operators we apply to the output of the
layer is just a permutation on the representation dimension. Thus a forward pass of a layer is computed as
num_channels
X |G|+1
X X
yb,cout ,rout ,hout ,wout = zb,cin ,rin ,hout +hin ,wout +win Wcout ,rout ,cin ,rin ,hin ,win (51)
cin =1 rin =1 hin ,win
where
rank(W)
X
Wcout ,rout ,cin ,rin ,hin ,win = ci,cout ,cin Vi,rout ,rin ,hin ,win . (52)
i=1
15
Table 2: Final learning rates used in CartPole-v1 experiments.
Equivariant Nullspace Random MLP
0.01 0.005 0.001 0.001
B.2 Cartpole-v1
Group Representations For states:
1 0 0 0 −1 0 0 0
0 1 0 0 0 −1 0 0
Lge = ,L =
0 0 1 0 g1 0 0 −1 0
0 0 0 1 0 0 0 −1
For intermediate layers and policies:
1 0 0 1
Kπge = , Kπg1 =
0 1 1 0
For values we require an invariant rather than equivariant output. This invariance is implemented by defining
the output representations to be |G| identity matrices of the desired output dimensionality. For predicting state
values we required a 1-dimensional output, and we thus used |G| 1-dimensional identity matrices, i.e. for value
output V :
KVge = 1 , KVg1 = 1
Hyperparameters For both the basis networks and the MLP, we used Xavier initialization. We
trained PPO using ADAM on 16 parallel environments and fine-tuned over the learning rates
{0.01, 0.05, 0.001, 0.005, 0.0001, 0.0003, 0.0005} by running 25 random seeds for each setting, and report
the best curve. The final learning rates used are shown in Table 2. Other hyperparameters were defaults in
RLPYT [36], except that we turn off learning rate decay.
Architecture
Basis networks:
16
Table 3: Final learning rates used in grid world experiments.
Equivariant Nullspace Random CNN
0.001 0.003 0.001 0.003
B.3 GridWorld
Group Representations For states we use numpy.rot90. The stack of weights is rolled.
For the intermediate representations:
1 0 0 0 0 0 0 1 0 0 1 0 0 1 0 0
0 1 0 0 1 0 0 0 0 0 0 1 0 0 1 0
Lge = ,L = ,L = ,L =
0 0 1 0 g1 0 1 0 0 g2 1 0 0 0 g3 0 0 0 1
0 0 0 1 0 0 1 0 0 1 0 0 1 0 0 0
For the policies:
1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0
0 1 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 1 0 0
Kπge = 0 0 1 0 0 , Kπg1 = 0 1 0 0 0 , Kπg2 = 0 0 0 0 1 , Kπg3 = 0 0 0 1 0
0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 1
0 0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 1 0 0 0
For the values:
KVge = 1 , KVg1 = 1 , KVg2 = 1 , KVg3 = 1
Hyperparameters For both the basis networks and the CNN, we used He initialization. We
trained A2C using ADAM on 16 parallel environments and fine-tuned over the learning rates
{0.00001, 0.00003, 0.0001, 0.0003, 0.001, 0.003} on 20 random seeds for each setting, and reporting the best
curve. The final learning rates used are shown in Table 3. Other hyperparameters were defaults in RLPYT [36].
Architecture
Basis networks:
CNN:
17
Table 4: Learning rates used in Pong experiments.
Equivariant Nullspace Random CNN
0.0002 0.0002 0.0002 0.0001
B.4 Pong
Group Representations For the states we use numpy’s indexing to flip the input, i.e.
w = w[..., ::-1, :], then the permutation on the representation dimension of the weights is a
numpy.roll, since the group is cyclic.
For the intermediate layers:
1 0 0 1
Lge = , Lg1 =
0 1 1 0
Hyperparameters For both the basis networks and the CNN, we used He initialization. We trained A2C
using ADAM on 4 parallel environments and fine-tuned over the learning rates {0.0001, 0.0002, 0.0003} on 15
random seeds for each setting, and reporting the best curve. The learning rates to fine-tune over were selected
to be close to where the baseline performed well in preliminary experiments. The final learning rates used are
shown in Table 4. Other hyperparameters were defaults in RLPYT [36].
Architecture
Basis Networks:
CNN:
18
Table 5: Learning rates used in Breakout experiments.
Equivariant CNN
0.0002 0.0002
100
80
Average Return
60
40
20
Equivariant
Convolutional
0
0 250 500 750 1000 1250 1500 1750 2000
Time steps (x 25000)
Figure 7: B REAKOUT: Trained with A2C, all networks fine-tuned over 9 learning rates. 25%, 50%
and 75% quantiles over 14 random seeds shown.
C Breakout Experiments
We evaluated the effect of an equivariant basis extractor on Breakout, compared to a baseline convolutional
network. The hyperparameter settings and architecture were largely the same as those of Pong, except for the
input group representation, a longer training time, and that we considered a larger range of learning rates. To
ensure symmetric states, we remove the two small decorative blocks in the bottom corners.
Group Representations For the states we use numpy’s indexing to flip the input, i.e.
w = w[..., :, ::-1] (note the different axis than in Pong), then the permutation on the representation
dimension of the weights is a numpy.roll, since the group is cyclic.
For the intermediate layers:
1 0 0 1
Lge = , Lg1 =
0 1 1 0
Hyperparameters We used He initialization. We trained A2C using ADAM on 4 parallel environments and
fine-tuned over the learning rates {0.001, 0.005, 0.0001, 0.0002, 0.0003, 0.0004, 0.0005, 0.00001, 0.00005}
on 15 random seeds for each setting, and reporting the best curve. The final learning rates used are shown in
Table 5. Other hyperparameters were defaults in RLPYT [36].
Results Figure 7 shows the result of the equivariant feature extractor versus the convolutional baseline. While
we again see an improvement over the standard convolutional approach, the difference is much less pronounced
than in CartPole, Pong or the grid world. It is not straightforward why. One factor could be that the equivariant
feature extractor is not end-to-end MDP homomorphic. It instead outputs a type of MDP homomorphic state
representations and learns a regular policy on top. As a result, the unconstrained final layers may negate some of
the advantages of the equivariant feature extractor. This may be more of an issue for Breakout than Pong, since
Breakout is a more complex game.
19
500 500
400 400
Average Return
Average Return
300 300
200 200
E Bellman Equations
" #
π
X X 0 π 0
V (s) = π(s, a) R(s, a) + γ T (s, a, s )V (s ) (53)
a∈A s0 ∈S
X
Qπ (s, a) = R(s, a) + γ T (s, a, s0 )V π (s0 ). (54)
s0 ∈S
20