0% found this document useful (0 votes)
14 views64 pages

Universal Approximations of Invariant Maps by Neural Networks

This paper extends the universal approximation theorem for neural networks to invariant and equivariant maps with respect to group representations. It presents a construction for complete invariant/equivariant networks, explores convolutional networks for translation groups, and introduces a charge-conserving convnet for 2D transformations. The findings emphasize the importance of symmetry in predictive models and provide theoretical foundations for designing neural networks that respect these symmetries.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views64 pages

Universal Approximations of Invariant Maps by Neural Networks

This paper extends the universal approximation theorem for neural networks to invariant and equivariant maps with respect to group representations. It presents a construction for complete invariant/equivariant networks, explores convolutional networks for translation groups, and introduces a charge-conserving convnet for 2D transformations. The findings emphasize the importance of symmetry in predictive models and provide theoretical foundations for designing neural networks that respect these symmetries.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 64

Universal approximations of invariant maps

by neural networks
Dmitry Yarotsky∗†
[email protected]
arXiv:1804.10306v1 [cs.NE] 26 Apr 2018

Abstract
We describe generalizations of the universal approximation theorem for neural
networks to maps invariant or equivariant with respect to linear representations of
groups. Our goal is to establish network-like computational models that are both in-
variant/equivariant and provably complete in the sense of their ability to approximate
any continuous invariant/equivariant map. Our contribution is three-fold. First, in
the general case of compact groups we propose a construction of a complete invari-
ant/equivariant network using an intermediate polynomial layer. We invoke classical
theorems of Hilbert and Weyl to justify and simplify this construction; in particular,
we describe an explicit complete ansatz for approximation of permutation-invariant
maps. Second, we consider groups of translations and prove several versions of the
universal approximation theorem for convolutional networks in the limit of continuous
signals on euclidean spaces. Finally, we consider 2D signal transformations equivariant
with respect to the group SE(2) of rigid euclidean motions. In this case we introduce
the “charge–conserving convnet” – a convnet-like computational model based on the
decomposition of the feature space into isotypic representations of SO(2). We prove
this model to be a universal approximator for continuous SE(2)–equivariant signal
transformations.
Keywords: neural network, approximation, linear representation, invariance, equiv-
ariance, polynomial, polarization, convnet

Contents
1 Introduction 2
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3 Contribution of this paper . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

Skolkovo Institute of Science and Technology, Nobelya Ulitsa 3, Moscow 121205, Russia

Institute for Information Transmission Problems, Bolshoy Karetny 19 build.1, Moscow 127051, Russia

1
2 Compact groups and shallow approximations 6
2.1 Approximations based on symmetrization . . . . . . . . . . . . . . . . . . . . 6
2.2 Approximations based on polynomial invariants . . . . . . . . . . . . . . . . 9
2.3 Polarization and multiplicity reduction . . . . . . . . . . . . . . . . . . . . . 12
2.4 The symmetric group SN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

3 Translations and deep convolutional networks 18


3.1 Finite abelian groups and single convolutional layers . . . . . . . . . . . . . . 19
3.2 Continuum signals and deep convnets . . . . . . . . . . . . . . . . . . . . . . 21
3.3 Convnets with pooling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

4 Charge-conserving convnets 32
4.1 Preliminary considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
4.1.1 Pointwise characterization of SE(ν)–equivariant maps . . . . . . . . . 35
4.1.2 Equivariant differentiation . . . . . . . . . . . . . . . . . . . . . . . . 36
4.1.3 Discretized differential operators . . . . . . . . . . . . . . . . . . . . . 39
4.1.4 Polynomial approximations on SO(2)-modules . . . . . . . . . . . . . 41
4.2 Charge-conserving convnet . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
4.3 The main result . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

5 Discussion 56

Bibliography 58

A Proof of Lemma 4.1 61

1 Introduction
1.1 Motivation
An important topic in learning theory is the design of predictive models properly reflecting
symmetries naturally present in the data (see, e.g., Burkhardt and Siggelkow [2001], Schulz-
Mirbach [1995], Reisert [2008]). Most commonly, in the standard context of supervised
learning, this means that our predictive model should be invariant with respect to a suitable
group of transformations: given an input object, we often know that its class or some
other property that we are predicting does not depend on the object representation (e.g.,
associated with a particular coordinate system), or for other reasons does not change under
certain transformations. In this case we would naturally like the predictive model to reflect
this independence. If f is our predictive model and Γ the group of transformations, we can
express the property of invariance by the identity f (Aγ x) = f (x), where Aγ x denotes the
action of the transformation γ ∈ Γ on the object x.
There is also a more general scenario where the output of f is another complex object
that is supposed to transform appropriately if the input object is transformed. This scenario

2
is especially relevant in the setting of multi-layered (or stacked) predictive models, if we want
to propagate the symmetry through the layers. In this case one speaks about equivariance,
and mathematically it is described by the identity f (Aγ x) = Aγ f (x), assuming that the
transformation γ acts in some way not only on inputs, but also on outputs of f . (For
brevity, here and in the sequel we will slightly abuse notation and denote any action of γ
by Aγ , though of course in general the input and output objects are different and γ acts
differently on them. It will be clear which action is meant in a particular context).
A well-known important example of equivariant transformations are convolutional layers
in neural networks, where the group Γ is the group of grid translations, Zd .
We find it convenient to roughly distinguish two conceptually different approaches to the
construction of invariant and equivariant models that we refer to as the symmetrization-
based one and the intrinsic one. The symmetrization-based approach consists in starting
from some asymmetric model, and symmetrizing it by a group averaging. On the other
hand, the intrinsic approach consists in imposing prior structural constraints on the model
that guarantee its symmetricity.
In the general mathematical context, the difference between the two approaches is best
illustrated with the example of symmetric polynomials in the variables x1 , . . . , xn , i.e., the
polynomials invariant with respect to arbitrary permutations of these variables. With the
symmetrization-based approach, we can obtain any invariant polynomial by starting with an
arbitrary polynomial f and P symmetrizing it over the group of permutations Sn , i.e. by defin-
1
ing fsym (x1 , . . . , xn ) = n! ρ∈Sn f (xρ(1) , . . . , xρ(n) ). On the other hand, the intrinsic approach
is associated with the fundamental theorem of symmetric polynomials, which states that
any invariant polynomial fsym in n variables can be obtained as a superposition f (s1 , . . . , sn )
of some polynomial f and the elementary symmetric polynomials s1 , . . . , sn . Though both
approaches yield essentially the same result (an arbitrary symmetric polynomial), the two
constructions are clearly very different.
In practical machine learning, symmetrization is ubiquitous. It is often applied both on
the level of data and the level of models. This means that, first, prior to learning an invariant
model, one augments the available set of training examples (x, f (x)) by new examples of the
form (Aγ x, f (x)) (see, for example, Section B.2 of Thoma [2017] for a list of transformations
routinely used to augment datasets for image classification problems). Second, once some,
generally non-symmetric, predictive model fb has been learned, it is symmetrized by setting
1
P
fbsym (x) = |Γ0 | γ∈Γ0 fb(Aγ x), where Γ0 is some subset of Γ (e.g., randomly sampled). This
can be seen as a manifestation of the symmetrization-based approach, and its practicality
probably stems from the fact that the real world symmetries are usually only approximate,
and in this approach one can easily account for their imperfections (e.g., by adjusting the
subset Γ0 ). On the other hand, the weight sharing in convolutional networks (Waibel et al.
[1989], le Cun [1989]) can be seen as a manifestation of the intrinsic approach (since the
translational symmetry is built into the architecture of the network from the outset), and
convnets are ubiquitous in modern machine learning LeCun et al. [2015].
In this paper we will be interested in the theoretical opportunities of the intrinsic approach
in the context of approximations using neural-network-type models. Suppose, for example,

3
that f is an invariant map that we want to approximate
PN Pwith the usual ansatz of a perceptron
d
with a single hidden layer, f (x1 , . . . , xd ) = n=1 cn σ( k=1 wnk xk + hn ) with some nonlinear
b
activation function σ. Obviously, this ansatz breaks the symmetry, in general. Our goal is
to modify this ansatz in such a way that, first, it does not break the symmetry and, second,
it is complete in the sense that it is not too specialized and any reasonable invariant map
can be arbitrarily well approximated by it. In Section 2 we show how this can be done by
introducing an extra polynomial layer into the model. In Sections 3, 4 we will consider more
complex, deep models (convnets and their modifications). We will understand completeness
in the sense of the universal approximation theorem for neural networks Pinkus [1999].
Designing invariant and equivariant models requires us to decide how the symmetry
information is encoded in the layers. A standard assumption, to which we also will adhere
in this paper, is that the group acts by linear transformations. Precisely, when discussing
invariant models we are looking for maps of the form
f : V → R, (1.1)
where V is a vector space carrying a linear representation R : Γ → GL(V ) of a group Γ.
More generally, in the context of multi-layer models
f1 f2
f : V1 → V2 → . . . (1.2)
we assume that the vector spaces Vk carry linear representations Rk : Γ → GL(Vk ) (the
“baseline architecture” of the model), and we must then ensure equivariance in each link.
Note that a linear action of a group on the input space V1 is a natural and general phe-
nomenon. In particular, the action is linear if V1 is a linear space of functions on some
domain, and the action is induced by (not necessarily linear) transformations of the domain.
Prescribing linear representations Rk is then a viable strategy to encode and upkeep the
symmetry in subsequent layers of the model.
From the perspective of approximation theory, we will be interested in finite computa-
tional models, i.e. including finitely many operations as performed on a standard computer.
Finiteness is important for potential studies of approximation rates (though such a study
is not attempted in the present paper). Compact groups have the nice property that their
irreducible linear representations are finite–dimensional. This allows us, in the case of such
groups, to modify the standard shallow neural network ansatz so as to obtain a compu-
tational model that is finite, fully invariant/equivariant and complete, see Section 2. On
the other hand, irreducible representations of non-compact groups such as Rν are infinite-
dimensional in general. As a result, finite computational models can be only approximately
Rν –invariant/equivariant. Nevertheless, we show in Sections 3, 4 that complete Rν – and
SE(ν)–equivariant models can be rigorously described in terms of appropriate limits of finite
models.

1.2 Related work


Our work can be seen as an extension of results on the universal approximation property of
neural networks (Cybenko [1989], Pinkus [1999], Leshno et al. [1993], Pinkus [1996], Hornik

4
[1993], Funahashi [1989], Hornik et al. [1989], Mhaskar and Micchelli [1992]) to the setting
of group invariant/equivariant maps and/or infinite-dimensional input spaces.
Our general results in Section 2 are based on classical results of the theory of polynomial
invariants (Hilbert [1890, 1893], Weyl [1946]).
An important element of constructing invariant and equivariant models is the extraction
of invariant and equivariant features. In the present paper we do not focus on this topic, but
it has been studied extensively, see e.g. general results along with applications to 2D and
3D pattern recognition in Schulz-Mirbach [1995], Reisert [2008], Burkhardt and Siggelkow
[2001], Skibbe [2013], Manay et al. [2006].
In a series of works reviewed in Cohen et al. [2017], the authors study expressiveness
of deep convolutional networks using hierarchical tensor decompositions and convolutional
arithmetic circuits. In particular, representation universality of several network structures
is examined in Cohen and Shashua [2016].
In a series of works reviewed in Poggio et al. [2017], the authors study expressiveness of
deep networks from the perspective of approximation theory and hierarchical decompositions
of functions. Learning of invariant data representations and its relation to information
processing in the visual cortex has been discussed in Anselmi et al. [2016].
In the series of papers Mallat [2012, 2016], Sifre and Mallat [2014], Bruna and Mallat
[2013], multiscale wavelet-based group invariant scattering operators and their applications
to image recognition have been studied.
There is a large body of work proposing specific constructions of networks for applied
group invariant recognition problems, in particular image recognition approximately invari-
ant with respect to the group of rotations or some of its subgroups: deep symmetry networks
of Gens and Domingos [2014], G-CNNs of Cohen and Welling [2016], networks with extra
slicing operations in Dieleman et al. [2016], RotEqNets of Marcos et al. [2016], networks
with warped convolutions in Henriques and Vedaldi [2016], Polar Transformer Networks of
Esteves et al. [2017].

1.3 Contribution of this paper


As discussed above, we will be interested in the following general question: assuming there
is a “ground truth” invariant or equivariant map f , how can we “intrinsically” approximate
it by a neural-network-like model? Our goal is to describe models that are finite, invariant/
equivariant (up to limitations imposed by the finiteness of the model) and provably complete
in the sense of approximation theory.
Our contribution is three-fold:

• In Section 2 we consider general compact groups and approximations by shallow net-


works. Using the classical polynomial invariant theory, we describe a general con-
struction of shallow networks with an extra polynomial layer which are exactly in-
variant/equivariant and complete (Propositions 2.3, 2.4). Then, we discuss how this
construction can be improved using the idea of polarization and a theorem of Weyl
(Propositions 2.5, 2.7). Finally, as a particular illustration of the “intrinsic” framework,

5
we consider maps invariant with respect to the symmetric group SN , and describe a
corresponding neural network model which is SN –invariant and complete (Theorem
2.4). This last result is based on another theorem of Weyl.

• In Section 3 we prove several versions of the universal approximation theorem for


convolutional networks and groups of translations. The main novelty of these results
is that we approximate maps f defined on the infinite–dimensional space of continuous
signals on Rν . Specifically, one of these versions (Theorem 3.1) states that a signal
transformation f : L2 (Rν , RdV ) → L2 (Rν , RdU ) can be approximated, in some natural
sense, by convnets without pooling if and only if f is continuous and translationally–
equivariant (here, by L2 (Rν , Rd ) we denote the space of square-integrable functions
Φ : Rν → Rd ). Another version (Theorem 3.2) states that a map f : L2 (Rν , RdV ) → R
can be approximated by convnets with pooling if and only if f is continuous.

• In Section 4 we describe a convnet-like model which is a universal approximator for


signal transformations f : L2 (R2 , RdV ) → L2 (R2 , RdU ) equivariant with respect to the
group SE(2) of rigid two-dimensional euclidean motions. We call this model charge–
conserving convnet, based on a 2D quantum mechanical analogy (conservation of the
total angular momentum). The crucial element of the construction is that the oper-
ation of the network is consistent with the decomposition of the feature space into
isotypic representations of SO(2). We prove in Theorem 4.1 that a transformation
f : L2 (R2 , RdV ) → L2 (R2 , RdU ) can be approximated by charge–conserving convnets if
and only if f is continuous and SE(2)–equivariant.

2 Compact groups and shallow approximations


In this section we give several results on invariant/equivariant approximations by neural
networks in the context of compact groups, finite-dimensional representations, and shallow
networks. We start by describing the standard group-averaging approach in Section 2.1. In
Section 2.2 we describe an alternative approach, based on the invariant theory. In Section
2.3 we show how one can improve this approach using polarization. Finally, in Section 2.4
we describe an application of this approach to the symmetric group SN .

2.1 Approximations based on symmetrization


We start by recalling the universal approximation theorem, which will serve as a “template”
for our invariant and equivariant analogs. There are several versions of this theorem (see
the survey Pinkus [1999]), we will use the general and easy-to-state version given in Pinkus
[1999].

Theorem 2.1 (Pinkus [1999], Theorem 3.1). Let σ : R → R be a continuous activation


function that is not a polynomial. Let V = Rd be a real finite dimensional vector space. Then

6
any continuous map f : V → R can be approximated, in the sense of uniform convergence
on compact sets, by maps fb : V → R of the form
N
X d
X 
fb(x1 , . . . , xd ) = cn σ wns xs + hn (2.1)
n=1 s=1

with some coefficients cn , wns , hn .

Throughout the paper, we assume, as in Theorem 2.1, that σ : R → R is some (fixed)


continuous activation function that is not a polynomial.
Also, as in this theorem, we will understand approximation in the sense of uniform
approximation on compact sets, i.e. meaning that for any compact K ⊂ V and any  > 0
one can find an approximating map fb such that |f (x)− fb(x)| ≤  (or kf (x)− fb(x)k ≤  in the
case of vector-valued f ) for all x ∈ K. In the case of finite-dimensional spaces V considered
in the present section, one can equivalently say that there is a sequence of approximating
maps fbn uniformly converging to f on any compact set. Later, in Sections 3, 4, we will
consider infinite-dimensional signal spaces V for which such an equivalence does not hold.
Nevertheless, we will use the concept of uniform approximation on compact sets as a guiding
principle in our precise definitions of approximation in that more complex setting.
Now suppose that the space V carries a linear representation R of a group Γ. Assuming
V is finite-dimensional, this means that R is a homomorphism of Γ to the group of linear
automorphisms of V :
R : Γ → GL(V ).
In the present section we will assume that Γ is a compact group, meaning, as is customary,
that Γ is a compact Hausdorff topological space and the group operations (multiplication
and inversion) are continuous. Accordingly, the representation R is also assumed to be
continuous. We remark that an important special case of compact groups are the finite
groups (with respect to the discrete topology).
One important property of compact groups is the existence of a unique, both left- and
right-invariant Haar measure normalized so that the total measure of Γ equals 1. Another
property is that any continuous representation of a compact group on a separable (but pos-
sibly infinite-dimensional) Hilbert space can be decomposed into a countable direct sum of
irreducible finite-dimensional representations. There are many group representation text-
books to which we refer the reader for details, see e.g. Vinberg [2012], Serre [2012], Simon
[1996]. Accordingly, in the present section we will restrict ourselves to finite-dimensional
representations. Later, in Sections 3 and 4, we will consider the noncompact groups Rν
and SE(ν) and their natural representations on the infinite-dimensional space L2 (Rν ), which
cannot be decomposed into countably many irreducibles.
Motivated by applications to neural networks, in this section and Section 3 we will con-
sider only representations over the field R of reals (i.e. with V a real vector space). Later,
in Section 4, we will consider complexified spaces as this simplifies the exposition of the
invariant theory for the group SO(2).

7
For brevity, we will call a vector space carrying a linear representation of a group Γ a
Γ-module. We will denote by Rγ the linear automorphism obtained by applying R to Rγ ∈ Γ.
The integral over the normalized Haar measure on a compact group Γ is denoted by Γ ·dγ.
We will denote vectors by boldface characters; scalar components of the vector x are denoted
xk .
Recall that given a Γ-module V , we call a map f : V → R Γ-invariant (or simply
invariant) if f (Rγ x) = f (x) for all γ ∈ Γ and x ∈ V . We state now the basic result on
invariant approximation, obtained by symmetrization (group averaging).

Proposition 2.1. Let Γ be a compact group and V a finite-dimensional Γ-module. Then, any
continuous invariant map f : V → R can be approximated by Γ-invariant maps fb : V → R
of the form
Z XN
fb(x) = cn σ(ln (Rγ x) + hn )dγ, (2.2)
Γ n=1

, hn ∈ R are some coefficients and ln ∈ V ∗ are some linear functionals on V , i.e.


where cnP
ln (x) = k wnk xk .

Proof. It is clear that the map (2.2) is Γ–invariant, and we only need to prove the com-
pleteness part. Let K be a compact subset in V , and  > 0. Consider the symmetrization
of K defined by Ksym = ∪γ∈Γ Rγ (K). Note that Ksym is also a compact set, because it is
the image of the compact set Γ × K under the continuous map (γ, x) 7→ Rγ x. We can use
PN
Theorem 2.1 to find a map f1 : V → R of the form f1 (x) = n=1 cn σ(ln (x) + hn ) and
such that |f (x) − f1 (x)| ≤  on Ksym . Now consider the Γ-invariant group–averaged map
R
fb(x) = Γ f1 (Rγ x)dγ. Then for any x ∈ K,
Z Z

|fb(x) − f (x)| = f1 (Rγ x) − f (Rγ x) dγ ≤ f1 (Rγ x) − f (Rγ x) dγ ≤ ,
Γ Γ

where we have used the invariance of f and the fact that |f1 (x) − f (x)| ≤  for x ∈ Ksym .
Now we establish a similar result for equivariant maps. Let V, U be two Γ-modules. For
brevity, we will denote by R the representation of Γ in either of them (it will be clear from the
context which one is meant). We call a map f : V → U Γ-equivariant if f (Rγ x) = Rγ f (x)
for all γ ∈ Γ and x ∈ V .

Proposition 2.2. Let Γ be a compact group and V and U two finite-dimensional Γ-modules.
Then, any continuous Γ-equivariant map f : V → U can be approximated by Γ-equivariant
maps fb : V → U of the form
N
Z X
fb(x) = Rγ−1 yn σ(ln (Rγ x) + hn )dγ, (2.3)
Γ n=1

with some coefficients hn ∈ R, linear functionals ln ∈ V ∗ , and vectors yn ∈ U .

8
Proof. The proof is analogous to the proof of Proposition 2.1. Fix any norm k · k in U .
Given a compact set K and  > 0, we construct the compact set Ksym = ∪γ∈Γ Rγ (K) as
before. Next, we find f1 : V → U of the form f1 (x) = N
P
n=1 yn σ(ln (x) + hn ) and such that
kf (x) − f1 (x)k ≤  on Ksym (we can do it, for example, by considering scalar components
of f with respect to some basis in U , and approximating these R components using Theorem
−1
2.1). Finally, we define the symmetrized map by f (x) = Γ Rγ f1 (Rγ x)dγ. This map is
b
Γ–equivariant, and, for any x ∈ K,
Z
Rγ−1 f1 (Rγ x) − Rγ−1 f (Rγ x) dγ

kfb(x) − f (x)k =
Γ Z
≤ max kRγ k f1 (Rγ x) − f (Rγ x) dγ
γ∈Γ Γ
≤ max kRγ k.
γ∈Γ

By continuity of R and compactness of Γ, maxγ∈Γ kRγ k < ∞, so we can approximate f by


fb on K with any accuracy.
Propositions 2.1, 2.2 present the “symmetrization–based” approach to constructing in-
variant/equivariant approximations relying on the shallow neural network ansatz (2.1). The
approximating expressions (2.2), (2.3) are Γ–invariant/equivariant and universal. Moreover,
in the case of finite groups the integrals in these expressions are finite sums, i.e. these ap-
proximations consist of finitely many arithmetic operations and evaluations of the activation
function σ. In the case of infinite groups, the integrals can be approximated by sampling
the group.
In the remainder of Section 2 we will pursue an alternative approach to symmetrize the
neural network ansatz, based on the theory of polynomial invariants.
We finish this subsection with the following general observation. Suppose that L we have
m
two Γ-modules U, V , and U can be decomposed into Γ–invariant submodules: U = β Uβ β
(where mβ denotes the multiplicity of Uβ in U ). Then a map f : V → U is equivariant if and
only if it is equivariant in each component Uβ of the output space. Moreover, if we denote
by Equiv(V, U ) the space of continuous equivariant maps f : V → U , then
 M  M

Equiv V, Uβ = Equiv(V, Uβ )mβ . (2.4)
β β

This shows that the task of describing equivariant maps f : V → U reduces to the task of
describing equivariant maps f : V → Uβ . In particular, describing vector-valued invariant
maps f : V → RdU reduces to describing scalar-valued invariant maps f : V → R.

2.2 Approximations based on polynomial invariants


The invariant theory seeks to describe polynomial invariants of group representations, i.e.
polynomial maps f : V → R such that f (Rγ x) ≡ f (x) for all x ∈ V . A fundamental result

9
of the invariant theory is Hilbert’s finiteness theorem Hilbert [1890, 1893] stating that for
completely reducible representations, all the polynomial invariants are algebraically gener-
ated by a finite number of such invariants. In particular, this holds for any representation
of a compact group.
Theorem 2.2 (Hilbert). Let Γ be a compact group and V a finite-dimensional Γ-module.
Then there exist finitely many polynomial invariants f1 , . . . , fNinv : V → R such that any
polynomial invariant r : V → R can be expressed as
r(x) = re(f1 (x), . . . , fNinv (x))
with some polynomial re of Ninv variables.
See, e.g., Kraft and Procesi [2000] for a modern expositions of the invariant theory and
Hilbert’s theorem. We refer to the set {fs }N s=1 from this theorem as a generating set of
inv

polynomial invariants (note that this set is not unique and Ninv may be different for different
generating sets).
Thanks to the density of polynomials in the space of continuous functions, we can easily
combine Hilbert’s theorem with the universal approximation theorem to obtain a complete
invariant ansatz for invariant maps:
Proposition 2.3. Let Γ be a compact group, V a finite-dimensional Γ-module, and
f1 , . . . , fNinv : V → R a finite generating set of polynomial invariants on V (existing by
Hilbert’s theorem). Then, any continuous invariant map f : V → R can be approximated by
invariant maps fb : V → R of the form
N
X Ninv
X 
fb(x) = cn σ wns fs (x) + hn (2.5)
n=1 s=1

with some parameter N and coefficients cn , wns , hn .


Proof. It is obvious that the expressions fb are Γ-invariant, so we only need to prove the
completeness part.
Let us first show that the map f can be approximated by an invariant polynomial. Let
K be a compact subset in V , and, like before, consider the symmetrized set Ksym . By
the Stone-Weierstrass theorem, for any  > 0 there exists a polynomial r on RV such that
|r(x) − f (x)| ≤  for x ∈ Ksym . Consider the symmetrized function rsym (x) = Γ r(Rγ x)dγ.
Then the function rsym is invariant and |rsym (x) − f (x)| ≤  for x ∈ K. On the other hand,
rsym is a polynomial, since r(Rγ x) is a fixed degree polynomial in x for any γ.
Using Hilbert’s theorem, we express rsym (x) = re(f1 (x), . . . , fNinv (x)) with some polyno-
mial re.
It remains to approximate the polynomial re(z1 , . . . , zNinv ) by an expression of the form
fe(z1 , . . . , zNinv ) = N
P PNinv
n=1 cn σ( s=1 wns zs + hn ) on the compact set {(f1 (x), . . . , fNinv (x))|x ∈
Ninv
K} ⊂ R . By Theorem 2.1, we can do it with any accuracy . Setting finally fb(x) =
fe(f1 (x), . . . , fNinv (x)), we obtain fb of the required form such that |fb(x) − f (x)| ≤ 2 for all
x ∈ K.

10
Note that Proposition 2.3 is a generalization of Theorem 2.1; the latter is a special case
obtained if the group is trivial (Γ = {e}) or its representation is trivial (Rγ x ≡ x), and in
this case we can just take Ninv = d and fs (x) = xs .
In terms of neural network architectures, formula (2.5) can be viewed as a shallow neural
network with an extra polynomial layer that precedes the conventional linear combination
and nonlinear activation layers.
We extend now the obtained result to equivariant maps. Given two Γ-modules V and U ,
we say that a map f : V → U is polynomial if l ◦ f is a polynomial for any linear functional
l : U → R. We rely on the extension of Hilbert’s theorem to polynomial equivariants:

Lemma 2.1. Let Γ be a compact group and V and U two finite-dimensional Γ-modules.
Then there exist finitely many polynomial invariants f1 , . . . , fNinv : V → R and polynomial
equivariants g1 , . . . , gNeq : V → U such that any polynomial equivariant rsym : V → U can be
represented in the form rsym (x) = N
P eq
m=1 gm (x)e
rm (f1 (x), . . . , fNinv (x)) with some polynomials
rem .

Proof. We give a sketch of the proof, see e.g. Section 4 of Worfolk [1994] for details. A
polynomial equivariant rsym : V → U can be viewed as an invariant element of the space
R[V ]⊗U with the naturally induced action of Γ, where R[V ] denotes the space of polynomials
on V . The space R[V ] ⊗ U is in turn a subspace of the algebra R[V ⊕ U ∗ ], where U ∗ denotes
the dual of U . By Hilbert’s theorem, all invariant elements in R[V ⊕ U ∗ ] can be generated
as polynomials of finitely many invariant elements of this algebra. The algebra R[V ⊕ U ∗ ] is
graded by the degree of the U ∗ component, and the corresponding decomposition of R[V ⊕U ∗ ]
into the direct sum of U ∗ -homogeneous spaces indexed by the U ∗ -degree dU ∗ = 0, 1, . . . ,
is preserved by the group action. The finitely many polynomials generating all invariant
polynomials in R[V ⊕ U ∗ ] can also be assumed to be U ∗ -homogeneous. Let {fs }N s=1 be those
inv

Neq
of these generating polynomials with dU ∗ = 0 and {gs }s=1 be those with dU ∗ = 1. Then, a
polynomial in the generating invariants is U ∗ -homogeneous with dU ∗ = 1 if and only if it is
nN
a linear combination of monomials gs f1n1 f2n2 · · · fNinvinv . This yields the representation stated
in the lemma.
eqN
We will refer to the set {gs }s=1 as a generating set of polynomial equivariants.
The equivariant analog of Proposition 2.3 now reads:

Proposition 2.4. Let Γ be a compact group, V and U be two finite-dimensional Γ-modules.


Let f1 , . . . , fNinv : V → R be a finite generating set of polynomial invariants and g1 , . . . , gNeq :
V → U be a finite generating set of polynomial equivariants (existing by Lemma 2.1). Then,
any continuous equivariant map f : V → U can be approximated by equivariant maps fb :
V → U of the form
Neq
N X Ninv
X 
X
fb(x) = cmn gm (x)σ wmns fs (x) + hmn
n=1 m=1 s=1

with some parameter N and coefficients cmn , wmns , hmn .

11
Proof. The proof is similar to the proof of Proposition 2.3, with the difference that
Rthe polynomial
−1
map r is now vector-valued, its symmetrization is defined by rsym (x) =
R r(Rγ x)dγ, and Lemma 2.1 is used in place of Hilbert’s theorem.
Γ γ

We remark that, in turn, Proposition 2.4 generalizes Proposition 2.3; the latter is a special
case obtained when U = R, and in this case we just take Neq = 1 and g1 = 1.

2.3 Polarization and multiplicity reduction


The main point of Propositions 2.3 and 2.4 is that the representations described there use
Neq
finite generating sets of invariants and equivariants {fs }N s=1 , {gm }m=1 independent of the
inv

function f being approximated. However, the obvious drawback of these results is their
non-constructive nature with regard to the functions fs , gm . In general, finding generating
sets is not easy. Moreover, the sizes Ninv , Neq of these sets in general grow rapidly with the
dimensions of the spaces V, U .
This issue can be somewhat ameliorated using polarization and Weyl’s theorem. Suppose
that a Γ–module V admits a decomposition into a direct sum of invariant submodules:
M
V = Vαmα . (2.6)
α

Here, Vαmα is a direct sum of mα submodules isomorphic to Vα :

Vαmα = Vα ⊗ Rmα = Vα ⊕ . . . ⊕ Vα . (2.7)


| {z }

Any finite-dimensional representation of a compact group is completely reducible and has


a decomposition of the form (2.6) with non-isomorphic irreducible submodules Vα . In this
case the decomposition (2.6) is referred to as the isotypic decomposition, and the subspaces
Vαmα are known as isotypic components. Such isotypic components and their multiplicities
mα are uniquely determined (though individually, the mα spaces Vα appearing in the direct
sum (2.7) are not uniquely determined, in general, as subspaces in V ).
For finite groups the number of non-isomorphic irreducibles α is finite. In this case, if the
module V is high-dimensional, then this necessarily means that (some of) the multiplicities
mα are large. This is not so, in general, for infinite groups, since infinite compact groups
have countably many non-isomorphic irreducible representations. Nevertheless, it is in any
case useful to simplify the structure of invariants for high–multiplicity modules, which is
what polarization and Weyl’s theorem do.
Below, we slightly abuse the terminology and speak of isotypic components and decom-
positions in the broader sense, assuming decompositions (2.6), (2.7) but not requiring the
submodules Vα to be irreducible or mutually non-isomorphic.
The idea of polarization is to generate polynomial invariants of a representation with large
multiplicities from invariants of a representation with small multiplicities. Namely, note that
in each isotypic component Vαmα written as Vα ⊗ Rmα the group essentially acts only on the

12
first factor, Vα . So, given two isotypic Γ-modules of the same type, Vαmα = Vα ⊗ Rmα and
m0 m0
Vα α = Vα ⊗ Rmα , the group action commutes with any linear map 1Vα ⊗ A : Vαmα → Vα α ,
0

0
where A acts on the second factor, A : Rmα → Rmα . Consequently, given two modules
L mα 0
L m0α m0
V = α Vα , V = α Vα and a linear map Aα : Vαmα → Vα α for each α, the linear
operator A : V → V 0 defined by
1Vα ⊗ Aα
M
A= (2.8)
α

will commute with the group action. In particular, if f is a polynomial invariant on V 0 , then
f ◦ A will be a polynomial invariant on V .
The fundamental theorem of Weyl states that it suffices to take m0α = dim Vα to generate
in this way a complete set of invariants for V . We will state this theorem in the following
form suitable for our purposes.
Theorem 2.3 (Weyl [1946], sections II.4-5). Let F be the set of polynomial invariants
L for a
Γ-module V 0 = α Vαdim Vα . Suppose that a Γ-module V admits a decomposition V = α Vαmα
L
with the same Vα , but arbitrary multiplicities mα . Then the polynomials {f ◦ A}f ∈F linearly
span the space of polynomial
PT invariants on V , i.e. any polynomial invariant f on V can be
expressed as f (x) = t=1 ft (At x) with some polynomial invariants ft on V 0 .
Proof. A detailed exposition of polarization and a proof of Weyl’s theorem based on the
Capelli–Deruyts expansion can be found in Weyl’s book or in Sections 7–9 of Kraft and
Procesi [2000]. We sketch the main idea of the proof.
Consider first the case where V has only one isotypic component: V = Vαmα . We may
assume without loss of generality that mα > dim Vα (otherwise the statement is trivial). It
is also convenient to identify the space V 0 = Vαdim Vα with the subspace of V spanned by the
first dim Vα components Vα . It suffices to establish the claimed expansion for polynomials f
multihomogeneous with respect to the decomposition V = Vα ⊕ . . . ⊕ Vα , i.e. homogeneous
with respect to each of the mα components.PFor any such polynomial, the Capelli–Deruyts
expansion represents f as a finite sum f = n Cn Bn f . Here Cn , Bn are linear operators on
the space of polynomials on V , and they belong to the algebra generated by polarization
operators on V . Moreover, for each n, the polynomial fen = Bn f depends only on variables
from the first dim Vα components of V = Vαmα , i.e. fen is a polynomial on V 0 . This polynomial
is invariant, since polarization operators commute with the group action. Since Cn belongs
to the algebra generated by polarization operators, we can then argue (see Proposition 7.4
in Kraft and Procesi [2000]) that Cn Bn f can be represented as a finite sum Cn Bn f (x) =
k fn ((1Vα ⊗ Akn )x) with some mα × dim Vα matrices Akn . This implies the claim of the
P e
theorem in the case of a single isotypic component.
Generalization to several isotypic components is obtained by iteratively applying the
Capelli–Deruyts expansion to each component.
Now we can give a more constructive version of Proposition 2.3:
Proposition 2.5. Let (fs )Ns=1 be a generating set of polynomial invariants
inv
Lfor amαΓ-module
0 dim Vα
L
V = α Vα . Suppose that a Γ-module V admits a decomposition V = α Vα with the

13
same Vα , but arbitrary multiplicities mα . Then any continuous invariant map f : V → R
can be approximated by invariant maps fb : V → R of the form
T
X Ninv
X 
fb(x) = ct σ wst fs (At x) + ht (2.9)
t=1 s=1

with some parameter T and coefficients ct , wst , ht , At , where each At is formed by an arbitrary
collection of (mα × dim Vα )-matrices Aα as in (2.8).

Proof. We follow the proof of Proposition 2.3 and approximate the function f by an invariant
polynomial rsym on a compact set Ksym ⊂ V . Then, using Theorem 2.3, we represent
T
X
rsym (x) = rt (At x) (2.10)
t=1

with some invariant polynomials rt on V 0 . Then, by Proposition 2.3, for each t we can
approximate rt (y) on At Ksym by an expression
N
X Ninv
X 
cnt σ
e w
enst fs (y) + e
hnt (2.11)
n=1 s=1

with some e
cnt , w
enst , e
hnt . Combining (2.10) with (2.11), it follows that f can be approximated
on Ksym by
XT XN Ninv
X 
cnt σ
e enst fs (At x) + hnt .
w e
t=1 n=1 s=1

The final expression (2.9) is obtained now by removing the superfluous summation over
n.
Proposition 2.5 is more constructive than Proposition 2.3 in the sense that theL approx-

imating ansatz (2.9) only requires us to know an isotypic decomposition V = α Vα of
Ninv
the Γ-module under consideration and a generating set (f )
s s=1 for the reference module
V 0 = α Vαdim Vα . In particular, suppose that the group Γ is finite, so that there are only
L
finitely many non-isomorphic irreducible modules Vα . Then, for any Γ–module V , the uni-
versal approximating ansatz (2.9) includes not more P than CT dim V scalar weights, with
some constant C depending only on Γ (since dim V = α mα dim Vα ).
We remark that in terms of the network architecture, formula (2.9) can be interpreted as
the network (2.5) from Proposition 2.3 with an extra linear layer performing multiplication
of the input vector by At .
We establish now an equivariant analog of Proposition 2.5. We start with an equivariant
analog of Theorem 2.3.

14
Proposition 2.6. Let V 0 =
L dim Vα
α Vα and G be the space of polynomial
L equivariants
0 mα
g : V → U . Suppose that a Γ-module V admits a decomposition V = α Vα with the
same Vα , but arbitrary multiplicities mα . Then, the functions {g ◦ A}g∈G linearly span the
space ofPpolynomial equivariants g : V → U , i.e. any such equivariant can be expressed as
g(x) = Tt=1 gt (At x) with some polynomial equivariants gt : V 0 → U .

Proof. As mentioned in the proof of Lemma 2.1, polynomial equivariants g : V → U can


be viewed as invariant elements of the extended polynomial algebra R[V ⊕ U ∗ ]. The proof
of the theorem is then completely analogous to the proof of Theorem 2.3 and consists in
applying the Capelli–Deruyts expansion to each isotypic component of the submodule V in
V ⊕ U ∗.
The equivariant analog of Proposition 2.5 now reads:

Proposition 2.7. Let (fs )N s=1 be a generating set of polynomial invariants for a Γ-module
inv

Neq
V 0 = α Vαdim Vα , and (gs )L 0
L
s=1 be a generating sets of polynomial equivariants mapping V to
a Γ-module U . Let V = α Vαmα be a Γ-module with the same Vα . Then any continuous
equivariant map f : V → U can be approximated by equivariant maps fb : V → U of the form
Neq
T X Ninv
X 
X
fb(x) = cmt gm (At x)σ wmst fs (At x) + hmt (2.12)
t=1 m=1 s=1

with some coefficients cmt , wmst , hmt , At , where each At is given by a collection of (mα ×
dim Vα )-matrices Aα as in (2.8).

Proof. As in the proof of Theorem 2.4, we approximate the function f by a polynomial


equivariant rsym on a compact Ksym ⊂ V . Then, using Theorem 2.6, we represent
T
X
rsym (x) = rt (At x) (2.13)
t=1

with some polynomial equivariants rt : V 0 → U . Then, by Proposition 2.4, for each t we can
approximate rt (x0 ) on At Ksym by expressions
Neq
N X Ninv
X 
X
0
cmnt g(x )σ
e emnst fs (x0 ) + e
w hmnt . (2.14)
n=1 m=1 s=1

Using (2.13) and (2.14), f can be approximated on Ksym by expressions

T X Neq
N X Ninv
X 
X
cmnt g(At x)σ
e w
emnst fs (At x) + e
hmnt .
t=1 n=1 m=1 s=1

We obtain the final form (2.12) by removing the superfluous summation over n.

15
We remark that Proposition 2.7 improves the earlier Proposition 2.4 in the equivariant
setting in the same sense in which Proposition 2.5 improves Proposition 2.3 in the invariant
setting: construction of a universal approximator in the case of arbitrary isotypic multi-
plicities is reduced to the construction with particular multiplicities by adding an extra
equivariant linear layer to the network.

2.4 The symmetric group SN


Even with the simplification resulting from polarization, the general results of the previous
section are not immediately useful, since one still needs to find the isotypic decomposition
of the analyzed Γ-modules and to find the relevant generating invariants and equivariants.
In this section we describe one particular case where the approximating expression can be
reduced to a fully explicit form.
Namely, consider the natural action of the symmetric group SN on RN :
Rγ en = eγ(n) ,
where en ∈ RN is a coordinate vector and γ ∈ SN is a permutation.
Let V = RN ⊗ RM and consider PVN as a SN -module by assuming that the group acts on
the first factor, i.e. γ acts on x = n=1 en ⊗ xn ∈ V by
N
X N
X
Rγ en ⊗ xn = eγ(n) ⊗ xn .
n=1 n=1

We remark that this module appears, for example, in the following scenario. Suppose
sets X = {x1 , . . . , xN } of N vectors from RM . We can
that f is a map defined on the set of P
identify the set X with the element N n=1 en ⊗ xn of V and in this way view f as defined
on
PN a subset of V . However, since the set X is unordered, it can also be identified with
n=1 eγ(n) ⊗ xn for any permutation γ ∈ SN . Accordingly, if the map f is to be extended
to the whole V , then this extension needs to be invariant with respect to the above action
of SN .
We describe now an explicit complete ansatz for SN -invariant approximations of functions
on V . This is made possible by another classical theorem of Weyl and by a simple form of a
generating set of permutation invariants on RN . We will denote by xnm the coordinates of
x ∈ V with respect to the canonical basis in V :
N X
X M
x= xnm en ⊗ em .
n=1 m=1
N M
Theorem 2.4. Let V = R ⊗ R and f : V → R be a SN -invariant continuous map. Then
f can be approximated by SN -invariant expressions
T1
X T2
X XN  X M  
f (x) =
b ct σ wqt σ bq atm xnm + eq + ht , (2.15)
t=1 q=1 n=1 m=1

with some parameters T1 , T2 and coefficients ct , wqt , bq , atm , eq , ht .

16
Proof. It is clear that expression (2.15) is SN -invariant and we only need to prove its com-
pleteness. The theorem of Weyl (Weyl [1946], Section II.3) states that a generating set of
symmetric polynomials on V can be obtained by polarizing a generating set of symmetric
polynomials {fp }N N
p=1 defined on a single copy of R . Arguing as in Proposition 2.5, it follows
inv

that any SN -invariant continuous map f : V → R can be approximated by expressions


T1
X Ninv
X 
ct σ
e w
ept fp (A
e t x) + e
ht ,
t=1 p=1
PN PM
where A
e tx =
n=1 atm xnm en . A well-known generating set of symmetric polynomials
m=1 e
N
on R is the first N coordinate power sums:
N
X
fp (y) = fep (yn ), where y = (y1 , . . . , yN ), fep (yn ) = ynp , p = 1, . . . , N.
n=1

It follows that f can be approximated by expressions


T1
X N
X N
X XM  
ct σ
e w
ept fp
e atm xnm + ht .
e e (2.16)
t=1 p=1 n=1 m=1
PT e e
Using Theorem 2.1, we can approximate fep (y) by expressions q=1 dpq σ(bpq y + hpq ). It
e
follows that (2.16) can be approximated by
XT1 N X
X T N
X  X M  
ct σ
e w
ept dpq
e σ bpq
e atm xnm + hpq + ht .
e e
t=1 p=1 q=1 n=1 m=1

Replacing the double summation over p, q by a single summation over q, we arrive at (2.15).

Note that expression (2.15) resembles the formula of the usual (non-invariant) feedforward
network with two hidden layers of sizes T1 and T2 :
XT1 T2
X XN XM  
f (x) =
b ct σ wqt σ aqnm xnm + eq + ht .
t=1 q=1 n=1 m=1

Let us also compare ansatz (2.15) with the ansatz obtained by direct symmetrization (see
Proposition (2.1)), which in our case has the form
T
XX N X
X M 
fb(x) = ct σ wγ(n),m,t xnm + ht .
γ∈SN t=1 n=1 m=1

From the application perspective, since |SN | = N !, at large N this expression has pro-
hibitively many terms and is therefore impractical without subsampling of SN , which would
break the exact SN -invariance. In contrast, ansatz (2.15) is complete, fully SN -invariant and
involves only O(T1 N (M + T2 )) arithmetic operations and evaluations of σ.

17
3 Translations and deep convolutional networks
Convolutional neural networks (convnets, le Cun [1989]) play a key role in many modern ap-
plications of deep learning. Such networks operate on input data having grid-like structure
(usually, spatial or temporal) and consist of multiple stacked convolutional layers trans-
forming initial object description into increasingly complex features necessary to recognize
complex patterns in the data. The shape of earlier layers in the network mimics the shape
of input data, but later layers gradually become “thinner” geometrically while acquiring
“thicker” feature dimensions. We refer the reader to deep learning literature for details on
these networks, e.g. see Chapter 9 in Goodfellow et al. [2016] for an introduction.
There are several important concepts associated with convolutional networks, in partic-
ular weight sharing (which ensures approximate translation equivariance of the layers with
respect to grid shifts); locality of the layer operation; and pooling. Locality means that the
layer output at a certain geometric point of the domain depends only on a small neigh-
borhood of this point. Pooling is a grid subsampling that helps reshape the data flow by
removing excessive spatial detalization. Practical usefulness of convnets stems from the
interplay between these various elements of convnet design.
From the perspective of the main topic of the present work – group invariant/equivariant
networks – we are mostly interested in invariance/equivariance of convnets with respect to
Lie groups such as the group of translations or the group of rigid motions (to be considered
in Section 4), and we would like to establish relevant universal approximation theorems.
However, we first point out some serious difficulties that one faces when trying to formulate
and prove such results.
Lack of symmetry in finite computational models. Practically used convnets are
finite models; in particular they operate on discretized and bounded domains that do not
possess the full symmetry of the spaces Rd . While the translational symmetry is partially
preserved by discretization to a regular grid, and the group Rd can be in a sense approximated
by the groups (λZ)d or (λZn )d , one cannot reconstruct, for example, the rotational symmetry
in a similar way. If a group Γ is compact, then, as discussed in Section 2, we can still
obtain finite and fully Γ-invariant/equivariant computational models by considering finite-
dimensional representations of Γ, but this is not the case with noncompact groups such as
Rd . Therefore, in the case of the group Rd (and the group of rigid planar motions considered
later in Section 4), we will need to prove the desired results on invariance/eqiuvariance and
completeness of convnets only in the limit of infinitely large domain and infinitesimal grid
spacing.
Erosion of translation equivariance by pooling. Pooling reduces the translational
symmetry of the convnet model. For example, if a few first layers of the network define a
map equivariant with respect to the group (λZ)2 with some spacing λ, then after pooling
with stride m the result will only be equivariant with respect to the subgroup (mλZ)2 . (We
remark in this regard that in practical applications, weight sharing and accordingly trans-
lation equivariance are usually only important for earlier layers of convolutional networks.)
Therefore, we will consider separately the cases of convnets without or with pooling; the
Rd –equivariance will only apply in the former case.

18
In view of the above difficulties, in this section we will give several versions of the universal
approximation theorem for convnets, with different treatments of these issues.
In Section 3.1 we prove a universal approximation theorem for a single non-local convo-
lutional layer on a finite discrete grid with periodic boundary conditions (Proposition 3.1).
This basic result is a straightforward consequence of the general Proposition 2.2 when applied
to finite abelian groups.
In Section 3.2 we prove the main result of Section 3, Theorem 3.1. This theorem extends
Proposition 3.1 in several important ways. First, we will consider continuum signals, i.e.
assume that the approximated map is defined on functions on Rn rather than on functions
on a discrete grid. This extension will later allow us to rigorously formulate a universal
approximation theorem for rotations and euclidean motions in Section 4. Second, we will
consider stacked convolutional layers and assume each layer to act locally (as in convnets
actually used in applications). However, the setting of Theorem 3.2 will not involve pooling,
since, as remarked above, pooling destroys the translation equivariance of the model.
In Section 3.3 we prove Theorem 3.2, relevant for convnets most commonly used in
practice. Compared to the setting of Section 3.2, this computational model will be spatially
bounded, will include pooling, and will not assume translation invariance of the approximated
map.

3.1 Finite abelian groups and single convolutional layers


We consider a group
Γ = Zn1 × · · · × Znν , (3.1)
where Zn = Z/(nZ) is the cyclic group of order n. Note that the group Γ is abelian and
conversely, by the fundamental theorem of finite abelian groups, any such group can be
represented in the form (3.1).
We consider the “input” module V = RΓ ⊗ RdV and the “output” module U = RΓ ⊗ RdU ,
with some finite dimensions dV , dU and with the natural representation of Γ:

Rγ (eθ ⊗ v) = eθ+γ ⊗ v, γ, θ ∈ Γ, v ∈ RdV or RdU .

We will denote elements of V, U by boldface characters Φ and interpret them as dV - or dU -


component signals defined on the set Γ. For example, in the context of 2D image processing
we have ν = 2 and the group Γ = Zn1 × Zn2 corresponds to a discretized rectangular image
with periodic boundary conditions, where n1 , n2 are the geometric sizes of the image while
dV and dU are the numbers of input and output features, respectively (in particular, if the
input is a usual RGB image, then dV = 3).
Denote by Φθk the coefficients in the expansion of a vector Φ from V or U over the
standard product bases in these spaces:

X dVX
or dU
Φ= Φθk eθ ⊗ ek . (3.2)
θ∈Γ k=1

19
We describe now a complete equivariant ansatz for approximating Γ-equivariant maps
f : V → U . Thanks to decomposition (2.4), we may assume without loss that dU = 1. By
(3.2), any map f : V → U is then specified by the coefficients f (Φ)θ (≡ f (Φ)θ,1 ) ∈ R as Φ
runs over V and θ runs over Γ.
Proposition 3.1. Any continuous Γ-equivariant map f : V → U can be approximated by
Γ-equivariant maps fb : V → U of the form
N
X dV
XX 
fb(Φ)γ = cn σ wnθk Φγ+θ,k + hn , (3.3)
n=1 θ∈Γ k=1

P PdV
where Φ = γ∈Γ Φγk eγ ⊗ ek , N is a parameter, and cn , wnγk , hn are some coefficients.
k=1

Proof. We apply Proposition 2.2 with ln (Φ) = θ0 ∈Γ dk=1


P PV 0 P
wnθ0 k Φθ0 k and yn = κ∈Γ ynκ eκ ,
and obtain the ansatz
N X
XX dV
XX  N
XX
0
fb(Φ) = ynκ σ wnθ 0k Φ 0 0
θ −γ ,k + hn e κ−γ 0 = ynκ aκn ,
γ 0 ∈Γ n=1 κ∈Γ θ0 ∈Γ k=1 κ∈Γ n=1

where
dV
X XX 
0
aκn = σ wnθ0 k Φθ −γ ,k + hn eκ−γ 0 .
0 0 (3.4)
γ 0 ∈Γ θ0 ∈Γ k=1

By linearity of the expression on the r.h.s. of (3.3), it suffices to check that each aκn can be
written in the form
X XX dV 
σ wnθk Φθ+γ,k + hn eγ .
γ∈Γ θ∈Γ k=1

But this expression results if we make in (3.4) the substitutions γ = κ − γ 0 , θ = θ0 − κ and


0
wnθk = wn,θ+κ,k .
The expression (3.3) resembles the standard convolutional layer without pooling as de-
scribed, e.g., in Goodfellow et al. [2016]. Specifically, this expression can be viewed as a
linear combination of N scalar filters obtained as compositions of linear convolutions with
pointwise non-linear activations. An important difference with the standard convolutional
layers is that the convolutions in (3.3) are non-local, in the sense that the weights wnθk do not
vanish at large θ. Clearly, this non-locality is inevitable if approximation is to be performed
with just a single convolutional layer.
We remark that it is possible to use Proposition 2.4 to describe an alternative complete Γ-
equivariant ansatz based on polynomial invariants and equivariants. However, this approach
seems to be less efficient because it is relatively difficult to specify a small explicit set of
generating polynomials for abelian groups (see, e.g. Schmid [1991] for a number of relevant
results). Nevertheless, we will use polynomial invariants of the abelian group SO(2) in our
construction of “charge-conserving convnet” in Section 4.

20
3.2 Continuum signals and deep convnets
In this section we extend Proposition 3.1 in several ways.
First, instead of the group Zn1 × · · · × Znν we consider the group Γ = Rν . Accordingly,
we will consider infinite-dimensional Rν –modules

V = L2 (Rν ) ⊗ RdV ∼
= L2 (Rν , RdV ),
U = L2 (Rν ) ⊗ RdU ∼
= L2 (Rν , RdU )

with
R some finite dV , dU . Here, L2 (Rν , Rd ) is the Hilbert space of maps Φ
R : Rν → Rd with
Rd
|Φ(γ)|2 dγ < ∞, equipped with the standard scalar product hΦ, Ψi = Rd Φ(γ) · Ψ(γ)dγ,
where Φ(γ) · Ψ(γ) denotes the scalar product of Φ(γ) and Ψ(γ) in Rd . The group Rν is
naturally represented on V, U by

Rγ Φ(θ) = Φ(θ − γ), Φ ∈ V, γ, θ ∈ Rν . (3.5)

Compared to the setting of the previous subsection, we interpret the modules V, U as carrying
now “infinitely extended” and “infinitely detailed” dV - or dU -component signals. We will be
interested in approximating arbitrary Rν –equivariant continuous maps f : V → U .
The second extension is that we will perform this approximation using stacked convolu-
tional layers with local action. Our approximation will be a finite computational model, and
to define it we first need to apply a discretization and a spatial cutoff to vectors from V and
U.
Let us first describe the discretization. For any grid spacing λ > 0, let Vλ be the subspace
in V formed by signals Φ : Rν → RdV constant on all cubes
ν

×
h  i
(λ) 1
λ, ks − 12 λ ,

Qk = ks − 2
s=1

where k = (k1 , . . . , kν ) ∈ Zν . Let Pλ be the orthogonal projector onto Vλ in V :


Z
1 (λ)
Pλ Φ(γ) = ν Φ(θ)dθ, where Qk 3 γ. (3.6)
λ Qk(λ)

A function Φ ∈ Vλ can naturally be viewed as a function on the lattice (λZ)d , so that we


can also view Vλ as a Hilbert space

Vλ ∼
= L2 ((λZ)ν , RdV ),

with the scalar product hΦ, Ψi = λν γ∈(λZ)ν Φ(γ) · Ψ(γ). We define the subspaces Uλ ⊂ U
P
similarly to the subspaces Vλ ⊂ V .
Next, we define the spatial cutoff. For an integer L ≥ 0 we denote by ZL the size-2L
cubic subset of the grid Zν :
ZL = {k ∈ Zν |kkk∞ ≤ L}, (3.7)

21
where k = (k1 , . . . , kν ) ∈ Zν and kkk∞ = maxn=1,...,ν |kn |. Let b·c denote the standard floor
function. For any Λ ≥ 0 (referred to as the spatial range or cutoff ) we define the subspace
Vλ,Λ ⊂ Vλ by

Vλ,Λ = {Φ : (λZ)ν → RdV |Φ(λk) = 0 if k ∈


/ ZbΛ/λc }
∼ d
= {Φ : λZbΛ/λc → R V }

= L2 (λZbΛ/λc , RdV ). (3.8)

Clearly, dim Vλ,Λ = (2bΛ/λc + 1)ν dV . The subspaces Uλ,Λ ⊂ Uλ are defined in a similar
fashion. We will denote by Pλ,Λ the linear operators orthogonally projecting V to Vλ,Λ or U
to Uλ,Λ .
In the following, we will assume that the convolutional layers have a finite receptive field
ZLrf – a set of the form (3.7) with some fixed Lrf > 0.
We can now describe our model of stacked convnets that will be used to approximate
maps f : V → U (see Fig.1). Namely, our approximation will be a composition of the form
Pλ,Λ+(T −1)λLrf fb1 fb2 T fb
fb : V −→ Vλ,Λ+(T −1)λLrf (≡ W1 ) → W2 → . . . → WT +1 (≡ Uλ,Λ ). (3.9)

Here, the first step Pλ,Λ+(T −1)λLrf is an orthogonal finite-dimensional projection implementing
the initial discretization and spatial cutoff of the signal. The maps fbt are convolutional layers
connecting intermediate spaces
(
{Φ : λZbΛ/λc+(T −t)Lrf → Rdt }, t ≤ T
Wt = (3.10)
{Φ : λZbΛ/λc → Rdt }, t=T +1

with some feature dimensions dt such that d1 = dV and dT +1 = dU . The first intermediate
space W1 is identified with the space Vλ,Λ+(T −1)λLrf (the image of the projector Pλ,Λ+(T −1)λLrf
applied to V ), while the end space WT +1 is identified with Uλ,Λ (the respective discretization
and cutoff of U ).
The convolutional layers are defined as follows. Let (Φγn )γ∈ZbΛ/λc+(T −t)Lrf be the coefficients
n=1,...,dt
in the expansion of Φ ∈ Wt over the standard basis in Wt , as in (3.2). Then, for t < T
we define fbt using the conventional “linear convolution followed by nonlinear activation”
formula,
dt
 X X 
(t)
fbt (Φ)γn = σ wnθk Φγ+θ,k + h(t)
n , γ ∈ ZbΛ/λc+(T −t−1)Lrf , n = 1, . . . , dt+1 , (3.11)
θ∈ZLrf k=1

while in the last layer (t = T ) we drop nonlinearities and only form a linear combination of
values at the same point of the grid:
dT
X (T )
fbT (Φ)γn = wnk Φγk + h(T
n ,
)
γ ∈ ZbΛ/λc , n = 1, . . . , dU . (3.12)
k=1

22
Pλ,Λ +3λ fc1 fc2 fc3 fc4
V W1 W2 W3 W4 W5
Φ fb(Φ)
λ

Figure 1: A one-dimensional (ν = 1) basic convnet with the receptive field parameter Lrf = 1.
The dots show feature spaces Rdt associated with particular points of the grid λZ.

Note that the grid size bΛ/λc + (T − t)Lrf associated with the space Wt is consistent with
the rule (3.11) which evaluates the new signal fb(Φ) at each node of the grid as a function
of the signal Φ in the Lrf -neighborhood of that node (so that the domain λZbΛ/λc+(T −t)Lrf
“shrinks” slightly as t grows).
Note that we can interpret the map fb as a map between V and U , since Uλ,Λ ⊂ U .

Definition 3.1. A basic convnet is a map fb : V → U defined by (3.9), (3.11), (3.12), and
(t) (t)
characterized by parameters λ, Λ, Lrf , T, d1 , . . . , dT +1 and coefficients wnθk and hn .
Note that, defined in this way, a basic convnet is a finite computational model in the
following sense: while being a map between infinite-dimensional spaces V and U , all the
steps in fb except the initial discretization and cutoff involve only finitely many arithmetic
operations and evaluations of the activation function.
We aim to prove an analog of Theorem 2.1, stating that any continuous Rν -equivariant
map f : V → U can be approximated by basic convnets in the topology of uniform conver-
gence on compact sets. However, there are some important caveats due to the fact that the
space V is now infinite-dimensional.
First, in contrast to the case of finite-dimensional spaces, balls in L2 (Rν , RdV ) are not
compact. The well-known general criterion states that in a complete metric space, and in
particular in V = L2 (Rν , RdV ), a set is compact iff it is closed and totally bounded, i.e. for
any  > 0 can be covered by finitely many –balls.
The second point (related to the first) is that a finite-dimensional space is hemicompact,
i.e., there is a sequence of compact sets such that any other compact set is contained in one
of them. As a result, the space of maps f : Rn → Rm is first-countable with respect to the
topology of compact convergence, i.e. each point has a countable base of neighborhoods,
and a point f is a limit point of a set S if and only if there is a sequence of points in
S converging to f . In a general topological space, however, a limit point of a set S may

23
not be representable as the limit of a sequence of points from S. In particular, the space
L2 (Rν , RdV ) is not hemicompact and the space of maps f : L2 (Rν , RdV ) → L2 (Rν , RdU ) is not
first countable with respect to the topology of compact convergence, so that, in particular,
we must distinguish between the notions of limit points of the set of convnets and the limits
of sequences of convnets. We refer the reader, e.g., to the book Munkres [2000] for a general
discussion of this and other topological questions and in particular to §46 for a discussion of
compact convergence.
When defining a limiting map, we would like to require the convnets to increase their
detalization λ1 and range Λ. At the same time, we will regard the receptive field and its
range parameter Lrf as arbitrary but fixed (the current common practice in applications
is to use small values such as Lrf = 1 regardless of the size of the network; see, e.g., the
architecture of residual networks He et al. [2016] providing state-of-the-art performance on
image recognition tasks).
With all these considerations in mind, we introduce the following definition of a limit
point of convnets.

Definition 3.2. With V = L2 (Rν , RdV ) and U = L2 (Rν , RdU ), we say that a map f : V → U
is a limit point of basic convnets if for any Lrf , any compact set K ⊂ V , and any
 > 0, λ0 > 0 and Λ0 > 0 there exists a basic convnet fb with the receptive field parameter
Lrf , spacing λ ≤ λ0 and range Λ ≥ Λ0 such that supΦ∈K kfb(Φ) − f (Φ)k < .

We can state now the main result of this section.

Theorem 3.1. A map f : V → U is a limit point of basic convnets if and only if f is


Rν –equivariant and continuous in the norm topology.

Before giving the proof of the theorem, we recall the useful notion of strong convergence of
linear operators on Hilbert spaces. Namely, if An is a sequence of bounded linear operators on
a Hilbert space and A is another such operator, then we say that the sequence An converges
strongly to A if An Φ converges to AΦ for any vector Φ from this Hilbert space. More
generally, strong convergence can be defined, by the same reduction, for any family {Aα } of
linear operators once the convergence of the family of vectors {Aα Φ} is specified.
An example of a strongly convergent family is the family of discretizing projectors Pλ
defined in (3.6). These projectors converge strongly to the identity as the grid spacing tends
λ→0
to 0: Pλ Φ −→ Φ. Another example is the family of projectors Pλ,Λ projecting V onto the
subspace Vλ,Λ of discretized and cut-off signals defined in (3.8). It is easy to see that Pλ,Λ
converge strongly to the identity as the spacing tends to 0 and the cutoff is lifted, i.e. as
λ → 0 and Λ → ∞. Finally, our representations Rγ defined in (3.5) are strongly continuous
in the sense that Rγ 0 converges strongly to Rγ as γ 0 → γ.
A useful standard tool in proving strong convergence is the continuity argument: if
the family {Aα } is uniformly bounded, then the convergence Aα Φ → AΦ holds for all
vectors Φ from the Hilbert space once it holds for a dense subset of vectors. This follows by
approximating any Φ with Ψ’s from the dense subset and applying the inequality kAα Φ −
AΦk ≤ kAα Ψ − AΨk + (kAα k + kAk)kΦ − Ψk. In the sequel, we will consider strong

24
convergence only in the settings where Aα are orthogonal projectors or norm-preserving
operators, so the continuity argument will be applicable.
Proof of Theorem 3.1.

Necessity (a limit point of basic convnets is Rν –equivariant and continuous).


The continuity of f follows by a standard argument from the uniform convergence on
compact sets and the continuity of convnets (see Theorem 46.5 in Munkres [2000]).
Let us prove the Rν –equivariance of f , i.e.

f (Rγ Φ) = Rγ f (Φ). (3.13)

Let DM = [−M, M ]ν ⊂ Rν with some M > 0, and PDM be the orthogonal projector in U
onto the subspace of signals supported on the set DM . Then PDM converges strongly to the
identity as M → +∞. Since Rγ is a bounded linear operator, (3.13) will follow if we prove
that for any M
PDM f (Rγ Φ) = Rγ PDM f (Φ). (3.14)
Let  > 0. Let γλ ∈ (λZ)ν be the nearest point to γ ∈ Rν on the grid (λZ)ν . Then, since
Rγλ converges strongly to Rγ as λ → 0, there exist λ0 such that for any λ < λ0

kRγλ PDM f (Φ) − Rγ PDM f (Φ)k ≤ , (3.15)

and
kf (Rγ Φ) − f (Rγλ Φ)k ≤ , (3.16)
where we have also used the already proven continuity of f .
Observe that the discretization/cutoff projectors Pλ,M converge strongly to PDM as λ → 0,
hence we can ensure that for any λ < λ0 we also have
kPDM f (Rγ Φ) − Pλ,M f (Rγ Φ)k ≤ ,
(3.17)
kPλ,M f (Φ) − PDΛ f (Φ)k ≤ .
Next, observe that basic convnets are partially translationally equivariant by our defini-
tion, in the sense that if the cutoff parameter Λ of the convnet is sufficiently large then

Pλ,M fb(Rγλ Φ) = Rγλ Pλ,M fb(Φ). (3.18)

Indeed, this identity holds as long as both sets λZbM/λc and λZbM/λc −γλ are subsets of λZbΛ/λc
(the domain where convnet’s output is defined, see (3.10)). This condition is satisfied if we
require that Λ > Λ0 with Λ0 = M + λ(1 + kγk∞ ).
Now, take the compact set K = {Rθ Φ|θ ∈ N }, where N ⊂ Rν is some compact set
including 0 and all points γλ for λ < λ0 . Then, by our definition of a limit point of basic
convnets, there is a convnet fb with λ < λ0 and L > L0 such that for all θ ∈ N (and in
particular for θ = 0 or θ = γλ )

kf (Rθ Φ) − fb(Rθ Φ)k < . (3.19)

25
We can now write a bound for the difference of the two sides of (3.14):

kPDM f (Rγ Φ)−Rγ PDM f (Φ)k


≤ kPDM f (Rγ Φ) − Pλ,M f (Rγ Φ)k + kPλ,M f (Rγ Φ) − Pλ,M f (Rγλ Φ)k
+ kPλ,M f (Rγ Φ) − Pλ,M fb(Rγ Φ)k + kPλ,M fb(Rγ Φ) − Rγ Pλ,M fb(Φ)k
λ λ λ λ

+ kRγλ Pλ,M fb(Φ) − Rγλ Pλ,M f (Φ)k + kRγλ Pλ,M f (Φ) − Rγλ PDM f (Φ)k
+ kRγλ PDM f (Φ) − Rγ PDM f (Φ)k
≤ kPDM f (Rγ Φ) − Pλ,M f (Rγ Φ)k + kf (Rγ Φ) − f (Rγλ Φ)k
+ kf (Rγ Φ) − fb(Rγ Φ)k + kfb(Φ) − f (Φ)k
λ λ

+ kPλ,M f (Φ) − PDΛ f (Φ)k + kRγλ PDM f (Φ) − Rγ PDM f (Φ)k


≤ 6,

Here in the first step we split the difference into several parts, in the second step we used
the identity (3.18) and the fact that Pλ,M , Rγλ are linear operators with the operator norm
1, and in the third step we applied the inequalities (3.15)–(3.17) and (3.19). Since  was
arbitrary, we have proved (3.14).

Sufficiency (an Rν –equivariant and continuous map is a limit point of basic convnets).
We start by proving a key lemma on the approximation capability of basic convnets in the
special case when they have the degenerate output range, Λ = 0. In this case, by (3.9), the
output space WT = Uλ,0 ∼= RdU , and the first auxiliary space W1 = Vλ,(T −1)λLrf ⊂ V .
Lemma 3.1. Let λ, T be fixed and Λ = 0. Then any continuous map f : Vλ,(T −1)λLrf → Uλ,0
can be approximated by basic convnets having spacing λ, depth T , and range Λ = 0.
Note that this is essentially a finite-dimensional approximation result, in the sense that
the input space Vλ,(T −1)λLrf is finite-dimensional and fixed. The approximation is achieved
by choosing sufficiently large feature dimensions dt and suitable weights in the intermediate
layers.
Proof. The idea of the proof is to divide the operation of the convnet into two stages. The
first stage is implemented by the first T − 2 layers and consists in approximate “contraction”
of the input vectors, while the second stage, implemented by the remaining two layers,
performs the actual approximation.
The contraction stage is required because the components of the input signal Φin ∈
Vλ,(T −1)λLrf ∼
= L2 (λZ(T −1)Lrf , RdV ) are distributed over the large spatial domain λZ(T −1)Lrf .
In this stage we will map the input signal to the spatially localized space WT −1 ∼ =
2 dT −1
L (λZLrf , R ) so as to approximately preserve the information in the signal.
Regarding the second stage, observe that the last two layers of the convnet (starting from
WT −1 ) act on signals in WT −1 by an expression analogous to the one-hidden-layer network

26
from the basic universal approximation theorem (Theorem 2.1):

  dT  X dX
T −1 
(T −1) (T −1)
X (T )
fT ◦ fT −1 (Φ) =
b b wnk σ wkθm Φθm + hk + h(T
n .
)
(3.20)
n
k=1 θ∈ZLrf m=1

This expression involves all components of Φ ∈ WT −1 , and so we can conclude by Theo-


rem 2.1 that by choosing a sufficiently large dimension dT and appropriate weights we can
approximate an arbitrary continuous map from WT −1 to Uλ,0 .
Now, given a continuous map f : Vλ,(T −1)λLrf → Uλ,0 , consider the map g = f ◦ I ◦ P :
WT −1 → Uλ,0 , where I is some linear isometric map from a subspace WT0 −1 ⊂ WT −1 to
Vλ,(T −1)λLrf , and P is the projection in WT −1 to WT0 −1 . Such isometric I exists if dim WT −1 ≥
dim Vλ,(T −1)λLrf , which we can assume w.l.o.g. by choosing sufficiently large dT −1 . Then the
map g is continuous, and the previous argument shows that we can approximate g using the
second stage of the convnet. Therefore, we can also approximate the given map f = g ◦ I −1
by the whole convnet if we manage to exactly implement or approximate the isometry I −1
in the contraction stage.
Implementing such an isometry would be straightforward if the first T − 2 layers had no
activation function (i.e., if σ were the identity function in the nonlinear layers (3.11)). In
this case for all t = 2, 3, . . . , T − 1 we can choose the feature dimensions dt = |ZLrf |dt−1 =
(t)
(2Lrf + 1)ν(t−1) dV and set hn = 0 and
(
(t) 1, n = ψt (θ, k),
wnθk =
0, otherwise,

where ψt is some bijection between ZLrf × {1, . . . , dt } and {1, . . . , dt+1 }. In this way, each
component of the network input vector Φin gets copied, layer by layer, to subsequent layers
and eventually ends up among the components of the resulting vector in WT −1 (with some
repetitions due to multiple possible trajectories of copying).
However, since σ is not an identity, copying needs to be approximated. Consider the first
layer, fb1 . For each γ ∈ ZLrf and each s ∈ {1, . . . , d1 }, consider the corresponding coordinate
map
gγs : L2 (λZLrf , Rd1 ) → R, gγs : Φ 7→ Φγs .
By Theorem 2.1, the map gγs can be approximated with arbitrary accuracy on any compact
set in L2 (λZLrf , Rd1 ) by maps of the form
N
X d1
 X X 
Φ 7→ cγsm σ wγsmθk Φθk + hγsm , (3.21)
m=1 θ∈ZLrf k=1

where we may assume without loss of generality that N is the same for all γ, s. We then set
the second feature dimension d2 = N |ZLrf |d1 = N (2Lrf +1)ν dV and assign the weights wγsmθk
(1) (1)
and hγsm in (3.21) to be the weights wnθk and hn of the first convnet layer, where the index

27
n somehow enumerates the triplets (γ, s, m). Defined in this way, the first convolutional
layer f1 only partly reproduces the copy operation, since this layer does not include the
linear weighting corresponding to the external summation over m in (3.21). However, we
can include this weighting into the next layer, since this operation involves only values at
the same spatial location γ ∈ Zν , and prepending this operation to the convolutional layer
(3.21) does not change the functional form of the layer.
By repeating this argument for the subsequent layers t = 2, 3, . . . , T − 2, we can make
the sequence of the first T − 2 layers to arbitrarily accurately copy all the components of
the input vector Φin into a vector Φ ∈ WT −1 , up to some additional linear transformations
that need to be included in the (T − 1)’th layer (again, this is legitimate since prepending a
linear operation does not change the functional form of the (T − 1)’th layer). Thus, we can
approximate f = g ◦ I −1 by arranging the first stage of the convnet to approximate I −1 and
the second to approximate g.
Returning to the proof of sufficiency, let f : V → U be an Rν –equivariant continuous
map that we need to approximate with accuracy  on a compact set K ⊂ V by a convnet
with λ < λ0 and Λ > Λ0 . For any λ and Λ, define the map

fλ,Λ = Pλ,Λ ◦ f ◦ Pλ .

Observe that we can find λ < λ0 and Λ > Λ0 such that



sup kfλ,Λ (Φ) − f (Φ)k ≤ . (3.22)
Φ∈K 3

Indeed, this can be proved as follows. Denote by Bδ (Φ) the radius–δ ball centered at Φ. By
compactness of K and continuity of f we can find finitely many signals Φn ∈ V, n = 1, . . . , N,
and some δ > 0 so that, first, K ⊂ ∪n Bδ/2 (Φn ), and second,

kf (Φ) − f (Φn )k ≤ , Φ ∈ Bδ (Φn ). (3.23)
9
For any Φ ∈ K, pick n such that Φ ∈ Bδ/2 (Φn ), then

kfλ,Λ (Φ) − f (Φ)k ≤ kPλ,Λ f (Pλ Φ) − Pλ,Λ f (Φn )k + kPλ,Λ f (Φn ) − f (Φn )k + kf (Φn ) − f (Φ)k

≤ kf (Pλ Φ) − f (Φn )k + kPλ,Λ f (Φn ) − f (Φn )k + . (3.24)
9
Since Φ ∈ Bδ/2 (Φn ), if λ is sufficiently small then Pλ Φ ∈ Bδ (Φn ) (by the strong convergence
of Pλ to the identity) and hence kf (Pλ Φ) − f (Φn )k < 9 , again by (3.23). Also, we can
choose sufficiently small λ and then sufficiently large Λ so that kPλ,L f (Φn ) − f (Φn )k < 9 .
Using these inequalities in (3.24), we obtain (3.22).
Having thus chosen λ and Λ, observe that, by translation equivariance of f , the map fλ,Λ
can be written as X
fλ,Λ (Φ) = Rλγ Pλ,0 f (Pλ R−λγ Φ),
γ∈ZbΛ/λc

28
where Pλ,0 is the projector Pλ,Λ in the degenerate case Λ = 0. Consider the map
X
fλ,Λ,T (Φ) = Rλγ Pλ,0 f (Pλ,(T −1)λLrf R−λγ Φ).
γ∈ZbΛ/λc

Then, by choosing T sufficiently large, we can ensure that



sup kfλ,Λ,T (Φ) − fλ,Λ (Φ)k < . (3.25)
Φ∈K 3
Indeed, this can be proved in the same way as (3.22), by using compactness of K, continuity
T →∞
of f , finiteness of ZbΛ/λc and the strong convergence Pλ,(T −1)λLrf R−λγ Φ −→ Pλ R−λγ Φ.
Observe that fλ,Λ,T can be alternatively written as
X
fλ,Λ,T (Φ) = Rλγ fλ,0,T (R−λγ Φ), (3.26)
γ∈ZbΛ/λc

where
fλ,0,T (Φ) = Pλ,0 f (Pλ,(T −1)λLrf Φ).
We can view the map fλ,0,T as a map from Vλ,(T −1)λLrf to Uλ,0 , which makes Lemma 3.1
applicable to fλ,0,T . Hence, since ∪γ∈ZbΛ/λc R−λγ K is compact, we can find a convnet fb0 with
spacing λ, depth T and range Λ = 0 such that

kfb0 (Φ) − fλ,0,T (Φ)k < , Φ ∈ ∪γ∈ZbΛ/λc R−λγ K. (3.27)
3|ZbΛ/λc |

Consider the convnet fbΛ different from fb0 only by the range parameter Λ; such a convnet
can be written in terms of fb0 in the same way as fλ,Λ,T is written in terms of fλ,0,T :
X
fbΛ (Φ) = Rλγ fb0 (R−λγ Φ). (3.28)
γ∈ZbΛ/λc

Combining (3.26), (3.27) and (3.28), we obtain



sup kfbΛ (Φ) − fλ,Λ,T (Φ)k < .
Φ∈K 3
Combining this bound with bounds (3.22) and (3.25), we obtain the desired bound

sup kfbΛ (Φ) − f (Φ)k < .


Φ∈K

Theorem 3.1 suggests that our definition of limit points of basic convnets provides a
reasonable rigorous framework for the analysis of convergence and invariance properties of
convnet-like models in the limit of continual and infinitely extended signals. We will use
these definition and theorem as templates when considering convnets with pooling in the
next subsection and charge–conserving convnets in Section 4.

29
Pλ,λL1,T fb1 fb2 fb3 fb4
V W1 W2 W3 W4 W5

Figure 2: A one-dimensional (ν = 1) convnet with downsampling having stride s = 2 and


the receptive field parameter Lrf = 2.

3.3 Convnets with pooling


As already mentioned, pooling erodes the equivariance of models with respect to transla-
tions. Therefore, we will consider convnets with pooling as universal approximators without
assuming the approximated maps to be translationally invariant. Also, rather than consider-
ing L2 (Rν , RdU )–valued maps, we will be interested in approximating simply R–valued maps,
i.e., those of the form f : V → R, where, as in Section 3.2, V = L2 (Rν , RdV ).
While the most popular kind of pooling in applications seems to be max-pooling, we will
only consider pooling by decimation (i.e., grid downsampling), which appears to be about as
efficient in practice (see Springenberg et al. [2014]). Compared to basic convnets of Section
3.2, convnets with downsampling then have a new parameter, stride, that we denote by s.
The stride can take values s = 1, 2, . . . and determines the geometry scaling when passing
information to the next convnet layer: if the current layer operates on a grid (λZ)ν , then
the next layer will operate on the subgrid (sλZ)ν . Accordingly, the current layer only needs
to perform the operations having outputs located in this subgrid. We will assume s to be
fixed and to be the same for all layers. Moreover, we assume that

s ≤ 2Lrf + 1, (3.29)

i.e., the stride is not larger than the size of the receptive field: this ensures that information
from each node of the current grid can reach the next layer.
Like the basic convnet of Section 3.2, a convnet with downsampling can be written as a
chain:
Pλ,λL1,T fb1 fb2 fbT
fb : V −→ Vλ,λL (≡ W1 ) → W2 → . . . →
1,T
WT +1 (∼
= R). (3.30)
Here the space Vλ,λL1,T is defined as in (3.8) (with Λ = λL1,T ) and Pλ,λL1,T is the orthogonal

30
projector to this subspace. The intermediate spaces are defined by

Wt = L2 (st−1 λZLt,T , Rdt ).

The range parameters Lt,T are given by


(
Lrf (1 + s + s2 + . . . + sT −t−1 ), t < T,
Lt,T =
0, t = T, T + 1.

This choice of Lt,T is equivalent to the identities

Lt,T = sLt+1,T + Lrf , t = 1, . . . , T − 1,

expressing the domain transformation under downsampling.


The feature dimensions dt can again take any values, aside from the fixed values d1 = dV
and dT +1 = 1.
As the present convnet model is R–valued, in contrast to the basic convnet of Section
3.2, it does not have a separate output cutoff parameter Λ (we essentially have Λ = 0 now).
The geometry of the input domain λZL1,T is fully determined by stride s, the receptive field
parameter Lrf , grid spacing λ, and depth T . Thus, the architecture of the model is fully
specified by these parameters and feature dimensions d2 , . . . , dT .
The layer operation formulas differ from the formulas (3.11),(3.3) by the inclusion of
downsampling:
dt
 X X 
(t)
fbt (Φ)γn = σ wnθk Φsγ+θ,k + h(t)
n , γ ∈ ZLt+1 , n = 1, . . . , dt+1 , (3.31)
θ∈ZLrf k=1

dT
X (T )
fbT +1 (Φ) = wnk Φk + h(T
n .
)
(3.32)
k=1

Summarizing, we define convnets with downsampling as follows.


Definition 3.3. A convnet with downsampling is a map fb : V → R defined by (3.30),
(t)
(3.31), (3.32), and characterized by parameters s, λ, Lrf , T, d1 , . . . , dT and coefficients wnθk
(t)
and hn .
Next, we give a definition of a limit point of convnets with downsampling analogous to
Definition 3.2 for basic convnets. In this definition, we require that the input domain grow
in detalization λ1 and in the spatial range λL1,T , while the stride and receptive field are fixed.
Definition 3.4. With V = L2 (Rν , RdV ), we say that a map f : V → R is a limit point
of convnets with downsampling if for any s and Lrf subject to Eq.(3.29), any compact
set K ∈ V , any  > 0, λ0 > 0 and Λ0 > 0 there exists a convnet with downsampling fb with
stride s, receptive field parameter Lrf , depth T , and spacing λ ≤ λ0 such that λL1,T ≥ Λ0
and supΦ∈K kfb(Φ) − f (Φ)k < .

31
The analog of Theorem 3.1 then reads:
Theorem 3.2. A map f : V → R is a limit point of convnets with downsampling if and
only if f is continuous in the norm topology.
Proof. The proof is completely analogous to, and in fact simpler than, the proof of Theorem
3.1, so we only sketch it.
The necessity only involves the claim of continuity and follows again by a basic topological
argument.
In the proof of sufficiency, an analog of Lemma 3.1 holds for convnets with downsampling,
since, thanks to the constraint (3.29) on the stride, all points of the input domain λZL1 are
connected by the network architecture to the output (though there are fewer connections
now due to pooling), so that our construction of approximate copy operations remains valid.
To approximate f : V → R on a compact K, first approximate it by a map f ◦ Pλ,ZL1,T
with a sufficiently small λ and large T , then use the lemma to approximate f ◦ Pλ,ZL1,T by
a convnet.

4 Charge-conserving convnets
The goal of the present section is to describe a complete convnet-like model for approximating
arbitrary continuous maps equivariant with respect to rigid planar motions. A rigid motion
of Rν is an affine transformation preserving the distances and the orientation in Rν . The
group SE(ν) of all such motions can be described as a semidirect product of the translation
group Rν with the special orthogonal group SO(ν):
SE(ν) = Rν o SO(ν).
An element of SE(ν) can be represented as a pair (γ, θ) with γ ∈ Rν and θ ∈ SO(ν). The
group operations are given by
(γ1 , θ1 )(γ2 , θ2 ) = (γ1 + θ1 γ2 , θ1 θ2 ),
(γ, θ)−1 = (−θ−1 γ, θ−1 ).
The group SE(ν) acts on Rν by
A(γ,θ) x = γ + θx.
It is easy to see that this action is compatible with the group operation, i.e. A(0,1) = Id and
A(γ1 ,θ1 ) A(γ2 ,θ2 ) = A(γ1 ,θ1 )(γ2 ,θ2 ) (implying, in particular, A−1
(γ,θ) = A(γ,θ) ).
−1

As in Section 3.2, consider the space V = L2 (Rν , RdV ). We can view this space as a
SE(ν)–module with the representation canonically associated with the action A:
R(γ,θ) Φ(x) = Φ(A(γ,θ)−1 x), (4.1)
where Φ : Rν → RdV and x ∈ Rν . We define in the same manner the module U =
L2 (Rν , RdU ). In the remainder of the paper we will be interested in approximating continuous
and SE(ν)-equivariant maps f : V → U . Let us first give some examples of such maps.

32
Linear maps. Assume for simplicity that dV = dU = 1 and consider a linear SE(ν)-
equivariant map f : L2 (Rν ) → L2 (Rν ). Such a map can be written as a convolution
f (Φ) = Φ ∗ Ψf , where Ψf is a radial signal, Ψf (x) = Ψ
e f (|x|). In general, Ψf should
be understood in a distributional sense.
By applying Fourier transform F, the map f can be equivalently described in the
Fourier dual space as pointwise multiplication of the given signal by constFΨf (with
the constant depending on the choice of the coefficient in the Fourier transfrom), so
f is SE(ν)-equivariant and continuous if and only if FΨf is a radial function belong-
ing to L∞ (Rν ). Note that in this argument we have tacitly complexified the space
L2 (Rν , R) into L2 (Rν , C). The condition that f preserves real-valuedness of the signal
Φ translates into FΨf (x) = FΨf (−x), where the bar denotes complex conjugation.
Note that linear SE(ν)-equivariant differential operators, such as the Laplacian ∆, are
not included in our class of maps, since they are not even defined on the whole space
V = L2 (Rν ). However, if we consider a smoothed version of the Laplacian given by
f : Φ 7→ ∆(Φ ∗ g ), where g is the variance- Gaussian kernel, then this map will be
well-defined on the whole V , norm-continuous and SE(ν)-equivariant.
Pointwise maps. Consider a pointwise map f : V → U defined by f (Φ)(x) = f0 (x),
where f0 : RdV → RdU is some map. In this case f is SE(ν)-equivariant. Note that if
f0 (0) 6= 0, then f is not well-defined on V = L2 (Rν , RdV ), since f (Φ) ∈
/ L2 (Rν , RdU )
for the trivial signal Φ(x) ≡ 0. An easy-to-check sufficient condition for f to be well-
defined and continuous on the whole V is that f0 (0) = 0 and f0 be globally Lipschitz
(i.e., |f0 (x) − f0 (y)| ≤ c|x − y| for all x, y ∈ Rν and some c < ∞).
Our goal in this section is to describe a finite computational model that would be a
universal approximator for all continuous and SE(ν)–equivariant maps f : V → U . Following
the strategy of Section 3.2, we aim to define limit points of such finite models and then prove
that the limit points are exactly the continuous and SE(ν)–equivariant maps.
We focus on approximating L2 (Rν , RdU )-valued SE(ν)-equivariant maps rather than RdU -
valued SE(ν)-invariant maps because, as discussed in Section 3, we find it hard to reconcile
the SE(ν)-invariance with pooling.
Note that, as in the previous sections, there is a straightforward symmetrization-based
approach to constructing universal SE(ν)–equivariant models. In particular, the group SE(ν)
extends the group of translations Rν by the compact group SO(ν), and we can construct
SE(ν)–equivariant maps simply by symmetrizing Rν –equivariant maps over SO(ν), as in
Proposition 2.2.
Proposition 4.1. If a map fRν : V → U is continuous and Rν –equivariant, then the map
fSE(ν) : V → U defined by
Z
fSE(ν) (Φ) = R(0,θ)−1 fRν (R(0,θ) Φ)dθ
SO(ν)

is continuous and SE(ν)–equivariant.

33
Proof. The continuity of fSE(ν) follows by elementary arguments using the continuity of
fRν : V → U , uniform boundedness of the operators R(0,θ) , and compactness of SO(ν). The
SE(ν)–equivariance follows since for any (γ, θ0 ) ∈ SE(ν) and Φ ∈ V
Z
fSE(ν) (R(γ,θ0 ) Φ) = R(0,θ)−1 fRν (R(0,θ) R(γ,θ0 ) Φ)dθ
SO(ν)
Z
= R(0,θ)−1 fRν (R(θγ,1) R(0,θθ0 ) Φ)dθ
SO(ν)
Z
= R(0,θ)−1 R(θγ,1) fRν (R(0,θθ0 ) Φ)dθ
SO(ν)
Z
= R(γ,θ0 ) R(0,θθ0 )−1 fRν (R(0,θθ0 ) Φ)dθ
SO(ν)

= R(γ,θ0 ) fSE(ν) (Φ).

This proposition implies, in particular, that SO(ν)-symmetrizations of merely Rν -


equivariant basic convnets considered in Section 3.2 can serve as universal SE(ν)–equivariant
approximators. However, like in the previous sections, we will be instead interested in an
intrinsically SE(ν)–equivariant network construction not involving explicit symmetrization
of the approximation over the group SO(ν). In particular, our approximators will not use
rotated grids.
Our construction relies heavily on the representation theory of the group SO(ν), and in
the present paper we restrict ourselves to the case ν = 2, in which the group SO(ν) is abelian
and the representation theory is much easier than in the general case.
Section 4.1 contains preliminary considerations suggesting the network construction ap-
propriate for our purpose. The formal detailed description of the model is given in Section 4.2.
In Section 4.3 we formulate and prove the main result of the section, the SE(2)–equivariant
universal approximation property of the model.

4.1 Preliminary considerations


In this section we explain the idea behind our construction of the universal SE(2)-equivariant
convnet (to be formulated precisely in Section 4.2). We start by showing in Section 4.1.1 that
a SE(2)-equivariant map f : V → U can be described using a SO(2)-invariant map floc : V →
RdV . Then, relying on this observation, in Section 4.1.2 we show that, heuristically, f can be
reconstructed by first equivariantly extracting local “features” from the original signal using
equivariant differentiation, and then transforming these features using a SO(2)-invariant
pointwise map. In Section 4.1.3 we describe discretized differential operators and smoothing
operators that we require in order to formulate our model as a finite computation model
with sufficient regularity. Finally, in Section 4.1.4 we consider polynomial approximations
on SO(2)-modules.

34
4.1.1 Pointwise characterization of SE(ν)–equivariant maps
In this subsection we show that, roughly speaking, SE(ν)-equivariant maps f : V → U can
be described in terms of SO(ν)-invariant maps f : V → Rν obtained by observing the output
signal at a fixed position.
(The proposition below has one technical subtlety: we consider signal values Φ(0) at a
particular point x = 0 for generic signals Φ from the space L2 (Rν , RdU ). Elements of this
spaces are defined as equivalence classes of signals that can differ on sets of zero Lebesgue
measure, so, strictly speaking, Φ(0) is not well-defined. We can circumvent this difficulty by
fixing a particular canonical representative of the equivalence class, say
(
lim→0 |B1(x)| B (x) Φ(y)dy, if the limit exists,
R
Φcanon (x) =
0, otherwise.

Lebesgue’s differentiation theorem ensures that the limit exists and agrees with Φ almost
everywhere, so that Φcanon is indeed a representative of the equivalence class. This choice of
the representative is clearly SE(ν)-equivariant. In the proposition below, the signal value at
x = 0 can be understood as the value of such a canonical representative.)
Proposition 4.2. Let f : L2 (R2 , RdV ) → L2 (R2 , RdU ) be a Rν –equivariant map. Then f is
SE(ν)–equivariant if and only if f (R(0,θ) Φ)(0) = f (Φ)(0) for all θ ∈ SO(ν) and Φ ∈ V .
Proof. One direction of the statement is obvious: if f is SE(ν)–equivariant, then
f (R(0,θ) Φ)(0) = R(0,θ) f (Φ)(0) = f (Φ)(A(0,θ−1 ) 0) = f (Φ)(0).
Let us prove the opposite implication, i.e. that f (R(0,θ) Φ)(0) ≡ f (Φ)(0) implies the
SE(ν)–equivariance. We need to show that for all (γ, θ) ∈ SE(ν), Φ ∈ V and x ∈ Rν we
have
f (R(γ,θ) Φ)(x) = R(γ,θ) f (Φ)(x).
Indeed,

f (R(γ,θ) Φ)(x) = R(−x,1) f (R(γ,θ) Φ)(0)


= f (R(−x,1) R(γ,θ) Φ)(0)
= f (R(0,θ) R(θ−1 (γ−x),1) Φ)(0)
= f (R(θ−1 (γ−x),1) Φ)(0)
= R(θ−1 (γ−x),1) f (Φ)(0)
= R(x,θ) R(θ−1 (γ−x),1) f (Φ)(A(x,θ) 0)
= R(γ,θ) f (Φ)(x),

where we used definition (4.1) (steps 1 and 6), the Rν –equivariance of f (steps 2 and 5), and
the hypothesis of the lemma (step 4).
Now, if f : V → U is an SE(ν)–equivariant map, then we can define the SO(ν)–invariant
map floc : V → RdU by
floc (Φ) = f (Φ)(0). (4.2)

35
Conversely, suppose that floc : V → RdU is an SO(ν)–invariant map. Consider the map
f : V → {Ψ : Rν → RdU } defined by

f (Φ)(x) := floc (R(−x,1) Φ). (4.3)

In general, f (Φ) need not be in L2 (Rν , RdU ). Suppose, however, that this is the case for all
Φ ∈ V. Then f is clearly Rν -equivariant and, moreover, SE(ν)–equivariant, by the above
proposition.
Thus, under some additional regularity assumption, the task of reconstructing SE(ν)–
equivariant maps f : V → U is equivalent to the task of reconstructing SO(ν)–invariant
maps floc : V → RdU .

4.1.2 Equivariant differentiation


It is convenient to describe rigid motions of R2 by identifying this two-dimensional real space
with the one-dimensional complex space C. Then an element of SE(2) can be written as
(γ, θ) = (x + iy, eiφ ) with some x, y ∈ R and φ ∈ [0, 2π). The action of SE(2) on R2 ∼
= C can
be written as
A(x+iy,eiφ ) z = x + iy + eiφ z, z ∈ C.
Using analogous notation R(x+iy,eiφ ) for the canonically associated representation of SE(2) in
V defined in (4.1), consider the generators of this representation:

R(δx,1) − 1 R(iδy,1) − 1 R(0,eiδφ ) − 1


Jx = i lim , Jy = i lim , Jφ = i lim .
δx→0 δx δy→0 δy δφ→0 δφ
The generators can be explicitly written as

Jx = −i∂x , Jy = −i∂y , Jφ = −i∂φ = −i(x∂y − y∂x )

and obey the commutation relations

[Jx , Jy ] = 0, [Jx , Jφ ] = −iJy , [Jy , Jφ ] = iJx . (4.4)

We are interested in local transformations of signals Φ ∈ V , so it is natural to consider


the action of differential operators on the signals. We would like, however, to ensure the
equivariance of this action. This can be done as follows. Consider the first-order operators
1 1
∂z = (∂x − i∂y ), ∂z = (∂x + i∂y ).
2 2
These operators commute with Jx , Jy , and have the following commutation relations with
Jφ :
[∂z , Jφ ] = ∂z , [∂z , Jφ ] = −∂z
or, equivalently,
∂z Jφ = (Jφ + 1)∂z , ∂z Jφ = (Jφ − 1)∂z . (4.5)

36
Let us define, for any µ ∈ Z,
(µ)
Jφ = Jφ + µ = µ − i∂φ .
(µ)
Then the triple (Jx , Jy , Jz ) obeys the same commutation relations (4.4), i.e., constitutes an-
other representation of the Lie algebra of the group SE(2). The corresponding representation
of the group differs from the original representation (4.1) by the extra phase factor:
(µ)
R(x+iy,eiφ ) Φ(x) = e−iµφ Φ(A(x+iy,eiφ )−1 x). (4.6)

(µ) (µ+1) (µ) (µ−1)


The identities (4.5) imply ∂z Jφ = Jφ ∂z and ∂z Jφ = Jφ ∂z . Since the operators ∂z , ∂z
also commute with Jx , Jy , we see that the operators ∂z , ∂z can serve as ladder operators
equivariantly mapping
∂z : Vµ → Vµ+1 , ∂z : Vµ → Vµ−1 , (4.7)
where Vµ is the space L2 (R2 , RdV ) equipped with the representation (4.6). Thus, we can
equivariantly differentiate signals as long as we appropriately switch the representation. In
the sequel, we will for brevity refer to the parameter µ characterizing the representation as
its global charge.
It is convenient to also consider another kind of charge, associated with angular depen-
dence of the signal with respect to rotations about fixed points; let us call it local charge
η in contrast to the above global charge µ. Namely, for any fixed x0 ∈ R2 , decompose the
module Vµ as M
(x0 )
Vµ = Vµ,η , (4.8)
η∈Z

where
(x0 ) (0)
Vµ,η = R(x0 ,1) Vµ,η , (4.9)
and
(0)
Vµ,η = {Φ ∈ Vµ |Φ(A(0,eiφ )−1 x) = e−iηφ Φ(x) ∀φ}. (4.10)
(x )
Writing x0 = (x0 , y0 ), we can characterize Vµ,η0 as the eigenspace of the operator
(x0 )
Jφ := R(x0 ,1) Jφ R(x0 ,1)−1 = −i(x − x0 )∂y + i(y − y0 )∂x
(x )
corresponding to the eigenvalue η. The operator Jφ 0 has the same commutation relations
with ∂z , ∂z as Jφ :
(x ) (x )
[∂z , Jφ 0 ] = ∂z , [∂z , Jφ 0 ] = −∂z .
We can then describe the structure of equivariant maps (4.7) with respect to decomposition
(4.8) as follows: for any x0 , the decrease or increase of the global charge by the respective
ladder operator is compensated by the opposite effect of this operator on the local charge,
(x ) (x0 ) (x ) (x0 )
i.e. ∂z maps Vµ,η0 to Vµ+1,η−1 while ∂z maps Vµ,η0 to Vµ−1,η+1 :

(x0 ) 0(x ) (x0 ) (x )


0
∂z Vµ,η → Vµ+1,η−1 , ∂z : Vµ,η → Vµ−1,η+1 . (4.11)

37
We interpret these identities as conservation of the total charge, µ + η. We remark that there
is some similarity between our total charge and the total angular momentum in quantum
mechanics; the total angular momentum there consists of the spin component and the orbital
component that are analogous to our global and local charge, respectively.
Now we give a heuristic argument showing how to express an arbitrary equivariant map
f : V → U using our equivariant differentiation. As discussed in the previous subsection,
the task of expressing f reduces to expressing floc using formulas (4.2),(4.3). Let a signal Φ
be analytic as a function of the real variables x, y, then it can be Taylor expanded as

X 1 a b
Φ= ∂z ∂z Φ(0)Φa,b , (4.12)
a,b=0
a!b!

with the basis signals Φa,b given by

Φa,b (z) = z a z b .

The signal Φ is fully determined by the coefficients ∂za ∂zb Φ(0), so the map floc can be expressed
as a function of these coefficients:

floc (Φ) = feloc (∂za ∂zb Φ(0))∞



a,b=0 . (4.13)

At x0 = 0, the signals Φa,b have local charge η = a − b, and, if viewed as elements of Vµ=0 ,
transform under rotations by

R(0,eiφ ) Φa,b = e−i(a−b)φ Φa,b .


P
Accordingly, if we write Φ in the form Φ = a,b ca,b Φa,b , then
X
R(0,eiφ ) Φ = e−i(a−b)φ ca,b Φa,b .
a,b

It follows that the SO(2)-invariance of floc is equivalent to feloc being invariant with respect
to simultaneous multiplication of the arguments by the factors e−i(a−b)φ :

feloc (e−i(a−b)φ ca,b )∞ ∞


 
a,b=0 = floc (ca,b )a,b=0
e ∀φ.

Having determined the invariant map feloc , we can express the value of f (Φ) at an arbitrary
point x ∈ R2 by
f (Φ)(x) = feloc (∂za ∂zb Φ(x))∞

a,b=0 . (4.14)
Thus, the map f can be expressed, at least heuristically, by first computing various deriva-
tives of the signal and then applying to them the invariant map feloc , independently at each
x ∈ R2 .
The expression (4.14) has the following interpretation in terms of information flow and
the two different kinds of charges introduced above. Given an input signal Φ ∈ V and

38
x ∈ R2 , the signal has global charge µ = 0, but, in general, contains multiple components
having different values of the local charge η with respect to x, according to the decomposition
(x) (x)
V = Vµ=0 = ⊕η∈Z V0,η . By (4.11), a differential operator ∂za ∂zb maps the space V0,η to the
(x) (x)
space Va−b,η+b−a . However, if a signal Ψ ∈ Va−b,η+b−a is continuous at x, then Ψ must vanish
there unless η + b − a = 0 (see the definition (4.9),(4.10)), i.e., only information from the
(x)
V0,η –component of Φ with η = a − b is observed in ∂za ∂zb Φ(x). Thus, at each point x, the
differential operator ∂za ∂zb can be said to transform information contained in Φ and associated
with global charge µ = 0 and local charge η = a − b into information associated with global
charge µ = a−b and local charge η = 0. This transformation is useful to us because the local
charge only reflects the structure of the input signal, while the global charge is a part of the
architecture of the computational model and can be used to directly control the information
flow. The operators ∂za ∂zb deliver to the point x information about the signal values away
from this point – similarly to how this is done by local convolutions in the convnets of Section
3 – but now this information flow is equivariant with respect to the action of SO(2).
By (4.14), the SE(2)–equivariant map f can be heuristically decomposed into the family
of SE(2)–equivariant differentiations producing “local features” ∂za ∂zb Φ(x) and followed by
the SO(2)–invariant map feloc acting independently at each x. In the sequel, we use this
decomposition as a general strategy in our construction of the finite convnet-like approx-
imation model in Section 4.2 – the “charge–conserving convnet” – and in the proof of its
universality in Section 4.3.
The Taylor expansion (4.12) is not rigorously applicable to generic signals Φ ∈
L (Rν , RdV ). Therefore, we will add smoothing in our convnet-like model, to be performed
2

before the differentiation operations. This will be discussed below in Section 4.1.3. Also, we
will discuss there the discretization of the differential operators, in order to formulate the
charge–conserving convnet as a finite computational model.
The invariant map feloc can be approximated using invariant polynomials, as we discuss in
Section 4.1.4 below. As discussed earlier in Section 2, invariant polynomials can be produced
from a set of generating polynomials; however, in the present setting this set is rather large
and grows rapidly as charge is increased, so it will be more efficient to just generate new
invariant polynomials by multiplying general polynomials of lower degree subject to charge
conservation. As a result, we will approximate the map feloc by a series of multiplication
layers in the charge-conserving convnet.

4.1.3 Discretized differential operators


Like in Section 3, we aim to formulate the approximation model as a computation which
is fully finite except for the initial discretization of the input signal. Therefore we need
to discretize the equivariant differential operators considered in Section 4.1.2. Given a dis-
cretized signal Φ : (λZ)2 → RdV on the grid of spacing λ, and writing grid points as

39
(λ) (λ)
γ = (λγx , λγy ) ∈ (λZ)2 , we define the discrete derivatives ∂z , ∂z by

(λ) 1  
∂z Φ(λγx , λγy ) = Φ λ(γx + 1), λγy − Φ λ(γx − 1), λγy (4.15)

 
 
− i Φ λγx , λ(γy + 1) − Φ λγx , λ(γy − 1) ,

(λ) 1  
∂z Φ(λγx , λγy ) = Φ λ(γx + 1), λγy − Φ λ(γx − 1), λγy (4.16)

 
 
+ i Φ λγx , λ(γy + 1) − Φ λγx , λ(γy − 1) .

Since general signals Φ ∈ L2 (Rν , RdV ) are not differentiable, we will smoothen them prior
to differentiating. Smoothing will also be a part of the computational model and can be
implemented by local operations as follows. Consider the discrete Laplacian ∆(λ) defined by
1
∆(λ) Φ(λγx , λγy ) = 2 Φ λ(γx + 1), λγy + Φ λ(γx − 1), λγy
 
(4.17)
λ 
 
+ Φ λγx , λ(γy + 1) + Φ λγx , λ(γy − 1) − 4Φ(λγx , λγy ) .

Then, a single smoothing layer can be implemented by the positive definite operator 1 +
λ2 (λ)
8
∆ :
 λ2 (λ)  1  
1+ ∆ Φ(λγx , λγy ) = Φ λ(γx + 1), λγy + Φ λ(γx − 1), λγy
8 8 
 
+ Φ λγx , λ(γy + 1) + Φ λγx , λ(γy − 1) + 4Φ(λγx , λγy ) .
(4.18)

We will then replace the differential operators ∂za ∂zb used in the heuristic argument in Section
4.1.2 by the discrete operators
(a,b) (λ) a (λ) b
 λ2 (λ) d4/λ2 e
Lλ = (∂z ) (∂z ) 1 + ∆ Pλ . (4.19)
8
Here Pλ is the discretization projector (3.6). The power d4/λ2 e (i.e., the number of smoothing
(a,b)
layers) scales with λ so that in the continuum limit λ → 0 the operators Lλ converge to
convolution operators. Specifically, consider the function Ψa,b : R2 → R:
1 
a b −|x|2 /2
Ψa,b = ∂z ∂z e , (4.20)

(a,b) (a,b)
where we identify |x|2 ≡ zz. Define the operator L0 by L0 Φ = Φ ∗ Ψa,b , i.e.
Z
(a,b)
L0 Φ(x) = Φ(x − y)Ψa,b (y)d2 y. (4.21)
R2

Then we have the following lemma proved in Appendix A.

40
Lemma 4.1. Let a, b be fixed nonnegative integers. For all λ ∈ [0, 1], consider the linear
(a,b)
operators Lλ as operators from L2 (R2 , RdV ) to L∞ (R2 , RdV ). Then:
(a,b)
1. The operators Lλ are bounded uniformly in λ;
(a,b) (a,b)
2. As λ → 0, the operators Lλ converge strongly to the operator L0 . Moreover,
(a,b)
this convergence is uniform on compact sets K ⊂ V (i.e., limλ→0 supΦ∈K kLλ Φ −
(a,b)
L0 Φk∞ = 0).

This lemma is essentially just a slight modification of Central Limit Theorem. It will
be convenient to consider L∞ rather than L2 in the target space because of the pointwise
polynomial action of the layers following the smoothing and differentiation layers.

4.1.4 Polynomial approximations on SO(2)-modules


Our derivation of the approximating model in Section 4.1.2 was based on identifying the
SO(2)-invariant map floc introduced in (4.2) and expressing it via feloc by Eq.(4.13). It is
convenient to approximate the map feloc by invariant polynomials on appropriate SO(2)-
modules, and in this section we state several general facts relevant for this purpose.
First, the following lemma is obtained immediately using symmetrization and the Weier-
strass theorem (see e.g. the proof of Proposition (2.5)).

Lemma 4.2. Let f : W → R be a continuous SO(2)-invariant map on a real finite-


dimensional SO(2)-module W . Then f can be approximated by polynomial invariants on
W.

We therefore focus on constructing general polynomial invariants on SO(2)-modules. This


can be done in several ways; we will describe just one particular construction performed in
a “layerwise” fashion resembling convnet layers.
It is convenient to first consider the case of SO(2)-modules over the field C, since the
representation theory of the group SO(2) is especially easily described when the underlying
field is C. Let us identify elements of SO(2) with the unit complex numbers eiφ . Then all
complex irreducible representations of SO(2) are one-dimensional characters indexed by the
number ξ ∈ Z:
Reiφ x = eiξφ x. (4.22)
The representation R induces the dual representation acting on functions f (x):

Re∗iφ f (x) = f (Re−iφ x).

In particular, if zξ is the variable associated with the one-dimensional space where represen-
tation (4.22) acts, then it is transformed by the dual representation as

Re∗iφ zξ = e−iξφ zξ .

41
Now let W be a general finite–dimensional SO(2)–module over C. Then W can be decom-
posed as M
W = Wξ , (4.23)
ξ

where Wξ ∼ = Cdξ is the isotypic component of the representation (4.22). Let zξk , k = 1, . . . , dξ ,
denote the variables associated with the subspace Wξ . If f is a polynomial on W , we can
write it as a linear combination of monomials:
X Y a
f= ca zξkξk . (4.24)
a=(aξk ) ξ,k

Then the dual representation acts on f by


X P Y a
Re∗iφ f = e−i ξ,k ξaξk φ ca zξkξk .
a=(aξk ) ξ,k

We see that a polynomial


P is invariant iff it consists of invariant monomials, and a monomial
is invariant iff ξ,k ξaξk = 0.
We can generate an arbitrary SO(2)-invariant polynomial on W in the following “layer-
Nt−1
wise” fashion. Suppose that {ft−1,ξ,n }n=1 is a collection of polynomials generated after t − 1
layers so that
Re∗iφ ft−1,ξ,n = e−iξφ ft−1,ξ,n (4.25)
N
for all ξ, n. Consider new polynomials {ft,ξ,n }N t−1
n=1 obtained from {ft−1,ξ,n }n=1 by applying
t

the second degree expressions


Nt−1
(t)
X (t)
X NX
t−1 Nt−1
X (t)
ft,ξ,n = w0,n 1ξ=0 + w1,ξ,n,n1 ft−1,ξ,n1 + w2,ξ1 ,ξ2 ,n,n1 ,n2 ft−1,ξ1 ,n1 ft−1,ξ2 ,n2
n1 =1 ξ1 +ξ2 =ξ n1 =1 n2 =1
(4.26)
(t) (t) (t)
with some (complex) coefficients w0,n , w1,ξ,n,n1 , w2,ξ1 ,ξ2 ,n,n1 ,n2 .
The first term is present only
for ξ = 0. The third term includes the “charge conservation” constraint ξ = ξ1 + ξ2 . It is
Nt−1
clear that ones condition (4.25) holds for {ft−1,ξ,n }n=1 , it also holds for {ft,ξ,n }N
n=1 .
t

N1
On the other hand, suppose that the initial set {f1,ξ,n }n=1 includes all variables zξk . Then
for any invariant polynomial f on W , we can arrange the parameters Nt and the coefficients
in Eq.(4.26) so that at some t we obtain ft,ξ=0,1 = f . Indeed, first note that thanks to
the second term in Eq.(4.26) it suffices to show this for the case when f is an invariant
monomial (since any invariant polynomial is a linear combination of invariant monomials,
and the second term allows us to form and pass forward such linear combinations). If f is
a constant, then it can be produced using the first term in Eq.(4.26). If f is a monomial of
a positive degree, then it can be produced by multiplying lower degree monomials, which is
afforded by the third term in Eq.(4.26).
Now we discuss the case of the underlying field R. In this case, apart from the trivial
one-dimensional representation, all irreducible representations of SO(2) are two-dimensional

42
and indexed by ξ = 1, 2, . . .:
    
x cos ξφ sin ξφ x
Reiφ = . (4.27)
y − sin ξφ cos ξφ y
It is convenient to diagonalize such a representation, turning it into a pair of complex con-
jugate one-dimensional representations:
   −iξφ  
z e 0 z
Reiφ = iξφ , (4.28)
z 0 e z
where
z = x + iy, z = x − iy.
More generally, any real SO(2)–module W can be decomposed exactly as in (4.23) into
isotypic components Wξ associated with complex characters, but with the additional con-
straints
Wξ = W−ξ , (4.29)
meaning that dξ = d−ξ and
Wξ = W±ξ,Re + iW±ξ,Im , W−ξ = W±ξ,Re − iW±ξ,Im , (ξ 6= 0)
with some real dξ –dimensional spaces W±ξ,Re , W±ξ,Im .
Any polynomial on W can then be written in terms of real variables z0,k corresponding
to ξ = 0 and complex variables
zξ,k = xξk + iyξk , z−ξ,k = xξk − iyξk (ξ > 0) (4.30)
constrained by the relations
zξ,k = z−ξ,k .
Suppose that a polynomial f on W is expanded over monomials in zξ,k as in Eq.(4.24). This
expansion is unique (the coefficients are given by
 Y ∂ aξ,k 
zξ,k
ca = f (0),
ξ,k
aξ,k !

where ∂zξ,k = 12 (∂xξk − i∂yξk ) for ξ > 0 and ∂zξ,k = 21 (∂x−ξ,k + i∂y−ξ,k ) for ξ < 0). This
implies that the condition for the polynomial f to be invariant on W is the same as in the
previously considered complex P case: the polynomial must consist of invariant monomials,
and a monomial is invariant iff ξ,k ξaξk = 0.
Therefore, in the case of real SO(2)-modules, any invariant polynomial can be generated
using the same procedure described earlier for the complex case, i.e., by taking the complex
extension of the module and iteratively generating (complex) polynomials {ft,ξ,n }N n=1 using
t

Eq.(4.26). The real part of a complex invariant polynomial on a real module is a real invariant
polynomial. Thus, to ensure that in the case of real modules W the procedure produces all
real invariant polynomials, and only such polynomials, we can just add taking the real part
of ft,ξ=0,1 at the last step of the procedure.

43
4.2 Charge-conserving convnet
We can now describe precisely our convnet-like model for approximating arbitrary SE(2)-
equivariant continuous maps f : V → U , where V = L2 (R2 , RdV ), U = L2 (R2 , RdU ). The
overview of the model is given in Fig.3. Like the models of Section 3, the present model
starts with the discretization projection followed by some finite computation. The model
includes three groups of layers: smoothing layers (Lsmooth ), differentiation layers (Ldiff ) and
multiplication layers (Lmult ). The parameters of the model are the lattice spacing λ, cutoff
range Λ of the output, dimension dmult of auxiliary spaces, and the numbers Tdiff , Tmult of
differentiation and multiplication layers. The overall operation of the model can be described
as the chain
Pλ,Λ0 Lsmooth Ldiff Lmult
fb : V −→ Vλ,Λ0 (≡ W1 ) −→ Wsmooth −→ Wdiff −→ Uλ,Λ . (4.31)
We describe now all these layers in detail.

Initial projection. The initial discretization projection Pλ,Λ0 is defined as explained


in Section 3 after Eq.(3.8). The input cutoff range Λ0 is given by Λ0 = Λ + (Tdiff + d4/λ2 e)λ.
This padding ensures that the output cutoff range will be equal to the specified value Λ.
With respect to the spatial grid structure, the space W1 can be decomposed as

W1 = ⊕γ∈λZbΛ0 /λc RdV ,

where ZL is the cubic subset of the grid defined in (3.7).

Smoothing layers. The model contains d4/λ2 e smoothing layers performing the same
2
elementary smoothing operation 1 + λ8 ∆(λ) :
 λ2 (λ) d4/λ2 e
Lsmooth = 1+ ∆ ,
8
where the discrete Laplacian ∆(λ) is defined as in Eq.(4.17). In each layer the value of
the transformed signal at the current spatial position is determined by the values of the
signal in the previous layer at this position and its 4 nearest neigbors as given in Eq.(4.18).
Accordingly, the domain size shrinks with each layer so that the output space of Lsmooth can
be written as
Wsmooth = ⊕γ∈λZbΛ00 /λc RdV ,
where Λ00 = Λ0 − d4/λ2 eλ = Λ + Tdiff λ.

Differentiation layers. The model contains Tdiff differentiation layers computing the
(λ) (λ)
discretized derivatives ∂z , ∂z as defined in (4.15),(4.16). Like the smoothing layers, these
derivatives shrink the domain, but additionally, as discussed in Section 4.1.2, they change
the representation of the group SE(2) associated with the global charge µ (see Eq.(4.7)).

44
L smooth L diff L mult
µ
2

( λ)
1
z
1 + λ 2 ∆ ( λ)
8
1
0

( λ)
z -1

-2

Figure 3: Architecture of the charge-conserving convnet. The top figure shows the informa-
tion flow in the fixed-charge subspaces of the feature space, while the bottom figure shows
the same flow in the spatial coordinates. The smoothing layers only act on spatial dimen-
sions, the multiplication layers only on feature dimensions, and the differentiation layers both
on spatial and feature dimensions. Operation of smoothing and differentiation layers only
involves nearest neighbors while in the multiplication layers the transitions are constrained
by the requirement of charge conservation. The smoothing and differentiation layers are
linear; the multiplication layers are not. The last multiplication layer only has zero-charge
(SO(2)–invariant) output.
45
Denoting the individual differentiation layers by Ldiff,t , t = 1, . . . , Tdiff ,, their action can
be described as the chain
Ldiff,1 Ldiff,2 Ldiff,T
Ldiff : Wsmooth −→ Wdiff,1 −→ Wdiff,2 . . . −→diff Wdiff,Tdiff (≡ Wdiff )

We decompose each intermediate space Wdiff,t into subspaces characterized by degree s of


the derivative and by charge µ:

Wdiff,t = ⊕ts=0 ⊕sµ=−s Wdiff,t,s,µ . (4.32)

Each Wdiff,t,s,µ can be further decomposed as a direct sum over the grid points:

Wdiff,t,s,µ = ⊕γ∈λZbΛ/λc+Tdiff −t CdV . (4.33)

Consider the operator Ldiff,t as a block matrix with respect to decomposition (4.32) of the
input and output spaces Wdiff,t−1 , Wdiff,t , and denote by (Ldiff,t )(st−1 ,µt−1 )→(st ,µt ) the respective
blocks. Then we define
 (λ)

∂z , if st = st−1 + 1, µt = µt−1 + 1,
∂ (λ) , if s = s + 1, µ = µ − 1,

z t t−1 t t−1
(Ldiff,t )(st−1 ,µt−1 )→(st ,µt ) = (4.34)


 1, if s t = st−1 , µt = µt−1 ,

0, otherwise.
(λ) (λ)
With this definition, the final space Wdiff,Tdiff contains all discrete derivatives (∂z )a (∂z )b Φ
of the smoothed signal Φ ∈ Wsmooth of degrees s = a + b ≤ Tdiff . Each such derivative can be
obtained by arranging the elementary steps (4.34) in different order, so that the derivative
will actually appear in Wdiff,Tdiff with the coefficient a!b!(TTdiff
diff !
−a−b)!
. This coefficient is not
important for the subsequent exposition.

Multiplication layers. In contrast to the smoothing and differentiation layers, the


multiplication layers act strictly locally (pointwise). These layers implement products and
linear combinations of signals of the preceding layers subject to conservation of global charge,
based on the procedure of generation of invariant polynomials described in Section 4.1.4.
Denoting the inividual layers by Lmult,t , t = 1, . . . , Tmult , their action is described by the
chain
Lmult,1 Lmult,2 Lmult,T
Lmult : Wdiff −→ Wmult,1 −→ Wmult,2 . . . −→mult Wmult,Tmult ≡ Uλ,Λ .
Each space Wmult,t except for the final one (Wmult,Tmult ) is decomposed into subspaces char-
acterized by spatial position γ ∈ (λZ)2 and charge µ:

Wmult,t = ⊕γ∈λZbΛ/λc ⊕Tµ=−T


diff
diff
Wmult,t,γ,µ . (4.35)

Each space Wmult,t,γ,µ is a complex dmult -dimensional space, where dmult is a parameter of the
model:
Wmult,t,γ,µ = Cdmult .

46
The final space Wmult,Tmult is real, dU -dimensional, and only has the charge-0 component:

Wmult,Tmult = ⊕γ∈λZbΛ/λc Wmult,t,γ,µ=0 , Wmult,t,γ,µ=0 = RdU ,

so that Wmult,Tmult can be identified with Uλ,Λ . The initial space Wdiff can also be expanded
in the form (4.35) by reshaping its components (4.32),(4.33):

Wdiff = ⊕Ts=0
diff
⊕sµ=−s Wdiff,Tdiff ,s,µ
= ⊕Ts=0
diff
⊕sµ=−s ⊕γ∈λZbΛ/λc CdV
= ⊕γ∈λZbΛ/λc ⊕Tµ=−T
diff
diff
Wmult,0,γ,µ ,

where

Wmult,0,γ,µ = ⊕Ts=|µ|
diff
CdV .

The multiplication layers Lmult,t act separately and identically at each γ ∈ λZbΛ/λc , i.e.,
without loss of generality these layers can be thought of as maps

Lmult,t : ⊕Tµ=−T
diff
diff
Wmult,t−1,γ=0,µ −→ ⊕Tµ=−T
diff
diff
Wmult,t,γ=0,µ .

To define Lmult,t , let us represent its input Φ ∈ ⊕Tµ=−T


diff
diff
Wmult,t−1,γ=0,µ as
Tdiff
X Tdiff
X dX
mult

Φ= Φµ = Φµ,n eµ,n ,
µ=−Tdiff µ=−Tdiff n=1

where eµ,n denote the basis vectors in Wmult,t−1,γ=0,µ . We represent the output Ψ ∈
⊕Tµ=−T
diff
diff
Wmult,t,γ=0,µ of Lmult,t in the same way:
Tdiff
X Tdiff
X dX
mult

Ψ= Ψµ = Ψµ,n eµ,n .
µ=−Tdiff µ=−Tdiff n=1

Then, based on Eq.(4.26), for t < Tmult we define Lmult,t Φ = Ψ by


dX
mult X dX
mult d
Xmult
(t) (t) (t)
Ψµ,n = w0,n 1µ=0 + w1,µ,n,n1 Φµ,n1 + w2,µ1 ,µ2 ,n,n1 ,n2 Φµ1 ,n1 Φµ2 ,n2 ,
n1 =1 −Tdiff ≤µ1 ,µ2 ≤Tdiff n1 =1 n2 =1
µ1 +µ2 =µ

(4.36)
(t) (t) (t)
with some complex weights w0,n , w1,µ,n,n1 , w2,µ1 ,µ2 ,n,n1 ,n2 .
In the final layer t = Tmult the
network only needs to generate a real charge-0 (invariant) vector, so in this case Ψ only has
real µ = 0 components:
 dX
mult X dX
mult d
Xmult 
(t) (t) (t)
Ψ0,n = Re w0,n + w1,0,n,n1 Φ0,n1 + w2,µ1 ,µ2 ,n,n1 ,n2 Φµ1 ,n1 Φµ2 ,n2 .
n1 =1 −Tdiff ≤µ1 ,µ2 ≤Tdiff n1 =1 n2 =1
µ1 +µ2 =0

(4.37)

47
This completes the description of the charge-conserving convnet. In the sequel, it will be
convenient to consider a family of convnets having all parameters and weights in common
except for the grid spacing λ. Observe that this parameter can be varied independently
(t) (t) (t)
of all other parameters and weights (Λ, dmult , Tdiff , Tmult , w0,n , w1,µ,n,n1 , w2,µ1 ,µ2 ,n,n1 ,n2 ). The
parameter λ affects the number of smoothing layers, and decreasing this parameter means
that essentially the same convnet is applied at a higher resolution. Accordingly, we will call
such a family a “multi-resolution convnet”.

Definition 4.1. A charge-conserving convnet is a map fb : V → U given in (4.31),


(t) (t) (t)
characterized by parameters λ, Λ, dmult , Tdiff , Tmult and weights w0,n , w1,µ,n,n1 , w2,µ1 ,µ2 ,n,n1 ,n2 ,
and constructed as described above. A multi-resolution charge-conserving convnet
fbλ is obtained by arbitrarily varying the grid spacing parameter λ in the charge-conserving
convnet fb.

We comment now why it is natural to call this model “charge-conserving”. As already


explained in Section 4.1.2, if the intermediate spaces labeled by specific µ’s are equipped with
the special representations (4.6), then, up to the spatial cutoff, the differentiation layers Ldiff
are SE(2)-equivariant and conserve the “total charge” µ + η, where η is the “local charge”
(see Eq.(4.11)). Clearly, the same can be said about the smoothing layers Lsmooth which,
in fact, separately conserve the global charge µ and the local charge η. Moreover, observe
that the multiplication layers Lmult , though nonlinear, are also equivariant and separately
conserve the charges µ and η. Indeed, consider the transformations (4.36),(4.37). The first
term in these transformations creates an SE(2)-invariant, µ = η = 0 signal. The second,
linear term does not change µ or η of the input signal. The third term creates products
Ψµ = Φµ1 Φµ2 , where µ = µ1 + µ2 . This multiplication operation is equivariant with respect
to the respective representations R(µ) , R(µ1 ) , R(µ2 ) as defined in (4.6). Also, if the signals
Φµ1 , Φµ2 have local charges η1 , η2 at a particular point x, then the product Φµ1 Φµ2 has local
charge η = η1 + η2 at this point (see Eqs.(4.9),(4.10)).

4.3 The main result


To state our main result, we define a limit point of charge-conserving convnets.

Definition 4.2. With V = L2 (Rν , RdV ) and U = L2 (Rν , RdU ), we say that a map f : V → U
is a limit point of charge-conserving convnets if for any compact set K ⊂ V , any
 > 0 and Λ0 > 0 there exist a multi-resolution charge-conserving convnet fbλ with Λ > Λ0
such that supΦ∈K kfbλ (Φ) − f (Φ)k ≤  for all sufficiently small grid spacings λ.

Then our main result is the following theorem.

Theorem 4.1. Let V = L2 (Rν , RdV ) and U = L2 (Rν , RdU ). A map f : V → U is a limit
point of charge-conserving convnets if and only if f is SE(2)–equivariant and continuous in
the norm topology.

48
Proof. To simplify the exposition, we will assume that dV = dU = 1; generalization of all
the arguments to vector-valued input and output signals is straightforward.
We start by observing that a multi-resolution family of charge-conserving convnets has
a natural scaling limit as the lattice spacing λ → 0:

fb0 (Φ) = lim fbλ (Φ). (4.38)


λ→0

Indeed, by (4.31), at λ > 0 we can represent the convnet as the composition of maps

fbλ = Lmult ◦ Ldiff ◦ Lsmooth ◦ Pλ,Λ0 .


(a,b)
The part Ldiff ◦ Lsmooth ◦ Pλ,Λ0 of this computation implements several maps Lλ introduced
in (4.19). More precisely, by the definition of differentiation layers in Section 4.2, the output
space Wdiff of the linear operator Ldiff ◦ Lsmooth ◦ Pλ,Λ0 can be decomposed into the direct sum
(4.32) over several degrees s and charges µ. The respective components of Ldiff ◦Lsmooth ◦Pλ,Λ0
(a,b)
are, up to unimportant combinatoric coefficients, just the operators Lλ with a + b = s,
a − b = µ:
(a,b) Tdiff !
Ldiff ◦ Lsmooth ◦ Pλ,Λ0 = (. . . , ca,b Lλ , . . .), ca,b = a!b!(Tdiff −a−b)!
, (4.39)

(a,b)
with the caveat that the output of Lλ is spatially restricted to the bounded domain
2 (a,b) (a,b)
[−Λ, Λ] . By Lemma 4.1, as λ → 0, the operators Lλ converge to the operator L0
(a,b)
defined in Eq.(4.21), so that for any Φ ∈ L2 (Rν ) the signals Lλ Φ are bounded functions
(a,b)
on Rν and converge to L0 Φ in the uniform norm k · k∞ . Let us denote the limiting linear
operator by Lconv :
Lconv = lim Ldiff ◦ Lsmooth ◦ Pλ,Λ0 . (4.40)
λ→0

The full limiting map fb0 (Φ) is then obtained by pointwise application (separately at each
point x ∈ [−Λ, Λ]2 ) of the multiplication layers Lmult to the signals LTdiff Φ:

fb0 (Φ) = Lmult (Lconv Φ). (4.41)

For any Φ ∈ L2 (Rν ), this f0 (Φ) is a well-defined bounded signal on the domain [−Λ, Λ]2 . It is
bounded because the multiplication layers Lmult implement a continuous (polynomial) map,
and because, as already mentioned, Lconv Φ is a bounded signal. Since the domain [−Λ, Λ]2
has a finite Lebesgue measure, we have f0 (Φ) ∈ L∞ ([−Λ, Λ]2 ) ⊂ L2 ([−Λ, Λ]2 ). By a similar
argument, the convergence in (4.38) can be understood in the L∞ ([−Λ, Λ]2 ) or L2 ([−Λ, Λ]2 )
sense, e.g.:
λ→0
kfb0 (Φ) − fbλ (Φ)kL2 ([−Λ,Λ]2 ) −→ 0, Φ ∈ V. (4.42)
Below, we will use the scaling limit fb0 as an intermediate approximator.
We will now prove the necessity and then the sufficiency parts of the theorem.

49
Necessity (a limit point f is continuous and SE(2)–equivariant). As in the previous the-
orems 3.1, 3.2, continuity of f follows by standard topological arguments, and we only need
to prove the SE(2)–equivariance.
Let us first prove the R2 –equivariance of f . By the definition of a limit point, for any
Φ ∈ V , x ∈ R2 ,  > 0 and Λ0 > 0 there is a multi-resolution convnet fbλ with Λ > Λ0 such
that
kfbλ (Φ) − f (Φ)k ≤ , kfbλ (R(x,1) Φ) − f (R(x,1) Φ)k ≤  (4.43)
for all sufficiently small λ. Consider the scaling limit fb0 = limλ→0 fbλ constructed above. As
shown above, fbλ (Φ) converges to fb0 (Φ) in the L2 sense, so the inequalities (4.43) remains
valid for fb0 (Φ):

kfb0 (Φ) − f (Φ)k ≤ , kfb0 (R(x,1) Φ) − f (R(x,1) Φ)k ≤ . (4.44)

The map fb0 is not R2 –equivariant only because its output is restricted to the domain
[−Λ, Λ]2 , since otherwise both maps Lmult , LTdiff appearing in the superposition (4.41) are
R2 –equivariant. Therefore, for any y ∈ R2 ,

fb0 (R(x,1) Φ)(y) = R(x,1) fb0 (Φ)(y) = fb0 (Φ)(y − x), if y, y − x ∈ [−Λ, Λ]2 . (4.45)

Consider the set

ΠΛ,x = {y ∈ R2 : y, y − x ∈ [−Λ, Λ]2 } = [−Λ, Λ]2 ∩ R(x,1) ([−Λ, Λ]2 ).

The identity (4.45) implies that

PΠΛ,x fb0 (R(x,1) Φ) = PΠΛ,x R(x,1) fb0 (Φ), (4.46)

where PΠΛ,x denotes the projection to the subspace L2 (ΠΛ,x ) in L2 (R2 ). For a fixed x, the
projectors PΠΛ,x converge strongly to the identity as Λ → ∞, therefore we can choose Λ
sufficiently large so that

kPΠΛ,x f (Φ) − f (Φ)k ≤ , kPΠΛ,x f (R(x,1) Φ) − f (R(x,1) Φ)k ≤ . (4.47)

Then, assuming that the approximating convnet has a sufficiently large range Λ, we have

kf (R(x,1) Φ) − R(x,1) f (Φ)k ≤ kf (R(x,1) Φ) − PΠΛ,x f (Rx,1) Φ)k


+ kPΠ f (R(x,1) Φ) − PΠ fb0 (R(x,1) Φ)k
Λ,x Λ,x

+ kPΠΛ,x fb0 (R(x,1) Φ) − PΠΛ,x R(x,1) fb0 (Φ)k


+ kPΠ R(x,1) fb0 (Φ) − PΠ R(x,1) f (Φ)k
Λ,x Λ,x

+ kPΠΛ,x R(x,1) f (Φ) − R(x,1) f (Φ)k


≤ 4,

where we used the bounds (4.44), (4.47), the equalities kPΠΛ,x k = kR(x,1) k = 1, and the
identity (4.46). Taking the limit  → 0, we obtain the desired R2 –equivariance of f .

50
To complete the proof of SE(2)–equivariance, we will show that for any θ ∈ SO(2) we
have
R(0,θ) fb0 (Φ)(x) = fb0 (R(0,θ) Φ)(x), x ∈ ΠΛ,θ , (4.48)
where
ΠΛ,θ = [−Λ, Λ]2 ∩ R(0,θ) ([−Λ, Λ]2 ).
Identity (4.48) is an analog of identity (4.45) that we used to prove the R(x,1) –equivariance
of f . Once Eq.(4.48) is established, we can prove the R(0,θ) –equivariance of f by arguing
in the same way as we did above to prove the R(x,1) –equivariance. After that, the R(0,θ) –
equivariance and the R(x,1) –equivariance together imply the full SE(2)–equivariance.
Note that by using the partial translation equivariance (4.45) and repeating the com-
putation from Lemma 4.2, it suffices to prove the identity (4.48) only in the special case
x = 0:
fb0 (R(0,θ) Φ)(0) = fb0 (Φ)(0). (4.49)
Indeed, suppose that Eq.(4.49) is established and Λ is sufficiently large so that x, θ−1 x ∈
[−Λ, Λ]2 . Then,

fb0 (R(0,θ) Φ)(x) = R(−x,1) fb0 (R(0,θ) Φ)(0)


= fb0 (R(−x,1) R(0,θ) Φ)(0)
= fb0 (R(0,θ) R(−θ−1 x,1) Φ)(0)
= fb0 (R(−θ−1 x,1) Φ)(0)
= R(−θ−1 x,1) fb0 (Φ)(0)
= R(x,θ) R(−θ−1 x,1) fb0 (Φ)(A(x,θ) 0)
= R(0,θ) fb0 (Φ)(x),

where we used general properties of the representaton R (steps 1, 6, 7), Eq.(4.49) (step 4), and
the partial Rν –equivariance (4.45) (steps 2 and 5, using the fact that 0, x, θ−1 x ∈ [−Λ, Λ]2 ).
To establish Eq.(4.49), recall that, by Eq.(4.41), the value fb0 (Φ)(0) is obtained by first
evaluating Lconv (Φ) at x = 0 and then applying to the resulting values the map Lmult . By
Eqs.(4.39),(4.40) and Lemma 4.1, we can write Lconv (Φ)(0) as a vector with components
(a,b)
Lconv (Φ)(0) = (. . . , ca,b L0 , . . .), (4.50)

where, by Eq.(4.21), Z
(a,b)
L0 Φ(0) = Φ(−y)Ψa,b (y)d2 y,
R2

and Ψa,b is given by Eq.(4.20):


1 2

Ψa,b = ∂za ∂zb e−|x| /2 .

51
In the language of Section 4.1.2, Ψa,b has local charge η = b − a:

Ψa,b (A(0,e−iφ ) x) = ei(a−b)φ Ψa,b (x).

It follows that
Z
(a,b)
L0 (R(0,eiφ ) Φ)(0) = R(0,eiφ ) Φ(−y)Ψa,b (y)d2 y
Z R2

= Φ(A(0,e−iφ ) (−y))Ψa,b (y)d2 y


Z R2

= Φ(−y)Ψa,b (A(0,eiφ ) y)d2 y


2
ZR
= Φ(−y)ei(b−a)φ Ψa,b (y)d2 y
R2
i(b−a)φ (a,b)
=e L0 (Φ)(0),
(a,b)
i.e., L0 (Φ)(0) transforms under rotations eiφ ∈ SO(2) as a character (4.22) with ξ = b − a.
Now consider the map Lmult . Since each component in the decomposition (4.50) trans-
forms as a character with ξ = b − a, the construction of Lmult in Section 4.2 (based on the
procedure of generating invariant polynomials described in Section 4.1.4) quarantees that
Lmult computes a function invariant with respect to SO(2), thus proving Eq.(4.49):

fb0 (R(0,θ) Φ)(0) = Lmult (Lconv (R(0,θ) Φ)(0)) = Lmult (Lconv (Φ)(0)) = fb0 (Φ)(0).

This completes the proof of the necessity part.

Sufficiency (a continuous SE(2)–equivariant map f : V → U can be approximated by


charge-conserving convnets).
Given a continuous SE(2)–equivariant f : V → U , a compact set K ⊂ V and positive
numbers , Λ0 , we need to construct a multi-resolution charge-conserving convnet fb = (fbλ )
with Λ > Λ0 and the property supΦ∈K kfbλ (Φ) − f (Φ)k ≤  for all sufficiently small λ. We
construct the desired convnet by performing a series of reductions of this approximation
problem.
1. Smoothing. For any 1 > 0, consider the smoothed map fe : V → U defined by 1

fe1 (Φ) = f (Φ) ∗ g1 , (4.51)

where
1 −|x|2 /(21 )
g1 (x) = e .
2π1
The map fe1 is continuous and SE(2)–equivariant, as a composition of two continuous and
SE(2)–equivariant maps. We can choose 1 small enough so that for all Φ ∈ K

kfe1 (Φ) − f (Φ)k ≤ . (4.52)
10
52
The problem of approximating f then reduces to the problem of approximating maps fe1 of
the form (4.51).
2. Spatial cutoff. We can choose Λ sufficiently large so that for all Φ ∈ K

kPΛ fe1 (Φ) − fe1 (Φ)k < . (4.53)
10

We can do this because fe1 (K) is compact, as an image of a compact set under a continuous
map, and because PΛ converge strongly to the identity as Λ → +∞. Thus, we only need to
approximate output signals fe1 (Φ) on the domain [−Λ, Λ]2 .
3. Output localization. Define the map fe1 ,loc : V → R by

fe1 ,loc (Φ) = fe1 (Φ)(0) = hg1 , f (Φ)iL2 (Rν ) . (4.54)

Since both g1 , f (Φ) ∈ L2 (Rν ), the map fe1 ,loc is well-defined, and it is continuous since f is
continuous.
By translation equivariance of f and hence fe1 , the map fe1 can be recovered from fe1 ,loc
by
fe1 (Φ)(x) = fe1 (R(−x,1) Φ)(0) = fe1 ,loc (R(−x,1) Φ). (4.55)
By the SO(2)-equivariance of fe1 , the map fe1 ,loc is SO(2)-invariant.
4. Nested finite-dimensional SO(2)-modules Vζ . For any nonnegative integer a, b
consider again the signal Ψa,b introduced in Eq.(4.20). For any ζ = 1, 2, . . . , consider the
subspace Vζ ⊂ V spanned by the vectors Re(Ψa,b ) and Im(Ψa,b ) with a + b ≤ ζ. These vectors
form a total system in V if a, b take arbitrary nonnegative integer values. Accordingly, if we
denote by PVζ the orthogonal projection to Vζ in V , then the operators PVζ converge strongly
to the identity as ζ → ∞.
The subspace Vζ is a real finite-dimensional SO(2)–module. As discussed in Subsection
4.1.4, it is convenient to think of such modules as consisting of complex conjugate irreducible
representations under constraint (4.29). The complex extension of the real module Vζ is
spanned by signals {Ψa,b }a+b≤ζ , so that Ψa,b and Ψb,a form a complex conjugate pair for
a 6= b (if a = b, then Ψa,b is real). The natural representation (4.1) of SO(2) transforms the
signal Ψa,b as a character (4.22) with ξ = a − b (in the language of Section 4.1.2, Ψa,b has
local charge η = b − a w.r.t. x = 0):

R(0,eiφ ) Ψa,b (x) = Ψa,b (A(0,e−iφ ) x) = ei(a−b)φ Ψa,b (x). (4.56)

The action of SO(2) on the real signals Re(Ψa,b ) and Im(Ψa,b ) can be related to its action
on Ψa,b and Ψb,a as in Eqs.(4.27),(4.28).
5. Restriction to Vζ . Let fe1 ,loc,ζ : Vζ → R be the restriction of the map fe1 ,loc defined in
Eq.(4.54) to the subspace Vζ :
fe1 ,loc,ζ = fe1 ,loc |Vζ . (4.57)

53
Consider the map fe1 ,ζ : V → U defined by projecting to Vζ and translating the map fe1 ,loc,ζ
to points x ∈ [−Λ, Λ]2 like in the reconstruction formula (4.55):
(
fe1 ,loc,ζ (PVζ R(−x,1) Φ), x ∈ [−Λ, Λ]2
fe1 ,ζ (Φ)(x) = (4.58)
0, otherwise.

We claim that if ζ is sufficiently large then for all Φ ∈ K



kfe1 ,ζ (Φ) − PΛ fe1 (Φ)k < . (4.59)
10
Indeed,
kfe1 ,ζ (Φ) − PΛ fe1 (Φ)k ≤ 2Λ sup |fe1 ,loc (PVζ Φ1 ) − fe1 ,loc (Φ1 )|, (4.60)
Φ1 ∈K1

where
K1 = {R(−x,1) Φ)|(x, Φ) ∈ [−Λ, Λ]2 × K} ⊂ V. (4.61)
The set K1 is compact, by compactness of K and strong continuity of R. Then, by compact-
ζ→∞
ness of K1 , strong convergence PVζ Φ1 −→ Φ1 and continuity of fe1 ,loc , the r.h.s. of (4.60)
becomes arbitrarily small as ζ → ∞.
It follows from (4.59) that the problem of approximating f reduces to approximating the
map fe1 ,ζ for a fixed finite ζ.
6. Polynomial approximation. The map fe1 ,loc,ζ : Vζ → R defined in (4.57) is a con-
tinuous SO(2)-invariant map on the SO(2)-module Vζ . By Lemma 4.2, such a map can be
approximated by invariant polynomials. Let K1 ⊂ V be the compact set defined in Eq.(4.61).
Note that PVζ K1 is then a compact subset of Vζ . Let fbloc : Vζ → R be an SO(2)-invariant
polynomial such that for all Φ2 ∈ PVζ K1

|fbloc (Φ2 ) − fe1 ,loc,ζ (Φ2 )| ≤ . (4.62)
10 · 2Λ
Consider now the map fb0 : V → U defined by
(
fbloc (PVζ R(−x,1) Φ), x ∈ [−Λ, Λ]2 ,
fb0 (Φ)(x) = (4.63)
0, otherwise.

Using Eqs.(4.58) and (4.62), we have for all x ∈ [−Λ, Λ]2 and Φ2 ∈ PVζ K1

|fb0 (Φ2 )(x) − fe1 ,ζ (Φ2 )(x)| ≤
10 · 2Λ
and hence for all Φ ∈ K

kfb0 (Φ) − fe1 ,ζ (Φ)k < . (4.64)
10

7. Identification of convnet with λ = 0. We show now that the map fb0 given in (4.63)
can be written as the scaling limit (λ → 0) of a multi-resolution charge-conserving convnet.

54
First note that the projector PVζ can be written as
X
PV ζ Φ = hΨ0a,b , ΦiΨa,b ,
a,b:a+b≤ζ

where Ψ0a,b is the basis in Vζ dual to the basis Ψa,b . Let Vζ,ξ denote the isotypic component
in Vζ spanned by vectors Ψa,b with a − b = ξ. By Eq.(4.56), this notation is consistent with
the notation of Section 4.1.4 where the number ξ is used to specify the characters (4.22). By
unitarity of the representation R, different isotypic components are mutually orthogonal, so
Ψ0a,b ∈ Vζ,a−b and we can expand
X
Ψ0a,b = ca,b,a0 ,b0 Ψa0 ,b0
0≤a0 ,b0 ≤ζ
a0 −b0 =a−b

with some coefficients ca,b,a0 ,b0 . Then we can write


X
PVζ R(−x,1) Φ = hΨ0a,b , R(−x,1) ΦiΨa,b
a,b:a+b≤ζ
ζ
X X X
= ca,b,a0 ,b0 hΨa0 ,b0 , R(−x,1) ΦiΨa,b
ξ=−ζ a,b:a+b≤ζ a0 ,b0 :a0 +b0 ≤ζ
a−b=ξ a0 −b0 =ξ

ζ Z 
X X X
= ca,b,a0 ,b0 Φ(x + y)Ψa0 ,b0 (y)d2 y Ψa,b
ξ=−ζ a,b:a+b≤ζ a0 ,b0 :a0 +b0 ≤ζ R2
a−b=ξ a0 −b0 =ξ

ζ Z 
a0 +b0
X X X
= ca,b,a0 ,b0 (−1) Φ(x − y)Ψb0 ,a0 (y)d2 y Ψa,b
ξ=−ζ a,b:a+b≤ζ a0 ,b0 :a0 +b0 ≤ζ R2
a−b=ξ a0 −b0 =ξ

ζ
0 0 (b0 ,a0 )
X X X
= ca,b,a0 ,b0 (−1)a +b (L0 Φ(x))Ψa,b , (4.65)
ξ=−ζ a,b:a+b≤ζ a0 ,b0 :a0 +b0 ≤ζ
a−b=ξ a0 −b0 =ξ

(a,b)
where in the last step we used definition (4.21) of L0 .
We can now interpret the map fb0 given by (4.63) as the λ → 0 limit of a convnet of
Section 4.2 in the following way.
First, by the above expansion, the part PVζ R(−x,1) of the map fb0 computes various con-
(b0 ,a0 )
volutions L0 Φ with a0 + b0 ≤ ζ, a0 − b0 = ξ — this corresponds to the λ → 0 limit of
smoothing and differentiation layers of Section 4.2 with Tdiff = ζ. The global charge param-
eter µ appearing in the decomposition (4.32) of the target spaces Wdiff,t of differentiation
layers corresponds to −ξ(= b0 − a0 ) in the above formula, while the degree s corresponds to
a0 + b0 . The vectors Ψa,b with a − b = ξ over which we expand in (4.65) serve as a particular
basis in the µ = −ξ component of Wdiff,Tdiff .

55
Now, the invariant polynomial fbloc appearing in (4.63) can be expressed as a polynomial
in the variables associated with the isotypic components Vζ,ξ . These components are spanned
by the vectors Ψa,b with a − b = ξ. By Eq.(4.65), fbloc (PVζ R(−x,1) Φ) can then be viewed as
(b0 ,a0 )
an invariant polynomial in the variables L0 Φ(x) that correspond to the isotypic compo-
nents Vζ,ξ with ξ = a0 − b0 . As shown in Section 4.1.4, this invariant polynomial can then be
generated by the layerwise multiplication procedure (4.26) starting from the initial variables
(b0 ,a0 )
L0 Φ(x). This procedure is reproduced in the definition (4.36),(4.37) of convnet multi-
plication layers. (The charge-conservation constraints are expressed in Eqs.(4.36),(4.37) in
terms of µ rather than ξ, but µ = −ξ, and the constraints are invariant with respect to
changing the sign of all µ’s.) Thus, if the number Tmult of multiplication layers and the
dimensions dmult of these layers are sufficiently large, then one can arrange the weights in
these layers so as to exactly give the map Φ 7→ fbloc (PVζ R(−x,1) Φ).
8. Approximation by convnets with λ > 0. It remains to show that the scaling limit
fb0 is approximated by the λ > 0 convnets fbλ in the sense that if λ is sufficiently small then
for all Φ ∈ K

kfb0 (Φ) − fbλ (Φ)k < . (4.66)
10
We have already shown earlier in Eq.(4.42) that for any Φ ∈ V the signals fbλ (Φ) converge to
fbλ (Φ) in the L2 ([−Λ, Λ]2 ) sense. In fact, Lemma 4.1 implies that this convergence is uniform
on any compact set K ⊂ V , which proves Eq.(4.66).
Summarizing all the above steps, we have constructed a multi-resolution charge-conserving
convnet fbλ such that, by the inequalities (4.52),(4.53),(4.59),(4.64) and (4.66), we have
supΦ∈K kfbλ (Φ) − f (Φ)k ≤  for all sufficiently small λ. This completes the proof of the
sufficiency part.

5 Discussion
We summarize and discuss the obtained results, and indicate potential directions of further
research.
In Section 2 we considered approximation of maps defined on finite-dimensional spaces
and described universal and exactly invariant/equivariant extensions of the usual shallow
neural network (Propositions 2.3, 2.4). These extensions are obtained by adding to the
network a special polynomial layer. This construction can be seen as an alternative to
the symmetrization of the network (similarly to how constructing symmetric polynomials
as functions of elementary symmetric polynomials is an alternative to symmetrizing non-
symmetric polynomials). A drawback (inherited from the theory of invariant polynomials)
of this construction is that it requires us to know appropriate sets of generating polynomial
invariants/equivariants, which is difficult in practice. This difficulty can be ameliorated using
polarization if the modules in question are decomposed into multiple copies of a few basic
modules (Proposition 2.5, 2.7), but this approach still may be too complicated in general for
practical applications.

56
Nevertheless, in the case of the symmetric group SN we have derived an explicit com-
plete SN -invariant modification of the usual shallow neural network (Theorem 2.4). While
complete and exactly SN -invariant, this modification does not involve symmetrization over
SN . With its relatively small computational complexity, this modification thus presents a
viable alternative to the symmetrization-based approach.
One can expect that further progress in the design of invariant/equivariant models may be
achieved by using more advanced general constructions from the representation and invariant
theories. In particular, in Section 2 we have not considered products of representations,
but later in Section 4 we essentially use them in the abelian SO(2) setting when defining
multiplication layers in “charge-conserving convnet” .
In Section 3 we considered approximations of maps defined on the space V = L2 (Rν , RdV )
of dV -component signals on Rν . The crucial feature of this setting is the infinite-
dimensionality of the space V , which requires us to reconsider the notion of approxi-
mation. Inspired by classical finite-dimensional results Pinkus [1999], our approach in
Section 3 was to assume that a map f is defined on the whole L2 (Rν , RdV ) as a map
f : L2 (Rν , RdV ) → L2 (Rν , RdU ) or f : L2 (Rν , RdV ) → R, and consider its approximation
by finite models fb in a weak sense of comparison on compact subsets of V (see Defini-
tions 3.2 and 3.4). This approach has allowed us to prove reasonable universal approxi-
mation properties of standard convnets. Specifically, in Theorem 3.1 we prove that a map
f : L2 (Rν , RdV ) → L2 (Rν , RdU ) can be approximated by convnets without pooling if and
only if f is norm-continuous and Rν -equivariant. In Theorem 3.2 we prove that a map
f : L2 (Rν , RdV ) → R can be approximated by convnets with downsampling if and only if f
is norm-continuous.
In applications involving convnets (e.g., image recognition or segmentation), the approx-
imated maps f are considered only on small subsets of the full space V . Compact (or, more
generally, precompact) subsets have properties that seem to make them a reasonable general
abstraction for such subsets. In particular, a subset K ⊂ V is precompact if, for example, it
results from a continuous generative process involving finitely many bounded parameters; or
if K is a finite union of precompact subsets; or if for any  > 0 the set K can be covered by
finitely many -balls. From this perspective, it seems reasonable to consider restrictions of
maps f to compact sets, as we did in our weak notion of approximation in Section 3. At the
same time, it would be interesting to refine the notion of model convergence by considering
the structure of the sets K in more detail and relate it quantitatively to the approximation
accuracy (in partucular, paving the way to computing approximation rates).
In Section 4 we consider the task of constructing finite universal approximators for
maps f : L2 (R2 , RdV ) → L2 (R2 , RdU ) equivariant with respect to the group SE(2) of two-
dimensional rigid planar motions. We introduce a particular convnet-like model – “charge-
conserving convnet” – solving this task. We extend the topological framework of Section 3
to rigorously formulate the properties of equivariance and completeness to be proved. Our
main result, Theorem 4.1, shows that a map f : L2 (R2 , RdV ) → L2 (R2 , RdU ) can be ap-
proximated in the small-scale limit by finite charge-conserving convnets if and only if f is
norm-continuous and SE(2)-equivariant.

57
The construction of this convnet is based on splitting the feature space into isotypic
components characterized by a particular representation of the group SO(2) of proper 2D
rotations. The information flow in the model is constrained by what can be interpreted as
“charge conservation” (hence the name of the model). The model is essentially polynomial,
only including elementary arithmetic operations (+, −, ∗) arranged so as to satisfy these
constraints but otherwise achieve full expressivity.
While in Sections 3, 4 we have constructed intrinsically Rν - and SO(2)-equivariant and
complete approximators for maps f : L2 (Rν , RdV ) → L2 (R2 , RdU ), we have not been able
to similarly construct intrinsically Rν -invariant approximators for maps f : L2 (Rν , RdV ) →
R. As noted in Section 3 and confirmed by Theorem 3.2, if we simply include pooling in
the convnet, it completely destroys the Rν -invariance in our continuum limit. It would be
interesting to further explore this issue.
The convnets considered in Section 3 have a rather conventional structure as sequences of
linear convolutional layers equipped with a nonlinear activation function [Goodfellow et al.,
2016]. In contrast, the charge-conserving convnets of Section 4 have a special and somewhat
artificial structure (three groups of layers of which the first two are linear and commuting;
no arbitrary nonlinearities). This structure was essential for our proof of the main Theorem
4.1, since these assumptions on the model allowed us to prove that the model is both SE(2)-
equivariant and complete. It would be interesting to extend this theorem to more general
approximation models.

Bibliography
Fabio Anselmi, Lorenzo Rosasco, and Tomaso Poggio. On invariance and selectivity in
representation learning. Information and Inference, 5(2):134–158, 2016.

Joan Bruna and Stéphane Mallat. Invariant scattering convolution networks. IEEE trans-
actions on pattern analysis and machine intelligence, 35(8):1872–1886, 2013.

Hans Burkhardt and S Siggelkow. Invariant features in pattern recognition–fundamentals


and applications. Nonlinear model-based image/video processing and analysis, pages 269–
307, 2001.

Nadav Cohen and Amnon Shashua. Convolutional rectifier networks as generalized tensor
decompositions. In International Conference on Machine Learning, pages 955–963, 2016.

Nadav Cohen, Or Sharir, Yoav Levine, Ronen Tamari, David Yakira, and Amnon Shashua.
Analysis and design of convolutional networks via hierarchical tensor decompositions.
arXiv preprint arXiv:1705.02302, 2017.

Taco Cohen and Max Welling. Group equivariant convolutional networks. In Proceedings of
The 33rd International Conference on Machine Learning, pages 2990–2999, 2016.

58
George Cybenko. Approximation by superpositions of a sigmoidal function. Mathematics of
Control, Signals, and Systems (MCSS), 2(4):303–314, 1989.

Sander Dieleman, Jeffrey De Fauw, and Koray Kavukcuoglu. Exploiting cyclic symmetry in
convolutional neural networks. arXiv preprint arXiv:1602.02660, 2016.

Carlos Esteves, Christine Allen-Blanchette, Xiaowei Zhou, and Kostas Daniilidis. Polar
transformer networks. arXiv preprint arXiv:1709.01889, 2017.

Ken-Ichi Funahashi. On the approximate realization of continuous mappings by neural


networks. Neural networks, 2(3):183–192, 1989.

Robert Gens and Pedro M Domingos. Deep symmetry networks. In Advances in neural
information processing systems, pages 2537–2545, 2014.

Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep Learning. MIT Press, 2016.
http://www.deeplearningbook.org.

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for
image recognition. In Proceedings of the IEEE conference on computer vision and pattern
recognition, pages 770–778, 2016.

João F Henriques and Andrea Vedaldi. Warped convolutions: Efficient invariance to spatial
transformations. arXiv preprint arXiv:1609.04382, 2016.

David Hilbert. Über die Theorie der algebraischen Formen. Mathematische annalen, 36(4):
473–534, 1890.

David Hilbert. Über die vollen Invariantensysteme. Mathematische Annalen, 42(3):313–373,


1893.

Kurt Hornik. Some new results on neural network approximation. Neural networks, 6(8):
1069–1072, 1993.

Kurt Hornik, Maxwell Stinchcombe, and Halbert White. Multilayer feedforward networks
are universal approximators. Neural networks, 2(5):359–366, 1989.

Hanspeter Kraft and Claudio Procesi. Classical invariant theory, a primer. Lecture Notes,
2000.

Yann le Cun. Generalization and network design strategies. In Connectionism in perspective,


pages 143–155. 1989.

Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. Nature, 521(7553):
436–444, 2015.

59
Moshe Leshno, Vladimir Ya Lin, Allan Pinkus, and Shimon Schocken. Multilayer feedforward
networks with a nonpolynomial activation function can approximate any function. Neural
Networks, 6(6):861–867, 1993.

Stéphane Mallat. Group invariant scattering. Communications on Pure and Applied Math-
ematics, 65(10):1331–1398, 2012.

Stéphane Mallat. Understanding deep convolutional networks. Phil. Trans. R. Soc. A, 374
(2065):20150203, 2016.

Siddharth Manay, Daniel Cremers, Byung-Woo Hong, Anthony J Yezzi, and Stefano Soatto.
Integral invariants for shape matching. IEEE Transactions on pattern analysis and ma-
chine intelligence, 28(10):1602–1618, 2006.

Diego Marcos, Michele Volpi, Nikos Komodakis, and Devis Tuia. Rotation equivariant vector
field networks. arXiv preprint arXiv:1612.09346, 2016.

Hrushikesh N Mhaskar and Charles A Micchelli. Approximation by superposition of sig-


moidal and radial basis functions. Advances in Applied mathematics, 13(3):350–373, 1992.

J.R. Munkres. Topology. Featured Titles for Topology Series. Prentice Hall, Incorporated,
2000. ISBN 9780131816299. URL https://books.google.ru/books?id=XjoZAQAAIAAJ.

Allan Pinkus. TDI-Subspaces of C(Rd ) and Some Density Problems from Neural Networks.
journal of approximation theory, 85(3):269–287, 1996.

Allan Pinkus. Approximation theory of the mlp model in neural networks. Acta Numerica,
8:143–195, 1999.

Tomaso Poggio, Hrushikesh Mhaskar, Lorenzo Rosasco, Brando Miranda, and Qianli Liao.
Why and when can deep-but not shallow-networks avoid the curse of dimensionality: A
review. International Journal of Automation and Computing, pages 1–17, 2017.

Marco Reisert. Group Integration Techniques in Pattern Analysis. PhD thesis, Albert-
Ludwigs-University, 2008.

Barbara J Schmid. Finite groups and invariant theory. In Topics in invariant theory, pages
35–66. Springer, 1991.

Hanns Schulz-Mirbach. Invariant features for gray scale images. In Mustererkennung 1995,
pages 1–14. Springer, 1995.

Jean-Pierre Serre. Linear representations of finite groups, volume 42. Springer Science &
Business Media, 2012.

Laurent Sifre and Stéphane Mallat. Rigid-motion scattering for texture classification. arXiv
preprint arXiv:1403.1687, 2014.

60
Barry Simon. Representations of finite and compact groups. Number 10. American Mathe-
matical Soc., 1996.
Henrik Skibbe. Spherical Tensor Algebra for Biomedical Image Analysis. PhD thesis, Albert-
Ludwigs-University, 2013.
Jost Tobias Springenberg, Alexey Dosovitskiy, Thomas Brox, and Martin Riedmiller. Striv-
ing for simplicity: The all convolutional net. arXiv preprint arXiv:1412.6806, 2014.
Martin Thoma. Analysis and optimization of convolutional neural network architectures.
Masterss thesis, Karlsruhe Institute of Technology, Karlsruhe, Germany, June 2017. URL
https://martin-thoma.com/msthesis/.
Ernest B Vinberg. Linear representations of groups. Birkhäuser, 2012.
A. Waibel, T. Hanazawa, G. Hinton, K. Shikano, and K. J. Lang. Phoneme recognition
using time-delay neural networks. IEEE Transactions on Acoustics, Speech, and Signal
Processing, 37(3):328–339, 1989.
H Weyl. The classical groups: their invariants and representations. Princeton mathematical
series, (1), 1946.
Patrick A Worfolk. Zeros of equivariant vector fields: Algorithms for an invariant approach.
Journal of Symbolic Computation, 17(6):487–511, 1994.

A Proof of Lemma 4.1


The proof is a slight modification of the standard proof of Central Limit Theorem via Fourier
transform (the CLT can be directly used to prove the lemma in the case a = b = 0 when
(a,b)
Lλ only includes diffusion factors).
To simplify notation, assume without loss of generality that dV = 1 (in the general
case the proof is essentially identical). We will use the appropriately discretized version
of the Fourier transform (i.e., the Fourier series expansion). Given a discretized signal
Φ : (λZ)2 → C, we define Fλ Φ as a function on [− πλ , πλ ]2 by
λ2 X
Fλ Φ(p) = Φ(γ)e−ip·γ .
2π 2
γ∈(λZ)

Then, Fλ : L2 ((λZ)2 , C) → L2 ([− πλ , πλ ]2 , C) is a unitary isomorphism, assuming that the


scalar product in the input space is defined by hΦ, Ψi = λ2 γ∈(λZ)2 Φ(γ)Ψ(γ) and in the
P
R
output space by hΦ, Ψi = [− π , π ]2 Φ(p)Ψ(p)d2 p. Let Pλ be the discretization projector (3.6).
λ λ
It is easy to check that Fλ Pλ strongly converges to the standard Fourier transform as λ → 0 :
lim Fλ Pλ Φ = F0 Φ, Φ ∈ L2 (R2 , C),
λ→0

61
where Z
1
F0 Φ(p) = Φ(γ)e−ip·γ d2 γ
2π R2

and where we naturally embed L2 ([− πλ , πλ ]2 , C) ⊂ L2 (R2 , C). Conversely, let Pλ0 denote the
orthogonal projection onto the subspace L2 ([− πλ , πλ ]2 , C) in L2 (R2 , C) :

Pλ0 : Φ 7→ Φ|[− πλ , πλ ]2 . (A.1)

Then
lim Fλ−1 Pλ0 Φ = F0−1 Φ, Φ ∈ L2 (R2 ). (A.2)
λ→0

Fourier transform gives us the spectral representation of the discrete differential operators
(4.15),(4.16),(4.17) as operators of multiplication by function:

Fλ ∂z(λ) Φ = Ψ∂z(λ) · Fλ Φ,
(λ)
Fλ ∂z Φ = Ψ∂ (λ) · Fλ Φ,
z

Fλ ∆(λ) Φ = Ψ∆(λ) · Fλ Φ,

where, denoting p = (px , py ),


i
Ψ∂z(λ) (px , py ) = (sin λpx − i sin λpy ),

i
Ψ∂ (λ) (px , py ) = (sin λpx + i sin λpy ),
z 2λ
4 λpx λpy 
Ψ∆(λ) (px , py ) = − 2 sin2 + sin2 .
λ 2 2
(a,b)
The operator Lλ defined in (4.19) can then be written as
(a,b)
F λ Lλ Φ = ΨL(a,b) · Fλ Pλ Φ,
λ

where the function ΨL(a,b) is given by


λ

2
λ2
ΨL(a,b) = (Ψ∂z(λ) )a (Ψ∂ (λ) )b (1 + 8
Ψ∆(λ) )d4/λ e .
λ z

(a,b)
We can then write Lλ Φ as a convolution of Pλ Φ with the kernel

(λ) 1 −1
Ψa,b = F Ψ (a,b)
2π λ Lλ
on the grid (λZ)2 :
(a,b)
X (λ)
Lλ Φ(γ) = λ2 Pλ Φ(γ − θ)Ψa,b (θ), γ ∈ (λZ)2 . (A.3)
θ∈(λZ)2

62
(a,b) (a,b)
Now consider the operator L0 defined in (4.21). At each x ∈ R2 , the value L0 Φ(x)
can be written as a scalar product:
Z
(a,b)
L0 Φ(x) = Φ(x − y)Ψa,b (y)d2 y = hR−x Φ,
e Ψa,b iL2 (R2 ) , (A.4)
R2

where Φ(x)
e = Φ(−x), Ψa,b is defined by (4.20), and Rx is our standard representation of the
2 (a,b)
group R , Rx Φ(y) = Φ(y − x). For λ > 0, we can write Lλ Φ(x) in a similar form. Indeed,
(λ)
using (A.3) and naturally extending the discretized signal Ψa,b to the whole R2 , we have
Z
(a,b) (λ) e Ψ(λ) iL2 (R2 ) .
Lλ Φ(γ) = Φ(γ − y)Ψa,b (y)d2 y = hR−γ Φ, a,b
R2

Then, for any x ∈ R2 we can write


(a,b) e Ψ(λ) iL2 (R2 ) ,
Lλ Φ(x) = hR−x+δx Φ, a,b (A.5)

where −x + δx is the point of the grid (λZ)2 nearest to −x.


Now consider the formulas (A.4),(A.5) and observe that, by Cauchy-Schwarz inequality
and since R is norm-preserving, to prove statement 1) of the lemma we only need to show
(λ)
that the functions Ψa,b , Ψa,b have uniformly bounded L2 -norms. For λ > 0 we have

(λ) 1 −1 2
kΨa,b k2L2 (R2 ) = Fλ ΨL(a,b)
2π λ L2 (R2 )
1
= 2 kΨL(a,b) k2L2 (R2 )
4π λ

1 2 2 2
= 2 (Ψ∂z(λ) )a (Ψ∂ (λ) )b (1 + λ8 Ψ∆(λ) )d4/λ e L2 (R2 )
4π z
Z π/λ Z π/λ
1 |px |+|py | 2(a+b) λpy 
≤ 2 2
exp − d4/λ2 e(sin2 λp2x + sin2 2
) dpx dpy
4π −π/λ −π/λ
Z ∞Z ∞
1 |px |+|py | 2(a+b)
≤ 2 2
exp(− π42 (p2x + p2y ))dpx dpy (A.6)
4π −∞ −∞
< ∞,

where we used the inequalities

| sin t| ≤ |t|,
|1 + t| ≤ et , t > −1,
2|t|
| sin t| ≥ π
, t ∈ [− π2 , π2 ].
(λ)
Expression (A.6) provides a finite bound, uniform in λ, for the squared norms kΨa,b k2 . This
bound also holds for kΨa,b k2 .

63
Next, observe that to establish the strong convergence in statement 2) of the lemma, it
suffices to show that
(λ)
lim kΨa,b − Ψa,b kL2 (R2 ) = 0. (A.7)
λ→0
Indeed, by (A.4),(A.5), we would then have
(a,b) (a,b) e Ψ(λ) i − hR−x Φ,
kLλ Φ − L0 Φk∞ = sup |hR−x+δx Φ, a,b
e Ψa,b i|
x∈R2
e Ψ(λ) i + hR−x Φ,
= sup |hR−x (Rδx − 1)Φ, e Ψ(λ) − Ψa,b i|
a,b a,b
x∈R2

≤ sup kRδx Φ e 2 sup kΨ(λ) k2 + kΦk


e − Φk e 2 kΨ(λ) − Ψa,b k2
a,b a,b
kδxk≤λ λ
λ→0
−→ 0
thanks to the unitarity of R, convergence limδx→0 kRδx Φ e − Φk
e 2 = 0, uniform boundedness
(λ)
of kΨa,b k2 and convergence (A.7).
To establish (A.7), we write
(λ) 1
Ψa,b − Ψa,b = (F −1 Ψ (a,b) − F0−1 ΨL(a,b) ),
2π λ Lλ 0

where ΨL(a,b) = 2πFλ Ψa,b . By definition (4.20) of Ψa,b and standard properties of Fourier
0
transform, the explicit form of the function ΨL(a,b) is
0

a i(px +ipy ) b p2 +p2 


ΨL(a,b) (px , py ) = i(px −ip y)

2 2
exp − x 2 y .
0

Observe that the function ΨL(a,b) is the pointwise limit of the functions ΨL(a,b) as λ → 0.
0 λ
The functions |ΨL(a,b) |2 are bounded uniformly in λ by the integrable function appearing in
λ
the integral (A.6). Therefore we can use the dominated convergence theorem and conclude
that
lim ΨL(a,b) − Pλ0 ΨL(a,b) 2 = 0, (A.8)
λ→0 λ 0

where Pλ0
is the cut-off projector (A.1). We then have
(λ) 1
kΨa,b − Ψa,b k2 = Fλ−1 ΨL(a,b) − F0−1 ΨL(a,b) 2
2π λ 0

1 1
≤ Fλ−1 (ΨL(a,b) − Pλ0 ΨL(a,b) ) 2 + (Fλ−1 Pλ0 − F0−1 )ΨL(a,b) 2
2π λ 0 2π 0
λ→0
−→ 0
by (A.8) and (A.2). We have thus proved (A.7).
(a,b) (a,b)
It remains to show that the convergence Lλ Φ → L0 Φ is uniform on compact sets K ⊂
V . This follows by a version of continuity argument. For any  > 0, we can choose finitely
many Φn , n = 1, . . . , N, such that for any Φ ∈ K there is some Φn for which kΦ − Φn k < .
(a,b) (a,b) (a,b) (a,b) (a,b) (a,b)
Then kLλ Φ − L0 Φk ≤ kLλ Φn − L0 Φn k + 2 supλ≥0 kLλ k. Since supλ≥0 kLλ k <
∞ by statement 1) of the lemma, the desired uniform convergence for Φ ∈ K follows from
the convergence for Φn , n = 1, . . . , N .

64

You might also like