0% found this document useful (0 votes)
92 views8 pages

On The Expressive Power of Deep Neural Networks

Uploaded by

Nathalia Santos
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
92 views8 pages

On The Expressive Power of Deep Neural Networks

Uploaded by

Nathalia Santos
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

On the Expressive Power of Deep Neural Networks

Maithra Raghu 1 2 Ben Poole 3 Jon Kleinberg 1 Surya Ganguli 3 Jascha Sohl Dickstein 2

Abstract performance.

We propose a new approach to the problem of Indeed, the very first results on this question take a highly
neural network expressivity, which seeks to char- theoretical approach, from using functional analysis to
acterize how structural properties of a neural net- show universal approximation results (Hornik et al., 1989;
work family affect the functions it is able to com- Cybenko, 1989), to analysing expressivity via comparisons
pute. Our approach is based on an interrelated to boolean circuits (Maass et al., 1994) and studying net-
set of measures of expressivity, unified by the work VC dimension (Bartlett et al., 1998). While these
novel notion of trajectory length, which mea- results provided theoretically general conclusions, the shal-
sures how the output of a network changes as low networks they studied are very different from the deep
the input sweeps along a one-dimensional path. models that have proven so successful in recent years.
Our findings show that: (1) The complexity of In response, several recent papers have focused on under-
the computed function grows exponentially with standing the benefits of depth for neural networks (Pas-
depth (2) All weights are not equal: trained net- canu et al., 2013; Montufar et al., 2014; Eldan and Shamir,
works are more sensitive to their lower (initial) 2015; Telgarsky, 2015; Martens et al., 2013; Bianchini and
layer weights (3) Trajectory regularization is a Scarselli, 2014). These results are compelling and take
simpler alternative to batch normalization, with modern architectural changes into account, but they only
the same performance. show that a specific choice of weights for a deeper network
results in inapproximability by a shallow (typically one or
two hidden layers) network.
1. Introduction
In particular, the goal of this new line of work has been
Deep neural networks have proved astoundingly effective to establish lower bounds — showing separations between
at a wide range of empirical tasks, from image classifica- shallow and deep networks — and as such they are based
tion (Krizhevsky et al., 2012) to playing Go (Silver et al., on hand-coded constructions of specific network weights.
2016), and even modeling human learning (Piech et al., Even if the weight values used in these constructions are
2015). robust to small perturbations (as in (Pascanu et al., 2013;
Montufar et al., 2014)), the functions that arise from these
Despite these successes, understanding of how and why
constructions tend toward extremal properties by design,
neural network architectures achieve their empirical suc-
and there is no evidence that a network trained on data ever
cesses is still lacking. This includes even the fundamen-
resembles such a function.
tal question of neural network expressivity, how the archi-
tectural properties of a neural network (depth, width, layer This has meant that a set of fundamental questions about
type) affect the resulting functions it can compute, and its neural network expressivity has remained largely unan-
ensuing performance. swered. First, we lack a good understanding of the “typ-
ical” case rather than the worst case in these bounds for
This is a foundational question, and there is a rich history
deep networks, and consequently have no way to evalu-
of prior work addressing expressivity in neural networks.
ate whether the hand-coded extremal constructions provide
However, it has been challenging to derive conclusions that
a reflection of the complexity encountered in more stan-
provide both theoretical generality with respect to choices
dard settings. Second, we lack an understanding of upper
of architecture as well as meaningful insights into practical
bounds to match the lower bounds produced by this prior
1
Cornell University 2 Google Brain 3 Stanford University. Cor- work; do the constructions used to date place us near the
respondence to: Maithra Raghu <[email protected]>. limit of the expressive power of neural networks, or are
there still large gaps? Finally, if we had an understanding
Proceedings of the 34 th International Conference on Machine of these two issues, we might begin to draw connections
Learning, Sydney, Australia, PMLR 70, 2017. Copyright 2017
by the author(s). between network expressivity and observed performance.
On the Expressive Power of Deep Neural Networks

Our contributions: Measures of Expressivity and their tial depth dependence displayed by these measures,
Applications In this paper, we address this set of chal- through a unifying analysis in which we study how the
lenges by defining and analyzing an interrelated set of mea- network transforms its input by measuring trajectory
sures of expressivity for neural networks; our framework length
applies to a wide range of standard architectures, indepen-
dent of specific weight choices. We begin our analysis at (3) All weights are not equal (the lower layers matter
the start of training, after random initialization, and later more): We show how these results on trajectory length
derive insights connecting network expressivity and perfor- suggest that optimizing weights in lower layers of the
mance. network is particularly important.

Our first measure of expressivity is based on the notion of (4) Trajectory Regularization Based on understanding the
an activation pattern: in a network where the units compute effect of batch norm on trajectory length, we propose
functions based on discrete thresholds, we can ask which a new method of regularization, trajectory regulariza-
units are above or below their thresholds (i.e. which units tion, that offers the same advantages as batch norm,
are “active” and which are not). For the range of standard and is computationally more efficient.
architectures that we consider, the network is essentially
computing a linear function once we fix the activation pat- In prior work (Poole et al., 2016), we studied the propa-
tern; thus, counting the number of possible activation pat- gation of Riemannian curvature through random networks
terns provides a concrete way of measuring the complexity by developing a mean field theory approach. Here, we take
beyond linearity that the network provides. We give an up- an approach grounded in computational geometry, present-
per bound on the number of possible activation patterns, ing measures with a combinatorial flavor and explore the
over any setting of the weights. This bound is tight as it consequences during and after training.
matches the hand-constructed lower bounds of earlier work
(Pascanu et al., 2013; Montufar et al., 2014).
2. Measures of Expressivity
Key to our analysis is the notion of a transition, in which
changing an input x to a nearby input x + δ changes the Given a neural network of a certain architecture A (some
activation pattern. We study the behavior of transitions as depth, width, layer types), we have an associated function,
we pass the input along a one-dimensional parametrized FA (x; W ), where x is an input and W represents all the
trajectory x(t). Our central finding is that the trajectory parameters of the network. Our goal is to understand how
length grows exponentially in the depth of the network. the behavior of FA (x; W ) changes as A changes, for values
of W that we might encounter during training, and across
Trajectory length serves as a unifying notion in our mea- inputs x.
sures of expressivity, and it leads to insights into the be-
havior of trained networks. Specifically, we find that the The first major difficulty comes from the high dimension-
exponential growth in trajectory length as a function of ality of the input. Precisely quantifying the properties of
depth implies that small adjustments in parameters lower FA (x; W ) over the entire input space is intractable. As a
in the network induce larger changes than comparable ad- tractable alternative, we study simple one dimensional tra-
justments higher in the network. We demonstrate this phe- jectories through input space. More formally:
nomenon through experiments on MNIST and CIFAR-10, Definition: Given two points, x0 , x1 ∈ Rm , we say x(t)
where the network displays much less robustness to noise is a trajectory (between x0 and x1 ) if x(t) is a curve
in the lower layers, and better performance when they are parametrized by a scalar t ∈ [0, 1], with x(0) = x0 and
trained well. We also explore the effects of regularization x(1) = x1 .
methods on trajectory length as the network trains and pro-
pose a less computationally intensive method of regulariza- Simple examples of a trajectory would be a line (x(t) =
tion, trajectory regularization, that offers the same perfor- tx1 + (1 − t)x0 ) or a circular arc (x(t) = cos(πt/2)x0 +
mance as batch normalization. sin(πt/2)x1 ), but in general x(t) may be more compli-
cated, and potentially not expressible in closed form.
The contributions of this paper are thus:
Armed with this notion of trajectories, we can begin to de-
(1) Measures of expressivity: We propose easily com- fine measures of expressivity of a network FA (x; W ) over
putable measures of neural network expressivity that trajectories x(t).
capture the expressive power inherent in different
neural network architectures, independent of specific 2.1. Neuron Transitions and Activation Patterns
weight settings.
In (Montufar et al., 2014) the notion of a “linear region”
(2) Exponential trajectories: We find an exponen- is introduced. Given a neural network with piecewise lin-
On the Expressive Power of Deep Neural Networks

ear activations (such as ReLU or hard tanh), the function ‘global’ activation patterns over the entire input space, and
it computes is also piecewise linear, a consequence of the prove that for any fully connected network, with any num-
fact that composing piecewise linear functions results in a ber of hidden layers, we can upper bound the number of lin-
piecewise linear function. So one way to measure the “ex- ear regions it can achieve, over all possible weight settings
pressive power” of different architectures A is to count the W . This upper bound is asymptotically tight, matched by
number of linear pieces (regions), which determines how the construction given in (Montufar et al., 2014). Our result
nonlinear the function is. can be written formally as:
In fact, a change in linear region is caused by a neuron Theorem 1. (Tight) Upper Bound for Number of Activa-
transition in the output layer. More precisely: tion Patterns Let A(n,k) denote a fully connected network
with n hidden layers of width k, and inputs in Rm . Then the
Definition For fixed W , we say a neuron with piecewise
number of activation patterns A(FAn,k (Rm ; W ) is upper
linear region transitions between inputs x, x + δ if its acti-
bounded by O(k mn ) for ReLU activations, and O((2k)mn )
vation function switches linear region between x and x + δ.
for hard tanh.
So a ReLU transition would be given by a neuron switching
from off to on (or vice versa) and for hard tanh by switch- From this we can derive a chain of inequalities. Firstly,
ing between saturation at −1 to its linear middle region to from the theorem above we find an upper bound of
saturation at 1. For any generic trajectory x(t), we can thus A(FAn,k (Rm ; W )) over all W , i.e.
define T (FA (x(t); W )) to be the number of transitions un-
dergone by output neurons (i.e. the number of linear re- ∀ W A(FA(n,k) )(Rm ; W ) ≤ U (n, k, m).
gions) as we sweep the input x(t). Instead of just concen-
trating on the output neurons however, we can look at this
Next, suppose we have N neurons in total. Then we want to
pattern over the entire network. We call this an activation
compare (for wlog ReLUs), quantities like U (n0 , N/n0 , m)
patten:
for different n0 .
Definition We can define AP(FA (x; W )) to be the activa- 0
But U (n0 , N/n0 , m)
 = O((N/n )
0 mn
), and so, noting that
tion pattern – a string of form {0, 1}num neurons (for ReLUs) a mx
the maxima of x (for a > e) is x = a/e, we get, (for
and {−1, 0, 1}num neurons (for hard tanh) of the network en-
n, k > e), in comparison to (*),
coding the linear region of the activation function of every
neuron, for an input x and weights W .
N
Overloading notation slightly, we can also define (similarly U (1, N, m) < U (2, , m) < · · ·
2
to transitions) A(FA (x(t); W )) as the number of distinct
activation patterns as we sweep x along x(t). As each N
distinct activation pattern corresponds to a different linear · · · < U (n − 1, , m) < U (n, k, m)
n−1
function of the input, this combinatorial measure captures
how much more expressive A is over a simple linear map-
We prove this via an inductive proof on regions in a hy-
ping.
perplane arrangement. The proof can be found in the Ap-
Returning to Montufar et al, they provide a construction pendix. As noted in the introduction, this result differs
i.e. a specific set of weights W0 , that results in an exponen- from earlier lower-bound constructions in that it is an upper
tial increase of linear regions with the depth of the archi- bound that applies to all possible sets of weights. Via our
tectures. They also appeal to Zaslavsky’s theorem (Stan- analysis, we also prove
ley, 2011) from the theory of hyperplane arrangements to
Theorem 2. Regions in Input Space Given the correspond-
show that a shallow network, i.e. one hidden layer, with the
ing function of a neural network FA (Rm ; W ) with ReLU
same number of parameters as a deep network, has a much
or hard tanh activations, the input space is partitioned into
smaller number of linear regions than the number achieved
convex polytopes, with FA (Rm ; W ) corresponding to a dif-
by their choice of weights W0 for the deep network.
ferent linear function on each region.
More formally, letting A1 be a fully connected network
with one hidden layer, and Al a fully connected network This result is of independent interest for optimization – a
with the same number of parameters, but l hidden layers, linear function over a convex polytope results in a well be-
they show haved loss function and an easy optimization problem. Un-
derstanding the density of these regions during the training
∀W T (FA1 ([0, 1]; W )) < T (FA1 ([0, 1]; W0 ) (*) process would likely shed light on properties of the loss
surface, and improved optimization methods. A picture of
We derive a much more general result by considering the a network’s regions is shown in Figure 1.
On the Expressive Power of Deep Neural Networks

1 Layer 0 1 Layer 1 1 Layer 2 Number of transitions with increasing depth Number of transitions with increasing width
102 101
w50 scl10
w100 scl8
w500 scl5
w700 scl5
w700 scl10
0 0 0 w1000 scl10
x1

101 w1000 scl16

Number of transitions
Transitions number
100
lay2 scl5
lay2 scl10
-1 -1 -1 10 0 lay4 scl5
-1 0 1 -1 0 1 -1 0 1 lay4 scl8
x0 x0 x0 lay6 scl5
lay6 scl8
lay8 scl8
lay10 scl8
lay12 scl8
10-1 10-1
Figure 1. Deep networks with piecewise linear activations subdi- 0 2 4 6 8
Network depth
10 12 14 0 200 400
Layer width
600 800 1000

vide input space into convex polytopes. We take a three hidden


layer ReLU network, with input x ∈ R2 , and four units in each
layer. The left pane shows activations for the first layer only. As Figure 2. The number of transitions seen for fully connected net-
the input is in R2 , neurons in the first hidden layer have an associ- works of different widths, depths and initialization scales, with
ated line in R2 , depicting their activation boundary. The left pane a circular trajectory between MNIST datapoints. The number of
thus has four such lines. For the second hidden layer each neuron transitions grows exponentially with the depth of the architecture,
again has a line in input space corresponding to on/off, but this as seen in (left). The same rate of growth is not seen with increas-
line is different for each region described by the first layer activa- ing architecture width, plotted in (right). There is a surprising
tion pattern. So in the centre pane, which shows activation bound- dependence on the scale of initialization, explained in 2.2.
ary lines corresponding to second hidden layer neurons in green
(and first hidden layer in black), we can see the green lines ‘bend’
at the boundaries. (The reason for this bending becomes appar-
ent through the proof of Theorem 2.) Finally, the right pane adds
the on/off boundaries for neurons in the third hidden layer, in pur-
ple. These lines can bend at both black and green boundaries, as
Figure 3. Picture showing a trajectory increasing with the depth of
the image shows. This final set of convex polytopes corresponds
a network. We start off with a circular trajectory (left most pane),
to all activation patterns for this network (with its current set of
and feed it through a fully connected tanh network with width
weights) over the unit square, with each polytope representing a
100. Pane second from left shows the image of the circular trajec-
different linear function.
tory (projected down to two dimensions) after being transformed
2.1.1. E MPIRICALLY C OUNTING T RANSITIONS by the first hidden layer. Subsequent panes show projections of
the latent image of the circular trajectory after being transformed
We empirically tested the growth of the number of acti- by more hidden layers. The final pane shows the the trajectory
vations and transitions as we varied x along x(t) to under- after being transformed by all the hidden layers.
stand their behavior. We found that for bounded non linear-
ities, especially tanh and hard-tanh, not only do we observe intervals.
exponential growth with depth (as hinted at by the upper If we let A(n,k) denote, as before, fully connected networks
bound) but that the scale of parameter initialization also af- with n hidden layers each of width k, and initializing with
fects the observations (Figure 2). We also experimented weights ∼ N (0, σw 2
/k) (accounting for input scaling as
with sweeping the weights W of a layer through a trajec- typical), and biases ∼ N (0, σb2 ), we find that:
tory W (t), and counting the different labellings output by
Theorem 3. Bound on Growth of Trajectory Length Let
the network. This ‘dichotomies’ measure is discussed fur-
FA (x0 , W ) be a ReLU or hard tanh random neural network
ther in the Appendix, and also exhibits the same growth
and x(t) a one dimensional trajectory with x(t + δ) having
properties, Figure 14.
a non trival perpendicular component to x(t) for all t, δ
(i.e, not a line). Then defining z (d) (x(t)) = z (d) (t) to be
2.2. Trajectory Length the image of the trajectory in layer d of the network, we
In fact, there turns out to be a reason for the exponential have
growth with depth, and the sensitivity to initialization scale.
Returning to our definition of trajectory, we can define an (a)
√ !d
immediately related quantity, trajectory length h
(d)
i σw k
E l(z (t)) ≥ O √ l(x(t))
Definition: Given a trajectory, x(t), we define its length, k+1
l(x(t)), to be the standard arc length: for ReLUs
Z
dx(t) (b)
l(x(t)) = dt
t dt  √ d
h i σw k
Intuitively, the arc length breaks x(t) up into infinitesimal E l(z (d) (t)) ≥ O  q p
 l(x(t))
2 2 2 + σ2
σw + σb + k σw
intervals and sums together the Euclidean length of these b
On the Expressive Power of Deep Neural Networks

Trajectory length growth with increasing depth


105
scl5
scl8
scl12
scl16
104 scl20
scl32
Trajectory length

103

102

Figure 5. The number of transitions is linear in trajectory length.


101
Here we compare the empirical number of transitions to the length
1 2 3 4 5 6 7 8 9
Network depth of the trajectory, for different depths of a hard-tanh network. We
repeat this comparison for a variety of network architectures, with
2
different network width k and weight variance σw .
Figure 4. We look at trajectory growth with different initializa-
tion scales as a trajectory is propagated through a convolutional rapidly. (This is the case for e.g. large initialization scales).
architecture for CIFAR-10, with ReLU activations. The analy- Then we have:
sis of Theorem 3 was for fully connected networks, but we see
Theorem 4. Transitions proportional to trajectory length
that trajectory growth holds (albeit with slightly higher scales) for
convolutional architectures also. Note that the decrease in trajec- Let FAn,k be a hard tanh network with n hidden layers each
tory length, seen in layers 3 and 7 is expected, as those layers are of width k. And let
pooling layers.
 √ n
k 
for hard tanh g(k, σw , σb , n) = O  q
σ2
1 + σ2b
w

That is, l(x(t) grows exponentially with the depth of the Then T (FAn,k (x(t); W ) = O(g(k, σw , σb , n)) for W ini-
network, but the width only appears as a base (of the expo- tialized with weight and bias scales σw , σb .
nent). This bound is in fact tight in the limits of large σw Note that the expression for g(k, σw , σb , n) is exactly the
and k. expression given by Theorem 3 when σw is very large and
A schematic image depicting this can be seen in Figure 3 dominates σb . We can also verify this experimentally in
and the proof can be found in the Appendix. A rough out- settings where the simpilfying assumption does not hold,
line is as follows: we look at the expected growth of the as in Figure 5.
difference between a point z (d) (t) on the curve and a small
perturbation z (d) (t+dt),
(d) from layer d to layer d+1. Denot- 3. Insights from Network Expressivity
ing this quantity δz (t) , we derive a recurrence relat-

ing δz (d+1) (t) and δz (d) (t) which can be composed Here we explore the insights gained from applying
to give the desired growth rate. our measurements of expressivity, particularly trajectory
length, to understand network performance. We examine
The analysis is complicated by the statistical dependence the connection of expressivity and stability, and inspired
on the image of the input z (d+1) (t). So we instead form by this, propose a new method of regularization, trajectory
a recursion by looking at the component of the difference regularization that offers the same advantages as the more
perpendicular
to the image of the input in that layer, i.e. computationally intensive batch normalization.
(d+1)
δz⊥ (t) , which results in the condition on x(t) in the
statement. 3.1. Expressivity and Network Stability
In Figures 4, 12, we see the growth of an input trajectory The analysis of network expressivity offers interesting
for ReLU networks on CIFAR-10 and MNIST. The CIFAR- takeaways related to the parameter and functional stabil-
10 network is convolutional but we observe that these lay- ity of a network. From the proof of Theorem 3, we saw
ers also result in similar rates of trajectory length increases that a perturbation to the input would grow exponentially
to the fully connected layers. We also see, as would be in the depth of the network. It is easy to see that this anal-
expected, that pooling layers act to reduce the trajectory ysis is not limited to the input layer, but can be applied to
length. We discuss upper bounds in the Appendix. any layer. In this form, it would say
For the hard tanh case (and more generally any bounded A perturbation at a layer grows exponentially in the
non-linearity), we can formally prove the relation of trajec- remaining depth after that layer.
tory length and transitions under an assumption: assume
that while we sweep x(t) all neurons are saturated un- This means that perturbations to weights in lower layers
less transitioning saturation endpoints, which happens very should be more costly than perturbations in the upper lay-
On the Expressive Power of Deep Neural Networks

1.0
CIFAR 10 accuracy against noise in diff layers tion. One regularization technique that has been extremely
0.9 successful for training neural networks is Batch Normal-
0.8
ization (Ioffe and Szegedy, 2015).
0.7

CIFAR10 Trajectory growth


without Batch Norm
0.6
Accuracy

0.5 10 5
0
0.4 1189
10512
0.3
lay01
lay02 40744
lay03 72389
lay04 10 4 104035
lay05
0.2
lay06 137262
142009

Layer depth
lay07
0.1
0.0 0.5 1.0 1.5 2.0 156250
Noise magnitude

10 3

Figure 6. We then pick a single layer of a conv net trained to high


10 2
accuracy on CIFAR10, and add noise to the layer weights of in-
creasing magnitudes, testing the network accuracy as we do so.
We find that the initial (lower) layers of the network are least ro- 10 1
in c1 c2 p1 c3 c4 c5 p2 fc1 fc2
bust to noise – as the figure shows, adding noise of 0.25 magni- Trajectory length
tude to the first layer results in a 0.7 drop in accuracy, while the
same amount of noise added to the fifth layer barely results in a
Figure 8. Training increases trajectory length even for typical
0.02 drop in accuracy. This pattern is seen for many different ini- 2
2 (σw = 2) initialization values of σw . Here we propagate a cir-
tialization scales, even for a (typical) scaling of σw = 2, used in
cular trajectory joining two CIFAR10 datapoints through a conv
the experiment.
net without batch norm, and look at how trajectory length changes
ers, due to exponentially increasing magnitude of noise, through training. We see that training causes trajectory length
and result in a much larger drop of accuracy. Figure 6, in to increase exponentially with depth (exceptions only being the
pooling layers and the final fc layer, which halves the number of
which we train a conv network on CIFAR-10 and add noise
neurons.) Note that at Step 0, the network is not in the exponen-
of varying magnitudes to exactly one layer, shows exactly
tial growth regime. We observe (discussed in Figure 9) that even
this. networks that aren’t initialized in the exponential growth regime
We also find that the converse (in some sense) holds: after can be pushed there through training.
initializing a network, we trained a single layer at different
depths in the network and found monotonically increasing By taking measures of trajectories during training we find
performance as layers lower in the network were trained. that without batch norm, trajectory length tends to increase
This is shown in Figure 7 and Figure 17 in the Appendix. during training, as shown in Figures 8 and Figure 18 in
the Appendix. In these experiments, two networks were
2
1.0
Train Accuracy Against Epoch Test Accuracy Against Epoch
lay 2
initialized with σw = 2 and trained to high test accuracy on
lay 3
0.9 lay 4
lay 5
lay 6
CIFAR10 and MNIST. We see that in both cases, trajectory
0.8 lay 7

0.7
lay 8
lay 9 length increases as training progresses.
Accuracy

0.6
2
0.5 A surprising observation is σw = 2 is not in the exponential
0.4

0.3
growth increase regime at initialization for the CIFAR10
0 100 200 300 400 5000 100 200 300 400 500
Epoch Number Epoch Number
architecture (Figure 8 at Step 0.). But note that even with a
smaller weight initialization, weight norms increase during
Figure 7. Demonstration of expressive power of remaining depth training, shown in Figure 9, pushing typically initialized
on MNIST. Here we plot train and test accuracy achieved by train-
networks into the exponential growth regime.
ing exactly one layer of a fully connected neural net on MNIST.
The different lines are generated by varying the hidden layer cho- While the initial growth of trajectory length enables greater
sen to train. All other layers are kept frozen after random initial- functional expressivity, large trajectory growth in the learnt
ization. We see that training lower hidden layers leads to better representation results in an unstable representation, wit-
performance. The networks had width k = 100, weight variance nessed in Figure 6. In Figure 10 we train another conv
2
σw = 1, and hard-tanh nonlinearities. Note that we only train
net on CIFAR10, but this time with batch normalization.
from the second hidden layer (weights W (1) ) onwards, so that
We see that the batch norm layers reduce trajectory length,
the number of parameters trained remains fixed.
helping stability.

3.2. Trajectory Length and Regularization: The Effect 3.3. Trajectory Regularization
of Batch Normalization
Motivated by the fact that batch normalization decreases
Expressivity measures, especially trajectory length, can trajectory length and hence helps stability and generaliza-
also be used to better understand the effect of regulariza- tion, we consider directly regularizing on trajectory length:
On the Expressive Power of Deep Neural Networks
CIFAR10 Accuracy for
6.0
lay 01
CIFAR-10 scaled weight variances with training
0.90 trajectory and batch norm reguarlizers
5.5
lay 02
lay 03
lay 04
5.0 lay 05
lay 06
lay 07 0.85
Scaled weight variance
4.5

Test Accuracy
4.0

3.5
0.80
3.0

2.5

2.0
0 20000 40000 60000
Train Step
80000 100000 120000 0.75 batch norm
traj reg
no batch norm or traj reg
0.70
Figure 9. This figure shows how the weight scaling of a CIFAR10 0 20000 40000 60000 80000 100000 120000 140000 160000
network evolves during training. The network was initialized with Train step
2
σw = 2, which increases across all layers during training.
Figure 11. We replace each batch norm layer of the CIFAR10
CIFAR10: Trajectory growth with conv net with a trajectory regularization layer, described in Sec-
10 7
Batch Norm
0 tion 3.3. During training trajectory length is easily calculated as
395 a piecewise linear trajectory between adjacent datapoints in the
10 6
7858 minibatch. We see that trajectory regularization achieves the same
Trajectory length

156250
10 5 performance as batch norm, albeit with slightly more train time.
However, as trajectory regularization behaves the same during
10 4
train and test time, it is simpler and more efficient to implement.
10 3
interrelated set of expressivity measures; we have shown
10 2 tight exponential bounds on the growth of these measures
in c1 c2 bn1 p1 c3 c4 c5 bn2 p2 fc1 bn3 fc2 bn4
Layer depth in the depth of the networks, and we have offered a uni-
fying view of the analysis through the notion of trajectory
Figure 10. Growth of circular trajectory between two datapoints length. Our analysis of trajectories provides insights for the
with batch norm layers for a conv net on CIFAR10. The network performance of trained networks as well, suggesting that
2
was initialized as typical, with σw = 2. Note that the batch norm networks in practice may be more sensitive to small per-
layers in Step 0 are poorly behaved due to division by a close to 0 turbations in weights at lower layers. We also used this to
variance. But after just a few hundred gradient steps and contin- explore the empirical success of batch norm, and developed
uing onwards, we see the batch norm layers (dotted lines) reduce
a new regularization method – trajectory regularization.
trajectory length, stabilising the representation without sacrificing
expressivity. This work raises many interesting directions for future
work. At a general level, continuing the theme of ‘prin-
we replace every batch norm layer used in the conv net
cipled deep understanding’, it would be interesting to link
in Figure 10 with a trajectory regularization layer. This
measures of expressivity to other properties of neural net-
layer adds to the loss λ(current length/orig length), and
work performance. There is also a natural connection be-
then scales the outgoing activations by λ, where λ is a pa-
tween adversarial examples, (Goodfellow et al., 2014), and
rameter to be learnt. In implementation, we typically scale
trajectory length: adversarial perturbations are only a small
the additional loss above with a constant (0.01) to reduce
distance away in input space, but result in a large change in
magnitude in comparison to classification loss. Our results,
classification (the output layer). Understanding how trajec-
Figure 11 show that both trajectory regularization and batch
tories between the original input and an adversarial pertur-
norm perform comparably, and considerably better than not
bation grow might provide insights into this phenomenon.
using batch norm. One advantage of using Trajectory Reg-
Another direction, partially explored in this paper, is regu-
ularization is that we don’t require different computations
larizing based on trajectory length. A very simple version
to be performed for train and test, enabling more efficient
of this was presented, but further performance gains might
implementation.
be achieved through more sophisticated use of this method.

4. Discussion Acknowledgements
Characterizing the expressiveness of neural networks, and
We thank Samy Bengio, Ian Goodfellow, Laurent Dinh,
understanding how expressiveness varies with parameters
and Quoc Le for extremely helpful discussion.
of the architecture, has been a challenging problem due to
the difficulty in identifying meaningful notions of expres-
sivity and in linking their analysis to implications for these
networks in practice. In this paper we have presented an
On the Expressive Power of Deep Neural Networks

References vances in neural information processing systems, pages


3360–3368, 2016.
Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton.
Richard Stanley. Hyperplane arrangements. Enumerative
Imagenet classification with deep convolutional neural
Combinatorics, 2011.
networks. In Advances in neural information processing
Sergey Ioffe and Christian Szegedy. Batch normalization:
systems, pages 1097–1105, 2012.
Accelerating deep network training by reducing inter-
David Silver, Aja Huang, Chris J Maddison, Arthur Guez,
nal covariate shift. In Proceedings of the 32nd Inter-
Laurent Sifre, George Van Den Driessche, Julian Schrit-
national Conference on Machine Learning, ICML 2015,
twieser, Ioannis Antonoglou, Veda Panneershelvam,
Lille, France, 6-11 July 2015, pages 448–456, 2015.
Marc Lanctot, et al. Mastering the game of go with deep
Ian J. Goodfellow, Jonathon Shlens, and Christian Szegedy.
neural networks and tree search. Nature, 529(7587):
Explaining and harnessing adversarial examples. CoRR,
484–489, 2016.
abs/1412.6572, 2014.
Chris Piech, Jonathan Bassen, Jonathan Huang, Surya Gan- D. Kershaw. Some extensions of w. gautschi’s inequalities
guli, Mehran Sahami, Leonidas J Guibas, and Jascha for the gamma function. Mathematics of Computation,
Sohl-Dickstein. Deep knowledge tracing. In Advances in 41(164):607–611, 1983.
Neural Information Processing Systems, pages 505–513, Andrea Laforgia and Pierpaolo Natalini. On some inequal-
2015. ities for the gamma function. Advances in Dynamical
Kurt Hornik, Maxwell Stinchcombe, and Halbert White. Systems and Applications, 8(2):261–267, 2013.
Multilayer feedforward networks are universal approxi- Norbert Sauer. On the density of families of sets. Journal of
mators. Neural networks, 2(5):359–366, 1989. Combinatorial Theory, Series A, 13(1):145–147, 1972.
George Cybenko. Approximation by superpositions of a
sigmoidal function. Mathematics of control, signals and
systems, 2(4):303–314, 1989.
Wolfgang Maass, Georg Schnitger, and Eduardo D Son-
tag. A comparison of the computational power of sig-
moid and Boolean threshold circuits. Springer, 1994.
Peter L Bartlett, Vitaly Maiorov, and Ron Meir. Almost lin-
ear vc-dimension bounds for piecewise polynomial net-
works. Neural computation, 10(8):2159–2173, 1998.
Razvan Pascanu, Guido Montufar, and Yoshua Bengio. On
the number of response regions of deep feed forward net-
works with piece-wise linear activations. arXiv preprint
arXiv:1312.6098, 2013.
Guido F Montufar, Razvan Pascanu, Kyunghyun Cho, and
Yoshua Bengio. On the number of linear regions of deep
neural networks. In Advances in neural information pro-
cessing systems, pages 2924–2932, 2014.
Ronen Eldan and Ohad Shamir. The power of depth
for feedforward neural networks. arXiv preprint
arXiv:1512.03965, 2015.
Matus Telgarsky. Representation benefits of deep feedfor-
ward networks. arXiv preprint arXiv:1509.08101, 2015.
James Martens, Arkadev Chattopadhya, Toni Pitassi, and
Richard Zemel. On the representational efficiency of re-
stricted boltzmann machines. In Advances in Neural In-
formation Processing Systems, pages 2877–2885, 2013.
Monica Bianchini and Franco Scarselli. On the complex-
ity of neural network classifiers: A comparison between
shallow and deep architectures. Neural Networks and
Learning Systems, IEEE Transactions on, 25(8):1553–
1565, 2014.
Ben Poole, Subhaneil Lahiri, Maithra Raghu, Jascha Sohl-
Dickstein, and Surya Ganguli. Exponential expressivity
in deep neural networks through transient chaos. In Ad-

You might also like