0% found this document useful (0 votes)
6 views12 pages

Path SGD Behnam

Uploaded by

ee19b114
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views12 pages

Path SGD Behnam

Uploaded by

ee19b114
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

Path-SGD: Path-Normalized Optimization in

Deep Neural Networks


Behnam Neyshabur1 , Ruslan Salakhutdinov2 and Nathan Srebro1
1
Toyota Technological Institute at Chicago
2
Department of Computer Science, University of Toronto
arXiv:1506.02617v1 [cs.LG] 8 Jun 2015

Abstract
We revisit the choice of SGD for training deep neural networks by reconsidering the appropriate
geometry in which to optimize the weights. We argue for a geometry invariant to rescaling of weights
that does not affect the output of the network, and suggest Path-SGD, which is an approximate steepest
descent method with respect to a path-wise regularizer related to max-norm regularization. Path-SGD is
easy and efficient to implement and leads to empirical gains over SGD and AdaGrad.

1 Introduction
Training deep networks is a challenging problem [16, 2] and various heuristics and optimization algorithms
have been suggested in order to improve the efficiency of the training [5, 9, 4]. However, training deep
architectures is still considerably slow and the problem has remained open. Many of the current training
methods rely on good initialization and then performing Stochastic Gradient Descent (SGD), sometimes
together with an adaptive stepsize or momentum term [16, 1, 6].
Revisiting the choice of gradient descent, we recall that optimization is inherently tied to a choice of ge-
ometry or measure of distance, norm or divergence. Gradient descent for example is tied to the `2 norm as it
is the steepest descent with respect to `2 norm in the parameter space, while coordinate descent corresponds
to steepest descent with respect to the `1 norm and exp-gradient (multiplicative weight) updates is tied to
an entropic divergence. Moreover, at least when the objective function is convex, convergence behavior is
tied to the corresponding norms or potentials. For example, with gradient descent, or SGD, convergence
speeds depend on the `2 norm of the optimum. The norm or divergence can be viewed as a regularizer for
the updates. There is therefore also a strong link between regularization for optimization and regularization
for learning: optimization may provide implicit regularization in terms of its corresponding geometry, and
for ideal optimization performance the optimization geometry should be aligned with inductive bias driving
the learning [14].
Is the `2 geometry on the weights the appropriate geometry for the space of deep networks? Or can we
suggest a geometry with more desirable properties that would enable faster optimization and perhaps also
better implicit regularization? As suggested above, this question is also linked to the choice of an appropriate
regularizer for deep networks.
Focusing on networks with RELU activations, we observe that scaling down the incoming edges to a
hidden unit and scaling up the outgoing edges by the same factor yields an equivalent network computing
the same function. Since predictions are invariant to such rescalings, it is natural to seek a geometry, and
corresponding optimization method, that is similarly invariant.

1
We consider here a geometry inspired by max-norm regularization (regularizing the maximum norm of
incoming weights into any unit) which seems to provide a better inductive bias compared to the `2 norm
(weight decay) [3, 15]. But to achieve rescaling invariance, we use not the max-norm itself, but rather the
minimum max-norm over all rescalings of the weights. We discuss how this measure can be expressed as a
“path regularizer” and can be computed efficiently.
We therefore suggest a novel optimization method, Path-SGD , that is an approximate steepest descent
method with respect to path regularization. Path-SGD is rescaling-invariant and we demonstrate that Path-
SGD outperforms gradient descent and AdaGrad for classifications tasks on several benchmark datasets.

Notations A feedforward neural network that computes a function f : RD → RC can be represented


by a directed acyclic graph (DAG) G(V, E) with D input nodes vin [1], . . . , vin [D] ∈ V , C output nodes
vout [1], . . . , vout [C] ∈ V , weights w : E → R and an activation function σ : R → R that is applied on the
internal nodes (hidden units). We denote the function computed by this network as fG,w,σ . In this paper we
focus on RELU (REctified Linear Unit) activation function σRELU (x) = max{0, x}. We refer to the depth
d of the network which is the length of the longest directed path in G. For any 0 ≤ i ≤ d, we define Vini to
i is defined similarly for paths to
be the set of vertices with longest path of length i to an input unit and Vout
i d−i
output units. In layered networks Vin = Vout is the set of hidden units in a hidden layer i.

2 Rescaling and Unbalanceness


One of the special properties of RELU activation function is non-negative homogeneity. That is, for any
scalar c ≥ 0 and any x ∈ R, we have σRELU (c · x) = c · σRELU (x). This interesting property allows the
network to be rescaled without changing the function computed by the network. We define the rescaling
function ρc,v (w), such that given the weights of the network w : E → R, a constant c > 0, and a node v, the
rescaling function multiplies the incoming edges and divides the outgoing edges of v by c. That is, ρc,v (w)
maps w to the weights w̃ for the rescaled network, where for any (u1 → u2 ) ∈ E:

c.w(u1 →u2 ) u2 = v,

w̃(u1 →u2 ) = 1c w(u1 →u2 ) u1 = v, (1)

w(u1 →u2 ) otherwise.

It is easy to see that the rescaled network computes the same function, i.e. fG,w,σRELU = fG,ρc,v (w),σRELU .
We say that the two networks with weights w and w̃ are rescaling equivalent denoted by w ∼ w̃ if and only
if one of them can be transformed to another by applying a sequence of rescaling functions ρc,v .
Given a training set S = {(x1 , yn ), . . . , (xn , yn )}, our goal is to minimize the following objective
function:
n
1X
L(w) = `(fw (xi ), yi ). (2)
n
i=1

Let w(t) be the weights at step t of the optimization. We consider update step of the following form w(t+1) =
w(t) + ∆w(t+1) . For example, for gradient descent, we have ∆w(t+1) = −η∇L(w(t) ), where η is the step-
size. In the stochastic setting, such as SGD or mini-batch gradient descent, we calculate the gradient on a
small subset of the training set.
Since rescaling equivalent networks compute the same function, it is desirable to have an update rule that
is not affected by rescaling. We call an optimization method rescaling invariant if the updates of rescaling
equivalent networks are rescaling equivalent. That is, if we start at either one of the two rescaling equivalent
weight vectors w̃(0) ∼ w(0) , after applying t update steps separately on w̃(0) and w(0) , they will remain
rescaling equivalent and we have w̃(t) ∼ w(t) .

2
2.5
Balanced
2 Unbalanced 1 100 ~100
v v v
Objective

SGD


1.5 Rescaling Update

1 1 10-4 ~104

0.5
u u u
1 100 ~100
0
0 100 200 300
Epoch
(a) Training on MNIST (b) Weight Explosion in an unbalanced network

6 8 1 1 10 0.1 10.5 70.1


8 7 1 2 0.1 20 70.1 20.5
SGD


Rescaling SGD
Update Update
u 3 4
v u 1 1
v u 10 0.1
v u 10.2 30.1 v
8 7 6 4 60 0.4 60.2 30.4

(c) Poor updates in an unbalanced network

Figure 1: (a): Evolution of the cross-entropy error function when training a feed-forward network on MNIST with
two hidden layers, each containing 4000 hidden units. The unbalanced initialization (blue curve) is generated by
applying a sequence of rescaling functions on the balanced initializations (red curve). (b): Updates for a simple case
where the input is x = 1, thresholds are set to zero (constant), the stepsize is 1, and the gradient with respect to output
is δ = −1. (c): Updated network for the case where the input is x = (1, 1), thresholds are set to zero (constant), the
stepsize is 1, and the gradient with respect to output is δ = (−1, −1).

Unfortunately, gradient descent is not rescaling invariant. The main problem with the gradient updates
is that scaling down the weights of an edge will also scale up the gradient which, as we see later, is exactly
the opposite of what is expected from a rescaling invariant update.
Furthermore, gradient descent performs very poorly on “unbalanced” networks. We say that a network
is balanced if the norm of incoming weights to different units are roughly the same or within a small range.
For example, Figure 1(a) shows a huge gap in the performance of SGD initialized with a randomly gener-
ated balanced network w(0) , when training on MNIST, compared to a network initialized with unbalanced
weights w̃(0) . Here w̃(0) is generated by applying a sequence of random rescaling functions on w(0) (and
therefore w(0) ∼ w̃(0) ).
In an unbalanced network, gradient descent updates could blow up the smaller weights, while keeping
the larger weights almost unchanged. This is illustrated in Figure 1(b). If this were the only issue, one could
scale down all the weights after each update. However, in an unbalanced network, the relative changes in the
weights are also very different compared to a balanced network. For example, Figure 1(c) shows how two
rescaling equivalent networks could end up computing a very different function after only a single update.

3
3 Magnitude/Scale measures for deep networks
Following [12], we consider the grouping of weights going into each node of the network. This forms the
following generic group-norm type regularizer, parametrized by 1 ≤ p, q ≤ ∞:
  q/p 1/q
X  X p
µp,q (w) =  w(u→v)  . (3)

v∈V (u→v)∈E

Two simple cases of above group-norm are p = q = 1 and p = q = 2 that correspond to overall `1 regular-
ization and weight decay respectively. Another form of regularization that is shown to be very effective in
RELU networks is the max-norm regularization, which is the maximum over all units of norm of incoming
edge to the unit1 [3, 15]. The max-norm correspond to “per-unit” regularization when we set q = ∞ in
equation (4) and can be written in the following form:
 1/p
p
X
µp,∞ (w) = sup  w(u→v)  (4)
v∈V
(u→v)∈E

Weight decay is probably the most commonly used regularizer. On the other hand, per-unit regulariza-
tion might not seem ideal as it is very extreme in the sense that the value of regularizer corresponds to the
highest value among all nodes. However, the situation is very different for networks with RELU activations
(and other activation functions with non-negative homogeneity property). In these cases, per-unit `2 regu-
larization has shown to be very effective [15]. The main reason could be because RELU networks can be
rebalanced in such a way that all hidden units have the same norm. Hence, per-unit regularization will not
be a crude measure anymore.
Since µp,∞ is not rescaling invariant and the values of the scale measure are different for rescaling
equivalent networks, it is desirable to look for the minimum value of a regularizer among all rescaling
equivalent networks. Surprisingly, for a feed-forward network, the minimum `p per-unit regularizer among
all rescaling equivalent networks can be efficiently computed by a single forward step. To see this, we
consider the vector π(w), the path vector, where the number of coordinates of π(w) is equal to the total
number of paths from the input to output units and each coordinate of π(w) is the equal to the product of
weights along a path from an input nodes to an output node. The `p -path regularizer is then defined as the
`p norm of π(w) [12]:
 1/p
d p
X Y
φp (w) = kπ(w)kp =  wek  (5)
 
e1 e2 ed k=1
vin [i]→v1 →v2 ...→vout [j]

The following Lemma establishes that the `p -path regularizer corresponds to the minimum over all equiva-
lent networks of the per-unit `p norm:
 d
Lemma 3.1 ([12]). φp (w) = min µp,∞ (w̃)
w̃∼w

1
This definition of max-norm is a bit different than the one used in the context of matrix factorization [13]. The later is similar
to the minimum upper bound over `2 norm of both outgoing edges from the input units and incoming edges to the output units in a
two layer feed-forward network.

4
The definition (5) of the `p -path regularizer involves an exponential number of terms. But it can be
computed efficiently by dynamic programming in a single forward step using the following equivalent form
as nested sums:
 1/p
p p
X X X
φp (w) =  w(v →v [j])d−1 out
... w(v [i]→v )  in 1
(vd−1 →vout [j])∈E (vd−2 →vd−1 )∈E (vin [i]→v1 )∈E

A straightforward consequence of Lemma 3.1 is that the `p path-regularizer φp is invariant to rescaling, i.e.
for any w̃ ∼ w, φp (w̃) = φp (w).

4 Path-SGD : An Approximate Path-Regularized Steepest Descent


Motivated by empirical performance of max-norm regularization and the fact that path-regularizer is in-
variant to rescaling, we are interested in deriving the steepest descent direction with respect to the path
regularizer φp (w):
D E 1 2
w(t+1) = arg min η ∇L(w(t) ), w + π(w) − π(w(t) ) (6)
w 2 p
!p 2/p
 
D E X Yd d
Y
= arg min η ∇L(w(t) ), w +  wek − we(t) ) 
 
w k
1e 2 e d e k=1 k=1
vin [i]→v1 →v2 ...→vout [j]

= arg min J (t) (w)


w

The steepest descent step (6) is hard to calculate exactly. Instead, we will update each coordinate we inde-
pendently (and synchronously) based on (6). That is:
(t)
we(t+1) = arg min J (t) (w) s.t. ∀e0 6=e we0 = we0 (7)
we

Taking the partial derivative with respect to we and setting it to zero we obtain:
 2/p
∂L (t)   X Y p
0=η (w ) − we − we(t)  we(t) 

∂we e
vin [i]···→...vout [j] ek 6=e

e
where vin [i] · · · → . . . vout [j] denotes the paths from any input unit i to any output unit j that includes e.
Solving for we gives us the following update rule:
η ∂L (t)
ŵe(t+1) = we(t) − (t)
(w ) (8)
γp (w , e) ∂w
where γp (w, e) is given as
 2/p
X Y
γp (w, e) =  |wek |p  (9)
 
e
vin [i]···→...vout [j] ek 6=e

We call the optimization using the update rule (8) path-normalized gradient descent. When used in stochastic
settings, we refer to it as Path-SGD .
Now that we know Path-SGD is an approximate steepest descent with respect to the path-regularizer, we
can ask whether or not this makes Path-SGD a rescaling invariant optimization method. The next theorem
proves that Path-SGD is indeed rescaling invariant.

5
Algorithm 1 Path-SGD update rule
1: ∀v∈V 0 γin (v) = 1 . Initialization
in
2: ∀v∈Vout
0 γout (v) = 1

3: for i = 1 to d do P
p
4: ∀v∈V i γin (v) = (u→v)∈E γin (u) w(u,v)
in P p
5: ∀v∈Vout
i γout (v) =
(v→u)∈E w(v,u) γout (u)
6: end for
7: ∀(u→v)∈E γ(w(t) , (u, v)) = γin (u)2/p γout (v)2/p
(t+1) (t) η ∂L
8: ∀e∈E we = we − γ(w(t) ,e) ∂we
(w(t) ) . Update Rule

Theorem 4.1. Path-SGD is rescaling invariant.

Proof. It is sufficient to prove that using the update rule (8), for any c > 0 and any v ∈ E, if w̃(t) =
ρc,v (w(t) ), then w̃(t+1) = ρc,v (w(t+1) ). For any edge e in the network, if e is neither incoming nor out-
going edge of the node v, then w̃(e) = w(e), and since the gradient is also the same for edge e we have
(t+1) (t+1)
w̃e = we . However, if e is an incoming edge to v, we have that w̃(t) (e) = cw(t) (e). Moreover,
γp (w(t) ,e) ∂L ∂L
since the outgoing edges of v are divided by c, we get γp (w̃(t) , e) = c2
and (t)
∂we (w̃ ) = (t)
c∂we (w ).
Therefore,
c2 η ∂L
w̃e(t+1) = cwe(t) − (t)
(w(t) )
γp (w , e) c∂we
 
(t) η ∂L (t)
=c w − (w ) = cwe(t+1) .
γp (w(t) , e) ∂we
A similar argument proves the invariance of Path-SGD update rule for outgoing edges of v. Therefore,
Path-SGD is rescaling invariant.

Efficient Implementation: The Path-SGD update rule (8), in the way it is written, needs to consider all
the paths, which is exponential in the depth of the network. However, it can be calculated in a time that is no
more than a forward-backward step on a single data point. That is, in a mini-batch setting with batch size B,
if the backpropagation on the mini-batch can be done in time BT , the running time of the Path-SGD on the
mini-batch will be roughly (B + 1)T – a very moderate runtime increase with typical mini-batch sizes of
hundreds or thousands of points. Algorithm 1 shows an efficient implementation of the Path-SGD update
rule.
We next compare Path-SGD to other optimization methods in both balanced and unbalanced settings.

5 Experiments
In this section, we compare `2 -Path-SGD to two commonly used optimization methods in deep learning,
SGD and AdaGrad. We conduct our experiments on four common benchmark datasets: the standard MNIST
dataset of handwritten digits [8]; CIFAR-10 and CIFAR-100 datasets of tiny images of natural scenes [7];
and Street View House Numbers (SVHN) dataset containing color images of house numbers collected by
Google Street View [10]. Details of the datasets are shown in Table 1.
In all of our experiments, we trained feed-forward networks with two hidden layers, each containing
4000 hidden units. We used mini-batches of size 100 and the step-size of 10−α , where α is an integer
between 0 and 10. To choose α, for each dataset, we considered the validation errors over the validation set

6
Table 1: General information on datasets used in the experiments.

Data Set Dimensionality Classes Training Set Test Set


CIFAR-10 3072 (32 × 32 color) 10 50000 10000
CIFAR-100 3072 (32 × 32 color) 100 50000 10000
MNIST 784 (28 × 28 grayscale) 10 60000 10000
SVHN 3072 (32 × 32 color) 10 73257 26032

Cross-Entropy Training Loss 0/1 Training Error 0/1 Test Error


2.5 0.2 0.6
Path−SGD − Unbalanced
SGD − Balanced
2 SGD − Unbalanced
0.15 0.55
CIFAR-10

AdaGrad − Balanced
AdaGrad − Unbalanced
1.5
0.1 0.5
.

.
.

1
0.05 0.45
0.5

0 0 0.4
0 20 40 60 80 100 0 20 40 60 80 100 0 20 40 60 80 100

5 0.1 0.85

0.08
CIFAR-100

4 0.8

3 0.06
0.75
.

.
.

2 0.04

0.7
1 0.02

0 0 0.65
0 20 40 60 80 100 0 20 40 60 80 100 0 20 40 60 80 100
2.5 0.02 0.035

2
0.015 0.03
MNIST

1.5
0.01 0.025
.

0.005 0.02
0.5

0 0 0.015
0 20 40 60 80 100 0 20 40 60 80 100 0 20 40 60 80 100
2.5 0.2 0.2

2 0.19
0.15
SVHN

0.18
1.5
.

0.1 0.17
.

1
0.16
0.5 0.05
0.15

0 0 0.14
0 20 40 60 80 100 0 20 40 60 80 100 0 20 40 60 80 100
Epoch Epoch Epoch

Figure 2: Learning curves using different optimization methods for 4 datasets without dropout. Left panel displays
the cross-entropy objective function; middle and right panels show the corresponding values of the training and test
errors, where the values are reported on different epochs during the course of optimization. Best viewed in color.

(10000 randomly chosen points that are kept out during the initial training) and picked the one that reaches
the minimum error faster. We then trained the network over the entire training set. All the networks were
trained both with and without dropout. When training with dropout, at each update step, we retained each

7
Cross-Entropy Training Loss 0/1 Training Error 0/1 Test Error
2.5 0.4 0.55
Path−SGD + Dropout
SGD + Dropout
2 AdaGrad + Dropout
CIFAR-10

0.3 0.5

1.5
0.2 0.45

.
.

0.1 0.4
0.5

0 0 0.35
0 20 40 60 80 100 0 20 40 60 80 100 0 20 40 60 80 100
5 0.8 0.8
CIFAR-100

4
0.6 0.75

3
0.4 0.7

.
.
.

0.2 0.65
1

0 0 0.6
0 20 40 60 80 100 0 20 40 60 80 100 0 20 40 60 80 100
2.5 0.08 0.035

2
0.06 0.03
MNIST

1.5
0.04 0.025
.

.
.

0.02 0.02
0.5

0 0 0.015
0 20 40 60 80 100 0 20 40 60 80 100 0 20 40 60 80 100
2.5 0.4 0.18

2 0.17
0.3
SVHN

0.16
1.5
0.2 0.15
.

1
0.14
0.1
0.5
0.13

0 0 0.12
0 20 40 60 80 100 0 20 40 60 80 100 0 20 40 60 80 100
Epoch Epoch Epoch

Figure 3: Learning curves using different optimization methods for 4 datasets with dropout. Left panel displays the
cross-entropy objective function; middle and right panels show the corresponding values of the training and test errors.
Best viewed in color.

unit with probability 0.5.


We tried both balanced and unbalanced initializations. In balanced initialization, incomingpweights to
each unit v are initialized to i.i.d samples from a Gaussian distribution with standard deviation 1/ fan-in(v).
In the unbalanced setting, we first initialized the weights to be the same as the balanced weights. We then
picked 2000 hidden units randomly with replacement. For each unit, we multiplied its incoming edge and
divided its outgoing edge by 10c, where c was chosen randomly from log-normal distribution.
The optimization results without dropout are shown in Figure 2. For each of the four datasets, the plots
for objective function (cross-entropy), the training error and the test error are shown from left to right where
in each plot the values are reported on different epochs during the optimization. Although we proved that
Path-SGD updates are the same for balanced and unbalanced initializations, to verify that despite numerical
issues they are indeed identical, we trained Path-SGD with both balanced and unbalanced initializations.
Since the curves were exactly the same we only show a single curve.
We can see that as expected, the unbalanced initialization considerably hurts the performance of SGD

8
and AdaGrad (in many cases their training and test errors are not even in the range of the plot to be dis-
played), while Path-SGD performs essentially the same. Another interesting observation is that even in
the balanced settings, not only does Path-SGD often get to the same value of objective function, training
and test error faster, but also the final generalization error for Path-SGD is sometimes considerably lower
than SGD and AdaGrad (except CIFAR-100 where the generalization error for SGD is slightly better com-
pared to Path-SGD ). The plots for test errors could also imply that implicit regularization due to steepest
descent with respect to path-regularizer leads to a solution that generalizes better. This view is similar to
observations in [11] on the role of implicit regularization in deep learning.
The results for training with dropout are shown in Figure 3, where here we suppressed the (very poor)
results on unbalanced initializations. We observe that except for MNIST, Path-SGD convergences much
faster than SGD or AdaGrad. It also generalizes better to the test set, which again shows the effectiveness
of path-normalized updates.
The results suggest that Path-SGD outperforms SGD and AdaGrad in two different ways. First, it can
achieve the same accuracy much faster and second, the implicit regularization by Path-SGD leads to a local
minima that can generalize better even when the training error is zero. This can be better analyzed by
looking at the plots for more number of epochs which we have provided in Figures 4 and 5. We should also
point that Path-SGD can be easily combined with AdaGrad to take advantage of the adaptive stepsize or
used together with a momentum term. This could potentially perform even better compare to Path-SGD .

6 Discussion
We revisited the choice of the Euclidean geometry on the weights of RELU networks, suggested an alterna-
tive optimization method approximately corresponding to a different geometry, and showed that using such
an alternative geometry can be beneficial. In this work we show proof-of-concept success, and we expect
Path-SGD to be beneficial also in large-scale training for very deep convolutional networks. Combining
Path-SGD with AdaGrad, with momentum or with other optimization heuristics might further enhance re-
sults.
Although we do believe Path-SGD is a very good optimization method, and is an easy plug-in for SGD,
we hope this work will also inspire others to consider other geometries, other regularizers and perhaps better,
update rules. A particular property of Path-SGD is its rescaling invariance, which we argue is appropriate
for RELU networks. But Path-SGD is certainly not the only rescaling invariant update possible, and other
invariant geometries might be even better.
Finally, we choose to use steepest descent because of its simplicity of implementation. A better choice
might be mirror descent with respect to an appropriate potential function, but such a construction seems
particularly challenging considering the non-convexity of neural networks.

Acknowledgments
Research was partially funded by NSF award IIS-1302662 and Intel ICRI-CI. We thank Hao Tang for in-
sightful discussions.

References
[1] John Duchi, Elad Hazan, and Yoram Singer. Adaptive subgradient methods for online learning and
stochastic optimization. The Journal of Machine Learning Research, 12:2121 – 2159, 2011.
[2] Xavier Glorot and Yoshua Bengio. Understanding the difficulty of training deep feedforward neural
networks. In AISTATS, 2010.

9
[3] Ian J. Goodfellow, David Warde-Farley, Mehdi Mirza, Aaron C. Courville, and Yoshua Bengio. Max-
out networks. In Proceedings of the 30th International Conference on Machine Learning, ICML, pages
1319–1327, 2013.

[4] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Delving deep into rectifiers: Surpassing
human-level performance on imagenet classification. arXiv preprint arXiv:1502.01852, 2015.

[5] Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by
reducing internal covariate shift. In arXiv, 2015.

[6] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. CoRR, abs/1412.6980, 2014.

[7] Alex Krizhevsky and Geoffrey Hinton. Learning multiple layers of features from tiny images. Com-
puter Science Department, University of Toronto, Tech. Rep, 1(4):7, 2009.

[8] Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to
document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.

[9] James Martens and Roger Grosse. Optimizing neural networks with kronecker-factored approximate
curvature. In ICML, 2015.

[10] Yuval Netzer, Tao Wang, Adam Coates, Alessandro Bissacco, Bo Wu, and Andrew Y Ng. Reading
digits in natural images with unsupervised feature learning. In NIPS workshop on deep learning and
unsupervised feature learning, 2011.

[11] Behnam Neyshabur, Ryota Tomioka, and Nathan Srebro. In search of the real inductive bias: On the
role of implicit regularization in deep learning. International Conference on Learning Representations
(ICLR) workshop track, 2015.

[12] Behnam Neyshabur, Ryota Tomioka, and Nathan Srebro. Norm-based capacity control in neural net-
works. COLT, 2015.

[13] Nathan Srebro and Adi Shraibman. Rank, trace-norm and max-norm. In Learning Theory, pages
545–560. Springer, 2005.

[14] Nathan Srebro, Karthik Sridharan, and Ambuj Tewari. On the universality of online mirror descent. In
Advances in neural information processing systems, pages 2645–2653, 2011.

[15] Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov.
Dropout: A simple way to prevent neural networks from overfitting. The Journal of Machine Learning
Research, 15(1):1929–1958, 2014.

[16] I. Sutskever, J. Martens, George Dahl, and Geoffery Hinton. On the importance of momentum and
initialization in deep learning. In ICML, 2013.

10
Cross-Entropy Training Loss 0/1 Training Error 0/1 Test Error
2.5 0.2 0.6
Path−SGD − Unbalanced
SGD − Balanced
2 SGD − Unbalanced
0.15 0.55
CIFAR-10

AdaGrad − Balanced
AdaGrad − Unbalanced
1.5
0.1 0.5

.
.

1
0.05 0.45
0.5

0 0 0.4
0 50 100 150 200 250 300 0 50 100 150 200 250 300 0 50 100 150 200 250 300

5 0.1 0.85

0.08
CIFAR-100

4 0.8

3 0.06
0.75
.

.
.

2 0.04

0.7
1 0.02

0 0 0.65
0 50 100 150 200 250 300 0 50 100 150 200 250 300 0 50 100 150 200 250 300
2.5 0.02 0.035

2
0.015 0.03
MNIST

1.5
0.01 0.025
.

.
.

0.005 0.02
0.5

0 0 0.015
0 50 100 150 200 250 300 0 50 100 150 200 250 300 0 50 100 150 200 250 300
2.5 0.2 0.2

2 0.19
0.15
SVHN

0.18
1.5
.

0.1 0.17
.

1
0.16
0.5 0.05
0.15

0 0 0.14
0 50 100 150 200 250 300 0 50 100 150 200 250 300 0 50 100 150 200 250 300
Epoch Epoch Epoch

Figure 4: Learning curves for more number of epochs using different optimization methods for 4 datasets without
dropout. Left panel displays the cross-entropy objective function; middle and right panels show the corresponding val-
ues of the training and test errors, where the values are reported on different epochs during the course of optimization.
Best viewed in color.

11
Cross-Entropy Training Loss 0/1 Training Error 0/1 Test Error
2.5 0.4 0.55
Path−SGD + Dropout
SGD + Dropout
2 AdaGrad + Dropout
0.3 0.5
CIFAR-10

1.5
0.2 0.45
.

.
.

0.1 0.4
0.5

0 0 0.35
0 100 200 300 400 0 100 200 300 400 0 100 200 300 400
5 0.8 0.8
CIFAR-100

4
0.6 0.75

3
0.4 0.7

.
.
.

0.2 0.65
1

0 0 0.6
0 100 200 300 400 0 100 200 300 400 0 100 200 300 400

2.5 0.08 0.035

2
0.06 0.03
MNIST

1.5
0.04 0.025
.
.
.

0.02 0.02
0.5

0 0 0.015
0 100 200 300 400 0 100 200 300 400 0 100 200 300 400
2.5 0.4 0.18

2 0.17
0.3
SVHN

0.16
1.5
0.2 0.15
.

.
.

1
0.14
0.1
0.5
0.13

0 0 0.12
0 100 200 300 400 0 100 200 300 400 0 100 200 300 400
Epoch Epoch Epoch

Figure 5: Learning curves for more number of epochs using different optimization methods for 4 datasets with
dropout. Left panel displays the cross-entropy objective function; middle and right panels show the corresponding
values of the training and test errors. Best viewed in color.

12

You might also like