Cortes 17 A

Download as pdf or txt
Download as pdf or txt
You are on page 1of 10

AdaNet: Adaptive Structural Learning of Artificial Neural Networks

Corinna Cortes 1 Xavier Gonzalvo 1 Vitaly Kuznetsov 1 Mehryar Mohri 2 1 Scott Yang 2

Abstract of layers and units specified since there needs to be at least


one path through the network for the hypothesis to be non-
We present new algorithms for adaptively learn-
trivial. Single weights can be pruned (Han et al., 2015)
ing artificial neural networks. Our algorithms
via a technique originally termed Optimal Brain Damage
(A DA N ET) adaptively learn both the structure
(LeCun et al., 1990), but the global architecture remains
of the network and its weights. They are
unchanged. Thus, this imposes a stringent lower bound on
based on a solid theoretical analysis, including
the complexity of the model, which may not match that
data-dependent generalization guarantees that we
of the learning task considered: complex networks trained
prove and discuss in detail. We report the re-
with insufficient data may be prone to overfitting and, in
sults of large-scale experiments with one of our
reverse, simpler architectures may not suffice to achieve an
algorithms on several binary classification tasks
adequate performance.
extracted from the CIFAR-10 dataset and on the
Criteo dataset. The results demonstrate that our This places a considerable burden on the user who is left
algorithm can automatically learn network struc- with the requirement to specify an architecture with the
tures with very competitive performance accura- right complexity, which is often a difficult task even with a
cies when compared with those achieved by neu- significant level of experience and domain knowledge. As a
ral networks found by standard approaches. result, the choice of the network is typically left to a hyper-
parameter search using a validation set. This search space
can quickly become exorbitantly large (Szegedy et al.,
1. Introduction 2015; He et al., 2015) and large-scale hyperparameter tun-
ing to find an effective network architecture often waste-
Multi-layer artificial neural networks form a powerful ful of data, time, and resources (e.g. grid search, random
learning model which has helped achieve a remarkable per- search (Bergstra et al., 2011)).
formance in several applications in recent years. Repre-
senting the input through increasingly more abstract lay- This paper seeks precisely to address some of these issues.
ers of feature representation has shown to be very ef- We present a theoretical analysis of the problem of learn-
fective in natural language processing, image captioning, ing simultaneously both the network architecture and its
speech recognition and several other areas (Krizhevsky parameters. To the best of our knowledge, our results are
et al., 2012; Sutskever et al., 2014). However, despite the the first generalization bounds for the problem of structural
compelling arguments for adopting multi-layer neural net- learning of neural networks. These general guarantees can
works as a general template for tackling learning problems, guide the design of a variety of different algorithms for
training these models and designing the right network for a learning in this setting. We describe in detail two such al-
given task has raised several theoretical questions and faced gorithms (A DA N ET algorithms) that directly benefit from
numerous practical challenges. our theory.

A critical step in learning a large multi-layer neural net- Rather than enforcing a pre-specified architecture and
work for a specific task is the choice of its architecture, thus a fixed network complexity, our A DA N ET algorithms
which includes the number of layers and the number of adaptively learn the appropriate network architecture for
units within each layer. Standard training methods for neu- a learning task. Starting from a simple linear model, our
ral networks return a model admitting precisely the number algorithms incrementally augment the network with more
units and additional layers, as needed. The choice of the
1
Google Research, New York, NY, USA; 2 Courant Institute additional subnetworks depends on their complexity and is
of Mathematical Sciences, New York, NY, USA. Correspondence directly guided by our learning guarantees. Remarkably,
to: Vitaly Kuznetsov <[email protected]>.
the optimization problems for both of our algorithms turn
Proceedings of the 34 th International Conference on Machine out to be strongly convex and thus guaranteed to admit a
Learning, Sydney, Australia, PMLR 70, 2017. Copyright 2017 unique global solution.
by the author(s).
AdaNet: Adaptive Structural Learning of Artificial Neural Networks

sents a function denoted by hk,j (before composition with


an activation function). Let X denote the input space and
for any x ∈ X, let Ψ(x) ∈ Rn0 denote the corresponding
feature vector. Then, the family of functions defined by the
first layer functions h1,j , j ∈ [n1 ], is the following:
n o
H1 = x 7→ u · Ψ(x) : u ∈ Rn0 , kukp ≤ Λ1,0 , (1)

where p ≥ 1 defines an lp -norm and Λ1,0 ≥ 0 is a hyperpa-


Figure 1. An example of a general network architecture: the out-
rameter on the weights connecting layer 0 and layer 1. The
put layer (in green) is connected to all of the hidden units as well
as some input units. Some hidden units (in red and yellow) are
family of functions hk,j , j ∈ [nk ], in a higher layer k > 1
connected not only to the units in the layer directly below, but is then defined as follows:
also to units at other levels.  k−1
X
Hk = x 7→ us · (ϕs ◦ hs )(x) :
s=1
The paper is organized as follows. In Appendix A, we give 
a detailed discussion of previous work related to this topic. us ∈ Rns , kus kp ≤ Λk,s , hk,s ∈ Hs , (2)
Section 2 describes the general network architecture and
therefore the hypothesis set that we consider. Section 3 where, for each unit function hk,s , us in (2) denotes the
provides a formal description of our learning scenario. In vector of weights for connections from that unit to a lower
Section 4, we prove strong generalization guarantees for layer s < k. The Λk,s s are non-negative hyperparameters
learning in this setting, which help guide the design of the and ϕs ◦ hs abusively denotes a coordinate-wise compo-
algorithm described in Section 5, as well as a variant de- sition: ϕs ◦ hs = (ϕs ◦ hs,1 , . . . , ϕs ◦ hs,ns ). The ϕs s
scribed in Appendix C. We report the results of our experi- are assumed to be 1-Lipschitz activation functions. In par-
ments with A DA N ET in Section 6. ticular, they can be chosen to be the Rectified Linear Unit
function (ReLU function) x 7→ max{0, x}, or the sigmoid
2. Network architecture function x 7→ 1+e1−x . The choice of the parameter p ≥ 1
determines the sparsity of the network and the complexity
In this section, we describe the general network architec- of the hypothesis sets Hk .
ture we consider for feedforward neural networks, thereby
also defining our hypothesis set. To simplify the presenta- For the networks we consider, the output unit can be con-
tion, we restrict our attention to the case of binary classifi- nected to all intermediate units, which therefore defines a
cation. However, all our results can be straightforwardly function f as follows:
extended to multi-class classification, including the net- nk
l X l
X X
work architecture, by augmenting the number of output f= wk,j hk,j = wk · hk , (3)
units, and, our generalization guarantees, by using existing k=1 j=1 k=1
multi-class counterparts of the binary classification ensem-
ble margin bounds we use. where hk = [hk,1 , . . . , hk,nk ]>∈ Hknk and wk ∈ Rnk is
the vector of connection weights to units of layer k. Ob-
A common model for feedforward neural networks is the serve that, for us = 0 for s < k − 1 and wk = 0 for k < l,
multi-layer architecture where units in each layer are only our architectures coincides with standard multi-layer feed-
connected to those in the layer below. We will consider forward ones.
more general architectures where a unit can be connected
to units in any of the layers below, as illustrated by Fig- We will denote by F the family of functions f defined by
ure 1. In particular, the output unit in our network architec- (3) with the absolute value of the weights summing to one:
tures can be connected to any other unit. These more gen- ( l l
)
X X
eral architectures include as special cases standard multi- F= wk · hk : hk ∈ Hk ,
nk
kwk k1 = 1 .
layer networks (by zeroing out appropriate connections) as k=1 k=1
well as somewhat more exotic ones (He et al., 2015; Huang
et al., 2016). In fact, our definition covers any architecture Let He k denote the union of Hk and its reflection, H
ek =
that can be represented as a directed acyclic graph (DAG). Hk ∪ (−Hk ), and let H denote the union of the families
e k : H = Sl H
H e k . Then, F coincides with the convex
More formally, the artificial neural networks we consider k=1
hull of H: F = conv(H).
are defined as follows. Let l denote the number of interme-
diate layers in the network and nk the maximum number of For any k ∈ [l], we will also consider the family Hk∗ de-
units in layer k ∈ [l]. Each unit j ∈ [nk ] in layer k repre- rived from Hk by setting Λk,s = 0 for s < k − 1, which
AdaNet: Adaptive Structural Learning of Artificial Neural Networks

corresponds to units connected only to the layer below. We that can be derived via a standard Rademacher complex-
similarly define He ∗ = H∗ ∪ (−H∗ ) and H∗ = ∪l H∗ ,
k k k k=1 k
ity analysis (Koltchinskii & Panchenko, 2002), and which
and define F as the convex hull F∗ = conv(H∗ ). Note
∗ admit an explicit dependency on the mixture weights wk
that the architecture corresponding to the family of func- defining the ensemble function f . That leads to the follow-
tions F∗ is still more general than standard feedforward ing learning guarantee.
neural network architectures since the output unit can be Theorem 1 (Learning bound). Fix ρ > 0. Then, for any
connected to units in different layers. δ > 0, with probability at least 1 − δ over the draw of a
size m from Dm , the following inequality holds
sample S ofP
l
3. Learning problem for all f = k=1 wk · hk ∈ F:

We consider the standard supervised learning scenario and l r


bS,ρ (f ) + 4 e k) + 2 log l
X
R(f ) ≤ R wk Rm (H
assume that training and test points are drawn i.i.d. accord- ρ 1 ρ m
ing to some distribution D over X × {−1, +1} and denote k=1

by S = ((x1 , y1 ), . . . , (xm , ym )) a training sample of size + C(ρ, l, m, δ),


m drawn according to Dm . q
ρ2 m log(2/δ)
4
 log l
where C(ρ, l, m, δ) = ρ2 log( log l ) m + 2m =
For a function f taking values in R, we denote by R(f ) =  q 
E(x,y)∼D [1yf (x)≤0 ] its generalization error and, for any e 1 log l .
O ρ m
ρ > 0, by RbS,ρ (f ) its empirical margin error on the sample
bS,ρ (f ) = 1 Pm 1y f (x )≤ρ .
S: R The proof of this result, as well as that of all other
m i=1 i i
main theorems are given in Appendix B. The bound of
The learning problem consists of using the training sam- the theorem can be generalized to hold uniformly for all
ple S to determine a function f defined by (3) with small
generalization error R(f ). For an accurate predictor f , we p∈ (0, 1], at the price of an additional term of the form
ρ
log log2 (2/ρ)/m using standard techniques (Koltchin-
expect many of the weights to be zero and the correspond- skii & Panchenko, 2002).
ing architecture to be quite sparse, with fewer than nk units
at layer k and relatively few non-zero connections. In that Observe that the bound of the theorem depends only log-
sense, learning an accurate function f implies also learning arithmically on the depth of the network l. But, perhaps
the underlying architecture. more remarkably, the complexity term of the bound is a
kwk k1 -weighted average of the complexities of the layer
In the next section, we present data-dependent learning hypothesis sets Hk , where the weights are precisely those
bounds for this problem that will help guide the design of defining the network, or the function f . This suggests that
our algorithms. a function f with a small empirical margin error and a deep
architecture benefits nevertheless from a strong generaliza-
4. Generalization bounds tion guarantee, if it allocates more weights to lower layer
units and less to higher ones. Of course, when the weights
Our learning bounds are expressed in terms of the are sparse, that will imply an architecture with relatively
Rademacher complexities of the hypothesis sets Hk . The fewer units or connections at higher layers than at lower
empirical Rademacher complexity of a hypothesis set G for ones. The bound of the theorem further gives a quantita-
a sample S is denoted by R
b S (G) and defined as follows:
tive guide for apportioning the weights, depending on the
" m
# Rademacher complexities of the layer hypothesis sets.
b S (G) = 1 X
R E sup σi h(xi ) ,
m σ h∈G i=1 This data-dependent learning guarantee will serve as a
foundation for the design of our structural learning algo-
where σ = (σ1 , . . . , σm ), with σi s independent uniformly rithms in Section 5 and Appendix C. However, to fully
distributed random variables taking values in {−1, +1}. exploit it, the Rademacher complexity measures need to
Its Rademacher complexity is defined by Rm (G) = be made explicit. One advantage of these data-dependent
ES∼Dm [R b S (G)]. These are data-dependent complexity measures is that they can be estimated from data, which
measures that lead to finer learning guarantees (Koltchin- can lead to more informative bounds. Alternatively, we can
skii & Panchenko, 2002; Bartlett & Mendelson, 2002). derive useful upper bounds for these measures which can
be more conveniently used in our algorithms. The next re-
As pointed out earlier, the family of functions F is the con- sults in this section provide precisely such upper bounds,
vex hull of H. Thus, generalization bounds for ensemble thereby leading to a more explicit generalization bound.
methods can be used to analyze learning with F. In particu-
1 1
lar, we can leverage the recent margin-based learning guar- We will denote by q the conjugate of p, that is p + q = 1,
antees of Cortes et al. (2014), which are finer than those and define r∞ = maxi∈[1,m] kΨ(xi )k∞ .
AdaNet: Adaptive Structural Learning of Artificial Neural Networks

Our first result gives an upper bound on the Rademacher to the design of algorithmic design since the network com-
complexity of Hk in terms of the Rademacher complexity plexity no longer needs to grow exponentially as a function
of other layer families. of depth. Our bounds are also more general and apply to
Lemma 1. For any k > 1, the empirical Rademacher more other network architectures, such as those introduced
complexity of Hk for a sample S of size m can be upper- in (He et al., 2015; Huang et al., 2016).
bounded as follows in terms of those of Hs s with s < k:
k−1
5. Algorithm
X 1
b S (Hk ) ≤ 2 q
R b S (Hs ).
Λk,s ns R This section describes our algorithm, A DA N ET, for adap-
s=1 tive learning of neural networks. A DA N ET adaptively
grows the structure of a neural network, balancing model
For the family Hk∗ , which is directly relevant to many of complexity with empirical risk minimization. We also de-
our experiments, the following more explicit upper bound scribe in detail in Appendix C another variant of A DA N ET
can be derived, using Lemma 1. which admits some favorable properties.
Qk Qk
Lemma 2. Let Λk = s=1 2Λs,s−1 and Nk = s=1 ns−1 . Let x 7→ Φ(−x) be a non-increasing convex function
Then, for any k ≥ 1, the empirical Rademacher complexity upper-bounding the zero-one loss, x 7→ 1x≤0 , such that Φ
of Hk∗ for a sample S of size m can be upper bounded as is differentiable over R and Φ0 (x) 6= 0 for all x. This surro-
follows: gate loss Φ may be, for instance, the exponential function
r Φ(x) = ex as in AdaBoost (Freund & Schapire, 1997), or

1
q log(2n0 ) the logistic function, Φ(x) = log(1 + ex ) as in logistic
RS (Hk ) ≤ r∞ Λk Nk
b .
2m regression.

Note that Nk , which is the product of the number of units 5.1. Objective function
in layers below k, can be large. This suggests that values of
p closer to one, that is larger values of q, could be more Let {h1 , . . . , hN } be a subset of H∗ . In the most general
helpful to control complexity in such cases. More gen- case, N is infinite. However, as discussed later, in practice,
erally, similar explicit upper bounds can be given for the the search is limited to a finite set. For any j ∈ [N ], we
Rademacher complexities of subfamilies of Hk with units will denote by rj the Rademacher complexity of the family
connected only to layers k, k − 1, . . . , k − d, with d fixed, Hkj that contains hj : rj = Rm (Hkj ).
d < k. Combining Lemma 2 with Theorem 1 helps derive PN
the following explicit learning guarantee for feedforward A DA N ET seeks to find a function f = j=1 wj hj ∈
neural networks with an output unit connected to all the F∗ (or neural network) that directly minimizes the data-
other units. dependent generalization bound of Corollary 1. This leads
to the following objective function:
Corollary 1 (Explicit learning bound). Fix ρ > 0. Let
Qk Qk m N N
Λk = s=1 4Λs,s−1 and Nk = s=1 ns−1 . Then, for any 1 X  X  X
δ > 0, with probability at least 1 − δ over the draw of a F (w) = Φ 1 − yi wj hj + Γj |wj |, (4)
m i=1
size m from Dm , the following inequality holds
sample S ofP j=1 j=1
l
for all f = k=1 wk · hk ∈ F∗ : where w ∈ RN and Γj = λrj + β, with λ ≥ 0 and
l  r  β ≥ 0 hyperparameters. The objective function (4) is a
2 X 1
q 2 log(2n0 ) convex function of w. It is the sum of a convex surrogate
R(f ) ≤ R bS,ρ (f ) + wk r∞ Λk N
1 k
ρ m of the empirical error and a regularization term, which is a
k=1
r weighted-l1 penalty containing two sub-terms: a standard
2 log l norm-1 regularization which admits β as a hyperparame-
+ + C(ρ, l, m, δ),
ρ m ter, and a term that discriminates the functions hj based on
q  log l log( δ2 ) their complexity.
4 ρ2 m
where C(ρ, l, m, δ) = ρ2 log( log l ) m + 2m =
 q  The optimization problem consisting of minimizing the ob-
1 log l
O
e
ρ m , and where r∞ = ES∼Dm [r∞ ]. jective function F in (4) is defined over a very large space
of base functions hj . A DA N ET consists of applying coor-
The learning bound of Corollary 1 is a finer guarantee than dinate descent to (4). In that sense, our algorithm is similar
previous ones by Bartlett (1998), Neyshabur et al. (2015), to the DeepBoost algorithm of Cortes et al. (2014). How-
or Sun et al. (2016). This is because it explicitly differenti- ever, unlike DeepBoost, which combines decision trees,
ates between the weights of different layers while previous A DA N ET learns a deep neural network, which requires new
bounds treat all weights indiscriminately. This is crucial methods for constructing and searching the space of func-
AdaNet: Adaptive Structural Learning of Artificial Neural Networks

just described. An important aspect of our algorithm is that


the units of a subnetwork learned at a previous iteration
(say h1,1 in Figure 2) can serve as input to a deeper subnet-
work added later (for example h2,2 or h2,3 in the Figure).
Thus, the deeper subnetworks added later can take advan-
tage of the embeddings that were learned at the previous
(a) iterations. The algorithm terminates after T rounds or if
the A DA N ET architecture can no longer be extended to im-
prove the objective (4).
More formally, A DA N ET is a boosting-style algorithm that
applies (block) coordinate descent to (4). At each iteration
of block coordinate descent, descent coordinates h (base
(b) learners in the boosting literature) are selected from the
space of functions H∗ . These coordinates correspond to
Figure 2. Illustration of the algorithm’s incremental construction
the direction of the largest decrease in (4). Once these co-
of a neural network. The input layer is indicated in blue, the out-
put layer in green. Units in the yellow block are added at the first ordinates are determined, an optimal step size in each of
iteration while units in purple are added at the second iteration. these directions is chosen, which is accomplished by solv-
Two candidate extensions of the architecture are considered at the ing an appropriate convex optimization problem.
the third iteration (shown in red): (a) a two-layer extension; (b)
Note that, in general, the search for the optimal descent
a three-layer extension. Here, a line between two blocks of units
coordinate in an infinite-dimensional space or even in fi-
indicates that these blocks are fully-connected.
nite but large sets such as that of all decision trees of some
tions hj . Both of these aspects differ significantly from the large depth may be intractable, and it is common to resort
decision tree framework. In particular, the search is par- to a heuristic search (weak learning algorithm) that returns
ticularly challenging. In fact, the main difference between δ-optimal coordinates. For instance, in the case of boosting
the algorithm presented in this section and the variant de- with trees one often grows trees according to some partic-
scribed in Appendix C is the way new candidates hj are ular heuristic (Freund & Schapire, 1997).
examined at each iteration.
We denote the A DA N ET model after t − 1 rounds by
5.2. Description ft−1 , which is parameterized by wt−1 . Let hk,t−1 de-
note the vector of outputs of units in the k-th layer of the
We start with an informal description of A DA N ET. Let A DA N ET model, lt−1 be the depth of the A DA N ET archi-
B ≥ 1 be a fixed parameter determining the number of tecture, nk,t−1 be the number of units in k-th layer after
units per layer of a candidate subnetwork. The algorithm t − 1 rounds. At round t, we select descent coordinates
proceeds in T iterations. Let lt−1 denote the depth of the by considering two candidate subnetworks h ∈ H e ∗ and
lt−1
neural network constructed before the start of the t-th itera-
h0 ∈ H e∗
lt−1 +1 that are generated by a weak learning algo-
tion. At iteration t, the algorithm selects one of the follow-
rithm W EAK L EARNER. Some choices for this algorithm in
ing two options:
our setting are described below. Once we obtain h and h0 ,
1. augmenting the current neural network with a subnet- we select one of these vectors of units, as well as a vector of
work with the same depth as that of the current network weights w ∈ RB , so that the result yields the best improve-
h ∈ Hl∗Bt−1
, with B units per layer. Each unit in layer k of ment in (4). This is equivalent to minimizing the following
this subnetwork may have connections to existing units in objective function over w ∈ RB and u ∈ {h, h0 }:
layer k − 1 of A DA N ET in addition to connections to units
in layer k − 1 of the subnetwork. m
1 X  
Ft (w, u) = Φ 1 − yi ft−1 (xi ) − yi w · u(xi )
2. augmenting the current neural network with a deeper m i=1
subnetwork h0 ∈ Hl∗B , with depth lt−1 + 1. The set of
t−1 + Γu kwk1 , (5)
connections allowed is defined in the same way as for h.
The option selected is the one leading to the best reduction where Γu = λru + β and ru is Rm Hlt−1 if u =

of the current value of the objective function, which de- h and Rm Hlt−1 +1 otherwise. In other words, if
pends both on the empirical error and the complexity of the minw Ft (w, h) ≤ minw Ft (w, h0 ), then
subnetwork added, which is penalized differently in these
two options.
w∗ = argmin Ft (w, h), ht = h
Figure 2 illustrates this construction and the two options w∈RB
AdaNet: Adaptive Structural Learning of Artificial Neural Networks

We conclude this section by observing that in our descrip-


A DA N ET(S = ((xi , yi )m
i=1 ) tion of A DA N ET we have fixed B for all iterations and
1 f0 ← 0 only two candidate subnetworks are considered at each
2 for t ← 1 to T do step. Our approach easily extends to an arbitrary number
h, h0 ← W EAK L EARNER S, ft−1

3 of candidate subnetworks (for instance of different depth l)
4 w ← M INIMIZE Ft (w, h)  as well as varying number of units per layer B. Further-
5 w0 ← M INIMIZE Ft (w, h0 ) more, selecting an optimal subnetwork among the candi-
6 if Ft (w, h0 ) ≤ Ft (w0 , h0 ) then dates is easily parallelizable allowing for efficient and ef-
7 ht ← h fective search for optimal descent directions. We also note
8 else ht ← h0 that the choice of subnetworks need not be restricted to
9 if F (wt−1 + w∗ ) < F (wt−1 ) then standard feedforward architectures and more exotic choices
10 ft ← ft−1 + w∗ · ht can be employed including the ones in (He et al., 2015;
11 else return ft−1 Huang et al., 2016). In our experiments we will restrict
12 return fT attention to simple feedforward subnetworks.

Figure 3. Pseudocode of the A DA N ET algorithm. On line 3 two 6. Experiments


candidate subnetworks are generated (e.g. randomly or by solving
(6)). On lines 3 and 4, (5) is solved for each of these candidates.
In this section we present the results of our experiments
On lines 5-7 the best subnetwork is selected and on lines 9-11 with A DA N ET. Some additional experimental results are
termination condition is checked. given in Appendix D and further implementation details
presented in Appendix E.
and otherwise
6.1. CIFAR-10
w∗ = argmin Ft (w, h0 ), ht = h0
w∈RB
In our first set of experiments, we used the CIFAR-10
dataset (Krizhevsky, 2009). This dataset consists of 60,000
If F (wt−1 + w∗ ) < F (wt−1 ) then we set ft = ft−1 + images evenly categorized in 10 different classes. To
w∗ · ht and otherwise we terminate the algorithm. reduce the problem to binary classification, we consid-
There are many different choices for the W EAK L EARNER ered five pairs of classes: deer-truck, deer-horse,
algorithm. For instance, one may generate a large number automobile-truck, cat-dog, dog-horse. Raw
of random networks and select the one that optimizes (5). images have been pre-processed to obtain color histograms
Another option is to directly minimize (5) or its regularized and histogram of gradient features. The result is 154 real
version: valued features with ranges in [0, 1].
m
1 X   We compared A DA N ET to standard feedforward neural
Fet (w, h) = Φ 1−yi ft−1 (xi )−yi w · h(xi ) networks (NN) and logistic regression (LR) models. Note
m i=1
that convolutional neural networks are often a more nat-
+ R(w, h), (6) ural choice for image classification problems such as
CIFAR-10. However, the goal of our experiments was not
over both w and h. Here R(w, h) is a regularization term to obtain state-of-the-art results for this particular task, but
that, for instance, can be used to enforce that kus kp ≤ Λk,s a proof-of-concept showing that our structural learning ap-
in (2). Note that, in general, (6) is a non-convex objective. proach can be very competitive with traditional approaches
However, we do not rely on finding a global solution to for finding efficient architectures and training correspond-
the corresponding optimization problem. In fact, standard ing networks.
guarantees for regularized boosting only require that each
h that is added to the model decreases the objective by a Our A DA N ET algorithm requires the knowledge of com-
constant amount (i.e. it satisfies δ-optimality condition) for plexities rj , which, in some cases, can be estimated from
a boosting algorithm to converge (Rätsch et al., 2001; Luo data. In our experiments, we used the upper bound of
& Tseng, 1992). Lemma 2. Our algorithm admits a number of hyperparam-
eters: regularization hyperparameters λ, β, number of units
Furthermore, the algorithm that we present in Appendix C B in each layer of new subnetworks that are used to extend
uses a weak-learning algorithm that solves a convex sub- the model at each iteration, and a bound Λk on weights u
problem at each step and that additionally has a closed- in each unit. As discussed in Section 5, there are different
form solution. This comes at the cost of a more restricted approaches to finding candidate subnetworks in each itera-
search space for finding a descent coordinate at each step tion. In our experiments, we searched for candidate subnet-
of the algorithm.
AdaNet: Adaptive Structural Learning of Artificial Neural Networks

Table 1. Experimental results for A DA N ET, NN, LR and NN-GP for different pairs of labels in CIFAR-10. Boldfaced results are
statistically significant at a 5% confidence level.

Label pair A DA N ET LR NN NN-GP

deer-truck 0.9372 ± 0.0082 0.8997 ± 0.0066 0.9213 ± 0.0065 0.9220 ± 0.0069


deer-horse 0.8430 ± 0.0076 0.7685 ± 0.0119 0.8055 ± 0.0178 0.8060 ± 0.0181
automobile-truck 0.8461 ± 0.0069 0.7976 ± 0.0076 0.8063 ± 0.0064 0.8056 ± 0.0138
cat-dog 0.6924 ± 0.0129 0.6664 ± 0.0099 0.6595 ± 0.0141 0.6607 ± 0.0097
dog-horse 0.8350 ± 0.0089 0.7968 ± 0.0128 0.8066 ± 0.0087 0.8087 ± 0.0109

works by minimizing (6) with R = 0. This also requires tions. NN, NN-GP and LR are trained using stochastic
a learning rate hyperparameter η. These hyperparamers gradient method with batch size of 100 and maximum of
have been optimized over the following ranges: λ ∈ 10,000 iterations. The same configuration is used for solv-
{0, 10−8 , 10−7 , 10−6 , 10−5 , 10−4 }, B ∈ {100, 150, 250}, ing (6). We use T = 30 for A DA N ET in all our experi-
η ∈ {10−4 , 10−3 , 10−2 , 10−1 }. We have used a single Λk ments although in most cases algorithm terminates after 10
for all k > 1 optimized over {1.0, 1.005, 1.01, 1.1, 1.2}. rounds.
For simplicity, we chose β = 0.
In each of the experiments, we used standard 10-fold cross-
Neural network models also admit a learning rate η validation for performance evaluation and model selection.
and a regularization coefficient λ as hyperparameters, as In particular, the dataset was randomly partitioned into 10
well as the number of hidden layers l and the num- folds, and each algorithm was run 10 times, with a different
ber of units n in each hidden layer. The range of assignment of folds to the training set, validation set and
η was the same as for A DA N ET and we varied l in test set for each run. Specifically, for each i ∈ {0, . . . , 9},
{1, 2, 3}, n in {100, 150, 512, 1024, 2048} and λ ∈ fold i was used for testing, fold i + 1 (mod 10) was used
{0, 10−5 , 10−4 , 10−3 , 10−2 , 10−1 }. Logistic regression for validation, and the remaining folds were used for train-
only admits as hyperparameters η and λ which were opti- ing. For each setting of the parameters, we computed
mized over the same ranges. Note that the total number of the average validation error across the 10 folds, and se-
hyperparameter settings for A DA N ET and standard neural lected the parameter setting with maximum average accu-
networks is exactly the same. Furthermore, the same holds racy across validation folds. We report the average accu-
for the number of hyperparameters that determine the re- racy (and standard deviations) of the selected hyperparam-
sulting architecture of the model: Λ and B for A DA N ET eter setting across test folds in Table 1.
and l and n for neural network models. Observe that, while
Our results show that A DA N ET outperforms other meth-
a particular setting of l and n determines a fixed architec-
ods on each of the datasets. The average architectures
ture, Λ and B parameterize a structural learning procedure
for all label pairs are provided in Table 2. Note that NN
that may result in a different architecture depending on the
and NN-GP always select a one-layer architecture. The
data.
architectures selected by A DA N ET also typically admit a
In addition to the grid search procedure, we have con- single layer, with fewer nodes than those selected by NN
ducted a hyperparameter optimization for neural net- and NN-GP. However, for the more challenging problem
works using Gaussian process bandits (NN-GP), which cat-dog, A DA N ET opts for a more complex model with
is a sophisticated Bayesian non-parametric method for two layers, which results in a better performance. This fur-
response-surface modeling in conjunction with a bandit ther illustrates how our approach helps learn network ar-
algorithm (Snoek et al., 2012). Instead of operating on chitectures in an adaptive fashion, based on the complexity
a pre-specified grid, this allows one to search for hy- of the task.
perparameters in a given range. We used the following
As discussed in Section 5, different heuristics can be used
ranges: λ ∈ [10−5 , 1], η ∈ [10−5 , 1], l ∈ [1, 3] and
to generate candidate subnetworks on each iteration of
n ∈ [100, 2048]. This algorithm was run for 500 trials,
A DA N ET. In a second set of experiments, we varied the
which is more than the number of hyperparameter settings
objective function (6), as well as the domain over which
considered by A DA N ET and NN. Observe that this search
it is optimized. This allowed us to study the sensitiv-
procedure can also be applied to our algorithm but we chose
ity of A DA N ET to the choice of a heuristic used to gen-
not to use it in this set of experiments to further demonstrate
erate candidate subnetworks. In particular, we consid-
competitiveness of our structural learning approach.
ered the following variants of A DA N ET. A DA N ET.R uses
In all our experiments, we use ReLu as the activation func- R(w, h) = Γh kwk1 as a regularization term in (6). As
AdaNet: Adaptive Structural Learning of Artificial Neural Networks

Table 2. Average number of units in each layer. Table 4. Experimental results for Criteo dataset.

Label pair A DA N ET NN NN-GP Algorithm Accuracy


1st layer 2nd layer
A DA N ET 0.7846
deer-truck 990 0 2048 1050 NN 0.7811
deer-horse 1475 0 2048 488
automobile-truck 2000 0 2048 1595 6d(f )1/4 for d(f ) > 25. Otherwise, the embedding di-
cat-dog 1800 25 512 155
dog-horse 1600 0 2048 1273 mension is d(f ). Missing feature values are set to 0. We
split the labeled set provided in the link above into train-
Table 3. Experimental results for different variants of A DA N ET, ing, validation and test sets.1 Our training set covered the
for the deer-truck label pair in CIFAR-10. first 5 days of data (32,743,299 instances) and the valida-
tion and test sets consisted of 1 day (6,548,659 instances).
Gaussian processes bandits were used to find the best hy-
Algorithm Accuracy (± std. dev.) perparameter settings on validation set both for A DA N ET
and NN. For A DA N ET we optimized over the following
A DA N ET.SD 0.9309 ± 0.0069
A DA N ET.R 0.9336 ± 0.0075 hyperparameter ranges: B ∈ {125, 256, 512}, Λ ∈ [1, 1.5],
A DA N ET.P 0.9321 ± 0.0065 η ∈ [10−4 , 10−1 ], λ ∈ [10−12 , 10−4 ]. For NN the ranges
A DA N ET.D 0.9376 ± 0.0080 were as follows: l ∈ [1, 6], n ∈ {250, 512, 1024, 2048},
η ∈ [10−5 , 10−1 ], λ ∈ [10−6 , 10−1 ]. We trained NNs
the A DA N ET architecture grows, each new subnetwork is for 100,000 iterations using mini-batch stochastic gradient
connected to all the previous subnetworks, which signifi- method with batch size of 512. The same configuration was
cantly increases the number of connections in the network used at each iteration of A DA N ET to solve (6). The max-
and the overall complexity of the model. A DA N ET.P and imum number of hyperparameter trials was 2,000 for both
A DA N ET.D are restricting connections to existing subnet- methods. The results are presented in Table 4. In this exper-
works in different ways. A DA N ET.P connects each new iment, NN chooses an architecture with four hidden layers
subnetwork only to the subnetwork that was added on the and 512 units in each hidden layer. Remarkably, A DA N ET
previous iteration. A DA N ET.D uses dropout on the con- achieves a better accuracy with an architecture consisting
nections to previously added subnetworks. Finally, while of single layer with just 512 nodes. While the difference
A DA N ET is based on the upper bounds on the Rademacher in performance appears to be small, it is in fact statistically
complexities of Lemma 2, A DA N ET.SD uses instead stan- significant in this challenging task.
dard deviations of the outputs of the last hidden layer on
the training data as surrogates for Rademacher complex- 7. Conclusion
ities. The advantage of using this data-dependent mea-
sure of complexity is that it eliminates the hyperparame- We presented a new framework and algorithms for adap-
ter Λ, thereby reducing the hyperparameter search space. tively learning artificial neural networks. Our algorithm,
We report the average accuracies across test folds for the A DA N ET, benefits from strong theoretical guarantees. It
deer-truck pair in Table 3. simultaneously learns a neural network architecture and its
parameters, by balancing a trade-off between model com-
6.2. Criteo Click Rate Prediction plexity and empirical risk minimization. We reported fa-
vorable experimental results demonstrating that our algo-
We also compared A DA N ET to NN on the Criteo Click rithm is able to learn network architectures that perform
Rate Prediction dataset (https://www.kaggle.com/c/ better than those found via a grid search. Our techniques
criteo-display-ad-challenge). This dataset con- are general and can be applied to other neural network ar-
sists of 7 days of data where each instance is an impression chitectures such as CNNs and RNNs.
and a binary label (clicked or not clicked). Each impres-
sion admits 13 count features and 26 categorical features.
Count features have been transformed by taking the natu- Acknowledgments
ral logarithm. The values of categorical features appear- The work of M. Mohri and that of S. Yang were partly
ing less than 100 times are replaced by 0. The rest of the funded by NSF awards IIS-1117591 and CCF-1535987.
values are then converted to integers, which are then used
as keys to look up embeddings (that are trained together
with each model). If the number of possible values for a 1
The test set available from this link does not include ground
feature x is d(x), then the embedding dimension is set to truth labels and therefore could be used in our experiments.
AdaNet: Adaptive Structural Learning of Artificial Neural Networks

References Han, Hong-Gui and Qiao, Jun-Fei. A structure optimisation


algorithm for feedforward neural network construction.
Alvarez, Jose M and Salzmann, Mathieu. Learning the
Neurocomputing, 99:347–357, 2013.
number of neurons in deep networks. In NIPS, 2016.
Arora, Sanjeev, Bhaskara, Aditya, Ge, Rong, and Ma, Han, Song, Pool, Jeff, Tran, John, and Dally, William J.
Tengyu. Provable bounds for learning some deep rep- Learning both weights and connections for efficient neu-
resentations. In ICML, pp. 584–592, 2014. ral networks. In NIPS, 2015.

Arora, Sanjeev, Liang, Yingyu, and Ma, Tengyu. Why are Hardt, Moritz, Recht, Benjamin, and Singer, Yoram. Train
deep nets reversible: A simple theory, with implications faster, generalize better: Stability of stochastic gradient
for training. arXiv:1511.05653, 2015. descent. arXiv:1509.01240, 2015.

Baker, Bowen, Gupta, Otkrist, Naik, Nikhil, and Raskar, He, Kaiming, Zhang, Xiangyu, Ren, Shaoqing, and Sun,
Ramesh. Designing neural network architectures using Jian. Deep residual learning for image recognition.
reinforcement learning. CoRR, 2016. CoRR, abs/1512.03385, 2015.

Bartlett, Peter L. The sample complexity of pattern classi- Huang, Gao, Liu, Zhuang, and Weinberger, Kilian Q.
fication with neural networks: the size of the weights is Densely connected convolutional networks. CoRR,
more important than the size of the network. Information 2016.
Theory, IEEE Transactions on, 44(2), 1998.
Islam, Md. Monirul, Yao, Xin, and Murase, Kazuyuki.
Bartlett, Peter L. and Mendelson, Shahar. Rademacher and A constructive algorithm for training cooperative neural
Gaussian complexities: Risk bounds and structural re- network ensembles. IEEE Trans. Neural Networks, 14
sults. JMLR, 3, 2002. (4):820–834, 2003.

Bergstra, James S, Bardenet, Rémi, Bengio, Yoshua, and Islam, Md. Monirul, Sattar, Md. Abdus, Amin, Md. Faijul,
Kégl, Balázs. Algorithms for hyper-parameter optimiza- Yao, Xin, and Murase, Kazuyuki. A new adaptive merg-
tion. In NIPS, pp. 2546–2554, 2011. ing and growing algorithm for designing artificial neural
networks. IEEE Trans. Systems, Man, and Cybernetics,
Chen, Tianqi, Goodfellow, Ian J., and Shlens, Jonathon. Part B, 39(3):705–722, 2009.
Net2net: Accelerating learning via knowledge transfer.
CoRR, 2015. Janzamin, Majid, Sedghi, Hanie, and Anandkumar, Anima.
Generalization bounds for neural networks through ten-
Choromanska, Anna, Henaff, Mikael, Mathieu, Michael,
sor factorization. arXiv:1506.08473, 2015.
Arous, Gérard Ben, and LeCun, Yann. The loss surfaces
of multilayer networks. arXiv:1412.0233, 2014. Kawaguchi, Kenji. Deep learning without poor local min-
ima. In NIPS, 2016.
Cohen, Nadav, Sharir, Or, and Shashua, Amnon. On the ex-
pressive power of deep learning: a tensor analysis. arXiv, Kingma, Diederik P. and Ba, Jimmy. Adam: A method for
2015. stochastic optimization. CoRR, abs/1412.6980, 2014.
Cortes, Corinna, Mohri, Mehryar, and Syed, Umar. Deep Koltchinskii, Vladmir and Panchenko, Dmitry. Empiri-
boosting. In ICML, pp. 1179 – 1187, 2014. cal margin distributions and bounding the generalization
Daniely, Amit, Frostig, Roy, and Singer, Yoram. Toward error of combined classifiers. Annals of Statistics, 30,
deeper understanding of neural networks: The power of 2002.
initialization and a dual view on expressivity. In NIPS,
Kotani, Manabu, Kajiki, Akihiro, and Akazawa, Kenzo.
2016.
A structural learning algorithm for multi-layered neural
Eldan, Ronen and Shamir, Ohad. The power of depth for networks. In International Conference on Neural Net-
feedforward neural networks. arXiv:1512.03965, 2015. works, volume 2, pp. 1105–1110. IEEE, 1997.

Freund, Yoav and Schapire, Robert E. A decision-theoretic Krizhevsky, Alex. Learning multiple layers of features
generalization of on-line learning and an application to from tiny images. Master’s thesis, University of Toronto,
boosting. Journal of Computer System Sciences, 55(1): 2009.
119–139, 1997.
Krizhevsky, Alex, Sutskever, Ilya, and Hinton, Geoffrey E.
Ha, David, Dai, Andrew M., and Le, Quoc V. Hypernet- Imagenet classification with deep convolutional neural
works. CoRR, 2016. networks. In NIPS, pp. 1097–1105, 2012.
AdaNet: Adaptive Structural Learning of Artificial Neural Networks

Kuznetsov, Vitaly, Mohri, Mehryar, and Syed, Umar. Snoek, Jasper, Larochelle, Hugo, and Adams, Ryan P.
Multi-class deep boosting. In NIPS, 2014. Practical Bayesian Optimization of Machine Learning
Algorithms. In Pereira, F., Burges, C. J. C., Bottou, L.,
Kwok, Tin-Yau and Yeung, Dit-Yan. Constructive algo- and Weinberger, K. Q. (eds.), NIPS, pp. 2951–2959. Cur-
rithms for structure learning in feedforward neural net- ran Associates, Inc., 2012.
works for regression problems. IEEE Transactions on
Neural Networks, 8(3):630–645, 1997. Sun, Shizhao, Chen, Wei, Wang, Liwei, Liu, Xiaoguang,
and Liu, Tie-Yan. On the depth of deep neural networks:
LeCun, Yann, Denker, John S., and Solla, Sara A. Optimal A theoretical view. In AAAI, 2016.
brain damage. In NIPS, 1990.
Sutskever, Ilya, Vinyals, Oriol, and Le, Quoc V. Sequence
Lehtokangas, Mikko. Modelling with constructive back- to sequence learning with neural networks. In NIPS,
propagation. Neural Networks, 12(4):707–716, 1999. 2014.

Leung, Frank HF, Lam, Hak-Keung, Ling, Sai-Ho, and Szegedy, Christian, Liu, Wei, Jia, Yangqing, Sermanet,
Tam, Peter KS. Tuning of the structure and parame- Pierre, Reed, Scott E., Anguelov, Dragomir, Erhan, Du-
ters of a neural network using an improved genetic al- mitru, Vanhoucke, Vincent, and Rabinovich, Andrew.
gorithm. IEEE Transactions on Neural Networks, 14(1): Going deeper with convolutions. In CVPR, 2015.
79–88, 2003.
Telgarsky, Matus. Benefits of depth in neural networks. In
Lian, Xiangru, Huang, Yijun, Li, Yuncheng, and Liu, Ji. COLT, 2016.
Asynchronous parallel stochastic gradient for nonconvex Zhang, Saizheng, Wu, Yuhuai, Che, Tong, Lin, Zhouhan,
optimization. In NIPS, pp. 2719–2727, 2015. Memisevic, Roland, Salakhutdinov, Ruslan, and Bengio,
Yoshua. Architectural complexity measures of recurrent
Livni, Roi, Shalev-Shwartz, Shai, and Shamir, Ohad. On
neural networks. CoRR, 2016.
the computational efficiency of training neural networks.
In NIPS, pp. 855–863, 2014. Zhang, Yuchen, Lee, Jason D, and Jordan, Michael I. ` 1-
regularized neural networks are improperly learnable in
Luo, Zhi-Quan and Tseng, Paul. On the convergence of co- polynomial time. arXiv:1510.03528, 2015.
ordinate descent method for convex differentiable mini-
mization. Journal of Optimization Theory and Applica- Zoph, Barret and Le, Quoc V. Neural architecture search
tions, 72(1):7 – 35, 1992. with reinforcement learning. CoRR, 2016.

Ma, Liying and Khorasani, Khashayar. A new strategy


for adaptively constructing multilayer feedforward neu-
ral networks. Neurocomputing, 51:361–385, 2003.

Narasimha, Pramod L, Delashmit, Walter H, Manry,


Michael T, Li, Jiang, and Maldonado, Francisco. An
integrated growing-pruning method for feedforward net-
work training. Neurocomputing, 71(13):2831–2847,
2008.

Neyshabur, Behnam, Tomioka, Ryota, and Srebro, Nathan.


Norm-based capacity control in neural networks. In
COLT, 2015.

Rätsch, Gunnar, Mika, Sebastian, and Warmuth, Man-


fred K. On the convergence of leveraging. In NIPS, pp.
487–494, 2001.

Sagun, Levent, Guney, V Ugur, Arous, Gerard Ben, and


LeCun, Yann. Explorations on high dimensional land-
scapes. arXiv:1412.6615, 2014.

Saxena, Shreyas and Verbeek, Jakob. Convolutional neural


fabrics. CoRR, abs/1606.02492, 2016.

You might also like