Cortes 17 A
Cortes 17 A
Cortes 17 A
Corinna Cortes 1 Xavier Gonzalvo 1 Vitaly Kuznetsov 1 Mehryar Mohri 2 1 Scott Yang 2
A critical step in learning a large multi-layer neural net- Rather than enforcing a pre-specified architecture and
work for a specific task is the choice of its architecture, thus a fixed network complexity, our A DA N ET algorithms
which includes the number of layers and the number of adaptively learn the appropriate network architecture for
units within each layer. Standard training methods for neu- a learning task. Starting from a simple linear model, our
ral networks return a model admitting precisely the number algorithms incrementally augment the network with more
units and additional layers, as needed. The choice of the
1
Google Research, New York, NY, USA; 2 Courant Institute additional subnetworks depends on their complexity and is
of Mathematical Sciences, New York, NY, USA. Correspondence directly guided by our learning guarantees. Remarkably,
to: Vitaly Kuznetsov <[email protected]>.
the optimization problems for both of our algorithms turn
Proceedings of the 34 th International Conference on Machine out to be strongly convex and thus guaranteed to admit a
Learning, Sydney, Australia, PMLR 70, 2017. Copyright 2017 unique global solution.
by the author(s).
AdaNet: Adaptive Structural Learning of Artificial Neural Networks
corresponds to units connected only to the layer below. We that can be derived via a standard Rademacher complex-
similarly define He ∗ = H∗ ∪ (−H∗ ) and H∗ = ∪l H∗ ,
k k k k=1 k
ity analysis (Koltchinskii & Panchenko, 2002), and which
and define F as the convex hull F∗ = conv(H∗ ). Note
∗ admit an explicit dependency on the mixture weights wk
that the architecture corresponding to the family of func- defining the ensemble function f . That leads to the follow-
tions F∗ is still more general than standard feedforward ing learning guarantee.
neural network architectures since the output unit can be Theorem 1 (Learning bound). Fix ρ > 0. Then, for any
connected to units in different layers. δ > 0, with probability at least 1 − δ over the draw of a
size m from Dm , the following inequality holds
sample S ofP
l
3. Learning problem for all f = k=1 wk · hk ∈ F:
Our first result gives an upper bound on the Rademacher to the design of algorithmic design since the network com-
complexity of Hk in terms of the Rademacher complexity plexity no longer needs to grow exponentially as a function
of other layer families. of depth. Our bounds are also more general and apply to
Lemma 1. For any k > 1, the empirical Rademacher more other network architectures, such as those introduced
complexity of Hk for a sample S of size m can be upper- in (He et al., 2015; Huang et al., 2016).
bounded as follows in terms of those of Hs s with s < k:
k−1
5. Algorithm
X 1
b S (Hk ) ≤ 2 q
R b S (Hs ).
Λk,s ns R This section describes our algorithm, A DA N ET, for adap-
s=1 tive learning of neural networks. A DA N ET adaptively
grows the structure of a neural network, balancing model
For the family Hk∗ , which is directly relevant to many of complexity with empirical risk minimization. We also de-
our experiments, the following more explicit upper bound scribe in detail in Appendix C another variant of A DA N ET
can be derived, using Lemma 1. which admits some favorable properties.
Qk Qk
Lemma 2. Let Λk = s=1 2Λs,s−1 and Nk = s=1 ns−1 . Let x 7→ Φ(−x) be a non-increasing convex function
Then, for any k ≥ 1, the empirical Rademacher complexity upper-bounding the zero-one loss, x 7→ 1x≤0 , such that Φ
of Hk∗ for a sample S of size m can be upper bounded as is differentiable over R and Φ0 (x) 6= 0 for all x. This surro-
follows: gate loss Φ may be, for instance, the exponential function
r Φ(x) = ex as in AdaBoost (Freund & Schapire, 1997), or
∗
1
q log(2n0 ) the logistic function, Φ(x) = log(1 + ex ) as in logistic
RS (Hk ) ≤ r∞ Λk Nk
b .
2m regression.
Note that Nk , which is the product of the number of units 5.1. Objective function
in layers below k, can be large. This suggests that values of
p closer to one, that is larger values of q, could be more Let {h1 , . . . , hN } be a subset of H∗ . In the most general
helpful to control complexity in such cases. More gen- case, N is infinite. However, as discussed later, in practice,
erally, similar explicit upper bounds can be given for the the search is limited to a finite set. For any j ∈ [N ], we
Rademacher complexities of subfamilies of Hk with units will denote by rj the Rademacher complexity of the family
connected only to layers k, k − 1, . . . , k − d, with d fixed, Hkj that contains hj : rj = Rm (Hkj ).
d < k. Combining Lemma 2 with Theorem 1 helps derive PN
the following explicit learning guarantee for feedforward A DA N ET seeks to find a function f = j=1 wj hj ∈
neural networks with an output unit connected to all the F∗ (or neural network) that directly minimizes the data-
other units. dependent generalization bound of Corollary 1. This leads
to the following objective function:
Corollary 1 (Explicit learning bound). Fix ρ > 0. Let
Qk Qk m N N
Λk = s=1 4Λs,s−1 and Nk = s=1 ns−1 . Then, for any 1 X X X
δ > 0, with probability at least 1 − δ over the draw of a F (w) = Φ 1 − yi wj hj + Γj |wj |, (4)
m i=1
size m from Dm , the following inequality holds
sample S ofP j=1 j=1
l
for all f = k=1 wk · hk ∈ F∗ : where w ∈ RN and Γj = λrj + β, with λ ≥ 0 and
l r β ≥ 0 hyperparameters. The objective function (4) is a
2 X
1
q 2 log(2n0 ) convex function of w. It is the sum of a convex surrogate
R(f ) ≤ R bS,ρ (f ) +
wk
r∞ Λk N
1 k
ρ m of the empirical error and a regularization term, which is a
k=1
r weighted-l1 penalty containing two sub-terms: a standard
2 log l norm-1 regularization which admits β as a hyperparame-
+ + C(ρ, l, m, δ),
ρ m ter, and a term that discriminates the functions hj based on
q log l log( δ2 ) their complexity.
4 ρ2 m
where C(ρ, l, m, δ) = ρ2 log( log l ) m + 2m =
q The optimization problem consisting of minimizing the ob-
1 log l
O
e
ρ m , and where r∞ = ES∼Dm [r∞ ]. jective function F in (4) is defined over a very large space
of base functions hj . A DA N ET consists of applying coor-
The learning bound of Corollary 1 is a finer guarantee than dinate descent to (4). In that sense, our algorithm is similar
previous ones by Bartlett (1998), Neyshabur et al. (2015), to the DeepBoost algorithm of Cortes et al. (2014). How-
or Sun et al. (2016). This is because it explicitly differenti- ever, unlike DeepBoost, which combines decision trees,
ates between the weights of different layers while previous A DA N ET learns a deep neural network, which requires new
bounds treat all weights indiscriminately. This is crucial methods for constructing and searching the space of func-
AdaNet: Adaptive Structural Learning of Artificial Neural Networks
Table 1. Experimental results for A DA N ET, NN, LR and NN-GP for different pairs of labels in CIFAR-10. Boldfaced results are
statistically significant at a 5% confidence level.
works by minimizing (6) with R = 0. This also requires tions. NN, NN-GP and LR are trained using stochastic
a learning rate hyperparameter η. These hyperparamers gradient method with batch size of 100 and maximum of
have been optimized over the following ranges: λ ∈ 10,000 iterations. The same configuration is used for solv-
{0, 10−8 , 10−7 , 10−6 , 10−5 , 10−4 }, B ∈ {100, 150, 250}, ing (6). We use T = 30 for A DA N ET in all our experi-
η ∈ {10−4 , 10−3 , 10−2 , 10−1 }. We have used a single Λk ments although in most cases algorithm terminates after 10
for all k > 1 optimized over {1.0, 1.005, 1.01, 1.1, 1.2}. rounds.
For simplicity, we chose β = 0.
In each of the experiments, we used standard 10-fold cross-
Neural network models also admit a learning rate η validation for performance evaluation and model selection.
and a regularization coefficient λ as hyperparameters, as In particular, the dataset was randomly partitioned into 10
well as the number of hidden layers l and the num- folds, and each algorithm was run 10 times, with a different
ber of units n in each hidden layer. The range of assignment of folds to the training set, validation set and
η was the same as for A DA N ET and we varied l in test set for each run. Specifically, for each i ∈ {0, . . . , 9},
{1, 2, 3}, n in {100, 150, 512, 1024, 2048} and λ ∈ fold i was used for testing, fold i + 1 (mod 10) was used
{0, 10−5 , 10−4 , 10−3 , 10−2 , 10−1 }. Logistic regression for validation, and the remaining folds were used for train-
only admits as hyperparameters η and λ which were opti- ing. For each setting of the parameters, we computed
mized over the same ranges. Note that the total number of the average validation error across the 10 folds, and se-
hyperparameter settings for A DA N ET and standard neural lected the parameter setting with maximum average accu-
networks is exactly the same. Furthermore, the same holds racy across validation folds. We report the average accu-
for the number of hyperparameters that determine the re- racy (and standard deviations) of the selected hyperparam-
sulting architecture of the model: Λ and B for A DA N ET eter setting across test folds in Table 1.
and l and n for neural network models. Observe that, while
Our results show that A DA N ET outperforms other meth-
a particular setting of l and n determines a fixed architec-
ods on each of the datasets. The average architectures
ture, Λ and B parameterize a structural learning procedure
for all label pairs are provided in Table 2. Note that NN
that may result in a different architecture depending on the
and NN-GP always select a one-layer architecture. The
data.
architectures selected by A DA N ET also typically admit a
In addition to the grid search procedure, we have con- single layer, with fewer nodes than those selected by NN
ducted a hyperparameter optimization for neural net- and NN-GP. However, for the more challenging problem
works using Gaussian process bandits (NN-GP), which cat-dog, A DA N ET opts for a more complex model with
is a sophisticated Bayesian non-parametric method for two layers, which results in a better performance. This fur-
response-surface modeling in conjunction with a bandit ther illustrates how our approach helps learn network ar-
algorithm (Snoek et al., 2012). Instead of operating on chitectures in an adaptive fashion, based on the complexity
a pre-specified grid, this allows one to search for hy- of the task.
perparameters in a given range. We used the following
As discussed in Section 5, different heuristics can be used
ranges: λ ∈ [10−5 , 1], η ∈ [10−5 , 1], l ∈ [1, 3] and
to generate candidate subnetworks on each iteration of
n ∈ [100, 2048]. This algorithm was run for 500 trials,
A DA N ET. In a second set of experiments, we varied the
which is more than the number of hyperparameter settings
objective function (6), as well as the domain over which
considered by A DA N ET and NN. Observe that this search
it is optimized. This allowed us to study the sensitiv-
procedure can also be applied to our algorithm but we chose
ity of A DA N ET to the choice of a heuristic used to gen-
not to use it in this set of experiments to further demonstrate
erate candidate subnetworks. In particular, we consid-
competitiveness of our structural learning approach.
ered the following variants of A DA N ET. A DA N ET.R uses
In all our experiments, we use ReLu as the activation func- R(w, h) = Γh kwk1 as a regularization term in (6). As
AdaNet: Adaptive Structural Learning of Artificial Neural Networks
Table 2. Average number of units in each layer. Table 4. Experimental results for Criteo dataset.
Arora, Sanjeev, Liang, Yingyu, and Ma, Tengyu. Why are Hardt, Moritz, Recht, Benjamin, and Singer, Yoram. Train
deep nets reversible: A simple theory, with implications faster, generalize better: Stability of stochastic gradient
for training. arXiv:1511.05653, 2015. descent. arXiv:1509.01240, 2015.
Baker, Bowen, Gupta, Otkrist, Naik, Nikhil, and Raskar, He, Kaiming, Zhang, Xiangyu, Ren, Shaoqing, and Sun,
Ramesh. Designing neural network architectures using Jian. Deep residual learning for image recognition.
reinforcement learning. CoRR, 2016. CoRR, abs/1512.03385, 2015.
Bartlett, Peter L. The sample complexity of pattern classi- Huang, Gao, Liu, Zhuang, and Weinberger, Kilian Q.
fication with neural networks: the size of the weights is Densely connected convolutional networks. CoRR,
more important than the size of the network. Information 2016.
Theory, IEEE Transactions on, 44(2), 1998.
Islam, Md. Monirul, Yao, Xin, and Murase, Kazuyuki.
Bartlett, Peter L. and Mendelson, Shahar. Rademacher and A constructive algorithm for training cooperative neural
Gaussian complexities: Risk bounds and structural re- network ensembles. IEEE Trans. Neural Networks, 14
sults. JMLR, 3, 2002. (4):820–834, 2003.
Bergstra, James S, Bardenet, Rémi, Bengio, Yoshua, and Islam, Md. Monirul, Sattar, Md. Abdus, Amin, Md. Faijul,
Kégl, Balázs. Algorithms for hyper-parameter optimiza- Yao, Xin, and Murase, Kazuyuki. A new adaptive merg-
tion. In NIPS, pp. 2546–2554, 2011. ing and growing algorithm for designing artificial neural
networks. IEEE Trans. Systems, Man, and Cybernetics,
Chen, Tianqi, Goodfellow, Ian J., and Shlens, Jonathon. Part B, 39(3):705–722, 2009.
Net2net: Accelerating learning via knowledge transfer.
CoRR, 2015. Janzamin, Majid, Sedghi, Hanie, and Anandkumar, Anima.
Generalization bounds for neural networks through ten-
Choromanska, Anna, Henaff, Mikael, Mathieu, Michael,
sor factorization. arXiv:1506.08473, 2015.
Arous, Gérard Ben, and LeCun, Yann. The loss surfaces
of multilayer networks. arXiv:1412.0233, 2014. Kawaguchi, Kenji. Deep learning without poor local min-
ima. In NIPS, 2016.
Cohen, Nadav, Sharir, Or, and Shashua, Amnon. On the ex-
pressive power of deep learning: a tensor analysis. arXiv, Kingma, Diederik P. and Ba, Jimmy. Adam: A method for
2015. stochastic optimization. CoRR, abs/1412.6980, 2014.
Cortes, Corinna, Mohri, Mehryar, and Syed, Umar. Deep Koltchinskii, Vladmir and Panchenko, Dmitry. Empiri-
boosting. In ICML, pp. 1179 – 1187, 2014. cal margin distributions and bounding the generalization
Daniely, Amit, Frostig, Roy, and Singer, Yoram. Toward error of combined classifiers. Annals of Statistics, 30,
deeper understanding of neural networks: The power of 2002.
initialization and a dual view on expressivity. In NIPS,
Kotani, Manabu, Kajiki, Akihiro, and Akazawa, Kenzo.
2016.
A structural learning algorithm for multi-layered neural
Eldan, Ronen and Shamir, Ohad. The power of depth for networks. In International Conference on Neural Net-
feedforward neural networks. arXiv:1512.03965, 2015. works, volume 2, pp. 1105–1110. IEEE, 1997.
Freund, Yoav and Schapire, Robert E. A decision-theoretic Krizhevsky, Alex. Learning multiple layers of features
generalization of on-line learning and an application to from tiny images. Master’s thesis, University of Toronto,
boosting. Journal of Computer System Sciences, 55(1): 2009.
119–139, 1997.
Krizhevsky, Alex, Sutskever, Ilya, and Hinton, Geoffrey E.
Ha, David, Dai, Andrew M., and Le, Quoc V. Hypernet- Imagenet classification with deep convolutional neural
works. CoRR, 2016. networks. In NIPS, pp. 1097–1105, 2012.
AdaNet: Adaptive Structural Learning of Artificial Neural Networks
Kuznetsov, Vitaly, Mohri, Mehryar, and Syed, Umar. Snoek, Jasper, Larochelle, Hugo, and Adams, Ryan P.
Multi-class deep boosting. In NIPS, 2014. Practical Bayesian Optimization of Machine Learning
Algorithms. In Pereira, F., Burges, C. J. C., Bottou, L.,
Kwok, Tin-Yau and Yeung, Dit-Yan. Constructive algo- and Weinberger, K. Q. (eds.), NIPS, pp. 2951–2959. Cur-
rithms for structure learning in feedforward neural net- ran Associates, Inc., 2012.
works for regression problems. IEEE Transactions on
Neural Networks, 8(3):630–645, 1997. Sun, Shizhao, Chen, Wei, Wang, Liwei, Liu, Xiaoguang,
and Liu, Tie-Yan. On the depth of deep neural networks:
LeCun, Yann, Denker, John S., and Solla, Sara A. Optimal A theoretical view. In AAAI, 2016.
brain damage. In NIPS, 1990.
Sutskever, Ilya, Vinyals, Oriol, and Le, Quoc V. Sequence
Lehtokangas, Mikko. Modelling with constructive back- to sequence learning with neural networks. In NIPS,
propagation. Neural Networks, 12(4):707–716, 1999. 2014.
Leung, Frank HF, Lam, Hak-Keung, Ling, Sai-Ho, and Szegedy, Christian, Liu, Wei, Jia, Yangqing, Sermanet,
Tam, Peter KS. Tuning of the structure and parame- Pierre, Reed, Scott E., Anguelov, Dragomir, Erhan, Du-
ters of a neural network using an improved genetic al- mitru, Vanhoucke, Vincent, and Rabinovich, Andrew.
gorithm. IEEE Transactions on Neural Networks, 14(1): Going deeper with convolutions. In CVPR, 2015.
79–88, 2003.
Telgarsky, Matus. Benefits of depth in neural networks. In
Lian, Xiangru, Huang, Yijun, Li, Yuncheng, and Liu, Ji. COLT, 2016.
Asynchronous parallel stochastic gradient for nonconvex Zhang, Saizheng, Wu, Yuhuai, Che, Tong, Lin, Zhouhan,
optimization. In NIPS, pp. 2719–2727, 2015. Memisevic, Roland, Salakhutdinov, Ruslan, and Bengio,
Yoshua. Architectural complexity measures of recurrent
Livni, Roi, Shalev-Shwartz, Shai, and Shamir, Ohad. On
neural networks. CoRR, 2016.
the computational efficiency of training neural networks.
In NIPS, pp. 855–863, 2014. Zhang, Yuchen, Lee, Jason D, and Jordan, Michael I. ` 1-
regularized neural networks are improperly learnable in
Luo, Zhi-Quan and Tseng, Paul. On the convergence of co- polynomial time. arXiv:1510.03528, 2015.
ordinate descent method for convex differentiable mini-
mization. Journal of Optimization Theory and Applica- Zoph, Barret and Le, Quoc V. Neural architecture search
tions, 72(1):7 – 35, 1992. with reinforcement learning. CoRR, 2016.