Regularization For Deep Learning: Tsz-Chiu Au Chiu@unist - Ac.kr
Regularization For Deep Learning: Tsz-Chiu Au Chiu@unist - Ac.kr
Tsz-Chiu Au
[email protected]
As model complexity goes up, the model requires much more data in order not to overfit.
Regularization
Any modification we make to a learning algorithm that is intended to reduce its
generalization error, but not its training error
&
# $−ℎ !
We will learn to derive this decomposition of the expected error into bias, variance, and
noise terms.
Bias
Discrepancy between averaged estimated and true function
Incorrect assumptions
- e.g.) linear assumption for data generated from a higher-
degree polynomial model.
*
!"# ℎ % = '( ℎ % −ℎ %
Source of Variance
Statistical sources
- Classifiers that are too local and can easily fit the data.
- e.g.) nearest neighbor, large decision trees.
Computational sources
- Learning algorithms that make sharp decisions can be
unstable, as the decision boundary can change if one
training example changes.
- Randomization in the learning algorithm
- e.g.) neural networks with random initial weights.
-
$%&'( ℎ # =+ !−" # = + .- = /-
Bias-Variance Decomposition
We can decompose the expected prediction error using the variance lemma
' '
!" #−ℎ & = !" ℎ & − 2#ℎ & + # ' = !" ℎ & ' + !" # ' − 2!" # !" ℎ &
Let ℎ & = ! ℎ & denote the mean prediction of the hypothesis at &.
The first term can be decomposed as follows, using the variance lemma
! + ' = ,-. + + +/ '
' ' '
!" ℎ & = !" ℎ & −ℎ & +ℎ &
'
!" # ' = !" #−0 & +0 & '
Bias-Variance Decomposition
Putting everything together, we have,
'
!" #−ℎ &
'
= !" ℎ & + !" # ' − 2!" # !" ℎ &
However, finding the right complexity of the model is not a simple matter.
Most regularization methods aim to reduce variance at the cost of added bias.
Norm-based Regularizations
Regularization in General Machine Learning
Context
Introduction of penalty terms to guide the learning with some additional information.
.
1
" # = * /(#, 2+ , 3+ )
)
+,-
One type of penalty term we can add in, is a norm of the parameter #, which can bring
in many nice effects into the learning process.
L2 Regularization
L2 regularizations, based on 2 norm, is one the most common types of regularizations
$ &
Loss function contour
! " + " &
2 Minimum for the loss
&
" & = ( +,
)*
" & ≤$
Shrinks the value of the variables while favoring similar weights among them
Ridge Regression
Linear regression with L2-regularization
1 +
min ' − )* + +-Ω *
$ &
1 + *
Ω * = * +
2
* = )1 ) + -2 34
)1 5
*/max(*)
&
Regularized objective "! # = " # + #( #
'
At each step, the weight decay term will shrink the weight by a constant factor.
Further Analysis on the L2 Regularization
The quadratic approximation of the objective function in the neighborhood of the value
of the weights that obtains minimal unregularized objective is given as follows
1
!" # = " #∗ + # − #∗ * + # − #∗
2
Thus, only the directions along which the parameter contribute significantly to reducing
the objective function will be preserved relatively intact.
L1 regularization
Results in parameter selection
& ) ≤(
As the dimensionality of w increases, the norm ball will have increasingly more
number of corners.
Sparsity
We define the L0-psuedonorm as follows:
! " = # % = 1, … , ) *+ ≠ 0}
*/ *0 *1 *2 *3 *4 *5 *6
Sparsity enables feature selection, model compression, and results in models with
better interpretability
Why is Sparsity Good?
Model compression – With most of the parameters set to zero, sparsity can greatly
reduce the memory and computational requirements.
[Hwang et al.] S. J. Hwang and L. Sigal, A Unified Semantic Embedding: Relating Taxonomies with Attributes, NIPS 2014
Why is Sparsity Good?
Feature selection - Identifies features that are truly relevant to the target task.
1 +
min ' − )* + +-Ω *
$ &
*
Ω * = * 0
*/max(*)
Not all variables become zero at the same time – results in variable selection
Subgradient Method
L1-norm is non-smooth (non-differentiable). In such cases, we can compute a set of
gradients, defining the gradient on non-smooth points
!: # → #, f x = x
For ) ≠ 0, the subgradient , = -.,/())
For ) = 0, subgradient , is any element of [-1,1].
Gradient of L1-Regularized Objectives
*((
)" !" = )" ∈ -. !"
1 ;
,-./ " = 4-!567 " − "∗ ; + 1ℎ "∗
2
Euclidean Projection
The proximal operator, ,-./ " should be efficient – closed form solution
preferred
Proximal Operator for L1-norm regularization
Iterative soft-thresholding operator – reduce the absolute value of each
component of ! by lambda, and if the resulting value is below zero, set it to zero.
min ) ! + ℎ !
!
ℎ ! =- ! 4
"#$% ! = '()* !∗ !∗ − - .
Summary: L2- vs L1-regularization
L2 regularization promotes grouping – results in equal weights for correlated features
L1 regularization promotes sparsity – selects few informative features
L2-regularization L1-regularization
1 &
Ω " = " & Ω " = " '
2
Lp-Norms
General case ! " where p can be any non-negative number
Auxiliary Tasks
1) 1 or 2 lanes?
2) Location of the centerline
Shared
3) Location of left edge of road
4) Location of right edge of
road
5) Location of road center.
-27.7% reduction in error.
[Baxter95] showed that sharing parameters can result in improved generalization and
generalization error bounds.
[Baxter95] J. Baxter, Learning Internal Representations, COLT 95
Multitask Learning
Cheetah Jaguar Leopard
Base-level Categorization
difference-pooling
Specialization Cheetah Jaguar Leopard
Felid
Generalization min-pooling
Category-specific
features
Accuracy on Cifar 100
Deep Convolutional
Neural Network Previous 67.38%
state-of-the-art
Test instance Ours 71.81%
Training instances
[Goo16] Wonjun Goo, Juyong Kim, Gunhee Kim, and Sung Ju Hwang, Taxonomy-Regularized Semantic Deep Convolutional Network, TO APPEAR, ECCV 2016
Ensemble Methods
Ensemble
Combining multiple hypotheses into one.
Ensemble
model 1 model
Labeled data
! = { $% , '% (
model 2
…
model k
If the accuracy is 70% for each classifier, the accuracy of majority voting is:
.7# + 5 .7& .3 + 10 .7* .3+ = 0.8369
If there are 101 such classifiers, the majority voting accuracy is 0.99.
Why Do Ensemble Models Work?
Ensemble gives a global picture of the actual distribution
Some unknown
distribution
Training Test
model 1 model 1
Combines the
Unlabeled data model 2 prediction by
Labeled data
$∗ majority voting
! = { $% , '% (
…
model 2 or averaging
model k
…
Final prediction
model k
Types of Ensemble Methods
Combine by learning – Boosting, rule ensemble, Bayesian model averaging…
Training Test
Ensemble
model 1 model
model k
Bagging
Create ensembles by repeatedly random-sampling the training data
Averaging or Voting
Final prediction
Given n data samples, and a class of learning models (e.g. neural networks, decision
trees), train k models on different samples, and average their predictions.
Bagging
Training
In each iteration ! = 1, … , &,
Randomly sample with replacement ' samples from the training set.
Train a chosen base model on the samples.
Test
For each test example,
Start all trained base models.
Then, combine the results of all & trained models
Why Does Bagging Work?
Example
Original dataset
First resampled
dataset
Second resampled
dataset
Why Does Bagging Work?
Assume that we measure a random variable !~# $, & ' .
Training
Testing
[Srivastava14] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, R. Salakhutdinov, Dropout: A Simple Way to Prevent Neural Networks from Overfitting, JMLR 2014
Why Dropout Works
With dropout, we can learn exponentially many different networks that share
weights, in a single training phase – up to 2" networks
Has an effect of combining many different neural networks from a single training
with bagging – and thus effectively prevents overfitting
Droupout vs. Bagging
However, dropout training differs from bagging in some aspects
1) All models share parameters – they are no longer really trained on separate
subsets of the dataset
2) Individual models are not trained to convergence – in fact, vast majority of sub-
networks will be never trained
3) Since there are too many models to average explicitly, dropout averages them
together with a fast approximation – geometric mean rather than arithmetic
mean.
Dropout as Bagging
Suppose that each model outputs a probability distribution
In case of bagging, each model produces a probability distribution !" # $ , and the
prediction of the ensemble is given as the arithmetic mean of all these distributions.
*
1
' !" # $
&
"()
In case of dropout, each sub-model defined by mask vector + defined a probability
! # $, + , and the arithmetic mean over all masks is given by
' ! + ! # $, +
-
However, since this term has exponential number of terms, it is intractable to evaluate
→ Average together output from masks obtained by sampling, for tractability.
Dropout as Bagging
We can instead use geometric mean rather than the arithmetic mean of the
ensemble members, which empirically works well for model averaging.
,-
"!#$%#&'(# ) * = . " ) *, 1
/
However, since the geometric mean of multiple probability distributions is not
guaranteed to be a probability distribution, we renormalize the resulting distribution
"!#$%#&'(# ) *
"#$%#&'(# ) * =
∑34 "!#$%#&'(# ) 5 *
More specifically, dropout does exact model averaging in deep networks given that
they are locally linear along the space of inputs to each layer that are visited by
applying different dropout masks.
Thus, in effect the network is acting like a linear one, making the approximate
model averaging more exact (in contrast to a network with nonlinear units)
Further, while dropout results in making most of the activations to be 0 (for ReLU),
maxout units are also always active and thus can better utilize the model capacity.
[Goodfellow14] I. J. Goodfellow, D. Warde-Farley, M. Mirza, A. Courville, and Y. Bengio, Maxout Networks, ICML 2013
Advantages of Dropout
1) Dropout is more effective than other standard regularizers.
Classification Error on MNIST Dataset
L2 1.62
L2 + L1 applied at the end of training 1.60
Max-norm 1.35
Dropout + L2 1.25
Dropout + Max-norm 1.05
3) Dropout does not significantly limit the type of the model or training procedure –
it works well with nearly any models that uses distributed representation and can be
trained with stochastic gradient descent.
Effect of Dropout on Features
Dropout prevents the features from co-adaptation – each neuron must do well
even when other neurons are not there.
[Srivastava14] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, R. Salakhutdinov, Dropout: A Simple Way to Prevent Neural Networks from Overfitting, JMLR 2014
Experimental Results
Dropout consistently improves performance on almost all datasets
[Srivastava14] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, R. Salakhutdinov, Dropout: A Simple Way to Prevent Neural Networks from Overfitting, JMLR 2014
Effect of Dropout on Sparsity
Dropout results in sparse activations, even when no sparsity inducing regularizers
are present.
[Srivastava14] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, R. Salakhutdinov, Dropout: A Simple Way to Prevent Neural Networks from Overfitting, JMLR 2014
Effect of Dropout on Sparsity
Dropout results in sparse activations, even when no sparsity inducing regularizers
are present.
[Srivastava14] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, R. Salakhutdinov, Dropout: A Simple Way to Prevent Neural Networks from Overfitting, JMLR 2014
Effect of Dropout Rate
Randomly drop a neuron with some predefined probability at each iteration.
[Srivastava14] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, R. Salakhutdinov, Dropout: A Simple Way to Prevent Neural Networks from Overfitting, JMLR 2014
DropConnect
Randomly drop a weight, instead of a neuron
[Wan13] L. Wan, M. Zeiler, S. Zhang, Y. LeCun, and R. Fergus, Regularization of Neural Networks using DropConnect, ICML 2013
DropConnect
DropConnect often outperforms dropout
[Wan13] L. Wan, M. Zeiler, S. Zhang, Y. LeCun, and R. Fergus, Regularization of Neural Networks using DropConnect, ICML 2013
Regularizations via Numerical Optimization
Early Stopping
When training large models, we often observe that training error decreases steadily
over time, but validation error begin to rise again after certain iteration.
We can consider the number of iteration as a parameter to tune, and stop the iteration
before reaching to the local minimum, based on the validation loss.
Early Stopping and L2-Regularization
Early stopping is equivalent to L2-regularization
However, early stopping only needs to train the model once, while the weight decay
requires many training experiments with different values of its hyperparameters.
Early Stopping and L2-Regularization
We simplify the problem by assuming that the only parameters are linear weights, and
consider the quadratic approximation of the cost function.
1
!" # = " #∗ + # − #∗ * + # − #∗
2
Lets suppose that we update the parameter via gradient descent. The gradient of the
above quadratic approximation is given as ,# "! # = + # − #∗
2* #- − #∗ = 1 − 03 2* #-./ − #∗
Early Stopping and L2-Regularization
Assuming that !" = 0, and % is chosen to be small enough to guarantee 1 − %() < 1,
the parameter trajectory after + parameter updates is as follows:
,- !. = / − / − %0 .
,- 1 ∗
. 89
If we choose hyperparameters %, 7, and + such that / − %0 = 0 + 7/ 7, then the
above two formulations become equivalent.
Batch Normalization
While the gradient for each layer is computed with the assumption that the other
layers do not change, they are actually simultaneously trained, and thus changed.
+ +
$ "
%$$ %$$ 1 " ,-
!$ 01256 ℎ$ 01236 #$ # − *$ = #. − *.
2 $ ,#.
$ "
%$" %$"
,#.
+ + = #. 1 − #.
$
"
%"$ ,01234
%"$
01257 ℎ 1 1
!" $ " " 01237 #" # − *" "
+ 2 =
%"" %"" 2 " 1 + 1 9:
Batch normalization greatly speeds up the training process, since the network will be
less sensitive to learning rate.
[Ioffe15] S. Ioffe and C. Szegedy, Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift, ICML 2015
Batch Normalization as Regularization
The primary purpose of batch normalization is to improve optimization, but the added
noise also has regularization effect, making dropout unnecessary at times.
One way is to use L2 penalty on the difference between the two weights
(
Ω " # , "% = " # − "% (
However, a more popular way is to force the sets of the parameters to be equal
" # = "% , which is often referred to as parameter sharing.
By sharing the parameters, we can significantly lower the model complexity (and
the degree of freedom) of the problem.
Parameter Sharing in CNN
Natural images have many statistical properties that are invariant to translation →
Share set of features for all different positions of an image
+ 0.007 × =
If we change each input by !, then a linear function with weights " can change as
much as ! " # . This could be a very large amount when " is high-dimensional.
$% '
& = $% ' + $% *
We can maximize the increase by assigning * = +,-.($)
Adversarial training can discourage his highly sensitive locally linear behavior by
encouraging the network to be locally constant in the neighborhood of the training
data.
[Goodfellow15] I. J. Goodfellow, J. Shlens, and C. Szegedy, Explaining and Harnessing Adversarial Examples, ICLR 2015
Fast Gradient Method
We can linearize the cost function around the current value of !, obtaining an
optimal max-norm constrained perturbation of
" = $%&'( )* + !
We can add this vector to the original input data to generate adversarial examples.
[Goodfellow15] I. J. Goodfellow, J. Shlens, and C. Szegedy, Explaining and Harnessing Adversarial Examples, ICLR 2015
Adversarial Training of Deep Networks
Weights of the learned networks change significantly.
Weights learned with adversarial training are more localized and interpretable.
[Goodfellow15] I. J. Goodfellow, J. Shlens, and C. Szegedy, Explaining and Harnessing Adversarial Examples, ICLR 2015
Why Do Adversarial Examples Generalize?
An example generated for one model is often misclassified by other models, even
when they have different architectures or trained on disjoint sets.
Negative !
Correctly
classified
examples
Positive !
Adversarial examples occur reliably for any sufficiently large value of !, given that
we move in the correct direction.
[Goodfellow15] I. J. Goodfellow, J. Shlens, and C. Szegedy, Explaining and Harnessing Adversarial Examples, ICLR 2015
Other Methods
Data Augmentation
The best way to make a machine learning model generalize better is to train it with
more data – however, in practice, the amount of data we have is limited.
Then, we can artificially enlarge the data by generating some variation of inputs
e.g.) Flipping, cropping, translating, rotating, and etc, for image inputs
Data Augmentation
[Jaitly13] showed that data augmentation is effective for speech recognition task as
well. Data was generated by a random warp factor for each utterance, that maps the
input frequency to a new frequency.
Injecting noise in the input to a neural network can be also seen as a form of data
augmentation.
[Poole14] showed that injecting noise to hidden units can be highly effective if the
magnitude of the noise is carefully tuned.
[Jaintly13] N. Jaitly and G. E. Hinton, Vocal Tract Length Perturbation (VTLP) Improves Speech Recognition, ICML 2013
[Poole14] B. Poole, J. Sohl-Dickstein, and S. Ganguli, Analyzing Noise in Autoencoders and Deep Networks, arXiv:1406.1831
Tangent Propagation
In the tangent distance algorithm, the distance between two points are computed as
the distance of their manifold tangent vectors.
This is to make the distance to be invariant to the local factors of variations that
correspond to movement on the manifold.
[Simard93] P. Y. Simard, Y, LeCun, and J. Denker, Efficient Pattern Recognition Using a New Transformation Distance, NIPS 1992
Tangent Propagation
Tangent propagation makes use of the manifold tangent vectors to enforce each
output ! " of the neural net locally invariant to known factors of variation.
*
( &
Ω ! = Σ& '"! " )
[Simard92] P. Y. Simard, Y, LeCun, and J. Denker, TangentProp – A Formalism for Specifying Selected Invariances in an Adaptive Network, NIPS 1991
Manifold Tangent Classifier
Manifold tangent classifier [Rifai11] estimates the manifold tangent vectors to
eliminate the need to know the tangent vector a priori.
[Rifai11] S. Rifai, Y. N. Dauphin, P. Vincent, Y. Bengio, and X. Muller, The Manifold Tangent Classifier, NIPS 2011
Tangent Propagation vs. Data Augmentation
TangentProp is very similar to data augmentation, in that the user encodes the prior
knowledge of the task by specifying a set of transformations that should not alter
the output of the network.
The difference is that data augmentation explicitly trains the network with explicit
examples, while TangentProp does not require such extra efforts.