0% found this document useful (0 votes)
153 views100 pages

Regularization For Deep Learning: Tsz-Chiu Au Chiu@unist - Ac.kr

This document discusses various regularization techniques used in deep learning to reduce overfitting. It begins by explaining that deep neural networks have many more parameters than shallow models, posing challenges during training. Overfitting, where models perform well on training but not test data, is one such challenge. Regularization helps reduce generalization error by adding constraints or penalties during training. Common regularization methods include L1 and L2 regularization, which add the L1 or L2 norm of the weights as a penalty term to the loss function. This encourages sparser or smaller weights and helps control overfitting. The document analyzes how L1 and L2 regularization impact models and drive different behaviors like sparsity or shrinking of weights.

Uploaded by

wild yellow
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
153 views100 pages

Regularization For Deep Learning: Tsz-Chiu Au Chiu@unist - Ac.kr

This document discusses various regularization techniques used in deep learning to reduce overfitting. It begins by explaining that deep neural networks have many more parameters than shallow models, posing challenges during training. Overfitting, where models perform well on training but not test data, is one such challenge. Regularization helps reduce generalization error by adding constraints or penalties during training. Common regularization methods include L1 and L2 regularization, which add the L1 or L2 norm of the weights as a penalty term to the loss function. This encourages sparser or smaller weights and helps control overfitting. The document analyzes how L1 and L2 regularization impact models and drive different behaviors like sparsity or shrinking of weights.

Uploaded by

wild yellow
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 100

Regularization for Deep Learning

Tsz-Chiu Au
[email protected]

Acknowledgment: The content of this file is based on the textbook


as well as the slides provided by Prof. Sung Ju Hwang.
Deep Neural Networks
Generally have a lot more parameters to tune than non-deep neural networks.

However this poses a new challenge in the training process.


Overfitting
One of the main problems that hindered the progress in deep learning research was
overfitting – the model works well on the training data, but not on the test data.

As model complexity goes up, the model requires much more data in order not to overfit.
Regularization
Any modification we make to a learning algorithm that is intended to reduce its
generalization error, but not its training error

There exist various strategies for regularization


1) Adding extra constraints on a machine learning model, such as adding restrictions
on the parameter values.
2) Adding extra terms to the learning objective that can be thought as corresponding
to a soft constraint on the parameter values.
3) Combine multiple hypotheses that explain the training data.
4) Modify the numerical optimization process.
5) Make use of the prior knowledge of the data distribution to generate extra
training data, or regularize the model training.
Review: Bias and Variance
Example
Suppose that we generate 20 samples as follows, and use linear regression to fit the
generated data.
Example
If we perform this 50 times, then the line fitting to generated samples will be different at
each run.
Generalization Error
Given a new data point !, what is the expected prediction error of the hypothesis ℎ?

&
# $−ℎ !

We will learn to derive this decomposition of the expected error into bias, variance, and
noise terms.
Bias
Discrepancy between averaged estimated and true function

!"#$ ℎ & =ℎ & −) &


Source of Bias
Inability to represent certain decision boundaries
- e.g.) linear classifier, decision trees.

Incorrect assumptions
- e.g.) linear assumption for data generated from a higher-
degree polynomial model.

Models that are too global (or too smooth)


- e.g.) a single linear separator

If the bias is high, the model is underfitting the data.


Variance
Discrepancy between models trained on different training sets

*
!"# ℎ % = '( ℎ % −ℎ %
Source of Variance
Statistical sources
- Classifiers that are too local and can easily fit the data.
- e.g.) nearest neighbor, large decision trees.

Computational sources
- Learning algorithms that make sharp decisions can be
unstable, as the decision boundary can change if one
training example changes.
- Randomization in the learning algorithm
- e.g.) neural networks with random initial weights.

If the variance is high, the model is overfitting the data.


Noise
Irreducible error that describes how ! varies from " #

-
$%&'( ℎ # =+ !−" # = + .- = /-
Bias-Variance Decomposition
We can decompose the expected prediction error using the variance lemma

' '
!" #−ℎ & = !" ℎ & − 2#ℎ & + # ' = !" ℎ & ' + !" # ' − 2!" # !" ℎ &

Let ℎ & = ! ℎ & denote the mean prediction of the hypothesis at &.
The first term can be decomposed as follows, using the variance lemma
! + ' = ,-. + + +/ '
' ' '
!" ℎ & = !" ℎ & −ℎ & +ℎ &

The second term can be decomposed as follows:

'
!" # ' = !" #−0 & +0 & '
Bias-Variance Decomposition
Putting everything together, we have,

'
!" #−ℎ &

= !" ℎ & ' − 2#ℎ & + # '

'
= !" ℎ & + !" # ' − 2!" # !" ℎ &

' ' ' '


= !" ℎ & −ℎ & +ℎ & + !" #−+ & ++ & − 2+ & ℎ &

' ' '


= !" ℎ & −ℎ & + ℎ & −+ & + !" #−+ &
Variance Bias2 Noise
Bias / Variance and Model Complexity
Simple models have high bias, but low variance.

Complex models have low bias, but high variance.


Bias / Variance Trade-off
We want to reduce both sources of error to obtain a model that is ‘just right’.

However, finding the right complexity of the model is not a simple matter.
Most regularization methods aim to reduce variance at the cost of added bias.
Norm-based Regularizations
Regularization in General Machine Learning
Context
Introduction of penalty terms to guide the learning with some additional information.

"! # = " # + &Ω #


Regularization term

.
1
" # = * /(#, 2+ , 3+ )
)
+,-

One type of penalty term we can add in, is a norm of the parameter #, which can bring
in many nice effects into the learning process.
L2 Regularization
L2 regularizations, based on 2 norm, is one the most common types of regularizations

$ &
Loss function contour
! " + " &
2 Minimum for the loss

&
" & = ( +,
)*

" & ≤$

Shrinks the value of the variables while favoring similar weights among them
Ridge Regression
Linear regression with L2-regularization

1 +
min ' − )* + +-Ω *
$ &

1 + *
Ω * = * +
2

* = )1 ) + -2 34
)1 5

*/max(*)

Shrinks all variables to zero – reduces variance while introducing bias.


L2 Regularization
Often referred to as weight decay regularizer, as optimizing for l2-norm will shrink the
value of the variables at each iteration.

&
Regularized objective "! # = " # + #( #
'

Corresponding gradient )* "! # = )* " # + +#

Update at each gradient step , ← , − / 0, + )* " #


, ← 1 − /0 , − /)* " #

At each step, the weight decay term will shrink the weight by a constant factor.
Further Analysis on the L2 Regularization
The quadratic approximation of the objective function in the neighborhood of the value
of the weights that obtains minimal unregularized objective is given as follows
1
!" # = " #∗ + # − #∗ * + # − #∗
2

The minimum of "! occurs when ,# "! # = + # − #∗ = 0.

We then solve for the minimum of the regularized version of "!


+ # . − #∗ + / # . =0
+ + /0 # . = +#∗
#. = + + /0 12 +#∗

What will happen then, as 3 grows?


Further Analysis on the L2 Regularization
Since ! is symmetric and real, we can decompose it into ! = #$%& , where $ is a
diagonal matrix of eigenvalues and # is the matrix of eigenvectors.

By applying this decomposition to the previous solution, we obtain


,-
' = %$%& + *+ #$#. (∗
(
= # $ + *+ #. ,- #$%& (∗
= # $ + *+ ,- $#. (∗
Thus, the effect of weight decay is to rescale (∗ along the axes defined by the
eigenvectors of !.

Specifically, the component of (∗ that is aligned with the i-th eigenvctor of ! is


0
rescaled by a factor of 1
01 23
Further Analysis on the L2 Regularization
" !
With such rescaling scheme , the effect of regularization is relatively small when
!" #$
% ≫ ', or large enough to shrink the value to have near zero magnitude if % ≪ '

The horizontal term does not increase


much when moving away from )∗

The vertical term decreases the objective


much when moving away from )∗

Thus, only the directions along which the parameter contribute significantly to reducing
the objective function will be preserved relatively intact.
L1 regularization
Results in parameter selection

min % & + ( & )


$

Minimum for the loss

& ) = + -. Solution is found at the “corners”


$,

& ) ≤(

As the dimensionality of w increases, the norm ball will have increasingly more
number of corners.
Sparsity
We define the L0-psuedonorm as follows:

! " = # % = 1, … , ) *+ ≠ 0}
*/ *0 *1 *2 *3 *4 *5 *6

This is a measure of how many of the variables are non-zero.

L0-regularization results in learning sparse models, by selecting variables. L1-


regularization is a convex relaxation of L0-regularization

Sparsity enables feature selection, model compression, and results in models with
better interpretability
Why is Sparsity Good?
Model compression – With most of the parameters set to zero, sparsity can greatly
reduce the memory and computational requirements.

8x8 matrix – requires 512 bytes to store in


double precision

12 nonzero entries – requires 96 bytes +


additional memory to store indices.
Why is Sparsity Good?
Better Interpretability – The learned model can be better explained in terms of
selected non-zero entries.

[Hwang et al.] S. J. Hwang and L. Sigal, A Unified Semantic Embedding: Relating Taxonomies with Attributes, NIPS 2014
Why is Sparsity Good?
Feature selection - Identifies features that are truly relevant to the target task.

Useful for high-dimensional learning and data-driven methods


Lasso (L1-regularized Linear Regression)
Shorthand for Least Absolute Shrinkage and Selection Operator
Linear regression with K1 regularization

1 +
min ' − )* + +-Ω *
$ &
*
Ω * = * 0

*/max(*)

Not all variables become zero at the same time – results in variable selection
Subgradient Method
L1-norm is non-smooth (non-differentiable). In such cases, we can compute a set of
gradients, defining the gradient on non-smooth points

!: # → #, f x = x
For ) ≠ 0, the subgradient , = -.,/())
For ) = 0, subgradient , is any element of [-1,1].
Gradient of L1-Regularized Objectives

Regularized objective "! # = J # + ' # (

Corresponding gradient )* "! # = )* " # + +,-./ 0

Regularization contribution to the gradient no longer scales linearly with each 01 ,


but instead it is a constant factor with a sign equal to ,-./ 0 .
Stochastic Subgradient Descent
Deep networks mostly use stochastic gradient descent due to its scalability, which
is to compute gradient only on the randomly sampled training instances.
step size at iteration t
!"#$ = !" − '" )
(" gradient direction
at iteration t

*((
)" !" = )" ∈ -. !"

A random vector is a noisy unbiased subgradient for .: 0 → 0 at x, if for all z


. 2 ≥ . ! + * )5 6 2 − !

Define the optimal value as . ∗ = min{. !$ , … , . !" }


However, SGD often does not generate sparse solutions for l1-regularized objectives
Proximal Gradient
Specifically tailored for regularized optimization problems where ! " is
differentiable but ℎ " is a general convex function.
smooth smooth or nonsmooth

min ! " + ℎ "


"
Solution obtained by taking
"()* = ,-./ "( − 12! 3( a gradient step only on g(w)

1 ;
,-./ " = 4-!567 " − "∗ ; + 1ℎ "∗
2
Euclidean Projection

The proximal operator, ,-./ " should be efficient – closed form solution
preferred
Proximal Operator for L1-norm regularization
Iterative soft-thresholding operator – reduce the absolute value of each
component of ! by lambda, and if the resulting value is below zero, set it to zero.

min ) ! + ℎ !
!

ℎ ! =- ! 4

"#$% ! = '()* !∗ !∗ − - .
Summary: L2- vs L1-regularization
L2 regularization promotes grouping – results in equal weights for correlated features
L1 regularization promotes sparsity – selects few informative features
L2-regularization L1-regularization

1 &
Ω " = " & Ω " = " '
2
Lp-Norms
General case ! " where p can be any non-negative number

Lp norms with p < 2 promotes sparsity


Lp norms with p > 2 promotes grouping
Multi-task Learning
Multitask Learning
Learning a model considering only a single task, using the labeled data only for the
specific task, is prone to overfitting.

Given a resized image of the roads for training, we


want to learn an ANN (Artificial Neural Network),
which can predict the steering direction.

[Caruana97] R. Caruana, Multitask Learning, CMU Thesis CMU-CS-97-203, 1997


Multitask Learning
Auxiliary tasks relevant to the target task can help learn a model that generalizes better
when training models for them alongside, by sharing information across multiple tasks.

Auxiliary Tasks
1) 1 or 2 lanes?
2) Location of the centerline
Shared
3) Location of left edge of road
4) Location of right edge of
road
5) Location of road center.
-27.7% reduction in error.

[Baxter95] showed that sharing parameters can result in improved generalization and
generalization error bounds.
[Baxter95] J. Baxter, Learning Internal Representations, COLT 95
Multitask Learning
Cheetah Jaguar Leopard

Base-level Categorization
difference-pooling
Specialization Cheetah Jaguar Leopard
Felid

Generalization min-pooling

Category-specific
features
Accuracy on Cifar 100
Deep Convolutional
Neural Network Previous 67.38%
state-of-the-art
Test instance Ours 71.81%

Training instances
[Goo16] Wonjun Goo, Juyong Kim, Gunhee Kim, and Sung Ju Hwang, Taxonomy-Regularized Semantic Deep Convolutional Network, TO APPEAR, ECCV 2016
Ensemble Methods
Ensemble
Combining multiple hypotheses into one.

Ensemble
model 1 model

Labeled data
! = { $% , '% (
model 2

model k

Improves accuracy and robustness over single model.


Ensemble
Most state-of-the-art results are obtained from ensemble of deep networks.
Why Do Ensemble Models Work?
Suppose that we have 5 independent classifiers for majority voting.

classifier 1 classifier 2 classifier 3 classifier 4 classifier 5


(70%) (70%) (70%) (70%) (70%)

If the accuracy is 70% for each classifier, the accuracy of majority voting is:
.7# + 5 .7& .3 + 10 .7* .3+ = 0.8369

If there are 101 such classifiers, the majority voting accuracy is 0.99.
Why Do Ensemble Models Work?
Ensemble gives a global picture of the actual distribution

Some unknown
distribution

Model 1 Model 2 Model 3 Model 5 Model 6


Model 4
Types of Ensemble Methods
Combine by consensus – Bagging, random forest, model averaging of probabilities…

Training Test

model 1 model 1
Combines the
Unlabeled data model 2 prediction by
Labeled data
$∗ majority voting
! = { $% , '% (


model 2 or averaging
model k

Final prediction

model k
Types of Ensemble Methods
Combine by learning – Boosting, rule ensemble, Bayesian model averaging…

Training Test

Ensemble
model 1 model

Labeled data Unlabeled data


Final prediction
! = { $% , '% ( $∗
model 2

model k
Bagging
Create ensembles by repeatedly random-sampling the training data

Model 1 Model 2 Model 3 Model k

Averaging or Voting

Final prediction

Given n data samples, and a class of learning models (e.g. neural networks, decision
trees), train k models on different samples, and average their predictions.
Bagging
Training
In each iteration ! = 1, … , &,
Randomly sample with replacement ' samples from the training set.
Train a chosen base model on the samples.

Test
For each test example,
Start all trained base models.
Then, combine the results of all & trained models
Why Does Bagging Work?
Example

Original dataset

First resampled
dataset

Second resampled
dataset
Why Does Bagging Work?
Assume that we measure a random variable !~# $, & ' .

If we only have a single measurement !( ,


) !( = $, and +,- !( = & '

If we have measured ! k times, and the value of ! is estimated as !( + !' + ⋯ + !0 /2,


) !( + !' + ⋯ + !0 /2 = $;
345 45
however, +,- !( + !' + ⋯ + !0 /2 = =
35 3

Thus, bagging results in a model with smaller variance.


Dropout
Dropout
Randomly drop a neuron with some predefined probability at each iteration.

Training

Testing

[Srivastava14] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, R. Salakhutdinov, Dropout: A Simple Way to Prevent Neural Networks from Overfitting, JMLR 2014
Why Dropout Works
With dropout, we can learn exponentially many different networks that share
weights, in a single training phase – up to 2" networks

Has an effect of combining many different neural networks from a single training
with bagging – and thus effectively prevents overfitting
Droupout vs. Bagging
However, dropout training differs from bagging in some aspects

1) All models share parameters – they are no longer really trained on separate
subsets of the dataset

2) Individual models are not trained to convergence – in fact, vast majority of sub-
networks will be never trained

3) Since there are too many models to average explicitly, dropout averages them
together with a fast approximation – geometric mean rather than arithmetic
mean.
Dropout as Bagging
Suppose that each model outputs a probability distribution

In case of bagging, each model produces a probability distribution !" # $ , and the
prediction of the ensemble is given as the arithmetic mean of all these distributions.
*
1
' !" # $
&
"()
In case of dropout, each sub-model defined by mask vector + defined a probability
! # $, + , and the arithmetic mean over all masks is given by
' ! + ! # $, +
-
However, since this term has exponential number of terms, it is intractable to evaluate
→ Average together output from masks obtained by sampling, for tractability.
Dropout as Bagging
We can instead use geometric mean rather than the arithmetic mean of the
ensemble members, which empirically works well for model averaging.
,-
"!#$%#&'(# ) * = . " ) *, 1
/
However, since the geometric mean of multiple probability distributions is not
guaranteed to be a probability distribution, we renormalize the resulting distribution
"!#$%#&'(# ) *
"#$%#&'(# ) * =
∑34 "!#$%#&'(# ) 5 *

"#$%#&'(# ) * can be approximated by evaluating " ) * in one model: The model


with all units, but weight from each unit 6 multiplied by the probability of including it
→ weight scaling inference rule – capture right expected value of the output from
each unit.
Dropout as Bagging
The predictive distribution defined by renormalizing the geometric mean over all
$
ensemble member’s prediction is simply given as exp % &' ( + * .

In other words, the average prediction of exponentially many sub-models can be


simply computed by running the full model with the weight divided by 2 (for all-linear
deep neural networks, or a single layer neural networks with softmax activation)

More specifically, dropout does exact model averaging in deep networks given that
they are locally linear along the space of inputs to each layer that are visited by
applying different dropout masks.

For more general networks, weight scaling is only an approximation.


[Goodfellow14] I. J. Goodfellow, D. Warde-Farley, M. Mirza, A. Courville, and Y. Bengio, Maxout Networks, ICML 2013
Maxout Networks with Dropout
Maxout trained with dropout may have the identity of the maximal filter in each
unit change relatively rarely as the dropout mask changes

Thus, in effect the network is acting like a linear one, making the approximate
model averaging more exact (in contrast to a network with nonlinear units)
Further, while dropout results in making most of the activations to be 0 (for ReLU),
maxout units are also always active and thus can better utilize the model capacity.
[Goodfellow14] I. J. Goodfellow, D. Warde-Farley, M. Mirza, A. Courville, and Y. Bengio, Maxout Networks, ICML 2013
Advantages of Dropout
1) Dropout is more effective than other standard regularizers.
Classification Error on MNIST Dataset
L2 1.62
L2 + L1 applied at the end of training 1.60
Max-norm 1.35
Dropout + L2 1.25
Dropout + Max-norm 1.05

2) It is computationally cheap, requiring only ! " computation per example.

3) Dropout does not significantly limit the type of the model or training procedure –
it works well with nearly any models that uses distributed representation and can be
trained with stochastic gradient descent.
Effect of Dropout on Features
Dropout prevents the features from co-adaptation – each neuron must do well
even when other neurons are not there.

Without dropout With dropout

[Srivastava14] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, R. Salakhutdinov, Dropout: A Simple Way to Prevent Neural Networks from Overfitting, JMLR 2014
Experimental Results
Dropout consistently improves performance on almost all datasets

[Srivastava14] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, R. Salakhutdinov, Dropout: A Simple Way to Prevent Neural Networks from Overfitting, JMLR 2014
Effect of Dropout on Sparsity
Dropout results in sparse activations, even when no sparsity inducing regularizers
are present.

[Srivastava14] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, R. Salakhutdinov, Dropout: A Simple Way to Prevent Neural Networks from Overfitting, JMLR 2014
Effect of Dropout on Sparsity
Dropout results in sparse activations, even when no sparsity inducing regularizers
are present.

[Srivastava14] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, R. Salakhutdinov, Dropout: A Simple Way to Prevent Neural Networks from Overfitting, JMLR 2014
Effect of Dropout Rate
Randomly drop a neuron with some predefined probability at each iteration.

[Srivastava14] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, R. Salakhutdinov, Dropout: A Simple Way to Prevent Neural Networks from Overfitting, JMLR 2014
DropConnect
Randomly drop a weight, instead of a neuron

[Wan13] L. Wan, M. Zeiler, S. Zhang, Y. LeCun, and R. Fergus, Regularization of Neural Networks using DropConnect, ICML 2013
DropConnect
DropConnect often outperforms dropout

[Wan13] L. Wan, M. Zeiler, S. Zhang, Y. LeCun, and R. Fergus, Regularization of Neural Networks using DropConnect, ICML 2013
Regularizations via Numerical Optimization
Early Stopping
When training large models, we often observe that training error decreases steadily
over time, but validation error begin to rise again after certain iteration.

We can consider the number of iteration as a parameter to tune, and stop the iteration
before reaching to the local minimum, based on the validation loss.
Early Stopping and L2-Regularization
Early stopping is equivalent to L2-regularization

However, early stopping only needs to train the model once, while the weight decay
requires many training experiments with different values of its hyperparameters.
Early Stopping and L2-Regularization
We simplify the problem by assuming that the only parameters are linear weights, and
consider the quadratic approximation of the cost function.
1
!" # = " #∗ + # − #∗ * + # − #∗
2
Lets suppose that we update the parameter via gradient descent. The gradient of the
above quadratic approximation is given as ,# "! # = + # − #∗

#- = #-./ − 0,# " #-./ = #-./ − 0+ #-./ − #∗


#- − #∗ = 1 − 0+ #-./ − #∗
We can perform eigendecomposition on +: + = 2345 , and get the following

2* #- − #∗ = 1 − 03 2* #-./ − #∗
Early Stopping and L2-Regularization
Assuming that !" = 0, and % is chosen to be small enough to guarantee 1 − %() < 1,
the parameter trajectory after + parameter updates is as follows:

,- !. = / − / − %0 .
,- 1 ∗

For L2-regularization, we can get the expression for ,- !


3 = [5 − 0 + 7/ 89
7],- !∗

. 89
If we choose hyperparameters %, 7, and + such that / − %0 = 0 + 7/ 7, then the
above two formulations become equivalent.
Batch Normalization
While the gradient for each layer is computed with the assumption that the other
layers do not change, they are actually simultaneously trained, and thus changed.
+ +

$ "
%$$ %$$ 1 " ,-
!$ 01256 ℎ$ 01236 #$ # − *$ = #. − *.
2 $ ,#.
$ "
%$" %$"
,#.
+ + = #. 1 − #.
$
"
%"$ ,01234
%"$
01257 ℎ 1 1
!" $ " " 01237 #" # − *" "
+ 2 =
%"" %"" 2 " 1 + 1 9:

Thus, the layers should continuously adapt to continuously changing distributions.


This problem is referred to as the internal covariance shift – this makes it difficult to
train the network, requiring careful tuning of parameters such as learning rate.
Batch Normalization
Batch normalization solves this by adding in a normalization transformation by
estimating the mean and variance based on the mini-batch statistics.

Batch normalization greatly speeds up the training process, since the network will be
less sensitive to learning rate.
[Ioffe15] S. Ioffe and C. Szegedy, Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift, ICML 2015
Batch Normalization as Regularization
The primary purpose of batch normalization is to improve optimization, but the added
noise also has regularization effect, making dropout unnecessary at times.

Batch normalization provides similar regularization effect as dropout, since the


activations observed for a training example are affected by random selection of
examples in the same mini-batch
Weight Tying (Parameter Sharing)
Parameter Sharing
If we have prior knowledge that two weights should be close to each other, how
can we enforce it in the learned model?

One way is to use L2 penalty on the difference between the two weights

(
Ω " # , "% = " # − "% (

However, a more popular way is to force the sets of the parameters to be equal
" # = "% , which is often referred to as parameter sharing.

By sharing the parameters, we can significantly lower the model complexity (and
the degree of freedom) of the problem.
Parameter Sharing in CNN
Natural images have many statistical properties that are invariant to translation →
Share set of features for all different positions of an image

Layer m-1 Layer m

Allows to detect features regardless of their positions in the visual field


Also greatly reduces the number of parameters to learn
Parameter Sharing in CNN
Results in learning a shared common filter for an input image (or a feature map)
Parameter Sharing in RNN
RNN also shares weights across different timesteps.

&"#$ &" &"'$


()0 ()0 ()0
ℎ" = +((-) %"#$ + ()) ℎ"#$ )
()) ()) ()) ())
ℎ"#$ ℎ" ℎ"
&" = + ()0 ℎ"

(-) (-) (-)

%"#$ %" %"'$

Enable the RNN to work with sequences of varying length.


Adversarial Training
Adversarial Examples
[Szegedy14] found that it is possible to intentionally construct examples that can
make the networks that perform at human level accuracy to almost always fail.

+ 0.007 × =

! %&'( )! * + " + $%&'( )! * +


Panda Nematode Gibbon
(57.7% confidence) (8.2% confidence) (99.3% confidence)
[Szegedy14] C. Szegedy, W. Zeremba, I, Sutskever, J. Bruna, D. Erhan, I. J. Goodfellow, and R. Fergus, Intriguing Properties of Neural Networks, ICLR 2014
Adversarial Examples
By adding small noise to the original image, all images are recognized as “ostrich”
Adversarial Examples
[Goodfellow15] showed that a primary cause of this problem is excessive linearity,
since the neural networks are built out of primarily linear building blocks.

If we change each input by !, then a linear function with weights " can change as
much as ! " # . This could be a very large amount when " is high-dimensional.

$% '
& = $% ' + $% *
We can maximize the increase by assigning * = +,-.($)

Adversarial training can discourage his highly sensitive locally linear behavior by
encouraging the network to be locally constant in the neighborhood of the training
data.
[Goodfellow15] I. J. Goodfellow, J. Shlens, and C. Szegedy, Explaining and Harnessing Adversarial Examples, ICLR 2015
Fast Gradient Method
We can linearize the cost function around the current value of !, obtaining an
optimal max-norm constrained perturbation of

" = $%&'( )* + !

We can add this vector to the original input data to generate adversarial examples.

This method is referred to as the fast gradient method.


Fast Gradient Method
The fast gradient method applied to logistic regression

Weight of a logistic regression The sign of the weights


model trained on MNIST
Fast Gradient Method
The fast gradient method applied to logistic regression

Class 3 and 7. The logistic regression Generated adversarial examples –


model has 1.6% error rate. the logistic regression has 99% error
rates on these examples.
[Goodfellow15] I. J. Goodfellow, J. Shlens, and C. Szegedy, Explaining and Harnessing Adversarial Examples, ICLR 2015
Adversarial Training of Deep Networks
Training with an adversarial objective function based on the fast gradient sign
method is an effective regularizer.

"! #, %, & = (" #, %, & + 1 − ( " #, % + ,-./0 1% " #, %, & , &

[Goodfellow15] I. J. Goodfellow, J. Shlens, and C. Szegedy, Explaining and Harnessing Adversarial Examples, ICLR 2015
Adversarial Training of Deep Networks
Weights of the learned networks change significantly.

Naïve model Model with adversarial training

Weights learned with adversarial training are more localized and interpretable.
[Goodfellow15] I. J. Goodfellow, J. Shlens, and C. Szegedy, Explaining and Harnessing Adversarial Examples, ICLR 2015
Why Do Adversarial Examples Generalize?
An example generated for one model is often misclassified by other models, even
when they have different architectures or trained on disjoint sets.
Negative !

Correctly
classified
examples

Positive !

Adversarial examples occur reliably for any sufficiently large value of !, given that
we move in the correct direction.
[Goodfellow15] I. J. Goodfellow, J. Shlens, and C. Szegedy, Explaining and Harnessing Adversarial Examples, ICLR 2015
Other Methods
Data Augmentation
The best way to make a machine learning model generalize better is to train it with
more data – however, in practice, the amount of data we have is limited.

Then, we can artificially enlarge the data by generating some variation of inputs
e.g.) Flipping, cropping, translating, rotating, and etc, for image inputs
Data Augmentation
[Jaitly13] showed that data augmentation is effective for speech recognition task as
well. Data was generated by a random warp factor for each utterance, that maps the
input frequency to a new frequency.

Injecting noise in the input to a neural network can be also seen as a form of data
augmentation.

[Poole14] showed that injecting noise to hidden units can be highly effective if the
magnitude of the noise is carefully tuned.

[Jaintly13] N. Jaitly and G. E. Hinton, Vocal Tract Length Perturbation (VTLP) Improves Speech Recognition, ICML 2013
[Poole14] B. Poole, J. Sohl-Dickstein, and S. Ganguli, Analyzing Noise in Autoencoders and Deep Networks, arXiv:1406.1831
Tangent Propagation
In the tangent distance algorithm, the distance between two points are computed as
the distance of their manifold tangent vectors.

This is to make the distance to be invariant to the local factors of variations that
correspond to movement on the manifold.
[Simard93] P. Y. Simard, Y, LeCun, and J. Denker, Efficient Pattern Recognition Using a New Transformation Distance, NIPS 1992
Tangent Propagation
Tangent propagation makes use of the manifold tangent vectors to enforce each
output ! " of the neural net locally invariant to known factors of variation.
*
( &
Ω ! = Σ& '"! " )

[Simard92] P. Y. Simard, Y, LeCun, and J. Denker, TangentProp – A Formalism for Specifying Selected Invariances in an Adaptive Network, NIPS 1991
Manifold Tangent Classifier
Manifold tangent classifier [Rifai11] estimates the manifold tangent vectors to
eliminate the need to know the tangent vector a priori.

This can be done using the following simple algorithm:


1) Use an autoencoder to learn the manifold structure by unsupervised learning.
2) Use these tangents to regularize a neural net classifier as in tangent prop.

[Rifai11] S. Rifai, Y. N. Dauphin, P. Vincent, Y. Bengio, and X. Muller, The Manifold Tangent Classifier, NIPS 2011
Tangent Propagation vs. Data Augmentation
TangentProp is very similar to data augmentation, in that the user encodes the prior
knowledge of the task by specifying a set of transformations that should not alter
the output of the network.

The difference is that data augmentation explicitly trains the network with explicit
examples, while TangentProp does not require such extra efforts.

Drawbacks of TangentProp relative to Data augmentation


1) It only regularizes the model to resist very small pertubations

2) The infinitesimal approach poses difficulties for models based on ReLU


Any questions?
Next class – Optimization for Deep Learning

You might also like