0% found this document useful (0 votes)
5 views37 pages

(DL) Ch04-Regularization

The document discusses various regularization techniques in deep learning, including parameter norm penalties, constrained optimization, and dataset augmentation, to improve model generalization. Key methods such as L1 and L2 regularization, dropout, and semi-supervised learning are highlighted, emphasizing their roles in reducing test error and enhancing model performance. The lecturer is Duc Dung Nguyen from the Faculty of Computer Science and Engineering at Hochiminh City University of Technology.

Uploaded by

baotran.fablab
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views37 pages

(DL) Ch04-Regularization

The document discusses various regularization techniques in deep learning, including parameter norm penalties, constrained optimization, and dataset augmentation, to improve model generalization. Key methods such as L1 and L2 regularization, dropout, and semi-supervised learning are highlighted, emphasizing their roles in reducing test error and enhancing model performance. The lecturer is Duc Dung Nguyen from the Faculty of Computer Science and Engineering at Hochiminh City University of Technology.

Uploaded by

baotran.fablab
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 37

Deep Learning

Regularization

Lecturer: Duc Dung Nguyen, PhD.


Contact: [email protected]

Faculty of Computer Science and Engineering


Hochiminh city University of Technology
Contents

1. Parameter Norm Penalties

2. Constrained optimization

3. Dataset Augmentation

4. Other Regularization Approaches

5. Dropout

Lecturer: Duc Dung Nguyen, PhD. Contact: [email protected] Deep Learning 1 / 31


Parameter Norm Penalties
Regularization

• Problem in ML: generalization!


• Regularization: strategies are explicitly designed to reduce the test error, possibly at the
expense of increased training error.

Lecturer: Duc Dung Nguyen, PhD. Contact: [email protected] Deep Learning 2 / 31


Regularization

• Most regularization strategies are based on regularizing estimators


• Regularization of an estimator works by trading increased bias for reduced variance
• An effective regularizer: makes a profitable trade, reducing variance significantly while
not overly increasing the bias

Lecturer: Duc Dung Nguyen, PhD. Contact: [email protected] Deep Learning 3 / 31


Parameter Norm Penalties

• Main regularization approaches: limiting the capacity of the model by adding a parameter
norm penalty Ω(θ) to the objective function J

˜ X, y) = J(θ, X, y) + αΩ(θ)
J(θ, (1)

• Different choices of the parameter norm Ω can result in different solutions being referred
• In NNs, Ω is chosen to penalize only the weights of the affine transformation at each layer
(leave the bias unregularized)

Lecturer: Duc Dung Nguyen, PhD. Contact: [email protected] Deep Learning 4 / 31


Parameter Norm Penalties

• The most common parameter norm penalty: L2 (weight decay), also called ridge
regression, or Tikhonov regularization

˜ α
J(w, X, y) = J(w, X, y) + αΩ(w) = J(θ, X, y) + w> w (2)
2
with the corresponding parameter gradient

˜
∇w J(w; X, y) = αw + ∇w J(w; X, y) (3)

• Gradient step

w ← w − (αw + ∇w J(w; X, y)) = (1 − α)w − ∇w J(w; X, y) (4)

Lecturer: Duc Dung Nguyen, PhD. Contact: [email protected] Deep Learning 5 / 31


Parameter Norm Penalties

• L1 regularization X
Ω(θ) = kwk1 = |wi | (5)
i

• Regularized objective function

˜
J(w, X, y) = J(w, X, y) + αkwk1 (6)

with the corresponding parameter gradient

˜
∇w J(w; X, y) = αsign(w) + ∇w J(w; X, y) (7)

Lecturer: Duc Dung Nguyen, PhD. Contact: [email protected] Deep Learning 6 / 31


Parameter Norm Penalties

• The regularization contribution to the gradient no longer scale linearly with each wi
• L1 regularization results in a solution that is more spare, comparing to L2 : some
parameters have an optimal value of zero.
• Sparsity property of L1 regularization has been used extensively as a feature selection
mechanism

Lecturer: Duc Dung Nguyen, PhD. Contact: [email protected] Deep Learning 7 / 31


Constrained optimization
Constrained Optimization

Constrained optimization

• Find the maximal or minimal value of f (x) for values of x in some set S
• Feasible points: points x that lie within the set S
• Find a solution that is small in some sense
• Common approach: impose a norm constraint, such as kxk ≤ 1

Lecturer: Duc Dung Nguyen, PhD. Contact: [email protected] Deep Learning 8 / 31


Constrained Optimization

Approach to constrained optimization

• Modify gradient descent taking the constraint into account


• If we use a small constant step size , we can make gradient descent steps, then project
the result back into S.
• If we use a line search, we can search only over step sizes  that yield new x points that
are feasible, or we can project each point on the line back into the constraint region.

Lecturer: Duc Dung Nguyen, PhD. Contact: [email protected] Deep Learning 9 / 31


Constrained Optimization

Karush–Kuhn–Tucker (KKT): a very general solution to constrained optimization.

• KKT multipliers: introduce new variables λi and αj for each constraint


• The generalized Lagrangian is then defined as
X X
L(x, λ, α) = f (x) + λi g (i) (x) + αj h(j) (x). (8)
i j

Lecturer: Duc Dung Nguyen, PhD. Contact: [email protected] Deep Learning 10 / 31


Constrained Optimization

• Solve a constrained minimization problem using unconstrained optimization of the


generalized Lagrangian
• Minimize minx∈S f (x) is equivalent to

min max max L(x, λ, α). (9)


x λ α,α≥0

Lecturer: Duc Dung Nguyen, PhD. Contact: [email protected] Deep Learning 11 / 31


Constrained Optimization

• This follows because any time the constraints are satisfied,

max max L(x, λ, α) = f (x) (10)


λ α,α≥0

while any time a constraint is violated

max max L(x, λ, α) = ∞ (11)


λ α,α≥0

• No infeasible point will ever be optimal


• The optimum within the feasible points is unchanged

Lecturer: Duc Dung Nguyen, PhD. Contact: [email protected] Deep Learning 12 / 31


Dataset Augmentation
Dataset Augmentation

• Making ML model generalize better: train on more data!


• In practice: the amount of data is limited!
• Solution: create fake data for training

Lecturer: Duc Dung Nguyen, PhD. Contact: [email protected] Deep Learning 13 / 31


Dataset Augmentation

• Dataset augmentation: a particularly effective technique for object recognition


• Image: high dimensional, enormous variety of factors of variation
• E.g.: rotating, scaling, affine transformation, etc.
• Dataset augmentation is effective for speech recognition tasks as well

Lecturer: Duc Dung Nguyen, PhD. Contact: [email protected] Deep Learning 14 / 31


Dataset Augmentation

• Injecting noise
• NN prove not to be very robust to noise
• Unsupervised learning: denoising autoencoder
• Noise in hidden units: dataset augmentation at multiple levels of abstraction

Lecturer: Duc Dung Nguyen, PhD. Contact: [email protected] Deep Learning 15 / 31


Dataset Augmentation

Lecturer: Duc Dung Nguyen, PhD. Contact: [email protected] Deep Learning 16 / 31


Other Regularization Approaches
Semi-Supervised Learning

• Semi-supervised learning: usually refers to learning a representation h = f (x)


• Learn a representation so that examples from the same class have similar representations
• Provide useful cues for how to group examples in representation space
• A linear classifier in the new space may achieve better generalization in many cases
• Principal components analysis (PCA): a pre-processing step before applying a classifier
(on the projected data)

Lecturer: Duc Dung Nguyen, PhD. Contact: [email protected] Deep Learning 17 / 31


Semi-Supervised Learning

• Construct models in which a generative model of either P (x) or P (x, y) shares


parameters with a discriminative model of P (y|x)
• Trade-off the supervised criterion − log P (y|x) with the unsupervised or generative one
(such as − log P (x) or − log P (x, y))
• The generative criterion expresses a particular form of prior belief about the solution to
the supervised learning problem

Lecturer: Duc Dung Nguyen, PhD. Contact: [email protected] Deep Learning 18 / 31


Multitask Learning

Lecturer: Duc Dung Nguyen, PhD. Contact: [email protected] Deep Learning 19 / 31


Early Stopping

Lecturer: Duc Dung Nguyen, PhD. Contact: [email protected] Deep Learning 20 / 31


Parameter Typing and Parameter Sharing

• Parameter sharing: force sets of parameters to be equal


• Interpret the various models or model components as sharing a unique set of parameters.
• Only a subset of the parameters (the unique set) need to be stored
• CNN

Lecturer: Duc Dung Nguyen, PhD. Contact: [email protected] Deep Learning 21 / 31


Dropout
Bagging

• Bagging (bootstrap aggregating): a techniques to reduce generalization error by


combining several models (Breiman, 1994)
• General strategy in ML: model averaging → ensemble method

Lecturer: Duc Dung Nguyen, PhD. Contact: [email protected] Deep Learning 22 / 31


Bagging

Lecturer: Duc Dung Nguyen, PhD. Contact: [email protected] Deep Learning 23 / 31


Dropout

• Dropout: provides a computationally inexpensive but powerful method of regularizing a


broad family of models
• A method of making bagging practical for ensembles of very large neural networks.

Lecturer: Duc Dung Nguyen, PhD. Contact: [email protected] Deep Learning 24 / 31


Dropout

• Good for five to ten neural networks


• Dropout trains the ensemble consisting of all sub-networks that can be formed by
removing non-output units from an underlying base network

Lecturer: Duc Dung Nguyen, PhD. Contact: [email protected] Deep Learning 25 / 31


Dropout

Lecturer: Duc Dung Nguyen, PhD. Contact: [email protected] Deep Learning 26 / 31


Dropout

• Bagging
• The models are independent
• Each model is trained to convergence on its respective training set
• Dropout
• Models share parameters
• most models are not explicitly trained at all
• It is infeasible to sample all possible subnetworks within the lifetime of the universe
• The remaining sub-networks to arrive at good settings of the parameters
• Dropout can represent an exponential number of models with a tractable amount of
memory

Lecturer: Duc Dung Nguyen, PhD. Contact: [email protected] Deep Learning 27 / 31


Dropout

• Assume that the model’s role is to output a probability distribution


• Bagging:
• each model i produces a probability distribution p(i) (y|x)
• The prediction of the ensemble is given by the arithmetic mean of all of these distributions
k
1 X (i)
p (y|x) (12)
k i=1

• Dropout:
• Each sub-model defined by mask vector µ defines a probability distribution p(y|x, µ)
• The arithmetic mean over all masks is given by
X
p(µ)p(y|x, µ) (13)
µ

where p(µ) is the probability distribution that was used to sample µ at training time
Lecturer: Duc Dung Nguyen, PhD. Contact: [email protected] Deep Learning 28 / 31
Dropout

• Very computationally cheap


• Using dropout during training requires only O(n) computation per example per update, to
generate n random binary numbers and multiply them by the state
• Dropout does not significantly limit the type of model or training procedure that can be
used

Lecturer: Duc Dung Nguyen, PhD. Contact: [email protected] Deep Learning 29 / 31


Dropout

• The cost of using dropout in a complete system can be significant


• Increase the size of the model
• Typically the optimal validation set error is much lower when using dropout
• The cost: a much larger model and many more iterations of the training algorithm

Lecturer: Duc Dung Nguyen, PhD. Contact: [email protected] Deep Learning 30 / 31


Dropout

• For very large datasets


• Regularization confers little reduction in generalization error
• The computational cost may outweigh the benefit of regularization
• Very few data samples
• Dropout is less effective
• When additional unlabeled data is available, unsupervised feature learning can gain an
advantage over dropout

Lecturer: Duc Dung Nguyen, PhD. Contact: [email protected] Deep Learning 31 / 31

You might also like