Lecture 05 - Regularization - 4p
Lecture 05 - Regularization - 4p
“Thinking is the hardest work there is, which is probably the reason why so few engage in it.”
-Henry Ford
CSE555
Deep Learning What is Regularization
(these slides are summarized version of the book by Goodfellow et al)
Spring 2025
Regularization
© 2016-2025 Yakup Genç & Y. Sinan Akgul
1 2
goodness/performance
hypothesis/program space selected hypothesis
measure
[WikiPedia, 11/15/2016]
Spring 2025 Deep Learning 3 Spring 2025 Deep Learning 4
3 4
- not desired to find the optimum (surrogate loss)
- non-convex - difficult for optimization
3/19/2025
5 6
7 8
3/19/2025
9 10
Regularization Regularization
• Controlling the complexity of the model is not a simple matter of • Parameter Norm Penalties
• L2 Parameter Regularization (Tikhonov or Ridge)
finding the model of the right size, with the right number of • L1 Regularization (Lasso)
parameters • Norm Penalties as Constrained Optimization
• Regularization and Under-Constrained Problems
• Instead, we might find—and indeed in practical deep learning • Dataset Augmentation
scenarios, we almost always do find—that the best fitting model (in • Noise Robustness
the sense of minimizing generalization error) is a large model that has • Injecting Noise at the Output Targets
11 12
3/19/2025
13 14
15 16
3/19/2025
17 18
19 20
3/19/2025
𝐽 𝜽; 𝑿, 𝒚 + 𝛼Ω 𝜽
21 22
23 24
3/19/2025
25 26
27 28
3/19/2025
29 30
åw
wj 2
sum of the squared weights r(w, b) = j
åw
2
sum of the squared weights r(w, b) = j
wj
åw
wj p p
p-norm r(w, b) = p j = w
wj
Squared weights penalizes large values more Smaller values of p (p < 2) encourage sparser vectors
Sum of weights will penalize small values more Larger values of p discourage large weights more
Adapted from: David Kauchak Adapted from: David Kauchak
Spring 2025 Deep Learning 31 Spring 2025 Deep Learning 32
31 32
3/19/2025
p < 2 tends to create sparse If we can ensure that the loss + regularizer is convex then we
(i.e. lots of 0 weights) could still use gradient descent:
n
33 34
35 36
3/19/2025
37 38
39 40
3/19/2025
41 42
43 44
newly generated data should have closely similar distribution!
3/19/2025
45 46
47 48
3/19/2025
49 50
51 52
3/19/2025
Spring 2025 Deep Learning Sargur N. Srihari 53 Spring 2025 Deep Learning 54
53 54
55 56
3/19/2025
57 58
59 60
3/19/2025
61 62
Parameter Tying and Parameter Sharing Parameter Tying and Parameter Sharing
• Regularization when we need other ways to express our prior • Parameter norm penalty – example
knowledge about suitable values of the model parameters • The more popular way is to use constraints
• Sometimes we might not know precisely what values the parameters • to force sets of parameters to be equal
should take but we know, from knowledge of the domain and model • often referred to as parameter sharing
architecture, that there should be some dependencies between the • we interpret the various models or model components as sharing a unique set
model parameters of parameters
• A common type of dependency that we often want to express is that • a significant advantage of parameter sharing over regularizing the parameters
to be close (via a norm penalty) is that only a subset of the parameters (the
certain parameters should be close to one another unique set) need to be stored in memory
• In certain models —such as the convolutional neural network—this can lead
to significant reduction in the memory footprint of the model
63 64
3/19/2025
65 66
67 68
3/19/2025
69 70
Dropout Dropout
• Dropout (Srivastava et al., 2014) provides a computationally
inexpensive but powerful method of regularizing a broad family of Dropout trains the
models ensemble consisting of all
• Can be thought of as a method of making bagging practical for ensembles of sub-networks that can be
very many large neural networks formed by removing non-
• Dropout provides an inexpensive approximation to training and evaluating a output units from an
bagged ensemble of exponentially many neural networks underlying base network
71 72
3/19/2025
Dropout Dropout
• In most modern neural networks, based on a series of affine • To train with dropout, we use a minibatch-based learning algorithm that
makes small steps, such as stochastic gradient descent
transformations and nonlinearities, we can effectively remove a unit
• Each time we load an example into a minibatch, we randomly sample a
from a network by multiplying its output value by zero different binary mask to apply to all of the input and hidden units in the
• This procedure requires some slight modification for models such as network. The mask for each unit is sampled independently from all of the
others
radial basis function networks, which take the difference between the • The probability of sampling a mask value of one (causing a unit to be
unit’s state and some reference value included) is a hyperparameter fixed before training begins
• It is not a function of the current value of the model parameters or the input
example
• Typically, an input unit is included with probability 0.8 and a hidden unit is included
with probability 0.5. We then run forward propagation, back-propagation, and the
learning update as usual
73 74
75 76
3/19/2025
Dropout Dropout
• Dropout training is not quite the same as bagging training • To make a prediction, a bagged ensemble must accumulate votes
• In the case of bagging, the models are all independent
• In the case of dropout, the models share parameters, with each model inheriting a different
from all of its members
subset of parameters from the parent neural network
• This parameter sharing makes it possible to represent an exponential number of models with a • We refer to this process as inference in this context
tractable amount of memory
• In the case of bagging, each model is trained to convergence on its respective training set
• In the case of dropout, typically most models are not explicitly trained at all—usually, the model is
large enough that it would be infeasible to sample all possible subnetworks within the lifetime of
the universe
• Instead, a tiny fraction of the possible sub-networks are each trained for a single step, and the
parameter sharing causes the remaining sub-networks to arrive at good settings of the parameters
• These are the only differences
• Beyond these, dropout follows the bagging algorithm
• For example, the training set encountered by each sub-network is indeed a subset of the original
training set sampled with replacement
77 78
Dropout Dropout
• Srivastava (2014) showed that dropout is more effective than other • One significant advantage of dropout is that it does not significantly
standard computationally inexpensive regularizers, such as weight decay, limit the type of model or training procedure that can be used
filter norm constraints and sparse activity regularization
• It works well with nearly any model that uses a distributed representation and
• Dropout may also be combined with other forms of regularization to yield a can be trained with stochastic gradient descent.
further improvement
• One advantage of dropout is that it is very computationally cheap
• This includes feedforward neural networks, probabilistic models such
• Using dropout during training requires only O(n) computation per example per as restricted Boltzmann machines (Srivastava et al., 2014), and
update, to generate n random binary numbers and multiply them by the state recurrent neural networks (Bayer and Osendorfer, 2014; Pascanu et
• Depending on the implementation, it may also require O(n) memory to store these al., 2014a)
binary numbers until the back-propagation stage
• Running inference in the trained model has the same cost per-example as if dropout • Many other regularization strategies of comparable power impose
were not used, though we must pay the cost of dividing the weights by 2 once before more severe restrictions on the architecture of the model
beginning to run inference on examples
Spring 2025 Deep Learning 79 Spring 2025 Deep Learning 80
79 80
3/19/2025
81 82