0% found this document useful (0 votes)
20 views21 pages

Lecture 05 - Regularization - 4p

The document discusses regularization in deep learning, explaining its role in preventing overfitting by introducing additional information to solve ill-posed problems. It highlights the bias-variance tradeoff and various regularization techniques such as L1 and L2 penalties, dataset augmentation, and early stopping. The content is summarized from the book by Goodfellow et al and focuses on practical applications of regularization in complex domains like images and text.

Uploaded by

emirkan.b.yilmaz
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views21 pages

Lecture 05 - Regularization - 4p

The document discusses regularization in deep learning, explaining its role in preventing overfitting by introducing additional information to solve ill-posed problems. It highlights the bias-variance tradeoff and various regularization techniques such as L1 and L2 penalties, dataset augmentation, and early stopping. The content is summarized from the book by Goodfellow et al and focuses on practical applications of regularization in complex domains like images and text.

Uploaded by

emirkan.b.yilmaz
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 21

3/19/2025

“Thinking is the hardest work there is, which is probably the reason why so few engage in it.”

-Henry Ford

CSE555
Deep Learning What is Regularization
(these slides are summarized version of the book by Goodfellow et al)
Spring 2025
Regularization
© 2016-2025 Yakup Genç & Y. Sinan Akgul

Spring 2025 Deep Learning 2

1 2

Supervised Machine Learning Regularization


• Regularization, in mathematics and statistics and particularly in the
# of experiences observed outcome observed input
fields of machine learning and inverse problems, refers to a process of
introducing additional information in order to solve an ill-posed
problem or to prevent overfitting
arg max 𝑔( 𝑦 , 𝑓(𝑥 ))
∈ 𝐽 𝜽; 𝑿, 𝒚 𝐽 𝜽; 𝑿, 𝒚 + 𝛼Ω 𝜽

goodness/performance
hypothesis/program space selected hypothesis
measure

[WikiPedia, 11/15/2016]
Spring 2025 Deep Learning 3 Spring 2025 Deep Learning 4

3 4
- not desired to find the optimum (surrogate loss)
- non-convex - difficult for optimization
3/19/2025

Bias and Variance Tradeoff (Simple Problem) Bias and Variance

Spring 2025 Deep Learning 5 Spring 2025 Deep Learning 6

5 6

Regularization in the Context of DL


• Most regularization strategies are based on regularizing estimators
• Regularization of an estimator works by trading increased bias for
reduced variance
• An effective regularizer is one that makes a profitable trade, reducing
Regularization variance significantly while not overly increasing the bias
(these slides are summarized version of the draft book by Bengio et al)

Spring 2025 Deep Learning 7 Spring 2025 Deep Learning 8

7 8
3/19/2025

Bias and Variance Regularization


• Three situations, where the model family being trained either • In practice, an overly complex model family does not necessarily include
the target function or the true data generating process, or even a close
• excluded the true data generating process—corresponding to underfitting approximation of either
and inducing bias, or • We almost never have access to the true data generating process so we can
• matched the true data generating process, or never know for sure if the model family being estimated includes the
• included the generating process but also many other possible generating generating process or not
processes—the overfitting regime where variance rather than bias dominates • However, most applications of deep learning algorithms are to domains where the
true data generating process is almost certainly outside the model family
the estimation error
• Deep learning algorithms are typically applied to extremely complicated
• The goal of regularization is to take a model from the third regime domains such as images, audio sequences and text, for which the true
into the second regime generation process essentially involves simulating the entire universe
• To some extent, we are always trying to fit a square peg (the data generating
process) into a round hole (our model family)

Spring 2025 Deep Learning 9 Spring 2025 Deep Learning 10

9 10

Regularization Regularization
• Controlling the complexity of the model is not a simple matter of • Parameter Norm Penalties
• L2 Parameter Regularization (Tikhonov or Ridge)
finding the model of the right size, with the right number of • L1 Regularization (Lasso)
parameters • Norm Penalties as Constrained Optimization
• Regularization and Under-Constrained Problems
• Instead, we might find—and indeed in practical deep learning • Dataset Augmentation
scenarios, we almost always do find—that the best fitting model (in • Noise Robustness
the sense of minimizing generalization error) is a large model that has • Injecting Noise at the Output Targets

been regularized appropriately • Semi-Supervised Learning


• Multi-Task Learning
• Early Stopping
• Parameter Tying and Parameter Sharing
• Sparse Representations

Spring 2025 Deep Learning 11 Spring 2025 Deep Learning 12

11 12
3/19/2025

Parameter Norm Penalties Paramater Norm Penalty for NN


• Many regularization approaches are based on limiting the capacity of • For neural networks, typical parameter norm penalty penalizes only the
models by adding a parameter norm penalty to the objective function weights of the affine transformation at each layer and leaves the biases
𝐽 𝜽; 𝑿, 𝒚 = 𝐽 𝜽; 𝑿, 𝒚 + 𝛼Ω 𝜽 unregularized
𝑧 = 𝑤 𝑥 +𝑤 𝑥 +𝑏
• where 𝛼 ∈ [0, ∞) is a hyperparameter that weights the relative
contribution of the norm penalty term • The biases typically require less data to fit accurately than the weights.
Each weight specifies how two variables interact. Fitting the weight well
• When the training algorithm minimizes the regularized objective requires observing both variables in a variety of conditions.
function it will decrease both • Each bias controls only a single variable. This means that we do not induce
• the original objective J on the training data, and too much variance by leaving the biases unregularized. Also, regularizing
• some measure of the size of the parameters θ (or some subset of the the bias parameters can introduce a significant amount of underfitting.
parameters)

Spring 2025 Deep Learning 13 Spring 2025 Deep Learning 14

13 14

Parameter Norm Penalty for NN L2 Parameter Regularization


• For neural networks, it is sometimes desirable to use a separate • Commonly known as weight decay (aka ridge regression or Tikhonov
penalty with a different α coefficient for each layer of the network regularization)
• Because it can be expensive to search for the correct value of multiple • This strategy drives the weights closer to the origin by adding a
hyperparameters, it is still reasonable to use the same weight decay regularization term
at all layers just to reduce the search space 1
Ω(𝜽) = 𝒘
2
to the objective function

Spring 2025 Deep Learning 15 Spring 2025 Deep Learning 16

15 16
3/19/2025

L2 Parameter Regularization L2 Parameter Regularization


• Assume no bias term… • Quadratic approximation to the objective function
𝛼 1
𝐽 𝒘; 𝑿, 𝒚 = 𝐽 𝒘; 𝑿, 𝒚 + 𝒘 𝒘 𝐽 𝜽 = 𝐽 𝒘∗ + (𝒘 − 𝒘∗ ) 𝑯(𝒘 − 𝒘∗ )
2 2
• Parameter gradient … where
𝛻𝒘 𝐽 𝒘; 𝑿, 𝒚 = 𝛻𝒘 𝐽 𝒘; 𝑿, 𝒚 + 𝛼𝒘 𝒘∗ = arg ming 𝐽(𝒘)
and 𝑯 is Hessian matrix of 𝐽 with respect to 𝒘 evaluated at 𝒘∗
• Single gradient step to update the weights
𝒘 ← 𝒘 − (𝛻𝒘 𝐽 𝒘; 𝑿, 𝒚 + 𝛼𝒘) • The minimum occurs when the gradient equals 0
or 𝛻𝒘 𝐽 𝒘 = 𝑯(𝒘 − 𝒘∗ )
𝒘 ← 1 − 𝛼 𝒘 + 𝛻𝒘 𝐽 𝒘; 𝑿, 𝒚

Spring 2025 Deep Learning 17 Spring 2025 Deep Learning 18

17 18

L2 Parameter Regularization L2 Parameter Regularization


• Adding the weight decay, the regularized version becomes

• As α approaches 0, the regularized solution approaches w∗


• What happens as α grows

Spring 2025 Deep Learning 19 Spring 2025 Deep Learning 20

19 20
3/19/2025

L2 Parameter Regularization L2 Parameter Regularization

𝐽 𝜽; 𝑿, 𝒚 + 𝛼Ω 𝜽

Spring 2025 Deep Learning 21 Spring 2025 Deep Learning 22

21 22

L2 Parameter Regularization L2 Parameter Regularization


• For linear regression, these assumptions hold. The cost function is
(𝑿𝒘 − 𝒚) 𝑯(𝑿𝒘 − 𝒚)
• When L is added
2
1
𝑿𝒘 − 𝒚 𝑯 𝑿𝒘 − 𝒚 + 𝛼𝒘 𝒘
2
• Solution
𝒘 = (𝑿 𝑿 + 𝛼𝑰) 𝑿 𝒚

Spring 2025 Deep Learning 23 Spring 2025 Deep Learning 24

23 24
3/19/2025

L2 Parameter Regularization L1 Parameter Regularization


• While L2 is the most common form of weight decay, L1 can also be
used…
Ω 𝜽 = 𝒘 = 𝒘

• For linear regression with no bias parameter, similar analysis …

Spring 2025 Deep Learning 25 Spring 2025 Deep Learning 26

25 26

L1 Parameter Regularization L1 Parameter Regularization


• The differences are clear… • Our simple linear model has a quadratic cost function that we can
• We see that the regularization contribution to the gradient no longer represent via its Taylor series. Alternately, we could imagine that this
scales linearly with each 𝒘 , instead it is a constant factor with a sign is a truncated Taylor series approximating the cost function of a more
equal to sign(𝒘 ) sophisticated model. The gradient in this setting is given by
𝛻𝒘 𝐽 𝒘 = 𝑯(𝒘 − 𝒘∗ )
• One consequence of this form of the gradient is that we will not
necessarily see clean algebraic solutions to quadratic approximations • To simplify the analysis further, assume H is diagonal
of J as we did for L2 regularization

Spring 2025 Deep Learning 27 Spring 2025 Deep Learning 28

27 28
3/19/2025

L1 Parameter Regularization L1 Parameter Regularization


• This has an analytic solution

Spring 2025 Deep Learning 29 Spring 2025 Deep Learning 30

29 30

Parameter Regularization Parameter Regularization

sum of the weights


r(w, b) = å w j
sum of the weights
r(w, b) = å w j wj

åw
wj 2
sum of the squared weights r(w, b) = j

åw
2
sum of the squared weights r(w, b) = j
wj

åw
wj p p
p-norm r(w, b) = p j = w
wj
Squared weights penalizes large values more Smaller values of p (p < 2) encourage sparser vectors
Sum of weights will penalize small values more Larger values of p discourage large weights more
Adapted from: David Kauchak Adapted from: David Kauchak
Spring 2025 Deep Learning 31 Spring 2025 Deep Learning 32

31 32
3/19/2025

Parameter Regularization Parameter Regularization


Solve convex minimization problems using gradient descent:
all p-norms penalize larger n

weights argmin w,b å loss(yy')


i=1

p < 2 tends to create sparse If we can ensure that the loss + regularizer is convex then we
(i.e. lots of 0 weights) could still use gradient descent:
n

p > 2 tends to like similar argmin w,b åloss(yy') + lregularizer(w)


i=1
weights
convex as long as both loss and regularizer are convex

p-norms are convex for p >= 1


Adapted from: David Kauchak Adapted from: David Kauchak
Spring 2025 Deep Learning 33 Spring 2025 Deep Learning 34

33 34

Norm Penalties as Constrained Optimization Norm Penalties as Constrained Optimization


• Consider the norm penalty to the objective function • Solution to this constrained problem is
𝐽 𝜽; 𝑿, 𝒚 = 𝐽 𝜽; 𝑿, 𝒚 + 𝛼Ω 𝜃 𝜽∗ = arg min max 𝐿 𝜽, 𝛼; 𝑿, 𝒚
𝜽 ,
• We can minimize a function subject to constraints by constructing a • Solving this problem requires modifying both 𝜽 and α
generalized Lagrange function, consisting of the original objective
• Gradient descent
function plus a set of penalties
• Analytical solutions for where the gradient is zero
• Each penalty is a product between a coefficient, called a Karush–
Kuhn–Tucker (KKT) multiplier, and a function representing whether • All solutions say
the constraint is satisfied. If we wanted to constrain Ω 𝜃 to be less • increase α whenever Ω 𝜃 > 𝑘
than some constant k, we could construct a generalized Lagrange • decrease α whenever Ω 𝜃 < 𝑘
function • All positive α encourage Ω 𝜃 to shrink. The optimal value α will encourage
𝐿 𝜽, 𝛼; 𝑿, 𝒚 = 𝐽 𝜽; 𝑿, 𝒚 + 𝛼(Ω 𝜃 − 𝑘) Ω 𝜃 to shrink, but not so strongly to make Ω 𝜃 become less than k
Spring 2025 Deep Learning 35 Spring 2025 Deep Learning 36

35 36
3/19/2025

Norm Penalties as Constrained Optimization Norm Penalties as Constrained Optimization


• Another reason to use explicit constraints and re-projection rather than enforcing
constraints with penalties is that penalties can cause non-convex optimization
procedures to get stuck in local minima corresponding to small .
• When training neural networks, this usually manifests as neural networks that
train with several “dead units.” These are units that do not contribute much to
the behavior of the function learned by the network because the weights going
into or out of them are all very small. When training with a penalty on the norm
of the weights, these configurations can be locally optimal, even if it is possible to
significantly reduce J by making the weights larger.
• Explicit constraints implemented by re-projection can work much better in these
cases because they do not encourage the weights to approach the origin. Explicit
constraints implemented by re-projection only have an effect when the weights
become large and attempt to leave the constraint region.

Spring 2025 Deep Learning 37 Spring 2025 Deep Learning 38

37 38

Regularization and Under-Constrained


Norm Penalties as Constrained Optimization
Problems
• Finally, explicit constraints with reprojection can be useful because they • Regularization requires well defined problems
impose some stability on the optimization procedure.
• In many cases, including LR and PCA, inversion of 𝑿 𝑿 cannot be
• When using high learning rates, it is possible to encounter a positive done due to singularity
feedback loop in which large weights induce large gradients which then
induce a large update to the weights. • This matrix can be singular whenever the data truly has no variance in
some direction, or when there are fewer examples (rows of X) than
• If these updates consistently increase the size of the weights, then  rapidly input features (columns of X )
moves away from the origin until numerical overflow occurs. Explicit
constraints with reprojection allow us to terminate this feedback loop after • Regularizing these requires inverting
the weights have reached a certain magnitude. 𝑿 𝑿 + 𝛼𝑰
• Hinton et al. (2012c) recommend using constraints combined with a high • These linear problems have closed form solutions when the relevant
learning rate to allow rapid exploration of parameter space while
maintaining some stability. matrix is invertible

Spring 2025 Deep Learning 39 Spring 2025 Deep Learning 40

39 40
3/19/2025

Regularization and Under-Constrained Regularization and Under-Constrained


Problems Problems
• It is also possible for a problem with no closed form solution to be • Most forms of regularization are able to guarantee the convergence
underdetermined of iterative methods applied to underdetermined problems
• E.g., logistic regression applied to a problem where the classes are linearly separable.
If a weight vector w is able to achieve perfect classification, then 2*w will also • For example, weight decay will cause gradient descent to quit
achieve perfect classification and higher likelihood increasing the magnitude of the weights when the slope of the
• An iterative optimization procedure like stochastic gradient descent will likelihood is equal to the weight decay coefficient
continually increase the magnitude of w and, in theory, will never halt
• In practice, a numerical implementation of gradient descent will eventually
reach sufficiently large weights to cause numerical overflow, at which
point its behavior will depend on how the programmer has decided to
handle values that are not real numbers

Spring 2025 Deep Learning 41 Spring 2025 Deep Learning 42

41 42

Dataset Augmentation Dataset Augmentation


• The best way to make a machine learning model generalize better is • Dataset augmentation is effective for classification problems, e.g., object
to train it on more data recognition
• Images are very high dimensional and include enormous variety of factors of
• Of course, in practice, the amount of data we have is limited variation
• Translate, rotate, scale images… (6 becomes 9 when flipped horizontally)
• One way to get around this problem is to create fake data and add it
to the training set • Transformation changing the correct class should be avoided
• Injecting noise in NN can be seen as dataset augmentation
• For some machine learning tasks, it is reasonably straightforward to • Input and hidden nodes – apply noise
create new fake data, e.g., classification • Careful tuning of noise magnitude is key (Poole et al. 2014)
• For some others, not, e.g., density estimation problem • Dropout, a powerful regularization method – constructing new inputs by
multiplying by noise

Spring 2025 Deep Learning 43 Spring 2025 Deep Learning 44

43 44
newly generated data should have closely similar distribution!
3/19/2025

Dataset Augmentation Noise


• Be careful when you evaluate algorithms using data augmentation • For some models, the addition of noise with infinitesimal variance at
• Especially when using hand-designed dataset, feed the same data to the input of the model is equivalent to imposing a penalty on the
both algorithms for comparison norm of the weights (Bishop, 1995a,b)
• In the general case, it is important to remember that noise injection
can be much more powerful than simply shrinking the parameters,
especially when the noise is added to the hidden units
• in the context of recurrent neural networks

Spring 2025 Deep Learning 45 Spring 2025 Deep Learning 46

45 46

Noise Semi-Supervised Learning


• For FFNN, this form of regularization encourages the parameters to go to regions of • You have unlabeled examples from P(x) and labeled examples from
parameter space where small perturbations of the weights have a relatively small
influence on the output P(x,y) to estimate P(y|x) (or predict y from x)
• In other words, it pushes the model into regions where the model is relatively • A solution
insensitive to small variations in the weights, finding points that are not merely
minima, but minima surrounded by flat regions • Separate unsupervised and supervised components in the model

Spring 2025 Deep Learning 47 Spring 2025 Deep Learning 48

47 48
3/19/2025

Semi-Supervised Learning Generative vs. Discriminative Models


• Another solution • Generative Approach: separately model class-conditional densities
• Construct models in which a generative model of either P(x) or P(x,y) shares and priors
parameters with a discriminative model of P(y|x)
• Trade-off the supervised criterion −log P(y|x) with the unsupervised or then evaluate posterior probabilities using Bayes’ theorem
genera ve one (such as −log P(x) or −log P(x,y))
• The generative criterion then expresses a particular form of prior belief about
the solution to the supervised learning problem, namely that the structure of
P(x) is connected to the structure of P(y|x) in a way that is captured by the • Discriminative Approach: directly model posterior probabilities
shared parametrization
• By controlling how much of the generative criterion is included in the total
criterion, one can find a better trade-off than with a purely generative or a
purely discriminative training criterion
(C. Bishop 04)
Spring 2025 Deep Learning 49 Spring 2025 Deep Learning 50

49 50

Generative vs. Discriminative Models Analogy


Generative Approach • The task is to determine the language that someone is speaking
• Assume some functional form for • Generative approach:
• Estimate parameters of directly from training data • is to learn each language and determine as to which language the speech
belongs to
• Use Bayes rule to calculate
• Discriminative approach:
Discriminative Approach • is determine the linguistic differences without learning any language– a much
• Assume some functional form for easier task!
ex: when comparing German and Turkish, you can immediately tell it Turkish if you see the letters //ü/ç
• Estimate parameters of directly from training data

(T. Mitchell 03) Sargur N. Srihari


Spring 2025 Deep Learning 51 Spring 2025 Deep Learning 52

51 52
3/19/2025

Graphical Model Relationship Multi-Task Learning


• Learning a problem together with other related problems at the same
time, using a shared representation often leading to a better model
for the main task, since it allows the learner to use the commonality
among the tasks (inductive transfer) [rephrased from WikePedia]

Spring 2025 Deep Learning Sargur N. Srihari 53 Spring 2025 Deep Learning 54

53 54

Multi-Task Learning Multi-Task Learning


• Multi-task learning is a way to • Multi-task learning is a way to improve
improve generalization by pooling generalization by pooling the examples
the examples (which can be seen as (which can be seen as soft constraints
soft constraints imposed on the Q3_A Q3_B Q3_C imposed on the parameters) arising out
parameters) arising out of several of several tasks
tasks
• Similar to additional training examples
• Similar to additional training put more pressure on the parameters
examples put more pressure on the Q2 of the model towards values that
parameters of the model towards generalize well, when part of a model is
values that generalize well, when shared across tasks, that part of the
part of a model is shared across Q1
model is more constrained towards
tasks, that part of the model is more
constrained towards good values good values (assuming the sharing is
(assuming the sharing is justified), justified), often yielding better
often yielding better generalization generalization
In a single batch Q3 A, B, C updated one time, wheres Q1 and Q2 are updated three times.
Spring 2025 Deep Learning 55 Spring 2025 Deep Learning 56

55 56
3/19/2025

Multi-Task Learning Early Stopping


• Can be divided into two kinds of parts and associated parameters: • When training large models with sufficient representational capacity
• Task-specific parameters (which only benefit from the examples of their task to overfit the task, we often observe that training error decreases
to achieve good generalization) – upper layers in the figure. steadily over time, but validation set error begins to rise again
• Generic parameters, shared across all the tasks (which benefit from the
pooled data of all the tasks) – lower layers
• From the point of view of deep learning, the underlying prior belief is:
• among the factors that explain the variations observed in the data associated
with the different tasks, some are shared across two or more tasks

Spring 2025 Deep Learning 57 Spring 2025 Deep Learning 58

57 58

Early Stopping Early Stopping


• Instead of running our optimization algorithm until we reach a (local) minimum of • Early stopping is a very efficient hyperparameter selection
validation error, we run it until the error on the validation set has not improved for • Algorithm
some amount of time
• the number of training steps is just another hyperparameter
• This strategy is known as early stopping • we are controlling the effective capacity of the model by determining how many
• It is probably the most commonly used form of regularization in deep learning steps it can take to fit the training set
• Its popularity is due both to its effectiveness and its simplicity • Cost – periodic evaluations
• Additional cost – the need to maintain a copy of the best parameters
(generally negligible)
• Early stopping is a very unobtrusive form of regularization requiring almost
no change in the underlying training procedure, the objective function, or
the set of allowable parameter values

Spring 2025 Deep Learning 59 Spring 2025 Deep Learning 60

59 60
3/19/2025

Early Stopping Early Stopping and L2 Norm


• Early stopping requires a validation set, which means some training data is
not fed to the model
• To best exploit this extra data
• one can perform extra training after the initial training with early stopping has
completed
• in the second, extra training step, all of the training data is included
• There are two basic strategies one can use for this second training
procedure:
• initialize the model again and retrain on all of the data (same number of steps as the
early stopping)
• keep the parameters obtained from the first round of training and then continue
training but now using all of the data
• Avoids high cost of retraining from scratch but not as well-behaved

Spring 2025 Deep Learning 61 Spring 2025 Deep Learning 62

61 62

parameter sharing in convolutional neural networks

Parameter Tying and Parameter Sharing Parameter Tying and Parameter Sharing
• Regularization when we need other ways to express our prior • Parameter norm penalty – example
knowledge about suitable values of the model parameters • The more popular way is to use constraints
• Sometimes we might not know precisely what values the parameters • to force sets of parameters to be equal
should take but we know, from knowledge of the domain and model • often referred to as parameter sharing
architecture, that there should be some dependencies between the • we interpret the various models or model components as sharing a unique set
model parameters of parameters
• A common type of dependency that we often want to express is that • a significant advantage of parameter sharing over regularizing the parameters
to be close (via a norm penalty) is that only a subset of the parameters (the
certain parameters should be close to one another unique set) need to be stored in memory
• In certain models —such as the convolutional neural network—this can lead
to significant reduction in the memory footprint of the model

Spring 2025 Deep Learning 63 Spring 2025 Deep Learning 64

63 64
3/19/2025

Sparse Representations Sparse Representations


• Instead of weight decay which acts by placing a penalty directly on
sparsely parametrized
the model parameters, we can place a penalty on the activations of linear
the units in a neural network, encouraging their activations to be regression model
sparse
• This indirectly imposes a complicated penalty on the model
parameters
• Recall that L1 does sparse parameterization as many parameters
become zero linear regression
• Representational sparsity, on the other hand, describes a with a sparse
representation
representation where many of the elements of the representation are
zero (or close to zero)
Spring 2025 Deep Learning 65 Spring 2025 Deep Learning 66

65 66

dropout applied in each iteration (mini-batch)

Sparse Representations Bagging and Other Ensemble Methods


• Representational regularization is accomplished by the same sorts of • Bagging (bootstrap aggregating) is a technique for reducing
mechanisms that we have used in parameter regularization generalization error by combining several models
• Norm penalty regularization of representations is performed by • Model averaging, ensemble methods, …
adding to the loss function J a norm penalty on the representation,
denoted Ω(h)
• Other approaches obtain representational sparsity with a hard
constraint on the activation values
• Essentially any model that has hidden units can be made sparse

Spring 2025 Deep Learning 67 Spring 2025 Deep Learning 68

67 68
3/19/2025

Why Ensemble Works? Bagging and Other Ensemble Methods


• Intuition • Neural networks reach a wide enough variety of solution points that
• combining diverse, independent opinions in human decision-making as a they can often benefit from model averaging even if all of the models
protective mechanism (e.g. stock portfolio) are trained on the same dataset
• Uncorrelated error reduction • Differences in random initialization, random selection of minibatches,
differences in hyperparameters, or different outcomes of non-deterministic
• Suppose we have 5 completely independent classifiers for majority voting implementations of NNs are often enough to cause different members of the
• If accuracy is 70% for each ensemble to make partially independent errors
• 10 (.7^3)(.3^2)+5(.7^4)(.3)+(.7^5)
• 83.7% majority vote accuracy
• Model averaging is an extremely powerful and reliable method for
• 101 such classifiers reducing generalization error
• 99.9% majority vote accuracy

Spring 2025 Deep Learning 69 Spring 2025 Deep Learning 70

69 70

Dropout Dropout
• Dropout (Srivastava et al., 2014) provides a computationally
inexpensive but powerful method of regularizing a broad family of Dropout trains the
models ensemble consisting of all
• Can be thought of as a method of making bagging practical for ensembles of sub-networks that can be
very many large neural networks formed by removing non-
• Dropout provides an inexpensive approximation to training and evaluating a output units from an
bagged ensemble of exponentially many neural networks underlying base network

Spring 2025 Deep Learning 71 Spring 2025 Deep Learning 72

71 72
3/19/2025

Dropout Dropout
• In most modern neural networks, based on a series of affine • To train with dropout, we use a minibatch-based learning algorithm that
makes small steps, such as stochastic gradient descent
transformations and nonlinearities, we can effectively remove a unit
• Each time we load an example into a minibatch, we randomly sample a
from a network by multiplying its output value by zero different binary mask to apply to all of the input and hidden units in the
• This procedure requires some slight modification for models such as network. The mask for each unit is sampled independently from all of the
others
radial basis function networks, which take the difference between the • The probability of sampling a mask value of one (causing a unit to be
unit’s state and some reference value included) is a hyperparameter fixed before training begins
• It is not a function of the current value of the model parameters or the input
example
• Typically, an input unit is included with probability 0.8 and a hidden unit is included
with probability 0.5. We then run forward propagation, back-propagation, and the
learning update as usual

Spring 2025 Deep Learning 73 Spring 2025 Deep Learning 74

73 74

Dropout Example Dropout Example


To perform FP with dropout, randomly sample a vector μ
with one entry for each input or hidden unit in the
A feedforward network with two input units,
network. The entries of μ are binary and are sampled
one hidden layer with two hidden units, and
independently from each other. Each unit in the network is
one output unit
multiplied by the corresponding mask, and then forward
propagation continues through the rest of the network as
usual. This is equivalent to randomly selecting one of the
sub networks from the figure below

Spring 2025 Deep Learning 75 Spring 2025 Deep Learning 76

75 76
3/19/2025

Dropout Dropout
• Dropout training is not quite the same as bagging training • To make a prediction, a bagged ensemble must accumulate votes
• In the case of bagging, the models are all independent
• In the case of dropout, the models share parameters, with each model inheriting a different
from all of its members
subset of parameters from the parent neural network
• This parameter sharing makes it possible to represent an exponential number of models with a • We refer to this process as inference in this context
tractable amount of memory
• In the case of bagging, each model is trained to convergence on its respective training set
• In the case of dropout, typically most models are not explicitly trained at all—usually, the model is
large enough that it would be infeasible to sample all possible subnetworks within the lifetime of
the universe
• Instead, a tiny fraction of the possible sub-networks are each trained for a single step, and the
parameter sharing causes the remaining sub-networks to arrive at good settings of the parameters
• These are the only differences
• Beyond these, dropout follows the bagging algorithm
• For example, the training set encountered by each sub-network is indeed a subset of the original
training set sampled with replacement

Spring 2025 Deep Learning 77 Spring 2025 Deep Learning 78

77 78

Dropout Dropout
• Srivastava (2014) showed that dropout is more effective than other • One significant advantage of dropout is that it does not significantly
standard computationally inexpensive regularizers, such as weight decay, limit the type of model or training procedure that can be used
filter norm constraints and sparse activity regularization
• It works well with nearly any model that uses a distributed representation and
• Dropout may also be combined with other forms of regularization to yield a can be trained with stochastic gradient descent.
further improvement
• One advantage of dropout is that it is very computationally cheap
• This includes feedforward neural networks, probabilistic models such
• Using dropout during training requires only O(n) computation per example per as restricted Boltzmann machines (Srivastava et al., 2014), and
update, to generate n random binary numbers and multiply them by the state recurrent neural networks (Bayer and Osendorfer, 2014; Pascanu et
• Depending on the implementation, it may also require O(n) memory to store these al., 2014a)
binary numbers until the back-propagation stage
• Running inference in the trained model has the same cost per-example as if dropout • Many other regularization strategies of comparable power impose
were not used, though we must pay the cost of dividing the weights by 2 once before more severe restrictions on the architecture of the model
beginning to run inference on examples
Spring 2025 Deep Learning 79 Spring 2025 Deep Learning 80

79 80
3/19/2025

Dropout – Ensemble Models with Shared Hidden Units

• Dropout is not just a means of performing efficient, approximate


bagging
• There is another view of dropout that goes further than this
• Dropout trains not just a bagged ensemble of models, but an Thanks for listening!
ensemble of models that share hidden units

Spring 2025 Deep Learning 81

81 82

You might also like