0% found this document useful (0 votes)
109 views13 pages

Lecture Slides For Chapter 7 of Deep Learning Ian Goodfellow 2016-09-27

1. Regularization is any modification made to a learning algorithm intended to reduce generalization error without increasing training error. 2. Common regularization techniques for deep learning include weight decay, norm penalties like L1 and L2, dataset augmentation, multi-task learning, early stopping, dropout, and adversarial training. 3. Regularization helps improve generalization by preventing overfitting, encouraging sparsity, introducing invariance, and making the model robust.

Uploaded by

Ricardo Rezende
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
109 views13 pages

Lecture Slides For Chapter 7 of Deep Learning Ian Goodfellow 2016-09-27

1. Regularization is any modification made to a learning algorithm intended to reduce generalization error without increasing training error. 2. Common regularization techniques for deep learning include weight decay, norm penalties like L1 and L2, dataset augmentation, multi-task learning, early stopping, dropout, and adversarial training. 3. Regularization helps improve generalization by preventing overfitting, encouraging sparsity, introducing invariance, and making the model robust.

Uploaded by

Ricardo Rezende
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

Regularization for

Deep Learning
Lecture slides for Chapter 7 of Deep Learning
www.deeplearningbook.org
Ian Goodfellow
2016-09-27
Definition

• “Regularization is any modification we make to a


learning algorithm that is intended to reduce its
generalization error but not its training error.”

(Goodfellow 2016)
Weight Decay as Constrained
ARIZATION FOR DEEP LEARNING

Optimization

w⇤


w2

w1
Figure 7.1 (Goodfellow 2016)
Norm Penalties

• L1: Encourages sparsity, equivalent to MAP


Bayesian estimation with Laplace prior

• Squared L2: Encourages small weights, equivalent to


MAP Bayesian estimation with Gaussian prior

(Goodfellow 2016)
Dataset Augmentation
Affine Elastic
Noise
Distortion Deformation

Horizontal Random
Hue Shift
flip Translation

(Goodfellow 2016)
network in figure 7.2.

2. Generic parameters, shared across all the tasks (which benefit from th

Multi-Task Learning
pooled data of all the tasks). These are the lower layers of the neural networ
in figure 7.2.

y(1) y(2)

h(1) h(2) h(3)

h(shared)

Figure
Figure 7.2: Multi-task learning can be cast in 7.2
several ways in deep learning framewor
(Goodfellow 2016)
HAPTER 7. REGULARIZATION FOR DEEP LEARNING
Learning Curves
Early stopping: terminate while validation set
performance is better
0.20
Loss (negative log-likelihood)

Training set loss


0.15 Validation set loss

0.10

0.05

0.00
0 50 100 150 200 250
Time (epochs)

Figure 7.3 (Goodfellow 2016)


gure 7.3: Learning curves showing how the negative log-likelihood loss changes o
R 7. REGULARIZATION FOR DEEP LEARNING
Early Stopping and Weight
Decay

w⇤ w⇤

w̃ w̃
w2

w2

w1 w1

Figure 7.4
(Goodfellow 2016)
Sparse Representations
HAPTER 7. REGULARIZATION FOR DEEP LEARNING

2 3
2 3 2 3 0
14 3 1 2 5 4 1 6 2 7
6 1 7 6 4 2 3 1 1 3 7 6 7
6 7 6 7 6 0 7
6 19 7 =6 1 5 4 2 3 2 7 6 7
6 7 6 7 6 0 7 (7.47)
4 2 5 4 3 1 2 3 0 3 5 6 7
4 3 5
23 5 4 2 2 5 1
0
y 2 Rm B 2 Rm⇥n h 2 Rn
In the first expression, we have an example of a sparsely parametrized linear
egression model. In the second, we have linear regression with a sparse representa-
on h of the data x. That is, h is a function of x that, in some sense, represents
he information present in x, but does so with a sparse vector.
Representational regularization is accomplished by the same sorts of mechanisms
hat we have used in parameter regularization.
(Goodfellow 2016)
Norm penalty regularization of representations is performed by adding to the
Bagging
CHAPTER 7. REGULARIZATION FOR DEEP LEARNING

Original dataset

First resampled dataset First ensemble member

Second resampled dataset Second ensemble member

Figure 7.5: A cartoon depiction of how bagging works. Suppose we train an 8 detector
the dataset depicted above, containing an 8, a 6 and a 9. Suppose we make two differ
Figure
resampled datasets. The bagging training 7.5 is to construct each of these data
procedure
by sampling with replacement. The first dataset omits the 9 and repeats the(Goodfellow
8. On 2016)
t
CHAPTER 7. REGULARIZATION FOR DEEP LEARNING

Dropout
y y y y

Figure 7.6
h1 h2 h1 h2 h1 h2 h2

x1 x2 x2 x1 x1 x2

y
y y y y

h1 h1 h2 h2
h1 h2
x1 x2 x1 x2 x2

y y y y
x1 x2

h1 h1 h2
Base network
x1 x2 x1 x1

y y y y

h2 h1

x2
(Goodfellow 2016)
Ensemble of subnetworks
Adversarial Examples
CHAPTER 7. REGULARIZATION FOR DEEP LEARNING

+ .007 ⇥ =

x+
x sign(rx J(✓, x, y))
✏ sign(rx J(✓, x, y))
y =“panda” “nematode” “gibbon”
w/ 57.7% w/ 8.2% w/ 99.3 %
confidence confidence confidence

Figure 7.8
Figure 7.8: A demonstration of adversarial example generation applied to GoogLeNet
(Szegedy et al., 2014a) on ImageNet. By adding an imperceptibly small vector whose
elements are equal to the sign of the elements of the gradient of the cost function with
respect to the input, we can change GoogLeNet’s classification of the image. Reproduced
Training on adversarial examples is mostly
with permission from Goodfellow et al. (2014b).

intended to improve security, but can sometimes


to optimize. Unfortunately, the value of a linear function can change very rapidly
provide genericinputs.
if it has numerous regularization.
If we change each input by ✏, then a linear function
with weights w can change by as much as ✏||w||1 , which can be a very large
amount if w is high-dimensional. Adversarial training discourages this highly
sensitive locally linear behavior by encouraging the network to be locally constant (Goodfellow 2016)
ER 7. REGULARIZATION FOR DEEP LEARNING

Tangent Propagation

Normal Tangent
x2

x1

Figure 7.9
7.9: Illustration of the main idea of the tangent prop algorithm (Sima
nd manifold tangent classifier (Rifai et al., 2011c), which both regul
(Goodfellow 2016)

You might also like