0% found this document useful (0 votes)

56 views

Deep Learning Basics Lecture 4 Regularization II

Regularization techniques discussed include: 1. Adding constraints such as l1 and l2 norms to prevent overfitting. 2. Adding noise to inputs, weights, or outputs to improve robustness and act as regularization. This can be equivalent to weight decay. 3. Data augmentation such as rotations to artificially increase data size. 4. Early stopping training when validation error stops improving to prevent overfitting. 5. Dropout which randomly drops units during training, forcing the model to learn robust representations.

Uploaded by

baris

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

56 views

Deep Learning Basics Lecture 4 Regularization II

Uploaded by

baris

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 27

Deep Learning Basics

Lecture 4: Regularization II
Princeton University COS 495
Instructor: Yingyu Liang
Review
Regularization as hard constraint
• Constrained optimization
𝑛
1
min 𝐿෠ 𝜃 = ෍ 𝑙(𝜃, 𝑥𝑖 , 𝑦𝑖 )
𝜃 𝑛
𝑖=1

subject to: 𝑅 𝜃 ≤ 𝑟
Regularization as soft constraint
• Unconstrained optimization
𝑛
1
min 𝐿෠ 𝑅 𝜃 = ෍ 𝑙(𝜃, 𝑥𝑖 , 𝑦𝑖 ) + 𝜆𝑅(𝜃)
𝜃 𝑛
𝑖=1
for some regularization parameter 𝜆 > 0
Regularization as Bayesian prior
• Bayesian rule:
𝑝 𝜃 𝑝 𝑥𝑖 , 𝑦𝑖 𝜃)
𝑝 𝜃 | {𝑥𝑖 , 𝑦𝑖 } =
𝑝({𝑥𝑖 , 𝑦𝑖 })

• Maximum A Posteriori (MAP):

max log 𝑝 𝜃 | {𝑥𝑖 , 𝑦𝑖 } = max log 𝑝 𝜃 + log 𝑝 𝑥𝑖 , 𝑦𝑖 | 𝜃
𝜃 𝜃

Regularization MLE loss

Classical regularizations
• Norm penalty
• 𝑙2 regularization
• 𝑙1 regularization
More examples
Other types of regularizations
• Robustness to noise
• Noise to the input
• Noise to the weights
• Noise to the output
• Data augmentation
• Early stopping
• Dropout
Multiple optimal solutions?

Class +1
𝑤1 𝑤2 𝑤3

Class -1

Prefer 𝑤2 (higher confidence)

Add noise to the input

Class +1
𝑤2

Class -1

Prefer 𝑤2 (higher confidence)

Caution: not too much noise
Too much noise leads
to data points cross
the boundary

Class +1
𝑤2

Class -1

Prefer 𝑤2 (higher confidence)

Equivalence to weight decay
• Suppose the hypothesis is 𝑓 𝑥 = 𝑤 𝑇 𝑥, noise is 𝜖~𝑁(0, 𝜆𝐼)
• After adding noise, the loss is

2
𝐿(𝑓) = 𝔼𝑥,𝑦,𝜖 𝑓 𝑥 + 𝜖 − 𝑦 = 𝔼𝑥,𝑦,𝜖 𝑓 𝑥 + 𝑤 𝑇 𝜖 − 𝑦 2

2
𝐿(𝑓) =𝔼𝑥,𝑦,𝜖 𝑓 𝑥 − 𝑦 + 2𝔼𝑥,𝑦,𝜖 𝑤 𝑇 𝜖 𝑓 𝑥 − 𝑦 + 𝔼𝑥,𝑦,𝜖 𝑤 𝑇 𝜖 2

2 2
𝐿(𝑓) =𝔼𝑥,𝑦,𝜖 𝑓 𝑥 − 𝑦 +𝜆 𝑤
Add noise to the weights
• For the loss on each data point, add a noise term to the weights
before computing the prediction

𝜖~𝑁(0, 𝜂𝐼), 𝑤′ = 𝑤 + 𝜖

• Prediction: 𝑓𝑤 ′ 𝑥 instead of 𝑓𝑤 𝑥
• Loss becomes
𝐿(𝑓) = 𝔼𝑥,𝑦,𝜖 𝑓𝑤+𝜖 𝑥 − 𝑦 2
Add noise to the weights
• Loss becomes
𝐿(𝑓) = 𝔼𝑥,𝑦,𝜖 𝑓𝑤+𝜖 𝑥 − 𝑦 2

• To simplify, use Taylor expansion

𝑇 𝜖𝑇 𝛻2 𝑓 𝑥 𝜖
• 𝑓𝑤+𝜖 𝑥 ≈ 𝑓𝑤 𝑥 + 𝜖 𝛻𝑓 𝑥 +
2
• Plug in
2
• 𝐿 𝑓 ≈ 𝔼 𝑓𝑤 𝑥 − 𝑦 + 𝜂𝔼[ 𝑓𝑤 𝑥 − 𝑦 𝛻 2 𝑓𝑤 𝑥 ] + 𝜂𝔼||𝛻𝑓𝑤 (𝑥)||2

Small so can be ignored Regularization term

Data augmentation

Figure from Image Classification with Pyramid Representation

and Rotated Data Augmentation on Torch 7, by Keven Wang
Data augmentation
• Adding noise to the input: a special kind of augmentation

• Be careful about the transformation applied:

• Example: classifying ‘b’ and ‘d’
• Example: classifying ‘6’ and ‘9’
Early stopping
• Idea: don’t train the network to too small training error

• Recall overfitting: Larger the hypothesis class, easier to find a

hypothesis that fits the difference between the two

• Prevent overfitting: do not push the hypothesis too much; use

validation error to decide when to stop
Early stopping

Figure from Deep Learning,

Goodfellow, Bengio and Courville
Early stopping
• When training, also output validation error
• Every time validation error improved, store a copy of the weights
• When validation error not improved for some time, stop
• Return the copy of the weights stored
Early stopping
• hyperparameter selection: training step is the hyperparameter

• Advantage
• Efficient: along with training; only store an extra copy of weights
• Simple: no change to the model/algo

• Disadvantage: need validation data

Early stopping
• Strategy to get rid of the disadvantage
• After early stopping of the first run, train a second run and reuse validation
data

• How to reuse validation data

1. Start fresh, train with both training data and validation data up to the
previous number of epochs
2. Start from the weights in the first run, train with both training data and
validation data util the validation loss < the training loss at the early
stopping point
Early stopping as a regularizer

Figure from Deep Learning,

Goodfellow, Bengio and Courville
Dropout
• Randomly select weights to update

• More precisely, in each update step

• Randomly sample a different binary mask to all the input and hidden units
• Multiple the mask bits with the units and do the update as usual

• Typical dropout probability: 0.2 for input and 0.5 for hidden units
Dropout

Figure from Deep Learning,

Goodfellow, Bengio and Courville
Dropout

Figure from Deep Learning,

Goodfellow, Bengio and Courville
Dropout

Figure from Deep Learning,

Goodfellow, Bengio and Courville
What regularizations are frequently used?
• 𝑙2 regularization
• Early stopping
• Dropout

• Data augmentation if the transformations known/easy to implement

Basics of Generative AI
No ratings yet
Basics of Generative AI
6 pages
Documentation For The Bytronic Pendulum Control System: (Version 2.1)
No ratings yet
Documentation For The Bytronic Pendulum Control System: (Version 2.1)
37 pages
Black Box Testing Document
100% (1)
Black Box Testing Document
15 pages
ECE604 f20 hw3
0% (1)
ECE604 f20 hw3
3 pages
APQP And/or PPAP Workbook v3.1.1
No ratings yet
APQP And/or PPAP Workbook v3.1.1
25 pages
03 Reg Slides
No ratings yet
03 Reg Slides
64 pages
07_regularization
No ratings yet
07_regularization
51 pages
Module - 2 Ver 1.4
No ratings yet
Module - 2 Ver 1.4
35 pages
UNIT-II Regularization in Deep Learning
No ratings yet
UNIT-II Regularization in Deep Learning
24 pages
Unit Ii
No ratings yet
Unit Ii
8 pages
Lec 05 Regularization
No ratings yet
Lec 05 Regularization
77 pages
Regularization in Machine Learning
No ratings yet
Regularization in Machine Learning
17 pages
WEEK 10
No ratings yet
WEEK 10
69 pages
Regularization
No ratings yet
Regularization
46 pages
LECTURE#9 EE258 F22 Part2 Draft v1
No ratings yet
LECTURE#9 EE258 F22 Part2 Draft v1
14 pages
What is Regularization.
No ratings yet
What is Regularization.
10 pages
Unit -4-NNDL- Notes
No ratings yet
Unit -4-NNDL- Notes
14 pages
Deep Neural Network Module 4 Regularization
No ratings yet
Deep Neural Network Module 4 Regularization
53 pages
Regularization: Swetha V, Research Scholar
No ratings yet
Regularization: Swetha V, Research Scholar
32 pages
DL Class3
No ratings yet
DL Class3
28 pages
Unit-2 L2 (3)
No ratings yet
Unit-2 L2 (3)
22 pages
Unit 2.3
No ratings yet
Unit 2.3
43 pages
DL+lect+7 (1)
No ratings yet
DL+lect+7 (1)
15 pages
Deep Learning Unit2
No ratings yet
Deep Learning Unit2
16 pages
UNIT IV NNHDL
No ratings yet
UNIT IV NNHDL
15 pages
NN&DL Unit-IV Regularization for Deep Learning
No ratings yet
NN&DL Unit-IV Regularization for Deep Learning
16 pages
Deep Learning Basics Lecture 3 Regularization I
No ratings yet
Deep Learning Basics Lecture 3 Regularization I
32 pages
4th Unit DL Final Class Notes (1)
No ratings yet
4th Unit DL Final Class Notes (1)
68 pages
5-Introduction To regularization-03-Aug-2020Material - I - 03-Aug-2020 - Module3 - Regularization
No ratings yet
5-Introduction To regularization-03-Aug-2020Material - I - 03-Aug-2020 - Module3 - Regularization
10 pages
Convolutional Neural Networks (Image Recognition) Part - II: Dr. Syed M. Usman
No ratings yet
Convolutional Neural Networks (Image Recognition) Part - II: Dr. Syed M. Usman
75 pages
Lecture 5-6
No ratings yet
Lecture 5-6
45 pages
465-Lecture 10-11
No ratings yet
465-Lecture 10-11
79 pages
4. Regularization
No ratings yet
4. Regularization
19 pages
Deep Feedforward Networks and Regularization: Licheng Zhang
No ratings yet
Deep Feedforward Networks and Regularization: Licheng Zhang
56 pages
S10_DNN_Regularization_wip
No ratings yet
S10_DNN_Regularization_wip
11 pages
Regularization (mathematics) - Wikipedia
No ratings yet
Regularization (mathematics) - Wikipedia
13 pages
cours4
No ratings yet
cours4
30 pages
Lecture 6
No ratings yet
Lecture 6
41 pages
CM20315 09 Regularization
No ratings yet
CM20315 09 Regularization
44 pages
Regularization Slides (2)
No ratings yet
Regularization Slides (2)
50 pages
UNIT V NNHDL
No ratings yet
UNIT V NNHDL
33 pages
An Overview of Overfitting and Its Solutions
No ratings yet
An Overview of Overfitting and Its Solutions
7 pages
5 Regularization
No ratings yet
5 Regularization
79 pages
BACK PROPAGATION and REGULATION, BATCH NORMALIZATION
No ratings yet
BACK PROPAGATION and REGULATION, BATCH NORMALIZATION
20 pages
DL Unit 3
No ratings yet
DL Unit 3
59 pages
Lecture15 Regularization
No ratings yet
Lecture15 Regularization
47 pages
DL Unit 4
No ratings yet
DL Unit 4
15 pages
UNIT LV
No ratings yet
UNIT LV
8 pages
Module-4_4
No ratings yet
Module-4_4
19 pages
Module-4_3
No ratings yet
Module-4_3
20 pages
DL_M2_Regularization
No ratings yet
DL_M2_Regularization
12 pages
Regularization_for_Neural_Networks_1718966083
No ratings yet
Regularization_for_Neural_Networks_1718966083
9 pages
mod4
No ratings yet
mod4
65 pages
A Quick Guide On Basic Regularization Methods For Neural Networks - by Jaime Durán - Yottabytes - Medium
No ratings yet
A Quick Guide On Basic Regularization Methods For Neural Networks - by Jaime Durán - Yottabytes - Medium
9 pages
Neural Networks For Machine Learning: Lecture 9a Overview of Ways To Improve Generalization
No ratings yet
Neural Networks For Machine Learning: Lecture 9a Overview of Ways To Improve Generalization
39 pages
Unit-2 Improving-Deep-Neural-Networks
No ratings yet
Unit-2 Improving-Deep-Neural-Networks
18 pages
DL Chpter 3
No ratings yet
DL Chpter 3
8 pages
DL_IT324a_3
No ratings yet
DL_IT324a_3
13 pages
Regularization
No ratings yet
Regularization
9 pages
Deep Neural Networks
No ratings yet
Deep Neural Networks
26 pages
Nndl Notes
No ratings yet
Nndl Notes
73 pages
4 NN Regularization
No ratings yet
4 NN Regularization
13 pages
DL_Unit-3
No ratings yet
DL_Unit-3
56 pages
Analog Dialogue, Volume 46, Number 4: Analog Dialogue, #8
From Everand
Analog Dialogue, Volume 46, Number 4: Analog Dialogue, #8
Analog Dialogue
No ratings yet
Sequences and Infinite Series, A Collection of Solved Problems
From Everand
Sequences and Infinite Series, A Collection of Solved Problems
Steven Tan
No ratings yet
Deep Learning Basics Lecture 2 Backpropagation
No ratings yet
Deep Learning Basics Lecture 2 Backpropagation
31 pages
OSRAM SFH 309 Datasheet
No ratings yet
OSRAM SFH 309 Datasheet
16 pages
Deep Learning Basics Lecture 6 Convolutional NN
No ratings yet
Deep Learning Basics Lecture 6 Convolutional NN
36 pages
Deep Learning Basics Lecture 1 Feedforward
No ratings yet
Deep Learning Basics Lecture 1 Feedforward
31 pages
Deep Learning Basics Lecture 8 Autoencoder & DBM
No ratings yet
Deep Learning Basics Lecture 8 Autoencoder & DBM
28 pages
Deep Learning Basics Lecture 11 Practical Methodology
No ratings yet
Deep Learning Basics Lecture 11 Practical Methodology
25 pages
ECE604 f20 hw1
No ratings yet
ECE604 f20 hw1
1 page
PYu-RC Group 51 RoHS L 12
No ratings yet
PYu-RC Group 51 RoHS L 12
10 pages
Lectures On Electromagnetic Theory - Weng Cho Chew
No ratings yet
Lectures On Electromagnetic Theory - Weng Cho Chew
591 pages
SFH 203 - en
No ratings yet
SFH 203 - en
15 pages
SFH 235 Fa - en
No ratings yet
SFH 235 Fa - en
15 pages
Lecture12 P2
No ratings yet
Lecture12 P2
12 pages
NS102 200902 ProblemSet4
No ratings yet
NS102 200902 ProblemSet4
7 pages
Total Productive Maintenance (TPM)
No ratings yet
Total Productive Maintenance (TPM)
66 pages
Gibbs Paradox 1
No ratings yet
Gibbs Paradox 1
19 pages
Composite Predictive Functional Control Strategies, Application To Positioning Axes
No ratings yet
Composite Predictive Functional Control Strategies, Application To Positioning Axes
10 pages
Andrew Rosenberg - Lecture 14: Neural Networks
No ratings yet
Andrew Rosenberg - Lecture 14: Neural Networks
50 pages
Eio0000002093 06
No ratings yet
Eio0000002093 06
674 pages
Heat Conduction Equation
100% (1)
Heat Conduction Equation
26 pages
Kaddour Najim Control of Continuous Linear Systems
No ratings yet
Kaddour Najim Control of Continuous Linear Systems
11 pages
TYBCA SEM-5 Minor Project Report Format
No ratings yet
TYBCA SEM-5 Minor Project Report Format
4 pages
Software Test Plan
No ratings yet
Software Test Plan
12 pages
PHY Chapter 24 The Laws of Thermodynamics
100% (1)
PHY Chapter 24 The Laws of Thermodynamics
96 pages
Dafpus
No ratings yet
Dafpus
3 pages
ISTQB Sample Question Paper - 7
No ratings yet
ISTQB Sample Question Paper - 7
5 pages
Case Stud1
No ratings yet
Case Stud1
13 pages
W5 Wicked Problems
No ratings yet
W5 Wicked Problems
29 pages
Services - RAM Study
No ratings yet
Services - RAM Study
3 pages
The Impact of Artificial Intelligence On Society
No ratings yet
The Impact of Artificial Intelligence On Society
2 pages
EEE 352 Automatic Control Systems General Course Information
No ratings yet
EEE 352 Automatic Control Systems General Course Information
14 pages
Isentropic Process
No ratings yet
Isentropic Process
47 pages
UNIT II Session 3
No ratings yet
UNIT II Session 3
18 pages
JNTUA - B Tech - 2018 - 3 2 - Dec - R15 - EEE - 15A02606 OPTIMIZATION TECHNIQUES
No ratings yet
JNTUA - B Tech - 2018 - 3 2 - Dec - R15 - EEE - 15A02606 OPTIMIZATION TECHNIQUES
2 pages
Student Attendance Management System Use Case Diagram PDF
No ratings yet
Student Attendance Management System Use Case Diagram PDF
3 pages
البرمجه التربيعيه
100% (1)
البرمجه التربيعيه
38 pages
Network Models Are The Control Technique of The Project. PERT and CPM Are The Network Models Used For Planning, Scheduling and Control of The Project
No ratings yet
Network Models Are The Control Technique of The Project. PERT and CPM Are The Network Models Used For Planning, Scheduling and Control of The Project
14 pages
Prototype Model
No ratings yet
Prototype Model
14 pages

Deep Learning Basics Lecture 4 Regularization II

Uploaded by

Deep Learning Basics Lecture 4 Regularization II

Uploaded by

Deep Learning Basics

• Maximum A Posteriori (MAP):

Regularization MLE loss

Prefer 𝑤2 (higher confidence)

Prefer 𝑤2 (higher confidence)

Prefer 𝑤2 (higher confidence)

• To simplify, use Taylor expansion

Small so can be ignored Regularization term

Figure from Image Classification with Pyramid Representation

• Be careful about the transformation applied:

• Recall overfitting: Larger the hypothesis class, easier to find a

• Prevent overfitting: do not push the hypothesis too much; use

Figure from Deep Learning,

• Disadvantage: need validation data

• How to reuse validation data

Figure from Deep Learning,

• More precisely, in each update step

Figure from Deep Learning,

Figure from Deep Learning,

Figure from Deep Learning,

• Data augmentation if the transformations known/easy to implement

You might also like