0% found this document useful (0 votes)

11 views43 pages

Unit 2.3

deep learning

Uploaded by

jadhavrohan7337

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

11 views43 pages

Unit 2.3

deep learning

Uploaded by

jadhavrohan7337

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 43

MIT Art Design and Technology University

MIT School of Computing, Pune

21BTCS031 – Deep Learning & Neural Networks

Class - L.Y. CORE (SEM-I)

Unit - II Deep Networks

Dr. Anant Kaulage
Dr. Sunita Parinam
Dr. Mayura Shelke
Dr. Aditya Pai
AY 2024-2025 SEM-I
Regularization

Unit II
Introduction

A central problem in machine learning is how to

make an algorithm that will perform well not just
on the training data, but also on new inputs
Strategies used in machine learning are explicitly
designed to reduce the test error, possibly at the
expense of increased training error
These strategies are known collectively as
regularization
“any modification we make to a learning algorithm
that is intended to reduce its generalization error
but not its training error.”
Intuition

Loss function is the sum of squared difference

between the actual value and the predicted
value
Intuition

When we penalize the weights θ_3 and θ_4

and make them too small, very close to zero. It
makes those terms negligible and helps simplify
the model.
Parameter Norm Penalties

Many regularization approaches are based on

limiting the capacity of models, such as neural
networks, linear regression, or logistic regression
add a parameter norm penalty Ω(θ) to the objective
function J
Regularized objective function by J˜:

where α ∈ [0, ∞) is a hyperparameter that weights

the relative contribution of the norm penalty term,
Ω, relative to the standard objective function J
Parameter Norm Penalties

we typically choose to use a parameter norm

penalty Ω that penalizes only the weights at
each layer and leaves the biases unregularized.
The biases typically require less data to fit
accurately than the weights
Regularizing the bias parameters can introduce
a significant amount of underfitting
It is sometimes desirable to use a separate
penalty with a different α coefficient for each
layer of the network
L2 Parameter Regularization

L2 parameter norm penalty commonly known

as weight decay
This regularization strategy drives the weights
closer to the origin by adding a regularization
term Ω(θ) = ½||w||22 to the objective function
is also known as ridge regression or
Tikhonov regularization
Consider behavior of weight decay
regularization for gradient of the regularized
objective function
assume no bias parameter, so θ is just w
L2 Parameter Regularization

with the corresponding parameter gradient

To take a single gradient step to update the

weights,

modified the learning rule to multiplicatively shrink the weight vector

by a constant factor on each step
L2 Parameter Regularization

Consider quadratic approximation to the

objective function in the neighborhood of the
value of the weights that obtains minimal
unregularized training cost,
w∗ = arg minw J(w).
If the objective function is truly quadratic, The
approximation Jˆ is given by

H is the Hessian matrix of J with respect to w

evaluated at w∗.
L2 Parameter Regularization

To study the effect of weight decay, we modify

equation by adding the weight decay gradient
We can now solve for the minimum of the
regularized version of Jˆ. We use the variable
w˜ to represent the location of the minimum

As α approaches 0, the regularized solution w˜

approaches w∗
L2 Parameter Regularization
L2 Parameter Regularization

What happens as α grows?

Because H is real and symmetric, we
can decompose it into a diagonal
matrix Λ and an orthonormal basis of
eigenvectors, Q, such that
H = QΛQT
L2 Parameter Regularization
L2 Parameter Regularization
L2 Parameter Regularization

We see that the effect of weight decay is to

rescale w∗ along the axes defined by the
eigenvectors of H.
Specifically, the component of w∗ that is aligned
with the i-th eigenvector of H is rescaled by a
factor of λi/λi +α.
1. The weight vector(w*) is getting
rotated to ( ~ w)
2. All of its elements are shrinking but
some are shrinking more than the
others
3.This ensures that only important
features are given high weights
L1 Regularization

L1 regularization on the model parameter w is

defined as:

the sum of absolute values of the individual

parameters

As with L2 weight decay, L1 weight decay controls

the strength of the regularization by scaling the
penalty using a positive Ω hyperparameter α.
Regularized objective function J˜(w;X, y) is given by
L1 Regularization

sign(w) is simply the sign of w applied element-wise

In comparison to L2 regularization, L1 regularization
results in a solution that is more sparse
“Sparse” solutions, with many parameters set to zero:
● can be more interpretable
● can require less memory and less computation
● might generalize better (but also often not!).
L1 Regularization

Like L2 regularization, we penalize weights with

large magnitudes.
`However, the solutions are qualitatively
different: with L1 regularization some of the
parameters will often be exactly zero
Why L1?
● The L1 regularizer is popular because it gives
sparse solutions and it is convex.
● If the error function is also convex it is possible
to find the global optimum
L2 VS L1

• L1 penalizes sum of • L2 regularization penalizes

absolute value of weights. sum of square weights.
• L1 has a sparse solution • L2 has a non sparse solution
• L1 has multiple solutions • L2 has one solution
• L1 has built in feature • L2 has no feature selection
selection • L2 is not robust to outliers
• L1 is robust to outliers • L2 gives better prediction
• L1 generates model that are when output variable is a
simple and interpretable but function of all input features
cannot learn complex • L2 regularization is able to
patterns learn complex data patterns
Data Augmentation

The best way to make a machine

learning model generalize better is to
train it on more data
effective technique for a specific
classification problem: object recognition
One must be careful not to apply
transformations that would change the
correct class
Data Augmentation
Noise Robustness

Noise can be applied to the inputs as a dataset

augmentation strategy
Noise applied to the weights can also be
interpreted as equivalent (under some
assumptions) to a more traditional form of
regularization
Consider the regression setting, where we wish
to train a function ˆy(x) that maps a set of
features x to a scalar using the least-squares
cost function between the model predictions
yˆ(x) and the true values y:
Noise Robustness
We can show that for a
simple input output
neural network, adding
Gaussian noise to the
input is equivalent to
weight decay (L2
regularization)
Can be viewed as data
augmentation
Noise Robustness
Noise Robustness
Injecting Noise to Output

Most datasets have some amount of mistakes

in the y labels.
It can be harmful to maximize log p(y | x) when
y is a mistake.
One way to prevent this is to explicitly model
the noise on the labels
we can assume that for some small constant ε,
the training set label y is correct with probability
1− ε, and otherwise any of the other possible
labels might be correct
Early Stopping

When training large models with sufficient

representational capacity to overfit the task, we
often observe that training error decreases
steadily over time, but validation set error
begins to rise again
Early stopping
Early stopping
Early stopping
Ensemble Methods
Ensemble Methods
Ensemble Methods
Typically model averaging(bagging ensemble)
always helps
Training several large neural networks for making an
ensemble is prohibitively expensive
Option 1: Train several neural networks having
different architectures( obviously expensive)
Option 2: Train multiple instances of the same
network using different training samples (again
expensive)
Even if we manage to train with option 1 or option 2,
combining several models at test time is infeasible in
real time applications
Dropout

Dropout is a technique which addresses

both these issues.
Effectively it allows training several
neural networks without any significant
computational overhead.
Also gives an efficient approximate way
of combining exponentially many
different neural networks
Dropout

Dropout refers to dropping out units

Temporarily remove a node and all its
incoming/outgoing connections resulting in a
thinned network
SUMMARY

Time Series Analysis Project - CAC 40 - 2018
No ratings yet
Time Series Analysis Project - CAC 40 - 2018
33 pages
Regularization
No ratings yet
Regularization
46 pages
4th Unit DL Final Class Notes
No ratings yet
4th Unit DL Final Class Notes
68 pages
Unit - 4 REGULARIZATION FOR DEEP LEARNING
No ratings yet
Unit - 4 REGULARIZATION FOR DEEP LEARNING
56 pages
5-Introduction To regularization-03-Aug-2020Material - I - 03-Aug-2020 - Module3 - Regularization
No ratings yet
5-Introduction To regularization-03-Aug-2020Material - I - 03-Aug-2020 - Module3 - Regularization
10 pages
Lec 05 Regularization
No ratings yet
Lec 05 Regularization
77 pages
Module - 2 Ver 1.4
No ratings yet
Module - 2 Ver 1.4
35 pages
Regularization For Deep Learning: Tsz-Chiu Au Chiu@unist - Ac.kr
No ratings yet
Regularization For Deep Learning: Tsz-Chiu Au Chiu@unist - Ac.kr
100 pages
12-Regularization For Deep Learning-17!08!2024
No ratings yet
12-Regularization For Deep Learning-17!08!2024
51 pages
07 Regularization
No ratings yet
07 Regularization
51 pages
Lecture 05 - Regularization - 4p
No ratings yet
Lecture 05 - Regularization - 4p
21 pages
Regularization: Swetha V, Research Scholar
No ratings yet
Regularization: Swetha V, Research Scholar
32 pages
Regularization in Deep Learning
No ratings yet
Regularization in Deep Learning
49 pages
Regularization in Machine Learning
No ratings yet
Regularization in Machine Learning
17 pages
UNIT-II Regularization in Deep Learning
No ratings yet
UNIT-II Regularization in Deep Learning
24 pages
S10 DNN Regularization Wip
No ratings yet
S10 DNN Regularization Wip
11 pages
Unit Ii
No ratings yet
Unit Ii
8 pages
Chap 7-1 Regularization For Deep Learning-Keonwoo Noh
No ratings yet
Chap 7-1 Regularization For Deep Learning-Keonwoo Noh
41 pages
Unit Iv NNHDL
No ratings yet
Unit Iv NNHDL
15 pages
DL Chpter 3
No ratings yet
DL Chpter 3
8 pages
Unit - 4-NNDL - Notes
No ratings yet
Unit - 4-NNDL - Notes
14 pages
UNIT LV
No ratings yet
UNIT LV
8 pages
03 Reg Slides
No ratings yet
03 Reg Slides
64 pages
NN&DL Unit-IV Regularization For Deep Learning
No ratings yet
NN&DL Unit-IV Regularization For Deep Learning
16 pages
DL Unit 4
No ratings yet
DL Unit 4
15 pages
DL Unit-3
No ratings yet
DL Unit-3
56 pages
Unit4 DL Final
No ratings yet
Unit4 DL Final
30 pages
Unit-2 L1
No ratings yet
Unit-2 L1
23 pages
Regularization and Normalization
No ratings yet
Regularization and Normalization
29 pages
Deep Learning: Computer Science and Engineering
No ratings yet
Deep Learning: Computer Science and Engineering
18 pages
What Is Regularization.
No ratings yet
What Is Regularization.
10 pages
Deep Learning Basics Lecture 4 Regularization II
No ratings yet
Deep Learning Basics Lecture 4 Regularization II
27 pages
Regularization (Mathematics) - Wikipedia
No ratings yet
Regularization (Mathematics) - Wikipedia
13 pages
Unit 4
No ratings yet
Unit 4
93 pages
Regularization Slides
No ratings yet
Regularization Slides
50 pages
465-Lecture 10-11
No ratings yet
465-Lecture 10-11
79 pages
NNDL Notes
No ratings yet
NNDL Notes
73 pages
CM20315 09 Regularization
No ratings yet
CM20315 09 Regularization
44 pages
Lec8 Regularization
No ratings yet
Lec8 Regularization
41 pages
Overfitting Vs Underfitting
No ratings yet
Overfitting Vs Underfitting
16 pages
Regularization
No ratings yet
Regularization
3 pages
Overfitting Underfitting: UNIT 2: Optimization and Regularization in Neural Networks
No ratings yet
Overfitting Underfitting: UNIT 2: Optimization and Regularization in Neural Networks
18 pages
Unit 4
No ratings yet
Unit 4
35 pages
Regularization
No ratings yet
Regularization
14 pages
DL Unit 3
No ratings yet
DL Unit 3
59 pages
Regularization (Mathematics)
No ratings yet
Regularization (Mathematics)
11 pages
DL Class3
No ratings yet
DL Class3
28 pages
Regularization
No ratings yet
Regularization
18 pages
A Quick Guide On Basic Regularization Methods For Neural Networks - by Jaime Durán - Yottabytes - Medium
No ratings yet
A Quick Guide On Basic Regularization Methods For Neural Networks - by Jaime Durán - Yottabytes - Medium
9 pages
Deep Feedforward Networks and Regularization: Licheng Zhang
No ratings yet
Deep Feedforward Networks and Regularization: Licheng Zhang
56 pages
L11+ Regularization
No ratings yet
L11+ Regularization
24 pages
Regularization 1704650055
No ratings yet
Regularization 1704650055
32 pages
Week 10
No ratings yet
Week 10
69 pages
Regularization in Machine Learning
No ratings yet
Regularization in Machine Learning
3 pages
Overfitting Problem Regularization (Ridge, Lasso, Elastic) Dropout and Early Stopping
No ratings yet
Overfitting Problem Regularization (Ridge, Lasso, Elastic) Dropout and Early Stopping
17 pages
Lecture 6
No ratings yet
Lecture 6
41 pages
BACK PROPAGATION and REGULATION, BATCH NORMALIZATION
No ratings yet
BACK PROPAGATION and REGULATION, BATCH NORMALIZATION
20 pages
5 Regularization
No ratings yet
5 Regularization
79 pages
Gauss Nodes Revolution: Numerical Integration Theory Radically Simplified And Generalised
From Everand
Gauss Nodes Revolution: Numerical Integration Theory Radically Simplified And Generalised
Rob Porter
No ratings yet
Bilinear Interpolation: Enhancing Image Resolution and Clarity through Bilinear Interpolation
From Everand
Bilinear Interpolation: Enhancing Image Resolution and Clarity through Bilinear Interpolation
Fouad Sabry
No ratings yet
Top Numerical Methods With Matlab For Beginners!
From Everand
Top Numerical Methods With Matlab For Beginners!
Andrei Besedin
No ratings yet
Hiding Encrypted Speech Using Steganography: Atef J. Al-Najjar, Aleem K. Alvi, Syed U. Idrees, Abdul-Rahman M. Al-Manea
No ratings yet
Hiding Encrypted Speech Using Steganography: Atef J. Al-Najjar, Aleem K. Alvi, Syed U. Idrees, Abdul-Rahman M. Al-Manea
5 pages
Control System Final Roadmap
No ratings yet
Control System Final Roadmap
3 pages
Chapter 1
No ratings yet
Chapter 1
13 pages
PID Pole Placement Controller
No ratings yet
PID Pole Placement Controller
16 pages
Assgn 2
No ratings yet
Assgn 2
2 pages
Matric S
No ratings yet
Matric S
16 pages
Digital Image Processing: Lecture # 7 Spatial Filtering
No ratings yet
Digital Image Processing: Lecture # 7 Spatial Filtering
32 pages
Advanced Engineering Mathematics 9th Edition Erwin Kreyszig Instant Download
100% (2)
Advanced Engineering Mathematics 9th Edition Erwin Kreyszig Instant Download
53 pages
The Visibility Graph: A New Method For Estimating The Hurst Exponent of Fractional Brownian Motion
No ratings yet
The Visibility Graph: A New Method For Estimating The Hurst Exponent of Fractional Brownian Motion
5 pages
Annu Maria-Introduction To Modelling and Simulation
0% (1)
Annu Maria-Introduction To Modelling and Simulation
7 pages
Excel Formula
No ratings yet
Excel Formula
9 pages
Zoho Question Set - 4
No ratings yet
Zoho Question Set - 4
2 pages
Maths Assignment 2
No ratings yet
Maths Assignment 2
2 pages
Ds Notes V Unit C++
No ratings yet
Ds Notes V Unit C++
20 pages
ADA Unit-2
No ratings yet
ADA Unit-2
19 pages
Reducibility and NP Completeness
No ratings yet
Reducibility and NP Completeness
73 pages
s1 October 2016 Paper
No ratings yet
s1 October 2016 Paper
24 pages
Decision Tree Learning
No ratings yet
Decision Tree Learning
16 pages
Differential Equations MSC
No ratings yet
Differential Equations MSC
6 pages
Private Key Encryption and Recovery in Blockchain: July 2019
100% (1)
Private Key Encryption and Recovery in Blockchain: July 2019
23 pages
2000 Sussman&Puckett CLSVOF
No ratings yet
2000 Sussman&Puckett CLSVOF
37 pages
(Slide) Non Linear Regression
No ratings yet
(Slide) Non Linear Regression
39 pages
Introduction To AI Module 1 Part C
No ratings yet
Introduction To AI Module 1 Part C
5 pages
An Algorithm Predicting Stock Markets - Farbod - Dehghani
No ratings yet
An Algorithm Predicting Stock Markets - Farbod - Dehghani
19 pages
Advanced Matrix Operations: 6.1 Opening Remarks
No ratings yet
Advanced Matrix Operations: 6.1 Opening Remarks
22 pages
IEEE 2024 DQL Improved DQL
No ratings yet
IEEE 2024 DQL Improved DQL
6 pages
Stodola
No ratings yet
Stodola
3 pages
Me Maths Model
No ratings yet
Me Maths Model
2 pages
Reliability & Maintainability Engineering Ebeling Chapter 12 Book Solutions - Data Collection ..
No ratings yet
Reliability & Maintainability Engineering Ebeling Chapter 12 Book Solutions - Data Collection ..
15 pages

Unit 2.3

Uploaded by

Unit 2.3

Uploaded by

MIT Art Design and Technology University

MIT School of Computing, Pune

Class - L.Y. CORE (SEM-I)

Unit - II Deep Networks

A central problem in machine learning is how to

Loss function is the sum of squared difference

When we penalize the weights θ_3 and θ_4

Many regularization approaches are based on

where α ∈ [0, ∞) is a hyperparameter that weights

we typically choose to use a parameter norm

L2 parameter norm penalty commonly known

with the corresponding parameter gradient

To take a single gradient step to update the

modified the learning rule to multiplicatively shrink the weight vector

Consider quadratic approximation to the

H is the Hessian matrix of J with respect to w

To study the effect of weight decay, we modify

As α approaches 0, the regularized solution w˜

What happens as α grows?

We see that the effect of weight decay is to

L1 regularization on the model parameter w is

the sum of absolute values of the individual

As with L2 weight decay, L1 weight decay controls

sign(w) is simply the sign of w applied element-wise

Like L2 regularization, we penalize weights with

• L1 penalizes sum of • L2 regularization penalizes

The best way to make a machine

Noise can be applied to the inputs as a dataset

Most datasets have some amount of mistakes

When training large models with sufficient

Dropout is a technique which addresses

Dropout refers to dropping out units

You might also like