0% found this document useful (0 votes)
12 views21 pages

Unit Online 1.3

The document discusses regularization techniques in deep learning to prevent overfitting, including L1 and L2 regularization, dropout, and early stopping. It explains the importance of model parameters and hyper-parameters in training deep neural networks, as well as the processes of model optimization and selection. Additionally, it covers concepts like training, validation, and test sets, along with error metrics such as Mean Squared Error and the Delta Learning Rule.

Uploaded by

aakilalig
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views21 pages

Unit Online 1.3

The document discusses regularization techniques in deep learning to prevent overfitting, including L1 and L2 regularization, dropout, and early stopping. It explains the importance of model parameters and hyper-parameters in training deep neural networks, as well as the processes of model optimization and selection. Additionally, it covers concepts like training, validation, and test sets, along with error metrics such as Mean Squared Error and the Delta Learning Rule.

Uploaded by

aakilalig
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 21

NEURAL NETWORKS & DEEP LEARNING

(21MCA24DB3)

Prepared & Presented By:


Dr. Balkishan
Assistant Professor
Department of Computer Science & Applications
Maharshi Dayanand University
Rohtak
Regularizing a Deep Network
(Technique to prevent overfitting)
• Regularization is a technique which makes
slight modifications to the learning algorithm
such that the model generalizes better.
• This in turn improves the model’s
performance on the unseen data.
• Reduce the complexity of the model
Regularization
• Regularization is a technique used to reduce the errors by fitting the function
appropriately on the given training set and avoid overfitting.
The commonly used regularization techniques are :

-L2 regularization
-L1 regularization
-Dropout regularization
- Early Stopping Regularization

• A regression model that uses L2 regularization technique is called Ridge regression.

• A regression model which uses L1 Regularization technique is called LASSO(Least


Absolute Shrinkage and Selection Operator) regression.

Lasso Regression adds “absolute value of magnitude” of coefficient as penalty term


to the loss function(L).
L2 Regularization
Equation (1), x is the independent variable, y is
the dependent variable and .7, 1.2, 21, 39 are
regression coefficients

Scale down version of equation (1)


What is Ridge Regression?

• Ridge regression is a model tuning method


that is used to analyze any data that suffers
from multicollinearity.
• This method performs L2 regularization.
• When the issue of multicollinearity occurs,
least-squares are unbiased, and variances are
large, this results in predicted values being far
away from the actual values.
Important Observations
• In simple terms, the minimization objective = LS Obj + α (sum of
the square of coefficients)
• Where LS Obj is Least Square Objective that is the linear
regression objective without regularization.
• Here α is the turning factor that controls the strength of the
penalty term.
• If α = 0, the objective becomes similar to simple linear regression.
So we get the same coefficients as simple linear regression.
• If α = ∞, the coefficients will be zero because of infinite weightage
on the square of coefficients as anything less than zero makes the
objective infinite.
• If 0 < α < ∞, the magnitude of α decides the weightage given to the
different parts of the objective.
L1 Regularization
Dropout Regularization
• Randomly selected neurons are ignored during each
training step.

• Dropped neurons don’t have effect on next layers.

• Dropped neurons are not updated in backward training.


Model Exploration and Hyper Parameter Tuning

Model Parameters (Learned during training)


•Model Parameters are the entities learned via training from the
training data.
•They are not set manually by the designer.
With respect to deep neural networks, the model parameters are:
-Weights
- Biases
Model Hyper-parameters (Control the parameters)
•These are parameters that govern(control) the determination of the model
parameters during training
-They are typically set manually via heuristics
-They are tuned during a cross-validation phase
Examples:
Learning rate, number of layers, number of units in each layer, activation
functions, many others
• What is a model?
• Model contains the hyper parameters describe the neural network.
Because hyper parameters govern (control) the parameters of the network.

Implicitly the model contains:

-The topology of the deep neural network (i.e. layers and their
interconnection)
- The learned parameters (i.e., the learned weights and biases)

The model is dependent upon the hyper parameters because the hyper
parameters determine the learned parameters (weights and biases).

Hyper parameters include:


-Learning Rate
-Number of Layers
-Number of Units in each Layer
-Activation Functions
-Etc.
Model Optimization

• To optimize the model (inference time behavior), a process


known as model selection is performed
• Model selection contains the selection of hyper parameters that
yield the best performance of the neural network
• The hyper parameters are tuned using an iterative process of
either:
-Validation
-Cross-Validation
• Many models may be evaluated during the validation/cross-
validation phase and the optimal model is selected
• The optimal model is then evaluated on the test dataset to
determine how well it performs on data never seen before
Training set-Validation Set and Test Set
• Training Set – Data set used to learn the optimal model
parameters (weights, biases)
• Validation (“Dev”) Set – Data set used to perform model
selection (tuning of hyper parameters)
• Used to estimate the generalization error of the training
allowing for the hyper parameters to be updated accordingly
• Test Set – Data set used to assess the fully trained model
• A fully trained model is the model that has been selected via
hyper parameter tuning and has been subsequently trained to
determine the optimal weights and biases (e.g., using back
propagation)
Train, Validation and Test Sets
Gradient Descent Learning
Mean Squared Error
• Mean Squared Error (MSE) – SSE/n where n
is the number of instances in the data set
– SSE means Sum Square Error
– This can be used to normalizes the error for data
sets of different sizes
– MSE is the average squared error per pattern
• Root Mean Squared Error (RMSE) – is the
square root of the MSE
– This puts the error value back into the same units
as the features and can thus be more intuitive
– RMSE is the average distance (error) of targets
from the outputs in the same scale as the features
Gradient Descent Learning: Minimize
(Maximze) the Objective Function

Error Landscape

SSE:
Sum
Squared
Error
S (ti – i)2

0
Weight Values
Delta Learning Rule
(Widrow-Hoff Rule)
• Goal is to decrease overall error each time a weight is
changed
• Total Sum Squared Error (SSE) is called objective function E
= S (ti – zi)2

• Delta learning rule is valid only for continuous activation


functions and in the supervised training mode
• The delta rule may be stated as “ the adjustment made to
a synaptic weight of a neuron is proportional to the
product of the error signal and the input signal of the
synapse”
Delta Rule for Single Output Unit
• The delta rule changes the weight of the connection to
minimize the difference between the net input to the
output unit i.e. yin and the target value t
Delta rule is given as

wi  t  yin xi


-Where x is the vector of activation of input unit
-Yin is the net input to output unit
- t is the target vector, α is the learning rate
Difference between Perceptron and Delta
(Windrow-Hoff) Learning Rule
• The Windrow-Hoff is very similar to perceptron learning
rule, but their origins are different.
• Perceptron learning rule originates from the Hebbian
assumption while delta rule is derived from gradient-
descent method
• Perceptron learning rule stops after a finite number of
learning steps, but the gradient-descent approach
continues forever, converging only asymptotically to the
solution.
• Delta rule updates the weights between the connections so as
to minimize the difference the net input to the output unit
and the target value

You might also like