0% found this document useful (0 votes)
81 views141 pages

DL Unit2 HD

The document discusses training, optimization, and regularization of deep neural networks. It covers topics like multi-layer feedforward neural networks, activation functions, loss functions, optimization algorithms, and regularization techniques. It provides details on concepts like backpropagation, gradient descent, dropout, batch normalization, and techniques to prevent overfitting.

Uploaded by

anongreeen
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
81 views141 pages

DL Unit2 HD

The document discusses training, optimization, and regularization of deep neural networks. It covers topics like multi-layer feedforward neural networks, activation functions, loss functions, optimization algorithms, and regularization techniques. It provides details on concepts like backpropagation, gradient descent, dropout, batch normalization, and techniques to prevent overfitting.

Uploaded by

anongreeen
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 141

DEEP LEARNING

UNIT - II
By,
Dr. Himani Deshpande
UNIT – II
TRAINING, OPTIMIZATION AND REGULARIZATION OF DEEP NEURALNETWORK

Training Feedforward DNN


¡ Multi-Layered Feed Forward Neural Network, Learning Factors,
Activation functions: Tanh, Logistic, Linear, Softmax, ReLU, Leaky ReLU,
Loss functions: Squared Error loss, Cross Entropy, Choosing output function and loss function

Optimization
¡ Learning with backpropagation, Learning Parameters: Gradient Descent (GD), Stochastic and Mini
Batch GD, Momentum Based GD , Nesterov Accelerated GD, AdaGrad, Adam, RMSProp

Regularization
¡ Overview of Overfitting, Types of biases, Bias Variance Tradeoff Regularization Methods: L1, L2
regularization, Parameter sharing, Dropout, Weight Decay, Batch normalization, Early stopping, Data
Augmentation, Adding noise to input and output
2.1 TRAINING FEEDFORWARD DNN

¡ Multi-Layered Feed Forward Neural Network, Learning Factors,


Activation functions: Tanh, Logistic, Linear, Softmax, ReLU, Leaky ReLU,
Loss functions: Squared Error loss, Cross Entropy, Choosing output function and loss
function
MULTILAYER FEED-FORWARD NEURAL NETWORK

¡ Multilayer Feed-Forward Neural Network(MFFNN) is an interconnected


Artificial Neural Network with multiple layers that has neurons with weights
associated with them and they compute the result using activation functions.

¡ It is one of the types of Neural Networks in which the flow of the network is
from input to output units and it does not have any loops, no feedback, and
no signal moves in backward directions that is from output to hidden and
input layer.
APPLICATION OF MULTILAYER FEED-FORWARD NEURAL
NETWORK

1. Medical field
2. Speech regeneration
3. Data processing and compression
4. Image processing
LEARNING FACTORS

¡ learning factors for training such networks:


¡ Activation Function
¡ Learning Rate
¡ Loss Function
¡ Regularization Techniques
¡ Weight Initialization
¡ Optimization Algorithms
¡ Batch Size
OPTIMIZATION ALGORITHMS

¡ Optimization algorithms, such as gradient descent and its variants (e.g., stochastic
gradient descent, Adam, RMSprop), determine how the network's weights are
updated during training.
¡ These algorithms use the gradients of the loss function with respect to the network
parameters to iteratively update the weights in the direction of decreasing error.
REGULARIZATION TECHNIQUES

¡ Regularization techniques help prevent overfitting, where the network becomes


overly specialized to the training data and performs poorly on new, unseen data.
¡ Techniques like L1 and L2 regularization, dropout, and early stopping can be applied
to prevent overfitting and improve the network's generalization ability.
ACTIVATION FUNCTION
Tanh, Logistic, Linear, Softmax, ReLU, Leaky ReLU,

DR.HIMANI DESHPANDE (TSEC)


ACTIVATION FUNCTIONS

¡ In the process of building a neural network, one of the choices you get to make is
what Activation Function to use in the hidden layer as well as at the output layer of
the network.
ACTIVATION FUNCTIONS

¡ An Activation decides whether a neuron should be activated or not. It will


activate the neuron if it finds the input important in the process of
predicting using mathematical operations.

DR.HIMANI DESHPANDE (TSEC)


ACTIVATION FUNCTIONS

DR.HIMANI DESHPANDE (TSEC)


Activation Functions

BINARY STEP FUNCTION

¡ Binary step function is one of the simplest


activation functions. The function
produces binary output and thus the
name binary step function. The function
produces 1 (or true) when input passes a
threshold limit whereas it produces 0 (or
false) when input does not pass threshold.

DR.HIMANI DESHPANDE (TSEC)


Activation Functions
BIPOLAR ACTIVATION FUNCTION

The Bipolar activation function used to


convert the activation level of a unit
(neuron) into an output signal. It is also
known as transfer function or squashing
function due to the capability to squeeze the
amplitude range of output signal to some
finite value.

DR.HIMANI DESHPANDE (TSEC)


Activation Functions
RAMP ACTIVATION FUNCTION

It looks very similar to the sigmoid activation function, it maps


the inputs to output over a range (0,1), instead of a smooth curve
the ramp will have a sharp curve. It is a truncated version of the
linear function.

DR.HIMANI DESHPANDE (TSEC)


Activation Functions

RELU FUNCTION

¡ One of the most popular AFs in DL models, the rectified linear unit (ReLU) function, is
a fast-learning AF that promises to deliver state-of-the-art performance with stellar
results. Compared to other AFs like the sigmoid and tanh functions, the ReLU
function offers much better performance and generalization in deep learning. The
function is a nearly linear function that retains the properties of linear models, which
makes them easy to optimize with gradient-descent methods.

DR.HIMANI DESHPANDE (TSEC)


Activation Functions

EXPONENTIAL LINEAR UNITS (ELU FUNCTION)

¡ The exponential linear units (ELUs) function is an AF that is also used to speed up
the training of neural networks (just like ReLU function). The biggest advantage
of the ELU function is that it can eliminate the vanishing gradient problem by
using identity for positive values and by improving the learning characteristics of
the model.

DR.HIMANI DESHPANDE (TSEC)


Activation Functions
HYPERBOLIC TANGENT FUNCTION (TANH)

¡ The hyperbolic tangent function, a.k.a., the tanh function, is another type of AF.
It is a smoother, zero-centered function having a range between -1 to 1.

DR.HIMANI DESHPANDE (TSEC)


Activation Functions
SIGMOID FUNCTION

In an ANN, the sigmoid function is a non-linear AF used primarily in feedforward neural networks. It is a
differentiable real function, defined for real input values, and containing positive derivatives
everywhere with a specific degree of smoothness. The sigmoid function appears in the output layer of
the deep learning models and is used for predicting probability-based outputs.

Similar to sigmoid, the tanh function is continuous


and differentiable at all points.

DR.HIMANI DESHPANDE (TSEC)


Activation Functions

SOFTMAX FUNCTION

¡ The softmax function is another type of AF used in neural networks to compute


probability distribution from a vector of real numbers. This function generates an
output that ranges between values 0 and 1 and with the sum of the probabilities being
equal to 1.

DR.HIMANI DESHPANDE (TSEC)


SIGMOID VS SOFTMAX

Sigmoid is used for binary classification methods where we only have 2 classes,
while SoftMax applies to multiclass problems. In fact, the SoftMax function is an
extension of the Sigmoid function.
SOFTMAX VS SIGMOID
SIGMOID ACTIVATION FUNCTION APPLICATIONS

¡ 1 – Binary Classification in the Logistic Regression model: The Sigmoid function is


used in logistic regression models to classify data into two different categories.
¡ 2 – Hidden Layers of Neural Networks: In various deep learning architectures, the
Sigmoid activation function aids in introducing nonlinearity during the information
transformation process.
SOFTMAX ACTIVATION FUNCTION APPLICATIONS

¡ 1 – Artificial and Convolutional Neural Networks: In neural networks, output


normalization is often applied to map non-normalized output to a probability
distribution for output classes. Softmax is commonly used in the final layers of neural
network-based classifiers, such as in artificial and convolutional neural networks.
¡ 2 – Multiclass Classification Methods: The Softmax function is also employed in
various multiclass classification techniques, including Multiclass Linear Discriminant
Analysis (MLDA) and Naive Bayes Classifiers.
¡ 3 – Reinforcement Learning: Softmax functions can be used to convert input values
into scaled probabilities of different actions in reinforcement learning scenarios.
Squared Error loss, Cross Entropy,
Choosing output function and loss function
LOSS FUNCTIONS
LOSS FUNCTION
NN TERMINOLOGIES
COST FUNCTION

q When we build a network, the network tries to predict the output as close as
possible to the actual value.
q We measure this accuracy of the network using the cost/loss function.
q The cost or loss function tries to penalize the network when it makes errors.

Gradient descent is an optimization algorithm for


minimizing the cost.
LOSS & COST FUNCTION

The terms cost function & loss function are analogous.

¡ Loss function:
Used when we refer to the error for a single training example.
¡ Cost function:
Used to refer to an average of the loss functions over an entire training dataset.
LOSS FUNCTION

q In mathematical optimization and decision theory, a loss function or cost


function (sometimes also called an error function) is a function that maps an
event or values of one or more variables onto a real number intuitively
representing some "cost" associated with the event.
LOSS FUNCTIONS

¡ Squared Error loss,


¡ Cross Entropy,
MEAN SQUARE ERROR

q Mean squared error (MSE) loss is a widely-used loss function in machine


learning and statistics that measures the average squared difference between
the predicted values and the actual target values.

q It is particularly useful for regression problems, where the goal is to predict


continuous numerical values.
MSE LOSS FUNCTION

Advantages : Disadvantages :

1) Easy to interpret 1) Squared error


2) Differentiable for the usage of GD 2) 2- Robust to outliers
3) Only one local minima
MEAN ABSOLUTE ERROR

L=
CROSS ENTROPY

¡ Cross-entropy builds upon the idea of information theory entropy and


measures the difference between two probability distributions for a given
random variable/set of events.
¡ Cross entropy can be applied in both binary and multi-class classification
problems.

entropy measures the average amount of uncertainty in a probability distribution.


CROSS ENTROPY

q Cross entropy loss, also known as log loss, is a widely-used loss function in
machine learning, particularly for classification problems.
q It quantifies the difference between the predicted probability distribution and
the actual or true distribution of the target classes.
q Cross entropy loss is often used when training models that output probability
estimates, such as logistic regression and neural networks.
CROSS ENTROPY

Cross-entropy is a measure from the field of information theory, building


upon entropy and generally calculating the difference between two
probability distributions.
BINARY CROSS ENTROPY

Binary Class cross entropy


MULTI CLASS CROSS ENTROPY

Multi-class
Categorial cross entropy
SELECTING LOSS FUNCTION

¡ MSE is particularly well-suited for regression tasks, as it quantifies the


average squared difference between predicted and actual target values.
¡ Cross Entropy Loss is more commonly used in classification problems, where
it measures the divergence between the predicted probability distribution and
the true distribution of target classes.

Selecting the appropriate loss function for your model can significantly
impact its performance and ability to generalize to unseen data.
LOSS FUNCTION
LOSS FUNCTION
¡ Learning with backpropagation,

¡ Learning Parameters: Gradient Descent (GD),


Stochastic and Mini Batch GD
Momentum Based GD ,
Nesterov Accelerated GD,
AdaGrad, Adam, RMSProp
FIRST EPOCH
NEXT EPOCH
LARGE DATASETS
Stochastic Gradient Descent
Stochastic Gradient Descent
Stochastic Gradient Descent
Stochastic Gradient Descent
Ø Which is faster. ( Given same number of epochs)?
Ø Which converges faster?
Ø Fluctuating nature of Stochastic GD?
Mini Batch Gradient Descent
Stochastic Gradient Descent

Batch Gradient Descent


STOCHASTIC GRADIENT DESCENT

¡ In SGD, only one training example is used to compute the gradient and update
the parameters at each iteration. This can be faster than batch gradient descent
but may lead to more noise in the updates.
SGD

¡ Advantages:
1. Frequent updates of model parameters hence, converges in less time.
2. Requires less memory as no need to store values of loss functions.
3. May get new minima’s.
¡ Disadvantages:
1. High variance in model parameters.
2. May shoot even after achieving global minima.
3. To get the same convergence as gradient descent needs to slowly reduce the value of learning rate.
BATCH GRADIENT DESCENT

¡ To update the model parameter values like weight and bias, the entire training
dataset is used to compute the gradient and update the parameters at each
iteration.
¡ This can be slow for large datasets but may lead to a more accurate model.
¡ It is effective for convex or relatively smooth error manifolds because it moves
directly toward an optimal solution by taking a large step in the direction of the
negative gradient of the cost function.
¡ It can be slow for large datasets because it computes the gradient and updates
the parameters using the entire training dataset at each iteration. This can
result in longer training times and higher computational costs.
Also known as vanilla gradient descent
ADVANTAGES

1. Fewer model updates mean that this variant of the steepest descent method is
more computationally efficient than the stochastic gradient descent method.
2. Reducing the update frequency provides a more stable error gradient and a
more stable convergence for some problems.
3. Separating forecast error calculations and model updates provides a parallel
processing-based algorithm implementation.
DISADVANTAGES

1. End-of-training epoch updates require the additional complexity of accumulating


prediction errors across all training examples.
2. The batch gradient descent method typically requires the entire training dataset
in memory and is implemented for use in the algorithm.
3. Large datasets can result in very slow model updates or training speeds.
4. Slow and require more computational power.
5. A more stable error gradient can cause the model to prematurely converge to a
suboptimal set of parameters.
MINI-BATCH GRADIENT DESCENT

¡ In Mini-batch gradient descent a small batch of training examples is used to


compute the gradient and update the parameters at each iteration.
¡ This can be a good compromise between batch gradient descent and
Stochastic Gradient Descent, as it can be faster than batch gradient descent
and less noisy than Stochastic Gradient Descent.
ADVANTAGES

1. The model is updated more frequently than the stack gradient descent
method, allowing for more robust convergence and avoiding local minima.
2. Batch updates provide a more computationally efficient process than
stochastic gradient descent.
3. Batch processing allows for both the efficiency of not having all the training
data in memory and implementing the algorithm.
DISADVANTAGES

1. Mini-batch requires additional hyperparameters “mini-batch size” to be set for


the learning algorithm.
2. Error information should be accumulated over a mini-batch of training
samples, such as batch gradient descent.
3. It will generate complex functions.
MINI BATCH GD

1. Considered as Best of the gradient descent based algorithms.


2. Frequently updates the model parameters and also has less variance.
3. Requires medium amount of memory.
MOMENTUM-BASED GRADIENT DESCENT

¡ Momentum is a variant of gradient descent that incorporates information from the


previous weight updates to help the algorithm converge more quickly to the
optimal solution.
¡ Momentum adds a term to the weight update that is proportional to the running
average of the past gradients, allowing the algorithm to move more quickly in the
direction of the optimal solution.
¡ The updates to the parameters are based on the current gradient and the
previous updates.
¡ This can help prevent the optimization process from getting stuck in local minima
and reach the global minimum faster.
MOMENTUM-BASED GRADIENT DESCENT

¡ Advantages:
1. Reduces the oscillations and high variance of the parameters.
2. Converges faster than gradient descent.

¡ Disadvantages:
1. One more hyper-parameter is added which needs to be selected manually and accurately.
2. If the momentum is too high the algorithm may miss the local minima and may continue
to rise up.
NESTEROV ACCELERATED GRADIENT

¡ NAG is an extension of Momentum Gradient Descent.


¡ It evaluates the gradient at a hypothetical position ahead of the current position
based on the current momentum vector, instead of evaluating the gradient at
the current position.
¡ This can result in faster convergence and better performance.
NESTEROV ACCELERATED GRADIENT
NESTEROV ACCELERATED GRADIENT

¡ Advantages:
1. Does not miss the local minima.
2. Slows if minima’s are occurring.

¡ Disadvantages:
1. Still, the hyperparameter needs to be selected manually.
ADAPTIVE GRADIENT (ADAGRAD)

A limitation of gradient descent is that it uses the same step size (learning rate) for each
input variable. This can be a problem on objective functions that have different amounts of
curvature in different dimensions, and in turn, may require a different sized step to a new
point.

Adaptive Gradients, is an extension of the gradient descent optimization


algorithm that allows the step size in each dimension used by the
optimization algorithm to be automatically adapted based on the gradients
seen for the variable (partial derivatives) seen over the course of the search
ADAPTIVE GRADIENT (ADAGRAD)

In this variant, the learning rate is adaptively adjusted for each parameter
based on the historical gradient information.

This allows for larger updates for infrequent parameters and smaller
updates for frequent parameters.
ADAPTIVE GRADIENT (ADAGRAD)

¡ Advantages:
1. Learning rate changes for each training parameter.
2. Don’t need to manually tune the learning rate.
3. Able to train on sparse data.

¡ Disadvantages:
1. Computationally expensive as a need to calculate the second order derivative.
2. The learning rate is always decreasing results in slow training.
ROOT MEAN SQUARED PROPAGATION (RMSPROP)

¡ Root Mean Squared Propagation, or RMSProp, is an extension of gradient


descent and the AdaGrad version of gradient descent that uses a decaying
average of partial gradients in the adaptation of the step size for each
parameter.
¡ The use of a decaying moving average allows the algorithm to forget early
gradients and focus on the most recently observed partial gradients seen
during the progress of the search, overcoming the limitation of AdaGrad.
ROOT MEAN SQUARED PROPAGATION (RMSPROP)

¡ In this variant, the learning rate is adaptively adjusted for each parameter
based on the moving average of the squared gradient.

¡ This helps the algorithm to converge faster in the presence of noisy


gradients.
ADAM

¡ Adam stands for adaptive moment estimation, it combines the benefits of


Momentum-based Gradient Descent, Adagrad, and RMSprop the learning rate is
adaptively adjusted for each parameter based on the moving average of the
gradient and the squared gradient, which allows for faster convergence and better
performance on non-convex optimization problems.
¡ It keeps track of two exponentially decaying averages the first-moment estimate,
which is the exponentially decaying average of past gradients, and the second-
moment estimate, which is the exponentially decaying average of past squared
gradients.
¡ The first-moment estimate is used to calculate the momentum, and the second-
moment estimate is used to scale the learning rate for each parameter.
¡ This is one of the most popular optimization algorithms for deep learning.
ADVANTAGES OF GRADIENT DESCENT

1. Widely used: Gradient descent and its variants are widely used in machine learning
and optimization problems because they are effective and easy to implement.
2. Convergence: Gradient descent and its variants can converge to a global minimum
or a good local minimum of the cost function, depending on the problem and the
variant used.
3. Scalability: Many variants of gradient descent can be parallelized and are scalable
to large datasets and high-dimensional models.
4. Flexibility: Different variants of gradient descent offer a range of trade-offs
between accuracy and speed, and can be adjusted to optimize the performance of a
specific problem.
DISADVANTAGES OF GD

¡ Choice of learning rate: The choice of learning rate is crucial for the convergence of gradient descent and
its variants. Choosing a learning rate that is too large can lead to oscillations or overshooting while choosing
a learning rate that is too small can lead to slow convergence or getting stuck in local minima.
¡ Sensitivity to initialization: Gradient descent and its variants can be sensitive to the initialization of the
model’s parameters, which can affect the convergence and the quality of the solution.
¡ Time-consuming: Gradient descent and its variants can be time-consuming, especially when dealing with
large datasets and high-dimensional models. The convergence speed can also vary depending on the variant
used and the specific problem.
¡ Local optima: Gradient descent and its variants can converge to a local minimum instead of the global
minimum of the cost function, especially in non-convex problems. This can affect the quality of the solution,
and techniques like random initialization and multiple restarts may be used to mitigate this issue.
2.3 REGULARIZATION
¡ Overview of Overfitting,
¡ Types of biases,
¡ Bias Variance Tradeoff
¡ Regularization Methods: L1, L2 regularization,
¡ Parameter sharing, Dropout, Weight Decay,
¡ Batch normalization,
¡ Early stopping,
¡ Data Augmentation,
¡ Adding noise to input and output
MACHINE LEARNING ERRORS

94

DR. HIMANI DESHPANDE


BIAS

¡ Bias is simply defined as the inability of the model because of that there is some difference
or error occurring between the model’s predicted value and the actual value.
¡ These differences between actual or expected values and the predicted values are known as
error or bias error or error due to bias.
¡ Bias is a systematic error that occurs due to wrong training of network.
¡ When the error rate is high, we call it High Bias and when the error rate is low, we call it
Low Bias
VARIANCE

¡ Variance is the measure of spread in data from its mean position.


¡ In machine learning variance is the amount by which the performance of a predictive model
changes when it is trained on different subsets of the training data.
¡ Variance refers to the changes in the model when using different portions of the training data set.
VARIANCE & BIAS
UNDERSTANDING
BIAS AND VARIANCE

103
OVER FITTING & UNDER FITTING
OVER FITTING, UNDER FITTING

High Bias Low Bias Low Bias


High Variance Low Variance High Variance
105
TYPES OF BIAS

In general, a bias is either implicitly (unconsciously) added to the learner model or


explicitly(consciously) added to the learner model.
TYPES OF BIAS
SAMPLE BIAS

¡ Sample bias occurs when data collected is not representative of the


environment in which a program is expected to implement.
¡ No algorithm can be trained with all data of the universe, rather it could be
trained on the subset that is carefully chosen.
EXCLUSION BIAS

¡ Exclusion bias occurs when some features are excluded from the dataset usually during
the data wrangling.
¡ When there is a large amount of data, let’s say petabytes of data, choosing a small sample
for training purposes is the best option, but while doing so features might be accidentally
excluded from the sample, resulting in a biased sample.
¡ There can also be exclusion bias due to removing duplicates from the sample.
EXPERIMENTER OR OBSERVER BIAS

¡ Experimenter or observer bias occurs while gathering data. When gathering data, the
experimenter or observer might only record certain instances of data and skip others, the
skipped part could be beneficial for the learner but the learner is learning from the
instances which are biased to the environment. Thus a biased learner is built.
MEASUREMENT BIAS

¡ Measurement bias is the result of not accurately recording the data.


¡ For example, an insurance company is sampling the weight of customers for health
insurance and the weighing machine is faulty but the data is still being recorded without
being noticed.
¡ The result would be the learner would classify customers into wrong categories.
PREJUDICE BIAS

¡ Prejudice bias is the result of human cultural differences and stereotyping.


¡ When this prejudiced data is fed to the learner, it applies the same stereotyping that exists
in real life.
ALGORITHM BIAS

¡ Algorithm bias refers to the certain parameters of an algorithm that causes it to create
unfair or subjective outcomes.
¡ When it does this, it unfairly favours someone or something over another person or thing.
It can exist because of the design of the algorithm.
¡ For example, an algorithm decides to approve credit card applications and the data is fed
that include the gender of the applicant.
¡ On this basis, the algorithm might decide that women are earning less than men and
therefore women’s applications would be rejected.
OVER FITTING ISSUE

¡
REGULARIZATION

¡ Regularization is a set of techniques that can prevent overfitting in neural networks


and thus improve the accuracy of a Deep Learning model when facing completely
new data from the problem domain.

Regularization is a technique that adds information to a model to prevent the


occurrence of overfitting. It is a type of regression that minimizes the coefficient
estimates to zero to reduce the capacity (size) of a model. In this context, the
reduction of the capacity of a model involves the removal of extra weights.
REGULARIZATION

¡ Regularization prevents overfitting by including a penalty term into the


model's loss function.
¡ Regularization has two objectives:
¡ to lessen a model's complexity and
¡ to improve its ability to generalize to new inputs.
TYPES OF REGULARIZATION IN DL

¡ L1
¡ L2
¡ Parameter sharing
¡ Drop out
¡ Weight decay
¡ Early stopping
REGULARIZATION IN DL

¡ L1 and L2 are the most common types of regularization. These update the general cost
function by adding another term known as the regularization term.

Cost function = Loss (say, binary cross entropy) + Regularization term

Due to the addition of this regularization term, the values of weight matrices
decrease because it assumes that a neural network with smaller weight matrices
leads to simpler models.
Therefore, it will also reduce overfitting to quite an extent.

In deep learning, it actually penalizes the weight


matrices of the nodes.
OVERFITTING REVISITED: REGULARIZATION

A regularizer is an additional criteria to the loss function to make sure that model
doesn’t overfit

It’s called a regularizer since it tries to keep the parameters more normal/regular

It is a bias on the model forces the learning to prefer certain types of weights over
others

n
argmin w,b ∑ loss(yy') + λ regularizer(w, b)
i=1
L1 REGULARIZATION
¡ L1 regularization, also known as Lasso regularization, is a machine-learning strategy
that inhibits overfitting by introducing a penalty term into the model's loss function
based on the absolute values of the model's parameters.
¡ L1 regularization seeks to reduce some model parameters toward zero in order to
lower the number of non-zero parameters in the model (sparse model).
¡ L1 regularization is particularly useful when working with high-dimensional data
since it enables one to choose a subset of the most important attributes.
¡ This lessens the risk of overfitting and also makes the model easier to understand.
¡ The size of a penalty term is controlled by the hyperparameter lambda, which
regulates the L1 regularization's regularization strength. As lambda rises, more
parameters will be lowered to zero, improving regularization.
DEEP LEARNING
L1 REGULARIZATION

¡ In L1, we have:

¡ In this, we penalize the absolute value of the weights. Unlike L2, the weights may be reduced to zero
here. Hence, it is very useful when we are trying to compress our model. Otherwise, we usually prefer
L2 over it.
L1 REGULARIZATION

124
L1 REGULARIZATION

the weights may be reduced to zero

125
L2 REGULARIZATION

¡ L2 regularization, also known as Ridge regularization, is a machine learning


technique that avoids overfitting by introducing a penalty term into the
model's loss function based on the squares of the model's parameters.
¡ In order to achieve L2 regularization, a term that is proportionate to the
squares of the model's parameters is added to the loss function.
¡ As a limiter on the parameters' size, preventing them from growing out of
control.
¡ A hyperparameter called lambda that controls the regularization's intensity
also controls the size of the penalty term.
DEEP LEARNING
L2 REGULARIZATION

¡ In L2, we have:

¡ Here, lambda is the regularization parameter. It is the hyperparameter whose value is optimized for
better results. L2 regularization is a type of weight decay as it forces the weights to decay towards zero
(but not exactly zero).
L2 REGULARIZATION

128
L2 REGULARIZATION

129
LAMBDA HYPERPARAMETER
LOSS FUNCTION

Squared weights penalizes large values more


Sum of weights will penalize small values more
L1 & L2 REGULARIZATION

The main intuitive difference


between the L1 and L2
regularization is that
L1 regularization tries to estimate
the median of the data while
L2 regularization tries to estimate
the mean of the data to avoid
overfitting.
L2 REGULARIZATION

¡ Here the value 0.01 is the value of regularization parameter, i.e., lambda,
which we need to optimize further.
L1 & L2

L1 robust to outliers L2 is much more sensitive to outliers


ADVANTAGE & DISADVANTAGE
PARAMETER SHARING

Parameter Sharing or weights replication

¡ Parameter sharing is the method of sharing weights by all neurons in a particular


feature map.
¡ Parameter sharing reduces the training time, which directly reduces the number of
weight updates during backpropagation.
¡ Therefore helps to reduce the number of parameters in the whole system, making it
computationally cheap.
¡ Parameter sharing is used in all convolution layers in the network
PARAMETER SHARING

¡ What are the benefits of parameter sharing in CNNs?

Convolution Neural Networks have a couple of techniques known as


parameter sharing and parameter tying. Parameter sharing is the
method of sharing weights by all neurons in a particular feature map.
Therefore helps to reduce the number of parameters in the whole
system, making it computationally cheap.
PARAMETER SHARING
PARAMETER SHARING

¡ Why is sharing parameters a good idea?

¡ Parameter sharing is used in all convolution layers in the network.


Parameter sharing reduces the training time, which directly reduces
the number of weight updates during backpropagation.
MTL
¡ Multi-task learning, is a machine learning approach in which we try to
learn multiple tasks simultaneously, optimizing multiple loss functions at once.
Rather than training independent models for each task, we allow a single model to
learn to complete all of the tasks at once. In this process, the model uses all of the
available data across the different tasks to learn generalized representations of the
data that are useful in multiple contexts.
¡ Generally, multi-task learning should be used when the tasks have some level
of correlation. In other words, multi-task learning improves performance when there
are underlying principles or information shared between tasks.

Example,
two tasks involving classifying images of animals are likely to be correlated, as both tasks will involve
learning to detect fur patterns and colors.
HARD PARAMETER SHARING

¡ Perhaps the most widely used approach for MTL with NNs is hard parameter sharing,
in which we learn a common space representation for all tasks (i.e. completely share
weights/parameters between tasks).
¡ This shared feature space is used to model the different tasks, usually with
additional, task-specific layers (that are learned independently for each task).
¡ Hard parameter sharing acts as regularization and reduces the risk of overfitting, as
the model learns a representation that will (hopefully) generalize well for all tasks.
SOFT PARAMETER SHARING

¡ Instead of sharing exactly the same value of the parameters, in soft parameter
sharing, we add a constraint to encourage similarities among related parameters.
¡ More specifically, we learn a model for each task and penalize the distance between
the different models’ parameters.
¡ Unlike hard sharing, this approach gives more flexibility for the tasks by only loosely
coupling the shared space representations.
PARAMETER SHARING- TYPES

Hard Parameter Sharing Soft Parameter Sharing


DROPOUT

¡ Dropout is a regularization technique which prevents over-fitting of the


network.
¡ As the name suggests, during training a certain number of neurons in the
hidden layer is randomly dropped.
¡ This means that the training happens on several architectures of the neural
network on different combinations of the neurons.
¡ You can think of drop out as an ensemble technique, where the output of
multiple networks is then used to produce the final output.
WEIGHT DECAY

¡ Weight decay is a popular technique in machine learning that helps to


improve the accuracy of predictions.
¡ It helps in building machine learning models having higher generalization
performance.
WEIGHT DECAY

Weight decay is a regularization technique that is used to regularize the


size of the weights of certain parameters in machine learning models.

Weight decay is the most widely used regularization


technique for parametric machine learning models.
WEIGHT DECAY

¡ Weight decay is a form of regularization that penalizes large weights in the network.
¡ It does this by adding a term to the loss function that is proportional to the sum of
the squared weights.
¡ This term reduces the magnitude of the weights and prevents them from growing
too large.
¡ L2 is a Weight decay regularization
EARLY STOPPING
EARLY STOPPING

¡ Early Stopping is a regularization technique for deep neural networks that


stops training when parameter updates no longer begin to yield improves on a
validation set.
¡ In essence, we store and update the current best parameters during training,
and when parameter updates no longer yield an improvement (after a set
number of iterations) we stop training and use the last best parameters.
¡ It works as a regularizer by restricting the optimization procedure to a smaller
volume of parameter space.
BATCH NORMALIZATION

¡ Batch normalization (also known as batch norm) is a method used to make training of
artificial neural networks faster and more stable through normalization of the layers'
inputs by re-centering and re-scaling. It was proposed by Sergey Ioffe and Christian
Szegedy in 2015.

Batch normalization provides an elegant way of reparametrizing


almost any deep network. The reparametrization significantly
reduces the problem of coordinating updates across many layers.
BATCH NORMALIZATION

¡ Batch Norm is just another network layer that gets inserted between a hidden layer
and the next hidden layer. Its job is to take the outputs from the first hidden layer
and normalize them before passing them on as the input of the next hidden layer.
DATA AUGMENTATION

Data augmentation is a
technique of artificially
increasing the training set by
creating modified copies of a
dataset using existing data.
It includes making minor
changes to the dataset or
using deep learning to
generate new data points.
DATA AUGMENTATION

¡ Data augmentation is a set of techniques to artificially increase the a


mount of data by generating new data points from existing data. This
includes making small changes to data or using deep learning models
to generate new data points.
ADDING NOISE TO INPUT AND OUTPUT

¡ Why does it prevent overfitting?


Noise destroys information. Your data becomes harder to fit, thus harder to over-fit.
The extreme case is pure noise and your classifier will learn to ignore the input and
predict a fixed probability for each class. That's the opposite of overfitting: on your
validation set you will reach the exact same performance as during training.
NOISE

¡ Why does this help with generalization?


¡ By adding noise you augment the training set with additional information. You tell
your NN that the kind of noise you're adding should not change its prediction much.
If this is true, then it will generalize better because it has learned about a larger part
of the input space. If it is false it can actually make generalization worse, for example
if you're learning the XOR function from a 10-bit input.
¡ THANK YOU

You might also like