DL Unit2 HD
DL Unit2 HD
UNIT - II
By,
Dr. Himani Deshpande
UNIT – II
TRAINING, OPTIMIZATION AND REGULARIZATION OF DEEP NEURALNETWORK
Optimization
¡ Learning with backpropagation, Learning Parameters: Gradient Descent (GD), Stochastic and Mini
Batch GD, Momentum Based GD , Nesterov Accelerated GD, AdaGrad, Adam, RMSProp
Regularization
¡ Overview of Overfitting, Types of biases, Bias Variance Tradeoff Regularization Methods: L1, L2
regularization, Parameter sharing, Dropout, Weight Decay, Batch normalization, Early stopping, Data
Augmentation, Adding noise to input and output
2.1 TRAINING FEEDFORWARD DNN
¡ It is one of the types of Neural Networks in which the flow of the network is
from input to output units and it does not have any loops, no feedback, and
no signal moves in backward directions that is from output to hidden and
input layer.
APPLICATION OF MULTILAYER FEED-FORWARD NEURAL
NETWORK
1. Medical field
2. Speech regeneration
3. Data processing and compression
4. Image processing
LEARNING FACTORS
¡ Optimization algorithms, such as gradient descent and its variants (e.g., stochastic
gradient descent, Adam, RMSprop), determine how the network's weights are
updated during training.
¡ These algorithms use the gradients of the loss function with respect to the network
parameters to iteratively update the weights in the direction of decreasing error.
REGULARIZATION TECHNIQUES
¡ In the process of building a neural network, one of the choices you get to make is
what Activation Function to use in the hidden layer as well as at the output layer of
the network.
ACTIVATION FUNCTIONS
RELU FUNCTION
¡ One of the most popular AFs in DL models, the rectified linear unit (ReLU) function, is
a fast-learning AF that promises to deliver state-of-the-art performance with stellar
results. Compared to other AFs like the sigmoid and tanh functions, the ReLU
function offers much better performance and generalization in deep learning. The
function is a nearly linear function that retains the properties of linear models, which
makes them easy to optimize with gradient-descent methods.
¡ The exponential linear units (ELUs) function is an AF that is also used to speed up
the training of neural networks (just like ReLU function). The biggest advantage
of the ELU function is that it can eliminate the vanishing gradient problem by
using identity for positive values and by improving the learning characteristics of
the model.
¡ The hyperbolic tangent function, a.k.a., the tanh function, is another type of AF.
It is a smoother, zero-centered function having a range between -1 to 1.
In an ANN, the sigmoid function is a non-linear AF used primarily in feedforward neural networks. It is a
differentiable real function, defined for real input values, and containing positive derivatives
everywhere with a specific degree of smoothness. The sigmoid function appears in the output layer of
the deep learning models and is used for predicting probability-based outputs.
SOFTMAX FUNCTION
Sigmoid is used for binary classification methods where we only have 2 classes,
while SoftMax applies to multiclass problems. In fact, the SoftMax function is an
extension of the Sigmoid function.
SOFTMAX VS SIGMOID
SIGMOID ACTIVATION FUNCTION APPLICATIONS
q When we build a network, the network tries to predict the output as close as
possible to the actual value.
q We measure this accuracy of the network using the cost/loss function.
q The cost or loss function tries to penalize the network when it makes errors.
¡ Loss function:
Used when we refer to the error for a single training example.
¡ Cost function:
Used to refer to an average of the loss functions over an entire training dataset.
LOSS FUNCTION
Advantages : Disadvantages :
L=
CROSS ENTROPY
q Cross entropy loss, also known as log loss, is a widely-used loss function in
machine learning, particularly for classification problems.
q It quantifies the difference between the predicted probability distribution and
the actual or true distribution of the target classes.
q Cross entropy loss is often used when training models that output probability
estimates, such as logistic regression and neural networks.
CROSS ENTROPY
Multi-class
Categorial cross entropy
SELECTING LOSS FUNCTION
Selecting the appropriate loss function for your model can significantly
impact its performance and ability to generalize to unseen data.
LOSS FUNCTION
LOSS FUNCTION
¡ Learning with backpropagation,
¡ In SGD, only one training example is used to compute the gradient and update
the parameters at each iteration. This can be faster than batch gradient descent
but may lead to more noise in the updates.
SGD
¡ Advantages:
1. Frequent updates of model parameters hence, converges in less time.
2. Requires less memory as no need to store values of loss functions.
3. May get new minima’s.
¡ Disadvantages:
1. High variance in model parameters.
2. May shoot even after achieving global minima.
3. To get the same convergence as gradient descent needs to slowly reduce the value of learning rate.
BATCH GRADIENT DESCENT
¡ To update the model parameter values like weight and bias, the entire training
dataset is used to compute the gradient and update the parameters at each
iteration.
¡ This can be slow for large datasets but may lead to a more accurate model.
¡ It is effective for convex or relatively smooth error manifolds because it moves
directly toward an optimal solution by taking a large step in the direction of the
negative gradient of the cost function.
¡ It can be slow for large datasets because it computes the gradient and updates
the parameters using the entire training dataset at each iteration. This can
result in longer training times and higher computational costs.
Also known as vanilla gradient descent
ADVANTAGES
1. Fewer model updates mean that this variant of the steepest descent method is
more computationally efficient than the stochastic gradient descent method.
2. Reducing the update frequency provides a more stable error gradient and a
more stable convergence for some problems.
3. Separating forecast error calculations and model updates provides a parallel
processing-based algorithm implementation.
DISADVANTAGES
1. The model is updated more frequently than the stack gradient descent
method, allowing for more robust convergence and avoiding local minima.
2. Batch updates provide a more computationally efficient process than
stochastic gradient descent.
3. Batch processing allows for both the efficiency of not having all the training
data in memory and implementing the algorithm.
DISADVANTAGES
¡ Advantages:
1. Reduces the oscillations and high variance of the parameters.
2. Converges faster than gradient descent.
¡ Disadvantages:
1. One more hyper-parameter is added which needs to be selected manually and accurately.
2. If the momentum is too high the algorithm may miss the local minima and may continue
to rise up.
NESTEROV ACCELERATED GRADIENT
¡ Advantages:
1. Does not miss the local minima.
2. Slows if minima’s are occurring.
¡ Disadvantages:
1. Still, the hyperparameter needs to be selected manually.
ADAPTIVE GRADIENT (ADAGRAD)
A limitation of gradient descent is that it uses the same step size (learning rate) for each
input variable. This can be a problem on objective functions that have different amounts of
curvature in different dimensions, and in turn, may require a different sized step to a new
point.
In this variant, the learning rate is adaptively adjusted for each parameter
based on the historical gradient information.
This allows for larger updates for infrequent parameters and smaller
updates for frequent parameters.
ADAPTIVE GRADIENT (ADAGRAD)
¡ Advantages:
1. Learning rate changes for each training parameter.
2. Don’t need to manually tune the learning rate.
3. Able to train on sparse data.
¡ Disadvantages:
1. Computationally expensive as a need to calculate the second order derivative.
2. The learning rate is always decreasing results in slow training.
ROOT MEAN SQUARED PROPAGATION (RMSPROP)
¡ In this variant, the learning rate is adaptively adjusted for each parameter
based on the moving average of the squared gradient.
1. Widely used: Gradient descent and its variants are widely used in machine learning
and optimization problems because they are effective and easy to implement.
2. Convergence: Gradient descent and its variants can converge to a global minimum
or a good local minimum of the cost function, depending on the problem and the
variant used.
3. Scalability: Many variants of gradient descent can be parallelized and are scalable
to large datasets and high-dimensional models.
4. Flexibility: Different variants of gradient descent offer a range of trade-offs
between accuracy and speed, and can be adjusted to optimize the performance of a
specific problem.
DISADVANTAGES OF GD
¡ Choice of learning rate: The choice of learning rate is crucial for the convergence of gradient descent and
its variants. Choosing a learning rate that is too large can lead to oscillations or overshooting while choosing
a learning rate that is too small can lead to slow convergence or getting stuck in local minima.
¡ Sensitivity to initialization: Gradient descent and its variants can be sensitive to the initialization of the
model’s parameters, which can affect the convergence and the quality of the solution.
¡ Time-consuming: Gradient descent and its variants can be time-consuming, especially when dealing with
large datasets and high-dimensional models. The convergence speed can also vary depending on the variant
used and the specific problem.
¡ Local optima: Gradient descent and its variants can converge to a local minimum instead of the global
minimum of the cost function, especially in non-convex problems. This can affect the quality of the solution,
and techniques like random initialization and multiple restarts may be used to mitigate this issue.
2.3 REGULARIZATION
¡ Overview of Overfitting,
¡ Types of biases,
¡ Bias Variance Tradeoff
¡ Regularization Methods: L1, L2 regularization,
¡ Parameter sharing, Dropout, Weight Decay,
¡ Batch normalization,
¡ Early stopping,
¡ Data Augmentation,
¡ Adding noise to input and output
MACHINE LEARNING ERRORS
94
¡ Bias is simply defined as the inability of the model because of that there is some difference
or error occurring between the model’s predicted value and the actual value.
¡ These differences between actual or expected values and the predicted values are known as
error or bias error or error due to bias.
¡ Bias is a systematic error that occurs due to wrong training of network.
¡ When the error rate is high, we call it High Bias and when the error rate is low, we call it
Low Bias
VARIANCE
103
OVER FITTING & UNDER FITTING
OVER FITTING, UNDER FITTING
¡ Exclusion bias occurs when some features are excluded from the dataset usually during
the data wrangling.
¡ When there is a large amount of data, let’s say petabytes of data, choosing a small sample
for training purposes is the best option, but while doing so features might be accidentally
excluded from the sample, resulting in a biased sample.
¡ There can also be exclusion bias due to removing duplicates from the sample.
EXPERIMENTER OR OBSERVER BIAS
¡ Experimenter or observer bias occurs while gathering data. When gathering data, the
experimenter or observer might only record certain instances of data and skip others, the
skipped part could be beneficial for the learner but the learner is learning from the
instances which are biased to the environment. Thus a biased learner is built.
MEASUREMENT BIAS
¡ Algorithm bias refers to the certain parameters of an algorithm that causes it to create
unfair or subjective outcomes.
¡ When it does this, it unfairly favours someone or something over another person or thing.
It can exist because of the design of the algorithm.
¡ For example, an algorithm decides to approve credit card applications and the data is fed
that include the gender of the applicant.
¡ On this basis, the algorithm might decide that women are earning less than men and
therefore women’s applications would be rejected.
OVER FITTING ISSUE
¡
REGULARIZATION
¡ L1
¡ L2
¡ Parameter sharing
¡ Drop out
¡ Weight decay
¡ Early stopping
REGULARIZATION IN DL
¡ L1 and L2 are the most common types of regularization. These update the general cost
function by adding another term known as the regularization term.
Due to the addition of this regularization term, the values of weight matrices
decrease because it assumes that a neural network with smaller weight matrices
leads to simpler models.
Therefore, it will also reduce overfitting to quite an extent.
A regularizer is an additional criteria to the loss function to make sure that model
doesn’t overfit
It’s called a regularizer since it tries to keep the parameters more normal/regular
It is a bias on the model forces the learning to prefer certain types of weights over
others
n
argmin w,b ∑ loss(yy') + λ regularizer(w, b)
i=1
L1 REGULARIZATION
¡ L1 regularization, also known as Lasso regularization, is a machine-learning strategy
that inhibits overfitting by introducing a penalty term into the model's loss function
based on the absolute values of the model's parameters.
¡ L1 regularization seeks to reduce some model parameters toward zero in order to
lower the number of non-zero parameters in the model (sparse model).
¡ L1 regularization is particularly useful when working with high-dimensional data
since it enables one to choose a subset of the most important attributes.
¡ This lessens the risk of overfitting and also makes the model easier to understand.
¡ The size of a penalty term is controlled by the hyperparameter lambda, which
regulates the L1 regularization's regularization strength. As lambda rises, more
parameters will be lowered to zero, improving regularization.
DEEP LEARNING
L1 REGULARIZATION
¡ In L1, we have:
¡ In this, we penalize the absolute value of the weights. Unlike L2, the weights may be reduced to zero
here. Hence, it is very useful when we are trying to compress our model. Otherwise, we usually prefer
L2 over it.
L1 REGULARIZATION
124
L1 REGULARIZATION
125
L2 REGULARIZATION
¡ In L2, we have:
¡ Here, lambda is the regularization parameter. It is the hyperparameter whose value is optimized for
better results. L2 regularization is a type of weight decay as it forces the weights to decay towards zero
(but not exactly zero).
L2 REGULARIZATION
128
L2 REGULARIZATION
129
LAMBDA HYPERPARAMETER
LOSS FUNCTION
¡ Here the value 0.01 is the value of regularization parameter, i.e., lambda,
which we need to optimize further.
L1 & L2
Example,
two tasks involving classifying images of animals are likely to be correlated, as both tasks will involve
learning to detect fur patterns and colors.
HARD PARAMETER SHARING
¡ Perhaps the most widely used approach for MTL with NNs is hard parameter sharing,
in which we learn a common space representation for all tasks (i.e. completely share
weights/parameters between tasks).
¡ This shared feature space is used to model the different tasks, usually with
additional, task-specific layers (that are learned independently for each task).
¡ Hard parameter sharing acts as regularization and reduces the risk of overfitting, as
the model learns a representation that will (hopefully) generalize well for all tasks.
SOFT PARAMETER SHARING
¡ Instead of sharing exactly the same value of the parameters, in soft parameter
sharing, we add a constraint to encourage similarities among related parameters.
¡ More specifically, we learn a model for each task and penalize the distance between
the different models’ parameters.
¡ Unlike hard sharing, this approach gives more flexibility for the tasks by only loosely
coupling the shared space representations.
PARAMETER SHARING- TYPES
¡ Weight decay is a form of regularization that penalizes large weights in the network.
¡ It does this by adding a term to the loss function that is proportional to the sum of
the squared weights.
¡ This term reduces the magnitude of the weights and prevents them from growing
too large.
¡ L2 is a Weight decay regularization
EARLY STOPPING
EARLY STOPPING
¡ Batch normalization (also known as batch norm) is a method used to make training of
artificial neural networks faster and more stable through normalization of the layers'
inputs by re-centering and re-scaling. It was proposed by Sergey Ioffe and Christian
Szegedy in 2015.
¡ Batch Norm is just another network layer that gets inserted between a hidden layer
and the next hidden layer. Its job is to take the outputs from the first hidden layer
and normalize them before passing them on as the input of the next hidden layer.
DATA AUGMENTATION
Data augmentation is a
technique of artificially
increasing the training set by
creating modified copies of a
dataset using existing data.
It includes making minor
changes to the dataset or
using deep learning to
generate new data points.
DATA AUGMENTATION