0% found this document useful (0 votes)
8 views

Module-2

Deep learning is a subfield of machine learning that utilizes artificial neural networks to learn complex patterns from large datasets, achieving significant success in areas like image recognition and natural language processing. It requires substantial computational resources and large amounts of labeled data, but offers advantages such as high accuracy and automated feature engineering. Despite its challenges, including interpretability and overfitting, deep learning continues to grow in popularity and application across various domains.

Uploaded by

hicey94162
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views

Module-2

Deep learning is a subfield of machine learning that utilizes artificial neural networks to learn complex patterns from large datasets, achieving significant success in areas like image recognition and natural language processing. It requires substantial computational resources and large amounts of labeled data, but offers advantages such as high accuracy and automated feature engineering. Despite its challenges, including interpretability and overfitting, deep learning continues to grow in popularity and application across various domains.

Uploaded by

hicey94162
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 37

Deep Learning

Deep learning is a branch of machine learning which is based on artificial neural networks.
It is capable of learning complex patterns and relationships within data. In deep learning, we
don’t need to explicitly program everything. It has become increasingly popular in recent
years due to the advances in processing power and the availability of large datasets. Because
it is based on artificial neural networks (ANNs) also known as deep neural networks (DNNs).
These neural networks are inspired by the structure and function of the human brain’s
biological neurons, and they are designed to learn from large amounts of data.
1. Deep Learning is a subfield of Machine Learning that involves the use of neural
networks to model and solve complex problems. Neural networks are modeled
after the structure and function of the human brain and consist of layers of
interconnected nodes that process and transform data.
2. The key characteristic of Deep Learning is the use of deep neural networks, which
have multiple layers of interconnected nodes. These networks can learn complex
representations of data by discovering hierarchical patterns and features in the
data. Deep Learning algorithms can automatically learn and improve from data
without the need for manual feature engineering.
3. Deep Learning has achieved significant success in various fields, including image
recognition, natural language processing, speech recognition, and
recommendation systems. Some of the popular Deep Learning architectures
include Convolutional Neural Networks (CNNs), Recurrent Neural Networks
(RNNs), and Deep Belief Networks (DBNs).
4. Training deep neural networks typically requires a large amount of data and
computational resources. However, the availability of cloud computing and the
development of specialized hardware, such as Graphics Processing Units (GPUs),
has made it easier to train deep neural networks.
In summary, Deep Learning is a subfield of Machine Learning that involves the use of deep
neural networks to model and solve complex problems. Deep Learning has achieved
significant success in various fields, and its use is expected to continue to grow as more data
becomes available, and more powerful computing resources become available.
In a fully connected Deep neural network, there is an input layer and one or more hidden
layers connected one after the other. Each neuron receives input from the previous layer
neurons or the input layer. The output of one neuron becomes the input to other neurons in
the next layer of the network, and this process continues until the final layer produces the
output of the network. The layers of the neural network transform the input data through a
series of nonlinear transformations, allowing the network to learn complex representations
of the input data.
Difference between Machine Learning and Deep Learning:

Machine learning and deep learning both are subsets of artificial intelligence but there are
many similarities and differences between them.
Machine Learning Deep Learning

Apply statistical algorithms to learn the Uses artificial neural network architecture
hidden patterns and relationships in the to learn the hidden patterns and
dataset. relationships in the dataset.

Requires the larger volume of dataset


Can work on the smaller amount of dataset
compared to machine learning

Better for complex task like image


Better for the low-label task. processing, natural language processing,
etc.

Takes less time to train the model. Takes more time to train the model.

A model is created by relevant features which Relevant features are automatically


are manually extracted from images to detect extracted from images. It is an end-to-end
an object in the image. learning process.

More complex, it works like the black box


Less complex and easy to interpret the result.
interpretations of the result are not easy.

It can work on the CPU or requires less


It requires a high-performance computer
computing power as compared to deep
with GPU.
learning.

Applications of Deep Learning:

The main applications of deep learning can be divided into computer vision, natural language
processing (NLP), and reinforcement learning.
Computer vision
In computer vision, Deep learning models can enable machines to identify and understand
visual data. Some of the main applications of deep learning in computer vision include:
• Object detection and recognition: Deep learning model can be used to identify
and locate objects within images and videos, making it possible for machines to
perform tasks such as self-driving cars, surveillance, and robotics.
• Image classification: Deep learning models can be used to classify images into
categories such as animals, plants, and buildings. This is used in applications such
as medical imaging, quality control, and image retrieval.
• Image segmentation: Deep learning models can be used for image
segmentation into different regions, making it possible to identify specific features
within images.
Natural language processing (NLP):
In NLP, the Deep learning model can enable machines to understand and generate human
language. Some of the main applications of deep learning in NLP include:
• Automatic Text Generation – Deep learning model can learn the corpus of text
and new text like summaries, essays can be automatically generated using these
trained models.
• Language translation: Deep learning models can translate text from one
language to another, making it possible to communicate with people from
different linguistic backgrounds.
• Sentiment analysis: Deep learning models can analyze the sentiment of a piece
of text, making it possible to determine whether the text is positive, negative, or
neutral. This is used in applications such as customer service, social media
monitoring, and political analysis.
• Speech recognition: Deep learning models can recognize and transcribe spoken
words, making it possible to perform tasks such as speech-to-text conversion,
voice search, and voice-controlled devices.
Reinforcement learning:
In reinforcement learning, deep learning works as training agents to take action in an
environment to maximize a reward. Some of the main applications of deep learning in
reinforcement learning include:
• Game playing: Deep reinforcement learning models have been able to beat
human experts at games such as Go, Chess, and Atari.
• Robotics: Deep reinforcement learning models can be used to train robots to
perform complex tasks such as grasping objects, navigation, and manipulation.
• Control systems: Deep reinforcement learning models can be used to control
complex systems such as power grids, traffic management, and supply chain
optimization.

Challenges in Deep Learning


Deep learning has made significant advancements in various fields, but there are still some
challenges that need to be addressed. Here are some of the main challenges in deep learning:
1. Data availability: It requires large amounts of data to learn from. For using deep
learning it’s a big concern to gather as much data for training.
2. Computational Resources: For training the deep learning model, it is
computationally expensive because it requires specialized hardware like GPUs
and TPUs.
3. Time-consuming: While working on sequential data depending on the
computational resource it can take very large even in days or months.
4. Interpretability: Deep learning models are complex, it works like a black box. it
is very difficult to interpret the result.
5. Overfitting: when the model is trained again and again, it becomes too specialized
for the training data, leading to overfitting and poor performance on new data.
Advantages of Deep Learning:

1. High accuracy: Deep Learning algorithms can achieve state-of-the-art


performance in various tasks, such as image recognition and natural language
processing.
2. Automated feature engineering: Deep Learning algorithms can automatically
discover and learn relevant features from data without the need for manual feature
engineering.
3. Scalability: Deep Learning models can scale to handle large and complex datasets,
and can learn from massive amounts of data.
4. Flexibility: Deep Learning models can be applied to a wide range of tasks and can
handle various types of data, such as images, text, and speech.
5. Continual improvement: Deep Learning models can continually improve their
performance as more data becomes available.

Disadvantages of Deep Learning:

1. High computational requirements: Deep Learning models require large amounts


of data and computational resources to train and optimize.
2. Requires large amounts of labeled data: Deep Learning models often require a
large amount of labeled data for training, which can be expensive and time-
consuming to acquire.
3. Interpretability: Deep Learning models can be challenging to interpret, making it
difficult to understand how they make decisions.
Overfitting: Deep Learning models can sometimes overfit to the training data,
resulting in poor performance on new and unseen data.
4. Black-box nature: Deep Learning models are often treated as black boxes, making
it difficult to understand how they work and how they arrived at their predictions.
In summary, while Deep Learning offers many advantages, including high
accuracy and scalability, it also has some disadvantages, such as high
computational requirements, the need for large amounts of labeled data, and
interpretability challenges. These limitations need to be carefully considered when
deciding whether to use Deep Learning for a specific task.

Multi-Layer Neural Networks (Multi-Layer perceptrons(MLPs))(


Deep feed forward network)
As the name suggests, the multi-layer neural networks, or the multi-layer perceptrons (MLPs),
consist layers more than one. Beside of the perceptrons, they can be used for non-linearly
separable problems. They achieve this with the activation functions they use in their layers. The
activation functions make the output of neurons nonlinear. In this way, it enables to solve more
complex problems. (Without activation function, ANNs actually become a linear regression
model.)

The basic layers of MLPs are:


• Input layer→ it includes 1 neuron per input x.

• Hidden layers (one or more) → The number of neurons it consists depends on the
problem.

• Output layer →The number of neurons it consists depends on the problem.

In the first step, for every neurons of hidden layers, the same process in the perceptron is
applied:

1. The weighted sum(z) is calculated.

2. It is transmitted to related hidden neuron, then the activation function present in the
neuron is applied.

In the next step, the outputs of hidden layers are transmitted to output layer. As said before,
the number of neurons depends on the problem in here:

Regression: consists of 1 neuron ,


Binary Classification: consists of 1 neuron,
Multi-label Classification: consists of 1 neuron per label,
Multi-class Classification: consists of 1 neuron per class in the output layer.

The activation functions in neurons of output layer also depends on the task:
Regression: None or ReLU/Softplus(if positive outputs) or Logistic/tanh( if bounded outputs),
Binary Classification: Logistic(sigmoid) function,
Multi-label Classification: Logistic(sigmoid) function,
Multi-class Classification: Softmax function.

The main goal is to enable ANN to learn the most accurate weight values (so achiving most
accurate result) with correct hidden layer and neuron numbers. We can do this by applying
certain processes in our artificial neural network and optimizing it.

Activation Functions

NOTE→(IT SHOULD BE EXPLAINED IN DEEP FORWARD NEURAL NETWORK)

It may be defined as the extra force or effort applied over the input to obtain an exact output.
In ANN, we can also apply activation functions over the input to get the exact output.

It’s just a thing function that you use to get the output of node. It is also known as Transfer
Function.

Why we use Activation functions with Neural Networks?

It is used to determine the output of neural network like yes or no. It maps the resulting values
in between 0 to 1 or -1 to 1 etc. (depending upon the function).

The Activation Functions can be basically divided into 2 types-

1. Linear Activation Function

2. Non-linear Activation Functions

Linear or Identity Activation Function

As you can see the function is a line or linear. Therefore, the output of the functions will not be
confined between any range.
Equation : f(x) = x
Range : (-infinity to infinity)
It doesn’t help with the complexity or various parameters of usual data that is fed to the
neural networks.

Non-linear Activation Function

The Nonlinear Activation Functions are the most used activation functions. Nonlinearity helps
to makes the graph look something like this
It makes it easy for the model to generalize or adapt with variety of data and to differentiate
between the output.

The main terminologies needed to understand for nonlinear functions are:

Derivative or Differential: Change in y-axis w.r.t. change in x-axis.It is also known as slope.

Monotonic function: A function which is either entirely non-increasing or non-decreasing.

The Nonlinear Activation Functions are mainly divided on the basis of their range or curves-

1. Sigmoid or Logistic Activation Function

The Sigmoid Function curve looks like a S-shape.


The main reason why we use sigmoid function is because it exists between (0 to 1). Therefore,
it is especially used for models where we have to predict the probability as an output. Since
probability of anything exists only between the range of 0 and 1, sigmoid is the right choice.

The function is differentiable. That means, we can find the slope of the sigmoid curve at any
two points.

The function is monotonic but function’s derivative is not.

The logistic sigmoid function can cause a neural network to get stuck at the training time.

The softmax function is a more generalized logistic activation function which is used for
multiclass classification.

2. Tanh or hyperbolic tangent Activation Function

tanh is also like logistic sigmoid but better. The range of the tanh function is from (-1 to 1). tanh
is also sigmoidal (s - shaped).
The advantage is that the negative inputs will be mapped strongly negative and the zero inputs
will be mapped near zero in the tanh graph.

The function is differentiable.

The function is monotonic while its derivative is not monotonic.

The tanh function is mainly used classification between two classes.

Both tanh and logistic sigmoid activation functions are used in feed-forward nets.

3. ReLU (Rectified Linear Unit) Activation Function

The ReLU is the most used activation function in the world right now. Since, it is used in almost
all the convolutional neural networks or deep learning.
As you can see, the ReLU is half rectified (from bottom). f(z) is zero when z is less than
zero and f(z) is equal to z when z is above or equal to zero.

Range: [ 0 to infinity)

The function and its derivative both are monotonic.

But the issue is that all the negative values become zero immediately which decreases the ability
of the model to fit or train from the data properly. That means any negative input given to the
ReLU activation function turns the value into zero immediately in the graph, which in turns
affects the resulting graph by not mapping the negative values appropriately.

Training Deep Learning Model


Gathering more data is the smart approach
A deep learning training program is only as good as the data it is trained on. When it comes to
training, more data is preferable to the alternative. Some deep learning use cases can require
up to millions of records during the process of training in order to be effective. TechTarget
editor Ed Burns speaks with Patrick Lucey, now chief scientist at Stats Perform, about how
much data is the right amount for deep learning projects.

When using limited data sets, focus on the approach


For companies that struggle to gather large data sets, there are still pathways to deep learning
model training success. Companies can use a grow-more approach and use generative
adversarial networks to generate more date to train a model on, or use a know-more approach
and use transfer learning. George Lawton discusses the advantages of both these approaches as
well as the importance of labelled data when it comes to deep learning and machine learning.
Decrease strain through federated deep learning
Training deep learning models requires significant compute resources which can create a drag
on an enterprise's infrastructure. A recent development known as federated deep learning was
created to combat this drag. Federated deep learning spreads out the compute power to
numerous individual devices in order to ease the burden. George Lawton writes on specific use
cases for federated deep learning and industries that have adopted the approach.

Consider reinforcement learning


Reinforcement learning is a relatively new training method that is based on rewarding desired
behaviors and punishing undesired ones. This is an approach to unsupervised learning for
models that can teach themselves despite limited data. Maria Korolov discusses what
role reinforcement learning has in deep learning model training, as well as practical
applications of the technology and other advancements.

Continue to retrain the model and build the right workflow


To prevent model decay and loss of accuracy, companies must continuously check on their
models and adjust. This is true for both machine learning as well as deep learning models. Jack
Vaughan speaks with James Kobielus, a research director as well as a tech consultant and
analyst, about the importance of this and how creating a DevOps workflow can ease this
burden.
Optimization techniques
Challenges with Optimization

When talking about optimization in the context of neural networks, we are discussing non-
convex optimization.

Convex optimization involves a function in which there is only one optimum, corresponding to
the global optimum (maximum or minimum). There is no concept of local optima for convex
optimization problems, making them relatively easy to solve — these are common introductory
topics in undergraduate and graduate optimization classes.

Non-convex optimization involves a function which has multiple optima, only one of which is
the global optima. Depending on the loss surface, it can be very difficult to locate the global
optima

For a neural network, the curve or surface that we are talking about is the loss surface. Since we
are trying to minimize the prediction error of the network, we are interested in finding the global
minimum on this loss surface — this is the aim of neural network training.

There are multiple problems associated with this:

• What is a reasonable learning rate to use? Too small a learning rate takes too
long to converge, and too large a learning rate will mean that the network will
not converge.

• How do we avoid getting stuck in local optima? One local optimum may be
surrounded by a particularly steep loss function, and it may be difficult to
‘escape’ this local optimum.

• What if the loss surface morphology changes? Even if we can find the global
minimum, there is no guarantee that it will remain the global minimum
indefinitely. A good example of this is when training on a dataset that is not
representative of the actual data distribution — when applied to new data, the
loss surface will different. This is one reason why trying to make the training
and test datasets representative of the total data distribution is of such high
importance. Another good example is data which habitually changes in
distribution due to its dynamic nature — an example of this would be user
preferences for popular music or movies, which changes day-to-day and month-
to-month.

Fortunately, there are methods available that provide ways to tackle all of these challenges, thus
mitigating their potentially negative ramifications.

Local Optima
Previously, local minima were viewed as a major problem in neural network training.
Nowadays, researchers have found that when using sufficiently large neural networks, most
local minima incur a low cost, and thus it is not particularly important to find the true global
minimum — a local minimum with reasonably low error is acceptable.

Saddle Points

Recent studies indicate that in high dimensions, saddle points are more likely than local minima.
Saddle points are also more problematic than local minima because close to a saddle point the
gradient can be very small. Thus, gradient descent will result in negligible updates to the
network and hence network training will cease.
Saddle point — simultaneously a local minimum and a local maximum.

What is Optimization ?

During Backward Propagation it updates attributes (weights and biases) of neural networks.

It also ensures how much data would be needed for back propagation and provides only that
amount of data to the network.
Non Momentum Based Optimization

In Non momentum based optimization new weight has no any dependencies on previous
weight everytime we feed new set of inputs we will obtain new weight which has no any
relationship with previous weight hang tight you will understand all these thing when you go
further in this article.

1. Batch Gradient Descent

Suppose you have dataset having n training set inputs.When we will send all the training set
input data to calculate the attributes is known as batch gradient descent.

α=∑(y^ — y)
Advantage:

To calculate new weight neural network will have to take the cost(average summation of all the
input that is being passed) which gives a kind of learnable capability to the network.

Disadvantage:

• As user will have to send all the input data to calculate loss hence it will be high
time consuming process.

• It will have high memory consumption.

• To handle large dataset we need high computational system resulting into hight cost
and complexity.

2. Mini Batch Gradient Descent

Suppose we have 1000 datapoints in a dataset .In MGD unlike BGD we won’t calculate loss
for whole dataset but we randomly select small-small batch of datapoints ideally in a range
of 50–256 and then we calculate attributes.

Advantage:

• Due to small size of batches , system will have less memomry consumption will
overcome the issue of memory loss (that was previously in BGD).

• Less computational complexity as compared to BGD.

Disadvantage:

In case of small learning rate convergence rate(the process of finding absolute minima ) wil be
too low.

3. Stochastic Gradient Descent

SGD randomly select one dataset out of n test datasets to calculate new weight and bias.
• Neither high Nor low learning rate affects change in attribute.

• It has lesser computational complexity as compared to BGD and MGD.

• It works tremendously well with momentum based gradient descent.

Momentum Based Optimization

In this technique for the calculation of new weight we will have to take previous weight into
consideration.

For the calculation of new weight there is contribution of current loss and previous loss that
makes machine learnable.
Adagrad

• Adapt Learning rate based on parameter. i.e:- If parameters will be frequently


occuring , learning rate will be slow else learning rate will be fast.

• It is best fitted with sparse dataset (non-frequent dataset,biased dataset).

• It adapts manual tuning.


Gi,t is the summation of square of each and every weight updated.gi,t gradient.Epsilon provides
smoothness.

Disadvantage:

Due to presence of summation of square of gradient for updated weight at denominator reduces
the learning rate monotonically.

2. Adadelta

It is an extension of Adagrad. It overcomes the issue of monotonically decreasing learning rate


of Adagrad.

Advantage:

Eliminate the reaquirement of manual tuning.

Disadvantage:

Resulted into late training period

3. RmsProp

It is also used as better replacement of Adagrad.Instead of taking summation of square of


descent of upgraded weight we take running gradient.

Running Gradient ?

Difference between current value and previous value of gradient is known as running gradient.
Eg² is Running gradient.

Advantage:

• It adjusts learning rate automatically doesn’t depends upon manual tuning.

• Better performance efficiency than Adagrad and Adadelta .

4. Adam

• Adam is known as adaptive momentum estimation.One of the most frequently used


algorithm in Neural Networks.

• Achieves Adaptive learning rate.

• It works well with infrequent data.

• It is the combination of RmsProp and Adagrad .

• It works well in online setting and offline setting.

• It has very less memory requirement.

let’s Understand reason behind invention of Adam;

previously, whenever user had applied optimization, they have used gradient as a base to
formulate them. But after certain development in the area of neural networks a group of
intellectuals decided not to use gradient as a base to make optimization function so they have
got an alternate option i.e:-mean and variance to use instead of gradient for making optimization
function and here Adam came into picture.

Adam uses regualrised mean and variance for it’s formuation. Due to regularization the user has
full control over it,whether it comes to setting learning rate, taking mean and variance of dataset
etc. In case of Adagrad, Adadelta and RmsProp user has control over learning rate but in
case of Adam user takes control over gradient.
Regularization
In regression analysis, the features are estimated using coefficients while modelling. Also, if
the estimates can be restricted, or shrinked or regularized towards zero, then the impact of
insignificant features might be reduced and would prevent models from high variance with a
stable fit.

Regularization is the most used technique to penalize complex models in machine learning, it
is deployed for reducing overfitting (or, contracting generalization errors) by putting network
weights small. Also, it enhances the performance of models for new inputs.

It avoids overfitting by panelizing the regression coefficients of high value. More specifically,
It decreases the parameters and shrinks (simplifies) the model. This more streamlined model
will aptly perform more efficiently while making predictions.

Since, it makes the magnitude to weighted values low in a model, regularization technique is
also referred to as weight decay.

Moreover, Regularization appends penalties to more complex models and arranges potential
models from slightest overfit to greatest. Regularization assumes that least weights may
produce simpler models and hence assist in avoiding overfitting.

The model with the least overfitting score is accounted as the preferred choice for prediction.

In general, regularization is adopted universally as simple data models generalize better and
are less prone to overfitting. Examples of regularization, included;

• K-means: Restricting the segments for avoiding redundant groups.


• Neural networks: Confining the complexity (weights) of a model.
• Random Forest: Reducing the depth of tree and branches (new features)

There are various regularization techniques, some well-known techniques are L1, L2 and
dropout regularization, however, during this blog discussion, L1 and L2 regularization is our
main course of interest.

Regularization Term

Both L1 and L2 can add a penalty to the cost depending upon the model complexity, so at the
place of computing the cost by using a loss function, there will be an auxiliary component,
known as regularization terms, added in order to panelizing complex models.

By adding regularization term, the value of weights matrices reduces by assuming that a neural
network having less weights makes simpler models. And hence, it reduces the overfitting to a
certain level.

Penalty Terms
Through biasing data points towards specific values such as very small values to zero,
Regularization achieves this biasing by adding a tuning parameter to strengthen those data
points. Such as;

1. L1 regularization: It adds an L1 penalty that is equal to the absolute value of the


magnitude of coefficient, or simply restricting the size of coefficients. For example,
Lasso regression implements this method.
2. L2 Regularization: It adds an L2 penalty which is equal to the square of the magnitude
of coefficients. For example, Ridge regression and SVM implement this method.
3. Elastic Net: When L1 and L2 regularization combine together, it becomes the elastic
net method, it adds a hyperparameter.

What is L1 Regularization?

L1 regularization is the preferred choice when having a high number of features as it provides
sparse solutions. Even, we obtain the computational advantage because features with zero
coefficients can be avoided.

The regression model that uses L1 regularization technique is called Lasso Regression.

Mathematical Formula for L1 regularization

For instance, we define the simple linear regression model Y with an independent variable to
understand how L1 regularization works.

For this model, W and b represents “weight” and “bias” respectively, such as

W= w1, w2, w3, ......... wn


And,
b=b1, b2, b3, ......... bn
And Ŷ is the predicted result such that
Ŷ= w1 x1 +w2 x2 +......+wn xn, + b
The below function calculates an error without the regularization function
Loss= Error (Y, Ŷ)
And function that can calculate the error with L1 regularization function,

Where 𝝺 is called the regularization parameter and 𝝺> 0 is manually tuned. Also, 𝝺=0 then the
above loss function acts as Ordinary Least Square where the high range value push the
coefficients (weights) 0 and hence make it underfits.

Now |w| is only differentiable everywhere except when w=0 as shown below;

Substituting the formula of Gradient Descent optimizer for calculating new weights;

Putting the L1 formula in the above equation;


From the above formula, we can say that;

• When w is positive, the regularization parameter (λ > 0) will make w to be least positive,
by deducting λ from w.
• When w is negative, the regularization parameter (λ < 0) will make w to be little
negative, by summing λ to w.

(Recommend blog: Dijkstra’s Algorithm: The Shortest Path Algorithm)

What is L2 regularization?

L2 regularization can deal with the multicollinearity (independent variables are highly
correlated) problems through constricting the coefficient and by keeping all the variables.

L2 regression can be used to estimate the significance of predictors and based on that it can
penalize the insignificant predictors.

A regression model that uses L2 regularization techniques is called Ridge Regression.

Mathematical Formula for L2 regularization

For instance, we define the simple linear regression model Y with an independent variable to
understand how L2 regularization works.

For this model, W and b represents “weight” and “bias” respectively, such as

W= w1, w2, w3, ......... wn


And,
b=b1, b2, b3, ......... bn
And Ŷ is the predicted result such that
Ŷ= w1 x1 +w2 x2 +......+wn xn, + b
The below function calculates an error without the regularization function
Loss= Error (Y, Ŷ)

And function that can calculate the error with L2 regularization function,

Here, 𝝺 is known as Regularization parameter, also if the lambda is zero, this again would act
as OLS, and if lambda is extremely large, it leads to adding huge weights and yield as
underfitting.

Substituting the formula of Gradient Descent optimizer for calculating new weights;

Putting the L2 formula in the above equation;


L2 vs L1 Regularization

It is often observed that people get confused in selecting the suitable regularization approach
to avoid overfitting while training a machine learning model.

Among many regularization techniques, such as L2 and L1 regularization, dropout, data


augmentation, and early stopping, we will learn here intuitive differences between L1 and L2
regularization.

1. Where L1 regularization attempts to estimate the median of data, L2 regularization


makes estimation for the mean of the data in order to evade overfitting.

2. Through including the absolute value of weight parameters, L1 regularization can add
the penalty term in cost function. On the other hand, L2 regularization appends the
squared value of weights in the cost function.

3. As defined, sparsity is the characteristic of holding highly significant coefficients, either


very close to zero or not very close to zero, where in general coefficients approaching
zero would be eliminated later.

And the feature selection is the in-depth of sparsity, i.e. in place of confining coefficients
nearby to zero, feature selection is brought them exactly to zero, and hence expel certain
features from the data model.

In this context, L1 regularization can be helpful in features selection by eradicating the


unimportant features, whereas, L2 regularization is not recommended for feature
selection.

4. L2 has a solution in closed form as it’s a square of a weight, on the other side, L1
doesn’t have a closed form solution since it includes an absolute value and it is a non-
differentiable function.

Due to this reason, L1 regularization is relatively more expensive in computation, it can’t


be solved in the context of matrix measurement and heavily relies on approximations.

L2 regularization is likely to be more accurate in all the circumstances, however, at a


much higher level of computational costs.

The table below shows the summarized differences between L1 and L2 regularization;

S.No L1 Regularization L2 Regularization

Panelizes the sum of absolute


1 penalizes the sum of square weights.
value of weights.

2 It has a sparse solution. It has a non-sparse solution.

3 It gives multiple solutions. It has only one solution.


Constructed in feature
4 No feature selection.
selection.

5 Robust to outliers. Not robust to outliers.

It gives more accurate predictions when the


It generates simple and
6 output variable is the function of whole input
interpretable models.
variables.

Unable to learn complex data


7 Able to learn complex data patterns.
patterns.

Computationally inefficient Computationally efficient because of having


8
over non-sparse conditions. analytical solutions.

Regularization and Early Stopping:


The general set of strategies against this curse of overfitting is called regularization and early
stopping is one such technique.

The idea is very simple. The model tries to chase the loss function crazily on the training data,
by tuning the parameters. Now, we keep another set of data as the validation set and as we go
on training, we keep a record of the loss function on the validation data, and when we see that
there is no improvement on the validation set, we stop, rather than going all the epochs. This
strategy of stopping early based on the validation set performance is called Early
Stopping. This is explained with the below diagram.
From Figure, it can be observed

• The training set accuracy continues to increase, through all the Epochs

• The validation set accuracy, however, saturates between 8 to 10 epochs. This is


where the model can be stopped training.

Early Stopping, hence does not only protect against overfitting but needs considerably less
number of Epoch to train.

Data Augmentation
Data augmentation is a technique of artificially increasing the training set by creating modified
copies of a dataset using existing data. It includes making minor changes to the dataset or using
deep learning to generate new data points.

Augmented vs. Synthetic data

Augmented data is driven from original data with some minor changes. In the case of image
augmentation, we make geometric and color space transformations (flipping, resizing,
cropping, brightness, contrast) to increase the size and diversity of the training set.

Synthetic data is generated artificially without using the original dataset. It often uses DNNs
(Deep Neural Networks) and GANs (Generative Adversarial Networks) to generate synthetic
data.
Note: the augmentation techniques are not limited to images. You can augment audio, video,
text, and other types of data too.

When Should You Use Data Augmentation?

1. To prevent models from overfitting.


2. The initial training set is too small.
3. To improve the model accuracy.
4. To Reduce the operational cost of labeling and cleaning the raw dataset.

Limitations of Data Augmentation

• The biases in the original dataset persist in the augmented data.


• Quality assurance for data augmentation is expensive.
• Research and development are required to build a system with advanced applications.
For example, generating high-resolution images using GANs can be challenging.
• Finding an effective data augmentation approach can be challenging.

Data Augmentation Techniques

In this section, we will learn about audio, text, image, and advanced data augmentation
techniques.

Audio Data Augmentation

1. Noise injection: add gaussian or random noise to the audio dataset to improve the
model performance.
2. Shifting: shift audio left (fast forward) or right with random seconds.
3. Changing the speed: stretches times series by a fixed rate.
4. Changing the pitch: randomly change the pitch of the audio.

Text Data Augmentation

1. Word or sentence shuffling: randomly changing the position of a word or sentence.


2. Word replacement: replace words with synonyms.
3. Syntax-tree manipulation: paraphrase the sentence using the same word.
4. Random word insertion: inserts words at random.
5. Random word deletion: deletes words at random.
Image Augmentation

1. Geometric transformations: randomly flip, crop, rotate, stretch, and zoom images.
You need to be careful about applying multiple transformations on the same images, as
this can reduce model performance.
2. Color space transformations: randomly change RGB color channels, contrast, and
brightness.
3. Kernel filters: randomly change the sharpness or blurring of the image.
4. Random erasing: delete some part of the initial image.
5. Mixing images: blending and mixing multiple images.

Advanced Techniques

1. Generative adversarial networks (GANs): used to generate new data points or


images. It does not require existing data to generate synthetic data.
2. Neural Style Transfer: a series of convolutional layers trained to deconstruct images
and separate context and style.

Data Augmentation Applications

Healthcare
Self-Driving Cars
Natural Language Processing
Automatic Speech Recognition

Parameter Tying
Parameter tying is a regularization technique. We divide the parameters or weights of a
machine learning model into groups by leveraging prior knowledge, and all parameters in each
group are constrained to take the same value. In simple terms, we want to express that specific
parameter should be close to each other.
Example
Two models perform the same classification task (with the same set of classes) but with
different input data.
• Model A with parameters wt(A).
• Model B with parameters wt(B).
The two models hash the input to two different but related outputs.

Some standard regularisers like l1 and l2 penalize model parameters for deviating from the
fixed value of zero. One of the side effects of Lasso or group-Lasso regularization in learning
a Deep Neural Networks is that there is a possibility that many of the parameters may become
zero. Thus, reducing the amount of memory required to store the model and lowering the
computational cost of applying it. A significant drawback of Lasso (or group-Lasso)
regularization is that in the presence of groups of highly correlated features, it tends to select
only one or an arbitrary convex combination of elements from each group. Moreover, the
learning process of Lasso tends to be unstable because the subsets of parameters that end up
selected may change dramatically with minor changes in the data or algorithmic procedure. In
Deep Neural Networks, it is almost unavoidable to encounter correlated features due to the
high dimensionality of the input to each layer and because neurons tend to adapt, producing
strongly correlated features that we pass as an input to the subsequent layer.

Parameter Sharing
Parameter sharing forces sets of parameters to be similar as we interpret various models or
model components as sharing a unique set of parameters. We only need to store only a subset
of memory.
Suppose two models A and B, perform a classification task on similar input and output
distributions. In such a case, we'd expect the parameters for both models to be identical to each
other as well. We could impose a norm penalty on the distance between the weights, but a more
popular method is to force the parameters to be equal. The idea behind Parameter Sharing is
the essence of forcing the parameters to be similar. A significant benefit here is that we need
to store only a subset of the parameters (e.g., storing only the parameters for model A instead
of storing for both A and B), which leads to significant memory savings.

Example
The most extensive use of parameter sharing is in convolutional neural networks. Natural
images have specific statistical properties that are robust to translation. For example photo of
a cat remains a photo of a cat if it is translated one pixel to the right. Convolution Neural
Networks consider this property by sharing parameters across multiple image locations. Thus
we can find a cat with the same cat detector in column i or i+1 in the image.

Injecting noise at input


The concept of noise injection is simple. We know that the main root cause for the problem of
overfitting is the size of the dataset, and if the dataset we are dealing with is too small, then our
model will get complete accuracy on the train data, but this model won’t show much accuracy
on the holdout dataset. So, we need to increase the size of the dataset by upsampling the whole
dataset, either collecting the new data or adding some noise or unwanted data. The concept of
collecting a new data sample and adding it to the dataset is a routine and effort-needed task.
Thus the concept of noise injection to the dataset is required and developed. The type of noise
you are going to add to the dataset is purely based on the actual dataset.

Why add Noise?

This small dataset problem challenges machine learning to develop this procedure. The main
problems that we are facing with the small dataset are that we have very few samples. Thus our
model will effectively learn all of those and work well for training data. Similarly, since the
model learned fewer samples, this model can’t make a better mapping between the input and
output data, thus resulting in a poor relation to the output of a particular input.
Wait! Don’t you think why the addition of noise will improve the model? Doesn’t the addition
of it degrade the model performance?

Well!, the answer is NO. This is because we know what regularization is? And how does it
work? Similarly, the addition of noise to the dataset that is causing the overfitting problem, this
addition leads to a regularization effect while training the model and thus improves the model
performance. Thus, adding noise expands the training dataset size. Each time when a training
sample is exposed to the model, some random noise is added to the input variables making
them different every time it is exposed to the model. In this way, adding noise to input samples
is a simple form of “data augmentation”. Thus the noise addition makes the model not
memorize the samples much efficiently, resulting in a smooth mapping function.

Key Points on adding Noise

Well, the most common noise added during the training of the model is Gaussian noise or white
noise. We all know the Gaussian noise has a mean of zero and a standard deviation of one. The
addition of this Gaussian noise to the inputs of a neural network is called “Jitter”.
The next and most important point to be noted is how much noise you are going to add? If you
add less noise, this is of no use. Similarly, if you add more noise, the model will lose its hands.
Adding Gaussian noise to hidden layers Example

The main advantage of Gaussian noise is that we can have a look at the standard deviation of
the random Noise, and thus can control it by the amount of spread it.
The major point is that we need to add noise only during the training stage, like adding noise
to activations, weights, gradients, and outputs.

Ensemble methods
Ensemble methods is a machine learning technique that combines several base models in order
to produce one optimal predictive model. To better understand this definition lets take a step
back into ultimate goal of machine learning and model building.
I will largely utilize Decision Trees to outline the definition and practicality of Ensemble
Methods (however it is important to note that Ensemble Methods do not only pertain to
Decision Trees).
A Decision Tree determines the predictive value based on series of questions and conditions.
For instance, this simple Decision Tree determining on whether an individual should play
outside or not. The tree takes several weather factors into account, and given each factor either
makes a decision or asks another question. In this example, every time it is overcast, we will
play outside. However, if it is raining, we must ask if it is windy or not? If windy, we will not
play. But given no wind, tie those shoelaces tight because were going outside to play.

Decision Trees can also solve quantitative problems as well with the same format. In the Tree
to the left, we want to know wether or not to invest in a commercial real estate property. Is it
an office building? A Warehouse? An Apartment building? Good economic conditions? Poor
Economic Conditions? How much will an investment return? These questions are answered
and solved using this decision tree.
When making Decision Trees, there are several factors we must take into consideration: On
what features do we make our decisions on? What is the threshold for classifying each question
into a yes or no answer? In the first Decision Tree, what if we wanted to ask ourselves if we
had friends to play with or not. If we have friends, we will play every time. If not, we might
continue to ask ourselves questions about the weather. By adding an additional question, we
hope to greater define the Yes and No classes.
This is where Ensemble Methods come in handy! Rather than just relying on one Decision Tree
and hoping we made the right decision at each split, Ensemble Methods allow us to take a
sample of Decision Trees into account, calculate which features to use or questions to ask at
each split, and make a final predictor based on the aggregated results of the sampled Decision
Trees.
Types of Ensemble Methods
1. BAGGing, or Bootstrap AGGregating. BAGGing gets its name because it
combines Bootstrapping and Aggregation to form one ensemble model. Given a sample
of data, multiple bootstrapped subsamples are pulled. A Decision Tree is formed on
each of the bootstrapped subsamples. After each subsample Decision Tree has been
formed, an algorithm is used to aggregate over the Decision Trees to form the most
efficient predictor. The image below will help explain:

Given a Dataset, bootstrapped subsamples are pulled. A Decision Tree is formed on each
bootstrapped sample. The results of each tree are aggregated to yield the strongest, most
accurate predictor.
2. Random Forest Models. Random Forest Models can be thought of as BAGGing, with a
slight tweak. When deciding where to split and how to make decisions, BAGGed Decision
Trees have the full disposal of features to choose from. Therefore, although the bootstrapped
samples may be slightly different, the data is largely going to break off at the same features
throughout each model. In contrary, Random Forest models decide where to split based on a
random selection of features. Rather than splitting at similar features at each node throughout,
Random Forest models implement a level of differentiation because each tree will split based
on different features. This level of differentiation provides a greater ensemble to aggregate
over, ergo producing a more accurate predictor. Refer to the image for a better understanding.

Similar to BAGGing, bootstrapped subsamples are pulled from a larger dataset. A decision tree
is formed on each subsample. HOWEVER, the decision tree is split on different features (in
this diagram the features are represented by shapes).
In Summary
The goal of any machine learning problem is to find a single model that will best predict our
wanted outcome. Rather than making one model and hoping this model is the best/most
accurate predictor we can make, ensemble methods take a myriad of models into account, and
average those models to produce one final model. It is important to note that Decision Trees
are not the only form of ensemble methods, just the most popular and relevant in DataScience
today.

Dropout in deep neural networks


Dropout refers to data, or noise, that's intentionally dropped from a neural network to improve
processing and time to results. A neural network is software attempting to emulate the actions
of the human brain. The human brain contains billions of neurons that fire electrical and
chemical signals to each other to coordinate thoughts and life functions. A neural network uses
a software equivalent of these neurons, called units. Each unit receives signals from other units
and then computes an output that it passes onto other neuron/units, or nodes, in the network.
Why do we need dropout?
The challenge for software-based neural networks is they must find ways to reduce the noise
of billions of neuron nodes communicating with each other, so the networks' processing
capabilities aren't overrun. To do this, a network eliminates all communications that are
transmitted by its neuron nodes not directly related to the problem or training that it's working
on. The term for this neuron node elimination is dropout.

Dropout layers
Like the neurons of the human brain, the units of a neural network randomly process myriad
inputs and then fire off myriad outputs at any given time. The process and outputs of each unit
may be intermediate output firings that are passed to another unit for further processing, long
before an end output or conclusion results. Some of this processing ends up as noise that's an
intermediate output from processing activities but isn't a final output.

When data scientists apply dropout to a neural network, they consider the nature of this random
processing. They make decisions about which data noise to exclude and then apply dropout to
the different layers of a neural network as follows:

• Input layer. This is the top-most layer of artificial intelligence (AI) and machine
learning where the initial raw data is being ingested. Dropout can be applied to this
layer of visible data based on which data is deemed to be irrelevant to the business
problem being worked on.
• Intermediate or hidden layers. These are the layers of processing after data
ingestion. These layers are hidden because we can't exactly see what they do. The
layers, which could be one or many, process data and then pass along intermediate
-- but not final -- results that they send to other neurons for additional processing.
Because much of this intermediate processing will end up as noise, data
scientists use dropout to exclude some of it.
• Output layer. This is the final, visible processing output from all neuron units.
Dropout is not used on this layer.
These images show the different layers of a neural network before and after dropout has been
applied.

You might also like