DL Mod1
🡪The sensory neurons carry information from the sensory receptor cells present
throughout the body to the brain.
🡪The motor neurons transmit information from the brain to the muscles.
🡪The interneurons transmit information between different neurons in the body.
• All neurons have three different parts – dendrites, cell body and axon.
• Parts of Neuron
• Following are the different parts of a neuron:
• Dendrites
• These are branch-like structures that receive messages from other neurons and
allow the transmission of messages to the cell body.
• Cell Body
• Each neuron has a cell body with a nucleus, Golgi body, endoplasmic reticulum,
mitochondria and other components.
• Axon
• Axon is a tube-like structure that carries electrical impulse from the cell body to
the axon terminals that pass the impulse to another neuron.
• It is the chemical junction between the terminal of one neuron and the dendrites of
another neuron.
Artificial Neural Network
🡪An Artificial neural network is usually a computational network based on biological neural
networks that construct the structure of the human brain.
🡪Similar to a human brain, it has neurons interconnected to each other, artificial neural networks
also have neurons that are linked to each other in various layers of the networks.
🡪These neurons are known as nodes.
Biological Neural Network Artificial Neural Network
Dendrites Inputs
Synapse Weights
Axon Output
• Input Layer:
• As the name suggests, it accepts inputs in several different formats provided by the
• Hidden Layer:
• The hidden layer presents in-between input and output layers. It performs all the
calculations to find hidden features and patterns.
• Output Layer:
• The input goes through a series of transformations using the hidden layer, which finally
results in output that is conveyed using this layer.
• The artificial neural network takes input and computes the weighted sum of the inputs
and includes a bias. This computation is represented in the form of a transfer function.
• Feedback ANN
🡪In this type of ANN, the output returns into the network to accomplish the best-evolved results
🡪The feedback networks feed information back into itself and are well suited to solve
optimization issues.
🡪The Internal system error corrections utilize feedback ANN
Neural networks are an efficient way to solve machine learning problems and can be used in
various situations. Neural networks offer precision and accuracy. Finding the correct neural
network for each project can increase efficiency.
Recurrent neural networks (RNNs) remember previously learned predictions to help make future
predictions with accuracy.
● Long short term memory network (LSTM) - LSTM adds extra structures, or gates, to an
RNN to improve memory capabilities.
● Echo state network (ESN) - A type of RNN hidden layers that are sparsely connected.
Convolutional neural networks (CNNs) are a type of feed-forward network that are used for
image analysis and language processing. There are hidden convolutional layers that form
ConvNets and detect patterns. CNNs use features such as edges, shapes, and textures to detect
patterns. Examples of CNNs include:
Generative adversarial networks (GAN) are a type of unsupervised learning where data is
generated from patterns that were discovered from the input data. GANs have two main parts
that compete against one another:
● Generator - creates synthetic data from the learning phase of the model. It will take
random datasets and generate a transformed image.
● Discriminator - decides whether or not the images produced are fake or genuine.
GANs are used to help predict what the next frame in a video might be, text to image generation,
or image to image translation.
Unlike RNNs, transformer neural networks do not have a concept of timestamps. This enables
them to pass through multiple inputs at once, making them a more efficient way to process data.
• It is one of the oldest and first introduced neural networks.
• It was proposed by Frank Rosenblatt in 1958.
• Perceptron is also known as an artificial neural network.
• Perceptron is mainly used to compute the logical gate like AND, OR, and NOR which
has binary input and binary output.
• Perceptron is a building block of an Artificial Neural Network.
• Perceptron is a linear Machine Learning algorithm used for supervised learning for
various binary classifiers. This algorithm enables neurons to learn elements and processes
them one by one during preparation.
The main functionality of the perceptron is:-
• Takes input from the input layer
• Weight them up and sum it up.
• Pass the sum to the function to produce the output.
In the first step first, multiply all input values with corresponding weight values and then add
them to determine the weighted sum. Mathematically, we can calculate the weighted sum as
∑wi*xi = x1*w1 + x2*w2 +…wn*xn
Add a special term called bias 'b' to this weighted sum to improve the model's performance.
∑wi*xi + b
In the second step, an activation function is applied with the above-mentioned weighted sum,
which gives us output either in binary form or a continuous value as follows:
Y = f(∑wi*xi + b)
Types of Perceptron Models
• Based on the layers, Perceptron models are divided into two types. These are as follows:
• Single-layer Perceptron Model
• Multi-layer Perceptron model
• The number of units in each layer is referred to as the dimensionality of that layer.
• For all the other hidden layers repeat the same procedure. Keep repeating the process
until reach the last weight set.
• It’s a function that we use to get the output of node. It is also known as Transfer
• The primary role of the Activation Function is to transform the summed weighted input
from the node into an output value to be fed to the next hidden layer or as output.
• It is used in neural network to determine the output of neural network like yes or no. It
maps the resulting values in between 0 to 1 or -1 to 1 etc. (depending upon the function).
• The Activation Functions can be basically divided into 2 types-
1. Linear Activation Function
2. Non-linear Activation Functions
• The main terminologies needed to understand for nonlinear functions are:
• Derivative or Differential: Change in y-axis w.r.t. change in x-axis.It is also
known as slope.
• Monotonic function: A function which is either entirely non-increasing or
1. Sigmoid or Logistic Activation Function
• The Sigmoid Function curve looks like a S-shape.
• The main reason why we use sigmoid function is because it exists between (0 to 1).
• Therefore, it is especially used for models where we have to predict the probability as an
• Since probability of anything exists only between the range of 0 and 1, sigmoid is the
right choice.
• The function is differentiable.That means, we can find the slope of the sigmoid curve at
any two points.
• The function is monotonic but function’s derivative is not.
• As you can see, the ReLU is half rectified (from bottom). f(z) is zero when z is less than
zero and f(z) is equal to z when z is above or equal to zero.
• Range: [ 0 to infinity)
• The function and its derivative both are monotonic.
• But the issue is that all the negative values become zero immediately which decreases the
ability of the model to fit or train from the data properly.
• That means any negative input given to the ReLU activation function turns the value into
zero immediately in the graph, which in turns affects the resulting graph by not mapping
the negative values appropriately.
4. Leaky ReLU
• It is an attempt to solve the dying ReLU problem
• The leak helps to increase the range of the ReLU function. Usually, the value of a is 0.01
or so.
• When a is not 0.01 then it is called Randomized ReLU.
• Therefore the range of the Leaky ReLU is (-infinity to infinity).
• Both Leaky and Randomized ReLU functions are monotonic in nature. Also, their
derivatives also monotonic in nature.
5. Softmax activation function
• The softmax activation function takes in a vector of raw outputs of the neural network
and returns a vector of probability scores.
• In the vector z of raw outputs, the maximum value is 1.23, which on applying softmax
activation maps to 0.664: the largest entry in the softmax output vector. Likewise, 0.25
and -0.8 map to 0.249 and 0.087: the second and the third largest entries in the softmax
output respectively. Thus, applying softmax preserves the relative ordering of scores.
• All entries in the softmax output vector are between 0 and 1.
• In a multiclass classification problem, where the classes are mutually exclusive, notice
how the entries of the softmax output sum up to 1: 0.664 + 0.249 + 0.087 = 1.
🡪To train — the process by which the model maps the relationship between the training
data and the outputs — the neural network updates its hyperparameters, the weights, wT,
and biases, b, to satisfy the equation above.
🡪Each training input is loaded into the neural network in a process called forward
propagation. Once the model has produced an output, this predicted output is compared
against the given target output in a process called backpropagation — the
hyperparameters of the model are then adjusted so that it now outputs a result closer to
the target output.
• A loss function is a function that compares the target and predicted output values;
measures how well the neural network models the training data. When training, we aim to
minimize this loss between the predicted and target outputs.
• The hyperparameters are adjusted to minimize the average loss — we find the weights,
wT, and biases, b, that minimize the value of J (average loss).
• Types of Loss Functions
• In supervised learning, there are two main types of loss functions : regression and
classification loss functions
• Regression Loss Functions — used in regression neural networks; given
an input value, the model predicts a corresponding output value (rather
than pre-selected labels); Ex. Mean Squared Error, Mean Absolute Error
• Classification Loss Functions — used in classification neural networks;
given an input, the neural network produces a vector of probabilities of the
input belonging to various pre-set categories — can then select the
category with the highest probability of belonging; Ex. Binary
Cross-Entropy, Categorical Cross-Entropy
• Mean Squared Error (MSE)
• One of the most popular loss functions, MSE finds the average of the squared differences
between the target and the predicted outputs
• Mean Absolute Error (MAE)
• MAE finds the average of the absolute differences between the target and the predicted
Practical Issues in Neural Network Training
• I.The Problem of Overfitting
• II. The Vanishing and Exploding Gradient Problems
• III.Difficulties in Convergence
• IV.Local and Spurious Optima
• V. Computational Challenges
• Overfitting during training can be spotted when the error on training data decreases to a
very small value but the error on the new data or test data increases to a large value.
• The error vs iteration graph shows how a deep neural network overfits on training data.
• The blue curve indicates the error on training data & the red curve the error on test data.
• The point where the green line intersects is the instance the network begins to overfit.
• As you can see, the error on test data increases sharply while error on training data
• A new set of data points will result in the model/network performing poorly as it is very
close to all the training points which are noise & outliers.
• The error on the training points is minimum or very small but the error on the new data
points will be high.
• One of the main reasons for the network to overfit is if the size of the training dataset is
• When the network tries to learn from a small dataset it will tend to have greater control
over the dataset & will make sure to satisfy all the datapoints exactly.
• In order to understand this point, consider a simple single-layer neural network on a data
set with five attributes, where we use the identity activation to learn a real-valued target
• Consider a situation in which the observed target value is real and is always twice the
value of the first attribute, whereas other attributes are completely unrelated to the target.
However, we have only four training instances, which is one less than the number of
features (free parameters). For example, the training instances could be as follows:
🡪The correct parameter vector in this case is W = [2, 0, 0, 0, 0] based on the known relationship
between the first feature and target.
🡪The training data also provides zero error with this solution, although the relationship needs to
be learned from the given instances
🡪However, the problem is that the number of training points is fewer than the number of
parameters and it is possible to find an infinite numberof solutions with zero error.
🡪For example, the parameter set [0, 2, 4, 6, 8] also provides zero error on the training data.
🡪However, if we used this solution on unseen test data, it is likely to provide very poor
performance because the learned parameters are spuriously inferred and are unlikely to
generalize well to new points in which the target is twice the first attribute (and other attributes
are random).
🡪As a result,the solution does not generalize well to unseen test data.
Underfitting happens when the network can neither model the training or test data which results
in overall bad performance.
By looking at the graph, the model doesn’t cover all the data points & has a high error on both
training & test data.
The reason for underfitting can be because of the limited capacity of the network, a limited
number of features provided as input to the network, noisy data etc.
• It represents the inability of the model to learn the training data effectively result in poor
performance both on the training and testing data.
• In simple terms, an underfit model’s are inaccurate, especially when applied to new,
unseen examples.
• It mainly happens when we uses very simple model with overly simplified assumptions.
• To address underfitting problem of the model, we need to use more complex models, with
enhanced feature representation, and less regularization.
• Note: The underfitting model has High bias and low variance
🡪There is no general rule as to how many layers are to be removed or how many neurons must
be in a layer before the network can overfit.
🡪The popular approach for reducing the network complexity is
--Grid search can be applied to find out the number of neurons and/or layers to reduce or
remove overfitting.
--The overfit model can be pruned (trimmed) by removing nodes or connections until it
reaches suitable performance on test data.
2. Data Augmentation
🡪One of the best strategies to avoid overfitting is to increase the size of the training
🡪As discussed, when the size of the training data is small the network tends to have
greater control over the training data.
🡪But in real-world scenarios gathering of large amounts of data is a tedious &
time-consuming task, hence the collection of new data is not a viable option.
🡪Data augmentation provides techniques to increase the size of existing training data
without any external addition.
🡪If our training data consists of images, image augmentation techniques like rotation,
horizontal & vertical flipping, translation, increasing or decreasing the brightness or adding
noise, cutouts etc can be applied to the existing training images to increase the number of
🡪By applying the above-mentioned data augmentation strategies, the network is trained
on multiple instances of the same class of object in different perspectives.
🡪An augmented result of a lion’s photograph will have an instance of a lion being
viewed in a rotated manner, a lion being viewed up-side-down or cutting out the portion of an
image which encloses the mane of a lion.
🡪By applying the last augmentation (cutout) the network learns to associate the feature
that male lions have a mane with its class.
3. Weight Regularization
🡪Weight regularization is a technique which aims to stabilize an overfitted network by
penalizing the large value of weights in the network.
🡪An overfitted network usually presents with problems with a large value of weights as a
small change in the input can lead to large changes in the output.
🡪For instance, when the network is given new or test data, it results in incorrect
🡪Weight regularization penalizes the network’s large weights & forcing the optimization
algorithm to reduce the larger weight values to smaller weights, and this leads to stability of the
network & presents good performance.
🡪In weight regularization, the network configuration remains unchanged only modifying
the value of weights.
🡪Weight Regularization reduces overfitting by penalizing or adding a constraint to the
loss function.
🡪Regularization terms are constraints the optimization algorithm (like Stochastic
Gradient Descent) must adhere to when minimizing loss function apart from minimizing the
error between predicted value & actual value.
🡪The above two equations represent two types of weight regularization L1 & L2.
🡪There are two parts to the equation, the first part is the error between the actual target vs
the predicted target (loss function).
🡪 The second part is the weight penalty or the regularization term
🡪A regression model that uses L1 regularization technique is called Lasso Regression
and model which uses L2 is called Ridge Regression.
🡪The key difference between these two is the penalty term.
4. Dropouts
🡪Dropout is a regularization strategy that prevents deep neural networks from overfitting.
The above graph indicates the point after which the network begins to overfit.
The network parameters at the point of early termination are the best fit for the model.
To decrease the test error beyond the point of early termination can be done by
🡪Decreasing the learning rate. Applying a learning rate scheduler algorithm would be
🡪Applying a different optimization algorithm.
🡪Applying regularization.
6. Neural Architecture and Parameter Sharing
The most effective way of building a neural network is by constructing the architecture of
the neural network after giving some thought to the underlying data domain.
For example, the successive words in a sentence are often related to one another, whereas
the nearby pixels in an image are typically related.
These types of insights are used to create specialized architectures for text and image
data with fewer parameters.
Furthermore, many of the parameters might be shared. For example, a convolutional
neural network uses the same set of parameters to learn the characteristics of a local block of the
7. Trading Off Breadth for Depth
🡪networks with more layers (i.e., greater depth) tend to require far fewer units per layer
because the composition functions created by successive layers make the neural network
more powerful.
🡪Increased depth is a form of regularization, as the features in later layers are forced to
obey a particular type of structure imposed by the earlier layers
🡪The number of units in each layer can typically be reduced to such an extent that a deep
network often has far fewer parameters even when added up over the greater
number of layers.
8. Ensemble Methods
🡪 A variety of ensemble methods like bagging are used in order to increase the
generalization power of the model.
🡪These methods are applicable not just to neural networks but to any pe of machine
learning algorithm.
🡪 However, in recent years, a number of ensemble methods that are specifically focused
on neural networks have also been proposed.
🡪Two such methods include Dropout and Dropconnect.
🡪These methods can be combined with many neural network architectures to obtain an
additional accuracy improvement of about 2% in many real settings.
🡪However, the precise improvement depends to the type of data and the nature of the
underlying training.
Model Parameters
o Model parameters are configuration variables that are internal to the model, and a model
learns them on its own. For example, W Weights or Coefficients of independent
variables in the Linear regression model. or Weights or Coefficients of independent
variables in SVM, weight, and biases of a neural network, cluster centroid in
Model Hyperparameters:
Hyperparameters are those parameters that are explicitly defined by the user to control the
learning process. Some key points for model parameters are as follows:
Broadly hyperparameters can be divided into two categories, which are given below:
The process of selecting the best hyperparameters to use is known as hyperparameter tuning, and
the tuning process is also known as hyperparameter optimization. Optimization parameters are
used for optimizing the model.
• Learning Rate: The learning rate is the hyperparameter in optimization algorithms that
controls how much the model needs to change in response to the estimated error for each
time when the model's weights are updated. It is one of the crucial parameters while
building a neural network, and also it determines the frequency of cross-checking with
model parameters. Selecting the optimized learning rate is a challenging task because if
the learning rate is very less, then it may slow down the training process. On the other
hand, if the learning rate is too large, then it may not optimize the model properly.
• Note: Learning rate is a crucial hyperparameter for optimizing the model, so if
there is a requirement of tuning only a single hyperparameter, it is suggested to
tune the learning rate.
• Batch Size: To enhance the speed of the learning process, the training set is divided into
different subsets, which are known as a batch.
• Number of Epochs: An epoch can be defined as the complete cycle for training the
machine learning model. Epoch represents an iterative learning process. The number of
epochs varies from model to model, and various models are created with more than one
epoch. To determine the right number of epochs, a validation error is taken into account.
• The number of epochs is increased until there is a reduction in a validation error.
If there is no improvement in reduction error for the consecutive epochs, then it
indicates to stop increasing the number of epochs.
Hyperparameters that are involved in the structure of the model are known as hyperparameters
for specific models. These are given below:
o A number of Hidden Units: Hidden units are part of neural networks, which refer to the
components comprising the layers of processors between input and output units in a
neural network.
It is important to specify the number of hidden units hyperparameter for the neural network. It
should be between the size of the input layer and the size of the output layer. More specifically,
the number of hidden units should be 2/3 of the size of the input layer, plus the size of the output
For complex functions, it is necessary to specify the number of hidden units, but it should not
overfit the model.
Validation Set
● Training set: The data you will use to train your model. This will be fed into an
algorithm that generates a model. It maps inputs to outputs.
● Validation set: This is smaller than the training set, and is used to evaluate the
performance of models with different hyperparameter values. It's also used to detect
overfitting during the training stages.
● Test set: This set is used to get an idea of the final performance of a model after
hyperparameter tuning. It's also useful to get an idea of how different models (SVMs,
Neural Networks, Random forests...) perform against each other.
🡪The validation and test sets are usually much smaller than the training set.
🡪The validation and test sets are put aside at the beginning of the project and are not
used for training.
The validation set is used to fine-tune the hyperparameters of the model and is considered a
part of the training of the model. The model only sees this data for evaluation but does not learn
from this data
Train vs. Validation vs. Test set
For training and testing purposes of our model, we should have our data broken down into three
distinct dataset splits.
Bias is simply defined as the inability of the model because of that there is some difference
or error occurring between the model’s predicted value and the actual value.
These differences between actual or expected values and the predicted values are known as
error or bias error or error due to bias.
Bias is a systematic error that occurs due to wrong assumptions in the machine learning
Let Y be the true value of a parameter, and let Y’ be an estimator of Y based on a sample of
data. Then, the bias of the estimator Y’ is given by:
• Bias(Y’) = E(Y’) - Y
• where E(Y’) is the expected value of the estimator Y’. It is the measurement of the model
that how well it fits the data.
Variance is the measure of spread in data from its mean position.
In machine learning variance is the amount by which the performance of a predictive model
changes when it is trained on different subsets of the training data.
More specifically, variance is the variability of the model that how much it is sensitive to
another subset of the training dataset. i.e. how much it can adjust on the new subset of the
training dataset.
Let Y be the actual values of the target variable, and Y’ be the predicted values of the
target variable.
Then the variance of a model can be measured as the expected value of the square of the
difference between predicted values and the expected value of the predicted values.
Variance = E[(Y’ - E[ Y’])^2]
Ways to Reduce the reduce Variance in Machine Learning:
Cross-validation: By splitting the data into training and testing sets multiple times,
cross-validation can help identify if a model is overfitting or underfitting and can be used to tune
hyperparameters to reduce variance.
Feature selection: By choosing the only relevant feature will decrease the model’s
complexity. and it can reduce the variance error.
Regularization: We can use L1 or L2 regularization to reduce variance in machine
learning models
Ensemble methods: It will combine multiple models to improve generalization
performance. Bagging, boosting, and stacking are common ensemble methods that can help
reduce variance and improve generalization performance.
Simplifying the model: Reducing the complexity of the model, such as decreasing the
number of parameters or layers in a neural network, can also help reduce variance and improve
generalization performance.
Early stopping: Early stopping is a technique used to prevent overfitting by stopping the
training of the deep learning model when the performance on the validation set stops improving.
Deep Learning
Deep learning is a method in artificial intelligence (AI) that teaches computers to process
data in a way that is inspired by the human brain. Deep learning models can recognize complex
patterns in pictures, text, sounds, and other data to produce accurate insights and predictions.
Takes less time to train the model. Takes more time to train the model.
The neural network can compare the outputs of its nodes with the desired values using a property
known as the delta rule, allowing the network to alter its weights through training to create more
accurate output values. This training and learning procedure results in gradient descent. The
technique of updating weights in multi-layered perceptrons is virtually the same, however, the
process is referred to as back-propagation. In such circumstances, the output values provided by
the final layer are used to alter each hidden layer inside the network.
A Feed Forward Neural Network is an artificial neural network in which the connections
between nodes does not form a cycle. The feed forward model is the simplest form of neural
network as information is only processed in one direction. While the data may pass through
multiple hidden nodes, it always moves in one direction and never backwards.
The structure of a DFF is very similar to that of an FF. The major difference between them is the
number of hidden layers. Currently, people refer to a Neural Network with one hidden layer as a
“shallow” network or simply a Feed-Forward network.
Feedforward neural networks perform well when solving basic problems like identifying
simple patterns or classifying information. However, they will struggle with more complex tasks.
On the other hand, deep learning algorithms can process and analyze vast data volumes due to
several hidden layers of abstraction.