1) Stochastic Gradient Descent .
Ans: Stochastic Gradient Descent (SGD):
The word ‘stochastic‘ means a system or a process that is linked with a random
probability. Hence, in Stochastic Gradient Descent, a few samples are selected randomly
instead of the whole data set for each iteration.
In Gradient Descent, there is a term called “batch” which denotes the total number of
samples from a dataset that is used for calculating the gradient for each iteration. In
typical Gradient Descent optimization, like Batch Gradient Descent, the batch is taken to
be the whole dataset. Although, using the whole dataset is really useful for getting to
the minima in a less noisy and less random manner, but the problem arises when our
datasets get big.
Suppose, you have a million samples in your dataset, so if you use a typical Gradient
Descent optimization technique, you will have to use all of the one million samples for
completing one iteration while performing the Gradient Descent, and it has to be done
for every iteration until the minima are reached. Hence, it becomes computationally
very expensive to perform.
This problem is solved by Stochastic Gradient Descent. In SGD, it uses only a single
sample, i.e., a batch size of one, to perform each iteration. The sample is randomly
shuffled and selected for performing the iteration.
SGD algorithm:
So, in SGD, we find out the gradient of the cost function of a single example at each iteration
instead of the sum of the gradient of the cost function of all the examples.
In SGD, since only one sample from the dataset is chosen at random for each iteration,
the path taken by the algorithm to reach the minima is usually noisier than your typical
Gradient Descent algorithm. But that doesn’t matter all that much because the path
taken by the algorithm does not matter, as long as we reach the minima and with a
significantly shorter training time.
The path took by Batch Gradient Descent –
A path has been taken by Stochastic Gradient Descent –
One thing to be noted is that, as SGD is generally noisier than typical Gradient Descent, it
usually took a higher number of iterations to reach the minima, because of its randomness
in its descent.
Even though it requires a higher number of iterations to reach the minima than typical
Gradient Descent, it is still computationally much less expensive than typical Gradient
Descent. Hence, in most scenarios, SGD is preferred over Batch Gradient Descent for
optimizing a learning algorithm.
The function f may have a minimum, but no minimizer. Take for example f(x) = 1/|x|.
Practically speaking, we want to show that as we iterate through the algorithm
(i→∞), the value of the iterates approaches that of the minimum:
Batch Gradient Descent is great for convex or relatively smooth error manifolds. In this case, we
move somewhat directly towards an optimum solution.
Cost vs Epochs
The graph of cost vs epochs is also quite smooth because we are averaging over all the
gradients of training data for a single step. The cost keeps on decreasing over the epochs.
Stochastic Gradient Descent
In Batch Gradient Descent we were considering all the examples for every step of Gradient
Descent. But what if our dataset is very huge. Deep learning models crave for data. The more
the data the more chances of a model to be good. Suppose our dataset has 5 million examples,
then just to take one step the model will have to calculate the gradients of all the 5 million
examples. This does not seem an efficient way. To tackle this problem we have Stochastic
Gradient Descent. In Stochastic Gradient Descent (SGD), we consider just one example at a time
to take a single step.
We do the following steps in one epoch for SGD:
1. Take an example
2. Feed it to Neural Network
3. Calculate it’s gradient
4. Use the gradient we calculated in step 3 to update the weights
5. Repeat steps 1–4 for all the examples in training dataset
Since we are considering just one example at a time the cost will fluctuate over the training
examples and it will not necessarily decrease. But in the long run, you will see the cost
decreasing with fluctuations.
Cost vs Epochs in SGD
Also because the cost is so fluctuating, it will never reach the minima but it will keep dancing
around it.
SGD can be used for larger datasets. It converges faster when the dataset is large as it causes
updates to the parameters more frequently.
4) Mini Batch Gradient Descent
Ans : have seen the Batch Gradient Descent. We have also seen the Stochastic Gradient
Descent. Batch Gradient Descent can be used for smoother curves. SGD can be used when the
dataset is large. Batch Gradient Descent converges directly to minima. SGD converges faster for
larger datasets. But, since in SGD we use only one example at a time, we cannot implement the
vectorized implementation on it. This can slow down the computations. To tackle this problem,
a mixture of Batch Gradient Descent and SGD is used.
Neither we use all the dataset all at once nor we use the single example at a time. We use a
batch of a fixed number of training examples which is less than the actual dataset and call it a
mini-batch. Doing this helps us achieve the advantages of both the former variants we saw. So,
after creating the mini-batches of fixed size, we do the following steps in one epoch:
Pick a mini-batch
Feed it to Neural Network
Calculate the mean gradient of the mini-batch
Use the mean gradient we calculated in step 3 to update the weights
Repeat steps 1–4 for the mini-batches we created
Just like SGD, the average cost over the epochs in mini-batch gradient descent fluctuates
because we are averaging a small number of examples at a time.
Cost vs no of mini-batch
So, when we are using the mini-batch gradient descent we are updating our parameters
frequently as well as we can use vectorized implementation for faster computations.
1) Normal Equation in linear regression.
Normal Equation is an analytical approach to Linear Regression with a Least Square Cost
Function. We can directly find out the value of θ without using Gradient Descent. Following this
approach is an effective and time-saving option when are working with a dataset with small
Normal Equation is a follows :
However, if we scale the features, i.e. divide x1 by 2000 and divide x2 by 5and then plot the
cost-function, the contours may look much more like circles. This provides a more direct path
for gradient descent, which had a very complicated trajectory as compared to the initial plot of
the cost function with unscaled features.
Mean normalization method can be used as an alternative for feature scaling. In this method
we can scale up the size as (size — mean size) divided by the range of the size.
Example: In the previous case x1 (size of the house) can be scaled as x1 = (size — 1000)/2000.
This modifies the range of x1 as: -0.5<x1<+0.5
This is how feature scaling and mean normalization method can be used to speed up the
gradient descent process, by having the values of the input variables, more or less in the same
1) Introduction to SVM
Support vector machines (SVMs) are powerful yet flexible supervised machine learning
algorithms which are used both for classification and regression.
Working of SVM
An SVM model is basically a representation of different classes in a hyperplane in
multidimensional space. The hyperplane will be generated in an iterative manner by SVM so that
the error can be minimized. The goal of SVM is to divide the datasets into classes to find a
maximum marginal hyperplane (MMH).
Working of SVM
The followings are important concepts in SVM –
Support Vectors – Datapoints that are closest to the hyperplane is called support vectors.
Separating line will be defined with the help of these data points.
Hyperplane – As we can see in the above diagram, it is a decision plane or space which is
divided between a set of objects having different classes.
Margin – It may be defined as the gap between two lines on the closet data points of different
classes. It can be calculated as the perpendicular distance from the line to the support vectors.
Large margin is considered as a good margin and small margin is considered as a bad margin.
The main goal of SVM is to divide the datasets into classes to find a maximum marginal
hyperplane (MMH) and it can be done in the following two steps –
First, SVM will generate hyperplanes iteratively that segregates the classes in best way.
Then, it will choose the hyperplane that separates the classes correctly.
SVM Kernels
In practice, SVM algorithm is implemented with kernel that transforms an input data space into
the required form. SVM uses a technique called the kernel trick in which kernel takes a low
dimensional input space and transforms it into a higher dimensional space. In simple words,
kernel converts non-separable problems into separable problems by adding more dimensions to
it. It makes SVM more powerful, flexible and accurate. The following are some of the types of
kernels used by SVM.
Linear Kernel
It can be used as a dot product between any two observations. The formula of linear kernel is as
below –
From the above formula, we can see that the product between two vectors say 𝑥 & 𝑥𝑖 is the sum
of the multiplication of each pair of input values.
Polynomial Kernel
It is more generalized form of linear kernel and distinguish curved or nonlinear input space.
Following is the formula for polynomial kernel
Here d is the degree of polynomial, which we need to specify manually in the learning algorithm.
Radial Basis Function (RBF) Kernel
RBF kernel, mostly used in SVM classification, maps input space in indefinite dimensional
space. Following formula explains it mathematically –
Here, gamma ranges from 0 to 1. We need to manually specify it in the learning algorithm. A
good default value of gamma is 0.1.
As we implemented SVM for linearly separable data, we can implement it in Python for the data
that is not linearly separable. It can be done by using kernels.
2) Support Vector.
Support vectors are data points that are closer to the hyperplane and influence the position and
orientation of the hyperplane. Using these support vectors, we maximize the margin of the
classifier. Deleting the support vectors will change the position of the hyperplane. These are the
points that help us build our SVM.
3) Non linear regression handling via SVM.
Ans : SVM address non-linearly separable cases by introducing two concepts: Soft
Margin and Kernel Tricks.
Soft Margin: try to find a line to separate, but tolerate one or few misclassified dots
(e.g. the dots circled in red)
Kernel Trick: try to find a non-linear decision boundary
Soft Margin
Two types of misclassifications are tolerated by SVM under soft margin:
1. The dot is on the wrong side of the decision boundary but on the
correct side/ on the margin (shown in left)
2. The dot is on the wrong side of the decision boundary and on the
wrong side of the margin (shown in right)
Applying Soft Margin, SVM tolerates a few dots to get misclassified and tries to
balance the trade-off between finding a line that maximizes the margin and
minimizes the misclassification.
Degree of tolerance
How much tolerance(soft) we want to give when finding the SVM in linear
separable cases
Obviously, infinite lines exist to separate the red and green dots in the example above. SVM
needs to find the optimal line with the constraint of correctly classifying either class:
Follow the constraint: only look into the separate hyperplanes(e.g. separate lines), hyperplanes
that classify classes correctly
Conduct optimization: pick up the one that maximizes the margin.
4) SVM formulation for seperable training data.
Ans : , infinite lines exist to separate the red and green dots in the example above.l
SVM needs to find the optimal line with the constraint of correctly classifying either cclass
Follow the constraint: only look into the separate hyperplanes (e.g. separate lines), hyperplanes
that classify classes correctly.
Conduct optimization : pick up the one that maximizes the margin.
Hyperplane is an (n minus 1)-dimensional subspace for an n-dimensional space. For a 2-
dimension space, its hyperplane will be 1-dimension, which is just a line. For a 3-dimension
space, its hyperplane will be 2-dimension, which is a plane that slice the cube. Okay, you got
the idea.
Assuming the label y is either 1 (for green) or -1 (for red), all those three lines below are
separating hyperplanes. Because they all share the same property — above the line, is green;
below the line, is red.
What is margin?
The distance between either side of the dashed line to the solid line is the margin. We can think
of this optimal line as the mid-line of the widest stretching we can possibly have between red
and green dots.
One of the commonly used loss functions for classification is cross-entropy loss.
The binary Cost function is a special case of Categorical cross-entropy, where there is only one
output class. For example, classification between red and blue.
To better understand it, let’s suppose there is only a single output variable Y.
The error in binary classification is calculated as the mean of cross-entropy for all N training
data. Which means:
It is designed in a way that it can be used with multi-class classification with the target values
ranging from 0 to 1, 3, ….,n classes.
In a multi-class classification problem, cross-entropy will generate a score that summarizes the
mean difference between actual and anticipated probability distribution.
For a perfect cross-entropy, the value should be zero when the score is minimized.
What is Artificial Neural Networks?
A neural network is a group of connected I/O units where each connection has a weight
associated with its computer programs. It helps you to build predictive models from large
databases. This model builds upon the human nervous system. It helps you to conduct image
understanding, human learning, computer speech, etc.
What is Backpropagation?
Backpropagation is the essence of neural network training. It is the method of fine-tuning the
weights of a neural network based on the error rate obtained in the previous epoch (i.e.,
iteration). Proper tuning of the weights allows you to reduce error rates and make the model
reliable by increasing its generalization.
Backpropagation in neural network is a short form for “backward propagation of errors.” It is a
standard method of training artificial neural networks. This method helps calculate the gradient
of a loss function with respect to all the weights in the network.
Static Back-propagation
Recurrent Backpropagation
Static back-propagation:
It is one kind of backpropagation network which produces a mapping of a static input for static
output. It is useful to solve static classification issues like optical character recognition.
Recurrent Backpropagation:
Recurrent Back propagation in data mining is fed forward until a fixed value is achieved. After
that, the error is computed and propagated backward.
The main difference between both of these methods is: that the mapping is rapid in static back-
propagation while it is nonstatic in recurrent backpropagation
Here, X1 and X2 are inputs to the artificial neurons, f(X) represents the processing done on the
inputs and y represents the output of the neuron.
Weight Initialization Techniques for Deep Neural Networks
While building and training neural networks, it is crucial to initialize the weights appropriately
to ensure a model with high accuracy. If the weights are not correctly initialized, it may give rise
to the Vanishing Gradient problem or the Exploding Gradient problem.
Weight Initialization Techniques
1. Zero Initialization
As the name suggests, all the weights are assigned zero as the initial value is zero initialization.
This kind of initialization is highly ineffective as neurons learn the same feature during each
iteration. Rather, during any kind of constant initialization, the same issue happens to occur.
Thus, constant initializations are not preferred.
2. Random Initialization
In an attempt to overcome the shortcomings of Zero or Constant Initialization, random
initialization assigns random values except for zeros as weights to neuron paths. However,
assigning values randomly to the weights, problems such as Overfitting, Vanishing Gradient
Problem, Exploding Gradient Problem might occur.
Random Normal
Random Uniform
a) Random Normal: The weights are initialized from values in a normal distribution.
b) Random Uniform: The weights are initialized from values in a uniform distribution.
3.Xavier/Glorot Initialization
In Xavier/Glorot weight initialization, the weights are assigned from values of a uniform
distribution as follows:
Xavier/Glorot Initialization often termed as Xavier Uniform Initialization, is suitable for layers
where the activation function used is Sigmoid.
In Normalized Xavier/Glorot weight initialization, the weights are assigned from values of a
normal distribution as follows:
Xavier/Glorot Initialization, too, is suitable for layers where the activation function used is
5. He Uniform Initialization
In He Uniform weight initialization, the weights are assigned from values of a uniform
distribution as follows:
He Uniform Initialization is suitable for layers where ReLU activation function is used.
6. He Normal Initialization
In He Normal weight initialization, the weights are assigned from values of a normal distribution
as follows.
He Uniform Initialization, too, is suitable for layers where ReLU activation function is used.
6)Architecture of ANN.
Input Layer
As the name suggests, it accepts inputs in several different formats provided by the
Hidden Layer:
The hidden layer presents in-between input and output layers. It performs all the
calculations to find hidden features and patterns.
Output Layer:
The input goes through a series of transformations using the hidden layer, which
finally results in output that is conveyed using this layer.
The artificial neural network takes input and computes the weighted sum of the
inputs and includes a bias. This computation is represented in the form of a transfer
The algorithm takes the unlabeled dataset as input, divides the dataset into k-
number of clusters, and repeats the process until it does not find the best clusters.
The value of k should be predetermined in this algorithm.
PCA generally tries to find the lower-dimensional surface to project the high-
dimensional data.
PCA works by considering the variance of each attribute because the high
attribute shows the good split between the classes, and hence it reduces the
Instead of relying on one decision tree, the random forest takes the
prediction from each tree and based on the majority votes of predictions,
and it predicts the final output.
The greater number of trees in the forest leads to higher accuracy and
prevents the problem of overfitting.
The below diagram explains the working of the Random Forest algorithm:
Assumptions for Random Forest
Since the random forest combines multiple trees to predict the class of the dataset,
it is possible that some decision trees may predict the correct output, while others
may not. But together, all the trees predict the correct output. Therefore, below are
two assumptions for a better Random forest classifier:
There should be some actual values in the feature variable of the dataset so that the
classifier can predict accurate results rather than a guessed result.
The predictions from each tree must have very low correlations.
The Working process can be explained in the below steps and diagram:
Banking: Banking sector mostly uses this algorithm for the identification of loan risk.
Medicine: With the help of this algorithm, disease trends and risks of the disease
can be identified.
Land Use: We can identify the areas of similar land use by this algorithm.
Marketing: Marketing trends can be identified using this algorithm.
Density-Based Clustering
The density-based clustering method connects the highly-dense areas into clusters,
and the arbitrarily shaped distributions are formed as long as the dense region can
be connected. This algorithm does it by identifying different clusters in the dataset
and connects the areas of high densities into clusters. The dense areas in data space
are divided from each other by sparser areas.
These algorithms can face difficulty in clustering the data points if the dataset has
varying densities and high dimensions.
In distance-based clustering, a distance metric is used to determine the similarity
between data objects.
The distance metric can be used to cluster observations by considering the distances
between objects directly or by considering distances between objects and cluster
centroids (or some other cluster representative points).
Most distance metrics, and hence the distance-based clustering methods, work
either with continuous-only or categorical-only data. In applications, however,
observations are often described by a combination of both continuous and
categorical variables.
Such data sets can be referred to as mixed or mixed-type data. In this review, we
consider different methods for distance-based cluster analysis of mixed data.