Document 2

Artificial Intelligence and Machine Learning
(Ch-10)
1) Stochastic Gradient Descent .
Ans: Stochastic Gradient Descent (SGD):
The word ‘stochastic‘ means a system or a process that is linked with a random
probability. Hence, in Stochastic Gradient Descent, a few samples are selected randomly
instead of the whole data set for each iteration.
In Gradient Descent, there is a term called “batch” which denotes the total number of
samples from a dataset that is used for calculating the gradient for each iteration. In
typical Gradient Descent optimization, like Batch Gradient Descent, the batch is taken to
be the whole dataset. Although, using the whole dataset is really useful for getting to
the minima in a less noisy and less random manner, but the problem arises when our
datasets get big.
Suppose, you have a million samples in your dataset, so if you use a typical Gradient
Descent optimization technique, you will have to use all of the one million samples for
completing one iteration while performing the Gradient Descent, and it has to be done
for every iteration until the minima are reached. Hence, it becomes computationally
very expensive to perform.
This problem is solved by Stochastic Gradient Descent. In SGD, it uses only a single
sample, i.e., a batch size of one, to perform each iteration. The sample is randomly
shuffled and selected for performing the iteration.
SGD algorithm:
So, in SGD, we find out the gradient of the cost function of a single example at each iteration
instead of the sum of the gradient of the cost function of all the examples.
In SGD, since only one sample from the dataset is chosen at random for each iteration,
the path taken by the algorithm to reach the minima is usually noisier than your typical
Gradient Descent algorithm. But that doesn’t matter all that much because the path
taken by the algorithm does not matter, as long as we reach the minima and with a
significantly shorter training time.
The path took by Batch Gradient Descent –
A path has been taken by Stochastic Gradient Descent –
One thing to be noted is that, as SGD is generally noisier than typical Gradient Descent, it
usually took a higher number of iterations to reach the minima, because of its randomness
in its descent.
Even though it requires a higher number of iterations to reach the minima than typical
Gradient Descent, it is still computationally much less expensive than typical Gradient
Descent. Hence, in most scenarios, SGD is preferred over Batch Gradient Descent for
optimizing a learning algorithm.
2) Stochastic Gradient Descent Convergence.
Ans: Mathematical Notion of Convergence

A natural thing to do is get some math to back up the nice behaviour seen in
practice when using SGD. For that, one must come up with a convergence
proof. We say an algorithm converges if we are able to find a minimizer:
The function f may have a minimum, but no minimizer. Take for example f(x) = 1/|x|.
Practically speaking, we want to show that as we iterate through the algorithm
(i→∞), the value of the iterates approaches that of the minimum:
Alternatively, one can say that for i large, ||wᵢ-w*|| is small.
Given an algorithm, it is usually easier to provide a bound on the difference ||

wᵢ — w*||. For example, the following is enough to prove convergence:
When i→∞, we have wᵢ→w* since 1\i→0.
Since a and b are constants, we say the above examples converges O(1/i).
3) Batch Gradient Descent.

Ans : In Batch Gradient Descent, all the training data is taken into consideration to take a single
step. We take the average of the gradients of all the training examples and then use that mean
gradient to update our parameters. So that’s just one step of gradient descent in one epoch.
Batch Gradient Descent is great for convex or relatively smooth error manifolds. In this case, we
move somewhat directly towards an optimum solution.
Cost vs Epochs
The graph of cost vs epochs is also quite smooth because we are averaging over all the
gradients of training data for a single step. The cost keeps on decreasing over the epochs.
Stochastic Gradient Descent
In Batch Gradient Descent we were considering all the examples for every step of Gradient
Descent. But what if our dataset is very huge. Deep learning models crave for data. The more
the data the more chances of a model to be good. Suppose our dataset has 5 million examples,
then just to take one step the model will have to calculate the gradients of all the 5 million
examples. This does not seem an efficient way. To tackle this problem we have Stochastic
Gradient Descent. In Stochastic Gradient Descent (SGD), we consider just one example at a time
to take a single step.
We do the following steps in one epoch for SGD:
1. Take an example
2. Feed it to Neural Network
3. Calculate it’s gradient
4. Use the gradient we calculated in step 3 to update the weights
5. Repeat steps 1–4 for all the examples in training dataset
Since we are considering just one example at a time the cost will fluctuate over the training
examples and it will not necessarily decrease. But in the long run, you will see the cost
decreasing with fluctuations.
Cost vs Epochs in SGD
Also because the cost is so fluctuating, it will never reach the minima but it will keep dancing
around it.
SGD can be used for larger datasets. It converges faster when the dataset is large as it causes
updates to the parameters more frequently.
4) Mini Batch Gradient Descent
Ans : have seen the Batch Gradient Descent. We have also seen the Stochastic Gradient
Descent. Batch Gradient Descent can be used for smoother curves. SGD can be used when the
dataset is large. Batch Gradient Descent converges directly to minima. SGD converges faster for
larger datasets. But, since in SGD we use only one example at a time, we cannot implement the
vectorized implementation on it. This can slow down the computations. To tackle this problem,
a mixture of Batch Gradient Descent and SGD is used.
Neither we use all the dataset all at once nor we use the single example at a time. We use a
batch of a fixed number of training examples which is less than the actual dataset and call it a
mini-batch. Doing this helps us achieve the advantages of both the former variants we saw. So,
after creating the mini-batches of fixed size, we do the following steps in one epoch:
Pick a mini-batch
Feed it to Neural Network
Calculate the mean gradient of the mini-batch
Use the mean gradient we calculated in step 3 to update the weights
Repeat steps 1–4 for the mini-batches we created
Just like SGD, the average cost over the epochs in mini-batch gradient descent fluctuates
because we are averaging a small number of examples at a time.
Cost vs no of mini-batch
So, when we are using the mini-batch gradient descent we are updating our parameters
frequently as well as we can use vectorized implementation for faster computations.
(Ch-3)
1) Normal Equation in linear regression.
Normal Equation is an analytical approach to Linear Regression with a Least Square Cost
Function. We can directly find out the value of θ without using Gradient Descent. Following this
approach is an effective and time-saving option when are working with a dataset with small
features.
Normal Equation is a follows :
In the above equation,

Θ: hypothesis parameters that define it the best.
X: Input feature value of each instance.
Y: Output value of each instance.
2) Multiple Linear regression.
Ans : linear regression (MLR), also known simply as multiple regression, is a statistical
technique that uses several explanatory variables to predict the outcome of a response
variable.
Multiple regression is an extension of linear (OLS) regression that uses just one explanatory
variable.
Multiple linear regression (MLR), also known simply as multiple regression, is a statistical
technique that uses several explanatory variables to predict the outcome of a response
variable.
The goal of multiple linear regression is to model the linear relationship between the
explanatory (independent) variables and response (dependent) variables. In essence, multiple
regression is the extension of ordinary least-squares (OLS) regression because it involves more
than one explanatory variable.
Formula and Calculation of Multiple(In notebook)
The multiple regression model is based on the following assumptions:

 There is a linear relationship between the dependent variables and the
independent variables
 The independent variables are not too highly correlated with each other
 yi observations are selected independently and randomly from the
population
 Residuals should be normally distributed with a mean of 0
and variance σ
3) Multiple Features (Variables).
Ans:
New hypothesis
Multivariate linear regression

Can reduce hypothesis to single number with a transposed theta matrix multiplied by x matrix
4) Gradient Descent in multiple variables.

5) Gradient Descent vs Normal Equation
Gradient Descent Normal Equation
Need to choose alpha No need to choose alpha
Needs many iterations Don’t need to iterate
Works with large n (10,000) Slow if n is large (100, 1000 is fine)
Number of features > 1000 So long number features < 1000
6) Feature scaling in GD.

Ans: Gradient descent will take longer to reach the global minimum when the features are not
on a similar scale Feature scaling allows you to reach the global minimum faster.
Feature scaling is an idea that makes sure that the features involved in the gradient descent
computation are on the similar scale. This implies that the features involved should take similar
ranges of values. Feature scaling aims to speed up the process of convergence of gradient
descent.
Example: The above shown picture is a housing price example, where ‘x1’ is the size of the
house and ‘x2’ refers to the number of bedrooms. The contours of the cost-function J(θ0,θ1)are
elliptical-shaped and extremely skewed as shown initially. This may cause the gradient descent
to converge to the global minimum in a very long time.
However, if we scale the features, i.e. divide x1 by 2000 and divide x2 by 5and then plot the
cost-function, the contours may look much more like circles. This provides a more direct path
for gradient descent, which had a very complicated trajectory as compared to the initial plot of
the cost function with unscaled features.
Alternative for Feature Scaling
Mean normalization method can be used as an alternative for feature scaling. In this method
we can scale up the size as (size — mean size) divided by the range of the size.
Example: In the previous case x1 (size of the house) can be scaled as x1 = (size — 1000)/2000.
This modifies the range of x1 as: -0.5<x1<+0.5
This is how feature scaling and mean normalization method can be used to speed up the
gradient descent process, by having the values of the input variables, more or less in the same
range.
(CH-5)
1) Introduction to SVM
Support vector machines (SVMs) are powerful yet flexible supervised machine learning
algorithms which are used both for classification and regression.
Working of SVM
An SVM model is basically a representation of different classes in a hyperplane in
multidimensional space. The hyperplane will be generated in an iterative manner by SVM so that
the error can be minimized. The goal of SVM is to divide the datasets into classes to find a
maximum marginal hyperplane (MMH).
Working of SVM
The followings are important concepts in SVM –
Support Vectors – Datapoints that are closest to the hyperplane is called support vectors.
Separating line will be defined with the help of these data points.
Hyperplane – As we can see in the above diagram, it is a decision plane or space which is
divided between a set of objects having different classes.
Margin – It may be defined as the gap between two lines on the closet data points of different
classes. It can be calculated as the perpendicular distance from the line to the support vectors.
Large margin is considered as a good margin and small margin is considered as a bad margin.
The main goal of SVM is to divide the datasets into classes to find a maximum marginal
hyperplane (MMH) and it can be done in the following two steps –
First, SVM will generate hyperplanes iteratively that segregates the classes in best way.
Then, it will choose the hyperplane that separates the classes correctly.
SVM Kernels
In practice, SVM algorithm is implemented with kernel that transforms an input data space into
the required form. SVM uses a technique called the kernel trick in which kernel takes a low
dimensional input space and transforms it into a higher dimensional space. In simple words,
kernel converts non-separable problems into separable problems by adding more dimensions to
it. It makes SVM more powerful, flexible and accurate. The following are some of the types of
kernels used by SVM.
Linear Kernel
It can be used as a dot product between any two observations. The formula of linear kernel is as
below –
K(x,xi)=sum(x∗xi)
From the above formula, we can see that the product between two vectors say 𝑥 & 𝑥𝑖 is the sum
of the multiplication of each pair of input values.
Polynomial Kernel
It is more generalized form of linear kernel and distinguish curved or nonlinear input space.
Following is the formula for polynomial kernel
K(X,Xi)=1+sum(X∗Xi)^d
Here d is the degree of polynomial, which we need to specify manually in the learning algorithm.
Radial Basis Function (RBF) Kernel
RBF kernel, mostly used in SVM classification, maps input space in indefinite dimensional
space. Following formula explains it mathematically –
K(x,xi)=exp(−gamma∗sum(x−xi^2))
Here, gamma ranges from 0 to 1. We need to manually specify it in the learning algorithm. A
good default value of gamma is 0.1.
As we implemented SVM for linearly separable data, we can implement it in Python for the data
that is not linearly separable. It can be done by using kernels.
2) Support Vector.
Support vectors are data points that are closer to the hyperplane and influence the position and
orientation of the hyperplane. Using these support vectors, we maximize the margin of the
classifier. Deleting the support vectors will change the position of the hyperplane. These are the
points that help us build our SVM.
3) Non linear regression handling via SVM.
Ans : SVM address non-linearly separable cases by introducing two concepts: Soft
Margin and Kernel Tricks.
Soft Margin: try to find a line to separate, but tolerate one or few misclassified dots
(e.g. the dots circled in red)
Kernel Trick: try to find a non-linear decision boundary
Soft Margin
Two types of misclassifications are tolerated by SVM under soft margin:
1. The dot is on the wrong side of the decision boundary but on the
correct side/ on the margin (shown in left)
2. The dot is on the wrong side of the decision boundary and on the
wrong side of the margin (shown in right)
Applying Soft Margin, SVM tolerates a few dots to get misclassified and tries to
balance the trade-off between finding a line that maximizes the margin and
minimizes the misclassification.
Degree of tolerance
How much tolerance(soft) we want to give when finding the SVM in linear
separable cases
Obviously, infinite lines exist to separate the red and green dots in the example above. SVM
needs to find the optimal line with the constraint of correctly classifying either class:
Follow the constraint: only look into the separate hyperplanes(e.g. separate lines), hyperplanes
that classify classes correctly
Conduct optimization: pick up the one that maximizes the margin.
4) SVM formulation for seperable training data.
Ans : , infinite lines exist to separate the red and green dots in the example above.l
SVM needs to find the optimal line with the constraint of correctly classifying either cclass
Follow the constraint: only look into the separate hyperplanes (e.g. separate lines), hyperplanes
that classify classes correctly.
Conduct optimization : pick up the one that maximizes the margin.
Hyperplane is an (n minus 1)-dimensional subspace for an n-dimensional space. For a 2-
dimension space, its hyperplane will be 1-dimension, which is just a line. For a 3-dimension
space, its hyperplane will be 2-dimension, which is a plane that slice the cube. Okay, you got
the idea.
Any Hyperplane can be written mathematically as above

For a 2-dimensional space, the Hyperplane, which is the line.
The dots above this line, are those x1, x2 satisfy the formula above
The dots below this line, similar logic.

What is a Separating Hyperplane?
Assuming the label y is either 1 (for green) or -1 (for red), all those three lines below are
separating hyperplanes. Because they all share the same property — above the line, is green;
below the line, is red.
This property can be written in math again as followed:
If we further generalize these two into one, it becomes:
What is margin?
Let’s say we have a hyperplane — line X

Calculate the perpendicular distance from all those 40 dots to line X, it will be 40 different
distances
Out of the 40, the smallest distance, that’s our margin!
The distance between either side of the dashed line to the solid line is the margin. We can think
of this optimal line as the mid-line of the widest stretching we can possibly have between red
and green dots.
To sum up, SVM in the linear separable cases:
Constrain/ensure that each observation is on the correct side of the Hyperplane

Pick up the optimal line so that the distance from those closest dots to the Hyperplane, so-
called margin, is maximized.
(CH-5)
1) Cost function representation in classification problem.
Binary Classification Cost Functions
Classification models are used to make predictions of categorical variables, such as predictions
for 0 or 1, Cat or dog, etc. The cost function used in the classification problem is known as the
Classification cost function. However, the classification cost function is different from the
Regression cost function.
One of the commonly used loss functions for classification is cross-entropy loss.
The binary Cost function is a special case of Categorical cross-entropy, where there is only one
output class. For example, classification between red and blue.
To better understand it, let’s suppose there is only a single output variable Y.
The error in binary classification is calculated as the mean of cross-entropy for all N training
data. Which means:
Binary Cross-Entropy = (Sum of Cross-Entropy for N data)/N

3. Multi-class Classification Cost Function
A multi-class classification cost function is used in the classification problems for which
instances are allocated to one of more than two classes. Here also, similar to binary class
classification cost function, cross-entropy or categorical cross-entropy is commonly used cost
function.
It is designed in a way that it can be used with multi-class classification with the target values
ranging from 0 to 1, 3, ….,n classes.
In a multi-class classification problem, cross-entropy will generate a score that summarizes the
mean difference between actual and anticipated probability distribution.
For a perfect cross-entropy, the value should be zero when the score is minimized.
What is Artificial Neural Networks?
A neural network is a group of connected I/O units where each connection has a weight
associated with its computer programs. It helps you to build predictive models from large
databases. This model builds upon the human nervous system. It helps you to conduct image
understanding, human learning, computer speech, etc.
What is Backpropagation?
Backpropagation is the essence of neural network training. It is the method of fine-tuning the
weights of a neural network based on the error rate obtained in the previous epoch (i.e.,
iteration). Proper tuning of the weights allows you to reduce error rates and make the model
reliable by increasing its generalization.
Backpropagation in neural network is a short form for “backward propagation of errors.” It is a
standard method of training artificial neural networks. This method helps calculate the gradient
of a loss function with respect to all the weights in the network.
How Backpropagation Algorithm Works

The Back propagation algorithm in neural network computes the gradient of the
loss function for a single weight by the chain rule. It efficiently computes one layer
at a time, unlike a native direct computation. It computes the gradient, but it does
not define how the gradient is used. It generalizes the computation in the delta
rule.
Why We Need Backpropagation?
Most prominent advantages of Backpropagation are:
Backpropagation is fast, simple and easy to program

It has no parameters to tune apart from the numbers of input
It is a flexible method as it does not require prior knowledge about the network
It is a standard method that generally works well
It does not need any special mention of the features of the function to be learned.
Types of Backpropagation Networks
Two Types of Backpropagation Networks are:
 Static Back-propagation
 Recurrent Backpropagation
Static back-propagation:
It is one kind of backpropagation network which produces a mapping of a static input for static
output. It is useful to solve static classification issues like optical character recognition.
Recurrent Backpropagation:
Recurrent Back propagation in data mining is fed forward until a fixed value is achieved. After
that, the error is computed and propagated backward.
The main difference between both of these methods is: that the mapping is rapid in static back-
propagation while it is nonstatic in recurrent backpropagation
What are Neurons in a Neural Network?

A layer consists of small individual units called neurons. A neuron in a neural network can be
better understood with the help of biological neurons. An artificial neuron is similar to a
biological neuron. It receives input from the other neurons, performs some processing, and
produces an output.
Now let’s see an artificial neuron-
Artificial neuron for neural network
Here, X1 and X2 are inputs to the artificial neurons, f(X) represents the processing done on the
inputs and y represents the output of the neuron.
Weight Initialization Techniques for Deep Neural Networks
While building and training neural networks, it is crucial to initialize the weights appropriately
to ensure a model with high accuracy. If the weights are not correctly initialized, it may give rise
to the Vanishing Gradient problem or the Exploding Gradient problem.
Weight Initialization Techniques
1. Zero Initialization
As the name suggests, all the weights are assigned zero as the initial value is zero initialization.
This kind of initialization is highly ineffective as neurons learn the same feature during each
iteration. Rather, during any kind of constant initialization, the same issue happens to occur.
Thus, constant initializations are not preferred.
2. Random Initialization
In an attempt to overcome the shortcomings of Zero or Constant Initialization, random
initialization assigns random values except for zeros as weights to neuron paths. However,
assigning values randomly to the weights, problems such as Overfitting, Vanishing Gradient
Problem, Exploding Gradient Problem might occur.
Random Initialization can be of two kinds:
Random Normal
Random Uniform
a) Random Normal: The weights are initialized from values in a normal distribution.
b) Random Uniform: The weights are initialized from values in a uniform distribution.
3.Xavier/Glorot Initialization
In Xavier/Glorot weight initialization, the weights are assigned from values of a uniform
distribution as follows:
Xavier/Glorot Initialization often termed as Xavier Uniform Initialization, is suitable for layers
where the activation function used is Sigmoid.
4. Normalized Xavier/Glorot Initialization
In Normalized Xavier/Glorot weight initialization, the weights are assigned from values of a
normal distribution as follows:
Here,
Xavier/Glorot Initialization, too, is suitable for layers where the activation function used is
Sigmoid
5. He Uniform Initialization
In He Uniform weight initialization, the weights are assigned from values of a uniform
distribution as follows:
He Uniform Initialization is suitable for layers where ReLU activation function is used.
6. He Normal Initialization
In He Normal weight initialization, the weights are assigned from values of a normal distribution
as follows.
He Uniform Initialization, too, is suitable for layers where ReLU activation function is used.
6)Architecture of ANN.
Input Layer
As the name suggests, it accepts inputs in several different formats provided by the
programmer.
Hidden Layer:
The hidden layer presents in-between input and output layers. It performs all the
calculations to find hidden features and patterns.
Output Layer:
The input goes through a series of transformations using the hidden layer, which
finally results in output that is conveyed using this layer.
The artificial neural network takes input and computes the weighted sum of the
inputs and includes a bias. This computation is represented in the form of a transfer
function.
What is Artificial Neural Network

It determines weighted total is passed as an input to an activation function to
produce the output. Activation functions choose whether a node should fire or not.
Only those who are fired make it to the output layer. There are distinctive activation
functions available that can be applied upon the sort of task we are performing.
7 ) What is K-Means Algorithm?

K-Means Clustering is an Unsupervised Learning algorithm, which groups the
unlabeled dataset into different clusters. Here K defines the number of pre-defined
clusters that need to be created in the process, as if K=2, there will be two clusters,
and for K=3, there will be three clusters, and so on.
It allows us to cluster the data into different groups and a convenient way to
discover the categories of groups in the unlabeled dataset on its own without the
need for any training.
It is a centroid-based algorithm, where each cluster is associated with a centroid.
The main aim of this algorithm is to minimize the sum of distances between the data
point and their corresponding clusters.
The algorithm takes the unlabeled dataset as input, divides the dataset into k-
number of clusters, and repeats the process until it does not find the best clusters.
The value of k should be predetermined in this algorithm.
The k-means clustering algorithm mainly performs two tasks:

Determines the best value for K center points or centroids by an iterative process.
Assigns each data point to its closest k-center. Those data points which are near to
the particular k-center, create a cluster.
The below diagram explains the working of the K-means Clustering Algorithm:
The working of the K-Means algorithm is explained in the below steps:

Step-1: Select the number K to decide the number of clusters.
Step-2: Select random K points or centroids. (It can be other from the input
dataset).
Step-3: Assign each data point to their closest centroid, which will form the
predefined K clusters.
Step-4: Calculate the variance and place a new centroid of each cluster.
Step-5: Repeat the third steps, which means reassign each datapoint to the
new closest centroid of each cluster.
Step-6: If any reassignment occurs, then go to step-4 else go to FINISH.
Step-7: The model is ready.
7) Advantages of k-means
1. Relatively simple to implement.
2. Scales to large data sets.
3. Guarantees convergence.
4. Can warm-start the positions of centroids.
5. Easily adapts to new examples.
6. Generalizes to clusters of different shapes and sizes, such as elliptical
clusters.
Disadvantages of k-means
1. Choosing manually.
2. Being dependent on initial values.
3. Clustering data of varying sizes and density.
4. Clustering outliers.
5. Scaling with number of dimensions.
(CH-6)
1) The Curse of Dimensionality
In machine learning, “dimensionality” simply refers to the number of

features (i.e. input variables) in your dataset.
While the performance of any machine learning model increases if we add

additional features/dimensions, at some point a further insertion leads to
performance degradation that is when the number of features is very large
commensurate with the number of observations in your dataset, several
linear algorithms strive hard to train efficient models. This is called the
“Curse of Dimensionality”.
Dimensionality reduction is a set of techniques that studies how to shrivel

the size of data while preserving the most important information and further
eliminating the curse of dimensionality. It plays an important role in the
performance of classification and clustering problems.
The various techniques used for dimensionality reduction include:
Principal Component Analysis (PCA)

Linear Discriminant Analysis (LDA)
Generalized Discriminant Analysis (GDA)
Multi-Dimension Scaling (MDS)
2) Principal Component Analysis (PCA) is an unsupervised linear

transformation technique that is widely used across different fields,
most prominently for feature extraction and dimensionality
reduction.
It is a statistical process that converts the observations of correlated features
into a set of linearly uncorrelated features with the help of orthogonal
transformation. These new transformed features are called the Principal
Components. It is one of the popular tools that is used for exploratory data
analysis and predictive modeling. It is a technique to draw strong patterns
from the given dataset by reducing the variances.
PCA generally tries to find the lower-dimensional surface to project the high-
dimensional data.
PCA works by considering the variance of each attribute because the high
attribute shows the good split between the classes, and hence it reduces the
dimensionality.
Some real-world applications of PCA are image processing, movie

recommendation system, optimizing the power allocation in various
communication channels
The PCA algorithm is based on some mathematical concepts such as:
Variance and Covariance

Eigenvalues and Eigen factors
Steps for PCA algorithm
1. Getting the dataset
2. Standardizing the data
3. Calculating the Covariance of Z
4. Calculating the Eigen Values and Eigen Vectors
5. Sorting the Eigen Vectors
6. Calculating the new features Or Principal Components
7. Remove less or unimportant features from the new dataset.
Applications of Principal Component Analysis
PCA is mainly used as the dimensionality reduction technique in various AI

applications such as computer vision, image compression, etc.
It can also be used for finding hidden patterns if data has high dimensions.
Some fields where PCA is used are Finance, data mining, Psychology, etc.
(CH-8)
1)Random Forest Algorithm.
Random Forest is a popular machine learning algorithm that belongs to the

supervised learning technique. It can be used for both Classification and
Regression problems in ML.
It is based on the concept of ensemble learning, which is a process of

combining multiple classifiers to solve a complex problem and to improve the
performance of the model.
As the name suggests, “Random Forest is a classifier that contains a number

of decision trees on various subsets of the given dataset and takes the
average to improve the predictive accuracy of that dataset.”
Instead of relying on one decision tree, the random forest takes the
prediction from each tree and based on the majority votes of predictions,
and it predicts the final output.
The greater number of trees in the forest leads to higher accuracy and
prevents the problem of overfitting.
The below diagram explains the working of the Random Forest algorithm:
Assumptions for Random Forest
Since the random forest combines multiple trees to predict the class of the dataset,
it is possible that some decision trees may predict the correct output, while others
may not. But together, all the trees predict the correct output. Therefore, below are
two assumptions for a better Random forest classifier:
There should be some actual values in the feature variable of the dataset so that the
classifier can predict accurate results rather than a guessed result.
The predictions from each tree must have very low correlations.
Why use Random Forest?

Below are some points that explain why we should use the Random Forest
algorithm:
 It takes less training time as compared to other algorithms.

 It predicts output with high accuracy, even for the large dataset it runs
efficiently.
 It can also maintain accuracy when a large proportion of data is missing.
How does Random Forest algorithm work?
Random Forest works in two-phase first is to create the random forest by combining
N decision tree, and second is to make predictions for each tree created in the first
phase.
The Working process can be explained in the below steps and diagram:
Step-1: Select random K data points from the training set.

Step-2: Build the decision trees associated with the selected data points (Subsets).
Step-3: Choose the number N for decision trees that you want to build.
Step-4: Repeat Step 1 & 2.
Step-5: For new data points, find the predictions of each decision tree, and assign
the new data points to the category that wins the majority votes.
The working of the algorithm can be better understood by the below example:
Example: Suppose there is a dataset that contains multiple fruit images. So, this
dataset is given to the Random forest classifier. The dataset is divided into subsets
and given to each decision tree. During the training phase, each decision tree
produces a prediction result, and when a new data point occurs, then based on the
majority of results, the Random Forest classifier predicts the final decision. Consider
the below image:
Random Forest Algorithm.
Applications of Random Forest
There are mainly four sectors where Random forest mostly used:
Banking: Banking sector mostly uses this algorithm for the identification of loan risk.
Medicine: With the help of this algorithm, disease trends and risks of the disease
can be identified.
Land Use: We can identify the areas of similar land use by this algorithm.
Marketing: Marketing trends can be identified using this algorithm.
Advantages of Random Forest

Random Forest is capable of performing both Classification and Regression tasks.
It is capable of handling large datasets with high dimensionality.
It enhances the accuracy of the model and prevents the overfitting issue.
Disadvantages of Random Forest

Although random forest can be used for both classification and regression tasks, it is
not more suitable for Regression tasks.
Density-Based Clustering
The density-based clustering method connects the highly-dense areas into clusters,
and the arbitrarily shaped distributions are formed as long as the dense region can
be connected. This algorithm does it by identifying different clusters in the dataset
and connects the areas of high densities into clusters. The dense areas in data space
are divided from each other by sparser areas.
These algorithms can face difficulty in clustering the data points if the dataset has
varying densities and high dimensions.
In distance-based clustering, a distance metric is used to determine the similarity
between data objects.
The distance metric can be used to cluster observations by considering the distances
between objects directly or by considering distances between objects and cluster
centroids (or some other cluster representative points).
Most distance metrics, and hence the distance-based clustering methods, work
either with continuous-only or categorical-only data. In applications, however,
observations are often described by a combination of both continuous and
categorical variables.
Such data sets can be referred to as mixed or mixed-type data. In this review, we
consider different methods for distance-based cluster analysis of mixed data.

Document 2

Uploaded by

Copyright:

Available Formats

Document 2

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Document 2

Uploaded by

Copyright:

Available Formats

Artificial Intelligence and Machine Learning

2) Stochastic Gradient Descent Convergence.

Ans: Mathematical Notion of Convergence

Alternatively, one can say that for i large, ||wᵢ-w*|| is small.

Given an algorithm, it is usually easier to provide a bound on the difference ||

When i→∞, we have wᵢ→w* since 1\i→0.

Since a and b are constants, we say the above examples converges O(1/i).

3) Batch Gradient Descent.

In the above equation,

The multiple regression model is based on the following assumptions:

Multivariate linear regression

4) Gradient Descent in multiple variables.

6) Feature scaling in GD.

Alternative for Feature Scaling

Any Hyperplane can be written mathematically as above

The dots below this line, similar logic.

This property can be written in math again as followed:

If we further generalize these two into one, it becomes:

Let’s say we have a hyperplane — line X

To sum up, SVM in the linear separable cases:

Constrain/ensure that each observation is on the correct side of the Hyperplane

Binary Cross-Entropy = (Sum of Cross-Entropy for N data)/N

How Backpropagation Algorithm Works

Backpropagation is fast, simple and easy to program

What are Neurons in a Neural Network?

Artificial neuron for neural network

Random Initialization can be of two kinds:

4. Normalized Xavier/Glorot Initialization

What is Artificial Neural Network

7 ) What is K-Means Algorithm?

The k-means clustering algorithm mainly performs two tasks:

The working of the K-Means algorithm is explained in the below steps:

In machine learning, “dimensionality” simply refers to the number of

While the performance of any machine learning model increases if we add

Dimensionality reduction is a set of techniques that studies how to shrivel

The various techniques used for dimensionality reduction include:

Principal Component Analysis (PCA)

2) Principal Component Analysis (PCA) is an unsupervised linear

Some real-world applications of PCA are image processing, movie

The PCA algorithm is based on some mathematical concepts such as:

Variance and Covariance

Steps for PCA algorithm

1. Getting the dataset

2. Standardizing the data

3. Calculating the Covariance of Z

4. Calculating the Eigen Values and Eigen Vectors

5. Sorting the Eigen Vectors

6. Calculating the new features Or Principal Components

7. Remove less or unimportant features from the new dataset.

Applications of Principal Component Analysis

PCA is mainly used as the dimensionality reduction technique in various AI

1)Random Forest Algorithm.

Random Forest is a popular machine learning algorithm that belongs to the

It is based on the concept of ensemble learning, which is a process of

As the name suggests, “Random Forest is a classifier that contains a number

Why use Random Forest?

 It takes less training time as compared to other algorithms.

Step-1: Select random K data points from the training set.

Advantages of Random Forest

Disadvantages of Random Forest