ML Module Ii
ML Module Ii
Perceptron model is also treated as one of the best and simplest types of Artificial Neural
networks. However, it is a supervised learning algorithm of binary classifiers. Hence, we can
consider it as a single-layer neural network with four main parameters, i.e., input values,
weights and Bias, net sum, and an activation function.
In Machine Learning, binary classifiers are defined as the function that helps in deciding whether
input data can be represented as vectors of numbers and belongs to some specific class.
Binary classifiers can be considered as linear classifiers. In simple words, we can understand it as
a classification algorithm that can predict linear predictor function in terms of weight and
feature vectors.
This is the primary component of Perceptron which accepts the initial data into the
system for further processing. Each input node contains a real numerical value.
o Activation Function:
These are the final and important components that help to determine whether the
neuron will fire or not. Activation Function can be considered primarily as a step
function.
o sign function
o Step function, and
o Sigmoid function
Step-1
In the first step first, multiply all input values with corresponding weight values and then
add them to determine the weighted sum. Mathematically, we can calculate the
weighted sum as follows:
Add a special term called bias 'b' to this weighted sum to improve the model's
performance.
∑wi*xi + b
Step-2
In the second step, an activation function is applied with the above-mentioned weighted
sum, which gives us output either in binary form or a continuous value as follows:
Y = f(∑wi*xi + b)
2.Explain about SVM Linear classification
Support Vector Machine or SVM is one of the most popular Supervised Learning
algorithms, which is used for Classification as well as Regression problems. However,
primarily, it is used for Classification problems in Machine Learning.
The goal of the SVM algorithm is to create the best line or decision boundary that
can segregate n-dimensional space into classes so that we can easily put the new
data point in the correct category in the future. This best decision boundary is called
a hyperplane.
SVM chooses the extreme points/vectors that help in creating the hyperplane. These
extreme cases are called as support vectors, and hence algorithm is termed as
Support Vector Machine. Consider the below diagram in which there are two
different categories that are classified using a decision boundary or hyperplane
Example: SVM can be understood with the example that we have used in the KNN
classifier. Suppose we see a strange cat that also has some features of dogs, so if we
want a model that can accurately identify whether it is a cat or dog, so such a model
can be created by using the SVM algorithm. We will first train our model with lots of
images of cats and dogs so that it can learn about different features of cats and
dogs, and then we test it with this strange creature. So as support vector creates a
decision boundary between these two data (cat and dog) and choose extreme cases
(support vectors), it will see the extreme case of cat and dog. On the basis of the
support vectors, it will classify it as a cat. Consider the below diagram:
SVM algorithm can be used for Face detection, image classification, text
categorization, etc.
Types of SVM
SVM can be of two types:
o Linear SVM: Linear SVM is used for linearly separable data, which means if a dataset can
be classified into two classes by using a single straight line, then such data is termed as
linearly separable data, and classifier is used called as Linear SVM classifier.
o Non-linear SVM: Non-Linear SVM is used for non-linearly separated data, which means
if a dataset cannot be classified by using a straight line, then such data is termed as non-
linear data and classifier used is called as Non-linear SVM classifier.
Hyperplane and Support Vectors in the SVM algorithm:
Hyperplane: There can be multiple lines/decision boundaries to segregate the classes in
n-dimensional space, but we need to find out the best decision boundary that helps to
classify the data points. This best boundary is known as the hyperplane of SVM.
The dimensions of the hyperplane depend on the features present in the dataset, which
means if there are 2 features (as shown in image), then hyperplane will be a straight line.
And if there are 3 features, then hyperplane will be a 2-dimension plane.
We always create a hyperplane that has a maximum margin, which means the maximum
distance between the data points.
Support Vectors:
The data points or vectors that are the closest to the hyperplane and which affect the
position of the hyperplane are termed as Support Vector. Since these vectors
support the hyperplane, hence called a Support vector.
The working of the SVM algorithm can be understood by using an example. Suppose we
have a dataset that has two tags (green and blue), and the dataset has two features x1
and x2. We want a classifier that can classify the pair(x1, x2) of coordinates in either
green or blue. Consider the below image:
So as it is 2-d space so by just using a straight line, we can easily separate these two
classes. But there can be multiple lines that can separate these classes. Consider the
below image:
Hence, the SVM algorithm helps to find the best line or decision boundary; this best
boundary or region is called as a hyperplane. SVM algorithm finds the closest point
of the lines from both the classes. These points are called support vectors. The
distance between the vectors and the hyperplane is called as margin. And the goal
of SVM is to maximize this margin. The hyperplane with maximum margin is called
the optimal hyperplane.
Non-Linear SVM:
If data is linearly arranged, then we can separate it by using a straight line, but for non-
linear data, we cannot draw a single straight line. Consider the below image:
So to separate these data points, we need to add one more dimension. For linear data,
we have used two dimensions x and y, so for non-linear data, we will add a third
dimension z. It can be calculated as:
z=x2 +y2
By adding the third dimension, the sample space will become as below image:
3.Explain about back propogation in artificial Neural Networks
1. Forward Pass:
The input data is fed forward through the neural network to generate
predictions. Each neuron's output is calculated by applying an activation
function to the weighted sum of its inputs.
2. Calculate Loss:
The output of the neural network is compared to the actual target values,
and the error is quantified using a loss function. Common loss functions
include mean squared error for regression tasks and cross-entropy loss for
classification tasks.
3. Backward Pass (Backpropagation):
The goal of the backward pass is to calculate the gradient of the loss with
respect to the weights and biases of the network. This gradient represents
how much the loss would increase or decrease with small changes to each
weight and bias.
4. Gradient Descent Optimization:
The calculated gradients are used to update the weights and biases in the
opposite direction of the gradient, aiming to minimize the loss. The
magnitude of the update is controlled by a learning rate.
5. Repeat:
Steps 1-4 are repeated for multiple iterations (epochs) or until the network
reaches a satisfactory level of performance.
6. Activation Functions:
The choice of activation functions is crucial in backpropagation. Common
activation functions include:
Sigmoid: Used in the output layer for binary classification.
Hyperbolic Tangent (tanh): Similar to sigmoid but centered
around zero.
Rectified Linear Unit (ReLU): Commonly used in hidden layers
due to faster convergence.
7. Backpropagation Math:
The chain rule of calculus is used to calculate the gradients during the
backward pass. The gradients are calculated layer by layer, starting from
the output layer and moving backward to the input layer.
8. Batch Training:
Backpropagation is often performed on batches of training data rather
than on individual samples. This is known as batch training, and it helps to
smooth out noisy gradients and improve convergence.
9. Regularization Techniques:
To prevent overfitting, regularization techniques such as L1 and L2
regularization or dropout can be applied during backpropagation.
4.Describe about Maximum Likehood estimation
Maximum likelihood estimation is a method that determines values for the parameters of a model.
The parameter values are found such that they maximise the likelihood that the process described
by the model produced the data that were actually observed.The above definition may still sound
Let’s suppose we have observed 10 data points from some process. For example, each data point
could represent the length of time in seconds that it takes a student to answer a specific exam
We first have to decide which model we think best describes the process of generating the data.
This part is very important. At the very least, we should have a good idea about which model to
use. This usually comes from having some domain expertise but we wont discuss this here.
For these data we’ll assume that the data generation process can be adequately described by a
Gaussian (normal) distribution. Visual inspection of the figure above suggests that a Gaussian
distribution is plausible because most of the 10 points are clustered in the middle with few points
scattered to the left and the right. (Making this sort of decision on the fly with only 10 data points
is ill-advised but given that I generated these data points we’ll go with it).
Recall that the Gaussian distribution has 2 parameters. The mean, μ, and the standard deviation, σ.
Different values of these parameters result in different curves (just like with the straight lines
above).
Now that we have an intuitive understanding of what maximum likelihood estimation is we can
move on to learning how to calculate the parameter values. The values that we find are called the
Again we’ll demonstrate this with an example. Suppose we have three data points this time and
we assume that they have been generated from a process that is adequately described by a
Gaussian distribution
What we want to calculate is the total probability of observing all of the data, i.e. the joint
probability distribution of all observed data points. To do this we would need to calculate some
conditional probabilities, which can get very difficult. So it is here that we’ll make our first
assumption. The assumption is that each data point is generated independently of the others. This
assumption makes the maths much easier. If the events (i.e. the process that generates the data)
are independent, then the total probability of observing all of data is the product of observing each
The probability density of observing a single data point x, that is generated from a Gaussian
conditional probability (which is typically represented with a vertical line e.g. P(A| B)).
In our example the total (joint) probability density of observing the three data points is given by:
Well this is just statisticians being pedantic (but for good reason). Most people tend to use
probability and likelihood interchangeably but statisticians and probability theorists distinguish
between the two. The reason for the confusion is best highlighted by looking at the equation.
These expressions are equal! So what does this mean? Let’s first define P(data; μ, σ)? It
means “the probability density of observing the data with model parameters μ and σ”. It’s worth
noting that we can generalise this to any number of parameters and any distribution.
On the other hand L(μ, σ; data) means “the likelihood of the parameters μ and σ taking certain
The equation above says that the probability density of the data given the parameters is equal to
the likelihood of the parameters given the data. But despite these two things being equal, the
likelihood and the probability density are fundamentally asking different questions — one is
asking about the data and the other is asking about the parameter values. This is why the method
In the context of neural networks, training, initialization, and validation are crucial
concepts that play distinct roles in building and evaluating effective models.
1. Training:
Definition: Training a neural network involves adjusting its parameters
(weights and biases) based on a labeled dataset to minimize a predefined
objective function (loss or cost function).
Process: During training, the neural network iteratively processes input
data, makes predictions, compares these predictions to the actual target
values, calculates the loss, and then updates its parameters using
optimization algorithms (e.g., gradient descent).
Objective: The goal of training is to enable the network to generalize well
to unseen data, capturing the underlying patterns in the training set.
2. Initialization:
Definition: Initialization refers to the process of setting the initial values of
the weights and biases in a neural network before training.
Importance: Proper initialization is crucial, as it can significantly impact
the convergence and performance of the network during training.
Common Methods:
Zero Initialization: Setting all weights to zero. Not commonly used
in deep networks due to symmetry issues.
Random Initialization: Assigning small random values to the
weights. Commonly used to break symmetry and promote faster
convergence.
Xavier/Glorot Initialization: Adjusting the scale of random
initialization based on the number of input and output neurons.
Suitable for sigmoid and hyperbolic tangent activation functions.
He Initialization: Similar to Xavier, but adapted for ReLU (Rectified
Linear Unit) activation functions.
3. Validation:
Definition: Validation is the process of evaluating a trained neural
network on a separate dataset not used during training, called the
validation set.
Purpose: The primary goal of validation is to assess the generalization
performance of the model and identify potential issues such as overfitting
or underfitting.
Procedure: The model's performance metrics (e.g., accuracy, precision,
recall) are calculated on the validation set, providing insights into how well
the model is expected to perform on new, unseen data.
Hyperparameter Tuning: Validation is often used in conjunction with
hyperparameter tuning to find the best configuration for the model. This
involves adjusting parameters not learned during training, such as learning
rate or the number of hidden units.
The SVM model or Support Vector Machine model is a popular set of supervised learning
models that are used for regression as well as classification analysis. It is a model based on
the statistical learning framework and is known for being robust and effective in multiple
use cases. Based on a non-probabilistic binary linear classifier, a support vector machine is
used for separating different classes with the help of various kernels.
One of the main reasons companies are leaning towards support vector machine models as
compared to other models is because Support Vector Machines have significantly
higher accuracy that can be leveraged while using decreased computation from the system.
Why are SVMs Used in Machine Learning?
The two main reasons why support vector machines used in machine learning are:
Relatively High Accuracy: One of the main advantages of a support vector machine is
that, as compared to more fundamental algorithms, it has a much higher relative
accuracy. This means that when deploying the model in the real world, we see better
results from the machine learning models implemented.
Minimal Computation Time: Due to the “kernel trick”, the computation time
of SVM support vector machines is reduced, which means that as data scientists, we are
able to get better results in a reduced time while utilizing fewer resources. This is a win-
win, as we can get better results without affecting hardware utilization costs and even at a
faster time.
Types of Support Vector Machines Algorithm
In this section, we will understand more about the types of SVM based on the kind of data
that we use. This is more specific to classification as that is the primary use case
for Support Vector Machines.
1. Linear SVM
The Linear Support Vector Machine algorithm is used when we have linearly separable data.
In simple language, if we have a dataset that can be classified into two groups using a simple
straight line, we call it linearly separable data, and the classifier used for this is known as
Linear SVM Classifier.
2. Non-Linear SVM
The non-linear support vector machine algorithm is used when we have non-linearly separable
data. In simple language, if we have a dataset that cannot be classified into two groups using a
simple straight line, we call it non-linear separable data, and the classifier used for this is
known as a Non-Linear SVM classifier.
Hyperplane and Support Vectors in SVM Algorithm
In this section, we will discuss more Hyperplane and Support Vectors in SVM:
1. Hyperplane
When given a set of points, there can be multiple ways to separate the classes in an n-
dimensional space. The way that SVM works, it transforms the lower dimensional data into
higher dimensional data and then separates out the points. There are multiple ways to separate
the data, and these can be called Decision Boundaries. However, the main idea
behind SVM classification is to find the best possible decision boundary. The hyperplane is
the optimal, generalized and best-fit boundary for the support vector machine classifier.
The aim is to create a hyperplane that has the highest possible margin to create a generalized
model. This indicates that there will be a maximum distance between data points.
2. Support Vectors
The term support vector indicates that we have supporting vectors to the main hyperplane. If
we have the maximum distance between the support vectors, it is an indication of the best fit.
So, support vectors are the vectors that pass through the closest points to the hyperplane and
affect the overall position of the hyperplane.
SVM works based on the principle of maximizing the distance between the support vectors.
This ensures that we have the maximum margin possible between points, thus, giving us a
generalized model. The aim of Support Vector Machine classification is to maximize the
margin between the Support Vectors.
Support Vector Machines (SVM) are known for their effectiveness in classification tasks
and can often provide high accuracy. There are several reasons why SVMs are capable of
achieving high accuracy in certain scenarios:
Here,
y- the actual class (-1 or 1)
f(x) – the output of the classifier for the datapoint
Lets understand it with the help of below graph