0% found this document useful (0 votes)
36 views24 pages

ML Module Ii

The document discusses perceptron learning and support vector machines (SVM) for classification. It explains that perceptron is a simple type of neural network that uses weighted inputs and an activation function to classify data. SVM finds the optimal separating hyperplane between classes by maximizing the margin between the closest data points of each class. Backpropagation is then explained as a method for training neural networks by calculating gradients to adjust weights and reduce errors.

Uploaded by

Crazy Chethan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
36 views24 pages

ML Module Ii

The document discusses perceptron learning and support vector machines (SVM) for classification. It explains that perceptron is a simple type of neural network that uses weighted inputs and an activation function to classify data. SVM finds the optimal separating hyperplane between classes by maximizing the margin between the closest data points of each class. Backpropagation is then explained as a method for training neural networks by calculating gradients to adjust weights and reduce errors.

Uploaded by

Crazy Chethan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 24

MODULE-II

1.Explain about Perceptron Learning

Perceptron is Machine Learning algorithm for supervised learning of various binary


classification tasks. Further, Perceptron is also understood as an Artificial Neuron or neural
network unit that helps to detect certain input data computations in business intelligence.

Perceptron model is also treated as one of the best and simplest types of Artificial Neural
networks. However, it is a supervised learning algorithm of binary classifiers. Hence, we can
consider it as a single-layer neural network with four main parameters, i.e., input values,
weights and Bias, net sum, and an activation function.

In Machine Learning, binary classifiers are defined as the function that helps in deciding whether
input data can be represented as vectors of numbers and belongs to some specific class.

Binary classifiers can be considered as linear classifiers. In simple words, we can understand it as
a classification algorithm that can predict linear predictor function in terms of weight and
feature vectors.

o Input Nodes or Input Layer:

This is the primary component of Perceptron which accepts the initial data into the
system for further processing. Each input node contains a real numerical value.

o Wight and Bias:


Weight parameter represents the strength of the connection between units. This is
another most important parameter of Perceptron components. Weight is directly
proportional to the strength of the associated input neuron in deciding the output.
Further, Bias can be considered as the line of intercept in a linear equation.

o Activation Function:

These are the final and important components that help to determine whether the
neuron will fire or not. Activation Function can be considered primarily as a step
function.

Types of Activation functions:

o sign function
o Step function, and
o Sigmoid function

How does Perceptron work?


In Machine Learning, Perceptron is considered as a single-layer neural network that
consists of four main parameters named input values (Input nodes), weights and Bias,
net sum, and an activation function. The perceptron model begins with the
multiplication of all input values and their weights, then adds these values together to
create the weighted sum. Then this weighted sum is applied to the activation function 'f'
to obtain the desired output. This activation function is also known as the step
function and is represented by 'f'.
This step function or Activation function plays a vital role in ensuring that output is
mapped between required values (0,1) or (-1,1). It is important to note that the weight
of input is indicative of the strength of a node. Similarly, an input's bias value gives the
ability to shift the activation function curve up or down.

Perceptron model works in two important steps as follows:

Step-1

In the first step first, multiply all input values with corresponding weight values and then
add them to determine the weighted sum. Mathematically, we can calculate the
weighted sum as follows:

∑wi*xi = x1*w1 + x2*w2 +…wn*xn

Add a special term called bias 'b' to this weighted sum to improve the model's
performance.

∑wi*xi + b

Step-2

In the second step, an activation function is applied with the above-mentioned weighted
sum, which gives us output either in binary form or a continuous value as follows:

Y = f(∑wi*xi + b)
2.Explain about SVM Linear classification

Support Vector Machine or SVM is one of the most popular Supervised Learning
algorithms, which is used for Classification as well as Regression problems. However,
primarily, it is used for Classification problems in Machine Learning.

The goal of the SVM algorithm is to create the best line or decision boundary that
can segregate n-dimensional space into classes so that we can easily put the new
data point in the correct category in the future. This best decision boundary is called
a hyperplane.

SVM chooses the extreme points/vectors that help in creating the hyperplane. These
extreme cases are called as support vectors, and hence algorithm is termed as
Support Vector Machine. Consider the below diagram in which there are two
different categories that are classified using a decision boundary or hyperplane

Example: SVM can be understood with the example that we have used in the KNN
classifier. Suppose we see a strange cat that also has some features of dogs, so if we
want a model that can accurately identify whether it is a cat or dog, so such a model
can be created by using the SVM algorithm. We will first train our model with lots of
images of cats and dogs so that it can learn about different features of cats and
dogs, and then we test it with this strange creature. So as support vector creates a
decision boundary between these two data (cat and dog) and choose extreme cases
(support vectors), it will see the extreme case of cat and dog. On the basis of the
support vectors, it will classify it as a cat. Consider the below diagram:

SVM algorithm can be used for Face detection, image classification, text
categorization, etc.

Types of SVM
SVM can be of two types:

o Linear SVM: Linear SVM is used for linearly separable data, which means if a dataset can
be classified into two classes by using a single straight line, then such data is termed as
linearly separable data, and classifier is used called as Linear SVM classifier.
o Non-linear SVM: Non-Linear SVM is used for non-linearly separated data, which means
if a dataset cannot be classified by using a straight line, then such data is termed as non-
linear data and classifier used is called as Non-linear SVM classifier.
Hyperplane and Support Vectors in the SVM algorithm:
Hyperplane: There can be multiple lines/decision boundaries to segregate the classes in
n-dimensional space, but we need to find out the best decision boundary that helps to
classify the data points. This best boundary is known as the hyperplane of SVM.

The dimensions of the hyperplane depend on the features present in the dataset, which
means if there are 2 features (as shown in image), then hyperplane will be a straight line.
And if there are 3 features, then hyperplane will be a 2-dimension plane.

We always create a hyperplane that has a maximum margin, which means the maximum
distance between the data points.

Support Vectors:

The data points or vectors that are the closest to the hyperplane and which affect the
position of the hyperplane are termed as Support Vector. Since these vectors
support the hyperplane, hence called a Support vector.

How does SVM works?


Linear SVM:

The working of the SVM algorithm can be understood by using an example. Suppose we
have a dataset that has two tags (green and blue), and the dataset has two features x1
and x2. We want a classifier that can classify the pair(x1, x2) of coordinates in either
green or blue. Consider the below image:
So as it is 2-d space so by just using a straight line, we can easily separate these two
classes. But there can be multiple lines that can separate these classes. Consider the
below image:
Hence, the SVM algorithm helps to find the best line or decision boundary; this best
boundary or region is called as a hyperplane. SVM algorithm finds the closest point
of the lines from both the classes. These points are called support vectors. The
distance between the vectors and the hyperplane is called as margin. And the goal
of SVM is to maximize this margin. The hyperplane with maximum margin is called
the optimal hyperplane.

Non-Linear SVM:

If data is linearly arranged, then we can separate it by using a straight line, but for non-
linear data, we cannot draw a single straight line. Consider the below image:
So to separate these data points, we need to add one more dimension. For linear data,
we have used two dimensions x and y, so for non-linear data, we will add a third
dimension z. It can be calculated as:

z=x2 +y2

By adding the third dimension, the sample space will become as below image:
3.Explain about back propogation in artificial Neural Networks

Backpropagation is the essence of neural network training. It is the method of fine-tuning


the weights of a neural network based on the error rate obtained in the previous epoch (i.e.,
iteration). Proper tuning of the weights allows you to reduce error rates and make the model
reliable by increasing its generalization.

Backpropagation in neural network is a short form for “backward propagation of errors.” It is


a standard method of training artificial neural networks. This method helps calculate the
gradient of a loss function with respect to all the weights in the network.

How Backpropagation Algorithm Works


The Back propagation algorithm in neural network computes the gradient of the loss function for
a single weight by the chain rule. It efficiently computes one layer at a time, unlike a native
direct computation. It computes the gradient, but it does not define how the gradient is used. It
generalizes the computation in the delta rule.

1. Inputs X, arrive through the preconnected path


2. Input is modeled using real weights W. The weights are usually
randomly selected.
3. Calculate the output for every neuron from the input layer, to the
hidden layers, to the output layer.
4. Calculate the error in the outputs
ErrorB= Actual Output – Desired Output
5. Travel back from the output layer to the hidden layer to adjust the weights such that the
error is decreased.

Keep repeating the process until the desired output is achieved

Why We Need Backpropagation?


Most prominent advantages of Backpropagation are:

 Backpropagation is fast, simple and easy to program


 It has no parameters to tune apart from the numbers of input
 It is a flexible method as it does not require prior knowledge about the network
 It is a standard method that generally works well
 It does not need any special mention of the features of the function to be learned.

Backpropagation (short for "backward propagation of errors") is a supervised learning


algorithm commonly used for training artificial neural networks. It is a supervised
learning algorithm, meaning it requires a labeled dataset to learn from. The primary goal
of backpropagation is to minimize the error between the predicted output of the neural
network and the actual target values by adjusting the weights and biases of the network.

Here's a step-by-step explanation of how backpropagation works in artificial neural


networks:

1. Forward Pass:
 The input data is fed forward through the neural network to generate
predictions. Each neuron's output is calculated by applying an activation
function to the weighted sum of its inputs.
2. Calculate Loss:
 The output of the neural network is compared to the actual target values,
and the error is quantified using a loss function. Common loss functions
include mean squared error for regression tasks and cross-entropy loss for
classification tasks.
3. Backward Pass (Backpropagation):
 The goal of the backward pass is to calculate the gradient of the loss with
respect to the weights and biases of the network. This gradient represents
how much the loss would increase or decrease with small changes to each
weight and bias.
4. Gradient Descent Optimization:
 The calculated gradients are used to update the weights and biases in the
opposite direction of the gradient, aiming to minimize the loss. The
magnitude of the update is controlled by a learning rate.
5. Repeat:
 Steps 1-4 are repeated for multiple iterations (epochs) or until the network
reaches a satisfactory level of performance.
6. Activation Functions:
 The choice of activation functions is crucial in backpropagation. Common
activation functions include:
 Sigmoid: Used in the output layer for binary classification.
 Hyperbolic Tangent (tanh): Similar to sigmoid but centered
around zero.
 Rectified Linear Unit (ReLU): Commonly used in hidden layers
due to faster convergence.
7. Backpropagation Math:
 The chain rule of calculus is used to calculate the gradients during the
backward pass. The gradients are calculated layer by layer, starting from
the output layer and moving backward to the input layer.
8. Batch Training:
 Backpropagation is often performed on batches of training data rather
than on individual samples. This is known as batch training, and it helps to
smooth out noisy gradients and improve convergence.
9. Regularization Techniques:
 To prevent overfitting, regularization techniques such as L1 and L2
regularization or dropout can be applied during backpropagation.
4.Describe about Maximum Likehood estimation

Maximum likelihood estimation is a method that determines values for the parameters of a model.

The parameter values are found such that they maximise the likelihood that the process described

by the model produced the data that were actually observed.The above definition may still sound

a little cryptic so let’s go through an example to help understand this.

Let’s suppose we have observed 10 data points from some process. For example, each data point

could represent the length of time in seconds that it takes a student to answer a specific exam

question. These 10 data points are shown in the figure below

We first have to decide which model we think best describes the process of generating the data.

This part is very important. At the very least, we should have a good idea about which model to

use. This usually comes from having some domain expertise but we wont discuss this here.

For these data we’ll assume that the data generation process can be adequately described by a

Gaussian (normal) distribution. Visual inspection of the figure above suggests that a Gaussian

distribution is plausible because most of the 10 points are clustered in the middle with few points
scattered to the left and the right. (Making this sort of decision on the fly with only 10 data points

is ill-advised but given that I generated these data points we’ll go with it).

Recall that the Gaussian distribution has 2 parameters. The mean, μ, and the standard deviation, σ.

Different values of these parameters result in different curves (just like with the straight lines

above).

Now that we have an intuitive understanding of what maximum likelihood estimation is we can

move on to learning how to calculate the parameter values. The values that we find are called the

maximum likelihood estimates (MLE).

Again we’ll demonstrate this with an example. Suppose we have three data points this time and

we assume that they have been generated from a process that is adequately described by a

Gaussian distribution

What we want to calculate is the total probability of observing all of the data, i.e. the joint

probability distribution of all observed data points. To do this we would need to calculate some

conditional probabilities, which can get very difficult. So it is here that we’ll make our first

assumption. The assumption is that each data point is generated independently of the others. This

assumption makes the maths much easier. If the events (i.e. the process that generates the data)

are independent, then the total probability of observing all of data is the product of observing each

data point individually (i.e. the product of the marginal probabilities).

The probability density of observing a single data point x, that is generated from a Gaussian

distribution is given by:


The semi colon used in the notation P(x; μ, σ) is there to emphasise that the symbols that appear

after it are parameters of the probability distribution. So it shouldn’t be confused with a

conditional probability (which is typically represented with a vertical line e.g. P(A| B)).

In our example the total (joint) probability density of observing the three data points is given by:

So why maximum likelihood and not maximum probability?

Well this is just statisticians being pedantic (but for good reason). Most people tend to use

probability and likelihood interchangeably but statisticians and probability theorists distinguish

between the two. The reason for the confusion is best highlighted by looking at the equation.

These expressions are equal! So what does this mean? Let’s first define P(data; μ, σ)? It

means “the probability density of observing the data with model parameters μ and σ”. It’s worth

noting that we can generalise this to any number of parameters and any distribution.
On the other hand L(μ, σ; data) means “the likelihood of the parameters μ and σ taking certain

values given that we’ve observed a bunch of data.”

The equation above says that the probability density of the data given the parameters is equal to

the likelihood of the parameters given the data. But despite these two things being equal, the

likelihood and the probability density are fundamentally asking different questions — one is

asking about the data and the other is asking about the parameter values. This is why the method

is called maximum likelihood and not maximum probability.

5.Describe parameter Estimation Bayesian Network

Parameter estimation in a Bayesian network involves determining the probabilities


associated with the nodes and edges in the network. A Bayesian network is a
probabilistic graphical model that represents a set of variables and their probabilistic
dependencies in the form of a directed acyclic graph (DAG). The nodes in the graph
represent random variables, and the edges represent probabilistic dependencies
between the variables.
Here's a step-by-step description of parameter estimation in a Bayesian network:

1. Define the Bayesian Network Structure:


 Specify the variables of interest and their relationships.
 Represent these relationships using a directed acyclic graph (DAG). Nodes
represent variables, and directed edges represent probabilistic
dependencies.
2. Assign Conditional Probability Tables (CPTs):
 Each node in the Bayesian network is associated with a Conditional
Probability Table (CPT).
 The CPT for a node specifies the conditional probability distribution of that
node given its parents in the graph.
 The sum of probabilities in each CPT should be equal to 1.
3. Data Collection:
 Gather a dataset that includes observations for the variables in the
Bayesian network.
 Each observation should provide values for the variables represented by
the nodes in the network.
4. Maximum Likelihood Estimation (MLE):
 Use the collected data to estimate the parameters of the Bayesian
network.
 For each CPT, calculate the maximum likelihood estimates of the
probabilities based on the observed data.
 MLE involves finding the parameter values that maximize the likelihood
function, which measures how well the model explains the observed data.
5. Smoothing and Regularization (Optional):
 Depending on the size of the dataset and the complexity of the network,
you may need to apply smoothing techniques or regularization methods
to avoid overfitting and improve the robustness of the parameter
estimates.
6. Bayesian Parameter Estimation (Optional):
 Instead of relying solely on MLE, you can incorporate prior beliefs or
information into the parameter estimation process using Bayesian
methods.
 Bayesian parameter estimation involves updating the prior beliefs based
on the observed data to obtain posterior probability distributions for the
parameters.
7. Model Validation:
 Assess the performance of the Bayesian network by validating it against
additional data not used during the parameter estimation process.
 Common validation techniques include cross-validation and holdout
validation.

6.Explain training,initialization & validation in Neural Networks

In the context of neural networks, training, initialization, and validation are crucial
concepts that play distinct roles in building and evaluating effective models.

1. Training:
 Definition: Training a neural network involves adjusting its parameters
(weights and biases) based on a labeled dataset to minimize a predefined
objective function (loss or cost function).
 Process: During training, the neural network iteratively processes input
data, makes predictions, compares these predictions to the actual target
values, calculates the loss, and then updates its parameters using
optimization algorithms (e.g., gradient descent).
 Objective: The goal of training is to enable the network to generalize well
to unseen data, capturing the underlying patterns in the training set.
2. Initialization:
 Definition: Initialization refers to the process of setting the initial values of
the weights and biases in a neural network before training.
 Importance: Proper initialization is crucial, as it can significantly impact
the convergence and performance of the network during training.
 Common Methods:
 Zero Initialization: Setting all weights to zero. Not commonly used
in deep networks due to symmetry issues.
 Random Initialization: Assigning small random values to the
weights. Commonly used to break symmetry and promote faster
convergence.
 Xavier/Glorot Initialization: Adjusting the scale of random
initialization based on the number of input and output neurons.
Suitable for sigmoid and hyperbolic tangent activation functions.
 He Initialization: Similar to Xavier, but adapted for ReLU (Rectified
Linear Unit) activation functions.
3. Validation:
 Definition: Validation is the process of evaluating a trained neural
network on a separate dataset not used during training, called the
validation set.
 Purpose: The primary goal of validation is to assess the generalization
performance of the model and identify potential issues such as overfitting
or underfitting.
 Procedure: The model's performance metrics (e.g., accuracy, precision,
recall) are calculated on the validation set, providing insights into how well
the model is expected to perform on new, unseen data.
 Hyperparameter Tuning: Validation is often used in conjunction with
hyperparameter tuning to find the best configuration for the model. This
involves adjusting parameters not learned during training, such as learning
rate or the number of hidden units.

7.What is the principle of SVM ?Why gives more accuracy?

The SVM model or Support Vector Machine model is a popular set of supervised learning
models that are used for regression as well as classification analysis. It is a model based on
the statistical learning framework and is known for being robust and effective in multiple
use cases. Based on a non-probabilistic binary linear classifier, a support vector machine is
used for separating different classes with the help of various kernels.

One of the main reasons companies are leaning towards support vector machine models as
compared to other models is because Support Vector Machines have significantly
higher accuracy that can be leveraged while using decreased computation from the system.
Why are SVMs Used in Machine Learning?

The two main reasons why support vector machines used in machine learning are:

 Relatively High Accuracy: One of the main advantages of a support vector machine is
that, as compared to more fundamental algorithms, it has a much higher relative
accuracy. This means that when deploying the model in the real world, we see better
results from the machine learning models implemented.
 Minimal Computation Time: Due to the “kernel trick”, the computation time
of SVM support vector machines is reduced, which means that as data scientists, we are
able to get better results in a reduced time while utilizing fewer resources. This is a win-
win, as we can get better results without affecting hardware utilization costs and even at a
faster time.
Types of Support Vector Machines Algorithm

In this section, we will understand more about the types of SVM based on the kind of data
that we use. This is more specific to classification as that is the primary use case
for Support Vector Machines.

1. Linear SVM

The Linear Support Vector Machine algorithm is used when we have linearly separable data.
In simple language, if we have a dataset that can be classified into two groups using a simple
straight line, we call it linearly separable data, and the classifier used for this is known as
Linear SVM Classifier.

2. Non-Linear SVM

The non-linear support vector machine algorithm is used when we have non-linearly separable
data. In simple language, if we have a dataset that cannot be classified into two groups using a
simple straight line, we call it non-linear separable data, and the classifier used for this is
known as a Non-Linear SVM classifier.
Hyperplane and Support Vectors in SVM Algorithm

In this section, we will discuss more Hyperplane and Support Vectors in SVM:

1. Hyperplane

When given a set of points, there can be multiple ways to separate the classes in an n-
dimensional space. The way that SVM works, it transforms the lower dimensional data into
higher dimensional data and then separates out the points. There are multiple ways to separate
the data, and these can be called Decision Boundaries. However, the main idea
behind SVM classification is to find the best possible decision boundary. The hyperplane is
the optimal, generalized and best-fit boundary for the support vector machine classifier.

For instance, in a two-dimensional space, as discussed in our example, the hyperplane


will be a straight line. In contrast, if the data exists in a three-dimensional space, then the
hyperplane will exist in two dimensions. A good rule of thumb is that for an n-dimensional
space, the hyperplane will generally have an n-1 dimension.

The aim is to create a hyperplane that has the highest possible margin to create a generalized
model. This indicates that there will be a maximum distance between data points.

2. Support Vectors

The term support vector indicates that we have supporting vectors to the main hyperplane. If
we have the maximum distance between the support vectors, it is an indication of the best fit.
So, support vectors are the vectors that pass through the closest points to the hyperplane and
affect the overall position of the hyperplane.

How Does SVM Work in Machine Learning?

SVM works based on the principle of maximizing the distance between the support vectors.
This ensures that we have the maximum margin possible between points, thus, giving us a
generalized model. The aim of Support Vector Machine classification is to maximize the
margin between the Support Vectors.
Support Vector Machines (SVM) are known for their effectiveness in classification tasks
and can often provide high accuracy. There are several reasons why SVMs are capable of
achieving high accuracy in certain scenarios:

1. Effective in High-Dimensional Spaces:


 SVMs work well in high-dimensional spaces, making them suitable for
problems with a large number of features. They can efficiently handle
datasets where the number of features is greater than the number of
samples.
2. Robust to Overfitting:
 SVMs are less prone to overfitting, especially in high-dimensional spaces.
The margin maximization objective of SVMs encourages a simpler decision
boundary, reducing the risk of fitting noise in the data.
3. Kernel Trick for Nonlinear Data:
 SVMs can effectively model complex, nonlinear relationships in the data by
using the kernel trick. The kernel function allows SVMs to implicitly map
the input data into a higher-dimensional space, making it easier to find a
linear separation.
4. Global Optimization Objective:
 The training of an SVM involves solving a convex optimization problem,
and the objective function aims to maximize the margin between different
classes. This leads to a global optimum, ensuring that the learned model is
robust and less sensitive to the initialization of parameters.
5. Memory Efficiency:
 SVMs typically use only a subset of training samples called support vectors
to define the decision boundary. This property makes them memory-
efficient, particularly when dealing with large datasets.
6. Effective in Binary Classification:
 SVMs are inherently binary classifiers. However, they can be extended to
handle multiclass problems through techniques such as one-vs-one or
one-vs-all. In binary classification, SVMs often perform well and achieve
high accuracy.
7. Well-Separated Classes:
 SVMs work best when classes are well-separated, and there is a clear
margin between them. In such cases, the algorithm can confidently
identify decision boundaries.
8.Discuss Hinge loss Formulation of SVM

Support Vector Machine


Hinge Loss
Hinge loass is used in binary classification problems where the objective is to separate
the data points in two classes typically labeled as +1 and -1.
Mathematically, Hinge loss for a data point can be represented as :

Here,
 y- the actual class (-1 or 1)
 f(x) – the output of the classifier for the datapoint
Lets understand it with the help of below graph

Case 1 : Correct Classification and |y| ≥ 1


In this case the product t.y will always be positive and its value greater than 1 and
therefore the value of 1-t.y will be negative. So the loss function value max(0,1-t.y) will
always be zero. This is indicated by the green region in above graph. Here there is no
penalty to the model as model correctly classifies the data point.
Case 2 : Correct Classification and |y| < 1
In this case the product t.y will always be positive , but its value will be less than 1 and
therefore the value of 1-t.y will be positive with value ranging between 0 to 1. Hence
the loss function value will be the value of 1-t.y. This is indicated by the yellow region
in above graph. Here though the model has correctly classified the data we are
penalizing the model because it has not classified it with much confidence (|y| < 1) as
the classification score is less than 1. We want the model to have a classification score
of at least 1 for all the points.
Case 3: Incorrect Classification
In this case either of t or y will be negative. Therefore the product t.y will always be
negative and the value of (1-t)y will be always positive and greater than 1. So the loss
function value max(0,1-t.y) will always be the value given by (1-t)y . Here the loss
value will increase linearly with increase in value of y. This is indicated by the red
region in above graph.
Relationship Between Hinge Loss and SVM
Let us understand the relationship between hinge loss and svm mathematically .
Hard Margin and Hinge Loss
A hard margivn SVM is a type of SVM which aims to find a hyperpalne that perfectly
separates the two classes without any misclassification. It assumes that the data is
linearly separable, and the objective is to maximize the margin while ensuring that all
training data points are correctly classified
So mathematically speaking for hard margin we want our model to classify all the
points in such a way that is at least 1 while minimizing the weight
vector w. Thus a good classifier i.e. a good hyperplane will be one that will give a large
positive value of for all the points. It encourages the SVM to find a
hyperplane that not only separates the two classes but also maximizes the margin
between them. Mathematically

You might also like