0% found this document useful (0 votes)
58 views38 pages

Unit 4

Support Vector Machine (SVM) is a popular supervised learning algorithm primarily used for classification, aiming to create a hyperplane that separates classes in n-dimensional space. It includes linear and non-linear SVM types for different data separability scenarios. Additionally, Linear Discriminant Analysis (LDA) is discussed as a dimensionality reduction technique for classification problems, along with the Perceptron model, which serves as a foundational binary classifier in neural networks.

Uploaded by

upmakaprasad
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
58 views38 pages

Unit 4

Support Vector Machine (SVM) is a popular supervised learning algorithm primarily used for classification, aiming to create a hyperplane that separates classes in n-dimensional space. It includes linear and non-linear SVM types for different data separability scenarios. Additionally, Linear Discriminant Analysis (LDA) is discussed as a dimensionality reduction technique for classification problems, along with the Perceptron model, which serves as a foundational binary classifier in neural networks.

Uploaded by

upmakaprasad
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 38

UNIT-IV

SUPPORT VECTOR MACHINE(SVM):

INDRODUCTION:
Support Vector Machine or SVM is one of the most popular Supervised Learning algorithms,
which is used for Classification as well as Regression problems. However, primarily, it is used
for Classification problems in Machine Learning.
The goal of the SVM algorithm is to create the best line or decision boundary that can
segregate n-dimensional space into classes so that we can easily put the new data point in
the correct category in the future. This best decision boundary is called a hyperplane.
SVM chooses the extreme points/vectors that help in creating the hyperplane.
These extreme cases are called as support vectors, and hence algorithm is
termed as Support Vector Machine. Consider the below diagram in which there
are two different categories that are classified using a decision boundary or
hyperplane:
Example: SVM can be understood with the example that we have used in the
KNN classifier. Suppose we see a strange cat that also has some features of
dogs, so if we want a model that can accurately identify whether it is a cat or
dog, so such a model can be created by using the SVM algorithm. We will first
train our model with lots of images of cats and dogs so that it can learn about
different features of cats and dogs, and then we test it with this strange creature.
So as support vector creates a decision boundary between these two data (cat
and dog) and choose extreme cases (support vectors), it will see the extreme
case of cat and dog. On the basis of the support vectors, it will classify it as a cat.
Consider the below diagram:

SVM algorithm can be used for Face detection, image classification, text
categorization, etc.

Types of SVM

o Linear SVM: Linear SVM is used for linearly separable data, which
means if a dataset can be classified into two classes by using a
single straight line, then such data is termed as linearly separable
data, and classifier is used called as Linear SVM classifier.
o Non-linear SVM: Non-Linear SVM is used for non-linearly separated
data, which means if a dataset cannot be classified by using a
straight line, then such data is termed as non-linear data and
classifier used is called as Non-linear SVM classifier.
LINEAR DISCRIMINANT FUNCTIONS FOR BINARY CLASSIFICATION:

Linear Discriminant Analysis (LDA) is one of the commonly used


dimensionality reduction techniques in machine learning to solve more
than two-class classification problems. It is also known as Normal
Discriminant Analysis (NDA) or Discriminant Function Analysis (DFA).

This can be used to project the features of higher dimensional space into lower-
dimensional space in order to reduce resources and dimensional costs. In this
topic, "Linear Discriminant Analysis (LDA) in machine learning”, we will
discuss the LDA algorithm for classification predictive modeling problems,
limitation of logistic regression, representation of linear Discriminant analysis
model, how to make a prediction using LDA, how to prepare data for LDA,
extensions to LDA and much more. So, let's start with a quick introduction to
Linear Discriminant Analysis (LDA) in machine learning.

What is Linear Discriminant Analysis (LDA)?


Although the logistic regression algorithm is limited to only two-class,
linear Discriminant analysis is applicable for more than two classes of
classification problems.

Linear Discriminant analysis is one of the most popular


dimensionality reduction techniques used for supervised
classification problems in machine learning. It is also considered a
pre-processing step for modeling differences in ML and applications of
pattern classification.

Whenever there is a requirement to separate two or more classes having


multiple features efficiently, the Linear Discriminant Analysis model is considered
the most common technique to solve such classification problems. For e.g., if we
have two classes with multiple features and need to separate them efficiently.
When we classify them using a single feature, then it may show overlapping.

To overcome the overlapping issue in the classification process, we must increase


the number of features regularly
How Linear Discriminant Analysis (LDA) works?
Linear Discriminant analysis is used as a dimensionality reduction
technique in machine learning, using which we can easily transform a 2-D
and 3-D graph into a 1-dimensional plane.

Let's consider an example where we have two classes in a 2-D plane


having an X-Y axis, and we need to classify them efficiently. As we have
already seen in the above example that LDA enables us to draw a straight
line that can completely separate the two classes of the data points. Here,
LDA uses an X-Y axis to create a new axis by separating them using a
straight line and projecting data onto a new axis.

To create a new axis, Linear Discriminant Analysis uses the following criteria:

o It maximizes the distance between means of two classes.


o It minimizes the variance within the individual class.

o Using the above two conditions, LDA generates a new axis in such a
way that it can maximize the distance between the means of the
two classes and minimizes the variation within each class.
o In other words, we can say that the new axis will increase the
separation between the data points of the two classes and plot them
onto the new axis.

Drawbacks of Linear Discriminant Analysis (LDA)


Although, LDA is specifically used to solve supervised classification
problems for two or more classes which are not possible using logistic
regression in machine learning. But LDA also fails in some cases where
the Mean of the distributions is shared. In this case, LDA fails to create a
new axis that makes both the classes linearly separable.

To overcome such problems, we use non-linear Discriminant


analysis in machine learning.

PERCEPTRON ALGORITHM:
In Machine Learning and Artificial Intelligence, Perceptron is the most commonly used term
for all folks. It is the primary step to learn Machine Learning and Deep Learning
technologies, which consists of a set of weights, input values or scores, and a
threshold. Perceptron is a building block of an Artificial Neural Network. Initially, in the mid
of 19th century, Mr. Frank Rosenblatt invented the Perceptron for performing certain
calculations to detect input data capabilities or business intelligence. Perceptron is a linear
Machine Learning algorithm used for supervised learning for various binary classifiers. This
algorithm enables neurons to learn elements and processes them one by one during
preparation. In this tutorial, "Perceptron in Machine Learning," we will discuss in-depth
knowledge of Perceptron and its basic functions in brief. Let's start with the basic
introduction of Perceptron.

What is the Perceptron model in Machine Learning?


Perceptron is Machine Learning algorithm for supervised learning of
various binary classification tasks. Further, Perceptron is also
understood as an Artificial Neuron or neural network unit that
helps to detect certain input data computations in business
intelligence.

Perceptron model is also treated as one of the best and simplest types of
Artificial Neural networks. However, it is a supervised learning algorithm
of binary classifiers. Hence, we can consider it as a single-layer neural
network with four main parameters, i.e., input values, weights and
Bias, net sum, and an activation function.

Basic Components of Perceptron


Mr. Frank Rosenblatt invented the perceptron model as a binary classifier which
contains three main components. These are as follows:

o Input Nodes or Input Layer:

This is the primary component of Perceptron which accepts the initial data
into the system for further processing. Each input node contains a real
numerical value.

o Wight and Bias:

Weight parameter represents the strength of the connection between


units. This is another most important parameter of Perceptron
components. Weight is directly proportional to the strength of the
associated input neuron in deciding the output. Further, Bias can be
considered as the line of intercept in a linear equation.

o Activation Function:

These are the final and important components that help to determine
whether the neuron will fire or not. Activation Function can be considered
primarily as a step function.

Types of Activation functions:


o Sign function
o Step function, and
o Sigmoid function

The data scientist uses the activation function to take a subjective decision
based on various problem statements and forms the desired outputs. Activation
function may differ (e.g., Sign, Step, and Sigmoid) in perceptron models by
checking whether the learning process is slow or has vanishing or exploding
gradients.

How does Perceptron work?


In Machine Learning, Perceptron is considered as a single-layer neural network
that consists of four main parameters named input values (Input nodes), weights
and Bias, net sum, and an activation function. The perceptron model begins with
the multiplication of all input values and their weights, then adds these values
together to create the weighted sum. Then this weighted sum is applied to the
activation function 'f' to obtain the desired output. This activation function is also
known as the step function and is represented by 'f'.
Perceptron model works in two important steps as follows:

Step-1

In the first step first, multiply all input values with corresponding weight
values and then add them to determine the weighted sum.
Mathematically, we can calculate the weighted sum as follows:

∑wi*xi = x1*w1 + x2*w2 +…wn*xn

Add a special term called bias 'b' to this weighted sum to improve the
model's performance.

∑wi*xi + b

Step-2

In the second step, an activation function is applied with the above-


mentioned weighted sum, which gives us output either in binary form or a
continuous value as follows:

Y = f(∑wi*xi + b)

Types of Perceptron Models


Based on the layers, Perceptron models are divided into two types. These
are as follows:
1. Single-layer Perceptron Model
2. Multi-layer Perceptron model

Single Layer Perceptron Model:


This is one of the easiest Artificial neural networks (ANN) types. A single-
layered perceptron model consists feed-forward network and also includes
a threshold transfer function inside the model. The main objective of the
single-layer perceptron model is to analyze the linearly separable objects
with binary outcomes.

In a single layer perceptron model, its algorithms do not contain recorded


data, so it begins with inconstantly allocated input for weight parameters.
Further, it sums up all inputs (weight). After adding all inputs, if the total
sum of all inputs is more than a pre-determined value, the model gets
activated and shows the output value as +1.

If the outcome is same as pre-determined or threshold value, then the


performance of this model is stated as satisfied, and weight demand does
not change. However, this model consists of a few discrepancies triggered
when multiple weight inputs values are fed into the model. Hence, to find
desired output and minimize errors, some changes should be necessary
for the weights input.

"Single-layer perceptron can learn only linearly separable patterns."

Multi-Layered Perceptron Model:


Like a single-layer perceptron model, a multi-layer perceptron model also
has the same model structure but has a greater number of hidden layers.

The multi-layer perceptron model is also known as the Backpropagation


algorithm, which executes in two stages as follows:

o Forward Stage: Activation functions start from the input layer in


the forward stage and terminate on the output layer.
o Backward Stage: In the backward stage, weight and bias values
are modified as per the model's requirement. In this stage, the error
between actual output and demanded originated backward on the
output layer and ended on the input layer.

Hence, a multi-layered perceptron model has considered as multiple


artificial neural networks having various layers in which activation
function does not remain linear, similar to a single layer perceptron
model. Instead of linear, activation function can be executed as sigmoid,
TanH, ReLU, etc., for deployment.
A multi-layer perceptron model has greater processing power and can
process linear and non-linear patterns. Further, it can also implement logic
gates such as AND, OR, XOR, NAND, NOT, XNOR, NOR.

Advantages of Multi-Layer Perceptron:

o A multi-layered perceptron model can be used to solve complex


non-linear problems.
o It works well with both small and large input data.
o It helps us to obtain quick predictions after the training.
o It helps to obtain the same accuracy ratio with large as well as small
data.

Disadvantages of Multi-Layer Perceptron:

o In Multi-layer perceptron, computations are difficult and time-


consuming.
o In multi-layer Perceptron, it is difficult to predict how much the
dependent variable affects each independent variable.
o The model functioning depends on the quality of the training.

LARGE MARGIN CLASSFIER FOR LINEARLY SEPERABLE DATA:


A large margin classifier for linearly separable data aims to find a decision boundary
(hyperplane) that maximizes the distance (margin) between the boundary and the closest data
points from each class. The most common example of such a classifier is the Support Vector
Machine (SVM). Here's a detailed explanation:

Support Vector Machine (SVM)


1. Objective

The objective of an SVM is to find a hyperplane that not only separates the classes but does
so with the maximum margin. The margin is defined as the distance between the hyperplane
and the nearest data points from either class, which are known as the support vectors.

2. Formulating the Hyperplane

 In a ddd-dimensional space, a hyperplane is defined by the equation:


w⋅x+b=0w \cdot x + b = 0w⋅x+b=0

where www is the weight vector, xxx is the input feature vector, and bbb is the bias
term.

 For a given training set {(xi,yi)}\{(x_i, y_i)\}{(xi,yi)}, where xix_ixi are the feature
vectors and yi∈{−1,1}y_i \in \{-1, 1\}yi∈{−1,1} are the class labels, the SVM aims
to find www and bbb such that:

yi(w⋅xi+b)≥1,∀iy_i (w \cdot x_i + b) \geq 1, \quad \forall iyi(w⋅xi+b)≥1,∀i

3. Maximizing the Margin

The margin is defined as 2/∥w∥2 / \|w\|2/∥w∥. To maximize the margin, we minimize ∥w∥\|w\|
∥w∥. This leads to the following optimization problem:

min⁡w,b12∥w∥2\min_{w, b} \frac{1}{2} \|w\|^2w,bmin21∥w∥2

subject to:

yi(w⋅xi+b)≥1,∀iy_i (w \cdot x_i + b) \geq 1, \quad \forall iyi(w⋅xi+b)≥1,∀i

4. Dual Formulation

The above problem can be solved more efficiently using its dual form. Introducing Lagrange
multipliers αi≥0\alpha_i \geq 0αi≥0, the dual problem becomes:

max⁡α∑i=1nαi−12∑i=1n∑j=1nαiαjyiyj(xi⋅xj)\max_{\alpha} \sum_{i=1}^n \alpha_i - \frac{1}{2} \


sum_{i=1}^n \sum_{j=1}^n \alpha_i \alpha_j y_i y_j (x_i \cdot x_j)αmaxi=1∑nαi−21i=1∑nj=1∑nαiαjyi
yj(xi⋅xj)

subject to:

∑i=1nαiyi=0,0≤αi,∀i\sum_{i=1}^n \alpha_i y_i = 0, \quad 0 \leq \alpha_i, \quad \forall ii=1∑nαiyi


=0,0≤αi,∀i

5. Finding the Optimal Hyperplane

Once the optimal αi\alpha_iαi values are found, the weight vector www can be computed as:

w=∑i=1nαiyixiw = \sum_{i=1}^n \alpha_i y_i x_iw=i=1∑nαiyixi

The bias term bbb can be determined using the support vectors:

b=yk−w⋅xkfor any support vector (xk,yk)b = y_k - w \cdot x_k \quad \text{for any support vector }
(x_k, y_k)b=yk−w⋅xkfor any support vector (xk,yk)

6. Decision Function

The decision function for classifying a new data point xxx is:
f(x)=sign(w⋅x+b)f(x) = \text{sign}(w \cdot x + b)f(x)=sign(w⋅x+b)

Key Properties

 Maximizing the Margin: By maximizing the margin, SVMs aim to improve the classifier's
ability to generalize to unseen data.
 Support Vectors: The decision boundary is determined by the support vectors, which are the
data points closest to the hyperplane.
 Robustness: Large margin classifiers like SVMs are less likely to overfit, especially in high-
dimensional spaces.

Practical Considerations

 Feature Scaling: It is important to scale features so that all features contribute equally to the
margin calculation.
 Linear Separability: The basic SVM assumes that the data is linearly separable. For non-
linearly separable data, kernel methods can be used to map the data into a higher-
dimensional space where a linear separator can be found.

Example

Consider a binary classification problem with two classes of data points in a 2D space. An
SVM will find the line (hyperplane) that separates the two classes with the maximum margin.
The data points that lie closest to the line are the support vectors. The distance between this
line and the support vectors is the margin, which the SVM maximizes.

In summary, a large margin classifier such as an SVM is an effective method for linearly
separable data. It finds the hyperplane that maximizes the margin between the classes,
leading to better generalization and robustness.

LINEAR SOFT MARGIN CLASSIFIER FOR OVERLAPPING CLASSES:


A Linear Soft Margin Classifier is a type of Support Vector Machine (SVM) used for binary
classification tasks where the classes may overlap. Unlike the hard margin SVM, which
assumes that the data is perfectly separable, the soft margin SVM allows for some
misclassifications to create a more flexible decision boundary.

Key Concepts:

1. Hyperplane: The decision boundary that separates different classes.


2. Margin: The distance between the hyperplane and the closest data points from either class.
In soft margin SVM, this margin is maximized while allowing some misclassifications.
3. Slack Variables (ξ\xiξ): These variables measure the degree of misclassification of a data
point. If ξi=0\xi_i = 0ξi=0, the point is correctly classified and lies on the correct side of the
margin. If 0<ξi≤10 < \xi_i \leq 10<ξi≤1, the point is correctly classified but within the margin.
If ξi>1\xi_i > 1ξi>1, the point is misclassified.
4. Regularization Parameter (CCC): This parameter controls the trade-off between maximizing
the margin and minimizing the classification error. A large CCC puts more emphasis on
minimizing misclassification, resulting in a narrower margin, while a small CCC allows more
misclassifications, leading to a wider margin.

Mathematical Formulation:

The objective is to find the optimal hyperplane that minimizes the following cost function:

min⁡w,b,ξ(12∥w∥2+C∑i=1Nξi)\min_{\mathbf{w}, b, \xi} \left( \frac{1}{2} \|\mathbf{w}\|^2 + C \


sum_{i=1}^{N} \xi_i \right)w,b,ξmin(21∥w∥2+Ci=1∑Nξi)

subject to the constraints:

yi(w⋅xi+b)≥1−ξi,ξi≥0∀iy_i (\mathbf{w} \cdot \mathbf{x}_i + b) \geq 1 - \xi_i, \quad \xi_i \geq 0 \


quad \forall iyi(w⋅xi+b)≥1−ξi,ξi≥0∀i

Here:

 w\mathbf{w}w is the weight vector perpendicular to the hyperplane.


 bbb is the bias term.
 ξi\xi_iξi are the slack variables.
 yiy_iyi are the class labels (+1+1+1 or −1-1−1).
 xi\mathbf{x}_ixi are the feature vectors.

Steps to Implement a Linear Soft Margin Classifier:

1. Data Preprocessing: Scale the data to have zero mean and unit variance.
2. Formulate the Optimization Problem: Define the objective function and constraints.
3. Solve the Optimization Problem: Use quadratic programming solvers to find the optimal w\
mathbf{w}w and bbb.
4. Make Predictions: For a new data point x\mathbf{x}x, the prediction is based on the sign of
w⋅x+b\mathbf{w} \cdot \mathbf{x} + bw⋅x+b.

kernel induced feature space:


In Support Vector Machines (SVMs), there are several types of kernel
functions that can be used to map the input data into a higher-
dimensional feature space. The choice of kernel function depends on
the specific problem and the characteristics of the data.

Here are some most commonly used kernel functions in SVMs:


Linear Kernel
A linear kernel is a type of kernel function used in machine learning,
including in SVMs (Support Vector Machines). It is the simplest and most
commonly used kernel function, and it defines the dot product between
the input vectors in the original feature space.

The linear kernel can be defined as:

K(x, y) = x .y

Where x and y are the input feature vectors. The dot product of the input
vectors is a measure of their similarity or distance in the original feature
space.

When using a linear kernel in an SVM, the decision boundary is a linear


hyperplane that separates the different classes in the feature space. This
linear boundary can be useful when the data is already separable by a
linear decision boundary or when dealing with high-dimensional data,
where the use of more complex kernel functions may lead to overfitting.

Polynomial Kernel
A particular kind of kernel function utilised in machine learning, such as in
SVMs, is a polynomial kernel (Support Vector Machines). It is a nonlinear
kernel function that employs polynomial functions to transfer the input
data into a higher-dimensional feature space.

One definition of the polynomial kernel is:

Where x and y are the input feature vectors, c is a constant term, and d is
the degree of the polynomial, K(x, y) = (x. y + c)d. The constant term is
added to, and the dot product of the input vectors elevated to the degree
of the polynomial.
In general, the polynomial kernel is an effective tool for converting the
input data into a higher-dimensional feature space in order to capture
nonlinear correlations between the input characteristics.

Gaussian (RBF) Kernel


The Gaussian kernel, also known as the radial basis function (RBF) kernel,
is a popular kernel function used in machine learning, particularly in SVMs
(Support Vector Machines). It is a nonlinear kernel function that maps the
input data into a higher-dimensional feature space using a Gaussian
function.

The Gaussian kernel can be defined as:

K(x, y) = exp(-gamma * ||x - y||^2)

Where x and y are the input feature vectors, gamma is a parameter that
controls the width of the Gaussian function, and ||x - y||^2 is the squared
Euclidean distance between the input vectors.

When using a Gaussian kernel in an SVM, the decision boundary is a


nonlinear hyper plane that can capture complex nonlinear relationships
between the input features. The width of the Gaussian function, controlled
by the gamma parameter, determines the degree of nonlinearity in the
decision boundary

Laplace Kernel
The Laplacian kernel, also known as the Laplace kernel or the exponential
kernel, is a type of kernel function used in machine learning, including in
SVMs (Support Vector Machines). It is a non-parametric kernel that can be
used to measure the similarity or distance between two input feature
vectors.

The Laplacian kernel can be defined as:

K(x, y) = exp(-gamma * ||x - y||)


Where x and y are the input feature vectors, gamma is a parameter that
controls the width of the Laplacian function, and ||x - y|| is the L1 norm or
Manhattan distance between the input vectors.

When using a Laplacian kernel in an SVM, the decision boundary is a


nonlinear hyperplane that can capture complex relationships between the
input features. The width of the Laplacian function, controlled by the
gamma parameter, determines the degree of nonlinearity in the decision
boundary.

One advantage of the Laplacian kernel is its robustness to outliers, as it


places less weight on large distances between the input vectors than the
Gaussian kernel. However, like the Gaussian kernel, choosing the correct
value of the gamma parameter can be challenging.

NONLINEAR CLASSIFIER:
Non-linear classification refers to categorizing those instances that
are not linearly separable. → Some of the classifiers that use non-
linear functions to separate classes are Quadratic Discriminant
Classifier, Multi-Layer Perceptron (MLP), Decision Trees, Random
Forest, and K-Nearest Neighbours (KNN).
Introduction to Non-Linear Classifiers
If you're new to this field, you might have heard about linear
classifiers and non-linear classifiers. These two distinct types of
algorithms help us make sense of complex data and make accurate
predictions.

In the context of machine learning, classification is a supervised


learning task where we train a model to predict which category (or
class) a new observation belongs to, based on previously seen examples.
There are two main types of classifiers: linear and non-linear.

Linear classifiers work by finding a straight line, plane, or hyperplane


that separates the different classes in the feature space.

They are relatively simple, easy to interpret, and fast to train. Some
common linear classifiers include logistic regression and linear support
vector machines.

Non-linear classifiers, on the other hand, can find more complex decision
boundaries to separate the classes. They can capture intricate patterns
and relationships within the data that linear classifiers might miss.

Non-linear classifiers include decision trees, neural networks, kernel


support vector machines, and many others.
Popular Non-linear Classification Algorithms

Real-world Applications of Non-linear


Classifiers
 Image Recognition: Non-linear classifiers, especially deep
neural networks, have revolutionized image recognition tasks,
achieving unprecedented performance in object detection, facial
recognition, and image classification.
 Natural Language Processing: Non-linear classifiers, such as
recurrent and transformer-based neural networks, have greatly
improved the state of natural language processing, enabling
applications like machine translation, sentiment analysis, and
question-answering systems.
 Anomaly Detection: Non-linear classifiers, like SVM and
autoencoders, can effectively detect unusual patterns in data,
making them useful for applications like fraud detection, network
intrusion detection, and industrial equipment monitoring.
 Bioinformatics: In the field of bioinformatics, non-linear
classifiers have been employed for tasks like protein structure
prediction, gene expression analysis, and disease diagnosis based
on genomic data.
REGRESSION BY SUPPORT VECTOR MACHINES:
Support vector regression (SVR) is a type of support vector machine
(SVM) that is used for regression tasks. It tries to find a function that
best predicts the continuous output value for a given input value.

SVR can use both linear and non-linear kernels. A linear kernel is a
simple dot product between two input vectors, while a non-linear
kernel is a more complex function that can capture more intricate
patterns in the data. The choice of kernel depends on the data’s
characteristics and the task’s complexity.

In scikit-learn package for Python, you can use the ‘SVR’ class to
perform SVR with a linear or non-linear ‘kernel’. To specify the kernel,
you can set the kernel parameter to ‘linear’ or ‘RBF’ (radial basis
function).

Concepts related to the Support vector regression (SVR):


There are several concepts related to support vector regression (SVR)
that you may want to understand in order to use it effectively. Here
are a few of the most important ones:
 Support vector machines (SVMs): SVR is a type of support
vector machine (SVM), a supervised learning algorithm that can
be used for classification or regression tasks. SVMs try to find
the hyperplane in a high-dimensional space that maximally
separates different classes or output values.
 Kernels: SVR can use different types of kernels, which are
functions that determine the similarity between input vectors.
A linear kernel is a simple dot product between two input
vectors, while a non-linear kernel is a more complex function
that can capture more intricate patterns in the data. The choice
of kernel depends on the data’s characteristics and the task’s
complexity.
 Hyperparameters: SVR has several hyperparameters that you
can adjust to control the behavior of the model. For example,
the ‘C’ parameter controls the trade-off between the insensitive
loss and the sensitive loss. A larger value of ‘C’ means that the
model will try to minimize the insensitive loss more, while a
smaller value of C means that the model will be more lenient in
allowing larger errors.
 Model evaluation: Like any machine learning model, it’s
important to evaluate the performance of an SVR model. One
common way to do this is to split the data into a training set
and a test set, and use the training set to fit the model and the
test set to evaluate it. You can then use metrics like mean
squared error (MSE) or mean absolute error (MAE) to measure
the error between the predicted and true output values.

Support Vector Machine can also be used as a regression method, maintaining all
the main features that characterize the algorithm (maximal margin). The Support
Vector Regression (SVR) uses the same principles as the SVM for classification,
with only a few minor differences. First of all, because output is a real number it
becomes very difficult to predict the information at hand, which has infinite
possibilities. In the case of regression, a margin of tolerance (epsilon) is set in
approximation to the SVM which would have already requested from the
problem. But besides this fact, there is also a more complicated reason, the
algorithm is more complicated therefore to be taken in consideration. However,
the main idea is always the same: to minimize error, individualizing the
hyperplane which maximizes the margin, keeping in mind that part of the error is
tolerated.
Linear SVR

Non-linear SVR
The kernel functions transform the data into a higher dimensional feature space to make it possible to perfo
linear separation.
Kernel functions
LEARNING WITH NEURAL NETWORKS:

NEURON MODELS:

Biological neuron models, also known as spiking neuron models,


are mathematical descriptions of the conduction of electrical signals
in neurons. Neurons (or nerve cells) are electrically excitable cells
within the nervous system, able to fire electric signals, called action
potentials, across a neural network.
Artificial Neuron
Artificial neurons bear only a modest resemblance to real things. They model
approximately three
of the processes that biological neurons perform
An artificial neuron
(i) evaluates the input signals, determining the strength of each one;
(ii) calculates a total for the combined input signals and compares that total to
some threshold
level; and
(iii) determines what the output should be.

Input and Outputs


Just as there are many inputs (stimulation levels) to a biological neuron, there
should be many
input signals to our artificial neuron (AN). All of them should come to our AN
simultaneously. In
response, a biological neuron either ‘fires’ or ‘doesn’t fire’ depending upon some
threshold level.
Our AN will be allowed a single output signal, just as is present in a biological
neuron: many inputs,
one output

Weighting Factors
Each input will be given a relative weighting, which will affect the impact of that
input (Fig.
5.3). This is something like varying synaptic strengths of the biological neurons—
some inputs
are more important than others in the way they combine to produce an impulse.
Weights are
adaptive coefficients within the network, that determine the intensity of the input
signal. In fact,
this adaptability of connection strength is precisely what provides neural
networks their ability to
learn and store information, and, consequently, is an essential element of all
neuron models.

The Mathematical Model of Neuron The basic rules are that neurons
are added when training is slow or when the mean squared error is
larger than a specified value, and that neurons are removed when a
change in a neuron " s value does not correspond to a change in the
network " s response or when the weight values that are
Linear Neuron And widrow-Hoff Learning Rule:
Linear Neuron:

An artificial neural network inspired by the human neural system is a network used to process
the data which consist of three types of layer i.e input layer, the hidden layer, and the output
layer. The basic neural network contains only two layers which are the input and output
layers. The layers are connected with the weighted path which is used to find net input data.
In this section, we will discuss two basic types of neural networks Adaline which doesn’t
have any hidden layer, and Madaline which has one hidden layer.

1.Adaptive Linear Neuron (Adaline):

Adaline which stands for Adaptive Linear Neuron, is a network having a single linear unit. It
was developed by Widrow and Hoff in 1960. Some important points about Adaline are as
follows −

 It uses bipolar activation function.


 Adaline neuron can be trained using Delta rule or Least Mean Square(LMS) rule or
widrow-hoff rule
 The net input is compared with the target value to compute the error signal.
 on the basis of adaptive training algoritham weights are adjusted

The basic structure of Adaline is similar to perceptron having an extra feedback loop with the
help of which the actual output is compared with the desired/target output. After comparison
on the basis of training algorithm, the weights and bias will be updated.
Adaptive Linear Neuron Learning algorithm

Step 0: initialize the weights and the bias are set to some random values but not to zero, also
initialize the learning rate α.

Step 1 − perform steps 2-7 when stopping condition is false.

Step 2 − perform steps 3-5 for each bipolar training pair s:t.

Step 3 − Activate each input unit as follows −

xi=si(i=1ton)

Step 4 − Obtain the net input with the following relation −

yin=∑inxi.wi+b

Here ‘b’ is bias and ‘n’ is the total number of input neurons.

Step 5 Until least mean square is obtained (t - yin), Adjust the weight and bias as follows −

wi(new) = wi(old) + α(t - yin)xi


b(new) = b(old) + α(t - yin)
Now calculate the error using => E = (t - yin)2

Step 7 − Test for the stopping condition, if error generated is less then or equal to specified
tolerance then stop.

2.Multiple Adaptive Linear Neuron (Madaline):

Madaline which stands for Multiple Adaptive Linear Neuron, is a network which consists of
many Adalines in parallel. It will have a single output unit. Some important points about
Madaline are as follows −

 It is just like a multilayer perceptron, where Adaline will act as a hidden unit between
the input and the Madaline layer.
 The weights and the bias between the input and Adaline layers, as in we see in the
Adaline architecture, are adjustable.
 The Adaline and Madaline layers have fixed weights and bias of 1.
 Training can be done with the help of Delta rule.

It consists of “n” units of input layer and “m” units of Adaline layer and “1” unit of the
Madaline layer. Each neuron in the Adaline and Madaline layers has a bias of excitation “1”.
The Adaline layer is present between the input layer and the Madaline layer; the Adaline
layer is considered as the hidden layer.
Multiple Adaptive Linear Neuron (Madaline) Training Algorithm

By now we know that only the weights and bias between the input and the Adaline layer are
to be adjusted, and the weights and bias between the Adaline and the Madaline layer are
fixed.

Step 0 − initialize the weights and the bias(for easy calculation they can be set to zero). also
initialize the learning rate α(0, α, 1) for simpicity α is set to 1.

Step 1 − perform steps 2-6 when stopping condition is false.

Step 2 − perform steps 3-5 for each bipolar training pair s:t

Step 3 − Activate each input unit as follows −

xi=si(i=1ton)

Step 4 − Obtain the net input at each hidden layer, i.e. the Adaline layer with the following
relation −

Qinj=bj+∑inxiwij(j=1tom)

Here ‘b’ is bias and ‘n’ is the total number of input neurons.

Step 5 − Apply the following activation function to obtain the final output at the Adaline and
the Madaline layer −

f(x)={1ifx⩾0−1ifx<0

Output at the hidden (Adaline) unit

Qj=f(Qinj)

Final output of the network

i.e. y i n j = b 0 + ∑ j = 1 m Q j v j

y=f(yin)

Step 6 − Calculate the error and adjust the weights as follows −

If t ≠ y and t = +1, update weights on Zj, where net input is closest to 0 (zero)
wij(new) = wij(old) + α(1 - Qinj)xi
bj(new) = bj(old) + α(1 - Qinj)
else If t ≠ y and t = -1, update weights on Zk, whose net input is positive
wik(new) = wik(old) + α(-1 - Qink)xi
bk(new) = bk(old) + α(-1 - Qink)

else if y = t then
no weight updation is required.

Step 7 − Test for the stopping condition, which will happen when there is no change in
weight or the highest weight change occurred during training is smaller than the specified
tolerance.

The Widrow-Hoff Learning Rule:

In 1960 Bernard Widrow and Tedd Hoff published an improved learning rule for artificial
neurons. The main difference is the way the weights and the bias unit are updated. A
schematic view of the process is shown in the following figure.

Unlike the perceptron, where we defined the error based on the predicted output, here we use
the result of the activation function to define a loss or cost function. The goal of this loss
function is to find its minimum, which corresponds to the optimal solution to our task.

One way to find this minimum is to use a technique called gradient descent. Depending on
the number of training records used for each learning step, we differentiate between full
batch, stochastic and minibatch gradient descent.

Full batch gradient descent


The main reason for using the activation function instead of the predicted output is that it is a
differentiable function and we can use calculus techniques to find the minimum of the loss
function. Once we have a function, we can calculate its gradient to find a vector that points in
the direction of the steepest slope. Thus, the negative of the gradient points in the direction of
the steepest descent. If we follow the gradient we will find the minimum of the function (at
least if the activation function is linear, otherwise we might just end up in a local minimum).
This technique is called gradient descent, and means that we update the weights and bias unit
as follows

As with the perceptron, η is a learning rate. If η

is small, the descent takes a lot of steps, if it is too high, we might overshoot the minimum.
Therefore, finding an appropriate value is an important step in fine-tuning the training.

By using the gradient descent, we have implicitly changed the learning step from online
learning, as with the perceptron, to full batch learning. This means, that the loss function is
calculated from the results of the activation function from the entire training data set.

Stochastic gradient descent (SGD)


An alternative to the full batch learning is the so-called stochastic gradient descent, which
updates the weights and the bias unit after every single training record. For the MSE loss
function, the update process is then:

SGD usually reaches the minimum faster than full batch gradient descent because the weights
are incremented more often. However, the error surface is noisier because the loss function is
different for each training record. This is actually an advantageous behavior, because it also
means that for nonlinear activation functions, it is easier to escape local minima and find the
global minimum. To avoid patterns arising from the order of the training records, the training
set is shuffled at the beginning of each epoch, which leads to the word stochastic in SGD.

Mini-batch gradient descent


Defining the loss function on a subset of the training set is a compromise between full-batch
and stochastic gradient descent. This has the advantage of reaching the minimum faster than
in full-batch mode, and it allows the use of vectorized operations, which improves the
computational efficiency.

Different regression types


Since the Widrow Hoff rule does not specify the activation and loss functions, we can build
different types of learning algorithms.

Linear regression

A common loss function is the mean squared error (MSE):

Obviously, if the MSE is small, we have a good classification. If we additionally choose

the whole learning rule is nothing else than a simple linear regression.

Logistic regression

If we use the logistic sigmoid function

and define the loss function as the log-likelihood

we have a logistic regression.

NB: We can use both versions for either classification or regression. In the case of
classification, we use the threshold function after the learning to do the classification, in the
case of regression, we simply use the outcome of the activation function as the regression
result.
Delta learning rule:

The delta rule in an artificial neural network is a specific kind of backpropagation that assists
in refining the machine learning/artificial intelligence network, making associations among
input and outputs with different layers of artificial neurons. The Delta rule is also called the
Delta learning rule.

Generally, backpropagation has to do with recalculating input weights for artificial neurons
utilizing a gradient technique. Delta learning does this by using the difference between a
target activation and an obtained activation. By using a linear activation function, network
connections are balanced. Another approach to explain the Delta rule is that it uses an error
function to perform gradient descent learning.

Delta rule refers to the comparison of actual output with a target output, the technology tries
to discover the match, and the program makes changes. The actual execution of the Delta rule
will fluctuate as per the network and its composition. Still, by applying a linear activation
function, the delta rule can be useful in refining a few sorts of neural networks with specific
kinds of backpropagation.

Delta rule is introduced by Widrow and Hoff, which is the most significant learning rule that
depends on supervised learning.

This rule states that the change in the weight of a node is equivalent to the product of error
and the input.

Mathematical equation:

The given equation gives the mathematical equation for delta learning rule:

∆w = µ.x.z

∆w = µ(t-y)x

Here,

∆w = weight change.
µ = the constant and positive learning rate.

X = the input value from pre-synaptic neuron.

z= (t-y) is the difference between the desired input t and the actual output y. The above
mentioned mathematical rule cab be used only for a single output unit.

The different weights can be determined with respect to these two cases.

Case 1 - When t ≠ k, then

w(new) = w(old) + ∆w

Case 2 - When t = k, then

No change in weight

The Error Correction Delta Rule:

The Delta Rule uses the difference between target activation (i.e., target output values) and
obtained activation to drive learning. For reasons discussed below, the use of a threshold
activation function (as used in both the McCulloch-Pitts network and the perceptron) is
dropped & instead a linear sum of products is used to calculate the activation of the output
neuron (alternative activation functions can also be applied). Thus, the activation function is
called a Linear Activation function, in which the output node’s activation is simply equal to
the sum of the network’s respective input/weight products. The strength of network
connections (i.e., the values of the weights) are adjusted to reduce the difference between
target and actual output activation (i.e., error). A graphical depiction of a simple two-layer
network capable of deploying the Delta Rule is given in the figure below (Such a network is
not limited to having only one output node):

During forward propagation through a network, the output (activation) of a given node is a
function of its inputs. The inputs to a node, which are simply the products of the output of
preceding nodes with their associated weights, are summed and then passed through an
activation function before being sent out from the node. Thus, we have the following:

and

where ‘Sj’ is the sum of all relevant products of weights and outputs from the previous layer
i, ‘wij’ represents the relevant weights connecting layer i with layer j, ‘ai’ represents the
activation of nodes in the previous layer i, ‘aj’ is the activation of the node at hand, and ‘f’is
the activation function.

Error function with just 2 weights w1 and w2

For any given set of input data and weights, there will be an associated magnitude of error,
which is measured by an error function (also known as a cost function) (e.g., Oh, 1997; Yam
and Chow, 1997). The Delta Rule employs the error function for what is known as Gradient
Descent learning, which involves the ‘modification of weights along the most direct path in
weight-space to minimize error’, so change applied to a given weight is proportional to the
negative of the derivative of the error with respect to that weight (McClelland and Rumelhart
1988, pp.126–130). The Error/Cost function is commonly given as the sum of the squares
of the differences between all target and actual node activation for the output layer. For a
particular training pattern (i.e., training case), error is thus given by:

where ‘Ep’ is total error over the training pattern, ½ is a value applied to simplify the
function’s derivative, ’n’ represents all output nodes for a given training pattern, ‘tj’ sub n
represents the Target value for node n in output layer j, and ‘aj’ sub n represents the actual
activation for the same node. This particular error measure is attractive because its derivative,
whose value is needed in the employment of the Delta Rule, and is easily calculated. Error
over an entire set of training patterns (i.e., over one iteration, or epoch) is calculated by
summing all ‘Ep’:

Error/Cost Function

where ‘E’ is total error, and ‘p’ represents all training patterns. An equivalent term for E in
earlier equation is Sum-of-squares error. A normalized version of this equation is given by
the Mean Squared Error (MSE) equation:

where ‘P’ and ’N’ are the total number of training patterns and output nodes, respectively. It
is the error of both previous equations, that gradient descent attempts to minimize (not
strictly true if weights are changed after each input pattern is submitted to the network
(Rumelhart et al., 1986: v1, p.324; Reed and Marks, 1999: pp. 57–62). Error over a given
training pattern is commonly expressed in terms of the Total Sum of Squares (‘tss’) error,
which is simply equal to the sum of all squared errors over all output nodes and all training
patterns. ‘The negative of the derivative of the error function is required in order to perform
Gradient Descent Learning’. The derivative of our equation(which measures error for a
given pattern ‘p’) above, with respect to a particular weight ‘wij’ sub ‘x’, is given by the
chain rule as:
where ‘aj’ sub ‘z’ is activation of the node in the output layer that corresponds to weight
‘wij’ sub x (subscripts refer to particular layers of nodes or weights, and the ‘sub-subscripts’
simply refer to individual weights and nodes within these layers). It follows that:

and

Thus, the derivative of the error over an individual training pattern is given by the product of
the derivatives of our prior equation:

Because Gradient Descent learning requires that any change in a particular weight be
proportional to the negative of the derivative of the error, the change in a given weight must
be proportional to the negative of our prior equation . Replacing the difference between the
target and actual activation of the relevant output node by d, and introducing a learning rate
epsilon, that equation can be re-written in the final form of the Delta Rule:

Delta Rule for Perceptrons

The reasoning behind the use of a Linear Activation function here instead of a Threshold
Activation function can now be justified: Threshold activation function that characterizes
both the McColloch and Pitts network and the perceptron is not differentiable at the
transition between the activations of 0and 1 (slope = infinity), and its derivative is 0 over the
remainder of the function. Hence, Threshold activation function cannot be used in
Gradient Descent learning. Whereas a Linear Activation function (or any other function
that is differential) allows the derivative of the error to be calculated.

Three-dimensional depiction of an Actual error surface (Leverington, 2001)

Two-dimensional depiction of the error surface

You might also like