Unit 4
Unit 4
INDRODUCTION:
Support Vector Machine or SVM is one of the most popular Supervised Learning algorithms,
which is used for Classification as well as Regression problems. However, primarily, it is used
for Classification problems in Machine Learning.
The goal of the SVM algorithm is to create the best line or decision boundary that can
segregate n-dimensional space into classes so that we can easily put the new data point in
the correct category in the future. This best decision boundary is called a hyperplane.
SVM chooses the extreme points/vectors that help in creating the hyperplane.
These extreme cases are called as support vectors, and hence algorithm is
termed as Support Vector Machine. Consider the below diagram in which there
are two different categories that are classified using a decision boundary or
hyperplane:
Example: SVM can be understood with the example that we have used in the
KNN classifier. Suppose we see a strange cat that also has some features of
dogs, so if we want a model that can accurately identify whether it is a cat or
dog, so such a model can be created by using the SVM algorithm. We will first
train our model with lots of images of cats and dogs so that it can learn about
different features of cats and dogs, and then we test it with this strange creature.
So as support vector creates a decision boundary between these two data (cat
and dog) and choose extreme cases (support vectors), it will see the extreme
case of cat and dog. On the basis of the support vectors, it will classify it as a cat.
Consider the below diagram:
SVM algorithm can be used for Face detection, image classification, text
categorization, etc.
Types of SVM
o Linear SVM: Linear SVM is used for linearly separable data, which
means if a dataset can be classified into two classes by using a
single straight line, then such data is termed as linearly separable
data, and classifier is used called as Linear SVM classifier.
o Non-linear SVM: Non-Linear SVM is used for non-linearly separated
data, which means if a dataset cannot be classified by using a
straight line, then such data is termed as non-linear data and
classifier used is called as Non-linear SVM classifier.
LINEAR DISCRIMINANT FUNCTIONS FOR BINARY CLASSIFICATION:
This can be used to project the features of higher dimensional space into lower-
dimensional space in order to reduce resources and dimensional costs. In this
topic, "Linear Discriminant Analysis (LDA) in machine learning”, we will
discuss the LDA algorithm for classification predictive modeling problems,
limitation of logistic regression, representation of linear Discriminant analysis
model, how to make a prediction using LDA, how to prepare data for LDA,
extensions to LDA and much more. So, let's start with a quick introduction to
Linear Discriminant Analysis (LDA) in machine learning.
To create a new axis, Linear Discriminant Analysis uses the following criteria:
o Using the above two conditions, LDA generates a new axis in such a
way that it can maximize the distance between the means of the
two classes and minimizes the variation within each class.
o In other words, we can say that the new axis will increase the
separation between the data points of the two classes and plot them
onto the new axis.
PERCEPTRON ALGORITHM:
In Machine Learning and Artificial Intelligence, Perceptron is the most commonly used term
for all folks. It is the primary step to learn Machine Learning and Deep Learning
technologies, which consists of a set of weights, input values or scores, and a
threshold. Perceptron is a building block of an Artificial Neural Network. Initially, in the mid
of 19th century, Mr. Frank Rosenblatt invented the Perceptron for performing certain
calculations to detect input data capabilities or business intelligence. Perceptron is a linear
Machine Learning algorithm used for supervised learning for various binary classifiers. This
algorithm enables neurons to learn elements and processes them one by one during
preparation. In this tutorial, "Perceptron in Machine Learning," we will discuss in-depth
knowledge of Perceptron and its basic functions in brief. Let's start with the basic
introduction of Perceptron.
Perceptron model is also treated as one of the best and simplest types of
Artificial Neural networks. However, it is a supervised learning algorithm
of binary classifiers. Hence, we can consider it as a single-layer neural
network with four main parameters, i.e., input values, weights and
Bias, net sum, and an activation function.
This is the primary component of Perceptron which accepts the initial data
into the system for further processing. Each input node contains a real
numerical value.
o Activation Function:
These are the final and important components that help to determine
whether the neuron will fire or not. Activation Function can be considered
primarily as a step function.
The data scientist uses the activation function to take a subjective decision
based on various problem statements and forms the desired outputs. Activation
function may differ (e.g., Sign, Step, and Sigmoid) in perceptron models by
checking whether the learning process is slow or has vanishing or exploding
gradients.
Step-1
In the first step first, multiply all input values with corresponding weight
values and then add them to determine the weighted sum.
Mathematically, we can calculate the weighted sum as follows:
Add a special term called bias 'b' to this weighted sum to improve the
model's performance.
∑wi*xi + b
Step-2
Y = f(∑wi*xi + b)
The objective of an SVM is to find a hyperplane that not only separates the classes but does
so with the maximum margin. The margin is defined as the distance between the hyperplane
and the nearest data points from either class, which are known as the support vectors.
where www is the weight vector, xxx is the input feature vector, and bbb is the bias
term.
For a given training set {(xi,yi)}\{(x_i, y_i)\}{(xi,yi)}, where xix_ixi are the feature
vectors and yi∈{−1,1}y_i \in \{-1, 1\}yi∈{−1,1} are the class labels, the SVM aims
to find www and bbb such that:
The margin is defined as 2/∥w∥2 / \|w\|2/∥w∥. To maximize the margin, we minimize ∥w∥\|w\|
∥w∥. This leads to the following optimization problem:
subject to:
4. Dual Formulation
The above problem can be solved more efficiently using its dual form. Introducing Lagrange
multipliers αi≥0\alpha_i \geq 0αi≥0, the dual problem becomes:
subject to:
Once the optimal αi\alpha_iαi values are found, the weight vector www can be computed as:
The bias term bbb can be determined using the support vectors:
b=yk−w⋅xkfor any support vector (xk,yk)b = y_k - w \cdot x_k \quad \text{for any support vector }
(x_k, y_k)b=yk−w⋅xkfor any support vector (xk,yk)
6. Decision Function
The decision function for classifying a new data point xxx is:
f(x)=sign(w⋅x+b)f(x) = \text{sign}(w \cdot x + b)f(x)=sign(w⋅x+b)
Key Properties
Maximizing the Margin: By maximizing the margin, SVMs aim to improve the classifier's
ability to generalize to unseen data.
Support Vectors: The decision boundary is determined by the support vectors, which are the
data points closest to the hyperplane.
Robustness: Large margin classifiers like SVMs are less likely to overfit, especially in high-
dimensional spaces.
Practical Considerations
Feature Scaling: It is important to scale features so that all features contribute equally to the
margin calculation.
Linear Separability: The basic SVM assumes that the data is linearly separable. For non-
linearly separable data, kernel methods can be used to map the data into a higher-
dimensional space where a linear separator can be found.
Example
Consider a binary classification problem with two classes of data points in a 2D space. An
SVM will find the line (hyperplane) that separates the two classes with the maximum margin.
The data points that lie closest to the line are the support vectors. The distance between this
line and the support vectors is the margin, which the SVM maximizes.
In summary, a large margin classifier such as an SVM is an effective method for linearly
separable data. It finds the hyperplane that maximizes the margin between the classes,
leading to better generalization and robustness.
Key Concepts:
Mathematical Formulation:
The objective is to find the optimal hyperplane that minimizes the following cost function:
Here:
1. Data Preprocessing: Scale the data to have zero mean and unit variance.
2. Formulate the Optimization Problem: Define the objective function and constraints.
3. Solve the Optimization Problem: Use quadratic programming solvers to find the optimal w\
mathbf{w}w and bbb.
4. Make Predictions: For a new data point x\mathbf{x}x, the prediction is based on the sign of
w⋅x+b\mathbf{w} \cdot \mathbf{x} + bw⋅x+b.
K(x, y) = x .y
Where x and y are the input feature vectors. The dot product of the input
vectors is a measure of their similarity or distance in the original feature
space.
Polynomial Kernel
A particular kind of kernel function utilised in machine learning, such as in
SVMs, is a polynomial kernel (Support Vector Machines). It is a nonlinear
kernel function that employs polynomial functions to transfer the input
data into a higher-dimensional feature space.
Where x and y are the input feature vectors, c is a constant term, and d is
the degree of the polynomial, K(x, y) = (x. y + c)d. The constant term is
added to, and the dot product of the input vectors elevated to the degree
of the polynomial.
In general, the polynomial kernel is an effective tool for converting the
input data into a higher-dimensional feature space in order to capture
nonlinear correlations between the input characteristics.
Where x and y are the input feature vectors, gamma is a parameter that
controls the width of the Gaussian function, and ||x - y||^2 is the squared
Euclidean distance between the input vectors.
Laplace Kernel
The Laplacian kernel, also known as the Laplace kernel or the exponential
kernel, is a type of kernel function used in machine learning, including in
SVMs (Support Vector Machines). It is a non-parametric kernel that can be
used to measure the similarity or distance between two input feature
vectors.
NONLINEAR CLASSIFIER:
Non-linear classification refers to categorizing those instances that
are not linearly separable. → Some of the classifiers that use non-
linear functions to separate classes are Quadratic Discriminant
Classifier, Multi-Layer Perceptron (MLP), Decision Trees, Random
Forest, and K-Nearest Neighbours (KNN).
Introduction to Non-Linear Classifiers
If you're new to this field, you might have heard about linear
classifiers and non-linear classifiers. These two distinct types of
algorithms help us make sense of complex data and make accurate
predictions.
They are relatively simple, easy to interpret, and fast to train. Some
common linear classifiers include logistic regression and linear support
vector machines.
Non-linear classifiers, on the other hand, can find more complex decision
boundaries to separate the classes. They can capture intricate patterns
and relationships within the data that linear classifiers might miss.
SVR can use both linear and non-linear kernels. A linear kernel is a
simple dot product between two input vectors, while a non-linear
kernel is a more complex function that can capture more intricate
patterns in the data. The choice of kernel depends on the data’s
characteristics and the task’s complexity.
In scikit-learn package for Python, you can use the ‘SVR’ class to
perform SVR with a linear or non-linear ‘kernel’. To specify the kernel,
you can set the kernel parameter to ‘linear’ or ‘RBF’ (radial basis
function).
Support Vector Machine can also be used as a regression method, maintaining all
the main features that characterize the algorithm (maximal margin). The Support
Vector Regression (SVR) uses the same principles as the SVM for classification,
with only a few minor differences. First of all, because output is a real number it
becomes very difficult to predict the information at hand, which has infinite
possibilities. In the case of regression, a margin of tolerance (epsilon) is set in
approximation to the SVM which would have already requested from the
problem. But besides this fact, there is also a more complicated reason, the
algorithm is more complicated therefore to be taken in consideration. However,
the main idea is always the same: to minimize error, individualizing the
hyperplane which maximizes the margin, keeping in mind that part of the error is
tolerated.
Linear SVR
Non-linear SVR
The kernel functions transform the data into a higher dimensional feature space to make it possible to perfo
linear separation.
Kernel functions
LEARNING WITH NEURAL NETWORKS:
NEURON MODELS:
Weighting Factors
Each input will be given a relative weighting, which will affect the impact of that
input (Fig.
5.3). This is something like varying synaptic strengths of the biological neurons—
some inputs
are more important than others in the way they combine to produce an impulse.
Weights are
adaptive coefficients within the network, that determine the intensity of the input
signal. In fact,
this adaptability of connection strength is precisely what provides neural
networks their ability to
learn and store information, and, consequently, is an essential element of all
neuron models.
The Mathematical Model of Neuron The basic rules are that neurons
are added when training is slow or when the mean squared error is
larger than a specified value, and that neurons are removed when a
change in a neuron " s value does not correspond to a change in the
network " s response or when the weight values that are
Linear Neuron And widrow-Hoff Learning Rule:
Linear Neuron:
An artificial neural network inspired by the human neural system is a network used to process
the data which consist of three types of layer i.e input layer, the hidden layer, and the output
layer. The basic neural network contains only two layers which are the input and output
layers. The layers are connected with the weighted path which is used to find net input data.
In this section, we will discuss two basic types of neural networks Adaline which doesn’t
have any hidden layer, and Madaline which has one hidden layer.
Adaline which stands for Adaptive Linear Neuron, is a network having a single linear unit. It
was developed by Widrow and Hoff in 1960. Some important points about Adaline are as
follows −
The basic structure of Adaline is similar to perceptron having an extra feedback loop with the
help of which the actual output is compared with the desired/target output. After comparison
on the basis of training algorithm, the weights and bias will be updated.
Adaptive Linear Neuron Learning algorithm
Step 0: initialize the weights and the bias are set to some random values but not to zero, also
initialize the learning rate α.
Step 2 − perform steps 3-5 for each bipolar training pair s:t.
xi=si(i=1ton)
yin=∑inxi.wi+b
Here ‘b’ is bias and ‘n’ is the total number of input neurons.
Step 5 Until least mean square is obtained (t - yin), Adjust the weight and bias as follows −
Step 7 − Test for the stopping condition, if error generated is less then or equal to specified
tolerance then stop.
Madaline which stands for Multiple Adaptive Linear Neuron, is a network which consists of
many Adalines in parallel. It will have a single output unit. Some important points about
Madaline are as follows −
It is just like a multilayer perceptron, where Adaline will act as a hidden unit between
the input and the Madaline layer.
The weights and the bias between the input and Adaline layers, as in we see in the
Adaline architecture, are adjustable.
The Adaline and Madaline layers have fixed weights and bias of 1.
Training can be done with the help of Delta rule.
It consists of “n” units of input layer and “m” units of Adaline layer and “1” unit of the
Madaline layer. Each neuron in the Adaline and Madaline layers has a bias of excitation “1”.
The Adaline layer is present between the input layer and the Madaline layer; the Adaline
layer is considered as the hidden layer.
Multiple Adaptive Linear Neuron (Madaline) Training Algorithm
By now we know that only the weights and bias between the input and the Adaline layer are
to be adjusted, and the weights and bias between the Adaline and the Madaline layer are
fixed.
Step 0 − initialize the weights and the bias(for easy calculation they can be set to zero). also
initialize the learning rate α(0, α, 1) for simpicity α is set to 1.
Step 2 − perform steps 3-5 for each bipolar training pair s:t
xi=si(i=1ton)
Step 4 − Obtain the net input at each hidden layer, i.e. the Adaline layer with the following
relation −
Qinj=bj+∑inxiwij(j=1tom)
Here ‘b’ is bias and ‘n’ is the total number of input neurons.
Step 5 − Apply the following activation function to obtain the final output at the Adaline and
the Madaline layer −
f(x)={1ifx⩾0−1ifx<0
Qj=f(Qinj)
i.e. y i n j = b 0 + ∑ j = 1 m Q j v j
y=f(yin)
If t ≠ y and t = +1, update weights on Zj, where net input is closest to 0 (zero)
wij(new) = wij(old) + α(1 - Qinj)xi
bj(new) = bj(old) + α(1 - Qinj)
else If t ≠ y and t = -1, update weights on Zk, whose net input is positive
wik(new) = wik(old) + α(-1 - Qink)xi
bk(new) = bk(old) + α(-1 - Qink)
else if y = t then
no weight updation is required.
Step 7 − Test for the stopping condition, which will happen when there is no change in
weight or the highest weight change occurred during training is smaller than the specified
tolerance.
In 1960 Bernard Widrow and Tedd Hoff published an improved learning rule for artificial
neurons. The main difference is the way the weights and the bias unit are updated. A
schematic view of the process is shown in the following figure.
Unlike the perceptron, where we defined the error based on the predicted output, here we use
the result of the activation function to define a loss or cost function. The goal of this loss
function is to find its minimum, which corresponds to the optimal solution to our task.
One way to find this minimum is to use a technique called gradient descent. Depending on
the number of training records used for each learning step, we differentiate between full
batch, stochastic and minibatch gradient descent.
is small, the descent takes a lot of steps, if it is too high, we might overshoot the minimum.
Therefore, finding an appropriate value is an important step in fine-tuning the training.
By using the gradient descent, we have implicitly changed the learning step from online
learning, as with the perceptron, to full batch learning. This means, that the loss function is
calculated from the results of the activation function from the entire training data set.
SGD usually reaches the minimum faster than full batch gradient descent because the weights
are incremented more often. However, the error surface is noisier because the loss function is
different for each training record. This is actually an advantageous behavior, because it also
means that for nonlinear activation functions, it is easier to escape local minima and find the
global minimum. To avoid patterns arising from the order of the training records, the training
set is shuffled at the beginning of each epoch, which leads to the word stochastic in SGD.
Linear regression
the whole learning rule is nothing else than a simple linear regression.
Logistic regression
NB: We can use both versions for either classification or regression. In the case of
classification, we use the threshold function after the learning to do the classification, in the
case of regression, we simply use the outcome of the activation function as the regression
result.
Delta learning rule:
The delta rule in an artificial neural network is a specific kind of backpropagation that assists
in refining the machine learning/artificial intelligence network, making associations among
input and outputs with different layers of artificial neurons. The Delta rule is also called the
Delta learning rule.
Generally, backpropagation has to do with recalculating input weights for artificial neurons
utilizing a gradient technique. Delta learning does this by using the difference between a
target activation and an obtained activation. By using a linear activation function, network
connections are balanced. Another approach to explain the Delta rule is that it uses an error
function to perform gradient descent learning.
Delta rule refers to the comparison of actual output with a target output, the technology tries
to discover the match, and the program makes changes. The actual execution of the Delta rule
will fluctuate as per the network and its composition. Still, by applying a linear activation
function, the delta rule can be useful in refining a few sorts of neural networks with specific
kinds of backpropagation.
Delta rule is introduced by Widrow and Hoff, which is the most significant learning rule that
depends on supervised learning.
This rule states that the change in the weight of a node is equivalent to the product of error
and the input.
Mathematical equation:
The given equation gives the mathematical equation for delta learning rule:
∆w = µ.x.z
∆w = µ(t-y)x
Here,
∆w = weight change.
µ = the constant and positive learning rate.
z= (t-y) is the difference between the desired input t and the actual output y. The above
mentioned mathematical rule cab be used only for a single output unit.
The different weights can be determined with respect to these two cases.
w(new) = w(old) + ∆w
No change in weight
The Delta Rule uses the difference between target activation (i.e., target output values) and
obtained activation to drive learning. For reasons discussed below, the use of a threshold
activation function (as used in both the McCulloch-Pitts network and the perceptron) is
dropped & instead a linear sum of products is used to calculate the activation of the output
neuron (alternative activation functions can also be applied). Thus, the activation function is
called a Linear Activation function, in which the output node’s activation is simply equal to
the sum of the network’s respective input/weight products. The strength of network
connections (i.e., the values of the weights) are adjusted to reduce the difference between
target and actual output activation (i.e., error). A graphical depiction of a simple two-layer
network capable of deploying the Delta Rule is given in the figure below (Such a network is
not limited to having only one output node):
During forward propagation through a network, the output (activation) of a given node is a
function of its inputs. The inputs to a node, which are simply the products of the output of
preceding nodes with their associated weights, are summed and then passed through an
activation function before being sent out from the node. Thus, we have the following:
and
where ‘Sj’ is the sum of all relevant products of weights and outputs from the previous layer
i, ‘wij’ represents the relevant weights connecting layer i with layer j, ‘ai’ represents the
activation of nodes in the previous layer i, ‘aj’ is the activation of the node at hand, and ‘f’is
the activation function.
For any given set of input data and weights, there will be an associated magnitude of error,
which is measured by an error function (also known as a cost function) (e.g., Oh, 1997; Yam
and Chow, 1997). The Delta Rule employs the error function for what is known as Gradient
Descent learning, which involves the ‘modification of weights along the most direct path in
weight-space to minimize error’, so change applied to a given weight is proportional to the
negative of the derivative of the error with respect to that weight (McClelland and Rumelhart
1988, pp.126–130). The Error/Cost function is commonly given as the sum of the squares
of the differences between all target and actual node activation for the output layer. For a
particular training pattern (i.e., training case), error is thus given by:
where ‘Ep’ is total error over the training pattern, ½ is a value applied to simplify the
function’s derivative, ’n’ represents all output nodes for a given training pattern, ‘tj’ sub n
represents the Target value for node n in output layer j, and ‘aj’ sub n represents the actual
activation for the same node. This particular error measure is attractive because its derivative,
whose value is needed in the employment of the Delta Rule, and is easily calculated. Error
over an entire set of training patterns (i.e., over one iteration, or epoch) is calculated by
summing all ‘Ep’:
Error/Cost Function
where ‘E’ is total error, and ‘p’ represents all training patterns. An equivalent term for E in
earlier equation is Sum-of-squares error. A normalized version of this equation is given by
the Mean Squared Error (MSE) equation:
where ‘P’ and ’N’ are the total number of training patterns and output nodes, respectively. It
is the error of both previous equations, that gradient descent attempts to minimize (not
strictly true if weights are changed after each input pattern is submitted to the network
(Rumelhart et al., 1986: v1, p.324; Reed and Marks, 1999: pp. 57–62). Error over a given
training pattern is commonly expressed in terms of the Total Sum of Squares (‘tss’) error,
which is simply equal to the sum of all squared errors over all output nodes and all training
patterns. ‘The negative of the derivative of the error function is required in order to perform
Gradient Descent Learning’. The derivative of our equation(which measures error for a
given pattern ‘p’) above, with respect to a particular weight ‘wij’ sub ‘x’, is given by the
chain rule as:
where ‘aj’ sub ‘z’ is activation of the node in the output layer that corresponds to weight
‘wij’ sub x (subscripts refer to particular layers of nodes or weights, and the ‘sub-subscripts’
simply refer to individual weights and nodes within these layers). It follows that:
and
Thus, the derivative of the error over an individual training pattern is given by the product of
the derivatives of our prior equation:
Because Gradient Descent learning requires that any change in a particular weight be
proportional to the negative of the derivative of the error, the change in a given weight must
be proportional to the negative of our prior equation . Replacing the difference between the
target and actual activation of the relevant output node by d, and introducing a learning rate
epsilon, that equation can be re-written in the final form of the Delta Rule:
The reasoning behind the use of a Linear Activation function here instead of a Threshold
Activation function can now be justified: Threshold activation function that characterizes
both the McColloch and Pitts network and the perceptron is not differentiable at the
transition between the activations of 0and 1 (slope = infinity), and its derivative is 0 over the
remainder of the function. Hence, Threshold activation function cannot be used in
Gradient Descent learning. Whereas a Linear Activation function (or any other function
that is differential) allows the derivative of the error to be calculated.