Ann Unit III
Ann Unit III
UNIT III – Support Vector Machines and Radial Basis Function: Learning from
Examples, Statistical Learning Theory, Support Vector Machines, SVM application to
Image Classification, Radial Basis Function Regularization theory, Generalized RBF
Networks, Learning in RBFNs, RBF application to face recognition.
PART A
2. What is SVM?
Support Vector Machines(SVM) is considered to be a classification approach
but it can be employed in both types of classification and regression problems.
It can easily handle multiple continuous and categorical variables. SVM
constructs a hyperplane in multidimensional space to separate different classes.
SVM generates optimal hyperplane in an iterative manner, which is used to
minimize an error.
The core idea of SVM is to find a maximum marginal hyperplane (MMH) that
best divides the dataset into classes.
RBF networks are being used for function approximation, pattern recognition, and
time series prediction problems.
1
SUPPORT VECTOR MACHINES
Support vector machine (SVMs) is a supervised machine learning algorithm that
classifies data by finding an optimal line or hyperplane that maximizes the distance
between each class in an N-dimensional space
Objective:
Trying to find a hyperplane that best separates the two classes.
Question is?
There can be an infinite number of hyperplanes passing through a point and classifying
the two classes perfectly.
LINEAR SVM
When the data is perfectly linearly separable only then we can use Linear SVM.
That is., the data points can be classified into 2 classes by using a single straight line
(if 2D).
NON-LINEAR SVM
When the data points cannot be separated into 2 classes by using a straight line (if 2D),
then use some advanced techniques like kernel tricks to classify them.
2
SUPPORT VECTORS
Support Vectors are the points that are closest to the hyperplane.
A separating line will be defined with the help of these data points.
MARGINS
Margin is the distance between the hyperplane and the observations closest to the
hyperplane (support vectors).
In SVM large margin is considered a good margin. There are two types of
margins hard margin and soft margin.
From the figure above it’s very clear that there are multiple lines (our hyperplane here is a line
because we are considering only two input features x1, x2) that segregate our data points or do
a classification between red and blue circles. So how do we choose the best line or in general
the best hyperplane that segregates our data points?
One reasonable choice as the best hyperplane is the one that represents the largest separation
or margin between the two classes.
3
Multiple hyperplanes separate the data from two classes
So we choose the hyperplane whose distance from it to the nearest data point on each side is
maximized. If such a hyperplane exists it is known as the maximum-margin or
hyperplane/hard margin. So from the above figure, we choose L2.
Here we have one blue ball in the boundary of the red ball. So how does SVM classify the
data? It’s simple! The blue ball in the boundary of red ones is an outlier of blue balls. The
SVM algorithm has the characteristics to ignore the outlier and finds the best hyperplane that
maximizes the margin. SVM is robust to outliers.
4
So in this type of data point what SVM does is, finds the maximum margin as done with
previous data sets along with that it adds a penalty each time a point crosses the margin. So
the margins in these types of cases are called soft margins.
When there is a soft margin to the data set, the SVM tries to
minimize (1/margin+∧(∑penalty)).
Hinge loss is a commonly used penalty. If no violations no hinge loss. If violations hinge loss
proportional to the distance of violation.
Say, our data is shown in the figure above. SVM solves this by creating a new variable using
a kernel. We call a point xi on the line and we create a new variable yi as a function of
distance from origin o. so if we plot this we get something like as shown below
5
In this case, the new variable y is created as a function of distance from the origin. A non-
linear function that creates a new variable is referred to as a kernel.
1. Hyperplane: Hyperplane is the decision boundary that is used to separate the data points
of different classes in a feature space. In the case of linear classifications, it will be a linear
equation i.e. wx+b = 0.
2. Support Vectors: Support vectors are the closest data points to the hyperplane, which
makes a critical role in deciding the hyperplane and margin.
3. Margin: Margin is the distance between the support vector and hyperplane. The main
objective of the support vector machine algorithm is to maximize the margin. The wider
margin indicates better classification performance.
4. Kernel: Kernel is the mathematical function, which is used in SVM to map the original
input data points into high-dimensional feature spaces, so, that the hyperplane can be easily
found out even if the data points are not linearly separable in the original input space.
Some of the common kernel functions are linear, polynomial, radial basis function(RBF),
and sigmoid.
5. Hard Margin: The maximum-margin hyperplane or the hard margin hyperplane is a
hyperplane that properly separates the data points of different categories without any
misclassifications.
6. Soft Margin: When the data is not perfectly separable or contains outliers, SVM permits a
soft margin technique. Each data point has a slack variable introduced by the soft-margin
SVM formulation, which softens the strict margin requirement and permits certain
misclassifications or violations. It discovers a compromise between increasing the margin
and reducing violations.
7. C: Margin maximisation and misclassification fines are balanced by the regularisation
parameter C in SVM. The penalty for going over the margin or misclassifying data items is
decided by it. A stricter penalty is imposed with a greater value of C, which results in a
smaller margin and perhaps fewer misclassifications.
8. Hinge Loss: A typical loss function in SVMs is hinge loss. It punishes incorrect
classifications or margin violations. The objective function in SVM is frequently formed
by combining it with the regularisation term.
9. Dual Problem: A dual Problem of the optimisation problem that requires locating the
Lagrange multipliers related to the support vectors can be used to solve SVM. The dual
formulation enables the use of kernel tricks and more effective comp
Advantages of SVM
Effective in high-dimensional cases.
Its memory is efficient as it uses a subset of training points in the decision function called
support vectors.
Different kernel functions can be specified for the decision functions and its possible to
specify custom kernels.
6
SVM APPLICATIONS TO IMAGE CLASSIFICATION
The SVM algorithm works by finding the hyperplane that separates the different classes in the
feature space. The key idea behind SVMs is to find the hyperplane that maximizes the margin,
which is the distance between the closest points of the different classes. The points that are
closest to the hyperplane are called support vectors.
In machine learning, the model is trained by input and expected output data. To create a
model, it is necessary to go through the following phases:
Capture a substantial dataset of images containing for example ripe oranges, and unripe
oranges. Consider acquiring images from different angles and distances to enhance model
generalizability.
Resizing
Normalization
Color Space Conversion
Data Augmentation (Optional)
2- Feature Extraction:
To classify an image using an SVM, We first need to extract features from the image. These
features can be the color values of the pixels, edge detection, or even the textures present in
the image. Once the features are extracted, we can use them as input for the SVM
algorithm. An image can contain:
Color Features
Texture Features
Shape Features
Combining Features
3- SVM Classification
Kernel Selection:
Start with a linear kernel if the feature distribution appears linearly separable.
If data exhibits nonlinear patterns, consider kernels like RBF (Radial Basis Function),
polynomial, or sigmoid.
7
Experiment with different kernel types and parameters (e.g., gamma for RBF) through
grid search or random search to find the best combination that maximizes performance
Hyperparameter Tuning
Fine-tune SVM hyperparameters like the regularization parameter (C) and kernel-
specific parameters using techniques like grid search or random search.
Aim to balance training accuracy with generalization ability to avoid overfitting.
Model Training
Split your preprocessed data into training, validation, and testing sets. Train the SVM
model on the training set and evaluate its performance on the validation set to prevent
overfitting. Refine hyperparameters based on validation results.
Model Evaluation
Evaluate the final model's performance on the unseen testing set using metrics like
accuracy, precision, recall, F1-score, and confusion matrix.
RBF NETWORKS
The idea of Radial Basis Function (RBF) Networks derives from the theory of function
approximation. We have already seen how Multi-Layer Perceptron (MLP) networks
with a hidden layer of sigmoidal units can learn to approximate functions. RBF
Networks take a slightly different approach. Their main features are:
1. They are two-layer feed-forward networks.
2. The hidden nodes implement a set of radial basis functions (e.g. Gaussian functions).
3. The output nodes implement linear summation functions as in an MLP.
4. The network training is divided into two stages: first the weights from the input
tohidden layer are determined, and then the weights from the hidden to output
layer.
5. The training/learning is very fast.
6. The networks are very good at interpolation.
8
9
10
Refer class notes also
11
neural networks, RBFNs have an input layer, a hidden layer with radial basis neurons, and an
output layer for producing the network's output.
RBFNs are susceptible to overfitting when the number of basis functions is not
appropriately selected, leading to reduced generalization capabilities.
12
They can be computationally expensive when dealing with high-dimensional data,
which may limit their scalability in certain applications.
Designing the architecture and selecting appropriate parameters for RBFNs can be a
challenging task, requiring domain expertise and rigorous experimentation.
13
LEARNING IN RBFNS
14
⁃ In Single Perceptron / Multi-layer Perceptron(MLP), we only have linear separability
because they are composed of input and output layers(some hidden layers in MLP)
⁃ For example, AND, OR functions are linearly-separable & XOR function is not linearly
separable.
15
⁃ RBNN is composed of input, hidden, and output layer. RBNN is strictly limited to have
exactly one hidden layer. We call this hidden layer as feature vector.
⁃ RBNN increases dimenion of feature vector.
⁃ We apply non-linear transfer function to the feature vector before we go for classification
problem.
⁃ When we increase the dimension of the feature vector, the linear separability of feature vector
increases.
16
A non-linearity separable problem(pattern classification problem) is highly separable in high
dimensional space than it is in low dimensional space.
[Cover’s Theorem]
⁃ What is a Radial Basis Function ?
⁃ we define a receptor = t
⁃ we draw confrontal maps around the receptor.
⁃ Gaussian Functions are generally used for Radian Basis Function(confrontal mapping). So we
define the radial distance r = ||x- t||.
Classification only happens on the second phase, where linear combination of hidden
functions are driven to output layer.
Advantages of using RBNN than the MLP :-
1. Training in RBNN is faster than in Multi-layer Perceptron (MLP) → takes many
interactions in MLP.
17
2. We can easily interpret what is the meaning / function of the each node in hidden layer of
the RBNN. This is difficult in MLP.
3. (what should be the # of nodes in hidden layer & the # of hidden layers)
this parameterization is difficult in MLP. But this is not found in RBNN.
4. Classification will take more time in RBNN than MLP.
Interpolation problem:
It requires every input vector to be mapped exactly onto the corresponding target vector.
The interpolation problem is to determine the real coefficient and the polynomial term.
The function is called a radial basis function if the interpolation problem has a unique
solution.
The learning of a neural network is viewed as a hyper surface reconstruction problem, is
an ill-posed inverse problem for following reasons-
18
i) Lack of information in the training data as need to reconstruct the input-output mapping
uniquely.
ii) Presence of noise in the input data adds uncertainty.
Regularization theory-
Regularization technique is a way of controlling the smoothness properties of a mapping
function. It involves adding to the error function an extra term which is designed to
penalize mappings which are not smooth. Instead of restricting the number of hidden
units, an alternative approach for preventing over fitting in RBF networks comes from the
regularization theory.
Regularization network-
RBF network can be seen as special case of regularization network. RBF network have
sound theoretical foundation in regularization theory. RBF network fit naturally into the
framework of the regularization of interpolation/approximation task. For these problems,
regularization means the smoothing of the interpolation/approximation curve, surface.
This approach to RBF network, is also known as regularization network.
RBF network have good generalization ability and a simple network structure that avoids
unnecessary and lengthy calculation. The modified or generalized RBF network has
following characteristics-
i) Gaussian function is modified
ii) Hidden neuron activation is normalized
iii) Output weights are the function of input variables
iv) A sequential learning algorithm is presented.
19
Kernel regression and RBF networks relationship:
The theory of kernel regression provides another viewpoint for the use of RBF network
for function approximation. It provides a framework for estimating regression function for
noisy data using kernel density estimation technique. The objective of function
approximation is to find a mapping from input space to output space. The mapping is
provided by forming the regression, or conditional average of target data, conditioned on
input variables. The regression function is known as the Nadaraya-Watson estimator.
Learning strategies:
Common learning strategies are orthogonal least squares method and hybrid learning
method. In OLS the hidden neuron, RBF centers, are selected one by one in a supervised
manner. Computationally more efficient hybrid learning method combines both self
organized and supervised learning strategies.
20