SML Unit 4
SML Unit 4
SML
SVM
• Support Vector Machine” (SVM) is a supervised machine learning algorithm
that can be used for both classification or regression challenges.
• Support vector machines are mainly classified into three types based
on their
• working principles:
- Maximum margin classifiers
- - Support vector classifiers
- Support vector machines
Maximum margin classifier
• People usually generalize support vector machines with maximum
margin classifiers. However, there is much more to present in SVMs
compared to maximum margin classifier.
• observations could fall in either of the regions, also called the region of
classes:
SVM
• The mathematical representation of the maximum margin classifier is
as follows, which is an optimization problem
SVM
• Constraint 2 ensures that observations will be on the correct side of
the hyperplane by taking the product of coefficients with x variables
and finally, with a class variable indicator
• In non-separable cases, the maximum margin classifier will not have a
separating hyperplane, which is also known as no feasible solution.
• This issue will be solved with support vector classifiers, which we will
be covering in the next section.
Maximum Margin Classifier
SVM
How does it work?
• the process of segregating the two classes with a hyper-plane.
• How can we identify the right hyper-plane?
Identify the right hyper-plane (Scenario-1):
• Here, we have three hyper-planes (A, B, and C). Now, identify the right hyper-
plane to classify stars and circles.
• You need to remember a thumb rule to identify the right hyper-plane: “Select
the hyper-plane which segregates the two classes better”. In this scenario,
hyper-plane “B” has excellently performed this job
Identify the right hyper-plane (Scenario-2)
• Here, we have three hyper-planes (A, B, and C) and all are segregating the classes
well. Now, How can we identify the right hyper-plane?
• Here, maximizing the distances between nearest data point (either class) and hyper-
plane will help us to decide the right hyper-plane. This distance is called as Margin
you can see that the margin for hyper-plane C is high as
compared to both A and B. Hence, we name the right hyper-
plane as C. Another lightning reason for selecting the hyper-
plane with higher margin is robustness. If we select a hyper-
plane having low margin then there is high chance of miss-
classification.
Identify the right hyper-plane (Scenario-
3):
• Hint: Use the rules as discussed in previous section to identify the right
hyper-plane.
• Some of you may have selected the hyper-plane B as it has higher margin
compared to A. But, here is the catch, SVM selects the hyper-plane which
classifies the classes accurately prior to maximizing margin. Here, hyper-
plane B has a classification error and A has classified all correctly. Therefore,
the right hyper-plane is A.
Can we classify two classes (Scenario-4)?
• Below, I am unable to segregate the two classes using a straight line, as
one of the stars lies in the territory of other(circle) class as an outlier
Find the hyper-plane to segregate to classes (Scenario-5):
• In fact, in real-life scenarios, we hardly find any data with purely separable
classes; most classes have a few or more observations in overlapping classes.
• High value of C will lead to a more robust model, whereas a lower value
creates the flexible model due to less violation of error terms.
• Another way of handling the data, called the kernel trick, using the kernel
function to work with non-linearly separable data.
• A polynomial kernel with degree 2 has been applied in transforming the data
from 1-dimensional to 2-dimensional data.
1-Dimensional Data Transferable
1-Dimensional Data Transferable
• The degree of the polynomial kernel is a tuning parameter
• Original feature vectors, return the same value as the dot product of its
corresponding mapped feature vectors.
• Kernel functions do not explicitly map the feature vectors to a higher
dimensional space, or calculate the dot product of the mapped vectors.
• Kernels produce the same value through a different series of operations that
can often be computed more efficiently.
REASON
To eliminate the computational requirement to derive the higher-
dimensional vector space from the given basic vector space, so that
observations be separated linearly in higher dimensions.
• Derived vector space will grow exponentially with the increase in dimensions
and it will become almost too difficult to continue computation, even when
you have a variable size of 30 or so.
Kernel Functions
• The following example shows how the size of the variables grows.
(A) Polynomial Kernel:
• Polynomial kernels are popularly used, especially with degree 2.
• In fact, the inventor of support vector machines
• Vladimir N Vapnik,developed using a degree 2 kernel for classifying
handwritten digits.
• Polynomial kernels are given by the following equation:
(B) Radial Basis Function (RBF) / Gaussian Kernel:
• RBF kernels are a good first choice for problems requiring nonlinear models.
• The feature space produced by the Gaussian kernel can have an infinite
number of dimensions, a feat that would be impossible otherwise.
Simplified Equation as
RBF Kernel Model
Artificial Neural Networks (ANN)
• Relationship between a set of input signals and output signals using a model
derived from a replica of the biological brain, which responds to stimuli from its
sensory inputs.
• ANN methods try to model problems using interconnected artificial neurons (or
nodes) to solve machine learning problems.
• The cell body begins to accumulate the incoming signals, a threshold is reached,
at which the cell fires and the output signal is then transmitted via an
electrochemical process down the axon
Artificial Neural Networks (ANN)
• The logistic function returns a value between 0 and 1 based on the set
threshold.
for example, here we set the threshold as 0.7.
• Any accumulated signal greater than 0.7 gives the signal of 1 and vice
versa; any accumulated signal less than 0.7 returns the value of 0:
Biological and Artificial Neurons
Neural Network Model
• Neural network models are being considered as universal approximators,
which means by using a neural network methodology.
• Only one neuron in each layer; however, the reader can attempt to create
other neurons within the same layer. Weights and biases are initiated from
some random numbers, so that in both forward and backward passes, these
can be updated in order to minimize the errors altogether.
Forward and Backward Propagation-Intro
• During forward propagation, features are input to the network and fed
through the following layers to produce the output activation.
• In some cases, the weighted combination signal is low; in those cases, bias
will compensate the extra amount for adjusting the aggregated value, which
can trigger for the next level.
Forward and Backward Propagation-Intro
Forward and Backward Propagation-Intro
1. In the last layer (also known as the output layer), outputs are calculated in
the same way from the outputs obtained from hidden layer 2 by taking the
weighted combination of weights and outputs obtained from hidden layer
2.
• Once we obtain the output from the model, a comparison needs to be made
with the actual value and we need to backpropagate the errors across the
net backward in order to correct the weights of the entire neural network
Forward and Backward Propagation
• we have taken the derivative of the output value and multiplied by
that much amount to the error component, which was obtained from
differencing the actual value with the model output
Forward and Backward Propagation
• we will backpropagate the error from the second hidden layer as well.
• Once all the neurons in hidden layer 1 are updated, weights between
inputs and the hidden layer also need to be updated, as we cannot
update anything on input variables.
• we will be updating the weights of both the inputs and also, at the
same time, the neurons in hidden layer 1, as neurons in layer 1 utilize
the weights from input only
Forward and Backward Propagation
• We have not shown the next iteration, in which neurons in the output layer
are updated with errors and backpropagation started again.
• In a similar way, all the weights get updated until a solution converges or the
number of iterations is reached.
Optimization of neural networks
Various techniques have been used for optimizing the weights of neural
networks:
• Stochastic gradient descent (SGD)
• Momentum
• Nesterov accelerated gradient (NAG)
• Adaptive gradient (Adagrad)
• Adadelta
• RMSprop
• Adaptive moment estimation (Adam)
• Limited memory Broyden-Fletcher-Goldfarb-Shanno (L-BFGS)
Optimization of neural networks