0% found this document useful (0 votes)
1 views

Module 3_Modified

The document covers key concepts in Neural Networks (NN) and Support Vector Machines (SVM), including perceptrons, activation functions, backpropagation, and the mathematical foundations of SVMs. It discusses the importance of margin maximization in SVMs, the role of kernel functions for non-linear classification, and the advantages of using different activation functions like sigmoid, tanh, and ReLU. Additionally, it highlights the training processes for perceptrons and neural networks, including the delta rule and gradient descent.

Uploaded by

thanhafathima480
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
1 views

Module 3_Modified

The document covers key concepts in Neural Networks (NN) and Support Vector Machines (SVM), including perceptrons, activation functions, backpropagation, and the mathematical foundations of SVMs. It discusses the importance of margin maximization in SVMs, the role of kernel functions for non-linear classification, and the advantages of using different activation functions like sigmoid, tanh, and ReLU. Additionally, it highlights the training processes for perceptrons and neural networks, including the delta rule and gradient descent.

Uploaded by

thanhafathima480
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 106

Module-3

(Neural Networks (NN) and Support


Vector Machines (SVM))
Perceptron, Neural Network - Multilayer feed
forward network, Activation functions
(Sigmoid, ReLU, Tanh), Backpropagation
algorithm.

SVM - Introduction, Maximum Margin


Classification, Mathematics behind Maximum
Margin Classification, Maximum Margin
linear separators, soft margin SVM classifier,
non-linear SVM, Kernels for learning non-
linear functions, polynomial kernel, Radial
Basis Function(RBF).
Biological Neuron
Artificial Neuron
Perceptron
The output of the perceptron can also be
expressed as a dot product
Net input function
Activation functio
ttps://cs231n.github.io/neural-networks-1/
https://towardsdatascience.com/perceptron-the-artificial-neuron-
Perceptron learning rule
One way to learn an acceptable weight
vector is to begin with random weights,
then iteratively apply the perceptron to
each training example, modifying the
perceptron weights whenever it
misclassifies an example.

This process is repeated, iterating through


the training examples as many times as
needed until the perceptron classifies all
training examples correctly.
Weights are modified at each step according
to the perceptron training rule, which
revises the weight wi associated with input
xi according to the rule

Here, t is the target output for the


current training example, o is the output
generated by the perceptron, and q is a
positive constant called the learning rate.
The role of the learning rate is to moderate
the degree to which weights are changed at
each step.
It is usually set to some small value (e.g., 0.1).
Gradient Descent and the
Delta Rule
Although the perceptron rule finds a
successful weight vector when the training
examples are linearly separable, it can fail
to converge if the examples are not linearly
separable.

Delta rule, is designed to overcome this


difficulty.

If the training examples are not linearly


separable, the delta rule converges toward
a best-fit approximation to the target
The key idea behind the delta rule is to use
gradient descent to search the
hypothesis space of possible weight
vectors to find the weights that best fit the
training examples.

The delta training rule is best understood


by considering the task of training an
unthresholded perceptron; that is, a
linear unit for which the output o is
given by
Training error

where D is the set of training examples, td


is the target output for training
example d, and od is the output of the
linear unit for training example d.
Since the gradient specifies the direction of
steepest increase of E, the training rule
for gradient descent is
Here the learning rate is a positive constant
, which determines the step size in the
gradient descent search.
The negative sign is present because we
want to move the weight vector in the
direction that decreases E.
This training rule can also be written in
its component form
That is,

Therefore,
Multilayer Feed Forward
Network
Feed forward neural network
Each layer is made up of units.
The inputs to the network correspond to the
attributes measured for each training tuple.
The inputs are fed simultaneously into the
units making up the input layer.
These inputs pass through the input layer
and are then weighted and fed
simultaneously to a second layer of
“neuronlike” units, known as a hidden
layer.
The outputs of the hidden layer units can be
input to another hidden layer, and so on.
The weighted outputs of the last hidden
layer are input to units making up
the output layer, which emits the network's
prediction for given tuples.
The units in the input layer are called input
units.
The units in the hidden layers and output
layer are sometimes referred to
as neurodes, due to their symbolic
biological basis, or as output units.
A network containing two hidden layers is
called a three-layer neural network, and so
on.
It is a feed-forward network since none of
the weights cycles back to an input unit or
to a previous layer's output unit.
https://www.sciencedirect.com/topics/computer-science/backpropagation-

Each output unit takes, as input, a
weighted sum of the outputs from units in
the previous layer.
It applies a nonlinear (activation) function
to the weighted input.
Compute the number of parameters for the
given network.
The network has 4 + 2 = 6 neurons (not
counting the inputs), [3 x 4] + [4 x 2] = 20
weights and 4 + 2 = 6 biases, for a total of
26 learnable parameters.
Compute the number of parameters for the
given network.
The network has 4 + 4 + 1 = 9 neurons
(not counting inputs), [3 x 4] + [4 x 4] + [4
x 1] = 12 + 16 + 4 = 32 weights and 4 + 4
+ 1 = 9 biases, for a total of 41 learnable
parameters.
Sigmoid function
Relu Function
Tanh Function
Sigmoid function
Sigmoid outputs are not zero centered.

If the activation function of the network is not


zero centered, y = f(x w) is always positive or
always negative.

Thus, the output of a layer is always being


moved to either the positive values or the
negative values.

As a result, the weight vector needs more update


to be trained properly.
Tanh vs Sigmoid
The tanh function is a stretched and shifted
version of the sigmoid.

Both sigmoid and tanh functions belong


to the S-like functions that suppress the
input value to a bounded range.

This helps the network to keep its weights


bounded and prevents the exploding
gradient problem where the value of the
gradients becomes very large.
https://www.baeldung.com/cs/sigmoid-vs-tanh-functions
The gradient of tanh is four times greater
than the gradient of the sigmoid function.
This means that using the tanh activation
function results in higher values of gradient
during training and higher updates in the
weights of the network.

So, if we want strong gradients and


big learning steps, we should use the
tanh activation function.

Another difference is that the output


of tanh is symmetric around zero
leading to faster convergence.
The output of tanh ranges from -1 to 1 and
have an equal mass on both the sides of
zero-axis so it is zero centered function.

So, tanh overcomes the non-zero


centric issue of the logistic activation
function.

Hence optimization becomes comparatively


easier than logistic and it is always
preferred over logistic.
Comparison with ReLU
Sigmoid and tanh functions suffer from vanishing
gradient problem.
It is encountered while training artificial neural
networks with gradient-based learning
methods and backpropagation.
In such methods, during each iteration of training
each of the neural network's weights receives an
update proportional to the partial derivative of the
error function with respect to the current weight.
The problem is that in some cases, the gradient
will be vanishingly small, effectively preventing
the weight from changing its value.
In the worst case, this may completely stop the
neural network from further training.
ReLU activation function can fix the
vanishing gradient problem.
Back propagation
A feedforward phase - where an input
vector is applied and the signal propagates
through the network layers, modified by
the current weights and biases and by the
nonlinear activation functions.
Corresponding output values then emerge,
and these can be compared with the target
outputs for the given input vector using a
loss function.
A feedback phase - the error signal is
then fed back (backpropagated) through
the network layers to modify the weights in
a way that minimizes the error across the
entire training set, effectively minimizing
the error surface in weight-space.
Backpropagation Algorithm
(Stochastic gradient descent version)
Determine the number of trainable
parameters of the following neural net:
Input layer: 4 units.
Hidden layer 1: 16 units.
Hidden layer 2: 8 units.
Hidden layer 3: 4 units.
Output layer: 2 units.

262 trainable parameters.


Support Vector Machine
(Developed at AT&T Bell Laboratories by Vladimir Vapnik with colleagues in
1995)

51 01/04/2025
Support Vector Machines—

01/04/2025
General Philosophy

Small Margin Large Margin


Support Vectors 52
http://image.diku.dk/imagecanon/material/cortes_vapnik95.pdf01/04/2025
53
A learned classifier (hyperplane) achieves

01/04/2025
maximum separation between the classes.

The two planes parallel to the classifier and


which pass through one or more points in
the dataset are called bounding planes.

The distance between these bounding


planes is called margin.

By SVM learning, we mean finding a


hyperplane which maximizes this margin.
54
01/04/2025
55

https://towardsdatascience.com/support-vector-machines-dual-formulation-
quadratic-programming-sequential-minimal-optimization-57f4387ce4dd
Linearly Separable SVM

01/04/2025
 The optimal hyperplane is given by

w.x + b = 0
where w={w1, w2, …, wn} is a weight vector and b
a scalar (bias).

56

https://link.springer.com/content/pdf/10.1007/BF00994018.pdf
Maximum Margin

01/04/2025
Distance between a point P (x o, yo, zo) and a
given plane Ax + By + Cz = D, is given by

|Axo + Byo+ Czo + D|/√(A2 + B2 +


C2).

Here we have, two bounding planes

w.x+b=1 and w.x+b=-1

57
01/04/2025
Distance of the bounding hyperplane w.x+b=1 from origin
|1  b |
=
|| w||
Distance of the bounding hyperplane w.x+b=-1 from origin
|  1 b |
=
|| w||
 Distance between the planes (which needs to be maximized)
|1  b | |  1  b |
= 
|| w|| || w||
2
 58
|| w||
Mathematics behind SVM

01/04/2025
 For the training data to be linearly
separable:

w.x i  b 1, if yi 1
w.x i  b  1, if yi  1
 Or,

yi (w.x i  b) 1, i 1, 2,..., n

59
01/04/2025
Vectors xi for which yi (w•xi, + b) = 1
(points which fall on the bounding planes)
are termed as support vectors.

60
01/04/2025
61
Primal problem
(1)
Linearly Separable SVM

01/04/2025
 The optimal hyperplane is given by

w.x + b = 0

where w={w1, w2, …, wn} is a weight vector and b


a scalar (bias).

The linear decision function I(x) is then given by

69

https://link.springer.com/content/pdf/10.1007/BF00994018.pdf
SVM – Soft Margin
Here, C is a hyperparameter that decides
the trade-off between maximizing the
margin and minimizing the mistakes.
When C is small, classification mistakes are
given less importance and focus is more on
maximizing the margin, whereas when C is
large, the focus is more on avoiding
misclassification at the expense of keeping
the margin small.
https://towardsdatascience.com/support-vector-machines-soft-margin-
formulation-and-kernel-trick-4c9729dc8efe
Mathematics behind Soft Margin
SVM
(1)
Non-Linear SVM
XOR Problem
X Y X XOR Y
0 0 0
0 1 1
1 0 1
1 1 0
https://www.tech-quantum.com/solving-xor-problem-using-neural-network-c/
https://towardsdatascience.com/the-kernel-trick-c98cdbcaeb3f
SVM—Linearly Inseparable

01/04/2025
 Transform the original input data into a higher
dimensional space.
 Search for a linear separating hyperplane in the
new space.

90
Kernel Functions
Kernel functions are generalized functions
that take two vectors (of any dimension) as
input and output a score that denotes how
similar the input vectors are.

An example is the dot product function: if


the dot product is small, we conclude that
vectors are different and if the dot product
is large, we conclude that vectors are more
similar.
Kernel Trick
We can use any fancy Kernel function in
place of dot product that has the capability
of measuring similarity in higher
dimensions (where it could be more
accurate;), without increasing the
computational costs much.

This is essentially known as the Kernel


Trick.
Polynomial Kernel
Kernel Matrix
Why is SVM Effective on High Dimensional Data?

01/04/2025
 The complexity of trained classifier is
characterized by the # of support vectors rather
than the dimensionality of the data.
 The support vectors are the essential or critical
training examples —they lie closest to the decision
boundary (Maximum Margin Hyperplane).
 Thus, an SVM with a small number of support
vectors can have good generalization, even when
the dimensionality of the data is high. 105
References

01/04/2025
Dunham M H, “Data Mining: Introductory and
Advanced Topics”, Pearson Education, New Delhi,
2003.

Jaiwei Han, Micheline Kamber, “Data Mining


Concepts and Techniques”, Elsevier, 2006.

K.P. Soman, Shyam Diwakar, V. Ajay, “Insight into


Data Mining Theory and Practice”, PHI Pvt. Ltd.,
New Delhi, 2008.

https://hanj.cs.illinois.edu/bk3/bk3_slidesindex.htm 106

You might also like