Module 3_Modified
Module 3_Modified
Therefore,
Multilayer Feed Forward
Network
Feed forward neural network
Each layer is made up of units.
The inputs to the network correspond to the
attributes measured for each training tuple.
The inputs are fed simultaneously into the
units making up the input layer.
These inputs pass through the input layer
and are then weighted and fed
simultaneously to a second layer of
“neuronlike” units, known as a hidden
layer.
The outputs of the hidden layer units can be
input to another hidden layer, and so on.
The weighted outputs of the last hidden
layer are input to units making up
the output layer, which emits the network's
prediction for given tuples.
The units in the input layer are called input
units.
The units in the hidden layers and output
layer are sometimes referred to
as neurodes, due to their symbolic
biological basis, or as output units.
A network containing two hidden layers is
called a three-layer neural network, and so
on.
It is a feed-forward network since none of
the weights cycles back to an input unit or
to a previous layer's output unit.
https://www.sciencedirect.com/topics/computer-science/backpropagation-
Each output unit takes, as input, a
weighted sum of the outputs from units in
the previous layer.
It applies a nonlinear (activation) function
to the weighted input.
Compute the number of parameters for the
given network.
The network has 4 + 2 = 6 neurons (not
counting the inputs), [3 x 4] + [4 x 2] = 20
weights and 4 + 2 = 6 biases, for a total of
26 learnable parameters.
Compute the number of parameters for the
given network.
The network has 4 + 4 + 1 = 9 neurons
(not counting inputs), [3 x 4] + [4 x 4] + [4
x 1] = 12 + 16 + 4 = 32 weights and 4 + 4
+ 1 = 9 biases, for a total of 41 learnable
parameters.
Sigmoid function
Relu Function
Tanh Function
Sigmoid function
Sigmoid outputs are not zero centered.
51 01/04/2025
Support Vector Machines—
01/04/2025
General Philosophy
01/04/2025
maximum separation between the classes.
https://towardsdatascience.com/support-vector-machines-dual-formulation-
quadratic-programming-sequential-minimal-optimization-57f4387ce4dd
Linearly Separable SVM
01/04/2025
The optimal hyperplane is given by
w.x + b = 0
where w={w1, w2, …, wn} is a weight vector and b
a scalar (bias).
56
https://link.springer.com/content/pdf/10.1007/BF00994018.pdf
Maximum Margin
01/04/2025
Distance between a point P (x o, yo, zo) and a
given plane Ax + By + Cz = D, is given by
57
01/04/2025
Distance of the bounding hyperplane w.x+b=1 from origin
|1 b |
=
|| w||
Distance of the bounding hyperplane w.x+b=-1 from origin
| 1 b |
=
|| w||
Distance between the planes (which needs to be maximized)
|1 b | | 1 b |
=
|| w|| || w||
2
58
|| w||
Mathematics behind SVM
01/04/2025
For the training data to be linearly
separable:
w.x i b 1, if yi 1
w.x i b 1, if yi 1
Or,
59
01/04/2025
Vectors xi for which yi (w•xi, + b) = 1
(points which fall on the bounding planes)
are termed as support vectors.
60
01/04/2025
61
Primal problem
(1)
Linearly Separable SVM
01/04/2025
The optimal hyperplane is given by
w.x + b = 0
69
https://link.springer.com/content/pdf/10.1007/BF00994018.pdf
SVM – Soft Margin
Here, C is a hyperparameter that decides
the trade-off between maximizing the
margin and minimizing the mistakes.
When C is small, classification mistakes are
given less importance and focus is more on
maximizing the margin, whereas when C is
large, the focus is more on avoiding
misclassification at the expense of keeping
the margin small.
https://towardsdatascience.com/support-vector-machines-soft-margin-
formulation-and-kernel-trick-4c9729dc8efe
Mathematics behind Soft Margin
SVM
(1)
Non-Linear SVM
XOR Problem
X Y X XOR Y
0 0 0
0 1 1
1 0 1
1 1 0
https://www.tech-quantum.com/solving-xor-problem-using-neural-network-c/
https://towardsdatascience.com/the-kernel-trick-c98cdbcaeb3f
SVM—Linearly Inseparable
01/04/2025
Transform the original input data into a higher
dimensional space.
Search for a linear separating hyperplane in the
new space.
90
Kernel Functions
Kernel functions are generalized functions
that take two vectors (of any dimension) as
input and output a score that denotes how
similar the input vectors are.
01/04/2025
The complexity of trained classifier is
characterized by the # of support vectors rather
than the dimensionality of the data.
The support vectors are the essential or critical
training examples —they lie closest to the decision
boundary (Maximum Margin Hyperplane).
Thus, an SVM with a small number of support
vectors can have good generalization, even when
the dimensionality of the data is high. 105
References
01/04/2025
Dunham M H, “Data Mining: Introductory and
Advanced Topics”, Pearson Education, New Delhi,
2003.
https://hanj.cs.illinois.edu/bk3/bk3_slidesindex.htm 106