0% found this document useful (0 votes)

8 views106 pages

Module 3 - Modified

The document covers key concepts in Neural Networks (NN) and Support Vector Machines (SVM), including perceptrons, activation functions, backpropagation, and the mathematical foundations of SVMs. It discusses the importance of margin maximization in SVMs, the role of kernel functions for non-linear classification, and the advantages of using different activation functions like sigmoid, tanh, and ReLU. Additionally, it highlights the training processes for perceptrons and neural networks, including the delta rule and gradient descent.

Uploaded by

thanhafathima480

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

8 views106 pages

Module 3 - Modified

Uploaded by

thanhafathima480

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 106

Module-3

(Neural Networks (NN) and Support

Vector Machines (SVM))
Perceptron, Neural Network - Multilayer feed
forward network, Activation functions
(Sigmoid, ReLU, Tanh), Backpropagation
algorithm.

SVM - Introduction, Maximum Margin

Classification, Mathematics behind Maximum
Margin Classification, Maximum Margin
linear separators, soft margin SVM classifier,
non-linear SVM, Kernels for learning non-
linear functions, polynomial kernel, Radial
Basis Function(RBF).
Biological Neuron
Artificial Neuron
Perceptron
The output of the perceptron can also be
expressed as a dot product
Net input function
Activation functio
ttps://cs231n.github.io/neural-networks-1/
https://towardsdatascience.com/perceptron-the-artificial-neuron-
Perceptron learning rule
One way to learn an acceptable weight
vector is to begin with random weights,
then iteratively apply the perceptron to
each training example, modifying the
perceptron weights whenever it
misclassifies an example.

This process is repeated, iterating through

the training examples as many times as
needed until the perceptron classifies all
training examples correctly.
Weights are modified at each step according
to the perceptron training rule, which
revises the weight wi associated with input
xi according to the rule

Here, t is the target output for the

current training example, o is the output
generated by the perceptron, and q is a
positive constant called the learning rate.
The role of the learning rate is to moderate
the degree to which weights are changed at
each step.
It is usually set to some small value (e.g., 0.1).
Gradient Descent and the
Delta Rule
Although the perceptron rule finds a
successful weight vector when the training
examples are linearly separable, it can fail
to converge if the examples are not linearly
separable.

Delta rule, is designed to overcome this

difficulty.

If the training examples are not linearly

separable, the delta rule converges toward
a best-fit approximation to the target
The key idea behind the delta rule is to use
gradient descent to search the
hypothesis space of possible weight
vectors to find the weights that best fit the
training examples.

The delta training rule is best understood

by considering the task of training an
unthresholded perceptron; that is, a
linear unit for which the output o is
given by
Training error

where D is the set of training examples, td

is the target output for training
example d, and od is the output of the
linear unit for training example d.
Since the gradient specifies the direction of
steepest increase of E, the training rule
for gradient descent is
Here the learning rate is a positive constant
, which determines the step size in the
gradient descent search.
The negative sign is present because we
want to move the weight vector in the
direction that decreases E.
This training rule can also be written in
its component form
That is,

Therefore,
Multilayer Feed Forward
Network
Feed forward neural network
Each layer is made up of units.
The inputs to the network correspond to the
attributes measured for each training tuple.
The inputs are fed simultaneously into the
units making up the input layer.
These inputs pass through the input layer
and are then weighted and fed
simultaneously to a second layer of
“neuronlike” units, known as a hidden
layer.
The outputs of the hidden layer units can be
input to another hidden layer, and so on.
The weighted outputs of the last hidden
layer are input to units making up
the output layer, which emits the network's
prediction for given tuples.
The units in the input layer are called input
units.
The units in the hidden layers and output
layer are sometimes referred to
as neurodes, due to their symbolic
biological basis, or as output units.
A network containing two hidden layers is
called a three-layer neural network, and so
on.
It is a feed-forward network since none of
the weights cycles back to an input unit or
to a previous layer's output unit.
https://www.sciencedirect.com/topics/computer-science/backpropagation-

Each output unit takes, as input, a
weighted sum of the outputs from units in
the previous layer.
It applies a nonlinear (activation) function
to the weighted input.
Compute the number of parameters for the
given network.
The network has 4 + 2 = 6 neurons (not
counting the inputs), [3 x 4] + [4 x 2] = 20
weights and 4 + 2 = 6 biases, for a total of
26 learnable parameters.
Compute the number of parameters for the
given network.
The network has 4 + 4 + 1 = 9 neurons
(not counting inputs), [3 x 4] + [4 x 4] + [4
x 1] = 12 + 16 + 4 = 32 weights and 4 + 4
+ 1 = 9 biases, for a total of 41 learnable
parameters.
Sigmoid function
Relu Function
Tanh Function
Sigmoid function
Sigmoid outputs are not zero centered.

If the activation function of the network is not

zero centered, y = f(x w) is always positive or
always negative.

Thus, the output of a layer is always being

moved to either the positive values or the
negative values.

As a result, the weight vector needs more update

to be trained properly.
Tanh vs Sigmoid
The tanh function is a stretched and shifted
version of the sigmoid.

Both sigmoid and tanh functions belong

to the S-like functions that suppress the
input value to a bounded range.

This helps the network to keep its weights

bounded and prevents the exploding
gradient problem where the value of the
gradients becomes very large.
https://www.baeldung.com/cs/sigmoid-vs-tanh-functions
The gradient of tanh is four times greater
than the gradient of the sigmoid function.
This means that using the tanh activation
function results in higher values of gradient
during training and higher updates in the
weights of the network.

So, if we want strong gradients and

big learning steps, we should use the
tanh activation function.

Another difference is that the output

of tanh is symmetric around zero
leading to faster convergence.
The output of tanh ranges from -1 to 1 and
have an equal mass on both the sides of
zero-axis so it is zero centered function.

So, tanh overcomes the non-zero

centric issue of the logistic activation
function.

Hence optimization becomes comparatively

easier than logistic and it is always
preferred over logistic.
Comparison with ReLU
Sigmoid and tanh functions suffer from vanishing
gradient problem.
It is encountered while training artificial neural
networks with gradient-based learning
methods and backpropagation.
In such methods, during each iteration of training
each of the neural network's weights receives an
update proportional to the partial derivative of the
error function with respect to the current weight.
The problem is that in some cases, the gradient
will be vanishingly small, effectively preventing
the weight from changing its value.
In the worst case, this may completely stop the
neural network from further training.
ReLU activation function can fix the
vanishing gradient problem.
Back propagation
A feedforward phase - where an input
vector is applied and the signal propagates
through the network layers, modified by
the current weights and biases and by the
nonlinear activation functions.
Corresponding output values then emerge,
and these can be compared with the target
outputs for the given input vector using a
loss function.
A feedback phase - the error signal is
then fed back (backpropagated) through
the network layers to modify the weights in
a way that minimizes the error across the
entire training set, effectively minimizing
the error surface in weight-space.
Backpropagation Algorithm
(Stochastic gradient descent version)
Determine the number of trainable
parameters of the following neural net:
Input layer: 4 units.
Hidden layer 1: 16 units.
Hidden layer 2: 8 units.
Hidden layer 3: 4 units.
Output layer: 2 units.

262 trainable parameters.

Support Vector Machine
(Developed at AT&T Bell Laboratories by Vladimir Vapnik with colleagues in
1995)

51 01/04/2025
Support Vector Machines—

01/04/2025
General Philosophy

Small Margin Large Margin

Support Vectors 52
http://image.diku.dk/imagecanon/material/cortes_vapnik95.pdf01/04/2025
53
A learned classifier (hyperplane) achieves

01/04/2025
maximum separation between the classes.

The two planes parallel to the classifier and

which pass through one or more points in
the dataset are called bounding planes.

The distance between these bounding

planes is called margin.

By SVM learning, we mean finding a

hyperplane which maximizes this margin.
54
01/04/2025
55

https://towardsdatascience.com/support-vector-machines-dual-formulation-
quadratic-programming-sequential-minimal-optimization-57f4387ce4dd
Linearly Separable SVM

01/04/2025
 The optimal hyperplane is given by

w.x + b = 0
where w={w1, w2, …, wn} is a weight vector and b
a scalar (bias).

https://link.springer.com/content/pdf/10.1007/BF00994018.pdf
Maximum Margin

01/04/2025
Distance between a point P (x o, yo, zo) and a
given plane Ax + By + Cz = D, is given by

|Axo + Byo+ Czo + D|/√(A2 + B2 +

C2).

Here we have, two bounding planes

w.x+b=1 and w.x+b=-1

57
01/04/2025
Distance of the bounding hyperplane w.x+b=1 from origin
|1  b |
=
|| w||
Distance of the bounding hyperplane w.x+b=-1 from origin
|  1 b |
=
|| w||
 Distance between the planes (which needs to be maximized)
|1  b | |  1  b |
= 
|| w|| || w||
2
 58
|| w||
Mathematics behind SVM

01/04/2025
 For the training data to be linearly
separable:

w.x i  b 1, if yi 1
w.x i  b  1, if yi  1
 Or,

yi (w.x i  b) 1, i 1, 2,..., n

59
01/04/2025
Vectors xi for which yi (w•xi, + b) = 1
(points which fall on the bounding planes)
are termed as support vectors.

60
01/04/2025
61
Primal problem
(1)
Linearly Separable SVM

01/04/2025
 The optimal hyperplane is given by

w.x + b = 0

where w={w1, w2, …, wn} is a weight vector and b

a scalar (bias).

The linear decision function I(x) is then given by

https://link.springer.com/content/pdf/10.1007/BF00994018.pdf
SVM – Soft Margin
Here, C is a hyperparameter that decides
the trade-off between maximizing the
margin and minimizing the mistakes.
When C is small, classification mistakes are
given less importance and focus is more on
maximizing the margin, whereas when C is
large, the focus is more on avoiding
misclassification at the expense of keeping
the margin small.
https://towardsdatascience.com/support-vector-machines-soft-margin-
formulation-and-kernel-trick-4c9729dc8efe
Mathematics behind Soft Margin
SVM
(1)
Non-Linear SVM
XOR Problem
X Y X XOR Y
0 0 0
0 1 1
1 0 1
1 1 0
https://www.tech-quantum.com/solving-xor-problem-using-neural-network-c/
https://towardsdatascience.com/the-kernel-trick-c98cdbcaeb3f
SVM—Linearly Inseparable

01/04/2025
 Transform the original input data into a higher
dimensional space.
 Search for a linear separating hyperplane in the
new space.

90
Kernel Functions
Kernel functions are generalized functions
that take two vectors (of any dimension) as
input and output a score that denotes how
similar the input vectors are.

An example is the dot product function: if

the dot product is small, we conclude that
vectors are different and if the dot product
is large, we conclude that vectors are more
similar.
Kernel Trick
We can use any fancy Kernel function in
place of dot product that has the capability
of measuring similarity in higher
dimensions (where it could be more
accurate;), without increasing the
computational costs much.

This is essentially known as the Kernel

Trick.
Polynomial Kernel
Kernel Matrix
Why is SVM Effective on High Dimensional Data?

01/04/2025
 The complexity of trained classifier is
characterized by the # of support vectors rather
than the dimensionality of the data.
 The support vectors are the essential or critical
training examples —they lie closest to the decision
boundary (Maximum Margin Hyperplane).
 Thus, an SVM with a small number of support
vectors can have good generalization, even when
the dimensionality of the data is high. 105
References

01/04/2025
Dunham M H, “Data Mining: Introductory and
Advanced Topics”, Pearson Education, New Delhi,
2003.

Jaiwei Han, Micheline Kamber, “Data Mining

Concepts and Techniques”, Elsevier, 2006.

K.P. Soman, Shyam Diwakar, V. Ajay, “Insight into

Data Mining Theory and Practice”, PHI Pvt. Ltd.,
New Delhi, 2008.

https://hanj.cs.illinois.edu/bk3/bk3_slidesindex.htm 106

Shu-Cherng Fang, Sarat Puthenpura - Linear Optimization and Extensions - Theory and Algorithms-Prentice Hall College Div (1993) PDF
No ratings yet
Shu-Cherng Fang, Sarat Puthenpura - Linear Optimization and Extensions - Theory and Algorithms-Prentice Hall College Div (1993) PDF
320 pages
cst414 - Deep Learning
No ratings yet
cst414 - Deep Learning
34 pages
CS 224D: Deep Learning For NLP: Lecture Notes: Part III Spring 2015
No ratings yet
CS 224D: Deep Learning For NLP: Lecture Notes: Part III Spring 2015
14 pages
Deep Learning
No ratings yet
Deep Learning
19 pages
Shortcomings in Single Layer Neural Networks: Most Real World Problems Are Not
No ratings yet
Shortcomings in Single Layer Neural Networks: Most Real World Problems Are Not
43 pages
CS 224D: Deep Learning For NLP: Lecture Notes: Part III Spring 2016
No ratings yet
CS 224D: Deep Learning For NLP: Lecture Notes: Part III Spring 2016
14 pages
CS460 - Deep Learning - W02 & W03
No ratings yet
CS460 - Deep Learning - W02 & W03
44 pages
Neural Networks Handout
No ratings yet
Neural Networks Handout
7 pages
Module 2 DL Snotes P1
No ratings yet
Module 2 DL Snotes P1
16 pages
ANN MODULE 1 Part2
No ratings yet
ANN MODULE 1 Part2
58 pages
Ann MJJ-1
No ratings yet
Ann MJJ-1
64 pages
Module 3.docxaiml
No ratings yet
Module 3.docxaiml
20 pages
Neural Network Presentation
No ratings yet
Neural Network Presentation
33 pages
Anthony Kuh - Neural Networks and Learning Theory
No ratings yet
Anthony Kuh - Neural Networks and Learning Theory
72 pages
Wa0006.
No ratings yet
Wa0006.
70 pages
Unit 2.1
No ratings yet
Unit 2.1
37 pages
AML 03 Dense Neural Networks
No ratings yet
AML 03 Dense Neural Networks
20 pages
Artificial Neural Network
No ratings yet
Artificial Neural Network
35 pages
DL U-I Introduction Part-2
No ratings yet
DL U-I Introduction Part-2
48 pages
L04 Slides - mlp1
No ratings yet
L04 Slides - mlp1
22 pages
1.1 Introduction
No ratings yet
1.1 Introduction
73 pages
Chapter 3
No ratings yet
Chapter 3
30 pages
Ad3451 ML Unit 4 Notes
No ratings yet
Ad3451 ML Unit 4 Notes
36 pages
Chapter 6 - Feedforward Deep Networks
No ratings yet
Chapter 6 - Feedforward Deep Networks
27 pages
Unit Iv DM
No ratings yet
Unit Iv DM
58 pages
CC511 Week 5 - 6 - NN - BP
No ratings yet
CC511 Week 5 - 6 - NN - BP
62 pages
Neural
No ratings yet
Neural
53 pages
CS224n: Natural Language Processing With Deep Learning
No ratings yet
CS224n: Natural Language Processing With Deep Learning
18 pages
ML Unit-2
No ratings yet
ML Unit-2
141 pages
Neural Net 3rdclass
No ratings yet
Neural Net 3rdclass
35 pages
Notes Chapter8
No ratings yet
Notes Chapter8
4 pages
1) Deep - Learning
No ratings yet
1) Deep - Learning
60 pages
Module 3 Final
No ratings yet
Module 3 Final
88 pages
Learning Rules For Multilayer Feedforward Neural Networks
No ratings yet
Learning Rules For Multilayer Feedforward Neural Networks
19 pages
Artificial Neural Networks: HCMC University of Technology Sep. 2008
No ratings yet
Artificial Neural Networks: HCMC University of Technology Sep. 2008
71 pages
AI & ML Unit 5 Notes
No ratings yet
AI & ML Unit 5 Notes
23 pages
Neural Network
100% (1)
Neural Network
54 pages
Supervised Learning Network
No ratings yet
Supervised Learning Network
33 pages
9.deep Feedforward Networks
100% (1)
9.deep Feedforward Networks
13 pages
Lecture Slides 2 - Neural Networks - 2021
No ratings yet
Lecture Slides 2 - Neural Networks - 2021
42 pages
ML Unit - 2
No ratings yet
ML Unit - 2
70 pages
Week 03-04 - Deep Feedforward Networks - Intro
No ratings yet
Week 03-04 - Deep Feedforward Networks - Intro
141 pages
Ece18898g Neural Networks
No ratings yet
Ece18898g Neural Networks
47 pages
Slides 11
No ratings yet
Slides 11
48 pages
Artificial Neural Networks An Artificial Neuron: X W X W S X W W y
No ratings yet
Artificial Neural Networks An Artificial Neuron: X W X W S X W W y
7 pages
Classification BP Regression KNN Other Classifiers - Final
No ratings yet
Classification BP Regression KNN Other Classifiers - Final
116 pages
Foundations of Machine Learning: Module 6: Neural Network
No ratings yet
Foundations of Machine Learning: Module 6: Neural Network
68 pages
Annette Paper
No ratings yet
Annette Paper
7 pages
Lect8 DNN
No ratings yet
Lect8 DNN
33 pages
Kagan Lecture2
No ratings yet
Kagan Lecture2
118 pages
Unit 5 (Second Half)
No ratings yet
Unit 5 (Second Half)
10 pages
UNIT 3 - Backpropagation Algorithm
No ratings yet
UNIT 3 - Backpropagation Algorithm
38 pages
Unit 3
No ratings yet
Unit 3
110 pages
Lec 8
No ratings yet
Lec 8
43 pages
Activation Function To Back Pro
No ratings yet
Activation Function To Back Pro
22 pages
Neural Network - Optimization DRAFT 3.11
No ratings yet
Neural Network - Optimization DRAFT 3.11
66 pages
Lesson 3 Artificial Neural Network
No ratings yet
Lesson 3 Artificial Neural Network
77 pages
Introduction To Neural Network
No ratings yet
Introduction To Neural Network
20 pages
L10 Neural Network
No ratings yet
L10 Neural Network
52 pages
Backpropagation: Fundamentals and Applications for Preparing Data for Training in Deep Learning
From Everand
Backpropagation: Fundamentals and Applications for Preparing Data for Training in Deep Learning
Fouad Sabry
No ratings yet
Support Vector Machine: Fundamentals and Applications
From Everand
Support Vector Machine: Fundamentals and Applications
Fouad Sabry
No ratings yet
Manali 2n 3d
No ratings yet
Manali 2n 3d
3 pages
SVM Presentation
No ratings yet
SVM Presentation
19 pages
Datamining Mod5 241101 202110
No ratings yet
Datamining Mod5 241101 202110
59 pages
Types of Blockchain - and - Consensus
No ratings yet
Types of Blockchain - and - Consensus
43 pages
Multiple Linear Regression1
No ratings yet
Multiple Linear Regression1
7 pages
Algorithm Analysis Important Topics
No ratings yet
Algorithm Analysis Important Topics
30 pages
4 - Gradient Descent and Stochastic GD
No ratings yet
4 - Gradient Descent and Stochastic GD
37 pages
AIML - 21CS54 - IA3 - Preparatory - Question Bank-1
No ratings yet
AIML - 21CS54 - IA3 - Preparatory - Question Bank-1
3 pages
Part A - Objective Questions (15 Marks) (Bahagian A - Soalan Objektif) (15 Markah)
No ratings yet
Part A - Objective Questions (15 Marks) (Bahagian A - Soalan Objektif) (15 Markah)
19 pages
Exercise 01:: ASD3 Exam
No ratings yet
Exercise 01:: ASD3 Exam
4 pages
Lecture 5 Queue DSA
No ratings yet
Lecture 5 Queue DSA
35 pages
Adaptive Large Neighborhood Search Algorithm For Mul 2023 Computers Indust
No ratings yet
Adaptive Large Neighborhood Search Algorithm For Mul 2023 Computers Indust
18 pages
Optimization Techniques in Deep Learning
No ratings yet
Optimization Techniques in Deep Learning
14 pages
ME 261 Numerical Analysis: System of Linear Equations
No ratings yet
ME 261 Numerical Analysis: System of Linear Equations
15 pages
Homework 3 Association Rule Mining
No ratings yet
Homework 3 Association Rule Mining
3 pages
Application of Splay Tree
No ratings yet
Application of Splay Tree
12 pages
Ds Short Lectures
No ratings yet
Ds Short Lectures
53 pages
Dynamic Programming Applications: Water Allocation
No ratings yet
Dynamic Programming Applications: Water Allocation
14 pages
COMPROG Flowchart For DCIT22.document
No ratings yet
COMPROG Flowchart For DCIT22.document
4 pages
Chapter 5 Stack and Queue
No ratings yet
Chapter 5 Stack and Queue
22 pages
K-Means Clustering Method For The Analysis of Log Data
No ratings yet
K-Means Clustering Method For The Analysis of Log Data
3 pages
Bubble Sort Pseudocode
No ratings yet
Bubble Sort Pseudocode
2 pages
Yen's Algorithm
No ratings yet
Yen's Algorithm
6 pages
Graphs and Trees-III Prefix, Postfix and Infix Notations
100% (1)
Graphs and Trees-III Prefix, Postfix and Infix Notations
7 pages
DSA Using Java Quick Guide
No ratings yet
DSA Using Java Quick Guide
62 pages
Data Structure and Algorithms (New)
No ratings yet
Data Structure and Algorithms (New)
2 pages
RECURSION
No ratings yet
RECURSION
11 pages
Daa Lab 4
No ratings yet
Daa Lab 4
22 pages
Content Beyond Syllabus Subbu Okk
No ratings yet
Content Beyond Syllabus Subbu Okk
8 pages
Hanoosh Tictactoe
No ratings yet
Hanoosh Tictactoe
23 pages
DSA Topic-Wise Question List (High-Frequency - Core Concepts)
No ratings yet
DSA Topic-Wise Question List (High-Frequency - Core Concepts)
5 pages
Trilogy Programming Interview Questions
No ratings yet
Trilogy Programming Interview Questions
3 pages
Chapter4 PDF
No ratings yet
Chapter4 PDF
9 pages
Binomial Trees Exaample
No ratings yet
Binomial Trees Exaample
19 pages

Module 3 - Modified

Uploaded by

Module 3 - Modified

Uploaded by

Module-3

(Neural Networks (NN) and Support

SVM - Introduction, Maximum Margin

This process is repeated, iterating through

Here, t is the target output for the

Delta rule, is designed to overcome this

If the training examples are not linearly

The delta training rule is best understood

where D is the set of training examples, td

If the activation function of the network is not

Thus, the output of a layer is always being

As a result, the weight vector needs more update

Both sigmoid and tanh functions belong

This helps the network to keep its weights

So, if we want strong gradients and

Another difference is that the output

So, tanh overcomes the non-zero

Hence optimization becomes comparatively

262 trainable parameters.

Small Margin Large Margin

The two planes parallel to the classifier and

The distance between these bounding

By SVM learning, we mean finding a

|Axo + Byo+ Czo + D|/√(A2 + B2 +

Here we have, two bounding planes

w.x+b=1 and w.x+b=-1

yi (w.x i  b) 1, i 1, 2,..., n

where w={w1, w2, …, wn} is a weight vector and b

The linear decision function I(x) is then given by

An example is the dot product function: if

This is essentially known as the Kernel

Jaiwei Han, Micheline Kamber, “Data Mining

K.P. Soman, Shyam Diwakar, V. Ajay, “Insight into

You might also like