0% found this document useful (0 votes)

51 views132 pages

Chapter-2 Single Feed Forward Netwotk

This document discusses various supervised learning algorithms. It begins by introducing perceptron learning using a single layer feedforward neural network. It then discusses delta learning/least mean square which uses gradient descent to minimize error over all training patterns. Finally, it introduces backpropagation algorithm which can be used to calculate gradients in multi-layer neural networks.

Uploaded by

shahdharmil3103

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

51 views132 pages

Chapter-2 Single Feed Forward Netwotk

Uploaded by

shahdharmil3103

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 132

ML-II

CH:02
Supervised Learning algorithms
Supervised Learning

 Perceptron Learning
Single Layer Feedforward NN
Multi Layer Feedforward NN
 Delta Learning
 Back Propagation Algorithm
Supervised Learning

Supervised learning, as the name indicates, has the

presence of a supervisor as a teacher.
Basically supervised learning is when we teach or train
the machine using data that is well labeled.
Which means some data is already tagged with the
correct answer.
After that, the machine is provided with a new set of
examples(data) so that the supervised learning
algorithm analyses the training data(set of training
examples) and produces a correct outcome from
labeled data.
Supervised Learning

Perceptron Learning
Single Layer Feedforward NN
Multi Layer Feedforward NN
Delta Learning
Back Propagation Algorithm
Perceptron Rule
Algorithm
1. Initialize all weights and biases to random values (for Simplicity Zero) and learning rate to 1
2. Present first input pattern (data point)
2. calculate net input

3. Apply Activation Function

Bipolar step

4. Update Weight and bias

3. Repeat Procedure for all data point until there is no weight change.
Perceptron Convergence Theorem


Multiple Neuron Perceptron
Multiple Neuron Perceptron


W11 Z1
W1i

W21 Z
W2j

Z..
Supervised Learning

Perceptron Learning
Single Layer Feedforward NN
Multi Layer Feedforward NN
Delta Learning
Back Propagation Algorithm
Delta Learning

 Also called as Least Mean Square (LMS) or Widrow-Hoff Rule

 Similar to Perceptron Learning
 Delta uses Gradient Descent approach ,hence continues forever
where Perceptron stops after limited No. of iteration
 Major Aim is to minimize error over all training patterns
 Works on Continuous Activation Function

 ΔW= c(y- )f’(yin)X

Supervised Learning

Perceptron Learning
Single Layer Feedforward NN
Multi Layer Feedforward NN
Delta Learning
Back Propagation Algorithm
Delta Learning

 Also called as Least Mean Square (LMS) or Widrow-Hoff Rule

 ΔW= c(y- )f’(yin)X

Delta Learning-Problem-1
Delta Learning-Problem-1
Consider X=X1
259
9482
Consider X=X2

-
Consider X=X2

-
Consider
X=X3

-
Consider
X=X3
0267
9741
0267

-
Supervised Learning

Perceptron Learning
Single Layer Feedforward NN
Multi Layer Feedforward NN
Delta Learning
Back Propagation Algorithm
j
Apply backpropagation algorithm to find final weights for the following network.
Input: x=[0.0, 1.0],
weights between Hidden and Output Layers: w=[0.4,0.2],
Bias on the Output Node O is W0 =[-0.4]
Weights between input and Hidden Layer: v= [2,1 ; 1,2],
Bias on hidden Unit nodes are V0 =[0.1,0.3] Desired output=1.0

0.0
0.4

0.2
Z1=0.5005
0.4

0.2
Z2=0.8178
k
D=1
Y=-0.0181

k
α k α

k
Z1= 0.5005

k δk= 0.50

Z2= 0.8178
k jk

k 11

Z1= 0.5005

k 12

δk= 0.5089

Z2= 0.8178
α
δ z1= 0.0763

Vij(new)= Vij(old)+ α * δ Zj *Xi Assume α=1

δk= 0.50
V11= V11+ 1* δ Z1 *X1
D=1
= 2+ 1*0.0763*0 =2 Y=-0.01
V12= V11+ 1* δ Z2 *X1 δ Z2= 0.0169

= 1+ 10.01690 =1 Vok(new)= Vok(old)+ α * δ Zj Assume α=1

V21= V21+ 1* δ Z1 *X2 Vo1= Vo1+ 1* δ Z1
= 1+ 1*0.0763*1 =1.0763 = 0.1+ 1*0.0763 =0.1763
V22= V22+ 1* δ Z2 *X2 Vo2= V02+ 1* δ Z2
= 1+ 1*0.0169*1=1.0169
= 0.3+ 1*0.0169=0.3169
Cycle Error =1/2*(d-y)2 =1/2(1-0.0181)2 =0.5123
Final Weights=Diagram?
Hypothesis space search
Contour plot
How to interpret contour map
Saddle Point
Problem with stochastic Gradient Descent

https://covid19.wh https://covid19.wh

Contour map in stochastic

Contour map in gradient descent gradient descent
• Convergence fast • Convergence fast
• Computationaly costly • Computationally easy
Optimization for Training Deep Models:
Exponential moving average
Temp1=450 C
Temp2=440c
Temp3=470c
:

Temp 365 =350c

Optimization for Training Deep Models:
 Exponential moving average
β= 0.9
β20= 0.121 --> only previous 20 points are considered
β= 0.5
β3= 0.125--> only previous 3 points are considered
If β value is large we can consider maximum previous data points
How many points to be considered is based on
β = 1/(1-β)
β =0.9 #points=10
β =0.5 #points=2
β =0.98 #points=50
β values and
Exponential
weighted values
curve

β=0.5
Optimizer Technique
SGD with Momentum

+
SGD with Momentum
momentum based gradient descent update rule, where, ‘w’ and ‘b’ are
updated not just based on the current updates(derivative), but also the past
Momentum
updates(derivatives).
SGD with Momentum
Advantages -Disadvantages
Momentum-based GD is faster than GD, it oscillates in and out of
the minima valley. But can this oscillation be reduced? Of course,
Nesterov Accelerated GD helps us to reduce the oscillations
Nesterov Accelerated GD:
why not move the history component first, and then calculate the derivative and update later,
Ada Grad and Ada Delta Optimizers
Adaptive Gradient

X1 X2 X3

• Take different learning rate for different weight/dimentions

• Learning rate decreases based on previous updations
Sparse Dense

- W2

- W2 Learning rate is high --> take large step

Learning rate is small--> take small step
W1 W2
Adagrad
 If some inputs are dense and some are sparse
 Algorithm
 Take different learning rates for different weights
 The learning rate is based on previous updation

1 2 3 4 5
Column1 w1 w1 w1 w1 w1
w1
Column2 w2 0 0 0 0
w2
Adagrad

 If some inputs are dense and some are sparse

 Algorithm
 Take different learning rates for different weights
 The learning rate is based on previous updation.
RMSProp — Root Mean
Square Propagation
 AdaGrad decays the learning rate very aggressively (as
the denominator grows). As a result, after a while, the
frequent parameters will start receiving very small updates
because of the decayed learning rate. To avoid this why
not decay the denominator and prevent its rapid growth.
Adam Optimizer
ADAM combines the advantages of momentum-based GD update
rule and RMSProp. ADAM uses a cumulative history of gradients and
another history to adjust the learning rate. Added to this, ADAM also
does Bias correction

Bias correction.
Which Optimizer to use

 Adam seems to be more or less the default choice now

(β1 = 0.9, β2 = 0.999 and ϵ = 1e − 8 ).
 Although it is supposed to be robust to initial learning
rates, we have observed that for sequence generation
problems η = 0.001, 0.0001 works best.
 Having said that, many papers report that SGD with
momentum (Nesterov or classical) with a simple
annealing learning rate schedule also works well in
practice (typically, starting with η = 0.001, 0.0001 for
sequence generation problems).
 Adam might just be the best choice overall.
 Some recent work suggests that there is a problem with
Adam and it will not converge in some cases.
Gradient
Descent
Stochastic
Gradient
Descent
Mini Batch
Gradient
Descent
Momentum
based
Gradient
Descent
Nesterov
Accelerated
Gradient
Descent
Adagrad
• Generalization, overfitting
and stopping criteria.
Regularization using Lp Norm
• Regularization for Deep Learning:
Parameter Norm Penalties,
• Dataset Augmentation, Nois
e Robustness, Early Stoopin
g, Sparse Representation, D
ropout.
Parameter sharing and tying

Parameter sharing : paprameters are shared

If Parameters are shared number of parameters are reduced
Hence complexity of system reduces
Adding noise to the inputs

 Adding noise(gaussian) in the input layer

 Same as adding L2 Rgularization
 Same as data augmentation
Adding Noise to the outputs
Ensemble method
Dropout
Challenges in Neural network Optimi
zation, Basic Algorithms, Parameter I
nitialization Strategies.
Thank you !

94
Perceptron Rule
Algorithm
1. Initialize all weights and biases to random values (for Simplicity Zero) and learning rate to 1
2. Present first input pattern (data point)
2. calculate net input

3. Apply Activation Function

Bipolar step

4. Update Weight and bias

3. Repeat Procedure for all data point until there is no weight change.
Supervised Learning-Perceptron
Learning

1


-1

0.5
Let us consider X=X1
1


-1

0.5
1
1


-1
-2

0.5
0

-1
1
1

-1
-2

0.5
0

-1
1
1

-1 Y=1
-2

0 D1=-1 ?
0.5
0

-1
1
1

-1 Y=1
-2

0 D=-1 ?
0.5
0

-1
1
1

-1
-2

0.5
0

-1
1
0.8

-0.6
-2

0.7
0

-1
0
0.8

-0.6 Y=-1
1.5

- 0
D=?
0.7
-0.5

-1
0
0.8

-0.6 Y=-1
1.5

0
D=-1
0.7
-0.5

-1
-1

0.8

1 -0.6

0.7 Y=-1
0.5 D=1

-1
-1

0.8

1 -0.6

This is Wrongly done 0.7 Y=-1

0.5 D=1

-1
-1

1 -0.8

-1

0.9
0.5

-1
 This is one Cycle/EPOCH
Problem on Perceptron-2
1

1
Completed One Input patten
Consider X=X2

1- 2
-1

-1
Step:3 Wt Updation
Step:3 Wt Updation

 This is one EPOCH

Second
EPOCH
-
First Input
* X1
Second
EPOCH
-
First Input
Second
EPOCH
-
Second
Input
Perceptron Problem-3
D=-1

Y=1
W2 =
0.5

+ Δb

0.5 +(-2)
-1.5
EPOCH1-Input 2

1.5

1.5 -3.5

-3.5)
EPOCH1-Input 2

b2+Δb

=-1.5+2

=0.5
Perceptron Learning Network-
Problem-4

W1=0.5 W2=0.8 b=0

Perceptron Learning Network-
Problem-4

W1=0.5 W2=0.8 b=0

Dec 18 SC
Recap

Unit 2-nn
No ratings yet
Unit 2-nn
40 pages
MV cs4243 2024 Amir 6 p1
No ratings yet
MV cs4243 2024 Amir 6 p1
97 pages
Optimization
No ratings yet
Optimization
51 pages
Mcculloh: Linear Activation Function
No ratings yet
Mcculloh: Linear Activation Function
18 pages
Unit - I Artificial Neural Networks
No ratings yet
Unit - I Artificial Neural Networks
23 pages
Neural Network BSC
No ratings yet
Neural Network BSC
32 pages
Unit Online 1.3
No ratings yet
Unit Online 1.3
21 pages
DL Full Merged
No ratings yet
DL Full Merged
454 pages
Unit-1 and 2 and 3
No ratings yet
Unit-1 and 2 and 3
212 pages
ADL Unit-3
100% (2)
ADL Unit-3
21 pages
Home Assignment Submission Solutions
No ratings yet
Home Assignment Submission Solutions
82 pages
Deep Learning
No ratings yet
Deep Learning
23 pages
Lecture 2
No ratings yet
Lecture 2
57 pages
L10 Learning II Gradient Based Learning
No ratings yet
L10 Learning II Gradient Based Learning
72 pages
Unit 3
No ratings yet
Unit 3
110 pages
Lecture 9
No ratings yet
Lecture 9
97 pages
Optimization
No ratings yet
Optimization
26 pages
Network Learning (Training)
No ratings yet
Network Learning (Training)
29 pages
Module 2
No ratings yet
Module 2
67 pages
6 Working Example 01-08-2024
No ratings yet
6 Working Example 01-08-2024
21 pages
DL 4
No ratings yet
DL 4
15 pages
DL CS 6 M2 Live Session Flow
No ratings yet
DL CS 6 M2 Live Session Flow
32 pages
Cst414-Deep Learning Module 2
No ratings yet
Cst414-Deep Learning Module 2
13 pages
BME 6407 - Class 10 (April 2023)
No ratings yet
BME 6407 - Class 10 (April 2023)
31 pages
Chap 2 Training Feed Forward Neural Networks
No ratings yet
Chap 2 Training Feed Forward Neural Networks
22 pages
L5 Training Neural Networks Part 2 en v2
No ratings yet
L5 Training Neural Networks Part 2 en v2
70 pages
2023246032-Backward Propagation and Other Differential Algorithms
No ratings yet
2023246032-Backward Propagation and Other Differential Algorithms
48 pages
Lecture 6
No ratings yet
Lecture 6
41 pages
Slide 2
No ratings yet
Slide 2
35 pages
2020 CS182 Section 2 Notes
No ratings yet
2020 CS182 Section 2 Notes
6 pages
DL Regularization
No ratings yet
DL Regularization
51 pages
Lecture 2
No ratings yet
Lecture 2
12 pages
MLP Encoder Decoder
No ratings yet
MLP Encoder Decoder
14 pages
Optimization Techniques (SGD Alternatives)
No ratings yet
Optimization Techniques (SGD Alternatives)
34 pages
Ann MJJ-1
No ratings yet
Ann MJJ-1
64 pages
Optimizers
No ratings yet
Optimizers
4 pages
3 Non Linear Classifiers
No ratings yet
3 Non Linear Classifiers
74 pages
Cours 5
No ratings yet
Cours 5
23 pages
Deep Learning
No ratings yet
Deep Learning
19 pages
Lecture 2
No ratings yet
Lecture 2
31 pages
L04 Slides - mlp1
No ratings yet
L04 Slides - mlp1
22 pages
Pure Optimization
No ratings yet
Pure Optimization
23 pages
Gradient Descent Optimization
No ratings yet
Gradient Descent Optimization
27 pages
Gradient-Based Optimizers
No ratings yet
Gradient-Based Optimizers
54 pages
Unit 4
No ratings yet
Unit 4
18 pages
Machine Learning: Algorithms and Applications: (Continued)
No ratings yet
Machine Learning: Algorithms and Applications: (Continued)
17 pages
Important Optimization Algorithms Essentials
No ratings yet
Important Optimization Algorithms Essentials
12 pages
Unit V NNHDL
No ratings yet
Unit V NNHDL
33 pages
Unit - I Artificial Neural Networks
No ratings yet
Unit - I Artificial Neural Networks
23 pages
Improving Deep Neural Networks: Hyperparameter Tuning, Regularization and Optimization
No ratings yet
Improving Deep Neural Networks: Hyperparameter Tuning, Regularization and Optimization
1 page
Soft Computing Assignment
No ratings yet
Soft Computing Assignment
9 pages
Shortcomings in Single Layer Neural Networks: Most Real World Problems Are Not
No ratings yet
Shortcomings in Single Layer Neural Networks: Most Real World Problems Are Not
43 pages
Curs6site PDF
No ratings yet
Curs6site PDF
40 pages
Optimization Techniques in Deep Learning
No ratings yet
Optimization Techniques in Deep Learning
14 pages
Op Tim Ization
No ratings yet
Op Tim Ization
22 pages
Supervised Learning Neural Networks
No ratings yet
Supervised Learning Neural Networks
34 pages
Activations, Loss Functions & Optimizers in ML
No ratings yet
Activations, Loss Functions & Optimizers in ML
29 pages
Deep Learning (All in One)
No ratings yet
Deep Learning (All in One)
23 pages
Signal Flow Graph
No ratings yet
Signal Flow Graph
51 pages
Introduction To Neural Networks: RWTH Aachen University Chair of Computer Science 6 Prof. Dr.-Ing. Hermann Ney
No ratings yet
Introduction To Neural Networks: RWTH Aachen University Chair of Computer Science 6 Prof. Dr.-Ing. Hermann Ney
31 pages
Project Crashing Algorithm: Initialization Steps
No ratings yet
Project Crashing Algorithm: Initialization Steps
12 pages
Worksheet-6 Factors of Polynomials (2010-2022)
No ratings yet
Worksheet-6 Factors of Polynomials (2010-2022)
18 pages
7a. MINIMUM FLOW PROBLEM
No ratings yet
7a. MINIMUM FLOW PROBLEM
20 pages
ANM Formulae - GSK
No ratings yet
ANM Formulae - GSK
6 pages
Cit 310
No ratings yet
Cit 310
189 pages
Pseudo Code, Algorithm and Flowchart Activity
No ratings yet
Pseudo Code, Algorithm and Flowchart Activity
8 pages
5 Memory Bounded Heuristic Searches
No ratings yet
5 Memory Bounded Heuristic Searches
30 pages
Labsheet DSP
No ratings yet
Labsheet DSP
19 pages
Sorting Algorithms With Python3 PDF
No ratings yet
Sorting Algorithms With Python3 PDF
37 pages
Seminar Persentation: Upgma
No ratings yet
Seminar Persentation: Upgma
16 pages
3.multimedia Compression Algorithms
No ratings yet
3.multimedia Compression Algorithms
23 pages
Session 18-Cluster Analysis
No ratings yet
Session 18-Cluster Analysis
20 pages
Best of The 20th Century: Editors Name Top 10 Algorithms SIAM News, Volume 33, Number 4
No ratings yet
Best of The 20th Century: Editors Name Top 10 Algorithms SIAM News, Volume 33, Number 4
22 pages
Digital Signal Processing Midterm 2
No ratings yet
Digital Signal Processing Midterm 2
12 pages
Veltech University: DR - RR & DR - SR
No ratings yet
Veltech University: DR - RR & DR - SR
4 pages
Exercise Session 2
No ratings yet
Exercise Session 2
17 pages
February 5, 2024: 5.2 Polynomial Functions - Notebook February 06, 2024
No ratings yet
February 5, 2024: 5.2 Polynomial Functions - Notebook February 06, 2024
15 pages
IML Unit Wise Question Bank
No ratings yet
IML Unit Wise Question Bank
6 pages
Excel Uts
No ratings yet
Excel Uts
29 pages
Introduction To Maxima and Minima
No ratings yet
Introduction To Maxima and Minima
9 pages
Ex 1
No ratings yet
Ex 1
7 pages
BSCM-04 Estimation of Distribution Algorithms
No ratings yet
BSCM-04 Estimation of Distribution Algorithms
9 pages
Ia-2 - Ai - QP 2024
No ratings yet
Ia-2 - Ai - QP 2024
3 pages
MCR 3U1 CPT Part 1 - MC - (Remote Learning) - October 2020 - v1
No ratings yet
MCR 3U1 CPT Part 1 - MC - (Remote Learning) - October 2020 - v1
2 pages
Learning Rate (Or Eta)
No ratings yet
Learning Rate (Or Eta)
4 pages
Mip001 Problema 1
No ratings yet
Mip001 Problema 1
3 pages
Obtain The Topological Ordering of Vertices in A Given Digraph
No ratings yet
Obtain The Topological Ordering of Vertices in A Given Digraph
1 page
VIT Chennai: Sense
No ratings yet
VIT Chennai: Sense
2 pages
10+2 Level Mathematics For All Exams GMAT, GRE, CAT, SAT, ACT, IIT JEE, WBJEE, ISI, CMI, RMO, INMO, KVPY Etc.
From Everand
10+2 Level Mathematics For All Exams GMAT, GRE, CAT, SAT, ACT, IIT JEE, WBJEE, ISI, CMI, RMO, INMO, KVPY Etc.
Shubhankar Paul
No ratings yet
Gauss Nodes Revolution: Numerical Integration Theory Radically Simplified And Generalised
From Everand
Gauss Nodes Revolution: Numerical Integration Theory Radically Simplified And Generalised
Rob Porter
No ratings yet

Chapter-2 Single Feed Forward Netwotk

Uploaded by

Chapter-2 Single Feed Forward Netwotk

Uploaded by

ML-II

Supervised learning, as the name indicates, has the

3. Apply Activation Function

4. Update Weight and bias

 Also called as Least Mean Square (LMS) or Widrow-Hoff Rule

 ΔW= c(y- )f’(yin)X

 Also called as Least Mean Square (LMS) or Widrow-Hoff Rule

 ΔW= c(y- )f’(yin)X

Vij(new)= Vij(old)+ α * δ Zj *Xi Assume α=1

= 1+ 1*0.0169*0 =1 Vok(new)= Vok(old)+ α * δ Zj Assume α=1

Contour map in stochastic

Temp 365 =350c

• Take different learning rate for different weight/dimentions

- W2 Learning rate is high --> take large step

 If some inputs are dense and some are sparse

 Adam seems to be more or less the default choice now

Parameter sharing : paprameters are shared

 Adding noise(gaussian) in the input layer

3. Apply Activation Function

4. Update Weight and bias

This is Wrongly done 0.7 Y=-1

 This is one EPOCH

W1=0.5 W2=0.8 b=0

W1=0.5 W2=0.8 b=0

You might also like

= 1+ 10.01690 =1 Vok(new)= Vok(old)+ α * δ Zj Assume α=1