0% found this document useful (0 votes)
51 views132 pages

Chapter-2 Single Feed Forward Netwotk

This document discusses various supervised learning algorithms. It begins by introducing perceptron learning using a single layer feedforward neural network. It then discusses delta learning/least mean square which uses gradient descent to minimize error over all training patterns. Finally, it introduces backpropagation algorithm which can be used to calculate gradients in multi-layer neural networks.

Uploaded by

shahdharmil3103
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
51 views132 pages

Chapter-2 Single Feed Forward Netwotk

This document discusses various supervised learning algorithms. It begins by introducing perceptron learning using a single layer feedforward neural network. It then discusses delta learning/least mean square which uses gradient descent to minimize error over all training patterns. Finally, it introduces backpropagation algorithm which can be used to calculate gradients in multi-layer neural networks.

Uploaded by

shahdharmil3103
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 132

ML-II

CH:02
Supervised Learning algorithms
Supervised Learning

 Perceptron Learning
Single Layer Feedforward NN
Multi Layer Feedforward NN
 Delta Learning
 Back Propagation Algorithm
Supervised Learning

Supervised learning, as the name indicates, has the


presence of a supervisor as a teacher.
Basically supervised learning is when we teach or train
the machine using data that is well labeled.
Which means some data is already tagged with the
correct answer.
After that, the machine is provided with a new set of
examples(data) so that the supervised learning
algorithm analyses the training data(set of training
examples) and produces a correct outcome from
labeled data.
Supervised Learning

Perceptron Learning
Single Layer Feedforward NN
Multi Layer Feedforward NN
Delta Learning
Back Propagation Algorithm
Perceptron Rule
Algorithm
1. Initialize all weights and biases to random values (for Simplicity Zero) and learning rate to 1
2. Present first input pattern (data point)
2. calculate net input

3. Apply Activation Function


Bipolar step

4. Update Weight and bias

3. Repeat Procedure for all data point until there is no weight change.
Perceptron Convergence Theorem


Multiple Neuron Perceptron
Multiple Neuron Perceptron

W11 Z1
W1i

W21 Z
W2j

Z..
Supervised Learning

Perceptron Learning
Single Layer Feedforward NN
Multi Layer Feedforward NN
Delta Learning
Back Propagation Algorithm
Delta Learning

 Also called as Least Mean Square (LMS) or Widrow-Hoff Rule


 Similar to Perceptron Learning
 Delta uses Gradient Descent approach ,hence continues forever
where Perceptron stops after limited No. of iteration
 Major Aim is to minimize error over all training patterns
 Works on Continuous Activation Function

 ΔW= c(y- )f’(yin)X


Supervised Learning

Perceptron Learning
Single Layer Feedforward NN
Multi Layer Feedforward NN
Delta Learning
Back Propagation Algorithm
Delta Learning

 Also called as Least Mean Square (LMS) or Widrow-Hoff Rule


 Similar to Perceptron Learning
 Delta uses Gradient Descent approach ,hence continues forever
where Perceptron stops after limited No. of iteration
 Major Aim is to minimize error over all training patterns
 Works on Continuous Activation Function

 ΔW= c(y- )f’(yin)X


Delta Learning-Problem-1
Delta Learning-Problem-1
Consider X=X1
259
9482
Consider X=X2

-
Consider X=X2

-
Consider
X=X3

-
Consider
X=X3

-
Consider
X=X3
0267
9741
0267

-
Supervised Learning

Perceptron Learning
Single Layer Feedforward NN
Multi Layer Feedforward NN
Delta Learning
Back Propagation Algorithm
j
Apply backpropagation algorithm to find final weights for the following network.
Input: x=[0.0, 1.0],
weights between Hidden and Output Layers: w=[0.4,0.2],
Bias on the Output Node O is W0 =[-0.4]
Weights between input and Hidden Layer: v= [2,1 ; 1,2],
Bias on hidden Unit nodes are V0 =[0.1,0.3] Desired output=1.0

0.0
0.4

0.2
Z1=0.5005
0.4

0.2
Z2=0.8178
k
D=1
Y=-0.0181

k
α k α

k
Z1= 0.5005

k δk= 0.50

Z2= 0.8178
k jk

k 11

Z1= 0.5005

k 12

δk= 0.5089

Z2= 0.8178
α
δ z1= 0.0763

Vij(new)= Vij(old)+ α * δ Zj *Xi Assume α=1


δk= 0.50
V11= V11+ 1* δ Z1 *X1
D=1
= 2+ 1*0.0763*0 =2 Y=-0.01
V12= V11+ 1* δ Z2 *X1 δ Z2= 0.0169

= 1+ 1*0.0169*0 =1 Vok(new)= Vok(old)+ α * δ Zj Assume α=1


V21= V21+ 1* δ Z1 *X2 Vo1= Vo1+ 1* δ Z1
= 1+ 1*0.0763*1 =1.0763 = 0.1+ 1*0.0763 =0.1763
V22= V22+ 1* δ Z2 *X2 Vo2= V02+ 1* δ Z2
= 1+ 1*0.0169*1=1.0169
= 0.3+ 1*0.0169=0.3169
Cycle Error =1/2*(d-y)2 =1/2(1-0.0181)2 =0.5123
Final Weights=Diagram?
Hypothesis space search
Contour plot
How to interpret contour map
Saddle Point
Problem with stochastic Gradient Descent

https://covid19.wh https://covid19.wh

Contour map in stochastic


Contour map in gradient descent gradient descent
• Convergence fast • Convergence fast
• Computationaly costly • Computationally easy
Optimization for Training Deep Models:
Exponential moving average
Temp1=450 C
Temp2=440c
Temp3=470c
:

Temp 365 =350c


Optimization for Training Deep Models:
 Exponential moving average
β= 0.9
β20= 0.121 --> only previous 20 points are considered
β= 0.5
β3= 0.125--> only previous 3 points are considered
If β value is large we can consider maximum previous data points
How many points to be considered is based on
β = 1/(1-β)
β =0.9 #points=10
β =0.5 #points=2
β =0.98 #points=50
β values and
Exponential
weighted values
curve

β=0.5
Optimizer Technique
SGD with Momentum

+
SGD with Momentum
momentum based gradient descent update rule, where, ‘w’ and ‘b’ are
updated not just based on the current updates(derivative), but also the past
Momentum
updates(derivatives).
SGD with Momentum
Advantages -Disadvantages
Momentum-based GD is faster than GD, it oscillates in and out of
the minima valley. But can this oscillation be reduced? Of course,
Nesterov Accelerated GD helps us to reduce the oscillations
Nesterov Accelerated GD:
why not move the history component first, and then calculate the derivative and update later,
Ada Grad and Ada Delta Optimizers
Adaptive Gradient

X1 X2 X3

• Take different learning rate for different weight/dimentions


• Learning rate decreases based on previous updations
Sparse Dense

- W2

- W2

- W2 Learning rate is high --> take large step


Learning rate is small--> take small step
W1 W2
Adagrad
 If some inputs are dense and some are sparse
 Algorithm
 Take different learning rates for different weights
 The learning rate is based on previous updation

1 2 3 4 5
Column1 w1 w1 w1 w1 w1
w1
Column2 w2 0 0 0 0
w2
Adagrad

 If some inputs are dense and some are sparse


 Algorithm
 Take different learning rates for different weights
 The learning rate is based on previous updation.
RMSProp — Root Mean
Square Propagation
 AdaGrad decays the learning rate very aggressively (as
the denominator grows). As a result, after a while, the
frequent parameters will start receiving very small updates
because of the decayed learning rate. To avoid this why
not decay the denominator and prevent its rapid growth.
Adam Optimizer
ADAM combines the advantages of momentum-based GD update
rule and RMSProp. ADAM uses a cumulative history of gradients and
another history to adjust the learning rate. Added to this, ADAM also
does Bias correction

Bias correction.
Which Optimizer to use

 Adam seems to be more or less the default choice now


(β1 = 0.9, β2 = 0.999 and ϵ = 1e − 8 ).
 Although it is supposed to be robust to initial learning
rates, we have observed that for sequence generation
problems η = 0.001, 0.0001 works best.
 Having said that, many papers report that SGD with
momentum (Nesterov or classical) with a simple
annealing learning rate schedule also works well in
practice (typically, starting with η = 0.001, 0.0001 for
sequence generation problems).
 Adam might just be the best choice overall.
 Some recent work suggests that there is a problem with
Adam and it will not converge in some cases.
Gradient
Descent
Stochastic
Gradient
Descent
Mini Batch
Gradient
Descent
Momentum
based
Gradient
Descent
Nesterov
Accelerated
Gradient
Descent
Adagrad
• Generalization, overfitting
and stopping criteria.
Regularization using Lp Norm
• Regularization for Deep Learning:
Parameter Norm Penalties,
• Dataset Augmentation, Nois
e Robustness, Early Stoopin
g, Sparse Representation, D
ropout.
Parameter sharing and tying

Parameter sharing : paprameters are shared


If Parameters are shared number of parameters are reduced
Hence complexity of system reduces
Adding noise to the inputs

 Adding noise(gaussian) in the input layer


 Same as adding L2 Rgularization
 Same as data augmentation
Adding Noise to the outputs
Ensemble method
Dropout
Challenges in Neural network Optimi
zation, Basic Algorithms, Parameter I
nitialization Strategies.
Thank you !

94
Perceptron Rule
Algorithm
1. Initialize all weights and biases to random values (for Simplicity Zero) and learning rate to 1
2. Present first input pattern (data point)
2. calculate net input

3. Apply Activation Function


Bipolar step

4. Update Weight and bias

3. Repeat Procedure for all data point until there is no weight change.
Supervised Learning-Perceptron
Learning

1


-1

0.5
Let us consider X=X1
1


-1

0.5
1
1


-1
-2

0.5
0

-1
1
1

-1
-2

0.5
0

-1
1
1

-1 Y=1
-2

0 D1=-1 ?
0.5
0

-1
1
1

-1 Y=1
-2

0 D=-1 ?
0.5
0

-1
1
1

-1
-2

0.5
0

-1
1
0.8

-0.6
-2

0.7
0

-1
0
0.8

-0.6 Y=-1
1.5

- 0
D=?
0.7
-0.5

-1
0
0.8

-0.6 Y=-1
1.5

0
D=-1
0.7
-0.5

-1
-1

0.8

1 -0.6

0.7 Y=-1
0.5 D=1

-1
-1

0.8

1 -0.6

This is Wrongly done 0.7 Y=-1


0.5 D=1

-1
-1

1 -0.8

-1

0.9
0.5

-1
 This is one Cycle/EPOCH
Problem on Perceptron-2
1

1
Completed One Input patten
Consider X=X2

11

1- 2
-1

-1
Step:3 Wt Updation
Step:3 Wt Updation

 This is one EPOCH


Second
EPOCH
-
First Input
* X1
Second
EPOCH
-
First Input
Second
EPOCH
-
Second
Input
Perceptron Problem-3
D=-1

Y=1
W2 =
0.5

+ Δb

0.5 +(-2)
-1.5
EPOCH1-Input 2

1.5

1.5 -3.5

-3.5)
EPOCH1-Input 2

b2+Δb

=-1.5+2

=0.5
Perceptron Learning Network-
Problem-4

W1=0.5 W2=0.8 b=0


Perceptron Learning Network-
Problem-4

W1=0.5 W2=0.8 b=0


Dec 18 SC
Recap

You might also like