ML-II
CH:02
Supervised Learning algorithms
Supervised Learning
Perceptron Learning
Single Layer Feedforward NN
Multi Layer Feedforward NN
Delta Learning
Back Propagation Algorithm
Supervised Learning
Supervised learning, as the name indicates, has the
presence of a supervisor as a teacher.
Basically supervised learning is when we teach or train
the machine using data that is well labeled.
Which means some data is already tagged with the
correct answer.
After that, the machine is provided with a new set of
examples(data) so that the supervised learning
algorithm analyses the training data(set of training
examples) and produces a correct outcome from
labeled data.
Supervised Learning
Perceptron Learning
Single Layer Feedforward NN
Multi Layer Feedforward NN
Delta Learning
Back Propagation Algorithm
Perceptron Rule
Algorithm
1. Initialize all weights and biases to random values (for Simplicity Zero) and learning rate to 1
2. Present first input pattern (data point)
2. calculate net input
3. Apply Activation Function
Bipolar step
4. Update Weight and bias
3. Repeat Procedure for all data point until there is no weight change.
Perceptron Convergence Theorem
Multiple Neuron Perceptron
Multiple Neuron Perceptron
W11 Z1
W1i
W21 Z
W2j
Z..
Supervised Learning
Perceptron Learning
Single Layer Feedforward NN
Multi Layer Feedforward NN
Delta Learning
Back Propagation Algorithm
Delta Learning
Also called as Least Mean Square (LMS) or Widrow-Hoff Rule
Similar to Perceptron Learning
Delta uses Gradient Descent approach ,hence continues forever
where Perceptron stops after limited No. of iteration
Major Aim is to minimize error over all training patterns
Works on Continuous Activation Function
ΔW= c(y- )f’(yin)X
Supervised Learning
Perceptron Learning
Single Layer Feedforward NN
Multi Layer Feedforward NN
Delta Learning
Back Propagation Algorithm
Delta Learning
Also called as Least Mean Square (LMS) or Widrow-Hoff Rule
Similar to Perceptron Learning
Delta uses Gradient Descent approach ,hence continues forever
where Perceptron stops after limited No. of iteration
Major Aim is to minimize error over all training patterns
Works on Continuous Activation Function
ΔW= c(y- )f’(yin)X
Delta Learning-Problem-1
Delta Learning-Problem-1
Consider X=X1
259
9482
Consider X=X2
-
Consider X=X2
-
Consider
X=X3
-
Consider
X=X3
-
Consider
X=X3
0267
9741
0267
-
Supervised Learning
Perceptron Learning
Single Layer Feedforward NN
Multi Layer Feedforward NN
Delta Learning
Back Propagation Algorithm
j
Apply backpropagation algorithm to find final weights for the following network.
Input: x=[0.0, 1.0],
weights between Hidden and Output Layers: w=[0.4,0.2],
Bias on the Output Node O is W0 =[-0.4]
Weights between input and Hidden Layer: v= [2,1 ; 1,2],
Bias on hidden Unit nodes are V0 =[0.1,0.3] Desired output=1.0
0.0
0.4
0.2
Z1=0.5005
0.4
0.2
Z2=0.8178
k
D=1
Y=-0.0181
k
α k α
k
Z1= 0.5005
k δk= 0.50
Z2= 0.8178
k jk
k 11
Z1= 0.5005
k 12
δk= 0.5089
Z2= 0.8178
α
δ z1= 0.0763
Vij(new)= Vij(old)+ α * δ Zj *Xi Assume α=1
δk= 0.50
V11= V11+ 1* δ Z1 *X1
D=1
= 2+ 1*0.0763*0 =2 Y=-0.01
V12= V11+ 1* δ Z2 *X1 δ Z2= 0.0169
= 1+ 1*0.0169*0 =1 Vok(new)= Vok(old)+ α * δ Zj Assume α=1
V21= V21+ 1* δ Z1 *X2 Vo1= Vo1+ 1* δ Z1
= 1+ 1*0.0763*1 =1.0763 = 0.1+ 1*0.0763 =0.1763
V22= V22+ 1* δ Z2 *X2 Vo2= V02+ 1* δ Z2
= 1+ 1*0.0169*1=1.0169
= 0.3+ 1*0.0169=0.3169
Cycle Error =1/2*(d-y)2 =1/2(1-0.0181)2 =0.5123
Final Weights=Diagram?
Hypothesis space search
Contour plot
How to interpret contour map
Saddle Point
Problem with stochastic Gradient Descent
https://covid19.wh https://covid19.wh
Contour map in stochastic
Contour map in gradient descent gradient descent
• Convergence fast • Convergence fast
• Computationaly costly • Computationally easy
Optimization for Training Deep Models:
Exponential moving average
Temp1=450 C
Temp2=440c
Temp3=470c
:
Temp 365 =350c
Optimization for Training Deep Models:
Exponential moving average
β= 0.9
β20= 0.121 --> only previous 20 points are considered
β= 0.5
β3= 0.125--> only previous 3 points are considered
If β value is large we can consider maximum previous data points
How many points to be considered is based on
β = 1/(1-β)
β =0.9 #points=10
β =0.5 #points=2
β =0.98 #points=50
β values and
Exponential
weighted values
curve
β=0.5
Optimizer Technique
SGD with Momentum
+
SGD with Momentum
momentum based gradient descent update rule, where, ‘w’ and ‘b’ are
updated not just based on the current updates(derivative), but also the past
Momentum
updates(derivatives).
SGD with Momentum
Advantages -Disadvantages
Momentum-based GD is faster than GD, it oscillates in and out of
the minima valley. But can this oscillation be reduced? Of course,
Nesterov Accelerated GD helps us to reduce the oscillations
Nesterov Accelerated GD:
why not move the history component first, and then calculate the derivative and update later,
Ada Grad and Ada Delta Optimizers
Adaptive Gradient
X1 X2 X3
• Take different learning rate for different weight/dimentions
• Learning rate decreases based on previous updations
Sparse Dense
- W2
- W2
- W2 Learning rate is high --> take large step
Learning rate is small--> take small step
W1 W2
Adagrad
If some inputs are dense and some are sparse
Algorithm
Take different learning rates for different weights
The learning rate is based on previous updation
1 2 3 4 5
Column1 w1 w1 w1 w1 w1
w1
Column2 w2 0 0 0 0
w2
Adagrad
If some inputs are dense and some are sparse
Algorithm
Take different learning rates for different weights
The learning rate is based on previous updation.
RMSProp — Root Mean
Square Propagation
AdaGrad decays the learning rate very aggressively (as
the denominator grows). As a result, after a while, the
frequent parameters will start receiving very small updates
because of the decayed learning rate. To avoid this why
not decay the denominator and prevent its rapid growth.
Adam Optimizer
ADAM combines the advantages of momentum-based GD update
rule and RMSProp. ADAM uses a cumulative history of gradients and
another history to adjust the learning rate. Added to this, ADAM also
does Bias correction
Bias correction.
Which Optimizer to use
Adam seems to be more or less the default choice now
(β1 = 0.9, β2 = 0.999 and ϵ = 1e − 8 ).
Although it is supposed to be robust to initial learning
rates, we have observed that for sequence generation
problems η = 0.001, 0.0001 works best.
Having said that, many papers report that SGD with
momentum (Nesterov or classical) with a simple
annealing learning rate schedule also works well in
practice (typically, starting with η = 0.001, 0.0001 for
sequence generation problems).
Adam might just be the best choice overall.
Some recent work suggests that there is a problem with
Adam and it will not converge in some cases.
Gradient
Descent
Stochastic
Gradient
Descent
Mini Batch
Gradient
Descent
Momentum
based
Gradient
Descent
Nesterov
Accelerated
Gradient
Descent
Adagrad
• Generalization, overfitting
and stopping criteria.
Regularization using Lp Norm
• Regularization for Deep Learning:
Parameter Norm Penalties,
• Dataset Augmentation, Nois
e Robustness, Early Stoopin
g, Sparse Representation, D
ropout.
Parameter sharing and tying
Parameter sharing : paprameters are shared
If Parameters are shared number of parameters are reduced
Hence complexity of system reduces
Adding noise to the inputs
Adding noise(gaussian) in the input layer
Same as adding L2 Rgularization
Same as data augmentation
Adding Noise to the outputs
Ensemble method
Dropout
Challenges in Neural network Optimi
zation, Basic Algorithms, Parameter I
nitialization Strategies.
Thank you !
94
Perceptron Rule
Algorithm
1. Initialize all weights and biases to random values (for Simplicity Zero) and learning rate to 1
2. Present first input pattern (data point)
2. calculate net input
3. Apply Activation Function
Bipolar step
4. Update Weight and bias
3. Repeat Procedure for all data point until there is no weight change.
Supervised Learning-Perceptron
Learning
1
-1
0.5
Let us consider X=X1
1
-1
0.5
1
1
-1
-2
0.5
0
-1
1
1
-1
-2
0.5
0
-1
1
1
-1 Y=1
-2
0 D1=-1 ?
0.5
0
-1
1
1
-1 Y=1
-2
0 D=-1 ?
0.5
0
-1
1
1
-1
-2
0.5
0
-1
1
0.8
-0.6
-2
0.7
0
-1
0
0.8
-0.6 Y=-1
1.5
- 0
D=?
0.7
-0.5
-1
0
0.8
-0.6 Y=-1
1.5
0
D=-1
0.7
-0.5
-1
-1
0.8
1 -0.6
0.7 Y=-1
0.5 D=1
-1
-1
0.8
1 -0.6
This is Wrongly done 0.7 Y=-1
0.5 D=1
-1
-1
1 -0.8
-1
0.9
0.5
-1
This is one Cycle/EPOCH
Problem on Perceptron-2
1
1
Completed One Input patten
Consider X=X2
11
1- 2
-1
-1
Step:3 Wt Updation
Step:3 Wt Updation
This is one EPOCH
Second
EPOCH
-
First Input
* X1
Second
EPOCH
-
First Input
Second
EPOCH
-
Second
Input
Perceptron Problem-3
D=-1
Y=1
W2 =
0.5
+ Δb
0.5 +(-2)
-1.5
EPOCH1-Input 2
1.5
1.5 -3.5
-3.5)
EPOCH1-Input 2
b2+Δb
=-1.5+2
=0.5
Perceptron Learning Network-
Problem-4
W1=0.5 W2=0.8 b=0
Perceptron Learning Network-
Problem-4
W1=0.5 W2=0.8 b=0
Dec 18 SC
Recap