Week 4
Week 4
Topic
Lecture 16: Optimization
Concepts Covered:
Optimization
Batch Optimization
Mini-Batch Optimization
Optimizing Loss Function
L
1 max(0,W X t
j i
W Xt
yi
i
W 2
kl
N i j yi k l
W yi
1
N i | (Wj X i W yi X i 0)] W yi
[ X t t
i j yi
Wj
1 | (W X W X 0)] W
[ X t t
i j i yi i j
N i j yi
Source - http://cs231n.github.io
Optimizing Loss Function
1
[ X | (W X W X 0)] W
t t
1
[ X | (W t X W t X 0)] W
Wyi
N i j yi
i j i yi i yi Wj
N
i j yi
i j i yi i j
Gradient descent
1
W y (k 1) (1)W yi (k ) N [ X i | (Wjt X i Wyti X i 0)]
i
i j yi
1
N i | (Wj Xi
Wj (k 1) (1 )Wj (k ) [ X t
W t
X i 0)]
y i
i j yi
Source - http://cs231n.github.io
Local and Global Minima
Stochastic/ Batch/ Mini batch
Optimization
Stochastic Gradient Descent
Upsides
The frequent updates immediately give an insight into the
performance of the model and the rate of improvement.
The noisy update process can allow the model to avoid local
minima (e.g. premature convergence).
Stochastic Gradient Descent
Downsides
Updating the model so frequently is more computationally
expensive than other configurations of gradient descent, taking
significantly longer to train models on large datasets.
The noisy learning process down the error gradient can also make
it hard for the algorithm to settle on an error minimum for the
model.
Batch Gradient Descent
Upsides
Fewer updates to the model means this variant of gradient
descent is more computationally efficient than stochastic
gradient descent.
The batching allows both the efficiency of not having all training
data in memory and algorithm implementations.
Mini-Batch Gradient Descent
Downsides
Mini-batch requires the configuration of an additional “mini-
batch size” hyper parameter for the learning algorithm.
Topic
Lecture 17: Optimization in ML
Concepts Covered:
Optimization
Stochastic Gradient Descent
Batch Optimization
Mini-batch optimization
Optimization in ML
Linear and Logistic Regression
Softmax classifier
Nonlinearity
Optimization in Machine
Learning
Optimization in Machine
Learning
Goal of optimization is to reduce a cost function J(W) to optimize
some performance measure P.
In pure optimization minimizing J is the goal in and of itself.
In Machine Learning J(W) is minimized w.r.t parameter W on
training data (training error), and we the error to be low on
unforeseen (test) data.
Test error (generalization error) should be low.
Optimization in Machine
Learning
Assumptions
Test and Training data are generated by a probability
distribution: Data generating process.
Data samples in each data set are independent.
Training set and Test set are identically distributed.
Logistic Regression
p( y | X ;W ) (W X )
t
Linear Regression
X2
X1
Logistic Regression
1
(W t X ) W X
t
(W t X )
1 e
W t X
Softmax Classifier
Generalization of Binary Logistic Classifier to
Multiple Classes
s yi f ( X i ,W ) yi (WX i ) yi W X i
t
yi
Softmax Classifier
s yi
e
p( yi | X i ;W )
e
sj
j
Course Name: Deep Learning
Faculty Name: Prof. P. K. Biswas
Department : E & ECE, IIT Kharagpur
Topic
Lecture 18: Nonlinearity
Concepts Covered:
Optimization in ML
Linear and Logistic Regression
Softmax classifier
Nonlinearity
Neural Network
Nonlinearity
Linear
Seperability
X2
X1
Nonlinearity
X2
-- -
-- - - -
- -+ + ++ + - -
- + + +++ -
- ++ - - X1
- - - -
-
Nonlinearity
Threshold
1 x0
y y
0 x0
x
Logistic Regression
1
(W t X ) W X
t
(W t X )
1 e
W t X
Nonlinearity
ReLU : Rectified Linear Unit
y max(0, x) y
x
Course Name: Deep Learning
Faculty Name: Prof. P. K. Biswas
Department : E & ECE, IIT Kharagpur
Topic
Lecture 19: Neural Network
Concepts Covered:
Nonlinearity
Neural Network
AND Logic
OR Logic
XOR Logic
1 x0
y y
0 x0
x
Logistic Regression
1
(W X )
t
(W t X )
1 eW
t
X
W t X
Nonlinearity
ReLU : Rectified Linear Unit
y max(0, x) y
x
Neuron
• Dendrite: receives signals from
other neurons
• Synapse : point of connection
to other neurons
• Soma : processes the
information
• Axon : transmits the output of
this neuron.
Neuron
X y f (W t X )
W
Neural Network
y f (W t X )
W W W
AND Function
X1 X2 y X2
0 0 0
0 1
0 1 0 X1 X 2 1.5 0
1 0 0
0 0
1 1 1
X1
AND Function
0
1.5 1 0
1
W 1 X
0 1
1 1 0
1
1 1 1
AND Function
0
1 0 1.5 1.5 0
1 0 1 0.5 0
X W
t
0
1
1 1 1 0.5
0
1 1 1 0.5 1
AND
Function
OR Function
X1 X2 y X2
0 0 0
1 1
0 1 1
X1 X 2 0.5 0
1 0 1
0 1
1 1 1
X1
OR Function
0
1 0 0.5 0.5 0
1 0 1 0.5 1
X W
t
0
1
1 1 1 0.5
1
1 1 1 1.5 1
OR Function
Course Name: Deep Learning
Faculty Name: Prof. P. K. Biswas
Department : E & ECE, IIT Kharagpur
Topic
Lecture 20: Neural Network - II
Concepts Covered:
Neural Network
AND Logic
OR Logic
XOR Logic
Feed Forward NN
Back Propagation Learning
AND/ OR
Function
XOR Function
X1 X2 y X2
0 0 0
1 0
0 1 1
1 0 1
0 1
1 1 0
X1
XOR Function
X1 X 2 ( X1 X 2 ).( X1 X 2 )
X1 X2 h1 X1 X 2 h2 X 1 X 2 h1.h2 X1 X 2
0 0 0 1 0
0 1 1 1 1
1 0 1 1 1
1 1 1 0 0
XOR Function
1 1 1 1 1
0.5 1 0 0 1 1 0.5 0.5 0.5 1.5
0 1 1 1
1 1
0.5
1 1 1 0
1.5 0 1.5 0.5 0.5
1 0 1
W1t X h
1
1 0 1.5 0.5 0
1 1 1 0.5
h tW 1 1
1
2 1 1 1
1
0.5
1 1 0 0.5 0
X1 X 2
XOR Function
Neural Network
Function
f (1) f (2) f (i1) f (i ) f ( K 1) f (K )