0% found this document useful (0 votes)
26 views61 pages

Week 4

The document covers various optimization techniques in deep learning, including Stochastic Gradient Descent, Batch Optimization, and Mini-Batch Optimization, along with their advantages and disadvantages. It discusses the importance of minimizing loss functions and the role of optimization in machine learning, particularly in relation to linear and logistic regression, and the softmax classifier. Additionally, it introduces concepts of nonlinearity and neural networks, including the implementation of logical functions such as AND, OR, and XOR.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
26 views61 pages

Week 4

The document covers various optimization techniques in deep learning, including Stochastic Gradient Descent, Batch Optimization, and Mini-Batch Optimization, along with their advantages and disadvantages. It discusses the importance of minimizing loss functions and the role of optimization in machine learning, particularly in relation to linear and logistic regression, and the softmax classifier. Additionally, it introduces concepts of nonlinearity and neural networks, including the implementation of logical functions such as AND, OR, and XOR.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 61

Course Name: Deep Learning

Faculty Name: Prof. P. K. Biswas


Department : E & ECE, IIT Kharagpur

Topic
Lecture 16: Optimization
Concepts Covered:

 Multiclass SVM Loss Function

 Optimization

 Stochastic Gradient descent

 Batch Optimization

 Mini-Batch Optimization
Optimizing Loss Function
L
1  max(0,W X t

j i
W Xt

yi
   
i
W 2

kl
N i j  yi k l

W yi
1
N   i | (Wj X i W yi X i   0)] W yi
[ X t t

i j  yi

Wj 
1   | (W X W X   0)]  W
[ X t t

i j i yi i j
N i j  yi

Source - http://cs231n.github.io
Optimizing Loss Function
 
1
 [ X | (W X W X    0)]   W
t t  
1
[ X | (W t X W t X   0)]  W
Wyi
N i j  yi
i j i yi i yi Wj
N 
i j  yi
i j i yi i j

Gradient descent
1
W y (k 1)  (1)W yi (k )  N   [ X i | (Wjt X i Wyti X i   0)]
i
i j  yi
1
N   i | (Wj Xi
Wj (k 1)  (1  )Wj (k )  [ X t
W t
X i   0)]
y i
i j  yi

Source - http://cs231n.github.io
Local and Global Minima
Stochastic/ Batch/ Mini batch
Optimization
Stochastic Gradient Descent
Upsides
 The frequent updates immediately give an insight into the
performance of the model and the rate of improvement.

 This variant of gradient descent may be the simplest to understand


and implement.

 The increased model update frequency can result in faster learning


on some problems.

 The noisy update process can allow the model to avoid local
minima (e.g. premature convergence).
Stochastic Gradient Descent
Downsides
 Updating the model so frequently is more computationally
expensive than other configurations of gradient descent, taking
significantly longer to train models on large datasets.

 The frequent updates can result in a noisy gradient signal, which


may cause the model parameters and in turn the model error to
jump around (have a higher variance over training epochs).

 The noisy learning process down the error gradient can also make
it hard for the algorithm to settle on an error minimum for the
model.
Batch Gradient Descent
Upsides
 Fewer updates to the model means this variant of gradient
descent is more computationally efficient than stochastic
gradient descent.

 The decreased update frequency results in a more stable


error gradient and may result in a more stable convergence
on some problems.

 The separation of the calculation of prediction errors and the


model update lends the algorithm to parallel processing
based implementations.
Batch Gradient Descent
Downsides
 The more stable error gradient may result in premature convergence of
the model to a less optimal set of parameters.
 The updates at the end of the training epoch require the additional
complexity of accumulating prediction errors across all training examples.
 It requires the entire training dataset in memory and available to the
algorithm.
 Model updates, and in turn training speed, may become very slow for
large datasets.
Mini-Batch Gradient Descent
Upsides
 The model update frequency is higher than batch gradient
descent which allows for a more robust convergence, avoiding
local minima.

 The batched updates provide a computationally more efficient


process than stochastic gradient descent.

 The batching allows both the efficiency of not having all training
data in memory and algorithm implementations.
Mini-Batch Gradient Descent
Downsides
 Mini-batch requires the configuration of an additional “mini-
batch size” hyper parameter for the learning algorithm.

 Error information must be accumulated across mini-batches


of training examples like batch gradient descent.
Error minimization with iterations
Course Name: Deep Learning
Faculty Name: Prof. P. K. Biswas
Department : E & ECE, IIT Kharagpur

Topic
Lecture 17: Optimization in ML
Concepts Covered:
 Optimization
 Stochastic Gradient Descent
 Batch Optimization
 Mini-batch optimization

 Optimization in ML
 Linear and Logistic Regression
 Softmax classifier
 Nonlinearity
Optimization in Machine
Learning
Optimization in Machine
Learning
 Goal of optimization is to reduce a cost function J(W) to optimize
some performance measure P.
 In pure optimization minimizing J is the goal in and of itself.
 In Machine Learning J(W) is minimized w.r.t parameter W on
training data (training error), and we the error to be low on
unforeseen (test) data.
 Test error (generalization error) should be low.
Optimization in Machine
Learning
Assumptions
 Test and Training data are generated by a probability
distribution: Data generating process.
 Data samples in each data set are independent.
 Training set and Test set are identically distributed.

Performance of ML is its ability to


 Make the training error small.
 Reduce the gap between training and test error.
Underfitting and Overfitting
 Underfitting: Model is not able to obtain sufficiently low training error.
 Overfitting: The gap between training and test error is too large.

We can control Overfitting/ Underfitting by altering its Capacity


Set of functions the learning algorithm can select as being the solution
Linear and Logistic
Regression
Linear & Logistic Regression- Binary
Classification
Linear Regression
f : X R  y R
d
yˆ W Xt

Logistic Regression

p( y | X ;W )   (W X )
t
Linear Regression
X2

X1
Logistic Regression

1 
 (W t X )  W X
t

 (W t X )
1 e

W t X 
Softmax Classifier
 Generalization of Binary Logistic Classifier to
Multiple Classes
s yi  f ( X i ,W ) yi  (WX i ) yi  W X i
t
yi

 Softmax Classifier
s yi
e
p( yi | X i ;W ) 
e
sj
j
Course Name: Deep Learning
Faculty Name: Prof. P. K. Biswas
Department : E & ECE, IIT Kharagpur

Topic
Lecture 18: Nonlinearity
Concepts Covered:
 Optimization in ML
 Linear and Logistic Regression
 Softmax classifier
 Nonlinearity
 Neural Network
Nonlinearity
Linear
Seperability
X2

X1
Nonlinearity
X2

-- -
-- - - -
- -+ + ++ + - -
- + + +++ -
- ++ - - X1
- - - -
-
Nonlinearity
Threshold

1 x0 
y   y
0 x0

x 
Logistic Regression

1 
 (W t X )  W X
t

 (W t X )
1 e

W t X 
Nonlinearity
ReLU : Rectified Linear Unit


y  max(0, x)  y

x 
Course Name: Deep Learning
Faculty Name: Prof. P. K. Biswas
Department : E & ECE, IIT Kharagpur

Topic
Lecture 19: Neural Network
Concepts Covered:
 Nonlinearity
 Neural Network
 AND Logic
 OR Logic
 XOR Logic

 Feed Forward NN


 Back Propagation Learning
Threshold

1 x0 
y   y
0 x0

x 
Logistic Regression


1
 (W X ) 
t
  (W t X )
1 eW
t
X

W t X 
Nonlinearity
ReLU : Rectified Linear Unit


y  max(0, x)  y

x 
Neuron
• Dendrite: receives signals from
other neurons
• Synapse : point of connection
to other neurons
• Soma : processes the
information
• Axon : transmits the output of
this neuron.
Neuron

X y  f (W t X )


W
Neural Network

y  f (W t X )

  
W W W
AND Function

X1 X2 y X2

0 0 0
0 1

0 1 0 X1  X 2 1.5  0
1 0 0
0 0
1 1 1
X1
AND Function
0
1.5 1 0 
1 

W  1  X  
0 1

 1 1 0
 1   

1 1 1 
AND Function
0
1 0  1.5 1.5  0
1 0 1    0.5 0
X W  
t
0 
1       
1 1  1   0.5 
 0
     

1 1 1   0.5  1
AND
Function
OR Function

X1 X2 y X2
0 0 0
1 1

0 1 1
X1  X 2  0.5  0
1 0 1
0 1
1 1 1
X1
OR Function
0
1 0   0.5  0.5 0
1 0 1    0.5  1
X W  
t
0 
1       
1 1  1   0.5  
 1
     

1 1 1   1.5  1
OR Function
Course Name: Deep Learning
Faculty Name: Prof. P. K. Biswas
Department : E & ECE, IIT Kharagpur

Topic
Lecture 20: Neural Network - II
Concepts Covered:
 Neural Network
 AND Logic
 OR Logic
 XOR Logic

 Feed Forward NN
 Back Propagation Learning
AND/ OR
Function
XOR Function
X1 X2 y X2
0 0 0
1 0
0 1 1

1 0 1
0 1
1 1 0
X1
XOR Function
X1  X 2  ( X1 X 2 ).( X1 X 2 )

X1 X2 h1  X1  X 2 h2  X 1  X 2 h1.h2  X1  X 2
0 0 0 1 0

0 1 1 1 1

1 0 1 1 1

1 1 1 0 0
XOR Function
1 1 1 1 1 
 0.5 1  0 0 1 1   0.5 0.5 0.5 1.5 
 0 1 1 1
1 1   
 0.5
  1 1 1 0
 1.5 0  1.5 0.5 0.5  
 1 0 1
W1t X h

1
1 0 1.5  0.5 0
1 1 1   0.5  
h tW    1     1
1

2 1 1 1 
  1  
 0.5 



  
1 1 0   0.5 0
X1  X 2
XOR Function
Neural Network
Function
f (1) f (2) f (i1) f (i ) f ( K 1) f (K )

(K ) ( K 1) ............ (i ) (2) (1)


f (f ( f ....( f (f ( X )))))

You might also like