Assignment 1
Assignment 1
Assignment # 1
Hello Everyone!
The objective of this assignment is to enhance your understanding of gradient descent algorithms and
the impact of learning rate policies on their performance. You are welcome to discuss any issues in Office
hours or via email.
!
Submission Instructions: For questions that require a written response, write the answer in com-
ments using ’#’ within your code. Submit your Jupyter Notebook file as
’Assignment1_<RollNumber>.ipynb’ (e.g., Assignment1_21060003.ipynb).
Submission Deadline: October 3, 2023 11:55 PM
Dataset: The dataset is based on the model y = ax + b + ϵ where ϵ is sampled from a normal distribution
with mean 0 and variance σ 2 . Dataset samples are available in the form {xi }N N
i=1 and {yi }i=1 where xi is
the input and yi is its desired/true output value.
PN
Cost Function: J = N1 i=1 (yi − yˆi )2 , also known as Mean Squared Error (MSE). This is the ob-
jective we aim to minimize.
(a) Stochastic Gradient Descent (SGD) - A single optimizing step will be over each example per epoch.
Train the model over this dataset for 50 epochs, with learning rate set to 0.001. Keep a record of
the mean loss per epoch while training. Additionally, you can tweak these parameters to get better
results.
1
(b) Mini-batch SGD - A single optimizing step will encompass a subset (mini-batch) of the whole dataset.
This is midway between iterating over one sample at a time and iterating over the complete dataset
in one go, offering the best of both worlds. Try out mini-batch sizes of 5, 10 20, and show results
for each case. Comment on which batch-size gives the lowest error? How does this method work out
as compared to SGD and GD?
(c) SGD Nesterov Momentum - This performs the optimization step over each example in each iteration.
Train the model with a learning rate of 0.001 and momentum of 0.5 for 50 epochs. You can adjust
the parameters to get better results.
(d) Adaptive Moment Estimation (Adam) - This performs the optimization step over entire dataset per
iteration. Train the model with a learning rate of 0.5 for 50 epochs. You can adjust the parameters
to get better results.
After training, plot the ’mean loss per epoch’ versus ’number of epochs’ for each optimizer for comparison.
This will give more insight into how training progressed with each epoch.
i
Info: The widely used (but not strictly followed) definition of an epoch is the duration in which
the algorithm completes iterating over the complete dataset. Thus, in the above case, after SGD
has iterated over each sample one-by-one and there are no samples left, we say that one epoch has
passed. After this, the algorithm is again iterated over the samples, and we can keep doing this till
a certain criteria is met, which in our case is the total number of epochs (10).
In this case, at each iteration, the learning rate is scaled by γ α where γ ∈ (0, 1) is the decay rate
and α is the current epoch while an additional parameter p called ‘patience’, i.e. number of epochs to
wait before reducing the learning rate if validation performance does not improve, is also used. Experi-
ment with this strategy using any optimizer of your choice. Explain how this strategy affects convergence
compared to your results in Part 1.
2
Implement this policy with SGD Nesterov Momentum and explain your observations and potential reasons
for behaviour of the optimizer with given policy.
2. There are other learning rate policies which can be employed such as Cyclical learning rates and
Cosine annealing schedule. Further details are present in this paper. Implement any one of them
and observe any difference in your results.
3. Explore other optimizers such as Adagrad, RMSprop etc. This link discusses other optmizers in
detail. Implement any one of them and observe any difference(s) in your results.