0% found this document useful (0 votes)
15 views3 pages

Assignment 1

Uploaded by

nihalahmad322
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views3 pages

Assignment 1

Uploaded by

nihalahmad322
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 3

CS-437/CS-5317, EE-414/EE-517: Deep Learning

Assignment # 1

Dated: September 19, 2023

Hello Everyone!
The objective of this assignment is to enhance your understanding of gradient descent algorithms and
the impact of learning rate policies on their performance. You are welcome to discuss any issues in Office
hours or via email.

!
Submission Instructions: For questions that require a written response, write the answer in com-
ments using ’#’ within your code. Submit your Jupyter Notebook file as
’Assignment1_<RollNumber>.ipynb’ (e.g., Assignment1_21060003.ipynb).
Submission Deadline: October 3, 2023 11:55 PM

1 Optimization Methods for Linear Regression (CLO 3): 70 points


Model: yˆi = mxi +c, where m, c are the parameters to be trained over the dataset, and yˆi is the estimate
of the model for input xi (with true/desired output value yi ).

Dataset: The dataset is based on the model y = ax + b + ϵ where ϵ is sampled from a normal distribution
with mean 0 and variance σ 2 . Dataset samples are available in the form {xi }N N
i=1 and {yi }i=1 where xi is
the input and yi is its desired/true output value.
PN
Cost Function: J = N1 i=1 (yi − yˆi )2 , also known as Mean Squared Error (MSE). This is the ob-
jective we aim to minimize.

1.1 Direct Method (10 points)


Optimization methods like gradient descent can be used to minimize the cost function of linear regression.
However, there exists an analytical solution for linear regression. Given below is the formula for the
analytical solution for linear regression where θ represents the minima, X is an array of input variables
and Y is an array of target values.
θ = (X T X)−1 X T Y
Use this formula to compute the exact solution for the given dataset.

1.2 Iterative Method (60 points)


In this section, you will employ gradient descent algorithms to optimize for the solution. The full-batch
gradient descent algorithm (GD) has already been done for you. Modify the weight update methods and
the train method in the Linear Regression class to implement the following optimizers:

(a) Stochastic Gradient Descent (SGD) - A single optimizing step will be over each example per epoch.
Train the model over this dataset for 50 epochs, with learning rate set to 0.001. Keep a record of
the mean loss per epoch while training. Additionally, you can tweak these parameters to get better
results.

1
(b) Mini-batch SGD - A single optimizing step will encompass a subset (mini-batch) of the whole dataset.
This is midway between iterating over one sample at a time and iterating over the complete dataset
in one go, offering the best of both worlds. Try out mini-batch sizes of 5, 10 20, and show results
for each case. Comment on which batch-size gives the lowest error? How does this method work out
as compared to SGD and GD?
(c) SGD Nesterov Momentum - This performs the optimization step over each example in each iteration.
Train the model with a learning rate of 0.001 and momentum of 0.5 for 50 epochs. You can adjust
the parameters to get better results.

(d) Adaptive Moment Estimation (Adam) - This performs the optimization step over entire dataset per
iteration. Train the model with a learning rate of 0.5 for 50 epochs. You can adjust the parameters
to get better results.

After training, plot the ’mean loss per epoch’ versus ’number of epochs’ for each optimizer for comparison.
This will give more insight into how training progressed with each epoch.

i
Info: The widely used (but not strictly followed) definition of an epoch is the duration in which
the algorithm completes iterating over the complete dataset. Thus, in the above case, after SGD
has iterated over each sample one-by-one and there are no samples left, we say that one epoch has
passed. After this, the algorithm is again iterated over the samples, and we can keep doing this till
a certain criteria is met, which in our case is the total number of epochs (10).

2 Learning Rate Decay (CLO 3): 30 points


In this section, you will add the learning rate decay functionality to the Linear Regression class from
Part 1 to allow for the following policies:

2.1 Constant Learning Rate


The equation given below shows that at each iteration the learning rate remains constant. Note that
η(0) is the initial learning rate while η(t) is the learning rate at the tth iteration. You can show its
implementation along with any optimizer of your choice from Part 1.

η(t) = η(0), ∀t ∈ [0, N ]

2.2 Auto-reduce Learning Rate


The idea is based on reducing the learning rate when the validation performance plateaues or diverges.
When the network parameters have approached the vicinity of a local minimum in the loss landscape,
and further training does not improve validation performance, learning rate is reduced.

In this case, at each iteration, the learning rate is scaled by γ α where γ ∈ (0, 1) is the decay rate
and α is the current epoch while an additional parameter p called ‘patience’, i.e. number of epochs to
wait before reducing the learning rate if validation performance does not improve, is also used. Experi-
ment with this strategy using any optimizer of your choice. Explain how this strategy affects convergence
compared to your results in Part 1.

η(t) = η(0)γ α , ∀t ∈ [0, N ]

2.3 Polynomial Decay


Implement the polynomial decay schedule in which η(0) > η(N ) to non-linearly decrease the learning
rate between an upper and lower bound. Note that the value of the decay rate (γ) must be greater than
zero.  γ
t
η(t) = [η(N ) − η(0)] + η(0), ∀t ∈ [0, N ]
N

2
Implement this policy with SGD Nesterov Momentum and explain your observations and potential reasons
for behaviour of the optimizer with given policy.

Thought provoking practice questions


These will not be graded.
1. What difference is observed if the data is noisy?. This can be done by increasing the value of the
standard deviation (or variance) of the distribution in the dataset generator.

2. There are other learning rate policies which can be employed such as Cyclical learning rates and
Cosine annealing schedule. Further details are present in this paper. Implement any one of them
and observe any difference in your results.
3. Explore other optimizers such as Adagrad, RMSprop etc. This link discusses other optmizers in
detail. Implement any one of them and observe any difference(s) in your results.

You might also like