Syllabus
Syllabus
YEAR OF
CATEGORY L T P CREDIT INTRODUCTION
CST414 DEEP LEARNING
PEC 2 1 0 3 2019
Preamble: Deep Learning is the recently emerged branch of machine learning, particularly
designed to solve a wide range of problems in Computer Vision and Natural Language Processing.
In this course, the building blocks used in deep learning are introduced. Specifically, neural
networks, deep neural networks, convolutional neural networks and recurrent neural networks.
Learning and optimization strategies such as Gradient Descent, Nesterov Accelerated Gradient
Descent, Adam, AdaGrad and RMSProp are also discussed in this course. This course will helps
the students to attain sound knowledge of deep architectures used for solving various Vision and
NLP tasks. In future, learners can master modern techniques in deep learning such as attention
mechanisms, generative models and reinforcement learning.
Prerequisite: Basic understanding of probability theory, linear algebra and machine learning
Course Outcomes: After the completion of the course, the student will be able to
CO1 Illustrate the basic concepts of neural networks and its practical issues
(Cognitive Knowledge Level: Apply)
CO2 Outline the standard regularization and optimization techniques for deep neural
network (Cognitive Knowledge Level: understand)
CO5 Use different neural network/deep learning models for practical applications.
(Cognitive Knowledge Level: Apply)
PO1 PO2 PO3 PO4 PO5 PO6 PO7 PO8 PO9 PO10 PO11 PO12
CO1
CO2
CO3
CO4
CO5
Assessment Pattern
Remember 30 30 30
Understand 30 30 30
Apply 40 40 40
Analyze
Evaluate
Create
Mark Distribution
150 50 100 3
Syllabus
Module-1 (Neural Networks )
Introduction to neural networks -Single layer perceptrons, Multi Layer Perceptrons (MLPs),
Representation Power of MLPs, Activation functions - Sigmoid, Tanh, ReLU, Softmax. , Risk
minimization, Loss function, Training MLPs with backpropagation, Practical issues in neural
network training - The Problem of Overfitting, Vanishing and exploding gradient problems,
Difficulties in convergence, Local and spurious Optima, Computational Challenges. Applications
of neural networks.
Introduction to deep learning, Deep feed forward network, Training deep models, Optimization
techniques - Gradient Descent (GD), GD with momentum, Nesterov accelerated GD, Stochastic
GD, AdaGrad, RMSProp, Adam. Regularization Techniques - L1 and L2 regularization, Early
stopping, Dataset augmentation, Parameter sharing and tying, Injecting noise at input, Ensemble
methods, Dropout, Parameter initialization.
Recurrent neural networks – Computational graphs, RNN design, encoder – decoder sequence to
sequence architectures, deep recurrent networks, recursive neural networks, modern RNNs LSTM
and GRU.
Applications – computer vision, speech recognition, natural language processing, common word
embedding: continuous Bag-of-Words, Word2Vec, global vectors for word representation
(GloVe). Research Areas – autoencoders, representation learning, boltzmann machines, deep
belief networks.
Text Books
1. Goodfellow, I., Bengio,Y., and Courville, A., Deep Learning, MIT Press, 2016.
2. Neural Networks and Deep Learning, Aggarwal, Charu C.
3. Fundamentals of Deep Learning: Designing Next-Generation Machine Intelligence
Algorithms (1st. ed.). Nikhil Buduma and Nicholas Locascio. 2017. O'Reilly Media, Inc.
Reference Books
1. Satish Kumar, Neural Networks: A Classroom Approach, Tata McGraw-Hill Education,
2004.
2. Yegnanarayana, B., Artificial Neural Networks PHI Learning Pvt. Ltd, 2009.
3. Michael Nielsen, Neural Networks and Deep Learning, 2018
2. Design a single layer perceptron to compute the NAND (not-AND) function. This
function receives two binary-valued inputs x1 and x2, and returns 0 if both inputs are
1, and returns 1 otherwise.
3. Suppose we have a fully connected, feed-forward network with no hidden layer, and
5 input units connected directly to 3 output units. Briefly explain why adding a
hidden layer with 8 linear units does not make the network any more powerful.
4. Briefly explain one thing you would use a validation set for, and why you can’t just
do it using the test set.
6. You would like to train a fully-connected neural network with 5 hidden layers, each
with 10 hidden units. The input is 20-dimensional and the output is a scalar. What is
the total number of trainable parameters in your network?
2. In stochastic gradient descent, each pass over the dataset requires the same number
of arithmetic operations, whether we use minibatches of size 1 or size 1000. Why
can it nevertheless be more computationally efficient to use minibatches of size
1000?
3. State how to apply early stopping in the context of learning using Gradient Descent.
Why is it necessary to use a validation set (instead of simply using the test set) when
using early stopping?
4. Suppose that a model does well on the training set, but only achieves an accuracy of
85% on the validation set. You conclude that the model is overfitting, and plan to
use L1 or L2 regularization to fix the issue. However, you learn that some of the
examples in the data may be incorrectly labeled. Which form of regularisation
would you prefer to use and why?
5. Describe one advantage of using Adam optimizer instead of basic gradient descent.
QP CODE:
PART A
2. List the advantages and disadvantages of sigmoid and ReLU activation functions.
3. Derive weight updating rule in gradient descent when the error function is a)
mean squared error b) cross entropy.
5. What happens if the stride of the convolutional layer increases? What can be the
maximum stride? Explain.
6. Draw the architecture of a simple CNN and write short notes on each block.
Part B
(Answer any one question from each module. Each question carries 14 Marks)
11. (a) Update the parameters in the given MLP using gradient descent with learning (10)
rate as 0.5 and activation function as ReLU. Initial weights are given as
𝑉𝑉 = 0.1 0.2 0 .1 0.1 W= 0.1 0.1
(b) Explain the importance of choosing the right step size in neural networks. (4)
OR
12. (a) Draw the architecture of a multi-layer perceptron. Derive update rules for (10)
parameters in the multi-layer neural network through the gradient descent
(b) Calculate the output of the following neuron Y if the activation function is a (4)
binary sigmoid.
13. (a) Explain, what might happen in ADAGRAD, where momentum is expressed (6)
as ∆𝑤𝑤𝑡𝑡 = −𝜂𝜂𝑔𝑔𝑡𝑡 /√(∑𝑡𝑡𝜏𝜏=1 𝑔𝑔𝜏𝜏2 ) where the denominator computes the L2
norm of all previous gradients on a per-dimension basis and is a global
learning rate shared by all dimensions.
(b) Differentiate gradient descent with and without momentum. Give equations (8)
for weight updation in GD with and without momentum. Illustrate plateaus,
saddle points and slowly varying gradient.
OR
14. (a) Suppose a supervised learning problem is given to model a deep feed forward (9)
neural network. Suggest solutions for the following a) small sized dataset for
training b) dataset with unlabeled data c) large data set but data from different
distribution.
(b) Describe the effect in bias and variance when a neural network is modified (5)
with more number of hidden units followed with dropout regularization
15. (a) Draw and explain the architecture of Convolutional Neural Networks (8)
(b) Suppose that a CNN was trained to classify images into different categories. (6)
It performed well on a validation set that was taken from the same source as
the training set but not on a testing set, which comes from another
distribution. What could be the problem with the training of such a CNN?
How will you ascertain the problem? How can those problems be solved?
OR
16. (a) What is the motivation behind convolution neural networks? (4)
(b) Discuss all the variants of the basic convolution function. (10)
17. (a) Describe how an LSTM takes care of the vanishing gradient problem. Use (8)
some hypothetical numbers for input and output signals to explain the
concept.
(b) Draw and explain the architecture of Recurrent Neural Networks (6)
OR
18. (a) Explain the application of LSTM in Natural Language Processing. (8)
(b) Explain the merits and demerits of using Autoencoders in Computer Vision. (6)
OR
20. (a) Illustrate the use of representation learning in object classification. (7)
Teaching Plan
No. of
Lecture
No Contents Hours
(36 hrs)