Ucs664 Est 23
Ucs664 Est 23
Note: Attempt any six questions. Assume missing data, if any, suitably
(5) CO3 L3
Q.2 i. A sequential model has been designed to train a Named Entity
Recognizer (NER). The model consists of an embedding layer followed
by a GRU layer (with 100 neurons) and a time distributed output layer.
The model has been trained on standard CONLL-2003 dataset
containing 30,000 sentences with 17 distinct NER tags. The model has
been trained with 36,000 most frequent words with a maximum
sequence length of 110 words. The embedding layer generates
embeddings of 300 dimensions.
a) Compute total number of parameters in embedding layer, GRU layer,
and output layer? (4)
b) If the model has been trained using Adam optimization technique
with a learning rate of 2e-5 and batch size of 128, how many
iterations of Adam are required to complete one epoch? (1) CO4 L4
ii. Write an expression to generate activations (c<t>) at any tth time-stamp
in a GRU network. Also derive gradient of loss function at tth time-stamp
w.r.t Wuc and Wrx (where symbols have their usual meanings). Consider
binary cross-entropy as loss function? (5) CO3 L2
Q.3 i. Consider following unique words present in a corpus:
ABLE, APE, BEATABLE, CAP, CHILDREN, CHIEF, CHILDLESS, CHILL,
CHILDLIKE, CHILDISH, CODE, FIXABLE, READ, READABLE, READING,
READS, RED, ROPE, RIPE
Using the given vocabulary and entropy-based letter successor variety
method, find the stem of the word CHILDREN? (4) CO2 L3
ii. Explain how Word2Vec Skip-gram model can be modeled using feed-
forward networks? (2) CO1 L1
iii. What is the major limitation of Skip-gram model? How Skip-gram with
Negative Sampling (SGNS) handles it? Write an expression of cost
1
function of SGNS and derive gradient of cost function w.r.t word vector
of central word (vc)? (4) CO1 L2
Q.4 i. A BERT-small model with 8 encoder blocks each with 6 self-attention
heads has been trained for a sequence classification task with three
output classes. The input to the model is a sequence of maximum length
of 100 each represented with 512 dimensions embedding. The feed-
forward layer used in each BERT block has 128 neurons in the first
hidden layer. Compute the total number of trainable parameters for the
model? (5) CO4 L4
ii. Explain how BERT model is fine tuned for intent classification, slot
filling and extractive question answering tasks. Explain the input layer,
output layer and loss function for fine-tuning the model for each of
these tasks? (5) CO4 L3
Q.5 i. A-single layer LSTM model has been designed with a single neuron in
the hidden layer. The weights and bias matrices for forget gate, output
gate, input gate, and candidate update are given below (where symbols
have their usual meanings). For example, Wfx indicate weight matrix for
forget gate that takes x-type input.
Wfx=[0.7 0.45], Wfa=[0.1], bf=[0.15]; Wix=[0.95 0.8],
Wia=[0.8], bi=[0.65]
Wox=[0.6 0.4], Woa=[0.25], bo=[0.1]; Wcx=[0.45 0.25],
Wca=[0.15], bc=[0.2]
Also consider the input at second time-stamp and previous long- and
short-term cell contents as follows:
Q.6 i. What are positional encodings? Why are they required in transformer
models? (2) CO4 L1
ii. A transformer-based machine translation model is trained to translate
French sentences to English. Generate positional encoding for the input
French input sentence ‘cette image est cliqué par moi’. Use
dimensionality (d) = 4 and scaling factor (n) =100? (4) CO4 L3
iii. The output of the machine translation model trained in part(ii) is ‘the
picture the picture by me’.
Compute BLEU score for the translation, given the following reference
translations:
Reference Translation-1: this picture is clicked by me
Reference Translation-2: this picture was clicked by me (4) CO4 L3
Q.7 i. Why deep neural networks cannot handle sequential data? (2) CO1 L2
ii. What do you mean by language modeling? How Recurrent Neural
Networks (RNNs) are used for language modeling? (2) CO2 L1
iii. What do you mean by perplexity of a language model? How it is
computed? (2) (2) CO2 L1
2
iv. What is the use of dropout layer in neural networks? How inverted
dropout is implemented? (2) CO1 L2
v. What do you mean by vanishing gradient problem in RNNs? How it is
handled? (2) CO2 L2