0% found this document useful (0 votes)
16 views7 pages

Exercise #2 28 - 4 - 2025

The document discusses various problems with sequence models, specifically focusing on Vanilla RNNs, LSTMs, and GRUs. It includes multiple-choice questions, explanations of key concepts such as gradient vanishing and the roles of different gates in LSTMs and GRUs, as well as coding tasks related to RNN implementation. Additionally, it presents numerical problems related to RNNs and their computations.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views7 pages

Exercise #2 28 - 4 - 2025

The document discusses various problems with sequence models, specifically focusing on Vanilla RNNs, LSTMs, and GRUs. It includes multiple-choice questions, explanations of key concepts such as gradient vanishing and the roles of different gates in LSTMs and GRUs, as well as coding tasks related to RNN implementation. Additionally, it presents numerical problems related to RNNs and their computations.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

Exercise #2 28/4/2025

Problems with sequence models.


1. Vanilla RNN
2. LSTM
3. GRU
I. MCQ:
1. Weight-tying in sequence models:
A. Increases model parameters
B. Forces the decoder to use the same embedding matrix as the encoder
C. Only applies to convolutional networks
D. Prevents overfitting completely
2. What is the primary role of the hidden state in a vanilla RNN?
A. To store the model’s weights
B. To capture information from previous time steps
C. To compute the output activation function
D. To initialize the input embeddings
3. In a vanilla RNN, gradient vanishing occurs because:
A. The activation functions produce zero outputs
B. The chain of repeated multiplications by the same weight matrix shrinks gradients exponentially when its
spectral radius < 1
C. The learning rate is too high
D. The loss function is not differentiable
4. Which activation function is commonly used inside RNN cells to mitigate vanishing gradients?
A. Sigmoid
B. Tanh
C. ReLU
D. Softmax
5. Which of the following helps prevent overfitting in RNNs during training?
A. Increasing the hidden state size without regularization
B. Applying dropout on recurrent connections
C. Removing all activation functions
D. Using a single-layer RNN only
6. Gradient clipping in RNN training is used to:
A. Prevent gradient explosion by capping the norm
B. Prevent gradient vanishing by boosting small gradients
C. Clip weights instead of gradients
D. Speed up backward propagation
7. In an LSTM cell, the forget gate controls:
A. How much new information can be added from the current input
B. How much past information to retain or discard
C. The activation function applied to the cell state
D. The learning rate for the weight updates
8. In an LSTM cell, the output gate controls:
A. How much of the cell state to expose as the hidden state
B. How much new information can be added to the cell state
C. How much past information to forget
D. The learning rate of the network
9. Which of the following is not a component of an LSTM cell?
A. Input gate
B. Forget gate
C. Reset gate
D. Output gate
10. A GRU differs from an LSTM because it:
A. Has separate input, forget, and output gates
B. Uses a single “update” gate instead of input+forget gates
C. Cannot model long-term dependencies
D. Always outperforms LSTMs on sequence tasks
11. In a GRU cell, the reset gate primarily controls:
A. How much past information to forget when computing the new candidate activation
B. The overall learning rate
C. The nonlinearity applied at the output
D. The dropout rates
12. A key advantage of GRUs over LSTMs is that GRUs:
A. Always achieve higher accuracy
B. Have fewer parameters and can train faster
C. Use convolutional operations
D. Requires no gating mechanisms
13. The update gate in a GRU mechanism:
A. Determines how much past information to keep versus update with new input
B. Controls the final output activation
C. Resets the hidden state to zero
D. Computes the attention weights

II. Answer the following questions:


1. Explain why vanilla RNNs struggle with long-term dependencies. Sketch how gradients can vanish or
explode through many time steps.
2. Describe the role of the forget gate in the LSTM cell. How does it help preserve long-term information?
3. Contrast the GRU’s update gate with the LSTM’s forget and input gates. Why might GRUs train faster?

III. Coding:
a) Implement a vanilla RNN to process sequences of length 50 with hidden size 16.
b) Train it on a synthetic task where the label depends only on the first input token.
c) Observe and report how test accuracy changes as sequence length increases from 10 to 100.
d) Explain your observations in terms of vanishing gradients.

VI. Numerical problem with solution

[1] Forward Pass in a Simple RNN


Given an RNN cell defined as:
[2] LSTM Gate Computation
For an LSTM cell, you are given the following:
[3] GRU Reset and Update Gate Dynamics
In a GRU, the update and reset gates are computed as:
[4] Backpropagation Through a Single RNN Cell
Consider the RNN cell from problem 1:

Solution:

[5] Gradient of the Loss for the input in an LSTM


[6] Sequence Classification Output Gradient

Problem:
A sequence classifier uses the final hidden state hT of an RNN to predict class scores via

You might also like