Curriculum: Tuesday, February 15, 2022 3:30 PM
Curriculum: Tuesday, February 15, 2022 3:30 PM
1. Data Dependency
2. Hardware Dependency
3. Training Time
4. Feature Selection
5. Interpretability
There are three variants of gradient descent, which differ in how much
data we use to compute the gradient of the objective function.
Depending on the amount of data, we make a trade -off between the
accuracy of the parameter update and the time it takes to perform an
update.
In artificial neural networks, each neuron forms a weighted sum of its inputs and
passes the resulting scalar value through a function referred to as an activation
function or transfer function. If a neuron has n inputs
then the output or activation of a neuron is
1. Number of Gates:
• LSTM: Has three gates — input (or update) gate, forget gate, and output gate.
• GRU: Has two gates — reset gate and update gate.
2. Memory Units:
• LSTM: Uses two separate states - the cell state (ct) and the hidden state (ht). The cell
state acts as an "internal memory" and is crucial for carrying long-term dependencies.
• GRU: Simplifies this by using a single hidden state (ht) to both capture and output the
memory.
3. Parameter Count:
• LSTM: Generally has more parameters than a GRU because of its additional gate and
separate cell state. For an input size of d and a hidden size of h, the LSTM has 4
×((d×ℎ)+(ℎ×ℎ)+ℎ)4×((d×h)+(h×h)+h)parameters.
• GRU: Has fewer parameters. For the same sizes, the GRU has 3×((d×ℎ)+(ℎ×ℎ)+ℎ)3
×((d×h)+(h×h)+h)parameters.
4. Computational Complexity:
• LSTM: Due to the extra gate and cell state, LSTMs are typically more computationally
intensive than GRUs.
• GRU: Is simpler and can be faster to compute, especially on smaller datasets or when
computational resources are limited.
5. Empirical Performance:
• LSTM: In many tasks, especially more complex ones, LSTMs have been observed to
perform slightly better than GRUs.
• GRU: Can perform comparably to LSTMs on certain tasks, especially when data is
limited or tasks are simpler. They can also train faster due to fewer parameters.
6. Choice in Practice:
• The choice between LSTM and GRU often comes down to empirical testing. Depending
on the dataset and task, one might outperform the other. However, GRUs, due to their
simplicity, are often the first choice when starting out.
1. Hierarchical Representation
2. Customization for Advanced Tasks
Ilya Sutskever
Special End-of-Sentence Symbol: Each sentence in the dataset was terminated with a unique
end-of-sentence symbol ("<EOS>"), enabling the model to recognize the end of a sequence.
Dataset: The model was trained on a subset of 12 million sentences, comprising 348 million
French words and 304 million English words, taken from a publicly available dataset.
Reversing Input Sequences: The input sentences (English) were reversed before feeding them
into the model, which was found to significantly improve the model's learning efficiency,
especially for longer sentences.
Word Embeddings: The model used a 1000-dimensional word embedding layer to represent
input words, providing dense, meaningful representations of each word.
Architecture Details: Both the input (encoder) and output (decoder) models had 4 layers, with
each layer containing 1,000 units, showcasing a deep LSTM-based architecture.
Output Layer and Training: The output layer employed a Softmax function to generate the
probability distribution over the target vocabulary. The model was trained end-to-end with
these settings.
Performance - BLEU Score: The model achieved a BLEU score of 34.81, surpassing the baseline
Statistical Machine Translation (SMT) system's score of 33.30 on the same dataset, marking a
significant advancement in neural machine translation.
The
man
saw
the
astronomer
with
telescope
Normalization in deep learning refers to the process of transforming data or model outputs to
have specific statistical properties, typically a mean of zero and a variance of one.
What do we normalize?
○ Normalization helps to stabilize and accelerate the training process by reducing the
likelihood of extreme values that can cause gradients to explode or vanish.
• Faster Convergence:
○ By normalizing inputs or activations, models can converge more quickly because the
gradients have more consistent magnitudes. This allows for more stable updates
during backpropagation.
○ Internal covariate shift refers to the change in the distribution of layer inputs during
training. Normalization techniques, like batch normalization, help to reduce this
shift, making the training process more robust.
• Regularization Effect:
Review Sentiment
Hi Nitish 1
How are you today 0
I am good 0
You? 1
Embedding dimension - 3
Batch Size - 2
Inference
1. Shifting
2. Tokenization
3. Embedding
4. Positional Encoding
Query Sentence
We are friends