Mandatory - Exercise 2
Mandatory - Exercise 2
: 1655735
1
Machine Learning for Natural Language Understanding
(Upload solutions via moodle test)
Good Luck!
Exercise WS 2023/2024
1) Use Pseudocode to fill the steps (1 to 4) in such way that the model goes through the process
of training. Stopping criteria can be ignored. Assign and reuse variables if needed. (5 Points)
Algorithm 1: Generic machine learning model training
input : batches = {samples, targets}, learningrate = λ, Parameters = Θ, loss = MSE,
model = NN
init model(parameters);
for batch in data do
1
2
3
4
end
return model;
model = NN
init model(parameters);
return model;
2) The following table contains predicted values from a simplified linear model (Yi = Beta0 + xi)
and their true (i.e. expected) counterpart. Show the calculation of the MSE for this model! How
should Beta be updated to minimize the MSE? (5 Points)
Predicted Expected
2 1
3 2
5 4
8 7
MSE = (1 + 1 + 1 + 1) / 4 = 4 / 4 = 1
So, the Mean Squared Error (MSE) for this simplified linear model is 1.
ML4NLU Page Matr.Nr.:
4
1) Which types of neural networks do you know and for which tasks are they typically used? (2
Points)
Feedforward Neural Networks (FNN): Used for general-purpose classification and regression tasks.
Convolutional Neural Networks (CNNs): Primarily used for image recognition, processing, and computer vision.
Recurrent Neural Networks (RNNs): Suited for sequential data such as time series analysis or natural language
processing.
Long Short-Term Memory Networks (LSTMs): A type of RNN particularly useful for long-term dependencies in
time series and sequence data.
Autoencoders: Used for unsupervised learning tasks such as anomaly detection or feature reduction.
Generative Adversarial Networks (GANs): Applied to generate new data that's similar to the training data,
commonly used for image generation.
2) Explain what distinguishes an Long Short-Term Memory model (LSTM) from a conventional
Recurrent Neural Network (RNN). (3 Points)
LSTMs differ from conventional RNNs mainly by having a memory cell and three gates (input, forget, and
output gates) to control the flow of information. These allow LSTMs to retain long-term dependencies and
mitigate the vanishing gradient problem that hampers traditional RNNs.
3) Name at least three NLP tasks for which an LSTM is suitable! (3 Points)
Transformers handle sequential information by using self-attention to process the entire sequence in parallel
and positional encodings to maintain the sequence order, allowing them to efficiently capture long-range
dependencies within the data.
ML4NLU Page Matr.Nr.:
5
1) Explain the terms overfitting and underfitting! When can they each occur? (2 Points)
Overfitting occurs when a model learns the training data too well, including its noise and outliers, resulting in
poor generalization to new data. It typically happens when a model is too complex relative to the simplicity of
the task or the amount of noise in the training data.
Underfitting happens when a model cannot capture the underlying trend of the data, often due to its
simplicity. It typically occurs when a model is too simple to handle the complexity of the task or when there is
insufficient training data.
2) Explain the differences between parameters and hyperparameters in a machine learning model (3
Points)
Parameters are the configuration variables internal to the model that are learned from the data during
training. They are adjusted automatically to better predict the training data. Examples include weights and
biases in neural networks.
Hyperparameters are the settings or configurations the learning algorithm uses before the learning process
begins. These are set by the practitioner and are not learned from the data. Examples include learning rate,
number of hidden layers, or batch size.
ML4NLU Page Matr.Nr.:
6
Show that the network correctly classifies the following data. Assume sgn as activation function (5
Points)
x0 x1 Klasse (
2 1 1 +1, if x > 0,
sgn(x):=
1. −1, -1 2 1 if x <= 0.
-3 2 -1
In conclusion, the neural network correctly classifies the first and third inputs, but not the second input, when
using the sign function as the activation function.
ML4NLU Page Matr.Nr.:
8
1) Name an describe task usually used for the pretraining of language models (e.g. BERT) (2
Points)
One task commonly used for the pretraining of language models like BERT is Masked Language Modeling
(MLM). In MLM, a percentage of the input tokens are randomly masked, and the model is trained to predict
the original vocabulary id of the masked word based on its context. This enables the model to understand
bidirectional context and learn a rich representation of language syntax and semantics.
2) What are positional embeddings and why are they used in the context of Transformer mod-
els? (2 Points)
Positional embeddings are vectors added to the input embeddings in Transformer models to provide
information about the position of tokens in a sequence. Since Transformers process the sequence elements in
parallel rather than sequentially, they lack the inherent notion of order in the input sequence. Positional
embeddings encode the order of the words and enable the model to take into account the position of words
when processing language, which is crucial for understanding the meaning and structure in many linguistic
tasks.
3) Name at least four downstream tasks at token or text level and briefly explain them. (4 Points)
Named Entity Recognition (NER): A token-level task where the model identifies and classifies named entities
(like names of people, organizations, locations, etc.) in text.
Part-of-Speech Tagging (POS): Another token-level task that involves labeling each word in a sentence with its
appropriate part of speech (noun, verb, adjective, etc.), based on its definition and context.
Sentiment Analysis: A text-level task where the model determines the sentiment expressed in a piece of text,
such as positive, negative, or neutral.
Question Answering (QA): A text-level task that requires the model to answer questions based on a given text
passage. The model must understand the passage and the question to provide a specific answer or a text span
from the passage.
ML4NLU Page Matr.Nr.:
9
4) Discuss where even the largest language models reach their limits! (2 Points)
Understanding Context: They may struggle with nuanced contexts or with understanding the deep semantics
in complex texts.
Common Sense Reasoning: Language models often lack common sense reasoning or the ability to apply
worldly knowledge that humans consider obvious.
Causality: They can predict statistically likely next words but may not truly understand causal relationships.
Domain-Specific Knowledge: They may underperform in highly specialized domains without extensive fine-
tuning.
Bias and Fairness: Large language models can perpetuate and amplify biases present in their training data.
Explainability: They are often seen as "black boxes," with decisions difficult to interpret or explain.
Generalization: While they can generalize well in many cases, they sometimes fail to apply learned knowledge
to fundamentally new or unseen tasks.
Resource Intensity: Training and running large models require significant computational resources, which can
be costly and have environmental impacts.
ML4NLU Page Matr.Nr.:
10
Benchmarks 10 Points
h i
G(y, n) := (y1, . . . , yn), (y2, . . . , yn+1), . . . , (y|y|−n+1, . . . , y|y|) (1)
Σ
min C(g, ŷ, n), C(g, y, n)
g∈G(ŷ,n)
P (ŷ, y, n) := (3)
Σ C(g, ŷ, n)
g∈G (ŷ,n)
1 Calculate the uni-/bi-grams for G(ŷ, 1), G(y, 1), G(ŷ, 2), G(y, 2). (4 Points)
2 Common N-Grams:
- Find n-grams that appear in both yˆ and y
- Common Unigrams: 'a', 'is', 'on', 'the'
- Common Bigrams: 'is on', 'on the'
3 Probability Calculation:
- Calculate the probability P(yˆ, y, n) using the common n-grams' counts
- Since n-grams are unique within their sequences, each count is 1
Resulting probabilities:
- P(yˆ, y, 1) = 0.67 (4 common unigrams out of 6 in y)
- P(yˆ, y, 2) = 0.40 (2 common bigrams out of 5 in y)
ML4NLU Page Matr.Nr.:
2 11y, 1) and P(yˆ, y, 2) (6 Points)
Calculate the uni-/bi-gram-precision for P(yˆ,
The unigram and bigram precision for P(yˆ, y, 1) and P(yˆ, y, 2) are:
- Unigram Precision P(yˆ, y, 1) = 0.67
- Bigram Precision P(yˆ, y, 2) = 0.40