0% found this document useful (0 votes)
18 views11 pages

Mandatory - Exercise 2

This document provides an exercise on machine learning for natural language understanding. It contains multiple choice questions, coding exercises, and explanations of machine learning concepts related to neural networks and NLP tasks. The exercises cover topics like model training procedures, calculating loss, distinguishing LSTM and RNN models, and classifying data with a neural network.

Uploaded by

redalert4ever4
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views11 pages

Mandatory - Exercise 2

This document provides an exercise on machine learning for natural language understanding. It contains multiple choice questions, coding exercises, and explanations of machine learning concepts related to neural networks and NLP tasks. The exercises cover topics like model training procedures, calculating loss, distinguishing LSTM and RNN models, and classifying data with a neural network.

Uploaded by

redalert4ever4
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

ML4NLU Noman Tahir Page Matr.Nr.

: 1655735
1
Machine Learning for Natural Language Understanding
(Upload solutions via moodle test)
Good Luck!
Exercise WS 2023/2024

Mandatory Exercise Due: 04.02.2024

Multiple Choice 10 Points


Are the following statements true or false?

Statement True False


1 The sigmoid-function hθ(x) is smooth and symetric at x = 0.

2 BERT is a language model based on RNNs

3 A Multilayer Perceptron encodes a simple linear discriminant function

4 Every continuous function can be approximated arbitrarily closely by a


multi-layer Artificial Neural Network.
5 GPT-4 is known for its language generation abilities

6 RNNs capture long-term dependencies

7 Hyperparameters should be tuned on the validation set

8 An overfitted model performs well on unknown data

9 The classification of unbalanced data is measured best with error and


accuracy
10 The harmonic mean of precision and recall is called F-measure
ML4NLU Page Matr.Nr.:
2

Aspects of Machine Learning Models 10 Points

1) Use Pseudocode to fill the steps (1 to 4) in such way that the model goes through the process
of training. Stopping criteria can be ignored. Assign and reuse variables if needed. (5 Points)
Algorithm 1: Generic machine learning model training
input : batches = {samples, targets}, learningrate = λ, Parameters = Θ, loss = MSE,
model = NN
init model(parameters);
for batch in data do
1
2
3
4
end
return model;

input: batches = {samples, targets}, learning_rate = λ, Parameters = Θ, loss = MSE

model = NN
init model(parameters);

for batch in data do


predictions = model.forward(samples) // Forward pass
loss_value = loss(predictions, targets) // Compute loss
gradients = loss.backward(loss_value) // Backward pass to compute gradients
model.parameters = model.parameters - λ * gradients // Update parameters
end

return model;

2) The following table contains predicted values from a simplified linear model (Yi = Beta0 + xi)
and their true (i.e. expected) counterpart. Show the calculation of the MSE for this model! How
should Beta be updated to minimize the MSE? (5 Points)

Predicted Expected
2 1
3 2
5 4
8 7

For the first data point:

Predicted value (Y1) = 2


Expected value (Y1_true) = 1
Squared difference = (2 - 1)^2 = 1
For the second data point:

Predicted value (Y2) = 3


Expected value (Y2_true) = 2
ML4NLU Page Matr.Nr.:
Squared difference = (3 - 2)^2 = 1 3
For the third data point:

Predicted value (Y3) = 5


Expected value (Y3_true) = 4
Squared difference = (5 - 4)^2 = 1
For the fourth data point:

Predicted value (Y4) = 8


Expected value (Y4_true) = 7
Squared difference = (8 - 7)^2 = 1
Step 2: Calculate the average of these squared differences (MSE).

MSE = (1 + 1 + 1 + 1) / 4 = 4 / 4 = 1
So, the Mean Squared Error (MSE) for this simplified linear model is 1.
ML4NLU Page Matr.Nr.:
4

Neural Networks I 10 Points

1) Which types of neural networks do you know and for which tasks are they typically used? (2
Points)

Feedforward Neural Networks (FNN): Used for general-purpose classification and regression tasks.
Convolutional Neural Networks (CNNs): Primarily used for image recognition, processing, and computer vision.
Recurrent Neural Networks (RNNs): Suited for sequential data such as time series analysis or natural language
processing.
Long Short-Term Memory Networks (LSTMs): A type of RNN particularly useful for long-term dependencies in
time series and sequence data.
Autoencoders: Used for unsupervised learning tasks such as anomaly detection or feature reduction.
Generative Adversarial Networks (GANs): Applied to generate new data that's similar to the training data,
commonly used for image generation.

2) Explain what distinguishes an Long Short-Term Memory model (LSTM) from a conventional
Recurrent Neural Network (RNN). (3 Points)

LSTMs differ from conventional RNNs mainly by having a memory cell and three gates (input, forget, and
output gates) to control the flow of information. These allow LSTMs to retain long-term dependencies and
mitigate the vanishing gradient problem that hampers traditional RNNs.

3) Name at least three NLP tasks for which an LSTM is suitable! (3 Points)

Language Modeling: Predicting the next word in a sentence.


Machine Translation: Translating text from one language to another.
Sentiment Analysis: Determining the sentiment behind text content, such as identifying if a review is positive or
negative.

4) Describe how Transformers handle sequential information (2 Points)

Transformers handle sequential information by using self-attention to process the entire sequence in parallel
and positional encodings to maintain the sequence order, allowing them to efficiently capture long-range
dependencies within the data.
ML4NLU Page Matr.Nr.:
5

Neural Networks II 10 Points

1) Explain the terms overfitting and underfitting! When can they each occur? (2 Points)

Overfitting occurs when a model learns the training data too well, including its noise and outliers, resulting in
poor generalization to new data. It typically happens when a model is too complex relative to the simplicity of
the task or the amount of noise in the training data.

Underfitting happens when a model cannot capture the underlying trend of the data, often due to its
simplicity. It typically occurs when a model is too simple to handle the complexity of the task or when there is
insufficient training data.

2) Explain the differences between parameters and hyperparameters in a machine learning model (3
Points)

Parameters are the configuration variables internal to the model that are learned from the data during
training. They are adjusted automatically to better predict the training data. Examples include weights and
biases in neural networks.

Hyperparameters are the settings or configurations the learning algorithm uses before the learning process
begins. These are set by the practitioner and are not learned from the data. Examples include learning rate,
number of hidden layers, or batch size.
ML4NLU Page Matr.Nr.:
6

3) Take a look at the following Neural Network:

Input Hidden Ouput


layer layer layer

Show that the network correctly classifies the following data. Assume sgn as activation function (5
Points)

x0 x1 Klasse (
2 1 1 +1, if x > 0,
sgn(x):=
1. −1, -1 2 1 if x <= 0.
-3 2 -1

1. For the input (2,1)(2,1):


• Hidden layer calculations:
• Neuron 1: 4⋅2+1⋅1−2=8+1−2=74⋅2+1⋅1−2=8+1−2=7 (Before activation)
• Neuron 2: 2⋅2+1⋅1−3=4+1−3=22⋅2+1⋅1−3=4+1−3=2 (Before activation)
• Since both inputs to the sign function are positive, both neurons will output +1+1 after the sign
activation function is applied.
• Output layer calculation:
• 1⋅(+1)+1⋅(+1)+0.5=1+1+0.5=2.51⋅(+1)+1⋅(+1)+0.5=1+1+0.5=2.5
• Since the input to the sign function is positive, the output neuron will output +1+1, which matches the
expected class 11.
2. For the input (−1,2)(−1,2):
• Hidden layer calculations:
• Neuron 1: 4⋅(−1)+1⋅2−2=−4+2−2=−44⋅(−1)+1⋅2−2=−4+2−2=−4 (Before activation)
• Neuron 2: 2⋅(−1)+1⋅2−3=−2+2−3=−32⋅(−1)+1⋅2−3=−2+2−3=−3 (Before activation)
• Since both inputs to the sign function are negative, both neurons will output −1−1 after the sign
activation function is applied.
• Output layer calculation:
• 1⋅(−1)+1⋅(−1)+0.5=−1−1+0.5=−1.51⋅(−1)+1⋅(−1)+0.5=−1−1+0.5=−1.5
• Since the input to the sign function is negative, the output neuron will output −1−1, which does not
match the expected class 11.
3. For the input (−3,2)(−3,2):
• Hidden layer calculations:
ML4NLU Page Matr.Nr.:
• 7
Neuron 1: 4⋅(−3)+1⋅2−2=−12+2−2=−124⋅(−3)+1⋅2−2=−12+2−2=−12 (Before activation)
• Neuron 2: 2⋅(−3)+1⋅2−3=−6+2−3=−72⋅(−3)+1⋅2−3=−6+2−3=−7 (Before activation)
• Since both inputs to the sign function are negative, both neurons will output −1−1 after the sign
activation function is applied.
• Output layer calculation:
• 1⋅(−1)+1⋅(−1)+0.5=−1−1+0.5=−1.51⋅(−1)+1⋅(−1)+0.5=−1−1+0.5=−1.5
• Since the input to the sign function is negative, the output neuron will output −1−1, which matches
the expected class −1−1.

In conclusion, the neural network correctly classifies the first and third inputs, but not the second input, when
using the sign function as the activation function.
ML4NLU Page Matr.Nr.:
8

Language Models 10 Points

1) Name an describe task usually used for the pretraining of language models (e.g. BERT) (2
Points)

One task commonly used for the pretraining of language models like BERT is Masked Language Modeling
(MLM). In MLM, a percentage of the input tokens are randomly masked, and the model is trained to predict
the original vocabulary id of the masked word based on its context. This enables the model to understand
bidirectional context and learn a rich representation of language syntax and semantics.

2) What are positional embeddings and why are they used in the context of Transformer mod-
els? (2 Points)

Positional embeddings are vectors added to the input embeddings in Transformer models to provide
information about the position of tokens in a sequence. Since Transformers process the sequence elements in
parallel rather than sequentially, they lack the inherent notion of order in the input sequence. Positional
embeddings encode the order of the words and enable the model to take into account the position of words
when processing language, which is crucial for understanding the meaning and structure in many linguistic
tasks.

3) Name at least four downstream tasks at token or text level and briefly explain them. (4 Points)

Named Entity Recognition (NER): A token-level task where the model identifies and classifies named entities
(like names of people, organizations, locations, etc.) in text.

Part-of-Speech Tagging (POS): Another token-level task that involves labeling each word in a sentence with its
appropriate part of speech (noun, verb, adjective, etc.), based on its definition and context.

Sentiment Analysis: A text-level task where the model determines the sentiment expressed in a piece of text,
such as positive, negative, or neutral.

Question Answering (QA): A text-level task that requires the model to answer questions based on a given text
passage. The model must understand the passage and the question to provide a specific answer or a text span
from the passage.
ML4NLU Page Matr.Nr.:
9

4) Discuss where even the largest language models reach their limits! (2 Points)

Understanding Context: They may struggle with nuanced contexts or with understanding the deep semantics
in complex texts.

Common Sense Reasoning: Language models often lack common sense reasoning or the ability to apply
worldly knowledge that humans consider obvious.

Causality: They can predict statistically likely next words but may not truly understand causal relationships.

Domain-Specific Knowledge: They may underperform in highly specialized domains without extensive fine-
tuning.

Bias and Fairness: Large language models can perpetuate and amplify biases present in their training data.

Explainability: They are often seen as "black boxes," with decisions difficult to interpret or explain.

Generalization: While they can generalize well in many cases, they sometimes fail to apply learned knowledge
to fundamentally new or unseen tasks.

Resource Intensity: Training and running large models require significant computational resources, which can
be costly and have environmental impacts.
ML4NLU Page Matr.Nr.:
10

Benchmarks 10 Points

h i
G(y, n) := (y1, . . . , yn), (y2, . . . , yn+1), . . . , (y|y|−n+1, . . . , y|y|) (1)

C(g, y, n) := .[g|g ∈ G(y, n)]. (2)

Σ
min C(g, ŷ, n), C(g, y, n)
g∈G(ŷ,n)
P (ŷ, y, n) := (3)
Σ C(g, ŷ, n)
g∈G (ŷ,n)

ŷ = [a cat is on the mat]


y = [a dog is on the couch]

1 Calculate the uni-/bi-grams for G(ŷ, 1), G(y, 1), G(ŷ, 2), G(y, 2). (4 Points)

1 Unigrams and Bigrams Identification:


- For yˆ = [a cat is on the mat], identify each individual word (unigrams) and each pair of consecutive words
(bigrams)
- For y = [a dog is on the couch], do the same

2 Common N-Grams:
- Find n-grams that appear in both yˆ and y
- Common Unigrams: 'a', 'is', 'on', 'the'
- Common Bigrams: 'is on', 'on the'

3 Probability Calculation:
- Calculate the probability P(yˆ, y, n) using the common n-grams' counts
- Since n-grams are unique within their sequences, each count is 1

4 Final Probability Values:


- For unigrams (n=1), the probability is the sum of the minimum counts of the common unigrams divided by
the total unigrams in y
- For bigrams (n=2), the probability is the sum of the minimum counts of the common bigrams divided by the
total bigrams in y

Resulting probabilities:
- P(yˆ, y, 1) = 0.67 (4 common unigrams out of 6 in y)
- P(yˆ, y, 2) = 0.40 (2 common bigrams out of 5 in y)
ML4NLU Page Matr.Nr.:
2 11y, 1) and P(yˆ, y, 2) (6 Points)
Calculate the uni-/bi-gram-precision for P(yˆ,

1 Identify the Common N-Grams:


- We've already identified the common unigrams and bigrams between yˆ and y

2 Calculate N-Gram Precision:


- Precision is calculated as the number of common n-grams between yˆ and y divided by the total number of
n-grams in yˆ

For Unigram Precision (n=1):


- P(yˆ, y, 1) = Number of Common Unigrams in yˆ and y divided by Total Number of Unigrams in yˆ

For Bigram Precision (n=2):


- P(yˆ, y, 2) = Number of Common Bigrams in yˆ and y divided by Total Number of Bigrams in yˆ

The unigram and bigram precision for P(yˆ, y, 1) and P(yˆ, y, 2) are:
- Unigram Precision P(yˆ, y, 1) = 0.67
- Bigram Precision P(yˆ, y, 2) = 0.40

You might also like