Exam ml4nlp1 Hs21.example Solution
Exam ml4nlp1 Hs21.example Solution
Processing 1”
Please DO NOT answer in this PDF file via comments, write in your own document.
Question #: 1 2 3 4 5 6 7 8 9 Sum
Points: 9 10 4 9 8 8 6 9 7 70
Achieved:
Grade (exam):
Grade (assignments):
Final Grade:
Important Information
Maximal number of points: 70 (schedule approx. 1 minute per point)
Please order your answers in the submission document in the order of the questions.
Note: Please answer in English in a concise, clear style in your own words (no copy-paste). Avoid dropping a single
keyword or formula as an answer, unless bullet lists are requested. If right and wrong answers are freely mixed without
giving a coherent impression, points can be deducted.
• they contain a lot of morphological, syntactic and semantic signals for words
• continuous representation with a lot of information in each component
• not sparse as one-hot-encoded vocabularies, fixed size window.
• they can handle ambiguous words (e.g. Washington as person or city) or words with differ-
ent aspects (Washington as the capital of the USA, representing the government)
3 (c) What are the downsides of word embeddings in general? Mention 3 aspects.
• they contain the biases of the text data they are trained on
• they are dependent on the training data domain
• due to Zipfian Law, many rare words do not have very good representations
• WEs often do not normalize/lemmatize (inflection) the input and miss generalizations
• it’s sometimes hard to grasp the nature of similarity that the cosine distance measures in the
vector space
• dealing with the out-of-vocabulary problem (although sub-tokens or character-based repre-
sentations do a nice job there).
• do not disambiguate words with multiple meaning.
• it can be difficult/hard to select the best embeddings
2. Tokenization and Text Representation (10 Points) In NLP, different techniques for the repre-
sentation of textual input have been explored. Let’s assume a script system exemplified by the following
sentence "Nobody knows the word jentacular."
Machine Learning for Natural Language Processing 1 HS21 Exam — January 3 2022 — Page 3/6
6 (a) Describe 3 different NLP techniques for text representation. Give the answer in bullet list format.
4 (b) Select 2 techniques and list their potential benefits and weaknesses. Give the answer in bullet list
format.
• Symbolic BOW representations have problems with similarity that goes beyond word iden-
tity; they are more brittle; sparse ; they can process arbitrarily long texts; efficient with linear
methods
• character-based RNNs can better deal with spelling errors or strongly inflecting languages
(but suffer more on long-distance phenomena); they can process arbitrarily long texts
• transformer-based approaches with subwords are good for NLU and can cope with rare
words; they have a fixed input width; efficient parallel computation during training
4 3. Perceptron (4 Points) Kevin claims that the Perceptron’s learning algorithm is more related to Hinge
Loss than to Log Loss. Do you agree, and why?
Agreed. Both learning algorithms, Perceptron and SGD with Hinge Loss, work on the principle of
error minimization. The Log Loss tries to maximize the probability of the true solution, which means
more updates even if no error is being made by the most probable solution.
4. Parameter Sharing (9 Points) is a technique that is important in ML for NLP in many neural network
architectures.
3 (a) Name 2 different neural architectures where parameter sharing is used prominently for processing
texts (e.g. for text classification). Explain how they differ.
• CNNs share parameters for each convolution filter. The same filter is applied all over the
input.
• RNNs share parameters for weighting the input and recurrent signals. The same shared
weight matrices are applied for each time step.
• Transformers apply the same FFN for each input subtoken inside a transformer module.
• parameter sharing between pretraining and fine-tuning in transfer learning
It reduces the capacity of the network, avoids overfitting and enforces generalization. [Some other
answers also give partial points.]
In multitasking setups, the same representations and parameters serves different end tasks. The
tasks share all or a subset of the parameters. Sometimes the parameters are shared in a more soft
fashion (coupled in the training).[Some other answers also give partial points.]
8 5. GPT (8 Points) How would you explain to a layperson the important ideas behind GPT 3 in your own
words?
Here is a sample explanation: GPT uses generative language modeling as pretraining. That means it
learns to predict the next word given a piece of text. Actually, it is not working on “normal” words
but on statistical subtokens, that is, a set of frequent character sequences found in texts, in the extreme
it can just be a character or a byte. This allows GPT to represent and produce also rare words with
a limited vocabulary. GPT’s architecture is based on transformers that process a fixed amount of
subtokens in one go. In order to not peek into the future during training, a special masking technique
is used. GPT uses a lot of parameters and a very deep transformer architecture. With that, it can
predict the next tokens with a high accuracy in longer texts. GPT’s approach to typical NLP tasks is to
turn them into a prompt that elicits further text that is the answer. GPT is especially strong in all tasks
that involve generation of text. For other tasks such as machine translation, dedicated architectures
are typically a lot better.
8 6. Attention (8 Points) Alfred Dullard does not understand why everybody talks about "attention" in
Machine Learning. Can you give him an example for 2 different types of attention and corresponding
explanations why this concept is important?
Attention was first used in seq2seq architectures with an encoder/decoder part, e.g. in Machine Trans-
lation. When translating a sentence, it is beneficial to focus on the encoded parts of the input that is
the most helpful for the decoder. Therefore, this attention often works like a word alignment, but
generally it is just like a softmax/soft gate applied to the encoded input vectors.
In morphological tasks such as inflection generation, an even harder form of attention is successfully
used. While decoding the output, a hard attention is placed on a specific character in the input. This
is important because input and output most often have a strong monotonic alignment on the level of
characters (fliegen => flog).
Another form of attention, self-attention, has been introduced by the transformer architecture. There,
each subword attends to each other subword in a sentence (or segment) and this “self-attention” mod-
els different types of binary relations between (sub)tokens in a sentence. The transformer learns which
tokens are in a relevant relation to each other (first for the pretraining MLM task, then for the actual
task).
Generally, attention lets the network focus dynamically on relevant signal in the context (be it the
encoded input and/or already produce output)
A CRF layer is used on top of a sequence labeling architecture to model the output label depen-
dencies in a factored way. Typical CRF layers take into account the transition between neigh-
boring labels, concretely speaking, the model learns a matrix that gives weights to two adjacent
output labels. In this way, output label dependencies can be effectively modeled by the architec-
ture.
3 (b) What are the reasons that certain neural network architectures profit more or less from CRF layers?
Machine Learning for Natural Language Processing 1 HS21 Exam — January 3 2022 — Page 5/6
If a model has a very strong sequence encoder capability (e.g. fine-tuned transformers), it does not
only encode the dependencies of the input, but also prevents that the local information is taken too
strongly without considering the prediction activation around the currently produced output. If a
sequence encoder is not bidirectional (e.g. a simple left-to-right RNN or CNN encoder), it cannot
consider the right context of an item during local decoding. Such a directional architecture (which
resembles a typical classic HMM) would profit more strongly from a CRF layer on top.
8. Black Boxes. (9 Points) Neural networks in NLP are often characterized as being black boxes.
3 (a) What does this mean?
If things go wrong, we cannot explain why they didn’t work and cannot fix it easily in the model.
Thus, we cannot know for sure with which new examples the model will have problems.
To fix model behavior, we would need to retrain the model with different training material.
It is also hard to learn from these models what they actually learned.
Or, it is hard to fix certain biases that are learned by the model.
If models are deployed in real-world contexts where they have impact, explanations for their
decisions are ethically relevant and necessary. A blackbox model has problems to deliver that.
3 (c) Which techniques/approaches help that we can take a look into the black box?
We can visualize the inner states and correlate them with some existing external knowledge.
We can associate neural activation levels with certain categories in the input or output.
We can do ablation experiments that modify certain parts of an architecture and evaluate the
results.
We can provide special test sets or properties (e.g. sentence length) that probe for specific ques-
tions/evaluations that we are interested in.
We can apply some gradient-based techniques (see Allen NLP MLM demo) to measure the influ-
ence of a token to the loss of a task.
9. Clustering. (7 Points)
2 (a) What are the principal goals of clustering?
Categorize data items without knowing the categories (or how many) beforehand.
Make the items inside a cluster as similar as possible.
Make the clusters as different as possible from each other.
2 (b) How does this translate into concrete evaluation measurements of quality for topic modeling? Use
the relevant terminology.
Topic exclusivity (e.g. by measuring the exclusivity of top n words in the topics)
Topic coherence (e.g. by measuring the similarity of top n words inside a topic; e.g. by comparing
their cosine similarity in some word embedding vector space)