0% found this document useful (0 votes)
61 views6 pages

Exam ml4nlp1 Hs21.example Solution

The document is a final exam for a course on "Machine Learning for Natural Language Processing 1". It provides instructions for students taking the exam, including the date, department, and exam structure. Students are asked to provide personal information and are informed that the exam will consist of 9 multiple choice questions worth a total of 70 points. Guidelines are provided around formatting answers and the code of honor.

Uploaded by

Mara Bucur
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
61 views6 pages

Exam ml4nlp1 Hs21.example Solution

The document is a final exam for a course on "Machine Learning for Natural Language Processing 1". It provides instructions for students taking the exam, including the date, department, and exam structure. Students are asked to provide personal information and are informed that the exam will consist of 9 multiple choice questions worth a total of 70 points. Guidelines are provided around formatting answers and the code of honor.

Uploaded by

Mara Bucur
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

Final Exam “Machine Learning for Natural Language

Processing 1”

Questions: Simon Clematide


Exam – January 3th 2022
Department of Computational Linguistics
University of Zurich

Please DO NOT answer in this PDF file via comments, write in your own document.

Please include the following personal information in your submission file:

First name UZH registration number

Last name MA study program

Question #: 1 2 3 4 5 6 7 8 9 Sum
Points: 9 10 4 9 8 8 6 9 7 70
Achieved:

Grade (exam):

Grade (assignments):

Final Grade:

Important Information
Maximal number of points: 70 (schedule approx. 1 minute per point)
Please order your answers in the submission document in the order of the questions.
Note: Please answer in English in a concise, clear style in your own words (no copy-paste). Avoid dropping a single
keyword or formula as an answer, unless bullet lists are requested. If right and wrong answers are freely mixed without
giving a coherent impression, points can be deducted.

Code of Honor (PDF-Link)


Don’t forget to include the following text at the beginning of your answer document: «I hereby confirm that
this assessment in no way violates the Code of Honor for official assessments of the Faculty of Arts and Social
Sciences of the University of Zurich.»

Good luck! Viel Erfolg! Machät’s guät!


Machine Learning for Natural Language Processing 1 HS21 Exam — January 3 2022 — Page 2/6

1. Word Embeddings (9 Points)


4 (a) What are the main reasons that static word embeddings became popular in Machine Learning?

Roughly one point for any of these reasons or variants of

• with word2vec or glove efficient software available


• dense representations are well-suited for neural ML
• only raw text data are needed for generating word embeddings

• no knowledge engineering required


• WE gave great performance due to the incorporation of language and world knowledge
• a first well-working pre-training technique was available with static WE

• they contain a lot of morphological, syntactic and semantic signals for words
• continuous representation with a lot of information in each component
• not sparse as one-hot-encoded vocabularies, fixed size window.

2 (b) Why are contextualized word embeddings even more useful?

• they can handle ambiguous words (e.g. Washington as person or city) or words with differ-
ent aspects (Washington as the capital of the USA, representing the government)

• they take into account the context of a word


• they should be able to represent local constructions, e.g. “not bad” of "of course"
• they normally have better performance than static embeddings in typical NLP tasks (given
their improved sequence encoding quality)

• more information is exploited


• they do not mix all meanings of a word into one ’average’ representation.

3 (c) What are the downsides of word embeddings in general? Mention 3 aspects.

• they contain the biases of the text data they are trained on
• they are dependent on the training data domain
• due to Zipfian Law, many rare words do not have very good representations

• WEs often do not normalize/lemmatize (inflection) the input and miss generalizations
• it’s sometimes hard to grasp the nature of similarity that the cosine distance measures in the
vector space
• dealing with the out-of-vocabulary problem (although sub-tokens or character-based repre-
sentations do a nice job there).
• do not disambiguate words with multiple meaning.
• it can be difficult/hard to select the best embeddings

2. Tokenization and Text Representation (10 Points) In NLP, different techniques for the repre-
sentation of textual input have been explored. Let’s assume a script system exemplified by the following
sentence "Nobody knows the word jentacular."
Machine Learning for Natural Language Processing 1 HS21 Exam — January 3 2022 — Page 3/6

6 (a) Describe 3 different NLP techniques for text representation. Give the answer in bullet list format.

• Word Embeddings vs symbolic representations (sometimes also with word shapes)

• Bag-of-Words with symbolic representation in sparse vectors


• Character-based encoding with RNNs
• Subwords with transformers
• Local n-gram windows of embedded words with CNNs

4 (b) Select 2 techniques and list their potential benefits and weaknesses. Give the answer in bullet list
format.

• Symbolic BOW representations have problems with similarity that goes beyond word iden-
tity; they are more brittle; sparse ; they can process arbitrarily long texts; efficient with linear
methods
• character-based RNNs can better deal with spelling errors or strongly inflecting languages
(but suffer more on long-distance phenomena); they can process arbitrarily long texts

• transformer-based approaches with subwords are good for NLU and can cope with rare
words; they have a fixed input width; efficient parallel computation during training

4 3. Perceptron (4 Points) Kevin claims that the Perceptron’s learning algorithm is more related to Hinge
Loss than to Log Loss. Do you agree, and why?

Agreed. Both learning algorithms, Perceptron and SGD with Hinge Loss, work on the principle of
error minimization. The Log Loss tries to maximize the probability of the true solution, which means
more updates even if no error is being made by the most probable solution.

4. Parameter Sharing (9 Points) is a technique that is important in ML for NLP in many neural network
architectures.
3 (a) Name 2 different neural architectures where parameter sharing is used prominently for processing
texts (e.g. for text classification). Explain how they differ.

• CNNs share parameters for each convolution filter. The same filter is applied all over the
input.

• RNNs share parameters for weighting the input and recurrent signals. The same shared
weight matrices are applied for each time step.
• Transformers apply the same FFN for each input subtoken inside a transformer module.
• parameter sharing between pretraining and fine-tuning in transfer learning

3 (b) Why do we do parameter sharing in Machine Learning?

It reduces the capacity of the network, avoids overfitting and enforces generalization. [Some other
answers also give partial points.]

3 (c) How is multitask learning related with parameter sharing?


Machine Learning for Natural Language Processing 1 HS21 Exam — January 3 2022 — Page 4/6

In multitasking setups, the same representations and parameters serves different end tasks. The
tasks share all or a subset of the parameters. Sometimes the parameters are shared in a more soft
fashion (coupled in the training).[Some other answers also give partial points.]

8 5. GPT (8 Points) How would you explain to a layperson the important ideas behind GPT 3 in your own
words?

Here is a sample explanation: GPT uses generative language modeling as pretraining. That means it
learns to predict the next word given a piece of text. Actually, it is not working on “normal” words
but on statistical subtokens, that is, a set of frequent character sequences found in texts, in the extreme
it can just be a character or a byte. This allows GPT to represent and produce also rare words with
a limited vocabulary. GPT’s architecture is based on transformers that process a fixed amount of
subtokens in one go. In order to not peek into the future during training, a special masking technique
is used. GPT uses a lot of parameters and a very deep transformer architecture. With that, it can
predict the next tokens with a high accuracy in longer texts. GPT’s approach to typical NLP tasks is to
turn them into a prompt that elicits further text that is the answer. GPT is especially strong in all tasks
that involve generation of text. For other tasks such as machine translation, dedicated architectures
are typically a lot better.

8 6. Attention (8 Points) Alfred Dullard does not understand why everybody talks about "attention" in
Machine Learning. Can you give him an example for 2 different types of attention and corresponding
explanations why this concept is important?

Attention was first used in seq2seq architectures with an encoder/decoder part, e.g. in Machine Trans-
lation. When translating a sentence, it is beneficial to focus on the encoded parts of the input that is
the most helpful for the decoder. Therefore, this attention often works like a word alignment, but
generally it is just like a softmax/soft gate applied to the encoded input vectors.
In morphological tasks such as inflection generation, an even harder form of attention is successfully
used. While decoding the output, a hard attention is placed on a specific character in the input. This
is important because input and output most often have a strong monotonic alignment on the level of
characters (fliegen => flog).
Another form of attention, self-attention, has been introduced by the transformer architecture. There,
each subword attends to each other subword in a sentence (or segment) and this “self-attention” mod-
els different types of binary relations between (sub)tokens in a sentence. The transformer learns which
tokens are in a relevant relation to each other (first for the pretraining MLM task, then for the actual
task).
Generally, attention lets the network focus dynamically on relevant signal in the context (be it the
encoded input and/or already produce output)

7. CRF Layers (6 Points)


3 (a) What are CRF layers? What kind of issues are they supposed to solve?

A CRF layer is used on top of a sequence labeling architecture to model the output label depen-
dencies in a factored way. Typical CRF layers take into account the transition between neigh-
boring labels, concretely speaking, the model learns a matrix that gives weights to two adjacent
output labels. In this way, output label dependencies can be effectively modeled by the architec-
ture.

3 (b) What are the reasons that certain neural network architectures profit more or less from CRF layers?
Machine Learning for Natural Language Processing 1 HS21 Exam — January 3 2022 — Page 5/6

If a model has a very strong sequence encoder capability (e.g. fine-tuned transformers), it does not
only encode the dependencies of the input, but also prevents that the local information is taken too
strongly without considering the prediction activation around the currently produced output. If a
sequence encoder is not bidirectional (e.g. a simple left-to-right RNN or CNN encoder), it cannot
consider the right context of an item during local decoding. Such a directional architecture (which
resembles a typical classic HMM) would profit more strongly from a CRF layer on top.

8. Black Boxes. (9 Points) Neural networks in NLP are often characterized as being black boxes.
3 (a) What does this mean?

It is difficult to attribute a task-specific meaning to the components (weights/activations) of a


model.
The information/knowledge in the model is distributed in the model weights.
Thus, it is difficult to know why models produce a specific output, be it correct or wrong.

3 (b) Why is this an issue?

If things go wrong, we cannot explain why they didn’t work and cannot fix it easily in the model.
Thus, we cannot know for sure with which new examples the model will have problems.
To fix model behavior, we would need to retrain the model with different training material.
It is also hard to learn from these models what they actually learned.
Or, it is hard to fix certain biases that are learned by the model.
If models are deployed in real-world contexts where they have impact, explanations for their
decisions are ethically relevant and necessary. A blackbox model has problems to deliver that.

3 (c) Which techniques/approaches help that we can take a look into the black box?

We can visualize the inner states and correlate them with some existing external knowledge.
We can associate neural activation levels with certain categories in the input or output.
We can do ablation experiments that modify certain parts of an architecture and evaluate the
results.
We can provide special test sets or properties (e.g. sentence length) that probe for specific ques-
tions/evaluations that we are interested in.
We can apply some gradient-based techniques (see Allen NLP MLM demo) to measure the influ-
ence of a token to the loss of a task.

9. Clustering. (7 Points)
2 (a) What are the principal goals of clustering?

Categorize data items without knowing the categories (or how many) beforehand.
Make the items inside a cluster as similar as possible.
Make the clusters as different as possible from each other.

2 (b) How does this translate into concrete evaluation measurements of quality for topic modeling? Use
the relevant terminology.

Topic exclusivity (e.g. by measuring the exclusivity of top n words in the topics)
Topic coherence (e.g. by measuring the similarity of top n words inside a topic; e.g. by comparing
their cosine similarity in some word embedding vector space)

3 (c) What are 3 typical difficulties in clustering of text data?


Machine Learning for Natural Language Processing 1 HS21 Exam — January 3 2022 — Page 6/6

Labeling of clusters; number of clusters; interpretation of clusters; preprocessing of texts; han-


dling of outliers; (maybe half point: high dimensionality of input vectors;)

You might also like