0% found this document useful (0 votes)

61 views6 pages

Exam ml4nlp1 Hs21.example Solution

The document is a final exam for a course on "Machine Learning for Natural Language Processing 1". It provides instructions for students taking the exam, including the date, department, and exam structure. Students are asked to provide personal information and are informed that the exam will consist of 9 multiple choice questions worth a total of 70 points. Guidelines are provided around formatting answers and the code of honor.

Uploaded by

Mara Bucur

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

61 views6 pages

Exam ml4nlp1 Hs21.example Solution

Uploaded by

Mara Bucur

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 6

Final Exam “Machine Learning for Natural Language

Processing 1”

Questions: Simon Clematide

Exam – January 3th 2022
Department of Computational Linguistics
University of Zurich

Please DO NOT answer in this PDF file via comments, write in your own document.

Please include the following personal information in your submission file:

First name UZH registration number

Last name MA study program

Question #: 1 2 3 4 5 6 7 8 9 Sum
Points: 9 10 4 9 8 8 6 9 7 70
Achieved:

Grade (exam):

Grade (assignments):

Final Grade:

Important Information
Maximal number of points: 70 (schedule approx. 1 minute per point)
Please order your answers in the submission document in the order of the questions.
Note: Please answer in English in a concise, clear style in your own words (no copy-paste). Avoid dropping a single
keyword or formula as an answer, unless bullet lists are requested. If right and wrong answers are freely mixed without
giving a coherent impression, points can be deducted.

Code of Honor (PDF-Link)

Don’t forget to include the following text at the beginning of your answer document: «I hereby confirm that
this assessment in no way violates the Code of Honor for official assessments of the Faculty of Arts and Social
Sciences of the University of Zurich.»

Good luck! Viel Erfolg! Machät’s guät!

Machine Learning for Natural Language Processing 1 HS21 Exam — January 3 2022 — Page 2/6

1. Word Embeddings (9 Points)

4 (a) What are the main reasons that static word embeddings became popular in Machine Learning?

Roughly one point for any of these reasons or variants of

• with word2vec or glove efficient software available

• dense representations are well-suited for neural ML
• only raw text data are needed for generating word embeddings

• no knowledge engineering required

• WE gave great performance due to the incorporation of language and world knowledge
• a first well-working pre-training technique was available with static WE

• they contain a lot of morphological, syntactic and semantic signals for words
• continuous representation with a lot of information in each component
• not sparse as one-hot-encoded vocabularies, fixed size window.

2 (b) Why are contextualized word embeddings even more useful?

• they can handle ambiguous words (e.g. Washington as person or city) or words with differ-
ent aspects (Washington as the capital of the USA, representing the government)

• they take into account the context of a word

• they should be able to represent local constructions, e.g. “not bad” of "of course"
• they normally have better performance than static embeddings in typical NLP tasks (given
their improved sequence encoding quality)

• more information is exploited

• they do not mix all meanings of a word into one ’average’ representation.

3 (c) What are the downsides of word embeddings in general? Mention 3 aspects.

• they contain the biases of the text data they are trained on
• they are dependent on the training data domain
• due to Zipfian Law, many rare words do not have very good representations

• WEs often do not normalize/lemmatize (inflection) the input and miss generalizations
• it’s sometimes hard to grasp the nature of similarity that the cosine distance measures in the
vector space
• dealing with the out-of-vocabulary problem (although sub-tokens or character-based repre-
sentations do a nice job there).
• do not disambiguate words with multiple meaning.
• it can be difficult/hard to select the best embeddings

2. Tokenization and Text Representation (10 Points) In NLP, different techniques for the repre-
sentation of textual input have been explored. Let’s assume a script system exemplified by the following
sentence "Nobody knows the word jentacular."
Machine Learning for Natural Language Processing 1 HS21 Exam — January 3 2022 — Page 3/6

6 (a) Describe 3 different NLP techniques for text representation. Give the answer in bullet list format.

• Word Embeddings vs symbolic representations (sometimes also with word shapes)

• Bag-of-Words with symbolic representation in sparse vectors

• Character-based encoding with RNNs
• Subwords with transformers
• Local n-gram windows of embedded words with CNNs

4 (b) Select 2 techniques and list their potential benefits and weaknesses. Give the answer in bullet list
format.

• Symbolic BOW representations have problems with similarity that goes beyond word iden-
tity; they are more brittle; sparse ; they can process arbitrarily long texts; efficient with linear
methods
• character-based RNNs can better deal with spelling errors or strongly inflecting languages
(but suffer more on long-distance phenomena); they can process arbitrarily long texts

• transformer-based approaches with subwords are good for NLU and can cope with rare
words; they have a fixed input width; efficient parallel computation during training

4 3. Perceptron (4 Points) Kevin claims that the Perceptron’s learning algorithm is more related to Hinge
Loss than to Log Loss. Do you agree, and why?

Agreed. Both learning algorithms, Perceptron and SGD with Hinge Loss, work on the principle of
error minimization. The Log Loss tries to maximize the probability of the true solution, which means
more updates even if no error is being made by the most probable solution.

4. Parameter Sharing (9 Points) is a technique that is important in ML for NLP in many neural network
architectures.
3 (a) Name 2 different neural architectures where parameter sharing is used prominently for processing
texts (e.g. for text classification). Explain how they differ.

• CNNs share parameters for each convolution filter. The same filter is applied all over the
input.

• RNNs share parameters for weighting the input and recurrent signals. The same shared
weight matrices are applied for each time step.
• Transformers apply the same FFN for each input subtoken inside a transformer module.
• parameter sharing between pretraining and fine-tuning in transfer learning

3 (b) Why do we do parameter sharing in Machine Learning?

It reduces the capacity of the network, avoids overfitting and enforces generalization. [Some other
answers also give partial points.]

3 (c) How is multitask learning related with parameter sharing?

Machine Learning for Natural Language Processing 1 HS21 Exam — January 3 2022 — Page 4/6

In multitasking setups, the same representations and parameters serves different end tasks. The
tasks share all or a subset of the parameters. Sometimes the parameters are shared in a more soft
fashion (coupled in the training).[Some other answers also give partial points.]

8 5. GPT (8 Points) How would you explain to a layperson the important ideas behind GPT 3 in your own
words?

Here is a sample explanation: GPT uses generative language modeling as pretraining. That means it
learns to predict the next word given a piece of text. Actually, it is not working on “normal” words
but on statistical subtokens, that is, a set of frequent character sequences found in texts, in the extreme
it can just be a character or a byte. This allows GPT to represent and produce also rare words with
a limited vocabulary. GPT’s architecture is based on transformers that process a fixed amount of
subtokens in one go. In order to not peek into the future during training, a special masking technique
is used. GPT uses a lot of parameters and a very deep transformer architecture. With that, it can
predict the next tokens with a high accuracy in longer texts. GPT’s approach to typical NLP tasks is to
turn them into a prompt that elicits further text that is the answer. GPT is especially strong in all tasks
that involve generation of text. For other tasks such as machine translation, dedicated architectures
are typically a lot better.

8 6. Attention (8 Points) Alfred Dullard does not understand why everybody talks about "attention" in
Machine Learning. Can you give him an example for 2 different types of attention and corresponding
explanations why this concept is important?

Attention was first used in seq2seq architectures with an encoder/decoder part, e.g. in Machine Trans-
lation. When translating a sentence, it is beneficial to focus on the encoded parts of the input that is
the most helpful for the decoder. Therefore, this attention often works like a word alignment, but
generally it is just like a softmax/soft gate applied to the encoded input vectors.
In morphological tasks such as inflection generation, an even harder form of attention is successfully
used. While decoding the output, a hard attention is placed on a specific character in the input. This
is important because input and output most often have a strong monotonic alignment on the level of
characters (fliegen => flog).
Another form of attention, self-attention, has been introduced by the transformer architecture. There,
each subword attends to each other subword in a sentence (or segment) and this “self-attention” mod-
els different types of binary relations between (sub)tokens in a sentence. The transformer learns which
tokens are in a relevant relation to each other (first for the pretraining MLM task, then for the actual
task).
Generally, attention lets the network focus dynamically on relevant signal in the context (be it the
encoded input and/or already produce output)

7. CRF Layers (6 Points)

3 (a) What are CRF layers? What kind of issues are they supposed to solve?

A CRF layer is used on top of a sequence labeling architecture to model the output label depen-
dencies in a factored way. Typical CRF layers take into account the transition between neigh-
boring labels, concretely speaking, the model learns a matrix that gives weights to two adjacent
output labels. In this way, output label dependencies can be effectively modeled by the architec-
ture.

3 (b) What are the reasons that certain neural network architectures profit more or less from CRF layers?
Machine Learning for Natural Language Processing 1 HS21 Exam — January 3 2022 — Page 5/6

If a model has a very strong sequence encoder capability (e.g. fine-tuned transformers), it does not
only encode the dependencies of the input, but also prevents that the local information is taken too
strongly without considering the prediction activation around the currently produced output. If a
sequence encoder is not bidirectional (e.g. a simple left-to-right RNN or CNN encoder), it cannot
consider the right context of an item during local decoding. Such a directional architecture (which
resembles a typical classic HMM) would profit more strongly from a CRF layer on top.

8. Black Boxes. (9 Points) Neural networks in NLP are often characterized as being black boxes.
3 (a) What does this mean?

It is difficult to attribute a task-specific meaning to the components (weights/activations) of a

model.
The information/knowledge in the model is distributed in the model weights.
Thus, it is difficult to know why models produce a specific output, be it correct or wrong.

3 (b) Why is this an issue?

If things go wrong, we cannot explain why they didn’t work and cannot fix it easily in the model.
Thus, we cannot know for sure with which new examples the model will have problems.
To fix model behavior, we would need to retrain the model with different training material.
It is also hard to learn from these models what they actually learned.
Or, it is hard to fix certain biases that are learned by the model.
If models are deployed in real-world contexts where they have impact, explanations for their
decisions are ethically relevant and necessary. A blackbox model has problems to deliver that.

3 (c) Which techniques/approaches help that we can take a look into the black box?

We can visualize the inner states and correlate them with some existing external knowledge.
We can associate neural activation levels with certain categories in the input or output.
We can do ablation experiments that modify certain parts of an architecture and evaluate the
results.
We can provide special test sets or properties (e.g. sentence length) that probe for specific ques-
tions/evaluations that we are interested in.
We can apply some gradient-based techniques (see Allen NLP MLM demo) to measure the influ-
ence of a token to the loss of a task.

9. Clustering. (7 Points)
2 (a) What are the principal goals of clustering?

Categorize data items without knowing the categories (or how many) beforehand.
Make the items inside a cluster as similar as possible.
Make the clusters as different as possible from each other.

2 (b) How does this translate into concrete evaluation measurements of quality for topic modeling? Use
the relevant terminology.

Topic exclusivity (e.g. by measuring the exclusivity of top n words in the topics)
Topic coherence (e.g. by measuring the similarity of top n words inside a topic; e.g. by comparing
their cosine similarity in some word embedding vector space)

3 (c) What are 3 typical difficulties in clustering of text data?

Machine Learning for Natural Language Processing 1 HS21 Exam — January 3 2022 — Page 6/6

Labeling of clusters; number of clusters; interpretation of clusters; preprocessing of texts; han-

dling of outliers; (maybe half point: high dimensionality of input vectors;)

Transfer Learning in Natural Language Processing PDF
0% (1)
Transfer Learning in Natural Language Processing PDF
238 pages
KIOXIA Dell EMC Data Sheet Global 2023-04
No ratings yet
KIOXIA Dell EMC Data Sheet Global 2023-04
2 pages
Numerical Methods For Engineers
80% (10)
Numerical Methods For Engineers
179 pages
EDR Presentation Slides
100% (1)
EDR Presentation Slides
50 pages
Pcap 31 03
No ratings yet
Pcap 31 03
6 pages
Hangout Slides
50% (4)
Hangout Slides
21 pages
How To Fix STAAD Warning WWW - Uniquecivil
No ratings yet
How To Fix STAAD Warning WWW - Uniquecivil
5 pages
Neighborhood and Connectivity of Pixels
No ratings yet
Neighborhood and Connectivity of Pixels
25 pages
Lesson 14 - Transformer
No ratings yet
Lesson 14 - Transformer
124 pages
Ford Fanuc R30ia R30ib Nextgen e
No ratings yet
Ford Fanuc R30ia R30ib Nextgen e
560 pages
Day 1
No ratings yet
Day 1
32 pages
Generative AI Unit 3 Notes
No ratings yet
Generative AI Unit 3 Notes
8 pages
The Newbie’s Guidebook to ChatGPT: A Beginner's Tutorial: The Newbie’s Guidebook
From Everand
The Newbie’s Guidebook to ChatGPT: A Beginner's Tutorial: The Newbie’s Guidebook
Timothy King
No ratings yet
CS585 Lecture October15th
No ratings yet
CS585 Lecture October15th
162 pages
2022 Foundations Tutorial3 Sunwang Deeplearning4nlp
No ratings yet
2022 Foundations Tutorial3 Sunwang Deeplearning4nlp
103 pages
Automatic Time Table Management System: A Project Report ON
No ratings yet
Automatic Time Table Management System: A Project Report ON
64 pages
Cpts 440 / 540 Artificial Intelligence: Search
No ratings yet
Cpts 440 / 540 Artificial Intelligence: Search
182 pages
2AMM30+AY23 24+Text+Mining+Lecture+3
No ratings yet
2AMM30+AY23 24+Text+Mining+Lecture+3
88 pages
AI4youngster - 6 - Topic NLP
No ratings yet
AI4youngster - 6 - Topic NLP
66 pages
Cheatsheet Recurrent Neural Networks
No ratings yet
Cheatsheet Recurrent Neural Networks
5 pages
Advanced Techniques in Training and Applying Large Language Models
No ratings yet
Advanced Techniques in Training and Applying Large Language Models
6 pages
Faculty Scheduling System Thesis Documentation
100% (2)
Faculty Scheduling System Thesis Documentation
4 pages
GEN AI - Question Bank EndSem
No ratings yet
GEN AI - Question Bank EndSem
9 pages
Impact of Word Embedding Models On Text Analytics in Deep Learning Environment: A Review
No ratings yet
Impact of Word Embedding Models On Text Analytics in Deep Learning Environment: A Review
81 pages
Hocken Maier 25
No ratings yet
Hocken Maier 25
46 pages
LLM Book 43-102
No ratings yet
LLM Book 43-102
60 pages
Intro DL 10 NLP
No ratings yet
Intro DL 10 NLP
99 pages
2023 07 28 Evolution of Language Models
No ratings yet
2023 07 28 Evolution of Language Models
73 pages
L5 Cse256 Fa24 LM
No ratings yet
L5 Cse256 Fa24 LM
65 pages
14 LookingForward
No ratings yet
14 LookingForward
48 pages
CS522T4C-DBMS-MODULE 2 - ER Diagram
No ratings yet
CS522T4C-DBMS-MODULE 2 - ER Diagram
77 pages
01-Transformer Based NLP Applications
No ratings yet
01-Transformer Based NLP Applications
55 pages
2 Kongsberg Help Aim Flexi Modules 20
No ratings yet
2 Kongsberg Help Aim Flexi Modules 20
20 pages
ML For NLP-LO3
No ratings yet
ML For NLP-LO3
61 pages
Lecture11 VDL
No ratings yet
Lecture11 VDL
58 pages
2020 NLPDeepLearning
No ratings yet
2020 NLPDeepLearning
72 pages
2005 14165v3 PDF
No ratings yet
2005 14165v3 PDF
74 pages
BERT Foundations and Applications: Definitive Reference for Developers and Engineers
From Everand
BERT Foundations and Applications: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
SUG918-1.9E - Gowin Software Quick Start Guide
No ratings yet
SUG918-1.9E - Gowin Software Quick Start Guide
43 pages
09 Bert Ai GPT
No ratings yet
09 Bert Ai GPT
63 pages
3 Sequence and Language Modeling
No ratings yet
3 Sequence and Language Modeling
56 pages
Slides
No ratings yet
Slides
26 pages
Transformer
No ratings yet
Transformer
55 pages
AN2DL 05 2324 Seq2SeqAndWordEmbedding
No ratings yet
AN2DL 05 2324 Seq2SeqAndWordEmbedding
42 pages
Technical Brief - A Look Into Purple Fox's New Arrival Vector
No ratings yet
Technical Brief - A Look Into Purple Fox's New Arrival Vector
34 pages
Trend
No ratings yet
Trend
47 pages
Management Information Systems Managing The Digital Firm 17ed-243-274.en - Id
No ratings yet
Management Information Systems Managing The Digital Firm 17ed-243-274.en - Id
32 pages
NLP Short Que Ans
No ratings yet
NLP Short Que Ans
21 pages
Iprd Patna
No ratings yet
Iprd Patna
29 pages
REPORT-MTechPESJul23BGrp2-3 (22-02-25)
No ratings yet
REPORT-MTechPESJul23BGrp2-3 (22-02-25)
15 pages
5th Unit
No ratings yet
5th Unit
36 pages
94.2.910.65en - F - Techn Produktbeschreibung SPRECON-E-P DQ6
No ratings yet
94.2.910.65en - F - Techn Produktbeschreibung SPRECON-E-P DQ6
33 pages
NLP - Machine Learning
No ratings yet
NLP - Machine Learning
23 pages
NLP NN Language Modeling Week5
No ratings yet
NLP NN Language Modeling Week5
33 pages
Natural Language Processing With Neural Network - Class3
No ratings yet
Natural Language Processing With Neural Network - Class3
25 pages
A M3 RD Ipjn Yd Ps GKF
No ratings yet
A M3 RD Ipjn Yd Ps GKF
20 pages
Natural Language Processing
No ratings yet
Natural Language Processing
6 pages
Unit 1 TB
No ratings yet
Unit 1 TB
19 pages
AI Primer
No ratings yet
AI Primer
12 pages
19 3 RTU560 Training 3
No ratings yet
19 3 RTU560 Training 3
8 pages
Modern Language Models
No ratings yet
Modern Language Models
28 pages
The 7 NLP Techniques That Will Change How You Communicate in The Future (Part I)
No ratings yet
The 7 NLP Techniques That Will Change How You Communicate in The Future (Part I)
19 pages
Msieditor,+journal+editor,+9 +dwi+ayu+gusriyanti
No ratings yet
Msieditor,+journal+editor,+9 +dwi+ayu+gusriyanti
16 pages
1 s2.0 S0925231221010997 Main
No ratings yet
1 s2.0 S0925231221010997 Main
14 pages
ATxBlock Universal Temperature Transmitter
No ratings yet
ATxBlock Universal Temperature Transmitter
11 pages
Exam
No ratings yet
Exam
10 pages
Subject: PRF192-PFC Workshop 05
No ratings yet
Subject: PRF192-PFC Workshop 05
13 pages
Deep Neural Network Language Models - W12-2703
No ratings yet
Deep Neural Network Language Models - W12-2703
9 pages
CS224n: Natural Language Processing With Deep Learning
No ratings yet
CS224n: Natural Language Processing With Deep Learning
14 pages
Lesson 3 IPV4 IPV6 in IP Addressing Mechanism.
No ratings yet
Lesson 3 IPV4 IPV6 in IP Addressing Mechanism.
9 pages
NLP End Sem
No ratings yet
NLP End Sem
6 pages
Cs224n Self Attention Transformers 2023 Draft
No ratings yet
Cs224n Self Attention Transformers 2023 Draft
18 pages
NLP 1
No ratings yet
NLP 1
15 pages
Program Questions
No ratings yet
Program Questions
9 pages
Assignment I
No ratings yet
Assignment I
6 pages
NLP MCQs
No ratings yet
NLP MCQs
15 pages
Unit 2
No ratings yet
Unit 2
6 pages
BST Object
No ratings yet
BST Object
5 pages
A Survey On Neural Network Language Models
No ratings yet
A Survey On Neural Network Language Models
7 pages
Justinrhill 2018@
No ratings yet
Justinrhill 2018@
9 pages
Thuyết Trình TWP
No ratings yet
Thuyết Trình TWP
7 pages
Paper Review
No ratings yet
Paper Review
6 pages
Bank Soal PG
No ratings yet
Bank Soal PG
5 pages
Turbo VPN For PC
No ratings yet
Turbo VPN For PC
7 pages
Chapter 1 Solutions
No ratings yet
Chapter 1 Solutions
5 pages
Reactivision 1.5.1: A Toolkit For Tangible Multi-Touch Surfaces
No ratings yet
Reactivision 1.5.1: A Toolkit For Tangible Multi-Touch Surfaces
7 pages
Farming Chia On An Old Computer - TurboFuture
100% (1)
Farming Chia On An Old Computer - TurboFuture
1 page
Natural Language Processing - Part 1
No ratings yet
Natural Language Processing - Part 1
1 page
Language Identification: Fundamentals and Applications
From Everand
Language Identification: Fundamentals and Applications
Fouad Sabry
No ratings yet
Explanation Based Learning: Fundamentals and Applications
From Everand
Explanation Based Learning: Fundamentals and Applications
Fouad Sabry
No ratings yet

Exam ml4nlp1 Hs21.example Solution

Uploaded by

Exam ml4nlp1 Hs21.example Solution

Uploaded by

Final Exam “Machine Learning for Natural Language

Questions: Simon Clematide

Please include the following personal information in your submission file:

First name UZH registration number

Last name MA study program

Code of Honor (PDF-Link)

Good luck! Viel Erfolg! Machät’s guät!

1. Word Embeddings (9 Points)

Roughly one point for any of these reasons or variants of

• with word2vec or glove efficient software available

• no knowledge engineering required

2 (b) Why are contextualized word embeddings even more useful?

• they take into account the context of a word

• more information is exploited

• Word Embeddings vs symbolic representations (sometimes also with word shapes)

• Bag-of-Words with symbolic representation in sparse vectors

3 (b) Why do we do parameter sharing in Machine Learning?

3 (c) How is multitask learning related with parameter sharing?

7. CRF Layers (6 Points)

It is difficult to attribute a task-specific meaning to the components (weights/activations) of a

3 (b) Why is this an issue?

3 (c) What are 3 typical difficulties in clustering of text data?

Labeling of clusters; number of clusters; interpretation of clusters; preprocessing of texts; han-

You might also like