NLP Final
NLP Final
0 1.1 (0 P) Just to be sure: Write your first (given) name, your last (family) name, and your matriculation
number (just as a sanity check).
0 1.2 (3 P) Describe / list the patterns that may be detected by the regular expression
o+u?h|a+h+|hm+
1 (1 P) What is the possible purpose of this RegEx?
2
0 1.3 (2 P) You fit Heaps’ law |V | = kN β to two different documents and you get the following values:
1 1. β = 0.99, k = 1
2 2. β = 0.71, k = 80
Provide a reasonable suggestion of the nature of the documents!
0 1.4 (2 P) You fit Heaps’ law |V | = kN β to the Java source code of the JDK class library. Provide a meaningful
guess for the value of β you would get and give a short reasoning!
1
0 1.5 (2 P) What are the the mathematical consequences in terms of conditional independence when applying
a tri-gram approximation for a language model for four word sequences P(w1 w2 w3 w4 ) (using no beginning-
1 or end-of-sequence markers)?
2
ute Discounting
"
Problem 2 Language Models (ctd.), Naive Bayes vs. Logistic Regression, Embeddings (10 cred-
its)
We have seen: introducing priors à adjusted (“discounted”) counts are
maller than2.1
original counts (shifting
(3 P) Language some probability
models: Absolute Discounting: mass to unseen
Church&Gale 1991: How would the second column of 0
the table ("N number of bigrams in first half of data that occurred n times") look like for the second half of the
words / N-Grams
data? (“zeros”)) 1
2
uppose we wanted to subtract n + ,- ,- / +
3
little from a count of 4 to save number of bigrams in
first half of data that
total number of
occurrences of
average no of
occurrences of
2.2 (3 P) Naive Bayes vs. Logistic Regression: How is P(x, y |θ) mathematically decomposed for a generative 0
classifier, and how is it decomposed for a discriminative classifier? (In your answer, you can e.g. write θ as
"theta"). 1
2.3 (2 P) What is an advantage of Logistic Regression compared to Naive Bayes? State and explain one 0
advantage (not more than one)!
1
2
"
0 2.4 (2 P) Your new PhD student suggests to train Word2Vec Skip-Gram embeddings with replacing the inner
product based similarity t · c in P(+|t, c) = 1/(1P
+ exp(−t · c))
Pwith a similarity measure based on the Jensen-
1 Shannon-Divergence between the vectors t/ i ti and c/ i ci . Provide one reasonable counter-argument
for doing that! (Not more than one)!
2
"
Problem 3 CCG parsing, Constituency Grammars (10 credits)
3.1 (3 P) In the following incomplete CCG parse, provide the missing three categories! 0
3.2 (4 P) CCG parsing with the A* algorithm: provide expressions for w, x, y and z in terms of the ai , bi and 0
ci , assuming that a1 < a2 < a3 < a4 , b1 < b2 and c1 < c2 < c3 !
1
2
− log%& '
Bayern beats Schalke 3
Initial Agenda
N/N: a1 (S\NP)/NP: b1 NP: c1
4
NP: a2 N: b2 N/N: c2
Bayern: N/N Bayern beats: N[0,2]
x z S/S: a3 S/(S\NP) c3
Goal state S\S: a4
2
3.3 (3 P) Explain the motivation for subcatagorization of verb-phrases, especially for training machine-
3
"
"
Problem 4 GloVe embeddings, LSTM neurons (10 credits)
4.1 (3 P) Motivation for GloVe embeddings: Given the following table, sort the numbers a1 , a2 , a3 , a4 in 0
ascending order (e.g. a2 = a4 < a1 < a2 )!
1
4.2 (2 P) Motivation for GloVe embeddings: aside from symmetry or group-homomorphism considerations, 0
motivate why setting uiT vk = logP(i |k ) also makes intuitive sense!
1
4.3 (3 P) Assuming you had never heard of contextual embeddings (such as BERT), how can static 0
embeddings (GloVe, Word2Vec etc.) deal with different word senses? Why is that not really practical?
1
1
4.4 (2 P) LSTM neurons: why do we use the Hadamard product in connection with the gates and not the
2
"
"
Problem 5 Modern neural models (10 credits)
5.1 (2 P) Some systems use hybrid combinations of word-based approaches and character-based ap- 0
proaches for neural machine translation. Provide one pro-argument for these approaches and provide one
counter-argument for these approaches! Do not provide more than one argument each! 1
5.2 (2 P) What is the problem with just sticking to the standard word-based softmax architectures when 0
vocabularies become large?
1
5.3 (3 P) Another idea to deal with the problems associated with large vocabularies is combining standard 0
word-based softmax architectures with Pointer Networks. What do pointer networks and basic attention have
in common? 1
5.4 (3 P) Paper "Attention is all you need" (Vaswani et al, 2017): Multi-Head Attention is defined as 0
1
Multihead(Q, K , V) = Concat(head1 , ... , headh )W O (5.1)
where headi = Attention(QWiQ , KWiK , VWiV ) (5.2) 2
3
Someone states: "This is very similar to CNN approaches as in the paper Kim, Y. (2014) "Convolutional
neural networks for sentence classification".
Provide one argument supporting the statement and one counter argument! Do not provide more than one
"
"
Problem 6 BERT and GPT (10 credits)
6.1 (3 P) Explain why BERT is not immediately usable for generation (e.g. as a language model)! 0
6.2 (3 P) Someone suggests using BERT as an encoder and GPT as a decoder in a seq-to-seq architecture. 0
The decoder would attend to the encoder in a similar way as in the original Transformer model (Vaswani et al.
2017). Motivate the usefulness of this architecture for seq-to-seq tasks! 1
3
6.3 (4 P) How could such a system as suggested in the previous assignment be trained and used for
4
"