0% found this document useful (0 votes)
7 views11 pages

NLP Final

The document consists of a series of problems related to regular expressions, language models, and various machine learning techniques including Naive Bayes, Logistic Regression, and embeddings. It includes tasks such as fitting Heaps' law to documents, comparing classifiers, and discussing the implications of different parsing techniques. Additionally, it covers modern neural models, including BERT and GPT, and their applications in sequence-to-sequence tasks.

Uploaded by

0xz8jorq0
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views11 pages

NLP Final

The document consists of a series of problems related to regular expressions, language models, and various machine learning techniques including Naive Bayes, Logistic Regression, and embeddings. It includes tasks such as fitting Heaps' law to documents, comparing classifiers, and discussing the implications of different parsing techniques. Additionally, it covers modern neural models, including BERT and GPT, and their applications in sequence-to-sequence tasks.

Uploaded by

0xz8jorq0
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

"

Problem 1 Regular Expressions, Heaps’ Law, Language Models (10 credits)

0 1.1 (0 P) Just to be sure: Write your first (given) name, your last (family) name, and your matriculation
number (just as a sanity check).

0 1.2 (3 P) Describe / list the patterns that may be detected by the regular expression
o+u?h|a+h+|hm+
1 (1 P) What is the possible purpose of this RegEx?
2

0 1.3 (2 P) You fit Heaps’ law |V | = kN β to two different documents and you get the following values:

1 1. β = 0.99, k = 1

2 2. β = 0.71, k = 80
Provide a reasonable suggestion of the nature of the documents!

0 1.4 (2 P) You fit Heaps’ law |V | = kN β to the Java source code of the JDK class library. Provide a meaningful
guess for the value of β you would get and give a short reasoning!
1

0 1.5 (2 P) What are the the mathematical consequences in terms of conditional independence when applying
a tri-gram approximation for a language model for four word sequences P(w1 w2 w3 w4 ) (using no beginning-
1 or end-of-sequence markers)?
2
ute Discounting
"
Problem 2 Language Models (ctd.), Naive Bayes vs. Logistic Regression, Embeddings (10 cred-
its)
We have seen: introducing priors à adjusted (“discounted”) counts are
maller than2.1
original counts (shifting
(3 P) Language some probability
models: Absolute Discounting: mass to unseen
Church&Gale 1991: How would the second column of 0
the table ("N number of bigrams in first half of data that occurred n times") look like for the second half of the
words / N-Grams
data? (“zeros”)) 1

2
uppose we wanted to subtract n + ,- ,- / +
3
little from a count of 4 to save number of bigrams in
first half of data that
total number of
occurrences of
average no of
occurrences of

probability mass for the zeros: occurred n times these bigrams in


second half of data
these bigrams in
second half of

How much to subtract ? data

0 74 671 100 000 2 019 187 0.000027


1 2 018 046 903 206 0.448
Church & Gale 1991: Divide 2 449 721 564 153 1.25
44 ∗ 10% word corpus in two 3 188 933 424 015 2.24
22 ∗ 10% halves. For all bigrams 4 105 664 341 099 3.23
hat occurred exactly n times in 5 68 379 287 776 4.21
irst half: how often on average 6 48 190 251 951 5.23
do they occur in second set? 7 35 709 221 693 6.21
8 27 710 199 779 7.21
à subtract ≈ 0.75 (probability 9 22 280 183 971 8.26
mass shifted to zeros)
18

2.2 (3 P) Naive Bayes vs. Logistic Regression: How is P(x, y |θ) mathematically decomposed for a generative 0
classifier, and how is it decomposed for a discriminative classifier? (In your answer, you can e.g. write θ as
"theta"). 1

2.3 (2 P) What is an advantage of Logistic Regression compared to Naive Bayes? State and explain one 0
advantage (not more than one)!
1

2
"
0 2.4 (2 P) Your new PhD student suggests to train Word2Vec Skip-Gram embeddings with replacing the inner
product based similarity t · c in P(+|t, c) = 1/(1P
+ exp(−t · c))
Pwith a similarity measure based on the Jensen-
1 Shannon-Divergence between the vectors t/ i ti and c/ i ci . Provide one reasonable counter-argument
for doing that! (Not more than one)!
2
"
Problem 3 CCG parsing, Constituency Grammars (10 credits)

3.1 (3 P) In the following incomplete CCG parse, provide the missing three categories! 0

3.2 (4 P) CCG parsing with the A* algorithm: provide expressions for w, x, y and z in terms of the ai , bi and 0
ci , assuming that a1 < a2 < a3 < a4 , b1 < b2 and c1 < c2 < c3 !
1

2
− log%& '
Bayern beats Schalke 3
Initial Agenda
N/N: a1 (S\NP)/NP: b1 NP: c1
4
NP: a2 N: b2 N/N: c2
Bayern: N/N Bayern beats: N[0,2]
x z S/S: a3 S/(S\NP) c3
Goal state S\S: a4

beats: (S\NP)/NP beats Schalke: S\NP[1,3] Bayern beats Schalke: S[0,3]


x x y
Bayern beats Schalke
Schalke: NP Schalke: S/(S\NP)[0,1]
x w
>
S\NP
<
S

2
3.3 (3 P) Explain the motivation for subcatagorization of verb-phrases, especially for training machine-
3
"
"
Problem 4 GloVe embeddings, LSTM neurons (10 credits)

4.1 (3 P) Motivation for GloVe embeddings: Given the following table, sort the numbers a1 , a2 , a3 , a4 in 0
ascending order (e.g. a2 = a4 < a1 < a2 )!
1

k = guitar k = engine k = wheel k = saddle 2


P(k |car)/P(k |bike) a1 a2 a3 a4
3

4.2 (2 P) Motivation for GloVe embeddings: aside from symmetry or group-homomorphism considerations, 0
motivate why setting uiT vk = logP(i |k ) also makes intuitive sense!
1

4.3 (3 P) Assuming you had never heard of contextual embeddings (such as BERT), how can static 0
embeddings (GloVe, Word2Vec etc.) deal with different word senses? Why is that not really practical?
1

1
4.4 (2 P) LSTM neurons: why do we use the Hadamard product in connection with the gates and not the
2
"
"
Problem 5 Modern neural models (10 credits)

5.1 (2 P) Some systems use hybrid combinations of word-based approaches and character-based ap- 0
proaches for neural machine translation. Provide one pro-argument for these approaches and provide one
counter-argument for these approaches! Do not provide more than one argument each! 1

5.2 (2 P) What is the problem with just sticking to the standard word-based softmax architectures when 0
vocabularies become large?
1

5.3 (3 P) Another idea to deal with the problems associated with large vocabularies is combining standard 0
word-based softmax architectures with Pointer Networks. What do pointer networks and basic attention have
in common? 1

5.4 (3 P) Paper "Attention is all you need" (Vaswani et al, 2017): Multi-Head Attention is defined as 0

1
Multihead(Q, K , V) = Concat(head1 , ... , headh )W O (5.1)
where headi = Attention(QWiQ , KWiK , VWiV ) (5.2) 2

3
Someone states: "This is very similar to CNN approaches as in the paper Kim, Y. (2014) "Convolutional
neural networks for sentence classification".
Provide one argument supporting the statement and one counter argument! Do not provide more than one
"
"
Problem 6 BERT and GPT (10 credits)

6.1 (3 P) Explain why BERT is not immediately usable for generation (e.g. as a language model)! 0

6.2 (3 P) Someone suggests using BERT as an encoder and GPT as a decoder in a seq-to-seq architecture. 0
The decoder would attend to the encoder in a similar way as in the original Transformer model (Vaswani et al.
2017). Motivate the usefulness of this architecture for seq-to-seq tasks! 1

3
6.3 (4 P) How could such a system as suggested in the previous assignment be trained and used for
4
"

You might also like