Texttech Ex06 Solution
Texttech Ex06 Solution
Summer 2024
Johannes Dellert
IMPORTANT: For all tasks that require usage of the command line: either copy and paste the com-
mand you used and the output you are shown or include a screenshot of the command line
with the command and the output.
HINT FOR WINDOWS USERS: If you cannot get OpenNLP to run in your Linux subsystem, you can
instead run it natively on the Windows command line (run cmd) using the .bat file provided. Pipes and
redirection are available, but not the usual range of Unix tools (like grep, sed, wc). Our recommendation
is to run the Linux subsystem and Cmd in parallel, to run OpenNLP within Cmd, and then to redirect
the output into a file (using >), which you can then read in (using <) for further processing by the Unix tools.
Task 1: Tokeniser
In this task, we will return to the OpenSubtitles parallel corpora. You are provided with a file to work with
(grail_corpus.txt), but if your parallel corpus from Exercise Sheet 03 has a language for which models
compatible with OpenNLP are available to download, feel free to use your data. Your goal in this task will
be to compare naive token count from raw source data with token count based on tokenised data.
Hint: To use OpenNLP with languages other than English, you need to download additional models.
For a selection of languages, you can find official models at https://opennlp.apache.org/models.html
and https://opennlp.sourceforge.net/models-1.5/. The provided file contains English, German, and
Swedish.
Hint: This time, our full solution comprises more than a single line, and it combines the following tools:
awk, nl, openNLP, join.
– 1/6 –
Example solutions:
1. Using awk, we can “filter” out the languages using the remainder of division by 4 on the number of
lines. Remainder 1, will give us English lines, remainder 2 German lines, and remainder 3 Swedish
lines.
2.
3.
The most visible difference is that openNLP separates punctuation marks as separate tokens. The difference
that makes can be seen, for example, in the first sentence of the corpus. We see that English and German
are very similar, and, presumably, we can explain the difference with punctuation marks; there is always
at least one at the end of each sentence, and some are also present within sentences. Although Swedish
deviates more, the difference is not large. The difference is probably not particularly informative, since all
three languages have a quite high degree of similarity and presumably use punctuation marks similarly.
word tag
Then compare tag.txt with the provided hand-labeled en_pud-ud-text.conllu using diff command.
Save the diff output to compare.txt file. This comparison can be done as a separate command. Submit
– 2/6 –
all tag.txt and compare.txt.
Solution:
or all together:
bin/opennlp SentenceDetector models/en-sent.bin < input/bheki.txt | bin/opennlp TokenizerME
models/en-token.bin | bin/opennlp TokenNameFinder models/en-ner-person.bin | bin/opennlp
TokenNameFinder models/en-ner-organisation.bin
Problem: foreign/uncommon names that the model was not exposed to/trained on → causes mistakes
in NER ...
Task 4: Parser I
Using the OpenNLP toolkits, parse following rawParse.txt file to obtain the parse tree. Then based on the
parse tree and using the Unix command line, identify the positions of each noun and verb within the tree.
Based on the output, analyse the characteristics of their plcaement.
Solution:
– 3/6 –
By examining the result, you can find out that positions of the verbs are shallower than the nouns.
Hint: Our full solution comprises more than a single line and it combines the following tools: openNLP,
grep, awk, sort, uniq.
Example Solution:
– 4/6 –
Task 6: Challenges
You do not need to use any command line tools for this task.
Imagine you were to use an NLP tool like OpenNLP to perform NLP tasks on non-standard texts, as listed
below. For each scenario, state what problems you would expect to occur and why.
[Chorus]
You say one love, one life
When it's one need in the night
One love, we get to share it
[...]
– 5/6 –
(taken from https://genius.com/U2-one-lyrics)
3. Using a parser on a recipe such as
- 1 tablespoon olive oil
- 1 garlic clove, finely chopped
- 8 thin-stemmed asparagus stalks, trimmed
[...]
Directions
Step 1
Heat a small skillet over medium-high heat. Add olive oil and garlic; cook and stir until
garlic is fragrant, about 30 seconds.
Step 2
Add asparagus
[...]
Example solution:
1. incorrect tokenisation due to e.g. slashes, full stops, lack of spaces; URL not structured like natural
language which is what tokenisers are usually trained on...
2. incorrect sentence detection due to e.g. lack of full stops, special notation with brackets in this case, or
the lyrics do not contain proper sentences in the first place; lyrics also are not structured like natural
language which is what sentence detectors are usually trained on...
3. incorrect parsing due to e.g. bullet points/listings of items, headings, non-standard punctuation,
instructions might have non-standard sentence structure; recipes also are not structured like natural
language which is what parsers are usually trained on...
– 6/6 –