0% found this document useful (0 votes)
16 views6 pages

Texttech Ex06 Solution

texttech-ex06-solution

Uploaded by

s03120204
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views6 pages

Texttech Ex06 Solution

texttech-ex06-solution

Uploaded by

s03120204
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

Text Technology

Summer 2024
Johannes Dellert

Exercise Sheet 06:


NLP Tools on the Command Line,
Challenges of Using NLP Tools (Solution)
handed out: 9 June, 18:00
to be submitted by: 17 June, 06:00

IMPORTANT: For all tasks that require usage of the command line: either copy and paste the com-
mand you used and the output you are shown or include a screenshot of the command line
with the command and the output.

HINT FOR WINDOWS USERS: If you cannot get OpenNLP to run in your Linux subsystem, you can
instead run it natively on the Windows command line (run cmd) using the .bat file provided. Pipes and
redirection are available, but not the usual range of Unix tools (like grep, sed, wc). Our recommendation
is to run the Linux subsystem and Cmd in parallel, to run OpenNLP within Cmd, and then to redirect
the output into a file (using >), which you can then read in (using <) for further processing by the Unix tools.

Task 1: Tokeniser
In this task, we will return to the OpenSubtitles parallel corpora. You are provided with a file to work with
(grail_corpus.txt), but if your parallel corpus from Exercise Sheet 03 has a language for which models
compatible with OpenNLP are available to download, feel free to use your data. Your goal in this task will
be to compare naive token count from raw source data with token count based on tokenised data.

For each of the three languages, do the following:


1. Build a command line pipeline to select lines in the given language from the corpus and do a naive
token count for each line, i.e. using spaces as delimiters and using the raw data. Save the output to
a file, so that you can later use it for comparison.
2. Introduce the OpenNLP tokeniser into your pipeline, i.e. tokenise the lines and then perform the
count. Again, save the output to a file.
3. Using the data in the two saved files, compute the average difference in word count between the naive
and tokenised approaches using the command line.
Look at the raw data compared to output from the tokeniser, what could be an explanation for the differences
you found? How does this differ across languages in the corpus?

Hint: To use OpenNLP with languages other than English, you need to download additional models.
For a selection of languages, you can find official models at https://opennlp.apache.org/models.html
and https://opennlp.sourceforge.net/models-1.5/. The provided file contains English, German, and
Swedish.
Hint: This time, our full solution comprises more than a single line, and it combines the following tools:
awk, nl, openNLP, join.

– 1/6 –
Example solutions:

1. Using awk, we can “filter” out the languages using the remainder of division by 4 on the number of
lines. Remainder 1, will give us English lines, remainder 2 German lines, and remainder 3 Swedish
lines.

2.

3.

The most visible difference is that openNLP separates punctuation marks as separate tokens. The difference
that makes can be seen, for example, in the first sentence of the corpus. We see that English and German
are very similar, and, presumably, we can explain the difference with punctuation marks; there is always
at least one at the end of each sentence, and some are also present within sentences. Although Swedish
deviates more, the difference is not large. The difference is probably not particularly informative, since all
three languages have a quite high degree of similarity and presumably use punctuation marks similarly.

Task 2: POS Tagger


Using OpenNLP toolkits and Unix command lines, annotate the provided rawPOS.txt file and save the
output into tag.txt. You must write all commands as a single line. The saved output format should use
tab separation, like this:

word tag
Then compare tag.txt with the provided hand-labeled en_pud-ud-text.conllu using diff command.
Save the diff output to compare.txt file. This comparison can be done as a separate command. Submit

– 2/6 –
all tag.txt and compare.txt.

Solution:

Task 3: Named Entity Recognition


bin/opennlp SentenceDetector models/en-sent.bin < input/bheki.txt | bin/opennlp TokenizerME
models/en-token.bin | bin/opennlp TokenNameFinder models/en-ner-person.bin
and
bin/opennlp SentenceDetector models/en-sent.bin < input/bheki.txt | bin/opennlp TokenizerME
models/en-token.bin | bin/opennlp TokenNameFinder models/en-ner-organisation.bin

or all together:
bin/opennlp SentenceDetector models/en-sent.bin < input/bheki.txt | bin/opennlp TokenizerME
models/en-token.bin | bin/opennlp TokenNameFinder models/en-ner-person.bin | bin/opennlp
TokenNameFinder models/en-ner-organisation.bin

Problem: foreign/uncommon names that the model was not exposed to/trained on → causes mistakes
in NER ...

Task 4: Parser I
Using the OpenNLP toolkits, parse following rawParse.txt file to obtain the parse tree. Then based on the
parse tree and using the Unix command line, identify the positions of each noun and verb within the tree.
Based on the output, analyse the characteristics of their plcaement.
Solution:

– 3/6 –
By examining the result, you can find out that positions of the verbs are shallower than the nouns.

Task 5: Analysing Parses II


In this task, you will analyse noun phrases across a large amount of data to see if you can find general linguis-
tic patterns. You will work with the English ELRC-837-Legal corpus (provided as elrc_legal_en.txt).
Create an OpenNLP command line pipeline that detects sentences, tokenises and then parses these sentences.
Since this could take a while, it may be a good idea to save the output into a file and then use that as
further input. Use the resulting parses in a new command which finds noun phrases and extracts the left
most POS tag in each noun phrase (If you encounter nested noun phrases, get the deepest tag that is not
an NP). For example:
• From “(NP (DT those)”, we want to extract DT.
• From “(NP (NP (JJ new) (NNS challenges)) (, ,) (PP (VBG including) (NP (VBG rising)
(NN inequality) (CC and) (NN worker) (NN vulnerability))))”, we want to extract JJ and
VBG.
Extend the pipeline to create a list of matched POS tags together with their frequencies. Which adnominal
is the most common?

Hint: Our full solution comprises more than a single line and it combines the following tools: openNLP,
grep, awk, sort, uniq.

Example Solution:

– 4/6 –
Task 6: Challenges
You do not need to use any command line tools for this task.
Imagine you were to use an NLP tool like OpenNLP to perform NLP tasks on non-standard texts, as listed
below. For each scenario, state what problems you would expect to occur and why.

1. Using a tokeniser on a URL such as https://www.reddit.com/r/EthicalLifeProTips/comments/


9fua2c6/elpt_request_how_to_get_to_university_by_bus/
2. Using a sentence detector on lyrics such as
[Verse 1]
Is it getting better?
Or do you feel the same?
Will it make it easier on you
Now you got someone to blame?

[Chorus]
You say one love, one life
When it's one need in the night
One love, we get to share it
[...]

– 5/6 –
(taken from https://genius.com/U2-one-lyrics)
3. Using a parser on a recipe such as
- 1 tablespoon olive oil
- 1 garlic clove, finely chopped
- 8 thin-stemmed asparagus stalks, trimmed
[...]

Directions
Step 1
Heat a small skillet over medium-high heat. Add olive oil and garlic; cook and stir until
garlic is fragrant, about 30 seconds.

Step 2
Add asparagus
[...]

(taken from https://www.allrecipes.com/asparagus-and-eggs-recipe-8634304)

Example solution:
1. incorrect tokenisation due to e.g. slashes, full stops, lack of spaces; URL not structured like natural
language which is what tokenisers are usually trained on...
2. incorrect sentence detection due to e.g. lack of full stops, special notation with brackets in this case, or
the lyrics do not contain proper sentences in the first place; lyrics also are not structured like natural
language which is what sentence detectors are usually trained on...
3. incorrect parsing due to e.g. bullet points/listings of items, headings, non-standard punctuation,
instructions might have non-standard sentence structure; recipes also are not structured like natural
language which is what parsers are usually trained on...

– 6/6 –

You might also like